non-gaussian statistical models andtheir applications

Thesis for the degree of Doctor of Philosophy

Non-Gaussian Statistical Models

and Their Applications

Zhanyu Ma

Sound and Image Processing LaboratorySchool of Electrical Engineering

KTH - Royal Institute of Technology

Stockholm 2011

Ma, ZhanyuNon-Gaussian Statistical Models and Their Applications

Copyright c©2011 Zhanyu Ma except whereotherwise stated. All rights reserved.

ISBN 978-91-7501-158-5TRITA-EE 2011:073ISSN 1653-5146

Sound and Image Processing LaboratorySchool of Electrical EngineeringKTH - Royal Institute of TechnologySE-100 44 Stockholm, Sweden

Abstract

Statistical modeling plays an important role in various research areas. It provides away to connect the data with the statistics. Based on the statistical properties of theobserved data, an appropriate model can be chosen that leads to a promising practicalperformance. The Gaussian distribution is the most popular and dominant probabilitydistribution used in statistics, since it has an analytically tractable Probability DensityFunction (PDF) and analysis based on it can be derived in an explicit form. However,various data in real applications have bounded support or semi-bounded support. As thesupport of the Gaussian distribution is unbounded, such type of data is obviously notGaussian distributed. Thus we can apply some non-Gaussian distributions, e.g., the betadistribution, the Dirichlet distribution, to model the distribution of this type of data.The choice of a suitable distribution is favorable for modeling efficiency. Furthermore,the practical performance based on the statistical model can also be improved by a bettermodeling.

An essential part in statistical modeling is to estimate the values of the parametersin the distribution or to estimate the distribution of the parameters, if we consider themas random variables. Unlike the Gaussian distribution or the corresponding GaussianMixture Model (GMM), a non-Gaussian distribution or a mixture of non-Gaussian dis-tributions does not have an analytically tractable solution, in general. In this dissertation,we study several estimation methods for the non-Gaussian distributions. For the Maxi-mum Likelihood (ML) estimation, a numerical method is utilized to search for the optimalsolution in the estimation of Dirichlet Mixture Model (DMM). For the Bayesian analysis,we utilize some approximations to derive an analytically tractable solution to approxi-mate the distribution of the parameters. The Variational Inference (VI) framework basedmethod has been shown to be efficient for approximating the parameter distribution byseveral researchers. Under this framework, we adapt the conventional Factorized Approx-imation (FA) method to the Extended Factorized Approximation (EFA) method and useit to approximate the parameter distribution in the beta distribution. Also, the LocalVariational Inference (LVI) method is applied to approximate the predictive distributionof the beta distribution. Finally, by assigning a beta distribution to each element in thematrix, we proposed a variational Bayesian Nonnegative Matrix Factorization (NMF) forbounded support data.

The performances of the proposed non-Gaussian model based methods are evaluatedby several experiments. The beta distribution and the Dirichlet distribution are appliedto model the Line Spectral Frequency (LSF) representation of the Linear Prediction (LP)model for statistical model based speech coding. For some image processing applications,the beta distribution is also applied. The proposed beta distribution based variationalBayesian NMF is applied for image restoration and collaborative filtering. Comparedto some conventional statistical model based methods, the non-Gaussian model basedmethods show a promising improvement.

Keywords: Statistical model, non-Gaussian distribution, Bayesian analysis, varia-tional inference, speech processing, image processing, nonnegative matrix factorization

i

List of Papers

The thesis is based on the following papers:

[A] Z. Ma and A. Leijon, “Bayesian estimation of beta mixture mod-els with variational inference,” in IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 33, no. 11, pp. 2160-2173, 2011.

[B] Z. Ma and A. Leijon, “Approximating the predictive distribu-tion of the beta distribution with the local variational method”,in Proceedings of IEEE International Workshop on MachineLearning for Signal Processing, 2011.

[C] Z. Ma and A. Leijon, “Expectation propagation for estimatingthe parameters of the beta distribution”, in Proceedings of IEEEInternational Conference on Acoustic, Speech, and Signal Pro-cessing, pp. 2082-2085, 2010.

[D] Z. Ma and A. Leijon, “Vector quantization of LSF parameterswith mixture of Dirichlet distributions”, in IEEE Transactionson Audio, Speech, and Language Processing, submitted, 2011.

[E] Z. Ma and A. Leijon, “Modelling speech line spectral frequen-cies with Dirichlet mixture models”, in Proceedings of INTER-SPEECH, pp. 2370-2373, 2010.

[F] Z. Ma and A. Leijon, “BG-NMF: a variational Bayesian NMFmodel for bounded support data,” submitted, 2011.

iii

In addition to papers A-F, the following papers have also beenproduced in part by the author of the thesis:

[1] Z. Ma and A. Leijon, “Super-Dirichlet mixture models usingdifferential line spectral frequencies for text-independent speakeridentification,” in Proceedings of INTERSPEECH, pp. 2349-2352, 2011.

[2] Z. Ma and A. Leijon, “PDF-optimized LSF vector quantiza-tion based on beta mixture models”, in Proceedings of INTER-SPEECH, pp. 2374-2377, 2010.

[3] Z. Ma and A. Leijon, “Coding bounded support data with betadistribution”, in Proceedings of IEEE International Conferenceon Network Infrastructure and Digital Content, pp. 246-250,2010.

[4] Z. Ma and A. Leijon, “Human Skin Color Detection in RGBSpace with Bayesian Estimation of Beta Mixture Models,”in Proceedings of European Signal Processing Conference, pp.2045-2048, 2010.

[5] Z. Ma and A. Leijon, “Beta mixture models and the applicationto image classification”, in Proceedings of IEEE InternationalConference on Image Processing, pp. 2045-2048, 2009.

[6] Z. Ma and A. Leijon, “Human audio-visual consonant recogni-tion analyzed with three bimodal integration models”, in Pro-ceedings of INTERSPEECH, pp. 812-815, 2009.

[7] Z. Ma and A. Leijon, “A probabilistic principal component anal-ysis based hidden Markov model for audio-visual speech recog-nition”, in Proceedings of IEEE Asilomar Signals, Systems, andComputers, pp. 2170-2173, 2008.

iv

AcknowledgementsPursuing a Ph.D. degree is a challenging task. It takes me about four and a halfyears, or even longer if my primary school, middle school, high school, university,and master study are also counted. Years passed, baby is coming, good times, hardtimes, but never bad times.

At the moment of approaching my Ph.D. degree, I would like to thank mysupervisor, Prof. Arne Leijon, for opening the door of the academic world to me.Your dedication, creativity, and hardworking nature influenced me. I benefited alot from your support, guidance, and encouragement. Also, I would like to thankmy co-supervisor, Prof. Bastiaan Kleijn, for the fruitful discussions. Special thanksalso go to Assoc. Prof. Markus Flierl for the ideas that inspired me on my research.

It is a great pleasure to work with the former and current colleagues in SIPand share my research experience with you. I am indebted to Guoqiang Zhang,Minyue Li, Janusz Klejsa, Gustav Henter, and all the others for the constructivediscussions about my research. I also enjoyed the teaching experience with PetkoPetkov. Writing a thesis is not an easy job. I express my thanks to Dr. TimoGerkmann, Gustav Henter, Janusz Klejsa, Nasser Mohammadiha, and Haopeng Lifor proofreading the summary part of my thesis. I am grateful to Dora Soderbergfor taking care of the administrative issues kindly and patiently, especially when Icame to you with a lot of receipts. It would be a too long list to give every namehere. Once more, I would like to thank all the SIPers sincerely.

Moving to a new country could be stressful. To Lei, David & Lili, Xi & Yi,Sha, Hailong, Qie & Bo, thank you for your generous help that makes the startof a new life so easy. To all my Chinese friends, thank you for the friendship thatmakes my days here wonderful and colorful.

Last but not least, I devote my special thanks to my parents and my parents-in-law for their tremendous support throughout the years. I owe my deepestgratitude to my wife Zhongwei for her support, understanding, encouragement,and love. We entered the university on the same day, recieved our bachelor’s andmaster’s degrees on the same day, flied to Stockholm on the same day, startedour Ph.D. studies on the same day, got married on the same day, and will becomeparents on the same day. A simple “thank” is far from expressing my gratitude.I hope we can spend everyday together to make more “same days”. Love is not aword but an action. It lasts forever and ever.

Zhanyu MaStockholm, December 2011

v

Contents

Abstract i

List of Papers iii

Acknowledgements v

Contents vii

Acronyms xi

I Summary 11 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Probability Distribution and Bayes’ Theorem . . . . . . . . 4

2.2 Parametric, Non-parametric, and Semi-parametric Models . 6

2.3 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . 7

2.4 Non-Gaussian Distributions . . . . . . . . . . . . . . . . . . 8

2.5 Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Analysis of Non-Gaussian Models . . . . . . . . . . . . . . . . . . . 16

3.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . 16

3.2 Bayesian Analysis . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Applications of Non-Gaussian Models . . . . . . . . . . . . . . . . 28

4.1 Speech Processing . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Nonnegative Matrix Factorization . . . . . . . . . . . . . . 34

5 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

vii

II Included papers 51

A Bayesian estimation of beta mixture models with variational in-

ference A1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A1

2 Beta Mixture Models and Maximum Likelihood Estimation . . . . A4

2.1 The Mixture Models . . . . . . . . . . . . . . . . . . . . . . A4

2.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . A6

3 Bayesian Estimation with Variational Inference Framework . . . . A6

3.1 Conjugate Prior of Beta Distribution . . . . . . . . . . . . . A6

3.2 Factorized Approximation to the Parameter Distributionsof BMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A7

3.3 Extended Factorized Approximation Method . . . . . . . . A12

3.4 Lower Bound Approximation . . . . . . . . . . . . . . . . . A13

3.5 Algorithm of Bayesian Estimation . . . . . . . . . . . . . . A16

3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . A19

4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . A20

4.1 Synthetic Data Evaluation . . . . . . . . . . . . . . . . . . . A22

4.2 Real Data Evaluation . . . . . . . . . . . . . . . . . . . . . A26

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A31

Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A31

6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A32

6.1 Proof of the Relative Convexity of Log-Inverse-Beta Func-tion (property 3) . . . . . . . . . . . . . . . . . . . . . . . . A32

6.2 Relative Convexity of Pseudo Digamma Function (property 5)A34

6.3 Approximations of the LIB Function and the PseudoDigamma Function (property 4 and 6) . . . . . . . . . . . A34

6.4 Approximation of the Bivariate LIG Function (property 7) A35

Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A35

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A35

B Approximating the predictive distribution of the beta distribu-

tion with the local variational method B1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B1

2 Bayesian Estimation of the Parameters in the Beta Distribution . . B3

3 Predictive Distribution of the Beta Distribution . . . . . . . . . . . B4

3.1 Convexity of the Inverse Beta Function . . . . . . . . . . . B5

3.2 Local Variational Method . . . . . . . . . . . . . . . . . . . B7

3.3 Upper Bound of the Predictive Distribution . . . . . . . . . B7

3.4 Global Minimum of the Upper Bound . . . . . . . . . . . . B8

3.5 Approximation of minu0,v0 F (x, u0, v0) . . . . . . . . . . . . B9

3.6 Approximation of the Predictive Distribution . . . . . . . . B10

4 Experimental Results and Discussion . . . . . . . . . . . . . . . . . B11

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B14

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B14

viii

C Expectation propagation for estimating the parameters of the

beta distribution C1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C12 Beta Distribution and Parameter Estimation . . . . . . . . . . . . C2

2.1 Posterior Approximation with Variational Inference . . . . C32.2 Posterior Approximation With Expectation Propagation . . C4

3 Experimental Results And Discussion . . . . . . . . . . . . . . . . C73.1 Approximation To The Posterior Distribution . . . . . . . . C83.2 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . C8

4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C10References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C10

D Vector quantization of LSF parameters with mixture of Dirichlet

distributions D1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D12 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D4

2.1 LPC Parameters Representation by ∆LSF . . . . . . . . . . D52.2 Modelling and Inter-component Bit Allocation . . . . . . . D6

3 Decorrelation of Dirichlet Variable . . . . . . . . . . . . . . . . . . D73.1 Previous work on the Dirichlet variable . . . . . . . . . . . D83.2 Three-dimensional Dirichlet Variable Decorrelation . . . . . D93.3 K-dimensional Dirichlet Variable Decorrelation . . . . . . . D103.4 Decorrelation of ∆LSF parameters . . . . . . . . . . . . . . D103.5 Computational Complexity . . . . . . . . . . . . . . . . . . D11

4 Intra-component Bit Allocation and Practical Coding Scheme . . . D134.1 Distortion Transformation by Sensitivity Matrix . . . . . . D134.2 Intra-component Bit Allocation . . . . . . . . . . . . . . . . D154.3 Practical Coding Scheme . . . . . . . . . . . . . . . . . . . D17

5 Experimental Results and Discussion . . . . . . . . . . . . . . . . . D195.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . D195.2 Model Fitness . . . . . . . . . . . . . . . . . . . . . . . . . . D205.3 High Rate D-R Performance . . . . . . . . . . . . . . . . . D225.4 Vector Quantization Performance . . . . . . . . . . . . . . . D235.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . D24

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D25References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D25

E Modelling speech line spectral frequencies with Dirichlet mix-

ture models E1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E12 Line Spectral Frequencies and ∆LSF . . . . . . . . . . . . . . . . . E3

2.1 Representations . . . . . . . . . . . . . . . . . . . . . . . . . E32.2 Transformation between LSF and ∆LSF domain . . . . . . E3

3 Probabilistic Model Frameworks . . . . . . . . . . . . . . . . . . . E53.1 Dirichlet Mixture Models . . . . . . . . . . . . . . . . . . . E53.2 Parameter Estimation for DMM . . . . . . . . . . . . . . . E6

4 PDF-optimized Vector Quantizer . . . . . . . . . . . . . . . . . . . E64.1 Distortion-Rate Relation with High Rate Theory . . . . . . E7

ix

4.2 Bit Allocation for DMM . . . . . . . . . . . . . . . . . . . . E75 Evaluation Results and Discussion . . . . . . . . . . . . . . . . . . E86 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E10References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E11

F BG-NMF: a variational Bayesian NMF model for bounded sup-

port data F1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F12 Bayesian Nonnegative Matrix Factorization . . . . . . . . . . . . . F33 Beta-Gamma Nonnegative Matrix Factorization for Bounded Sup-

port Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F43.1 The Generative Model . . . . . . . . . . . . . . . . . . . . . F43.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . F63.3 The BG-NMF Algorithm . . . . . . . . . . . . . . . . . . . F13

4 Experimental Results and Discussion . . . . . . . . . . . . . . . . . F144.1 Sparseness Constraints . . . . . . . . . . . . . . . . . . . . . F144.2 Source Separation . . . . . . . . . . . . . . . . . . . . . . . F154.3 Predict the Missing Data . . . . . . . . . . . . . . . . . . . F174.4 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . F18

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F18Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F196 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F19

6.1 BG-NMF and IS-NMF . . . . . . . . . . . . . . . . . . . . . F19References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F21

x

Acronyms

ADF Assumed Density Filtering

AIC Akaike Information Criterion

ASRC ArcSine Reflection Coefficient

BF Bayes Factor

BIC Bayesian Information Criterion

BMM Beta Mixture Model

DAG Directed Acyclic Graph

DMM Dirichlet Mixture Model

DPCM Differential Pulse Code Modulation

ECM Expectation Conditional Maximization

EFA Extended Factorized Approximation

EM Expectation Maximization

EP Expectation Propagation

FA Factorized Approximation

GEM Generalized Expectation Maximization

GMM Gaussian Mixture Model

IEEE Institute of Electrical and Electronics Engineers

i.i.d. Independent and Identically Distributed

IS Itakura-Saito

ISCA International Speech Communication Association

ISF Immittance Spectral Frequency

KDE Kernel Density Estimator

KL Kullback-Leibler

KLT Karhunen-Loeve Transform

LAR Log Area Ratio

xi

LDA Latent Dirichlet Allocation

LP Linear Prediction

LPC Linear Prediction Coefficient

LRMA Low Rank Matrix Approximation

LSF Line Spectral Frequency

LVI Local Variational Inference

MAP Maximum A-Posteriori

MCMC Markov Chain Monte Carlo

MFCC Mel-Frequency Cepstral Coefficient

ML Maximum Likelihood

nGMM non-Gaussian Mixture Model

NMF Nonnegative Matrix Factorization

PCA Principal Component Analysis

PDF Probability Density Function

PMF Probability Mass Function

RC Reflection Coefficient

RGB Red Green Blue

sDMM super-Dirichlet Mixture Model

SI Speaker Identification

SQ Scalar Quantization

SV Speaker Verification

SVD Singular Value Decomposition

VI Variational Inference

VQ Vector Quantization

xii

Part I

Summary

1 Introduction

Nowadays, statistical modeling plays an important role in various research areas.Data obtained from experiment, measurement, survey, etc, can be described ef-ficiently by a statistical model for facilitating analysis, transmission, prediction,classification, etc. The statistical modeling provides a way to connect the datawith the statistics. According to some accepted theories, Cox et al. [1] definedthe statistical model as “statistical methods of analysis are intended to aid theinterpretation of data that are subject to appreciable haphazard variability”, Mc-Cullagh [2] simplified it as “a statistical model is a set of probability distributionson the sample space”, and Davison [3] emphasized the purpose of a statisticalmodel as “a statistical model is a probability distribution constructed to enableinferences to be drawn or decisions made from data”. More discussions aboutstatistical models can also be found in, for example, [4–6].

Generally speaking, a statistical model comprises one or more probability dis-tributions. Given a set of observed data, we are free to choose any valid probabilitydistribution to establish a statistical model for the data. Assuming the observeddata are realizations of a random variable, the probability distribution is a math-ematical formula that gives the probability of each value of the variable (discretecase) or gives the probability that the variable falls in a particular interval (con-tinuous case) [7]. For the discrete variable, the Probability Mass Function (PMF)is a mathematical function to describe the probability distribution. Similarly,the Probability Density Function (PDF) is used to describe the probability distri-bution of the continuous variable. If the PMF or the PDF can be determined byusing a parameter vector with known dimensionality, such a model is a so-calledparametric model [8] (e.g., Poisson distribution, Gaussian distribution). However,if the model can be determined using a parameter vector with unknown dimen-sionality (i.e., the number of parameters is not set in advance and may changeaccording to the data), it is referred as a non-parametric model [9] (e.g., the Ker-nel Density Estimator (KDE) [10, 11], the Dirichlet process mixture model [12]).A model containing both finite dimensional and infinite dimensional parametervectors is named as a semi-parametric model [13] (e.g., Cox proportional hazardmodel [14]). To summarize, the parametric model has a fixed structure, whilethe non-parametric model has a flexible structure. The semi-parametric model isa compromise between these two aims. In general, a statistical model describesthe relation of a set of random variables to another. It could be parametric, non-parametric, or semi-parametric. Regardless of whether the statistical model isparametric, non-parametric, or semi-parametric, the model is parameterized by aparameter vector sampled from a parameter space.

A parameterized statistical model should describe the observed data efficiently,by choosing a suitable probability distribution. This choice is made depending onthe properties of the data. For instance, the case of a discrete variable such asthe result of tossing a coin is usually modeled by a Bernoulli distribution and acategorical distribution is used to describe a random event which takes on one ofseveral possible outcomes [8]. An example for the case of the continuous variable,the exponential distribution describes the time difference between events in aPoisson process and the gamma distribution is frequently used as a model forwaiting time. In other words, selection of a suitable probability distribution is

2 Summary

favorable for the efficiency of modeling the data.

Gaussian distribution (i.e., normal distribution) is the ubiquitous probabilitydistribution used in statistics, since it has an analytically tractable PDF andanalysis based on it can be derived in an explicit form [8,15–18]. Furthermore, bythe technique of mixture modeling [8,19,20], the corresponding Gaussian MixtureModel (GMM) can be used to approximate arbitrary probability distributions,with a rather flexible model complexity. The research employing the Gaussiandistribution and the corresponding GMM is vast (see e.g., [16,21–24]).

However, not all the data we would like to model are Gaussian distributed [25].The Gaussian distribution has an unbounded support, while some data have asemi-bounded or bounded support. For example, the digital image pixel value isbounded in an interval, the magnitude of speech spectrum is nonnegative, andthe Line Spectral Frequency (LSF) representation of the Linear Prediction Co-efficient (LPC) is bounded and ordered. Recent research [26–33] demonstratedthat the usage of non-Gaussian statistical models is advantageous in applicationswhere the data is not Gaussian distributed. Common non-Gaussian distributionsinclude, among others, beta distribution, gamma distribution, Dirichlet distribu-tion, and Poisson distribution. The non-Gaussian distributions mentioned aboveall belong to the exponential family [34,35], which is a class of probability distribu-tions chosen for mathematical convenience. Also, the distributions in the exponen-tial family are appropriate to model the data in some applications. We restrict ourattention to the non-Gaussian distribution in the exponential family in this dis-sertation and use the term “non-Gaussian distribution” to denote “non-Gaussiandistribution in the exponential family” in the following paragraph. Similarly, thetechnique of mixture models can also be applied to the non-Gaussian distribu-tions to build a non-Gaussian Mixture Model (nGMM). Even though the GMMcould approximate arbitrary probability distribution when the number of mixturecomponents is unlimited, the nGMM can model the probability distribution moreefficiently than the GMM, with comparable model complexity [27–30].

An important task in statistical modeling is to fit a statistical model by esti-mating its parameter vector. As mentioned, a statistical model is a parameterizedmodel. When modeling the data with a probability distribution based on someparameters, the PMF or PDF of this probability distribution can also be inter-preted as a function of the parameter vector, given the observed data. In sucha case, this function can be named as the “likelihood function” [36]. There aremany methods to fit the parameter vector or to estimate the distribution of theparameter vector (e.g., minimum description length (MDL) estimation, MaximumLikelihood (ML) estimation). The method of finding suitable values of the pa-rameter vector that maximizes the likelihood function is named ML estimation,which is an essential method in estimation theory. This method was introducedby Fisher in [37]. Reviews and introductions to the ML estimation can be foundin, for example, [8, 36, 38–40]. If we treat the parameter vector as a random vec-tor variable and assign it with a prior distribution, we can obtain the posteriordistribution of the parameter vector (variable) using Bayes’ theorem [8,38,41–43].Instead of providing a point estimate to the parameter vector, the posterior dis-tribution provides a description of the probability distribution of the parametervector. Such an estimate is more informative than the point estimate, since the

3

posterior distribution describes the possible values of the parameter vector, witha certain probability. If the point estimate is still required, the mode of the poste-rior distribution of the parameter vector can be traced by the so-called MaximumA-Posteriori (MAP) estimation [44,45]. In the framework of Bayesian estimation,if the prior information is non-informative, then the MAP estimate is identicalto the ML estimate. Moreover, when the posterior distribution is unimodal andthe amount of data goes to infinity, the Bayesian estimate converges to the MLestimate [8]. In such case, both estimates converge to the true values of theparameters.

Often, it is not practically relevant to use a single probability distributionto describe the data. Thus the mixture modeling technique is often applied inpractical problems. The Expectation Maximization (EM) algorithm [45, 46] andits variants, e.g., Generalized Expectation Maximization (GEM) algorithm [47]or Expectation Conditional Maximization (ECM) algorithm [48], are frequentlyused to carry out the ML estimation or the MAP estimation of the parametervector. The EM based algorithms assign a point estimate to the parameter vec-tor but do not yield a posterior distribution. To estimate the distribution ofthe parameter vector, the Bayesian estimation method, which involves the priorand posterior distribution as the conjugate pair, is always applied. However, theBayesian estimation of the parameter vector is not always analytically tractable.In such case, the Variational Inference (VI) framework [8, 49–51] is frequentlyapplied to factorize the parameter vector into sub-groups and approximate theposterior distribution. Unlike the EM based algorithms, the VI based methodsresult in an approximation to the posterior distribution of the parameter vec-tor. A point estimate could also be obtained by taking some distinctive values(e.g., the mode, the mean) from the posterior distribution. With sufficiently largeamount of data, the result of VI based method converges to that of the EM basedmethods [8]. Meanwhile, various approaches including sampling methods [52, 53](e.g., importance sampling [8], Gibbs sampling [54]) can be applied to generatesome data according to an obtained distribution.

In addition to the algebraic representation of the statistical models, the graph-ical model [8, 55, 56] provides a diagrammatic representation of the statisticalmodel. With the graphical model, the relations among different variables can beeasily inferred by Bayes’ theorem and the conditional independence among differ-ent variables can be obtained. By visualizing the relations, the graphical modelfacilitates the analysis of a statistical model.

According to the above discussion, the main features of the statistical modelingframework are, among others:

1. Usage of probability distributions, which are usually parameterized;

2. The model should be able to fit the data;

3. There exists algorithms that can be applied to estimate the parameter vec-tor.

Utilizing an appropriate statistical model in practical applications can improvethe modeling performance. For instance, in the field of pattern recognition, thestatistical model based approach is the most intensively studied and applied [16].For source coding problems, for example in speech coding, the statistical model

4 Summary

is applied to model the speech signal or the speech model for the purpose of datacompression [57,58]. In speech enhancement, the speech, the noise, and the noisyspeech are modeled by statistical models and enhancement algorithms are derivedbased on the estimated models [57, 59]. Another application of the statisticalmodel is found in matrix factorization. To factorize the matrix with a low rank ap-proximation, several statistical methods, e.g., probabilistic Principal ComponentAnalysis (PCA) [22] and Bayesian Nonnegative Matrix Factorization (NMF) [24],were proposed to give a probabilistic representation of the conventional methods.

The statistical models mentioned above are mainly based on Gaussian distri-bution. One can do much better if the statistical model is non-Gaussian becausethe true data is seldom Gaussian distributed. Thus applying non-Gaussian sta-tistical models to the non-Gaussian data could improve the performance of theapplications. This motivates us to work on non-Gaussian statistical models. Thework in this dissertation focuses mainly on the following three aspects:

1. Find suitable non-Gaussian statistical models to describe data that is non-Gaussian distributed;

2. Derive efficient parameter estimation methods, including both the ML andthe Bayesian estimations, for non-Gaussian statistical models;

3. Propose applications, where the non-Gaussian modeling could improve theperformance.

The remaining parts is organized as follows: the concepts, the principles, andsome examples of the statistical models (Gaussian and non-Gaussian) are reviewedin Section 2; the parameter estimation methods to the non-Gaussian statisticalmodels are introduced in Section 3; the performance of the non-Gaussian sta-tistical model is evaluated in Section 4; the contribution of this dissertation issummarized in Section 5.

2 Statistical Models

A statistical model contains a set of probability distributions to describe thedata [1–3]. Based on established statistical model, inferences and prediction canbe made. Usually, a statistical model describes the relation between the data andthe parameters and connects them in a statistical way.

2.1 Probability Distribution and Bayes’ Theorem

A probability distribution is a formula gives the probability of a random variablewith certain values [7]. For a discrete random variable X, a PMF pX(x) is amathematical function to describe the probability that X equals x, i.e.,

pX(x) = Pr(X = x), (1)

where x denotes any possible value that X can take. The PMF satisfies theconstraint that

pX(x) ≥ 0,∑

x∈ΩX

pX(x) = 1, (2)

2. STATISTICAL MODELS 5

where ΩX is the sample space ofX. If we have another discrete random variable Y ,then pX,Y (x, y) is the joint PMF representing the probability that (X,Y ) equals(x, y). Furthermore, the conditional PMF pX(x|Y = y) gives the probability thatX = x when Y = y. With the sum and product rules [8], we have

pX(x) =∑

y∈ΩY

pX,Y (x, y),

pX,Y (x, y) = pY (y|X = x)pX(X = x).

(3)

Also, pX(x) is named as the marginal PMF. X is independent of Y , if

pX,Y (x, y) = pX(x)pY (y). (4)

Furthermore, by involving a third discrete random variable Z, X and Y are con-ditionally independent given Z = z, if

pX,Y (x, y|Z = z) = pX(x|Z = z)pY (y|Z = z). (5)

Bayes’ theorem links the conditional probability a given b and the inverseform. Given conditional probability p(a|b) and the marginal probability p(b),Bayes’s theorem [8,38,41–43] infers the conditional probability of b given a as

p(b|a) = p(a|b)p(b)p(a)

=p(a|b)p(b)∑b p(a|b)p(b)

. (6)

With Bayes’ theorem in (6), we can obtain the PMF of Y = y given X = x as

pY (y|X = x) =pX(x|Y = y)pY (y)∑

y∈ΩYpX(x|Y = y)pY (y)

. (7)

The above procedure is named as “Bayesian inference” [43,60,61].When X denotes continuous random variable, the PDF is used to describe the

“likelihood” that X takes value x. If the probability that X falls in an interval[x, x+∆] can be denoted as

Pr (X ∈ [x, x+∆]) =

∫ x+∆

x

fX(v)dv (8)

and

lim∆→0

∫ x+∆

x

fX(v)dv = fX(x) ·∆, (9)

then fX(x) is named as the PDF of X. As a PDF, x should satisfy the followingconstraints

fX(x) ≥ 0, x ∈ΩX ,∫

x∈ΩX

fX(x)dx =1.(10)

Similarly, given another continuous random variable Y , the “conditional” PDFof X given Y is denoted as fX (x|y) and the “joint” PDF is denoted as fX,Y (x, y).The sum and product rules are also valid to X and Y as

fX(x) =

∫

y∈ΩY

fX,Y (x, y)dy,

fX,Y (x, y) = fY (y|x)fX(x).

(11)

6 Summary

Again, with Bayes’ theorem, the conditional PDF of Y given X is formulated as

fY (y|x) = fX(x|y)fY (y)∫y∈ΩY

fX (x|y)fY (y)dy. (12)

Similarly, X and Y are independent if

fX,Y (x, y) = fX(x)fY (y). (13)

Furthermore, X and Y are conditionally independent given another continuousrandom variable Z as

fX,Y (x, y|z) = fX(x|z)fY (y|z). (14)

The PMF is aimed for discrete random variables while the PDF is aimedfor continuous random variables. They have similar properties and the sameinference methodology can also be applied to them. Thus the analysis, inference,and discussion in the following paragraph is based only on the continuous randomvariable.

2.2 Parametric, Non-parametric, and Semi-parametricModels

As a mathematical formula describes the connection between X and θ, the PDF isa function of x, based on the parameter θ. The random variable X could be eithera scalar variable or a vector variable. It is the same for the parameter θ. A scalarcan be considered as a special case of a vector (i.e., a vector with dimensionality 1)and a vector can be considered as an extension (in the dimensionality) of a scalar.Since a PDF is usually defined for a scalar random variable (then extended fora vector random variable, if possible) and contains more than one parameter, Xdenotes the scalar random variable and θ denotes the parameter vector in thefollowing paragraph, except where otherwise stated. In this dissertation, a formalexpression for PDF of X based on θ is fX (x; θ), where x is a sample of the variableX. The notation fX(x|θ) has the same mathematical expression as fX(x; θ) butdenotes the conditional PDF of the variable X given the variable Θ, and θ is asample of Θ. It would be used in Bayesian analysis.

A parametric statistical model has a fixed model complexity. In other words,the dimensionality of θ is known in advance and fixed. The parametric model iswidely used in research. A scalar Gaussian distribution is a parametric model withmean and standard deviation as its parameters. A GMM contains a known (orsay, pre-chosen) number of mixture components, thus it also has a fixed number ofparameters. A Bayesian version of the mixture model, e.g., Bayesian GMM [62],could give zero probabilities (or slightly positive probabilities, but not distinguish-able from zero) to some mixture components according to the data. However,assigning zero probability to a mixture component does not mean this mixturecomponent “does not exists”. The component with zero probability still belongsto the whole model, but contributes nothing.


Some applications would like to have a flexible model complexity, which meansthe dimensionality of the parameter vector is not pre-assigned and may be change-able according to the data. This is also because a simple model is not suf-ficient for analysis of complicated data set. To this end, the non-parametricmodel [9, 13, 63] can be applied to describe the data efficiently and flexibly. Theterm “non-parametric” here means the dimensionality of the parameter vector isunknown or unbounded in advance. The number of parameters would change,which depends on the data. One of the frequently used non-parametric modelsis the KDE [10, 11], which is applied to estimate the PDF of a variable. It is asmoothing method and could choose different kernel functions. For the purposeof Bayesian analysis, the Bayesian non-parametric methods [63,64] aim to make ainformative estimation with fewer assumptions to the model. The Dirichlet pro-cess mixture model [12,65–67] is a powerful tool for modeling the data with a fullflexibility on the model size. The Dirichlet process mixture model generates a dis-crete random probability measure for each mixture component. The activation ofa new mixture component is controlled by the new data, the existing components,and a concentration parameter. This generating process is one perspective of theChinese restaurant process [68].

A model that contains both parametric component and non-parametric com-ponent is named as a semi-parametric model [13, 69]. The parametric modelhas a simple structure while the non-parametric model is flexible at the modelcomplexity. The semi-parametric model combines these two features. One of thefamous semi-parametric models might be the Cox proportional hazard model [14],which contains a finite dimensional parameter vector for interest and an infinitedimensional parameter vector as the baseline hazard function.

By the assistance of Bayesian framework, the model complexity in a paramet-ric model could also be adapted. This adaptation is mainly based on assigningzero weights to some components whose contribution to the whole model canbe ignored. It does not change the actual model size but limits the number of“active” parameters in the model. The model complexity is also flexible in thissense. Thus, this dissertation focuses only on how to find a suitable parametricmodel for the data and how to estimate the parameters efficiently. The basic ideaabout choosing a proper probability distribution for the data and the proposedestimation methodology can also be easily generalized to non-parametric models.

2.3 Gaussian Distribution

The Gaussian distribution is the most frequently used probability distribu-tion [70, 71]. There are numerous applications of the Gaussian distribution [16],the corresponding GMM [19,21] or the Gaussian process [18,72].

The Gaussian distribution has a PDF as

fX(x;µ, σ) = N (x;µ, σ) =1√2πσ

e−

(x−µ)2

2σ2 , (15)

where µ is the mean and σ denotes the standard deviation. It is easy to extendthe scalar case to the multivariate case with a K-dimensional vector variable X

8 Summary

as

fX(x;µ,Σ) = N (x;µ,Σ) =1

(√2π)K√|Σ|

e−12 (x−µ)T Σ

−1(x−µ), (16)

with mean vector µ and covariance matrix Σ. The Gaussian distribution has asymmetric “bell” shape and the parameters can be easily estimated. For theBayesian analysis, prior distributions can be assigned to the parameters andthe posterior distribution of the parameters can be obtained by an analyticallytractable solution [62]. This is another advantage of applying the Gaussian dis-tribution. If the data to model are not unimodal or do not have a symmetricshape, a GMM could be applied to describe the data distribution. In principle,the GMM can model arbitrary distribution with unlimited number of mixturecomponents. In other words, it increases the model complexity to improve theflexibility of modeling. More applications based on the Gaussian distribution canalso be found in [8,15,17,24,73].

2.4 Non-Gaussian Distributions

Although the Gaussian distribution is widely applied in a lot of applications,some practical data are obviously non-Gaussian. For example, the queueing timeis nonnegative, the short term spectrum of speech is semi-bounded in [0,∞),the pixel value of digitalized image is bounded in a fixed interval, and the LSFrepresentation of the LPC is bounded in [0, π] and strictly ordered. If we assumesuch type of data are Gaussian distributed, the domain of the Gaussian variableviolates the boundary property of the semi-bounded or the bounded support data.Even though a general solution to such kind of problem is to apply a GMM tomodel the data, it requires a big amount of mixture components to describe theedge of the data’s sample space [29].

Recently, several studies showed that by applying a distribution with a suitabledomain (or a mixture model of such distribution) could explicitly improve the effi-ciency of modeling the data and hence improve the performance (e.g., recognition,classification, quantization, and prediction) based on such modeling framework.Ji et al. [74] applied the Beta Mixture Model (BMM) in bioinformatics to solvea variety of problems related to the correlations of gene-expression. A practicalGibbs sampler based Bayesian estimation of BMM was introduced in [27]. Toprevent the edge effect in the GMM, the bounded support GMM was proposedin [29] to model the underlying distribution of the LSF parameters for speech cod-ing. As an extension of the beta distribution, the Dirichlet distribution and thecorresponding Dirichlet Mixture Model (DMM) was applied to model the colorimage pixels in Red Green Blue (RGB) space [75]. In the area of NonnegativeMatrix Factorization (NMF), Cemgil et al. [31] proposed a Bayesian inferenceframework for NMF based on Poisson distribution. A exponential distributionbased Bayesian NMF was introduced in [32] for recorded music. These appli-cations focus on capturing the semi-bounded or bounded support properties ofthe data and apply an appropriate distribution to model the data so that themodel complexity is decreased as well as the application performance is improved,compared to some Gaussian distribution based methods. Thus, it is more conve-nient and efficient to utilize a distribution which has a suitable domain and clear


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Beta(x;u,v)

x

(a) u = 3, v = 3

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

Beta(x;u,v)

x

(b) u = 2, v = 5

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

14

Beta(x;u,v)

x

(c) u = 0.2, v = 0.8

Figure 1: Examples of beta distribution.

mathematical form to describe the data.

To this end, this dissertation analyzes several non-Gaussian distributions, e.g.,the beta distribution, the Dirichlet distribution, and the gamma distribution.Also, it studies the parameter estimation methods to some non-Gaussian distri-butions. Some applications in the area of speech processing, image processing,and NMF are evaluated to validate the proposed methods. These contents arecovered in the attached papers A-F.

One of the obvious and main differences between the Gaussian and non-Gaussian distributions is the sample space of the variable X. In the remainingparts of this section, some non-Gaussian distributions will be introduced. Themethods of estimating the parameters and the corresponding applications will beintroduced in section 3 and 4, respectively.

Beta Distribution

The beta distribution is a family of continuous distribution defined on the interval[0, 1] with two parameters. The PDF of the beta distribution is

fX(x;u, v) = Beta(x;u, v) =1

beta(u, v)xu−1 (1− x)v−1 , u, v > 0, (17)

where beta(u, v) is the beta function defined as beta(u, v) = Γ(u)Γ(v)Γ(u+v)

and Γ(·) is

the gamma function defined as Γ(z) =∫∞

0tz−1e−tdt. The mean value, the mode,

and the variance are

X = E [X] =u

u+ v,

Mode(X) =u− 1

u+ v − 2, if u, v > 1,

Var(X) = E[(X −X

)2]=

uv

(u+ v)2 (u+ v + 1)

(18)

respectively. The differential entropy of the beta variable is

H(X) = ln beta (u, v)− (u− 1)ψ (u)− (v − 1)ψ (v)+(u+ v − 2)ψ (u+ v) , (19)

10 Summary

where ψ(·) is the digamma function defined as ψ(z) = ∂ ln Γ(z)∂z

. The beta distri-bution has a flexible shape, it could be symmetric, asymmetric, or convex, whichdepends on the parameters u and v. These two parameters play similar roles inthe beta distribution. Examples of the beta distribution can be seen in Fig. 1.For the bounded support data beyond the interval [0, 1], it is straightforward toshift and scale the data to fit in this interval.

The multivariate beta distribution can be obtained by cascading a set of betavariables together, i.e., each element in the K-dimensional vector variable X is ascalar beta variable. The PDF of the multivariate beta distribution is expressedas

fX(x;u, v) =K∏

k=1

Beta(xk;uk, vk)

=

K∏

k=1

1

beta(uk, vk)xuk−1k (1− xk)

vk−1 , uk, vk > 0.

(20)

Unlike the multivariate Gaussian distribution, the covariance matrix of the multi-variate beta distribution is diagonal. Thus the correlation among the elements inthe vector variable can not be described by a single multivariate beta distribution.In such case, the BMM is applied and the correlation is handled by the mixturemodeling.

The beta distribution is usually applied to model events that take place in alimited interval and is widely used in financial model building, project control sys-tems, and some business applications [76]. For the purpose of Bayesian analysis,it is often used as the conjugate prior to the Bernoulli distribution [76–78]. In thefield of electrical engineering, it is not usual to model the distribution of the datadirectly with the beta distribution. The reason that the beta distribution (or thecorresponding BMM) has not received so much attention might be due to the dif-ficulties in parameter estimation, where a closed-form solution does not exist andsome approximations are required [27, 28]. To solve this problem, the EM algo-rithm for the ML estimation of the parameters in the BMM was proposed in [79].To make a more informative estimate, a practical Bayesian estimation methodto the BMM through the Gibbs sampling was introduced in [27]. In contrast tothe method based on sampling, we proposed an analytically tractable solution toobtain the posterior distribution of the BMM in [28], i.e., the attached paper A.A brief introduction to the closed-form solution will be found in section 3.2 andmore details are given in the attached paper A.

Furthermore, we also proposed a new NMF strategy, the beta-gamma (BG-NMF), for Low Rank Matrix Approximation (LRMA) to the bounded supportdata [33]. Each element in the matrix is modeled by a beta distribution and thenthe parameter matrices are factorized. More details about the BG-NMF can befound in the attached paper F.

Dirichlet Distribution

The Dirichlet distribution is the conjugate prior to the multinomial distribution,which is parameterized by a vector parameter α and each element in the pa-rameter vector variable is positive. The K-dimensional vector variable X in the


0

0.5

1 0

0.5

10

2

4

6

8

10

12

14

x2x1

Dir(x;α)

(a) The Dirichlet distribution. (b) Top view.

Figure 2: An example of a three-dimensional Dirichlet distribution withparameter vector α=[6, 2, 4]T .

Dirichlet distribution is following two constraints: 1) each element in the vectorvariable is positive and 2) the sum of the elements is equal to one. Thus a K-dimensional Dirichlet variable has K − 1 degrees of freedom. The PDF of theDirichlet distribution is defined as

fX(x;α) = Dir(x;α) =Γ(∑K

k=1 αk

)

∏K

k=1 Γ (αk)

K∏

k=1

xαk−1k . (21)

The mean, mode, and covariance matrix of the Dirichlet distribution are

Xk =αk

α0, Mode (Xk) =

αk − 1

α0 −K, if αk > 1,

Var(Xk) =αk (α0 − αk)

α20 (α0 + 1)

, Cov(XlXk) =−αlαk

α20 (α0 + 1)

, l 6= k,(22)

respectively and α0 =∑K

k=1 αk. The differential entropy of the Dirichlet variableis

H(X) =K∑

k=1

ln Γ (αk)− ln Γ

(K∑

k=1

αk

)+ (α0 −K)ψ (α0)−

K∑

k=1

(αk − 1)ψ (αk) .

(23)Fig. 2 shows an example of a three-dimensional Dirichlet distribution with

different view angles. It can be recognized that the elements in the Dirichletvariable are negatively correlated, which means if we increase one element thenanother (or some other) element(s) should be decreased correspondingly. Thisproperty plays an important role in the mixture modeling, where the Dirichletdistribution is used to model the prior/posterior distribution of the weightingfactors to the mixture components. In the Bayesian analysis, if the kth mixturecomponent has a very small weighting, i.e., the corresponding parameter αk in theDirichlet distribution is relative small in the whole parameter vector, this mixturecomponent has very little effect to the whole model, and can thus be discarded. In

12 Summary

the Dirichlet process mixture model [12,65–67], the Dirichlet process is consideredas an infinite dimensional Dirichlet distribution, where only a limited number ofmixture components are activated by the observed data. Also, it is still possibleto activate a new mixture component with the upcoming data.

Two more distinguished properties of the Dirichlet variable should also bementioned here: the aggregation property [80] and the neutrality [81]. By theaggregation property of the Dirichlet variable, a K-dimensional Dirichlet vectorvariable can be aggregated to a K − 1 dimensional vector variable as

X ′ = [X1, . . . , Xk +Xk+1, . . . , XK ]T ∼ Dir(x′;α′) , (24)

with α′ = [α1, . . . , αk +αk+1, . . . , αK ]T . As the elements in the Dirichlet variableare permutable, any two elements can be added together and then a new Dirichletvariable with new parameter vector is obtained. The above procedure can be doneiteratively and a beta variable [X, 1−X]T can be obtained finally. The neutralityof the Dirichlet variable was introduced by Connor et al. [81] to study proportionsin biological data. The neutrality of a K-dimensional Dirichlet variable can beexpressed as

fXk,X\k(xk,

x1

1− xk

, . . . ,xK

1− xk

) = fXk (xk)fX\k(

x1

1− xk

, . . . ,xK

1− xk

), (25)

which means if we take an arbitrary element from the Dirichlet variable andnormalize the remaining elements, the selected element is independent of theremaining normalized ones.

The Dirichlet distribution is usually used as the conjugate prior to the multi-nomial distribution in Bayesian analysis [82,83]. Instead of applying the Dirichletdistribution as the prior/posterior distribution, Bouguila et al. [75, 84] appliedthe DMM directly to the color image pixel value, where the pixel value in the RGBspace is considered as a three-dimensional vector and normalized to satisfy theconstraints of the Dirichlet variable. A generalized DMM was applied for eye mod-eling in [85]. Blei et al. proposed the Latent Dirichlet Allocation (LDA) modelfor collections of discrete data such as text corpora [26, 86]. For speech coding,we utilized the boundary and ordering properties of the LSF representation tointroduce a new representation named ∆LSF [30, 87]. The ∆LSF satisfies theconstraints of the Dirichlet variable, thus a DMM is used to model the underlyingdistribution of the ∆LSF vectors. Furthermore, based on the obtained DMM, theaggregation property, and the neutrality, we proposed a PDF-optimized VectorQuantization (VQ) to quantize the Linear Prediction (LP) model. A brief in-troduction will be introduced in section 4.1 and full details are proposed in theattached papers D and E.

Although in some literature [75, 84] the Dirichlet distribution is consideredas the multivariate generalization of the beta distribution, we do not follow thisterminology here. A K-dimensional Dirichlet variable satisfies the constraint ofunit summation, thus the sample space is a standard K − 1 simplex (see (21)).However, each element in the vector variable of multivariate beta distribution(please refer to (20)) has the support [0, 1], which differs from the elements inthe Dirichlet variable. For example, if we set K = 3, then the sample space ofthe Dirichlet variable in (21) is a simplex (a triangle plane) while the support of


(a) Dirichlet distribution. (b) Multivariate beta distribution.

Figure 3: Domain comparison of three-dimensional Dirichlet distribu-tion and three-dimensional multivariate beta distribution.

the multivariate beta distribution in (20) is within a unit cube. A comparison ofthe sample spaces of the three-dimensional Dirichlet and the three-dimensionalmultivariate beta distribution is illustrated in Fig. 3. In this dissertation, weconsider the multivariate beta distribution and the Dirichlet distribution as twodifferent cases, thus these two types of distributions can be applied for differ-ent data. The ML estimation of the Dirichlet distribution and the correspond-ing DMM can be found in, e.g., [75, 88, 89]. Also, the ML estimation of themultivariate beta distribution/BMM and the Bayesian estimate of multivariatebeta distribution/BMM can be found in [79] and [28], respectively. To demon-strate the effects of different sample spaces, we also proposed a multivariate BMMbased VQ and compared it with the DMM based VQ in [30,87,89].

Gamma Distribution

The gamma distribution is well-known for modeling waiting times [90]. As thevariable in the gamma distribution is nonnegative, Dat et al. [91] also appliedthe gamma distribution for modeling the speech power in speech enhancement.The PDF of the gamma distribution is represented as

fX (x; ν, β) = Gam(x; ν, β) =βν

Γ (ν)xν−1e−βx, β, ν > 0, (26)

where x is a nonnegative variable, β is the inverse scale parameter, and ν is theshape parameter. The mean value, the mode, and the variance are

X = E [X] =ν

β,

Mode(X) =ν − 1

β, if ν > 1,

Var(X) = E[(X −X

)2]=

ν

β2,

(27)

14 Summary

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Gam

(x;ν,β)

x

(a) ν = 1, β = 2

0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Gam

(x;ν,β)

x

(b) ν = 2, β = 2

0 1 2 3 4 5 6 70

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Gam

(x;ν,β)

x

(c) ν = 4, β = 2

Figure 4: Examples of gamma distribution.

respectively. If ν is an integer, the gamma distribution is representing a sum ofν independent exponential variables, each of which has the same rate parameterβ. Furthermore, if ν = 1, the gamma distribution is exactly the same as anexponential distribution. Fig. 4 shows examples of the gamma distribution withdifferent parameter pairs.

In the field of Bayesian analysis, the gamma distribution is often used as theconjugate prior to many distributions, such as the exponential distribution, theGaussian distribution with fixed mean, and the gamma distribution with knownshape parameter. In [92], the gamma distribution was used as the prior distribu-tion for the precision parameter1. Hoffman et al. [32] proposed a NMF scheme forrecorded music, in which the gamma distribution is used as the conjugate prior tothe exponential distribution. In this dissertation, we mainly consider the gammadistribution as the prior distribution to the parameters in the beta distribution,because the variable in the gamma distribution is nonnegative. Even though thegamma distribution is not a conjugate prior to the beta distribution, with someapproximations and by the strategy of Variational Inference (VI), we can still ob-tain a conjugate pair and derive an analytically tractable solution for the Bayesiananalysis of beta distribution. In the attached paper A and F, we presented thedetails of how to apply the gamma distribution as the prior distribution to theparameters in the beta distribution.

2.5 Mixture Model

Most of the statistical distributions are unimodal, which means the distributionhas a single mode. However, the data in real applications are usually multimodalydistributed. To handel the multimodality of the data and to describe the datadistribution flexibly, the mixture modeling technique was introduced [8, 19, 20].The mixture model PDF is a linear combination of several single density functions(mixture components), where each mixture component is assigned with a positiveweighting factor. The sum of the weighting factors is equal to 1 so that a mixture

1In the original paper, the inverse-gamma distribution was used as the prior distri-bution for the variance parameter. As the gamma and inverse-gamma distributions arean inverse pair, the variance and the precision parameters are also an inverse pair, thesetwo statements are equivalent.


−8 −6 −4 −2 0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

FX

(x;θ)

x

(a) A GMM density with three mixture

components. The parameters are π1 =0.5, µ = −3, σ = 1, π2 = 0.2, µ = 0, σ = 1,and π3 = 0.3, µ = 4, σ = 1.5.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

FX

(x;θ)

x

(b) A BMM density with three mixture

components. The parameters are π1 =0.5, u = 2, v = 8, π2 = 0.2, u = 15, v = 15,and π3 = 0.3, u = 10, v = 2.

Figure 5: Examples of GMM and BMM.

model is also a normalized PDF. The mathematical expression for a mixturedensity with I mixture components is

FX(x;θ) =I∑

i=1

πifX (x; θi), 0 < πi < 1,I∑

i=1

πi = 1, (28)

where θ = [π, θ1, . . . , θI ] denotes all the parameter vectors, fX(x; θ) can be anyparameterized distribution, and θi is the parameter vector for the ith mixturecomponent.

The most frequently used mixture model is the Gaussian Mixture Model(GMM) [19, 21, 62, 73], which contains several Gaussian distributions. When ap-plying the non-Gaussian statistical models, the concept of mixture modeling canbe easily extended to a mixture model with several non-Gaussian distributions.The Beta Mixture Model (BMM) was used in several applications [27,28,74,79].Also, the mixture of Dirichlet distributions were applied to model image pix-els [75], to the eye modeling [85], to the LSF quantization [30, 89], etc. TheBayesian gamma mixture model was applied for radar target recognition in [93].Fig. 5 shows a GMM and a BMM.

In this section, we introduced some basic concepts about statistical models,probability distributions, and probability density functions. Instead of the classicGaussian distribution, we focused on the non-Gaussian distributions, e.g., thebeta distribution, the Dirichlet distribution, and the gamma distribution, whichare efficient for modeling data with “non-bell” shape and semi-bounded/boundedsupport. Finally, the technique of mixture modeling was briefly summarized. Inthe next section, the methods of analyzing the non-Gaussian statistical modelswill be presented.

16 Summary

3 Analysis of Non-Gaussian Models

When applying non-Gaussian statistical models to describe the distribution of thedata, we usually assume that one or more parameterized models fit the data. Keyanalysis tasks for non-Gaussian statistical models include parameter estimation,derivation of the predictive distribution, and model selection. In this section, wefirst introduce different methods for estimating the parameters, which cover MLestimation and Bayesian estimation. Then we present how to select a suitablestatistical model according to the data. Finally, graphical models are introducedas a supplemental tool for analyzing non-Gaussian statistical models.

3.1 Maximum Likelihood Estimation

Maximum Likelihood (ML) estimation is a widely used method for estimatingthe parameters in statistical models. Assuming that the variable X is distributedfollowing a PDF as

X ∼ fX (x; θ) , (29)

then given a set of i.i.d. observations X = x1, . . . , xN, the joint density functioncan be expressed as

f (X; θ) =

N∏

n=1

fX (xn; θ) . (30)

If we interpret the PDF in (30) as the likelihood function of the parameter θ, thenthe ML estimate of the parameter is denoted as

θML = argmaxθ

N∏

n=1

fX (xn; θ). (31)

Usually, it is more convenient to work with the logarithm of the likelihood function,i.e., the log-likelihood function, as

θML = argmaxθ

N∑

n=1

ln fX (xn; θ) , (32)

which is equivalent to maximizing the original likelihood function.If the log-likelihood function has global maxima and it is differentiable at

the extreme points, we can take the derivative of the log-likelihood function withrespect to the parameter vector and then set the gradient to zero as

∂ lnL (θ; X)

∂θ=

N∑

n=1

∂ ln fX (xn; θ)

∂θ= 0. (33)

Solving the above equation could then lead to ML estimates of the parameters.For the Gaussian distribution, the ML estimates of the parameter θ = [µ, σ]T canbe simply expressed in a closed-form solution as

µML =1

N

N∑

n=1

xn, σML =

√√√√ 1

N

N∑

n=1

(xn − µML)2. (34)

3. ANALYSIS OF NON-GAUSSIAN MODELS 17

However, for some non-Gaussian distributions, the ML estimates do not have ananalytically tractable solution. For example, substituting the beta PDF in (17)into (33) will yield an expression2

[ψ (u+ v)− ψ (u) + 1

N

∑N

n=1 ln xn

ψ (u+ v)− ψ (v) + 1N

∑N

n=1 ln (1− xn)

]= 0. (35)

As the digamma function ψ (·) is defined though an integration, a closed-formsolution to (35) does not exist. Numerical methods, e.g., Newton’s method [94],are thus required to solve this nonlinear problem. In [79], the Newton-Raphsonmethod was applied to calculate the solution numerically. This works well inpractical problems.

When computing ML estimate of the parameters in the mixture model, the Ex-pectation Maximization (EM) algorithm [8,45,46,95] is generally applied. Recallthe expression of the mixture model in (28) and assume we have a set of i.i.d. ob-servations X = x1, . . . , xN. A I-dimensional indication vector zn is introducedfor each observation xn. The indication vector zn has only one element equals1 and the remaining elements equal 0. If the ith element of zn equals 1, i.e.,zin = 1, we assume that the nth observation was generated from the ith mixturecomponent. Apparently, the indication vector zn follows a categorical distributionwith parameter vector π = [π1, . . . , πI ]

T as

fZn

(zn;π

)=

I∏

i=1

πzini . (36)

In the expectation step, the expected value of zin is calculated with the currentestimated π as

Zin = E [Zin|X,θ] =πifX (xn; θi)

FX (xn;θ). (37)

In the maximization step, the weight factor πi is calculated as

πi =1

N

N∑

n=1

Zin. (38)

The parameter vector estimate of θi can be obtained by taking the derivative ofthe logarithm of (28) with respect to the parameter vector and setting the gradientequal to zero as

∂∑N

n=1 lnFX (xn;θ)

∂θi=

N∑

n=1

πi

FX (xn;θ)

∂fX(xn; θi

)

∂θi

=N∑

n=1

πifX(xn; θi

)

FX (xn;θ)︸︷︷︸Zin

∂ ln fX(xn; θi

)

∂θi

= 0.

(39)

2To prevent infinite numbers in the practical implementation, we assign ε1 to xn whenxn = 0 and 1− ε2 to xn when xn = 1. Both ε1 and ε2 are slightly positive real numbers.

18 Summary

This equation is a weighted-sum version of (33). Thus for GMM, we can still havea closed-form solution in the maximization step. Similar to some non-Gaussiandistributions, a mixture model which contains non-Gaussian distributions as mix-ture components typically does not have analytically tractable solutions. Thuswe can apply the same strategies for ML estimation of non-Gaussian distribu-tions to carry out the maximization step. By performing the expectation step andthe maximization step iteratively, we obtain an EM algorithm for a mixture ofnon-Gaussian distributions. The EM algorithm for the BMM and the DMM wereintroduced in [79] and [75], respectively.

Note that the EM algorithm may not simplify the calculations, if the maxi-mization step is not analytically tractable [95]. Some variants of the EM algo-rithm, e.g., GEM [47] or ECM algorithm [48], can be applied to overcome suchkind of problem. Aside from this drawback, the EM algorithm is not guaranteedto reach a global maximum when the log-likelihood function is non-convex [95].Then the EM algorithm may only find a local maximum.

Another drawback of ML estimation is the problem of overfitting [7]: if theamount of observations is relatively small compared to the number of parameters,the model estimated may have a very poor predictive performance. We may utilizecross-validation [96, 97], regularization [98, 99], or Bayesian estimation [6, 8, 100]techniques to avoid such problems. In the next section, we focus on Bayesianestimation of the parameters in non-Gaussian statistical models.

3.2 Bayesian Analysis

In ML estimation, the parameter vector θ is assumed with fixed but unknownvalue. ML estimation thus only gives a point estimate of the parameters. Ifwe consider the parameter vector θ a vector-valued random variable with somedistribution, then a Bayesian estimate [6, 8, 100] of the parameter vector can bemade. Given a set of observations X = x1, . . . , xN and assuming that theparameter vector is a random variable, the likelihood function of Θ is

LΘ (θ) = f (X|θ) =N∏

n=1

fX (xn|θ) . (40)

Meanwhile, the prior distribution of Θ is assumed to have a parameterized PDFfollowing

Θ ∼ fΘ (θ;ω0) , (41)

where ω0 is called the hyperparameter of the prior distribution. By combiningBayes’ theorem in (6) with (40) and (41), we can derive the posterior densityfunction of the variable Θ, given the observations X, as

fΘ (θ|X;ωN ) =f (X|θ) fΘ (θ;ω0)∫

θ′∈ΩΘf (X|θ′) fΘ (θ′;ω0) dθ

′ ∝ f (X|θ) fΘ (θ;ω0) , (42)

where ωN is the hyperparameter vector of the posterior distribution but may referto a sample space different from that of ω0. By (42), the posterior distribution ofthe variable Θ given the observation X can be obtained. The posterior describesthe shape and the distribution of the parameter. Thus Bayesian estimation is


more informative than ML estimation. A point estimate can also be made basedon the posterior distribution. For example, the posterior mode is

θMAP = argmaxθ

fΘ (θ|X;ωN ) , (43)

which is the Maximum A-Posteriori (MAP) estimate of the parameter. Also, wecan estimate the posterior mean, the posterior variance, etc.

Conjugate Priors and the Exponential Family

If the prior distribution in (41) and the posterior distribution in (42) have thesame mathematical form, then the prior distribution is called the conjugate priorto the likelihood function and the prior distribution and the posterior distributionare said to be a conjugate pair [8].

Among all distributions, the distributions that belong to the exponential fam-ily [8,34,35,56] have conjugate priors. The exponential family is defined by [8]

fX (x|θ) = h (x) q (θ) eθT u(x), (44)

where h (·), q (·), and u (·) are some functions. The conjugate prior to the likeli-hood function in (44) is

fΘ(θ; η

0, λ0

)= p

(η0, λ0

)q (θ)λ0 eλ0θ

T η0 , (45)

where p (·, ·) is a function of η0and λ0. Combining (44) and (45), the posterior

distribution, given a set of observations X = x1, . . . , xN, can be written as

fΘ(θ|X; η

N, λN

)∝ q (θ)λ0+N eθ

T (∑N

n=1 u(xn)+λ0η0). (46)

The posterior distribution in (46) has the same mathematical form as the likeli-hood function in (44), up to a normalization factor. For any distribution belongsto the exponential family, it is straightforward to obtain the posterior distribu-tion. The closed-form solution for conjugate pairs facilitates Bayesian estimationfor distributions belong to the exponential family.

The posterior distributions of the parameters in the Gaussian distribution hasbeen introduced in [8]. For non-Gaussian distributions, we studied the conjugateprior of the beta distribution in [28] (i.e., paper A). Since the beta distribution is amember of the exponential family, it has a conjugate prior. This prior distributionis

fU,V (u, v;α0, β0, ν0) ∝[Γ (u+ v)

Γ (u) Γ (v)

]ν0e−α0(u−1)e−β0(v−1), (47)

where α0, β0, and ν0 are the prior hyperparameters. Combining the likelihoodfunction in (17) and the prior distribution in (47), the posterior distribution canbe obtained as

fU,V (u, v|X;αN , βN , νN ) ∝[Γ (u+ v)

Γ (u) Γ (v)

]νNe−αN (u−1)e−βN (v−1), (48)

where νN = ν0 +N , αN = α0 −∑N

n=1 ln xn, and βN = β0 −∑N

n=1 ln(1− xn) arethe posterior hyperparameters.

20 Summary

Unfortunately, this posterior is difficult to use in practise since it leads toanalytically intractable integrals. There are several ways to handle this problem.Using Gibbs sampling [27], the posterior mean can be obtained by taking themean of a set of samples generated according to the posterior distribution. Byapproximate inference methods [8,28,56], the posterior distribution can be approx-imated by another distribution, subject to some cost funtions (e.g., to minimizethe Kullback-Leibler (KL) divergence).

Approximate Inference

The central task in Bayesian estimation is to obtain the posterior distribution ofthe parameter variable Θ given the observed data X = x1, . . . , xN and evaluatesome quantity with respect to this distribution. In other words, when the math-ematical expression of the posterior distribution is known, the hyperparametersneed to be estimated. Besides the straightforward solution above obtained by us-ing conjugate pairs, other stochastic approximation schemes can also be applied.

Variational Inference The Variational Inference (VI) framework is a gen-eral strategy for the inference of probability distributions [8, 49, 51, 56]. Supposethat the true posterior distribution fΘ (θ|X) is not practical, then a practical ap-proximation gΘ (θ) of the true posterior distribution might be computed as analternative solution. The logarithm of the marginal probability can be decom-posed as

ln f (X) =

∫gΘ (θ) ln

[f (X, θ)

gΘ (θ)

]dθ +

∫gΘ (θ) ln

[gΘ (θ)

fΘ (θ|X)

]dθ

= L (gΘ (θ)) + KL (gΘ (θ) ‖fΘ (θ|X))

= L (g) + KL (g‖f) .

(49)

The first term in the last line of (49) is a lower bound of ln f (X) since the KL di-vergence is nonnegative. The KL divergence vanishes when gΘ (θ) equals fΘ (θ|X).As the logarithm of the marginal probability is fixed, minimizing the KL diver-gence is equivalent to maximizing the lower bound L (g). Thus we can obtaina optimal approximation to the posterior distribution by maximizing the lowerbound. Meanwhile, the form of the approximating distribution gΘ (θ) should bechosen carefully so that the KL divergence can be minimized and gΘ (θ) is feasibleto work with in practical problems.

The remaining problem in VI is how to maximize the lower bound. A commonstrategy is to factorize the vector variable Θ into K subgroups as

gΘ (θ) =K∏

k=1

gΘk

(θk). (50)

The above decomposition assumes mutual independence among the subgroups. Ifwe also assume for a moment that variable subgroup l is the only variable, the


lower bound can be factorized as [8]

L (g) =

∫ ( K∏

k=1

gΘk

(θk))[

ln f (X, θ)−K∑

k=1

ln gΘk

(θk)]dθ

=

∫gΘl

(θl)∫

ln f (X, θ)

K∏

k=1,k 6=l

gΘk

(θk)dθk

dθl

−∫gΘl

(θl)ln gΘl

(θl)dθl + const

=

∫gΘl

(θl)Ek 6=l [ln f (X, θ)] dθl

−∫gΘl

(θl)ln gΘl

(θl)dθl + const

=−KL(gΘl

(θl)‖eEk 6=l[ln f(X,θ)]

)+ const,

(51)

where

Ek 6=l [ln f (X, θ)] =

∫ln f (X, θ)

K∏

k=1,k 6=l

gΘk

(θk)dθk (52)

denotes the expectation of f (X, θ) with respect to all the variable subgroups exceptfor subgroup Θl. The decomposition in (51) splits the lower bound into two parts,one part containing only the variable subgroup Θl, and one part is a constant whenthe variable subgroup Θl is considered as the only variable. By recognizing the

first part as the negative KL divergence of eEk 6=l[ln f(X,θ)] from gΘl

(θl), the lower

bound L (g) can be maximized when the negative KL divergence vanishes. Thisindicates an optimal solution for maximizing the lower bound, which is

ln g∗Θl

(θl)= Ek 6=l [ln f (X, θ)] + const. (53)

The above solution is only optimal for variable subgroup Θl. The optimal solutionto the entire posterior distribution can be obtained by processing through all thevariable subgroups, one after the other. Since the optimization problem withrespect to the subgroup Θl is a convex problem [101], a unique global maximumexists. The above procedure is named as the Factorized Approximation (FA) [8],and was originally developed as mean filed theory in statistical physics [102].

Another advantage of the FA is that it provides an analytically tractable so-lution for approximating the posterior distribution, if the mathematical form ofln gΘl

(θl)is the same as that of Ek 6=l [ln f (X, θ)]. An example for the Gaus-

sian distribution has been introduced in [8], where the mean and the precisionparameters were assumed to be independent.

However, it is not always true that the terms in (53) have the same math-ematical form. Further approximations may thus be required to “force” thesetwo items have the same expression. To this end, we extend the lower boundby replacing ln f

(X, θl

)= Ek 6=l [ln f (X, θ)] with a (unnormalized) lower bound

ln g(X, θl

)≤ ln f

(X, θl

). This lower bound should have the same mathematical

expression as ln gΘl

(θl). With this substitution, the lower bound L (g) is lower

22 Summary

bounded by [28] (i.e., paper A)

L (g) = −KL(gΘl

(θl)‖f(X, θl

))+ const

≥ −KL(gΘl

(θl)‖g(X, θl

))+ const.

(54)

Similar to the FA, the optimal solution is

ln g∗Θl

(θl)= ln g

(X, θl

). (55)

This Extended Factorized Approximation (EFA) procedure maximizes the lowerbound of the objective function L (g) instead of the objective function itself di-rectly. Generally speaking, maximizing this lower bound will asymptotically max-imize the objective function. The approximation performance depends on theshape of the (unnormalized) distribution g (X, θ), the sharper the better. If equal-

ity holds, e.g., the (unnormalized) distribution g (X, θ) is identical to f (X, θ),the EFA method is equivalent to the FA method.

In paper A, we extended FA to EFA and applied this VI strategy to factor-ized the parameters of the prior/posterior distribution of the beta distribution(i.e., (47) and (48)) into two independent group as

fU,V (u, v) ≈ fU (u) fV (v) . (56)

fU (u) and fV (v) were assigned gamma distributions to capture the nonnegativ-ity of U and V . Based on this EFA method and relative convexity [103], we ap-proximated the log-inverse-beta function and the pseudo digamma function withtheir first and second order Taylor expansions [104, 105] and derived an analyt-ically tractable solution. Fig. 6 shows a comparison between the true posteriordistribution fU,V (u, v) and the variational product approximation fU (u) fV (v).Evidently, factorization cannot capture the correlation between U and V . It is aconcession for speeding up the estimation procedure by an analytically tractablesolution. The accuracy of the approximation improves as the number of observa-tions increases. More details are available in the attached paper A.

Also, for data with bounded support, we modeled the distribution of the datawith a beta PDF, and assigned a gamma prior to the parameters in the beta distri-bution. By taking advantage of the Low Rank Matrix Approximation (LRMA),the two parameter matrices were factorized by the NMF strategy. The EFAmethod was applied to derive a closed-form solution for parameter estimation.More details about this beta-gamma NMF (BG-NMF) can be found in the at-tached paper F.

In the VI framework, if we factorize the variables into subgroups but cannotfind a closed-form solution for updating each variable subgroup as in (53) or (55),we could also maximize the lower bound L (g) directly by taking the derivativewith respect to each variable subgroup one by one. An example of such a solutioncan be found in [32].

Because of the factorization and lower bound approximation, both the FA andthe EFA introduce a systematic gap between the true optimal solution and theapproximate solution [6]. However, as the VI framework can provide a determin-istic solution and facilitate estimation compared to sampling methods, it is stillan efficient and easily implemented method for Bayesian parameter estimation.


True Posterior

0 10 200

5

10

15

20

25VI

0 10 200

5

10

15

20

25

uu

vv

(a) N=10.

True Posterior

0 10 200

5

10

15

20

25VI

0 10 200

5

10

15

20

25

uu

vv

(b) N=100.

Figure 6: Comparisons of the true posterior distribution and the ap-proximation obtained with the VI based method. The datawas generated from a Beta(x; 5, 8) distribution. The averageEuclidean distances between the obtained posterior mean andthe true parameters are 5.73 and 1.09 while the systematic bi-ases of the posterior mean estimation (measured in Euclideandistance) are 5.23 and 0.32 for N = 10 and 100, respectively.This figure is copied from [28], i.e., the attached paper A.

Expectation Propagation The VI framework based method mentionedabove approximates the posterior distribution fΘ (θ|X) with gΘ (θ) by minimizingthe KL divergence of fΘ (θ|X) from gΘ (θ) as

g∗Θ (θ) = argmingΘ(θ)

KL(gΘ (θ) ‖fΘ (θ|X)) . (57)

In this section, we introduce and discuss the Expectation Propagation (EP)method [8, 106, 107], which is another form of deterministic approximate infer-ence for approximating the posterior distribution in Bayesian analysis. The EPmethod is an improved version of Assumed Density Filtering (ADF) [108,109], sothat the result obtained from EP does not depend on the ordering of the inputsequence. In EP, the posterior distribution is considered to be a product of a setof factor distributions as

fΘ (θ|X) ∝ f0 (θ)

N∏

n=1

fn (θ) , (58)

where f0 (θ) denotes the prior distribution and fn (θ) = fX (xn|θ) is the likelihoodfunction of θ given the nth observation xn. Furthermore, EP approximates theposterior distribution by a distribution gΘ (θ), which is assumed to be a productof distribution factors as

gΘ (θ|X) ∝N∏

n=0

gn (θ) , (59)

24 Summary

0 5 100

10

20

30Posterior Via VI

0 5 100

10

20

30Posterior Via EP

0 5 100

10

20

30True Posterior

0 5 100

10

20

30

0 5 100

10

20

30

0 5 100

10

20

30

uuu

u u u

vvvv v v

Figure 7: The true posterior, the posterior via EP, and the posteriorvia VI. Based on data generated from a beta distributionBeta(x; 3, 8). Upper row: N = 20, lower row: N = 100. Thisfigure is copied from [110], i.e., the attached paper C.

where gn (θ) is an approximate PDF of Θ with different hyperparameters for everyn, i.e., gn (θ) = gΘ

(θ;ωn

). In contrast to VI, the EP method minimizes the KL

divergence of gΘ (θ|X) from fΘ (θ|X) as

g∗Θ (θ) = argmingΘ(θ)

KL(fΘ (θ|X) ‖gΘ (θ)) . (60)

The EP method computes an approximation by optimizing each factor gk (θ) , k =1, . . . , N in turn with respect to all the other factors, to ensure that the ap-proximating distribution gΘ (θ|X) ∝ ∏N

n=0 gn (θ) is as close as possible tofn (θ)

∏n6=k gn (θ). In other words, we refine one factor gk (θ) while holding the

remaining factors fixed. The hyperparameters of gk (θ) are obtained based on acombination of the kth observation fk (θ) and the remaining factors. It has beenshown [8] that the optimal solution to the problem stated above corresponds tomatching the expected sufficient statistics. In each iteration, we select one factorand update it by moment matching. This is repeated until convergence. If boththe likelihood function and the prior/posterior factor distribution are Gaussian,an analytically tractable solution is obtained. Otherwise, we can apply samplingmethod, e.g., the importance sampling [8], to generate samples and calculate suf-ficient statistics for moment matching.

In [110], i.e., the attached paper C, we approximated the posterior distribution


of the parameters in the beta distribution by a product of Gaussian distributionsusing the EP framework. This EP based method is different from the VI basedmethod proposed in the attached paper A. The differences are illustrated in Fig. 7.The EP based method applied a Gaussian approximation (with boundary trun-cation) to capture the correlation between two parameters u and v in (48), whilethe VI based method violated this correlation. In the moment matching stepof EP, importance sampling method was utilized, while the VI based methodprovided an analytically tractable solution. When the amount of data is small,the EP based method outperforms the VI based method, since it contains all theinformation contributed from every observation by iteratively updating one factoronce a time. As the amount of observations increases, both EP and VI lead togood approximations, especially for point estimate. More details about the EPbased method and the comparison can be found in the attached paper C.

Predictive Distribution

In some practical problems, it is more interesting to study the predictive distri-bution. A simple way to obtain the predictive distribution of x given previousobserved data X = x1, . . . , xN is to plug in a point estimate of the parametersas

fX (x|X) ≈ fX(x; θ), (61)

where θ is a point estimate obtained from, e.g., ML estimation, based on theobservations X. However, in Bayesian analysis, we compute the entire distributionof the parameter vector. Thus a single point estimate may not be sufficient todescribe all statistical characteristics. Taking the uncertainty of the parametersinto account, the predictive distribution can be formulated, following a standardBayesian approach, as

fX (x|X) =

∫

θ∈ΩΘ

fX (x|θ) fΘ (θ|X) dθ, (62)

where fΘ (θ|X) is the posterior distribution obtained based on X. If the likelihoodfunction fX (x|θ) is Gaussian and we assign an inverse-gamma distribution as theprior distribution for the variance, the predictive distribution becomes Student’s t-distribution [8]. However, in most cases the predictive distribution does not havean analytically tractable solution. In such situations, a numerical method can beused to simulate the predictive distribution accurately, or some approximationscan be applied to derive an analytically tractable approximation of the predictivedistribution.

Sampling Methods Sampling methods [8, 52, 53] are numerical solution tosimulate a target distribution by generating a sufficiently large number of inde-pendent samples from it. For the purpose of obtaining the predictive distribution,the distribution defined in (62) can be reformulated as

fX (x|X) = EfΘ(θ|X) [fX (x|θ)] , (63)

which is the expectation of fX (x|θ) with respect to fΘ (θ|X). According to theposterior distribution, we can generate a set of independent samples θ1, . . . , θL,

26 Summary

after which the expectation in (63) can be approximated by a summation as

fX (x|X) ≈ 1

L

L∑

l=1

fX(x; θl

). (64)

Obviously, the more samples we have from the posterior distribution, the moreaccurate the approximation in (64) would be. However, it is not always feasible tosample from the posterior distribution directly. In that case, some basic samplingmethods [8], e.g., importance sampling, or rejection sampling, can be applied togenerate samples according to the posterior distribution, based on a referencedistribution which is easy to sample from. The performance of such kind ofsampling methods depends on how well the reference distribution matches thetarget distribution. Moreover, methods based on Markov Chain Monte Carlo(MCMC) techniques [52, 53], e.g., the Gibbs sampling [54], are widely used asan alternative to the basic sampling methods, especially when sampling from ahigh-dimensional space.

Local Variational Inference An analytically tractable solution to the pre-dictive distribution is preferred as it could facilitate calculations. To this end,some approximation can be applied for efficiently calculating the predictive dis-tribution. Different from the “global” Variational Inference (VI) based methodintroduced in section 3.2, the Local Variational Inference (LVI) method, whichis also from the variational framework, can be applied as an alternative way tocompute an approximation of the predictive distribution, instead of simulating it.Recall the expression of the predictive distribution in (62), if there exists a func-tion g

(θ, γ)so that g

(θ, γ)≥ fΘ (θ|X)3, then an upper bound of the predictive

distribution can be obtained as

fX (x|X) ≤∫

θ∈ΩΘ

fX (x|θ) g(θ, γ)dθ , J

(γ, x

), (65)

where γ is a free parameter vector. If the integration in (65) can be calculatedanalytically, then the upper bound of the predictive distribution is a function ofonly γ (x is treated as a known value in this case). Minimizing the upper boundof the predictive distribution and then normalizing it afterwards, we can obtainan approximation of the value of the true predictive PDF as

fX (x|X) ≈ 1

CminγJ(γ, x

)=J(γ∗, x

)

C, (66)

where C is the normalization factor. Thus the remaining problem is how to findthe optimal γ∗ so that the upper bound is minimized. It can be obtained bysolving an optimization problem [101], subject to certain constraints, as

minγJ(γ, x

)

s.t. constraints.(67)

3γ is a parameter vector in a sample space Ωγ , in which all the possible values of γsatisfy this inequality.


To be noted, the optimal value γ∗ of (67) depends on x. To facilitate the cal-culation of the normalization factor C, we can take a reasonable approximationto γ∗, regardless of the value of x [111]. In general, the optimized upper bound

J(γ∗, x

)is not exactly the same as the first term in (65), thus the result is not

accurate. However, it provides a convenient way to approximate the integrationwith an analytically tractable solution.

In [111], i.e., the attached paper B, we replaced the posterior distribution ofthe parameters in the beta distribution, as obtained from the attached paper A,by an upper bound. With some mathematical analysis, we proved the existenceof a global minimum of the upper bound. The predictive distribution of the betadistribution is then approximated by an analytically tractable solution, which wasshown to be more similar to the true one, compared to the predictive distributionapproximated by plugging in the point estimate of the parameters. More detailsabout this LVI based method for approximating the predictive distribution of thebeta distribution can be found in the attached paper B.

3.3 Model Selection

When describing data using statistical model, considerable attention should bepaid to which model, from a set of possible descriptions, is to be preferred, inaddition to parameter estimation. Several criteria have been proposed and appliedfor the purpose of model selection. Frequently used criteria, among others, includethe Akaike Information Criterion (AIC) [112], the Bayesian Information Criterion(BIC) [98], and the Bayes Factor (BF) [113]. The AIC is based on the conceptof information theory, which is used to measure the loss of the KL divergence ofthe hypothesis model from the true one, so the smaller the better. The AIC isdefined as

AIC = −2 ln f(X|θ)+ 2s, (68)

where X = x1, . . . , xN is the observed data and s denotes the degree of freedomof parameters in the statistical model.

For the purpose of Bayesian model selection, the BIC, also called the Schwarzcriterion, was proposed in order to favor simpler models than those chosen by theAIC. The BIC applies more weight to the model size and penalizes it as

BIC = −2 ln f(X|θ)+ s lnN. (69)

The model with a smaller BIC value is preferred. A comparison of AIC and BICis available in [114].

The BF offers a way to evaluate evidence in favor of a null hypothesis [113].It is a more general criterion for Bayesian model comparison. In the Bayesianframework, the BF of model hypothesis H1 over H2 is

BF =f (X|H1)

f (X|H2), (70)

where

f (X|Hk) =

∫

θk∈ΩΘk

f(X|θk

)fΘk

(θk|Hk

)dθk, k = 1, 2 (71)

28 Summary

represents the model evidence for the observations X under hypothesis Hk, andθk is the parameter vector under that hypothesis. If the prior probabilities of thetwo hypotheses are equal, the BF is equal to the ratio of the posterior hypothesisprobabilities given the data, since

p (H1|X)

p (H2|X)= BF× p (H1)

p (H2). (72)

Actually, the BIC is an approximation of the logarithm of the BF. As the BF isdifficult to calculate in general, the BIC is usually used as an alternative. Whenthe number of parameters are equal in both statistical model hypotheses, the BFand the BIC give identical results.

In our research, when comparing the performance of non-Gaussian statisti-cal models to Gaussian statistical model, or when deciding the complexity (thenumber of free parameters) of a mixture model, we utilized both the BIC andthe BF.

3.4 Graphical Models

In statistical models, the relations among the variables, no matter how complexthey are, are usually represented by mathematical expressions. Graphical mod-els provide a way to visualize the structure of statistical models [49, 56, 115]. Agraphical model contains nodes, the variables, and edges, the probabilistic relationsamong different variables. For Bayesian analysis, Directed Acyclic Graph (DAG)is used to infer the (conditional) dependencies and independencies among thevariables. DAG is also called a Bayesian network [8]. Fig. 8 illustrates the con-ditional dependence/independece of two variables, also known as the Bayes ballalgorithm [115]. A Bayesian network consists of several basic probabilistic rela-tions of the kind is described in Fig. 8. According to that, the statistical modelcan be separated into several groups to facilitate the analysis. Sometimes, (con-ditional) independence is introduced to simplify the problem, by modifying theconnections in Bayesian network. We utilized Bayesian network for analyzing theprobabilistic relations among the variables in the attached papers A and F.

4 Applications of Non-Gaussian Models

For data with bounded support or semi-bounded support, it has been shownthat non-Gaussian statistical models and the corresponding non-Gaussian Mix-ture Model (nGMM) can provide better modeling performance than Gaussianmodels [27,28,30–32,74,87,89] according to the criteria introduced in section 3.3.We thus expect these better models can lead to improved performance for differ-ent aspects in practical applications. In our research, we investigated the use ofnon-Gaussian statistical models for several applications.

4.1 Speech Processing

Speech processing is an application of digital signal processing to speech. Thisincludes speech coding, speech recognition, speaker recognition, to mention a few.

4. APPLICATIONS OF NON-GAUSSIAN MODELS 29

√

√√

××

×

Figure 8: Illustration of the Bayes ball algorithm [115]. The small solidcircles are random variables. The shaded nodes are instanti-ated. “

√” denotes that the variables are d-connected (con-

ditionally dependent) and “×” denotes that the variables ared-separated (conditionally independent).

Speech signals are not stationary processes, so the Linear Prediction (LP) methodis applied to a short segment of speech (usually 20∼30 ms block length and modi-fied by a window function) to estimate a linear representation of the speech signalin each frame [57]. The LP model with prediction order K is formulated as

x (t) =K∑

k=1

ρkx (t− k) , (73)

where x (t) is the speech sample at time t, and ρk denotes the Linear Predic-tion Coefficient (LPC). The parameters in the LP model can be estimated inseveral ways, e.g., the autocorrelation method [117, 118], the covariance matrix

30 Summary

(a) A speech example from the TIMIT

database [116].

(b) One speech segment with 25 ms (400

samples) extracted from (a).

(c) The speech segment in (b) multiplied

by a Hanning window.

Original

Modi!ed LSF

Modi!ed LPC

(d) Comparisons of the original synthesis

filter envelope for (c), the one obtained bymodifying the LSF, and the one obtainedby modifying the LPC.

Figure 9: A speech example, a speech segmentation, a windowed speechsegmentation, and the effect of modifying the LPC andthe LSF. The modification was done by multiplying the firstelements in the LPC and the LSF parameters with 1.2, respec-tively.

method [57]. Then the LP analysis filter can be expressed by z -transform as

G (z) = 1−K∑

k=1

ρkz−k, (74)

which is known as the “whitening” filter as it removes the short-term correlationand flattens the spectrum [119]. As the LP filter is a minimum-phase filter, thesynthesis filter, which is the inverse of the LP analysis filter, exists and is alsostable. Fig. 9 shows an example of speech and the corresponding envelope obtainedby LP synthesis filter.

In speech coding, an essential part is quantizing the LPCs efficiently. Di-rectly quantizing the LPCs is usually not used because small quantization errors


in the LPCs may result in relative large spectral distortion and could lead to theinstability of the filter [119]. Thus, some representations of the LPCs have beenproposed for the purpose of effective quantization. The most used representationsare the Reflection Coefficient (RC), the ArcSine Reflection Coefficient (ASRC),the Log Area Ratio (LAR), the Line Spectral Frequency (LSF), and the Im-mittance Spectral Frequency (ISF) [57, 119–121]. Among these representations,the LSF is the widely used representation when quantizing the LP model. Withthe analysis filter in (74), two symmetric polynomials can be created as [119]

P (z) = G (z) + z−(K+1)G(z−1

),

Q (z) = G (z)− z−(K+1)G(z−1

).

(75)

The zeros of P (z) and Q (z) are interleaved on the unit circle as (by assumingthat K is even)

0 = φq0 < φp1 < φq1 < . . . < φqK2

< φpK2

+1= π. (76)

Then LSF parameters are defined as [120]

s = [s1, s2, . . . , sK ]T = [φp1 , φq1 , . . . , φpK2

, φqK2

]T . (77)

The modification effect of the LSF parameters is local, which means that modify-ing the value of one element in the LSF vector will only locally affect the envelopein a small area. Fig. 9(d) illustrates this local effect of modifying the LSF pa-rameter. As mentioned above, the LSF parameters are ordered and boundedin [0, π]. To exploit these properties, several quantization methods of the LSFparameters were studied in the literature (see e.g., [122–125]). The problem ofdesigning an optimal quantization scheme for the LPC parameters has also beeninvestigated by many researchers [73, 123, 126]. The well-known Lloyd [127] andthe Linde-Buzo-Gray (LGB) algorithms [128] are usually utilized for obtainingthe codebook. However, these algorithms depend on the training data and leadto worse quantization performance if the training data is not representative or theamount of training data is limited.

Recently, the PDF-optimized Vector Quantization (VQ) scheme was pro-posed [73, 129, 130, 130] to overcome the problem of lacking training data. Forthis purpose, the GMM is widely applied to model the underlying distribution ofthe LSF parameters. Based on the trained model, a training set with sufficientamount of data (theoretically infinite) can be generated and used for obtainingthe codebook. Also, to prevent training a VQ in a high-dimensional space, thetransform coding [131] technique is often applied to decorrelate the vector variableinto a set of independent scalar variables. Thus, the VQ can be replaced by a setof Scalar Quantization (SQ) without losing the memory advantage of VQ. ForGaussian source, the Karhunen-Loeve Transform (KLT) is usually applied. Bythe high rate theory in source coding [58], the optimal bit allocation strategy canbe derived based on the trained model and an entropy coding scheme can also bederived afterwards [130].

As suggested in [129], explicitly taking the bounded support constraint ofthe LSF parameters into account could improve the VQ performance. Based on

32 Summary

this, Lindblom et al. [29] proposed a bounded support GMM based VQ, which is amodified and improved version of the conventinal GMM based VQ, both in modelestimation and in VQ design. However, the estimation algorithm of the boundedsupport GMM is computationally costly when the bounded support is taken intoaccount [29]. Also, a lot of mixture components were spent on describing the edgeof the distribution. To avoid the extra computational cost and save the bit budget,we applied the BMM based VQ for the quantization of the LSF parameters byonly considering the bounded support property. The performance of the BMMbased VQ was shown to be superior to the GMM based VQ with the same level ofmodel complexity (in terms of the number of parameters in the statistical model).We introduced this work in [87].

To further exploit the ordered property, we transformed the LSF parameterslinearly to another representation, the LSF difference (∆LSF), as [89]

x = ϕ(s) = As, A =1

π

1 0 · · · · · · · · · 0−1 1 0 · · · · · · 00 −1 1 0 · · · 0...

.... . .

. . .. . .

...0 · · · · · · 0 −1 1

K×K

. (78)

The linear transformation in (78) is identically invertible. Since there is no in-formation loss during the transformation, quantizing the ∆LSF parameters isequivalent to quantizing the LSF parameters. The ∆LSF vector x contains posi-tive elements and the sum of all the elements is smaller than π. Thus x has thesame domain as the Dirichlet variable, up to a normalization constant π. Hence,we applied a Dirichlet Mixture Model (DMM) to model the underlying distribu-tion of the ∆LSF parameters, by considering both the boundary and the orderedproperties of the LSF parameters. As the Dirichlet variable is neutral [81], weproposed a non-linear transformation to decorrelate the Dirichlet vector variableinto a set of independent beta variables, which is similar as the KLT for the Gaus-sian variables but not linear. After that, an optimal bit allocation strategy wasderived based on the distortion-rate (D-R) relation. Finally, a practical codingscheme similar to the Differential Pulse Code Modulation (DPCM)4 was proposedfor preventing the error propagation in the sequential quantizations. The model-ing performance of the ∆LSF parameters was studied in the attached paper E.The details of the DMM based VQ design and implementation can be found inthe attached paper D.

Another important application in speech processing is speaker recogni-tion [132], which includes two tasks: 1) Speaker Identification (SI), to identifya particular speaker [133]; and 2) Speaker Verification (SV), to verify a speaker’sclaimed identity [134, 135]. For both tasks, the Mel-Frequency Cepstral Coeffi-cient (MFCC) [59] is the widely used feature to represent the speaker’s character-istics. For the task of SI, the LSF parameter is also used as a feature presenta-tion [132,136], since the LP filter parameters are determined by the speaker’s artic-ulatory system, and the LSF parameters convey information about the speaker’s

4A similar approach was also used in [124], in which a empirical grouping strategywas applied to the LSF parameters.


identity.

The MFCC feature suppresses the spectrum details in the high frequency areaby introducing a mel-scale filter bank, which takes the advantages of human ears’frequency selectivity. However, the information contained in the high frequencyarea might be useful for the machine to identify the speaker. In other words,the LSF contains “full band” information and might perform better than theband-modified MFCCs. Hence, we used the ∆LSF parameters, which contains thesame information as the LSF parameters, as the feature for the task of SI [137].Since the dynamic information is useful for speaker recognition5, we representedthe dynamic information of the ∆LSF parameters by considering two neighborsof the current frame, one from the past frames and the other from the followingframes. These ∆LSF vectors are cascaded to make a supervector. To model thedistribution of the supervectors, the so-called super-Dirichlet distribution and thecorresponding super-Dirichlet Mixture Model (sDMM) was proposed. The sDMMbased SI system was shown to be superior to the GMM based system, for the taskof text-independent speaker identification [137].

4.2 Image Processing

For digitalized images, the pixel-histogram based method is a simple but efficientway for the purpose of image classification [138], image segmentation [139, 140],etc. In most cases, the Gaussian distribution and the corresponding GMM areused to describe the distribution of pixel values [23,139–142]. Actually, the pixelvalues are located in a fixed interval [0, 2R − 1], where R is the number of bits forstoring a pixel in the computer. Thus the assumption of Gaussian distribution ofthe pixel is not accurate, as it violates the boundary property [25,27,79,84].

To model the distribution of pixel values efficiently, Bouguila et al. [27] mod-eled the distribution of the gray image pixels by a Beta Mixture Model (BMM)and proposed a practical Gibbs sampling based method for parameter estima-tion. We applied the BMM to describe the histogram of the handwritten digitsfor the purpose of classification [79]. The color image is usually described andstored in the RGB space. One pixel in the color image is a composition of threechannels, each of which has a value in a fixed interval [0, 2R − 1]. The Dirichletdistribution and the corresponding DMM were used to model the histogram ofthe color pixels in the RGB space and showed an improvement over the GMMbased method [75, 84]. In these papers, the human skin color pixel in the RGBspace was firstly normalized by the sum of three channels, then the distribution ofthe normalized pixel vectors was modeled by a DMM. This normalization couldremove the illuminance so that the variance of each channel can be reduced [143].However, the variance of the non-skin color pixel may also be reduced as well.Furthermore, if the illuminance of the color pixel is removed, the color pixelscontaining the different illuminances but the same RGB proportions will be rec-ognized as the same cluster. For example, the color “light green” [0, 255, 0] and“deep green” [0, 100, 0] have the same proportion [0, 1, 0] and may not be distin-guished efficiently. As mentioned in page 12, the domain of the three-dimensional

5When taking the MFCC as the feature, the ∆MFCC and the the ∆∆MFCC areusually combined with the MFCC to describe the dynamic information.

34 Summary

Dirichlet variable is a simplex while the domain of a three-dimensional beta vari-able is a unit cubic. According to the above discussion, we considered the colorpixel as an extension of the gray image pixel and applied the multivariate betadistribution [28,144] to model the color pixels’ distribution for the task of humanskin color detection [144]. The BMM based method outperformed some othermethods based on the pixel probabilistic model. In general, the non-Gaussianstatistical model based methods are better than the Gaussian statistical modelbased method, in terms of the ability of describing the pixel histogram [27,28,75].

4.3 Nonnegative Matrix Factorization

The Nonnegative Matrix Factorization (NMF) is an important technique for ma-trix decomposition. It decomposes a nonnegative matrix into a product of twononnegative matrices [145,146]. The NMF can be applied to several applications,such as face feature extraction [146], image denoising [24], sparse coding [147],music spectrum modeling [32, 148], etc. The conventional form of NMF can beexpressed as

XP×T ≈ WP×KVK×T , (79)

where xpt, wpk, and vkt are nonnegative elements and p = 1, . . . , P , t = 1, . . . , T ,k = 1, . . . ,K. If we choose K < min (P, T ), then the NMF is a Low Rank MatrixApproximation (LRMA). By the expression in (79), a column xt in X can beinterpreted as a linear combination of the columns in W, weighted by the factorsin the tth column of V, as

xt =

K∑

k=1

wkvkt. (80)

As the observation X contains nonnegative data (e.g., image data, speech spectra)in various applications, the NMF can serve a better performance for the nonneg-ative data [146], compared to some other classical LRMA methods, e.g., SingularValue Decomposition (SVD).

The conventional way to is to minimize the distance between XP×T andWP×KVK×T as

minW,V

D (X‖WV)

s.t. wpk ≥ 0, vkt ≥ 0,(81)

where D (·‖·) denotes a distance measure. For example, by taking the Frobeniusnorm for measuring the distortion, the NMF algorithm tries to minimize theEuclidean distance between the original matrix and the approximated one as [149]

minW,V

‖X−WV‖2F

s.t. wpk ≥ 0, vkt ≥ 0.(82)

If the generalized KL divergence is taken into account as the criterion [149],the NMF algorithm is to solve a constrained optimization problem as

minW,V

GKL (X‖WV)

s.t. wpk ≥ 0, vkt ≥ 0,(83)


where

GKL (A‖B) =∑

ij

(aij log

aijbij

− aij + bij

). (84)

This measure reduces to the standard KL divergence when∑

ij aij =∑

ij bij = 1.Another distortion measure for NMF is the Itakura-Saito (IS) distance as:

minW,V

IS (X‖WV)

s.t. wpk ≥ 0, vkt ≥ 0(85)

with

IS (A‖B) =∑

ij

(aijbij

− logaijbij

− 1

). (86)

The IS-NMF showed a promising improvement over the above two methods whenmodeling music spectra [148]. Generally speaking, the optimization problem con-nected to the NMF algorithm is convex with respect to W or V. But it is notjointly convex in terms of W and V. Thus, the gradient methods are usuallyutilized to solve this constrained optimization problem approximately [147–149].

The above mentioned NMF frameworks, based on different criteria, can allbe interpreted in a probabilistic way in terms of different probability distribu-tions [148]. To prevent the overfitting problem, the NMF can also be carried outin a Bayesian framework. The minimization problem in (81) can be expressed bya maximization problem with respect to an appropriate PDF as

W∗,V∗ = argmaxW,V

f (X;WV,Θ) , (87)

where Θ denotes the (possible) parameter set according to the choice of PDF.For example, the Euclidean distance based NMF (E-NMF) method is equivalentto the ML estimation of the parameters by assigning a Gaussian distribution forthe observation as

W∗,V∗ = argmaxwpk,vkt

∏

p,t

N(xpt;

∑

k

wpkvkt, σ

). (88)

Schmidt et al. [24,150] proposed two Bayesian NMF methods based on this Gaus-sian assumption. In [150], the prior distribution of wpk and vkt was assigned witha Gaussian prior, via a link function to preserve the nonnegativity. In [24], theexponential distribution was assigned to wpk and vkt, respectively. The Gibbssampler was used to simulate the posterior distribution. The proposed Gaussian-NMF method was applied to image feature extraction and showed a promisingimprovement.

The generalized KL (GKL-NMF) can be connected to the Poisson distributionas


∏

p,t

P(xpt;

∑

k

wpkvkt

), (89)

where

P (x;λ) =λxe−λ

x!, x is a positive integer. (90)

36 Summary

Cemgil [31] took this assumption and proposed a Bayesian inference methodin NMF, by assigning gamma distributions for wpk and vkt. Since the gammadistribution is the conjugate prior to the Poisson distribution and the summationof several Poisson variables is still Poisson distributed, an analytically tractablesolution was obtained by the Variational Inference (VI) framework. However, theproposed Poisson-NMF method could only be suitable to variables Xpt with dis-crete value. The application of the Poisson-NMF is mainly on image data [31],which is usually considered as a continuous variable with bounded support. Thismismatch impairs the statistical interpretation of the GKL-NMF on non-countabledata [148].

The IS-NMF is equivalent to the ML estimation of the parameters in gammamultiplicative noise with mean 1 [148], up to a scale factor and a constant. Tothis end, the IS-NMF method can be interpreted as [148, eq. 31]


∏

p,t

1∑k wpkvkt

Gam

(xpt∑

k wpkvkt;α, α

), (91)

where α is a free parameter that controls the variance of the gamma noise. ABayesian NMF with the exponential assumption (which is a special case of thegamma assumption in [148]) was proposed in [32]. The prior distribution of wpk

and vkt was assumed to be gamma. The VI framework was applied and the approx-imations to the posterior distributions were derived analytically. This approachwas proposed mainly for the analysis of music spectra [32].

As mentioned above, when applying the Poisson-NMF to image data, it vio-lates the continuous and bounded support properties of the pixel values. Sinceit has been shown [27, 28] that the beta distribution can model the pixel databetter than some other distributions because of its bounded support, we proposeda beta distribution based NMF strategy for modeling the bounded support datain [33] (i.e., the attached paper F). For a nonnegative matrix XP×T , each elementvariable Xpt has a bounded support and is assumed to be beta distributed withparameter mpt and npt as

Xpt ∼ Beta (xpt;mpt, npt) . (92)

Then for the whole matrix, we obtain two parameter matrices MP×T and NP×T .To create a hierarchical Bayesian framework, we assume these two parameter ma-trices are latent random variable matrices. If factorize these two variable matricesjointly as

MP×T ≈ AP×KHK×T ,

NP×T ≈ BP×KHK×T ,(93)

and assign a gamma distribution to each element in AP×K , BP×K , and HK×T ,a generative model for the bounded support variable can be obtained as

Apk ∼ Gam(apk;µ0, α0),

Bpk ∼ Gam(bpk; ν0, β0),

Hkt ∼ Gam(hkt; ρ0, ζ0),

Xpt ∼ Beta(xpt|∑

k

apkhkt,∑

k

bpkhkt).

(94)


This generative model assumes the mutual independency of the elements inMP×T

and NP×T . However, the correlations of the elements are captured by thejoint NMF in (93) since they share the same weighting matrix HK×T . Withthe VI framework, an analytically tractable solution was derived to approximatethe posterior distributions of AP×K , BP×K , and HK×T in [33] (i.e., the attachedpaper F). By definition, we have

E [Xpt|AP×K ,BP×K ,HK×T ] =

∑k apkhkt∑

k apkhkt +∑

k bpkhkt

. (95)

If we take the posterior mean as the point estimate to the latent variables apk,bpk, and hkt, the expected value of Xpt can be approximated as

E [Xpt|AP×K ,BP×K ,HK×T ] ≈∑

k ApkHkt∑k ApkHkt +

∑kBpkHkt

. (96)

We proposed the beta-gamma-NMF (BG-NMF) in [33] and applied it to imageprocessing and collaborative filtering problems. Compared to recently proposedmethods [31,147,151,152], it showed a promising improvement.

This proposed BG-NMF can also be interpreted by the IS-NMF, under certaincondition. Recall (91), if we set α =

∑k wpkvkt, then we have


∏

p,t

Gam

(xpt;

∑

k

wpkvkt, 1

), (97)

which means xpt6 is gamma distributed with shape parameter

∑k wpkvkt and scale

parameter 1. Furthermore, if we assume that Ypt and Zpt are gamma distributedas

Ypt ∼ Gam (ypt; ypt, 1) and Zpt ∼ Gam (zpt; zpt, 1) , (98)

then Xpt =Ypt

Ypt+Zptis beta distributed with parameter ypt and zpt as [153]

Xpt ∼ Beta (xpt; ypt, zpt) . (99)

Under the IS-NMF framework, factorizations can be placed on the two variablematrices YP×T as

YP×T ≈ YP×T = AP×KHK×T ,

ZP×T ≈ ZP×T = BP×KHK×T .(100)

Again, if we assign a gamma prior to each of the elements in AP×K , BP×K , and

6Please note, xpt in (97) can be any nonnegative value, which is not the same as thebounded xpt in (94). The usage of xpt in (97) is for the consistence with that in (91).

38 Summary

HK×T , a generative model can be obtained as

Apk ∼ Gam(apk;µ0, α0),

Bpk ∼ Gam(bpk; ν0, β0),

Hkt ∼ Gam(hkt; ρ0, ζ0),

Ypt ∼ Gam(ypt|∑

k

apkhkt, 1),

Zpt ∼ Gam(zpt|∑

k

bpkhkt, 1),

Xpt =Ypt

Ypt + Zpt

∼ Beta(xpt|∑

k

apkhkt,∑

k

bpkhkt).

(101)

This generative model is equivalent to that in (94). The only difference is thattwo intermediate variables Ypt and Zpt are introduced. Similarly, the expectedvalue of the bounded variable Xpt is

E [Xpt|AP×K ,BP×K ,HK×T ] =

∑k apkhkt∑

k apkhkt +∑

k bpkhkt

, (102)

and it can also be approximated by taking the point estimate to Apk, Bpk, andHkt as

E [Xpt|AP×K ,BP×K ,HK×T ] ≈∑

k ApkHkt∑k ApkHkt +

∑k BpkHkt

. (103)

5 Summary of Contributions

5.1 Overview

The work introduced in this dissertation focuses mainly on

• the Maximum Likelihood (ML) and Bayesian estimations of non-Gaussianstatistical models,

• their applications in various fields.

This dissertation mainly consists of six papers, in which I formulated the problemsand proposed the approaches to solve the problems. Also, I did the mathematicalderivations and conducted the experimental evaluations. The coauthor of thesepapers, who is my supervisor, gave me a lot of fruitful suggestions, both on the-oretical and experimental parts. These attached papers can be categorized intothree packages, according to the contents covered by them.

The attached paper A, B, and C focused on Bayesian analysis of the betadistribution. By the principles of the Variational Inference (VI) framework, theposterior distribution of the correlated parameters in the beta distribution wasapproximated by a product of independent gamma distributions. The FactorizedApproximation (FA) method was extended to Extended Factorized Approxima-tion (EFA), by relaxing the lower bound. This approximation facilitated the es-timation procedure by an analytically tractable solution. Based on this posterior

References 39

approximation, the predictive distribution can be obtained by the LVI approxi-mately. Again, an analytically tractable expression for the predictive distributionmakes the problem easy. To capture the correlation between the parameters inthe beta distribution, a EP based method was proposed to approximate the pos-terior distribution and showed an advantage when the amount of observed datais small. In general, the beta distribution can model the bounded support databetter than the conventional Gaussian distribution. The Bayesian estimationmethod based Beta Mixture Model (BMM) was applied for several applicationsand showed an improvement.

The attached paper D and E mainly focused on improving the VQ perfor-mance of the LP model for speech. By taking the boundary and ordered proper-ties, the LSF parameters were linearly transformed to the ∆LSF parameters anda DMM was applied to model the underlying distribution. The ML estimation ofthe parameters in the mixture model was proposed. The Dirichlet variable wasdecorrelated into a set of independent beta variables, according to its aggrega-tion property and neutrality. The optimal bit allocation strategy for inter- andintra-component was also proposed based on the estimated distribution. The pro-posed PDF-optimized VQ outperformed the state-of-the-art GMM based VQ fortransparent coding.

In the attached paper F, we derived a beta-gamma NMF (BG-NMF) methodfor bounded support data. The distribution of the bounded support data matrixwas modeled by a matrix of beta variables. Then the parameter matrices of thebeta variables were factorized jointly, and each of the elements in the basis andexcitation matrices was assigned with a gamma prior. This generative model canalso be interpreted by the IS-NMF under certain condition. Based on the VIframework and the EFA method, some lower bound approximations were usedto derive an analytically tractable solution. The proposed BG-NMF was appliedto different applications in image processing and also used for the collaborativefiltering application.

5.2 Conclusions

For data with bounded support or semi-bounded support, it has been proventhat the non-Gaussian statistical models and the corresponding non-GaussianMixture Model (nGMM) can provide a better modeling performance [27, 28, 30–32, 74, 87, 89] by considering the criteria introduced in section 3.3. Thus, in realapplications, it is also expected that a better modeling behavior can lead to abetter performance for different application aspects. In our research work, weapplied the non-Gaussian statistical model in several applications. Comparedto some conventional statistical model based methods, the non-Gaussian modelbased methods show a promising improvement.

References

[1] D. R. Cox and D. V. Hinkley, Theoretical Statistics, 1st ed. Chapman andHall, Sep 1979.

40 Introduction

[2] P. McCullagh, “What is a statistical model?” The Annals of Statistics,vol. 30, no. 5, 2002.

[3] A. Davison, Statistical Models, ser. Cambridge Series in Statistical Proba-bilistic Mathematics. Cambridge University Press, 2003.

[4] O. E. Barndorff-Nielsen and D. R. Cox, Inference and asymptotics, ser.Monographs on statistics and applied probability. Chapman & Hall, 1994.

[5] E. L. Lehmann and G. Casella, Theory of point estimation, ser. Springertexts in statistics. Springer, 1998.

[6] J. M. Bernardo and A. Smith, Bayesian Theory, ser. Wiley Series in Prob-ability and Statistics. Chichester: John Wiley & Sons Ltd., 2000.

[7] B. S. Everitt and A. Skrondal, The Cambridge Dictionary of Statistics.Cambridge University Press, 2010.

[8] C. M. Bishop, Pattern recognition and machine learning, ser. Informationscience and statistics. Springer, 2006.

[9] J. D. Gibbons and S. Chakraborti, Nonparametric statistical inference, ser.Statistics, textbooks and monographs. Marcel Dekker, 2003.

[10] M. Rosenblatt, “Remarks on some nonparametric estimates of a densityfunction,” The Annals of Mathematical Statistics, vol. 27, no. 3, pp. 832–837, 1956.

[11] E. Parzen, “On estimation of a probability density function and mode,” TheAnnals of Mathematical Statistics, vol. 33, no. 3, pp. 1065–1076, 1962.

[12] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet processmixtures,” Bayesian Analysis, vol. 1, pp. 121–144, 2005.

[13] W. Hardle, Nonparametric and semiparametric models, ser. Springer seriesin statistics. Springer, 2004.

[14] D. R. Cox, “Regression models and life-tables,” Journal of the Royal Sta-tistical Society. Series B (Methodological), vol. 34, no. 2, pp. 187–220, 1972.

[15] J. K. Patel and C. B. Read, Handbook of the normal distribution, ser. Statis-tics, textbooks and monographs. Marcel Dekker, 1996.

[16] A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recognition: areview,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 22, pp. 4–37, 2000.

[17] G. Casella and R. L. Berger, Statistical inference, ser. Duxbury advancedseries in statistics and decision sciences. Thomson Learning, 2002.

[18] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machinelearning, ser. Adaptive computation and machine learning. MIT Press,2006.

[19] G. McLachlan and D. Peel, Finite mixture models, ser. Wiley series in prob-ability and statistics: Applied probability and statistics. Wiley, 2000.

[20] M. A. T. Figueiredo and A. K. Jain, “Unsupervised learning of finite mixturemodels,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 24, pp. 381–396, 2002.

References 41

[21] Y. Zhao, X. Zhuang, and S. Ting, “Gaussian mixture density modelingof non-Gaussian source for autoregressive process,” IEEE Transactions onSignal Processing, vol. 43, no. 4, pp. 894 –903, Apr. 1995.

[22] M. E. Tipping and C. M. Bishop, “Probabilistic principal component anal-ysis,” Journal of the Royal Statistical Society. Series B (Statistical Method-ology), vol. 61, no. 3, pp. 611–622, 1999.

[23] M. H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting faces in images: asurvey,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 24, no. 1, pp. 34–58, 2002.

[24] M. N. Schmidt, O. Winther, and L. K. Hansen, “Bayesian non-negative ma-trix factorization,” in Independent Component Analysis and Signal Separa-tion, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg,2009.

[25] J. D. Banfield and A. E. Raftery, “Model-based Gaussian and non-Gaussianclustering,” Biometrics, vol. 49, no. 3, pp. 803–821, 1993.

[26] D. M. Blei, “Probabilistic models of text and images,” Ph.D. dissertation,University of California, Berkeley, 2004.

[27] N. Bouguila, D. Ziou, and E. Monga, “Practical Bayesian estimation of afinite beta mixture through Gibbs sampling and its applications,” Statisticsand Computing, vol. 16, pp. 215–225, 2006.

[28] Z. Ma and A. Leijon, “Bayesian estimation of beta mixture models withvariational inference,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 33, no. 11, pp. 2160–2173, 2011.

[29] J. Lindblom and J. Samuelsson, “Bounded support Gaussian mixture model-ing of speech spectra,” IEEE Transactions on Speech and Audio Processing,vol. 11, no. 1, pp. 88–99, Jan. 2003.

[30] Z. Ma and A. Leijon, “Vector quantization of LSF parameters with mix-ture of Dirichlet distributions,” IEEE Transactions on Audio, Speech, andLanguage Processing. Submitted,, 2011.

[31] A. T. Cemgil, “Bayesian inference in non-negative matrix factorisation mod-els,” Computational Intelligence and Neuroscience., vol. 2009, no. CUED/F-INFENG/TR.609, July 2009.

[32] M. Hoffman, D. M. Blei, and P. Cook, “Bayesian nonparametric matrixfactorization for recorded music,” in Proceedings of the International Con-ference on Machine Learning, 2010.

[33] Z. Ma and A. Leijon, “BG-NMF: a variational Bayesian NMF model forbounded support data,” IEEE Transactions on Pattern Analysis and Ma-chine Intelligence. Submitted,, 2011.

[34] B. O. Koopman, “On distributions admitting a sufficient statistic,” Trans-actions of the American Mathematical Society, vol. 39, no. 3, pp. 399–409,1936.

[35] E. B. Andersen, “Sufficiency and exponential families for discrete samplespaces,” Journal of the American Statistical Association, vol. 65, no. 331,pp. 1248–1255, 1970.

42 Introduction

[36] A. Edwards, Likelihood, ser. Cambridge science classics. Cambridge Uni-versity Press, 1984.

[37] R. A. Fisher, “On an absolute criterion for fitting frequency curves,” Mes-senger of Mathmatics, vol. 41, no. 1, pp. 155–160, 1912.

[38] K. Fukunaga, Introduction to statistical pattern recognition, ser. Computerscience and scientific computing. Academic Press, 1990.

[39] A. W. F. Edwards, “Three early papers on efficient parametric estimation,”Statistical Science, vol. 12, no. 1, pp. 35–38, 1997.

[40] J. Aldrich, “R. A. Fisher and the making of maximum likelihood 1912-1922,”Statistical Science, vol. 12, no. 3, pp. 162–176, 1997.

[41] S. M. Stigler, “Thomas Bayes’s Bayesian inference,” Journal of the RoyalStatistical Society. Series A (General), vol. 145, no. 2, pp. 250–258, 1982.

[42] ——, “Who discovered Bayes’s theorem?” The American Statistician,vol. 37, no. 4, pp. 290–296, 1983.

[43] M. E. Tipping, “Bayesian inference: An introduction to principles and prac-tice in machine learning,” 2004, pp. 41–62.

[44] E. W. Kamen and J. Su, Introduction to optimal estimation, ser. Advancedtextbooks in control and signal processing. Springer, 1999.

[45] G. J. McLachlan and T. Krishnan, The EM algorithm and extensions, ser.Wiley series in probability and statistics. Wiley-Interscience, 2008.

[46] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood fromincomplete data via the EM algorithm,” Journal of the Royal StatisticalSociety. Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977.

[47] R. M. Neal and G. E. Hinton, “A view of the EM algorithm that justifiesincremental, sparse, and other variants,” in Learning in Graphical Models.Kluwer Academic Publishers, 1998, pp. 355–368.

[48] X. L. Meng and D. B. Rubin, “Maximum likelihood estimation via the ECMalgorithm: a general framework,” Biometrika, vol. 80, no. 2, pp. 267–278,Jun. 1993.

[49] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An intro-duction to variational methods for graphical models,” Machine Learning,vol. 37, no. 2, pp. 183–233, 1999.

[50] T. S. Jaakkola and M. I. Jordan, “Bayesian parameter estimation via vari-ational methods,” Statistics and Computing, vol. 10, pp. 25–37, 2000.

[51] T. S. Jaakkola, “Tutorial on variational approximation methods,” in Ad-vances in Mean Field Methods., M. Opper and D. Saad, Eds. MIT Press.,2001, pp. 129–159.

[52] S. P. Brooks, “Markov chain Monte Carlo method and its application,”Journal of the Royal Statistical Society. Series D (The Statistician), vol. 47,no. 1, pp. 69–100, 1998.

[53] M. H. Chen, Q. M. Shao, and J. G. Ibrahim, Monte Carlo Methods inBayesian Computation, ser. Springer Series in Statistics. Springer, 2000.

References 43

[54] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, andthe Bayesian restoration of images,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. PAMI-6, no. 6, pp. 721 –741, nov. 1984.

[55] H. Attias, “A variational Bayesian framework for graphical models,” in Ad-vances in Neural Information Processing Systems (NIPS). MIT Press, 2000,pp. 209–215.

[56] M. J. Wainwright and M. I. Jordan, Graphical Models, Exponential Families,and Variational Inference. Now Publishers, 2008.

[57] P. Vary and R. Martin, Digital speech transmission: enhancement, codingand error concealment. John Wiley, 2006.

[58] W. B. Kleijn, A basis for source coding, 2010, KTH lecture notes.

[59] J. Benesty, M. M. Sondhi, and Y. Huang, Springer handbook of speech pro-cessing, ser. Springer Handbook Of Series. Springer, 2008.

[60] G. E. P. Box and G. C. Tiao, Bayesian inference in statistical analysis,ser. Addison-Wesley Series in Behavioral Science, Quantitative Methods.Addison-Wesley Pub. Co., 1973.

[61] D. J. C. MacKay, Information theory, inference, and learning algorithms.Cambridge University Press, 2003.

[62] S. Roberts, D. Husmeier, I. Rezek, and W. Penny, “Bayesian approaches toGaussian mixture modeling,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 20, no. 11, pp. 1133 –1142, nov 1998.

[63] S. G. Walker, P. Damien, P. Laud, and A. F. M. Smith, “Bayesian nonpara-metric inference for random distributions and related functions,” Journalof the Royal Statistical Society. Series B (Statistical Methodology), vol. 61,no. 3, pp. 485–527, 1999.

[64] P. Muller and F. A. Quintana, “Nonparametric Bayesian data analysis,”Statistical Science, vol. 19, no. 1, pp. 95–110, 2004.

[65] T. S. Ferguson, “A Bayesian analysis of some nonparametric problems,”The Annals of Statistics, vol. 1, no. 2, pp. 209–230, 1973.

[66] C. E. Antoniak, “Mixtures of Dirichlet processes with applications toBayesian nonparametric problems,” The Annals of Statistics, vol. 2, no. 6,pp. 1152–1174, 1974.

[67] R. M. Neal, “Markov chain sampling methods for Dirichlet process mixturemodels,” Journal of Computational and Graphical Statistics, vol. 9, no. 2,pp. 249–265, 2000.

[68] J. Pitman, “Some developments of the Blackwell-Macqueen URN scheme,”Lecture Notes-Monograph Series, vol. 30, pp. 245–267, 1996.

[69] D. Dey, P. Muller, and D. Sinha, Practical nonparametric and semiparamet-ric Bayesian statistics, ser. Lecture notes in statistics. Springer, 1998.

[70] W. Bryc, The normal distribution: characterizations with applications, ser.Lecture notes in statistics. Springer-Verlag, 1995.

44 Introduction

[71] T. M. Cover and J. A. Thomas, Elements of information theory, ser. WileySeries in Telecommunications and Signal Processing. Wiley-Interscience,2006.

[72] M. N. Gibbs and D. J. C. Mackay, “Variational Gaussian process classifiers,”IEEE Transactions on Neural Networks, vol. 11, no. 6, pp. 1458 – 1464, Nov.2000.

[73] A. D. Subramaniam and B. D. Rao, “PDF optimized parametric vectorquantization of speech line spectral frequencies,” IEEE Transactions onSpeech and Audio Processing, vol. 11, no. 2, pp. 130–142, Mar. 2003.

[74] Y. Ji, C. Wu, P. Liu, J. Wang, and K. R. Coombes, “Application of beta-mixture models in bioinformatics,” Bioinformatics applications note, vol. 21,pp. 2118–2122, 2005.

[75] N. Bouguila, D. Ziou, and J. Vaillancourt, “Unsupervised learning of a fi-nite mixture model based on the Dirichlet distribution and its application,”IEEE Transactions on Image Processing, vol. 13, no. 11, pp. 1533–1543,Nov. 2004.

[76] A. K. Gupta and S. Nadarajah, Eds., Handbook of Beta Distribution andIts Applications. Marcel Dekker, 2004.

[77] V. P. Savchuk and H. F. Martz, “Bayes reliability estimation using multiplesources of prior information: binomial sampling,” IEEE Transactions onReliability, vol. 43, pp. 138–144, 1994.

[78] J. C. Lee and Y. L. Lio, “A note on bayesian estimation and prediction forthe beta-binomial model,” Journal of Statistical Computation and Simula-tion, vol. 63, pp. 73–91, 1999.

[79] Z. Ma and A. Leijon, “Beta mixture models and the application to imageclassification,” in Proceedings of IEEE International Conference on ImageProcessing (ICIP), nov. 2009, pp. 2045 –2048.

[80] B. A. Frigyik, A. Kapila, and M. R. Gupta, “Introduction to the Dirichletdistribution and related processes,” Department of Electrical Engineering,University of Washington, Tech. Rep., 2010.

[81] R. J. Connor and J. E. Mosimann, “Concepts of independence for pro-portions with a generalization of the Dirichlet distribution,” J. Am. Stat.Assoc., vol. 64, no. 325, pp. 194–206, 1969.

[82] P. Guimaraes and R. C. Lindrooth, “Dirichlet-multinomial regression,”EconWPA, Econometrics, 2005.

[83] S. Yu, K. Yu, V. Tresp, and H. P. Kriegel, “Variational Bayesian Dirichlet-multinomial allocation for exponential family mixtures,” in Machine Learn-ing: ECML 2006, ser. Lecture Notes in Computer Science, J. Furnkranz,T. Scheffer, and M. Spiliopoulou, Eds. Springer Berlin / Heidelberg, 2006,vol. 4212, pp. 841–848.

[84] N. Bouguila and D. Ziou, “Dirichlet-based probability model applied tohuman skin detection,” in Proceedings of IEEE International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, May 2004,pp. 512–524.

References 45

[85] N. Bouguila, D. Ziou, and R. Hammoud, “A Bayesian non-Gaussian mix-ture analysis: Application to eye modeling,” in Proceedings of IEEE Inter-national Conference on Computer Vision and Pattern Recognition (CVPR),June 2007, pp. 1–8.

[86] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” J.Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003.

[87] Z. Ma and A. Leijon, “PDF-optimized LSF vector quantization based onbeta mixture models,” in Proceedings of INTERSPEECH, 2010, pp. 2374–2377.

[88] T. P. Minka, “Estimating a Dirichlet distribution,” Annals of Physics, vol.2000, no. 8, pp. 1–13, 2003.

[89] Z. Ma and A. Leijon, “Modeling speech line spectral frequencies with Dirich-let mixture models,” in Proceedings of INTERSPEECH, 2010, pp. 2370–2373.

[90] R. V. Hogg, J. W. McKean, and A. T. Craig, Introduction to mathematicalstatistics. Pearson Education, 2005.

[91] T. H. Dat, K. Takbda, and F. Itakura, “Gamma modeling of speech powerand its on-line estimation for statistical speech enhancement,” IEICE trans-actions on information and systems, vol. E89-D, pp. 1040–1049, 2006.

[92] A. Gelman, “Prior distributions for variance parameters in hierarchical mod-els,” Bayesian Analysis, vol. 1, pp. 515–533, 2006.

[93] K. Copsey and A. Webb, “Bayesian gamma mixture model approach toradar target recognition,” IEEE Transactions on Aerospace and ElectronicSystems, vol. 39, no. 4, pp. 1201–1217, oct. 2003.

[94] C. T. Kelley, Solving nonlinear equations with Newton’s method, ser. Fun-damentals of algorithms. Society for Industrial and Applied Mathematics,2003.

[95] M. R. Gupta and Y. Chen, “Theory and use of the EM algorithm,” Foun-dations and Trends in Signal Process, vol. 4, pp. 223–296, Mar. 2011.

[96] S. Geisser, Predictive inference: an introduction, ser. Monographs on statis-tics and applied probability. Chapman & Hall, 1993.

[97] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estima-tion and model selection.” Morgan Kaufmann, 1995, pp. 1137–1143.

[98] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statis-tics, vol. 6, no. 2, pp. 461–464, 1978.

[99] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journalof the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, pp.267–288, 1996.

[100] D. Lindley, Bayesian statistics, a review, ser. CBMS-NSF Regional Confer-ence Series in Applied Mathematics. Society for Industrial and AppliedMathematics, 1972.

[101] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge universityPress, 2004.

46 Introduction

[102] G. Parisi, Statistical field theory, ser. Advanced book classics. PerseusBooks, 1998.

[103] J. A. Palmer, “Relative convexity,” ECE Dept., UCSD, Tech. Rep., 2003.

[104] D. M. Blei and J. D. Lafferty, “Correlated topic models,” in Advances inNeural Information Processing Systems (NIPS), 2006.

[105] ——, “A correlated topic model of Science,” The Annals of Applied Stattis-tics, vol. 1, pp. 17–35, 2007.

[106] T. P. Minka, “Expectation propagation for approximate Bayesian infer-ence,” in Proceedings of the Seventeenth Conference on Uncertainty in Ar-tificial Intelligence, 2001, pp. 362–369.

[107] ——, “A family of algorithms for approximate Bayesian inference,” Ph.D.dissertation, Massachusetts Institute of Technology, 2001.

[108] S. L. Lauritzen, “Propagation of probabilities, means and variances in mixedgraphical association models,” Journal of the American Statistical Associa-tion, vol. 87, pp. 1098–1108, 1992.

[109] M. Opper, “A Bayesian approach to on-line learning,” On-line learning inneural networks, pp. 363–378, 1999.

[110] Z. Ma and A. Leijon, “Expectation propagation for estimating the parame-ters of the beta distribution,” in Proceedings of IEEE International Confer-ence onAcoustics Speech and Signal Processing (ICASSP), Mar. 2010, pp.2082 –2085.

[111] ——, “Approximating the predictive distribution of the beta distributionwith the local variational method,” in Proceedings of IEEE InternationalWorkshop on Machine Learning for Signal Processing (MLSP), 2011.

[112] H. Akaike, “A new look at the statistical model identification,” IEEE Trans-actions on Automatic Control, vol. 19, no. 6, pp. 716–723, Dec. 1974.

[113] R. E. Kass and A. E. Raftery, “Bayes factors,” Journal of the AmericanStatistical Association, vol. 90, no. 430, pp. 773–795, 1995.

[114] K. P. Burnham and D. R. Anderson, “Multimodel inference,” SociologicalMethods & Research, vol. 33, no. 2, pp. 261–304, 2004.

[115] T. Koski and J. Noble, Bayesian networks: an introduction, ser. Wiley seriesin probability and statistics. John Wiley, 2009.

[116] “DARPA-TIMIT,” “Acoustic-phonetic continuous speech corpus,” NISTSpeech Disc 1.1-1, 1990.

[117] N. Levinson, “The Wiener RMS error criterion in filter design and predic-tion,” Journal of Mathematical Physics, vol. 25, pp. 261–278, 1947.

[118] J. Durbin, “The fitting of time-series models,” Review of the InternationalStatistical Institute, vol. 28, no. 3, pp. 233–244, 1960.

[119] K. K. Paliwal and W. B. Kleijn, Speech Coding and Synthesis. Amsterdam:Elsevier, 1995, ch. Quantization of LPC parameters, pp. 433–466.

[120] F. Soong and B. Juang, “Line spectrum pair (LSP) and speech data com-pression,” in IEEE International Conference on Acoustics, Speech, and Sig-nal Processing, vol. 9, Mar. 1984, pp. 37–40.

References 47

[121] “ITU-T recommendation G.722.2, “wideband coding of speech at around16 kbit/s using adaptive multi-rate wideband (AMR-WB”,” Jun. 2003.

[122] F. Soong and B. Juang, “Optimal quantization of LSP parameters usingdelayed decisions,” in Proceedings of IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP), vol. 1, 3-6 1990, pp.185 –188.

[123] K. K. Paliwal and B. S. Atal, “Efficient vector quantization of LPC parame-ters at 24 bits/frame,” IEEE Transactions on Speech and Audio Processing,vol. 1, no. 1, pp. 3–14, Jan. 1993.

[124] M. Xie and J. P. Adoul, “Fast and low-complexity LSF quantization usingalgebraic vector quantizer,” in Proceedings of IEEE International Confer-ence on Acoustics, Speech, and Signal Processing, vol. 1, May 1995, pp.716–719 vol.1.

[125] S. So and K. K. Paliwal, “Empirical lower bound on the bitrate for thetransparent memoryless coding of wideband LPC parameters,” IEEE SignalProcessing Letters, vol. 13, no. 9, pp. 569 –572, Sep. 2006.

[126] W. R. Gardner and B. D. Rao, “Theoretical analysis of the high-rate vectorquantization of LPC parameters,” IEEE Transactions on Speech and AudioProcessing, vol. 3, no. 5, pp. 367 –381, Sep. 1995.

[127] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on In-formation Theory, vol. 28, no. 2, pp. 129 – 137, Mar. 1982.

[128] Y. Linde, A. Buzo, and R. Gray, “An algorithm for vector quantizer design,”IEEE Transactions on Communications, vol. 28, no. 1, pp. 84–95, Jan. 1980.

[129] P. Hedelin and J. Skoglund, “Vector quantization based on Gaussian mixturemodels,” IEEE Transactions on Speech and Audio Processing, vol. 8, no. 4,pp. 385–401, Jul. 2000.

[130] D. Zhao, J. Samuelsson, and M. Nilsson, “On entropy-constrained vectorquantization using Gaussian mixture models,” IEEE Transactions on Com-munications, vol. 56, no. 12, pp. 2094 –2104, Dec. 2008.

[131] V. K. Goyal, “Theoretical foundations of transform coding,” IEEE SignalProcessing Magazine, vol. 18, no. 5, pp. 9 –21, Sep. 2001.

[132] J. Campbell, J.P., “Speaker recognition: a tutorial,” Proceedings of theIEEE, vol. 85, no. 9, pp. 1437 –1462, Sep. 1997.

[133] D. Reynolds and R. Rose, “Robust text-independent speaker identificationusing Gaussian mixture speaker models,” IEEE Transactions on Speech andAudio Processing, vol. 3, no. 1, pp. 72 –83, Jan. 1995.

[134] F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau,S. Meignier, T. Merlin, J. Ortega-Garcıa, D. Petrovska-Delacretaz, andD. A. Reynolds, “A tutorial on text-independent speaker verification,”EURASIP Journal on Applied Signal Processing, vol. 2004, pp. 430–451,Jan. 2004.

[135] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector ma-chines using GMM supervectors for speaker verification,” IEEE Signal Pro-cessing Letters, vol. 13, no. 5, pp. 308 – 311, May 2006.

48 Introduction

[136] H. Cordeiro and C. Ribeiro, “Speaker characterization with MLSFs,” inIEEE Odyssey: The Speaker and Language Recognition Workshop, Jun.2006, pp. 1–4.

[137] Z. Ma and A. Leijon, “Super-Dirichlet mixture models using differential linespectral frequencies for text-independent speaker identification,” in Proceed-ings of INTERSPEECH, 2011, pp. 2349–2352.

[138] T. Lei and J. Udupa, “Performance evaluation of finite normal mixturemodel-based image segmentation techniques,” IEEE Transactions on ImageProcessing, vol. 12, no. 10, pp. 1153 – 1169, Oct. 2003.

[139] H. Caillol, W. Pieczynski, and A. Hillion, “Estimation of fuzzy Gaussianmixture and unsupervised statistical image segmentation,” IEEE Transac-tions on Image Processing, vol. 6, no. 3, pp. 425 –440, mar 1997.

[140] M. J. Jones and J. M. Rehg, “Statistical color models with application toskin detection,” International Journal of Computer Vision, vol. 46, no. 1,p. 81/96, 2002.

[141] E. Littmann and H. Ritter, “Adaptive color segmentation-a comparison ofneural and statistical methods,” IEEE Transactions on Neural Networks,vol. 8, no. 1, pp. 175 –185, Jan. 1997.

[142] Y. Tai, J. Jia, and C. Tang, “Soft color segmentation and its applications,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29,no. 9, pp. 1520 –1537, Sep. 2007.

[143] J. Yang, W. Lu, and A. Waibel, “Skin-color modeling and adaptation,”in Proceedings of the Third Asian Conference on Computer Vision, vol. 2.London, UK: Springer-Verlag, 1997, pp. 687–694.

[144] Z. Ma and A. Leijon, “Human skin color detection in RGB space withBayesian estimation of beta mixture models,” in 18th European Signal Pro-cessing Conference (EUSIPCO), 2010, pp. 1204–1208.

[145] P. Paatero and U. Tapper, “Positive matrix factorization: a non-negativefactor model with optimal utilization of error estimates of data values,”Environmetrics, vol. 5, pp. 111–126, 1994.

[146] D. D. Lee and H. S. Seung, “Learning the parts of objects with nonnegativematrix factorization,” Nature, vol. 401, pp. 788–791, 1999.

[147] P. O. Hoyer, “Non-negative matrix factorization with sparseness con-straints,” Journal of Machince Learning Research, vol. 5, pp. 1457–1469,Dec. 2004.

[148] C. Fevotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factoriza-tion with the Itakura-Saito divergence: with application to music analysis,”Neural Computation, vol. 21, no. 3, pp. 793–830, 2009.

[149] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factoriza-tion,” in Proceedings of Advances in Neural Information Processing Systems(NIPS), 2001.

[150] M. N. Schmidt and H. Laurberg, “Nonnegative matrix factorization withGaussian process priors,” Computational Intelligence and Neuroscience, vol.2008, 2008.

References 49

[151] T. Raiko, A. Ilin, and J. Karhunen, “Neural information processing,”M. Ishikawa, K. Doya, H. Miyamoto, and T. Yamakawa, Eds. Berlin,Heidelberg: Springer-Verlag, 2008, ch. Principal Component Analysis forSparse High-Dimensional Data, pp. 566–575.

[152] R. Salakhutdinov and A. Mnih, “Bayesian probabilistic matrix factoriza-tion using markov chain monte carlo,” in Proceedings of the InternationalConference on Machine Learning, 2008.

[153] I. Olkin and R. Liu, “A bivariate beta distribution,” Statistics & ProbabilityLetters, vol. 62, pp. 407–412, 2003.

non-gaussian statistical models andtheir applications

Documents