pattern recognition letters - uef

14
Comparison of clustering methods: A case study of text-independent speaker modeling Tomi Kinnunen, Ilja Sidoroff, Marko Tuononen, Pasi Fränti Speech and Image Processing Unit, School of Computing, University of Eastern Finland, P.O. Box 111, FI-80101 Joensuu, Finland article info Article history: Received 10 September 2010 Available online 2 July 2011 Communicated by G. Borgefors Keywords: Clustering methods Speaker recognition Vector quantization Gaussian mixture model Universal background model abstract Clustering is needed in various applications such as biometric person authentication, speech coding and recognition, image compression and information retrieval. Hundreds of clustering methods have been proposed for the task in various fields but, surprisingly, there are few extensive studies actually compar- ing them. An important question is how much the choice of a clustering method matters for the final pat- tern recognition application. Our goal is to provide a thorough experimental comparison of clustering methods for text-independent speaker verification. We consider parametric Gaussian mixture model (GMM) and non-parametric vector quantization (VQ) model using the best known clustering algorithms including iterative (K-means, random swap, expectation–maximization), hierarchical (pairwise nearest neighbor, split, split-and-merge), evolutionary (genetic algorithm), neural (self-organizing map) and fuzzy (fuzzy C-means) approaches. We study recognition accuracy, processing time, clustering validity, and correlation of clustering quality and recognition accuracy. Experiments from these complementary observations indicate clustering is not a critical task in speaker recognition and the choice of the algo- rithm should be based on computational complexity and simplicity of the implementation. This is mainly because of three reasons: the data is not clustered, large models are used and only the best algorithms are considered. For low-order models, choice of the algorithm, however, can have a significant effect. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction Text-independent speaker recognition (Bimbot et al., 2004; Camp- bell, 1997; Kinnunen and Li, 2010) aims at recognizing persons from their voice. It consists of two different tasks: identification and verification. The identification task aims at finding the best match (or a set of potential matches) for an unknown voice from a speaker database. The goal of verification task, in turn, is either to accept or reject a claimed identity given by speaking (‘‘I am Tomi, verify me’’), or by typing a personal identification number (PIN), for instance. Speaker recognition process is illustrated in Fig. 1. When a new person is enrolled into the system, the audio signal is first con- verted into a set of feature vectors. Although short-term spectral features (Huang et al., 2001) are sensitive to noise and channel effects, they provide better recognition accuracy than prosodic and ‘‘high-level’’ features (Reynolds et al., 2004), and are therefore used in this study. Following feature extraction, a speaker model is trained and added into the database. In the matching phase, fea- ture vectors are extracted from the unknown sample and com- pared with the model(s) in the database, providing a similarity score. To increase robustness to signal variability, recent solutions use sophisticated speaker model compensation (Burget et al., 2007; Kenny et al., 2008) and score normalization using back- ground speakers (Auckenthaler et al., 2000; Reynolds et al., 2000). Finally, the normalized score is compared with a threshold (verification), or the best scoring speaker(s) is selected as such (identification). A number of different classifiers have been studied for speaker recognition; see Ramachandran et al. (2002), Kinnunen and Li (2010) for an overview. Speaker models can be divided into gener- ative and discriminative models. Generative models characterize the distribution of the feature vectors within the classes (speakers), whereas discriminative modeling focuses on modeling the decision boundary between the classes. For generative modeling, vector quantization (VQ) (Burton, 1987; He et al., 1999; Hautamäki et al., 2008; Kinnunen et al., 2006; Soong et al., 1987; Tran and Wagner, 2002) and Gaussian mixture model (GMM) (Reynolds and Rose, 1995; Reynolds et al., 2000) are commonly used. For discrim- inative training, artificial neural networks (ANNs) (Farrell et al., 0167-8655/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2011.06.023 Abbreviations: ANN, artificial neural network; DET, detection error trade-off; EER, equal error rate; EM, expectation maximization; FAR, false acceptance rate; FRR, false rejection rate; FCM, fuzzy C-means; GMM, Gaussian mixture model; GA, genetic algorithm; MAP, maximum a posteriori; MFCC, mel-frequency cepstral coefficient; PNN, pairwise nearest neighbor; RS, random swap; SOM, self-organizing map; SM, split-and-merge; SVM, support vector machine; UBM, universal back- ground model; VQ, vector quantization. Corresponding author. E-mail addresses: [email protected].fi (T. Kinnunen), [email protected].fi (I. Sidoroff), [email protected].fi (M. Tuononen), [email protected].fi (P. Fränti). Pattern Recognition Letters 32 (2011) 1604–1617 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Upload: others

Post on 17-Oct-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pattern Recognition Letters - UEF

Pattern Recognition Letters 32 (2011) 1604–1617

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

Comparison of clustering methods: A case study of text-independentspeaker modeling

Tomi Kinnunen, Ilja Sidoroff, Marko Tuononen, Pasi Fränti ⇑Speech and Image Processing Unit, School of Computing, University of Eastern Finland, P.O. Box 111, FI-80101 Joensuu, Finland

a r t i c l e i n f o

Article history:Received 10 September 2010Available online 2 July 2011Communicated by G. Borgefors

Keywords:Clustering methodsSpeaker recognitionVector quantizationGaussian mixture modelUniversal background model

0167-8655/$ - see front matter � 2011 Elsevier B.V. Adoi:10.1016/j.patrec.2011.06.023

Abbreviations: ANN, artificial neural network; DEEER, equal error rate; EM, expectation maximizationFRR, false rejection rate; FCM, fuzzy C-means; GMM,genetic algorithm; MAP, maximum a posteriori; Mcoefficient; PNN, pairwise nearest neighbor; RS, randomap; SM, split-and-merge; SVM, support vector maground model; VQ, vector quantization.⇑ Corresponding author.

E-mail addresses: [email protected] (T. KinnuSidoroff), [email protected] (M. Tuononen), fran

a b s t r a c t

Clustering is needed in various applications such as biometric person authentication, speech coding andrecognition, image compression and information retrieval. Hundreds of clustering methods have beenproposed for the task in various fields but, surprisingly, there are few extensive studies actually compar-ing them. An important question is how much the choice of a clustering method matters for the final pat-tern recognition application. Our goal is to provide a thorough experimental comparison of clusteringmethods for text-independent speaker verification. We consider parametric Gaussian mixture model(GMM) and non-parametric vector quantization (VQ) model using the best known clustering algorithmsincluding iterative (K-means, random swap, expectation–maximization), hierarchical (pairwise nearestneighbor, split, split-and-merge), evolutionary (genetic algorithm), neural (self-organizing map) andfuzzy (fuzzy C-means) approaches. We study recognition accuracy, processing time, clustering validity,and correlation of clustering quality and recognition accuracy. Experiments from these complementaryobservations indicate clustering is not a critical task in speaker recognition and the choice of the algo-rithm should be based on computational complexity and simplicity of the implementation. This is mainlybecause of three reasons: the data is not clustered, large models are used and only the best algorithms areconsidered. For low-order models, choice of the algorithm, however, can have a significant effect.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction effects, they provide better recognition accuracy than prosodic

Text-independent speaker recognition (Bimbot et al., 2004; Camp-bell, 1997; Kinnunen and Li, 2010) aims at recognizing personsfrom their voice. It consists of two different tasks: identificationand verification. The identification task aims at finding the bestmatch (or a set of potential matches) for an unknown voice froma speaker database. The goal of verification task, in turn, iseither to accept or reject a claimed identity given by speaking(‘‘I am Tomi, verify me’’), or by typing a personal identificationnumber (PIN), for instance.

Speaker recognition process is illustrated in Fig. 1. When a newperson is enrolled into the system, the audio signal is first con-verted into a set of feature vectors. Although short-term spectralfeatures (Huang et al., 2001) are sensitive to noise and channel

ll rights reserved.

T, detection error trade-off;; FAR, false acceptance rate;Gaussian mixture model; GA,FCC, mel-frequency cepstralm swap; SOM, self-organizingchine; UBM, universal back-

nen), [email protected] ([email protected] (P. Fränti).

and ‘‘high-level’’ features (Reynolds et al., 2004), and are thereforeused in this study. Following feature extraction, a speaker model istrained and added into the database. In the matching phase, fea-ture vectors are extracted from the unknown sample and com-pared with the model(s) in the database, providing a similarityscore. To increase robustness to signal variability, recent solutionsuse sophisticated speaker model compensation (Burget et al.,2007; Kenny et al., 2008) and score normalization using back-ground speakers (Auckenthaler et al., 2000; Reynolds et al.,2000). Finally, the normalized score is compared with a threshold(verification), or the best scoring speaker(s) is selected as such(identification).

A number of different classifiers have been studied for speakerrecognition; see Ramachandran et al. (2002), Kinnunen and Li(2010) for an overview. Speaker models can be divided into gener-ative and discriminative models. Generative models characterizethe distribution of the feature vectors within the classes (speakers),whereas discriminative modeling focuses on modeling the decisionboundary between the classes. For generative modeling, vectorquantization (VQ) (Burton, 1987; He et al., 1999; Hautamäkiet al., 2008; Kinnunen et al., 2006; Soong et al., 1987; Tran andWagner, 2002) and Gaussian mixture model (GMM) (Reynolds andRose, 1995; Reynolds et al., 2000) are commonly used. For discrim-inative training, artificial neural networks (ANNs) (Farrell et al.,

Page 2: Pattern Recognition Letters - UEF

Short-term spectral feature extractor

Speech input

Clustering algorithm

Similarity/distancefunction

Feature extraction

Training phase

Matching phase

DecisionThresholding or picking max/min

Decision phase

Speaker models

Speech data from a large number speakers

Short-term spectral feature extractor

Clustering algorithm

Clustering algorithm

Universal background model (UBM)

Short-term spectral feature extractor

Speech input

Maximum a posteriori (MAP) adaptation

Adapted speaker model

Matching and UBM score normalization

Decision Thresholding or picking max/min

Off-line phase

Training phase

Matching phase Speaker models

and UBM

(a) Maximum likelihood (ML) training

Feature extraction

Decision phase

(b) Maximum a posteriori (MAP) training

Fig. 1. System diagram of a spectral feature speaker recognizer, the focus of this study. Clustering methods are characterized here by their clustering quality, resultingspeaker recognition accuracy, time consumption of the modeling, and usability aspects. There are two common ways to train speaker models, (a) maximum likelihood (ML)that trains the model using feature vectors of the speaker, and (b) maximum a posteriori (MAP) that uses in addition a universal background model (UBM) to generate a robustmodel.

1 In practice, the various datasets need to be selected according to the expectedconditions of the actual application data, reflecting the channel conditions, noiselevels, as well as the speaker population (e.g. native language of speakers). Additionalcare must be taken when the same speakers (but possibly different utterances) are re-used in background modeling and score normalization. The degrees of variability usedin the NIST speaker recognition evaluation datasets (http://nist.gov/itl/iad/mig/sre.cfm) increases every year and proper selection of training datasets is critical.

T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617 1605

1994; Yegnanarayana and Kishore, 2002) and, more recently, sup-port vector machines (SVMs) (Campbell et al., 2006a,b) are repre-sentative techniques.

In the past few years, research community has also focused oncombining generative and discriminative models, leading to hybridmodels. In particular, GMMs are extensively used for mapping var-iable-length vector sequences into fixed-dimensional supervectors(Campbell et al., 2006b; Dehak et al., 2011; Lee et al., 2008) that areused as features in SVM. Parallel to, and in conjunction with thisresearch trend, significant recent advances have also been madeon intersession variability compensation of the supervectors (Bur-get et al., 2007; Dehak et al., 2011; Kenny et al., 2008). A represen-tative class of such techniques is factor analysis (FA) model in theGMM supervector space (Dehak et al., 2011; Kenny et al., 2008).Both the hybrid GMM–SVM approaches and the factor analysismodels have excellent performance especially under severechannel and session mismatches. However, due to the additionaltraining steps required for constructing the GMM front-end and,subsequently, the session variability models, the supervectormethods typically require at least an order of magnitude moredevelopment data compared to traditional generative models(Hasan and Hansen, in press; Hautamäki et al., 2008; Reynoldset al., 2000), and are therefore much more CPU-intensive.

Careful selection of the various modeling data sets is also a criticalstep.1

1.1. Relevance of clustering in a large-scale pattern recognitionproblem

In this paper, we focus on the training methodology of two clas-sical generative speaker models, GMM and VQ, for two reasons.Firstly, these methods underlie both the traditional maximum like-lihood (or minimum distortion) trained speaker models (Burton,1987; He et al., 1999; Kinnunen et al., 2006; Soong et al., 1987;Tran and Wagner, 2002), their maximum a posteriori adaptedextensions using universal background model (UBM) priors (Hasanand Hansen, in press; Hautamäki et al., 2008; Reynolds et al.,2000) and, importantly, also the recent hybrid GMM–SVM and

Page 3: Pattern Recognition Letters - UEF

1606 T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617

FA models (Campbell et al., 2006b; Dehak et al., 2011; Kenny et al.,2008; Lee et al., 2008). The way the underlying generative model istrained will have a major effect to the performance of all thesemethods. Secondly, while there are good guidelines for composinga balanced and representative training set for background model-ing – see the recent study (Hasan and Hansen, in press) and refer-ences therein – the question of how to model the generative modelitself has received only little attention in literature. Typically,Gaussian mixture models, which are pertinent not only in speakerrecognition but in all speech applications and general audio classi-fication tasks (Bagci and Erzin, 2007), are trained using the expec-tation–maximization (EM) algorithm, or, in the case of vectorquantization, the K-means algorithm.

Better clustering algorithms have been introduced after K-means and EM (Jain, 2010), in terms of preventing local minima,being less sensitive to parameter setup and providing faster pro-cessing. Even though several literature surveys exist (Jain et al.,1999; Jain, 2010; Milligan, 1981; Theodoridis and Koutroumbas,2009), only a few extensive comparisons are available in imageprocessing (Fränti and Virmajoki, 2006) and text retrieval (Stein-bach et al., 2000) but none in speaker recognition. In clustering re-search, new methods are usually compared in terms of clusteringquality. But should better clustering quality improve the recogni-tion accuracy of the full pattern recognition system? Overall, giventhe long history of clustering research (Jain, 2010), existence ofthousands of clustering methods, we feel that it is time to reviewthe choice of clustering methodology in a large-scale, real-worldpattern recognition problem involving tens of dimensions andhundreds of pattern classes of highly noisy data. In our view,text-independent speaker recognition is a representative applica-tion. The main goal of this paper is to bridge some of the gap be-tween theoretical clustering research and large-scale patternrecognition applications, by focusing to an important practical de-sign question: choice of clustering methodology. Before represent-ing the research hypotheses, we first review the role of GMM andVQ clustering methods in our target application.

1.2. Review of clustering methods in speaker recognition

The VQ model (centroid model) is a collection of prototype vec-tors determined by minimizing a distance-based objective function.GMM is a model-based approach (Meila and Heckerman, 2001)where the data is assumed to follow Gaussian mixture distributionparameterized by mean vectors, covariance matrices and mixingweights. For a fixed number of clusters, GMM has more freeparameters than VQ. Their main difference is the cluster overlapin GMM. In fact, VQ can be seen as a special case of the GMM inwhich the posterior probabilities have been hardened, and unitvariance is assumed in all clusters. Similarly, K-means algorithm(Linde et al., 1980) can be considered as a special case of theexpectation maximization (EM) algorithm for GMM (Bishop, 2006).

The VQ model was first introduced to speaker recognition in(Burton, 1987; Soong et al., 1987) and the GMM model in(Reynolds and Rose, 1995). GMM remains a core component instate-of-the-art speaker recognition whereas VQ is usually seenas a simplified variant of GMM. GMM combined with UBM(Reynolds et al., 2000) is the de facto reference method (Fig. 1b).The role of VQ, on the other hand, has been mostly in reducingthe number of training or testing vectors to reduce the computa-tional overhead. VQ has also been used as a pre-processor forANN and SVM classifiers in (Louradour and Daoudi, 2005; Umet al., 2000) to reduce the training time and for speeding up theGMM-based verification in (Kinnunen et al., 2006; Roch, 2006).In (Lei et al., 2005) VQ is used for partitioning the feature space intolocal decision regions modeled by SVMs to increase accuracy. De-spite its secondary role, VQ gives comparable accuracy to GMM

(Brew and Cunningham, 2010; Kinnunen et al., 2009) whenequipped with a MAP adaptation (Hautamäki et al., 2008). Thecomputational benefits over GMM are important in small-footprintimplementations such as mobile devices (Saastamoinen et al.,2005). Recently, similar to hybrids of GMM and SVM (Campbellet al., 2006b), combination of VQ with SVM has also been studied(Brew and Cunningham, 2010).

Fuzzy clustering (Dunn, 1974) is a compromise between VQ andGMM models. It retains the simplicity of VQ while allowing softcluster assignments using a membership function. Fuzzy exten-sions of both VQ (Tran and Wagner, 2002) and GMM (Tran et al.,1998) have been studied in speaker recognition. For a useful re-view, refer to (Tran, 2000). Another recent extension of GMM isbased on nonlinear warping of the GMM density function (Wuet al., 2009). These methods, however, lack formulation for thebackground model adaptation (Reynolds et al., 2000), which is anessential part of modern speaker verification relying on MAP train-ing (Fig. 1b).

The model order – number of centroid vectors in VQ or Gaussiancomponents in GMM – is an important control parameter in bothVQ and GMM. Typically the number varies from 64 to 2048,depending on the chosen features and their dimensionality, num-ber of training vectors, and the selected clustering model (VQ orGMM). In general, increasing the number of clusters improves rec-ognition accuracy, but it levels off after a certain point due to over-fitting. From the two clustering models, VQ was found to be lesssensitive to the choice of the number of clusters in (Stapert andMason, 2001) when trained without the UBM adaptation. Themodel order in both VQ and GMM needs to be carefully optimizedfor the given data to achieve good performance (Kinnunen et al.,2009).

The choice of the clustering method, on the other hand, hasbeen much less studied. Usually K-means (Linde et al., 1980) andexpectation–maximization (EM) (Bishop, 2006; McLachlan andPeel, 2001) methods have been used, although several better clus-tering methods exist (Fränti and Virmajoki, 2006). This raises thequestions of which clustering algorithm should be chosen, andwhether the choice between VQ or GMM model matters. Regardingthe choice between these models, experimental evidence is di-verse. GMM has been shown to perform better for small model or-ders (Stapert and Mason, 2001), but the difference vanishes whenusing larger model order (He et al., 1999; Kinnunen et al., 2006;Stapert and Mason, 2001). However, GMM has been reported towork better than VQ only when cluster-dependent covariancematrices were used but perform worse when a shared covariancematrix was used (Reynolds and Rose, 1995). Several authors haveused GMM derived from the VQ model for faster training (Kolanoand Regel-Brietzmann, 1999; Pelecanos et al., 2000; Singh et al.,2003). All these observations are based on the maximum likelihood(ML) training of speaker models though.

Two recent studies include more detailed comparisons of GMMand VQ (Kinnunen et al., 2009; Hanilci and Ertas, 2011). In (Kinn-unen et al., 2009) the MAP trained VQ outperformed MAP-trainedGMM for longer training data (2.5 min) but the situation was re-versed for 10-second speech samples. The study of Hanilci and Er-tas (2011) focused on the choice of dissimilarity measure (city-block, euclidean, Chebychev) in VQ and two different clusteringinitializations (binary LBG splitting (Linde et al., 1980) versus ran-dom selection). Differences in the identification and verificationtasks, as well as ML versus MAP training were also considered.The authors found the distance measure and the number of clus-ters to be more important than the choice of the K-means initiali-zation. ML-trained models performed better with the shorterNTIMIT data in speaker identification, whereas MAP-trained mod-els (both GMM and VQ) worked better on longer training segments(NIST 2001). Regarding the choice between GMM and VQ, they

Page 4: Pattern Recognition Letters - UEF

T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617 1607

performed equally well on the NIST 2001 verification task, regard-less whether trained by ML or MAP. However, in the identificationtask, MAP-trained GMM outperformed MAP-trained VQ, on bothcorpuses.

A recent study (Brew and Cunningham, 2010) compares MAP-trained GMM and VQ models when used as front-end featuresfor SVM. From the two corpuses, GMM variant outperformed VQon the YOHO corpus with short utterances, whereas VQ performedslightly better on the KING corpus with longer free-vocabularyutterances.

1.3. Research objectives and hypotheses

Existing literature lacks extensive comparison between differ-ent clustering algorithms that would be useful for practitioners.The existing comparisons in speaker recognition study only a fewmethods, use different features and datasets preventing meaning-ful cross-comparisons. Even in (Hanilci and Ertas, 2011; Kinnunenet al., 2009), only the basic EM and K-means algorithms were stud-ied. Thus, extensive comparison of better clustering algorithms isstill missing.

In the experimental section of this paper, we consider the GMMand VQ models both in the maximum likelihood (ML) and maxi-mum a posteriori (MAP) training setting, without additional SVMback-end, inter-session compensation or score normalization (Auc-kenthaler et al., 2000). Focusing on this computationally feasiblecore component enables detailed study of generative model train-ing methodology without re-training the full recognition systemfrom scratch every time the background models are changed; thesame rationale was chosen recently in (Hasan and Hansen, inpress).

In the experiments, we consider both controlled laboratoryquality speech (TIMIT corpus) and noisy conversional telephonyspeech (NIST 1999 and NIST 2006 corpuses). Our main evaluationcriteria are the recognition accuracy, processing time and ease ofimplementation. We aim at answering the following questions:

1. Is clustering needed or would random sub-sampling besufficient?

2. What is the best algorithm in terms of quality, efficiency andsimplicity?

3. What is the difference between the accuracy of the VQ andGMM models?

It was hypothesized in (Kinnunen et al., 2000) that a clusteringwould be required but the choice of clustering algorithm would notbe critical. A possible explanation is that the speech data may nothave a clustering tendency (Kinnunen et al., 2001). These observa-tions were based on a small 25-speaker laboratory-quality datacollected using the same microphone and read sentences. In thispaper, we aim at confirming these hypotheses via extensive largescale experiments. Since the main advantage of speaker recogni-tion over other biometric modalities is possibility for low-cost re-mote authentication, we experiment using realistic telephony dataincluding different handsets, transmission lines, GSM coding andenvironmental noises. The two NIST corpuses used in the study in-clude 290,521 (NIST 1999) and 53,966 (NIST 2006) verification tri-als including 539 and 816 speakers, respectively. Furthermore, inNIST 2006 corpus, all the verification trials are from highly mis-matched channel conditions. This makes it a very challenging pat-tern recognition problem.

Regarding the difference between the VQ and GMM models, ourresults reveal insights which are not obvious, and sometimes con-tradict previous understanding based on literature. For example,even though the models are of similar quality in terms of averagespeaker verification accuracy (equal error rate), their performance

differs systematically at the extreme cases where small falseacceptance or false rejection errors are required.

2. Clustering models as speaker models

2.1. Problem formulation

We consider a training set X = {x1, . . . , xN}, where xi ¼ ðxð1Þi ; . . . ;

xðdÞi Þ 2 Rd are the d-dimensional feature vectors. In the centroid-based model, also known as the vector quantization (VQ) model,the clustering structure is represented by a set of code vectorsknown as the codebook, which is denoted here as C = {c1, . . . , cK},where K� N. The size of the codebook (K) is considered as a con-trol parameter. For a fixed K, the clustering problem can be definedas an optimization problem, in which the goal is to find a codebookC that minimizes a given objective function. Here we use the meansquare error (MSE):

MSEðX;CÞ ¼ 1N

XN

i¼1

min16k6K

kxi � ckk2; ð1Þ

where kxk2 ¼Pd

j¼1x2i denotes the squared Euclidean norm.

In Gaussian mixture model (GMM), each cluster is represented bythree parameters: mean vector lk, covariance matrix Rk, and themixing weight wk. By considering K Gaussian components, theclustering objective function can be defined as the average log-likelihood:

LðX;HÞ ¼ 1N

XN

i¼1

logXK

k¼1

wkNðxijlk;RkÞ; ð2Þ

where H ¼ flk;Rk;wkgKk¼1 denotes the model parameters, and

N(xi|lk, Rk) is the multivariate Gaussian density function withparameters lk and Rk. The mixing weights wk are constrained tobe positive and to sum up to 1.

In early studies, speaker models were trained by optimizing (1)or (2) directly on the enrollment data of that speaker. The sameoptimization criteria would then be used as the similarity score be-tween unknown sample and the given model(s). The current para-digm, however, uses a two-stage training process. First, a universalbackground model (UBM) is trained by pooling a large number fea-ture vectors from different speakers and optimizing (1) or (2) withany suitable clustering algorithm. The UBM serves as prior infor-mation about the general (speaker-independent) distribution ofthe spectral feature space, and it is used as a form of regularization(smoothing) in the training. To be precise, for the GMM model themean vectors of the UBM, lUBM

k , are adapted as,

ladaptedk ¼ akEkðXÞ þ ð1� akÞlUBM

k ; ð3Þ

where

ak ¼PN

i¼1PðkjxiÞr þ

PNi¼1PðkjxiÞ

and EkðXÞ ¼1nk

XN

i¼1

PðkjxiÞxi: ð4Þ

Here P(k|xi) is the posterior probability of vector xi originating fromthe kth Gaussian, nk is the soft count of vectors assigned to the kthGaussian, and r is a fixed constant known as relevance factor. Typi-cally r is a fixed constant (Reynolds et al., 2000) but data-adaptiverelevance factor using fuzzy control rule has also been suggested(Juang et al., 2003). In this study, we use fixed constant r = 16 asusually done in speaker verification. Note that only the meanvectors are adapted, and the rest of the parameters are shared be-tween speakers. For more details, refer to (Reynolds et al., 2000).For the VQ model, the adaptation formulae are a special case of(3) and (4) with the assumption that P(k|xi) = 1 for the nearest cen-troid vector in UBM, and P(k|xi) = 0 otherwise. For the VQ adapta-tion, we use relevance factor r = 12 as in (Hautamäki et al., 2008).

Page 5: Pattern Recognition Letters - UEF

1608 T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617

In the recognition phase, the average log likelihood (meansquare error in the case of VQ) of the data, in respect both to thetarget speaker and the UBM, are evaluated, and their differencegives the normalized score. Normalization with the backgroundmodel equalizes the score ranges of different speakers and test seg-ments so that a common verification threshold can be used. Thenormalized score is finally compared with a verification thresholdto give the accept/reject decision.

2.2. Clustering algorithms

Optimal algorithms for solving the clustering problem haveexponential time complexity (Garey et al., 1982). Thus, all methodsfor data sets consisting of thousands or millions of training vectorsare based on different heuristics; several hundreds of methodshave been proposed in literature (Jain et al., 1999). For the compar-isons in this paper, we include algorithms that, according to ourexperience (Fränti and Virmajoki, 2006), consistently provide highquality clustering, and algorithms that are popular due to theirsimplicity or for other reasons. We include two hierarchical algo-rithms (PNN, SPLIT) and six iterative algorithms (K-means, SOM,RS, SM, FCM, GA). Random clustering is used as reference points.GMM, on the other hand, is the de facto standard in text-indepen-dent speaker recognition, and provides another good referencepoint.

Random: A trivial method for modeling the data is to constructthe codebook from K randomly chosen data vectors. The randomcodebook will also be used as the initial solution for the iterativealgorithms described below, but serves also as a reference solutionfor measuring the quality of the clustering algorithms. A good clus-tering algorithm should produce significantly better codebook thanthe random selection.

Repeated K-means: K-means (McQueen, 1967) starts from anyinitial solution, which is then iteratively improved by two optimi-zation steps as long as improvement is achieved. The algorithm isknown as Linde-Buzo-Gray (LBG) or generalized Lloyd algorithm(GLA) in vector quantization (Linde et al., 1980). Since K-means issensitive to the initial solution, we apply it repeatedly each timestarting from a new random initial solution (Duda and Hart,1973). The codebook providing the smallest MSE is retained asthe final solution.

SOM: Self-organizing map (Nasrabadi and Feng, 1988) is a neuralnetwork approach to the clustering problem. The neurons in thenetwork are connected with a 1-D or 2-D structure, and they cor-respond to the code vectors. Each feature vector is fed to the net-work by finding the nearest code vector. The best matched codevector and its neighboring vectors in the network are updated bymoving them towards the input vector. After processing the train-ing set by a predefined number of times, the neighborhood size isshrunk. The entire process is repeated until the neighborhood sizeshrinks to zero.

PNN: Pairwise nearest neighbor (Equitz, 1989; Fränti et al., 2000)generates the codebook hierarchically. It starts by initializing eachfeature vector as a separate code vector. Two code vectors aremerged at each step of the algorithm and the process is repeateduntil the desired size of the codebook is obtained. The code vectorsto be merged are always the ones that results in the least distor-tion. We use the fast exact PNN algorithm introduced in (Fräntiet al., 2000).

SPLIT: An opposite top-down approach starts with a single clus-ter of all the feature vectors. New clusters are then created one at atime by dividing the existing clusters. The splitting process is re-peated until the desired number of clusters is reached. This ap-proach usually requires much less computation than the PNN.We use the algorithm in (Fränti et al., 1997b) that always selectsthe optimal hyperplane, dividing the particular cluster along its

principal axis, augmented with a local repartitioning phase at eachdivision step. This variant gives comparable results to that of thePNN but with much faster algorithm.

SM: Split-and-Merge (Kaukoranta et al., 1998) is an iterativealgorithm that modifies the codebook by a series of split and mergeoperations. At every step, the code vectors to be split and mergedare chosen as the ones that provide best improvement (split), orleast increase (merge) in the distortion. The algorithm provideshigh quality codebooks but with a significantly more compleximplementation than the other algorithms.

RS: Random swap algorithm (Fränti and Kivijärvi, 2000) startswith a random codebook, which is then iteratively improved. Atevery step, a randomly selected code vector is tentatively re-allo-cated to a randomly chosen training vector. The new candidatecodebook is fine-tuned by two iterations of K-means, the solutionis then evaluated and accepted if it improves the previous solution.The algorithm is iterated for a fixed number of iterations. This trial-and-error approach is much simpler to implement than the split-and-merge algorithm, and is surprisingly effective. It was shownto find the optimal global allocation of the codebook in an expectedtime of O(N2K) Fränti et al., 2008. See Fig. 2 for illustration of thealgorithm.

FCM: Fuzzy C-means (Dunn, 1974) generalizes K-means to fuzzyclustering, in which data vectors can belong to several partitions atthe same time with a given weight. Traditional K-means is then ap-plied in the final step in order to obtain the centroids (codebook)from the fuzzy partitions. Another alternative would be to formu-late fuzzy-MAP adaptation based on the fuzzy memberships. Sincewe are not aware of such formulation, we use the centroids ob-tained from the hard partitions.

GA: Genetic algorithm generates a set of solutions (called popu-lation) and evolves it using the survival of the fittest principle. In(Fränti et al., 1997a), PNN is used as the crossover to generatenew candidates, which are further fine-tuned by K-means. Thealgorithm has outperformed so far every competitive clusteringalgorithm for more than a decade already. Slightly better resultshave been reported only by other (more complicated) GA variants[FräntiShrink2006]. It therefore serves as a good reference point forclustering model.

GMM : We use the expectation–maximization (EM) algorithm(Bishop, 2006; McLachlan and Peel, 2001) for training the GMMas described in (Reynolds and Rose, 1995). In the EM algorithm,an initial guess is made for the model parameters, and the solutionis then improved by using two optimization steps similar to K-means. Since the EM algorithm is also sensitive for the initializa-tion, we apply it repeatedly starting from a new random solution,which is always first fine-tuned by K-means. The result providingthe highest likelihood is retained as the final model. An importantconsideration in GMM is the type of the covariance matrices of theGaussian components. As generally done with MFCC features, weuse diagonal covariance matrices instead of full covariances dueto numerical and computational reasons: for limited data, fullcovariance matrices easily become singular (ill-conditioned). Usingdiagonal covariances is also computationally efficient since no fullcovariance matrix inversions are required.

2.3. Number of clusters

The number of clusters (model order) is a control parameter of aclustering algorithm which must be optimized for a given applica-tion. In recognition applications, this is usually done by picking themodel order that gives best recognition accuracy on a developmentset. Practice has shown that, irrespective of the implementationdetails, corpus and chosen short-term features, the best accuracyis typically found using K = 64–2048 code vectors or Gaussiancomponents. The main drawback of this approach is the expensive

Page 6: Pattern Recognition Letters - UEF

Current solution Centroid swapping

Two centroids, butonly one cluster.

One centroid, buttwo clusters.

Local repartition Fine-tuning by K-means

Swap is made fromcentroid rich area tocentroid poor area..

Fig. 2. Illustration of a single step of the random swap (RS) algorithm. A randomly chosen centroid is re-allocated to new location, followed by K-means fine-tuning. The newsolution is accepted if it provides smaller distortion than the original codebook.

T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617 1609

computations involved – for each considered model order, oneneeds to re-train speaker models (and UBM if MAP adaptation isused) and re-classify the development test samples.

In clustering research, large number of clustering validity indiceshave been proposed for automatically detecting the number ofclusters (e.g. Bezdek and Pal, 1998; Davies and Bouldin, 1979; Gevaet al., 2000; McLachlan and Peel, 2001; Milligan, 1981; Sarkar et al.,1997). In early phase of this study, we also evaluated the classicalF-ratio (based on ANOVA test procedure Ito, 1980) and Davis–Boul-din index (DBI) (Davies and Bouldin, 1979) in VQ-based speakermodeling, but found the selected model order to correlate poorlywith recognition accuracy. Even though these indices have been re-ported to work reasonably well for low-dimensional features anddata with clear clustering tendency (Kärkkäinen and Fränti,2002), they are affected by overlapping clusters and noise (e.g. Bez-dek and Pal, 1998), as well as increasing dimensionality. Sincespeech features have tens of dimensions and are unlikely to haveany clustering tendency (Kinnunen et al., 2001), this may explainthe result. For practitioners, we therefore recommend to use theoptimize-on-devset procedure.

3. Experimental setup

3.1. Corpora and features

For the experiments, we use three corpora: TIMIT, NIST 1999and NIST 2006 as documented in Tables 1 and 2. TIMIT representslaboratory quality speech recorded in highly controlled conditions.It was selected for the purpose of parameter optimization in an ini-tial stage of our study. The NIST 1999 and NIST 2006 corpora, on

the other hand, represent uncontrolled, conversational telephonequality speech, which is expected in real applications. We use12- and 36-dimensional mel-frequency cepstral coefficient (MFCC)features for the NIST 1999 and NIST 2006, respectively (see below).

The TIMIT corpus (Linguistic Data Consortium) consists of 630speakers, and for each speaker there are 10 speech files. We splitthe files into non-overlapping training and test sections of 70%(22 s) and 30% (9 s), respectively. For consistency with the NISTfiles, TIMIT files were anti-alias filtered and downsampled from16 kHz to 8 kHz.

The NIST 1999 corpus (Martin and Przybocki, 2000) consists of539 speakers (230 males, 309 females). We use the training sectionof the corpus for our experiments. Each speaker’s training data isgiven in two files labeled ‘‘a’’ and ‘‘b’’, and each has duration ofone minute. We use the ‘‘a’’ files for training and the ‘‘b’’ files forclassification. For a given speaker, these two files are from thesame telephone number but from two different telephone calls(sessions). Different speakers may have same or different type ofhandset (electret or carbon button). To evaluate verification perfor-mance, we match each of the test files per each of the speakers,yielding a total number of 539 � 539 = 290,521 test trials, of which539 are genuine and the remaining 289,982 are impostor trials.

From the NIST 2006 corpus, we have selected the common ‘‘corecondition’’ as specified in the NIST 2006 SRE evaluation plan (NIST2006 Speaker Recognition Evaluation Webpage, 2006). This bench-mark test consists of 816 target speakers (354 males, 462 females)and a total number of 53,966 verification trials (5077 genuine,48,889 impostors).

We use the mel-frequency cepstral coefficients (MFCCs) as theacoustic features (Huang et al., 2001), see Table 2. Each frame ismultiplied by a 30 ms Hamming window, shifted by 10 ms. From

Page 7: Pattern Recognition Letters - UEF

Table 1Summary of the speech material.

TIMIT NIST 1999 NIST 2006

Language English English Mostly Englisha

Speakers 630 539 816Test trials N/A 539 genuine + 289,982 impostor 5077 genuine + 48,889 impostorSpeech type Read Conversational ConversationalQuality Laboratory Telephone TelephoneSampling rate 8.0 kHz 8.0 kHz 8.0 kHzSession mismatch Matched Mismatched MismatchedChannel mismatch Matched Mixed MismatchedTraining data (avg.) 22 s �1 min �2.5 minTest data (avg.) 9 s �1 min �2.5 min

a Small part of the data contains Arabic, Mandarin, Russian, or Spanish speakers.

Table 2Evaluation set-up for each corpus. UBM = universal background model, d = feature dimensionality, N = number of vectors.

Corpus Evaluation task Features used UBM training data and UBM type UBM training vectors

TIMIT Identif. MFCC (d = 12) N/A N/ANIST 1999 Verif. MFCC (d = 12) NIST 2000, single UBM N = 591.378NIST 2006 Verif. MFCC + D + D2 (d = 36) NIST 2004, gender-dependent UBMs N = 500.000 per gender

1610 T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617

the windowed frame, magnitude spectrum using fast Fourier trans-form (FFT) is computed and then filtered with a bank of 27 triangu-lar filters spaced linearly on the mel-frequency scale. The log-compressed filter outputs are then converted into cepstral coeffi-cients by discrete cosine transform (DCT), and the coefficients 1–12 are retained.

For the TIMIT and NIST 1999 corpus, we use the 12 MFCCs asfeatures, followed by utterance-level mean and variance normali-zation to give zero mean, unit-variance features. For these two cor-pora, voice activity detection (VAD) is not needed; TIMIT samplesare short and of high-quality, containing mostly speech. The NIST1999 corpus, in turn, has been pre-processed by NIST for silence re-moval. This simple setup is sufficient on these data sets accordingto our experience.

For the more challenging NIST 2006 data, our front-end includesadditional RelAtive SpecTrAl (RASTA) filtering Hermansky and Mor-gan, 1994 to mitigate convolutive channel effects, followed by esti-mation of the D and D2 parameters to capture local spectraldynamics. Finally, an adaptive energy-based algorithm is used forpicking speech-only frames, followed by utterance-level meanand variance normalization. For the NIST 2006 corpus, voice activ-ity detection is crucial – according to (Hautamäki et al., 2007), er-ror rates may increase near chance level if VAD is excluded. TheMatlab code of the energy VAD used in this study is available in(Kinnunen and Li, 2010).

For the NIST 1999 and NIST 2006 corpora, we need universalbackground models. For the NIST 1999 data, we use a subset ofthe 1-speaker detection task training files of the NIST 2000 speakerrecognition corpus (Linguistic Data Consortium) and for the NIST2006 data, we use a subset of the 1-side training files of the NIST2004 SRE corpus. For NIST 1999 we use gender-independentUBM and for NIST 2006 we use separate UBMs for female and malespeakers.

3.2. Performance criteria

To assess speaker verification performance, we use the detectionerror tradeoff (DET) curve (Martin et al., 1997) as an evaluation tool.The DET curve presents the trade-off between the two detectionerror rates, false acceptance rate (FAR) and false rejection rate(FRR), in a normal deviate scale over all decision thresholds. ForGaussian score distributions, the resulting DET curves are straight

lines. As an average error measurement, we report the equal errorrates (EERs), i.e. the error rate corresponding to point FAR = FRR.We also provide the FAR and FRR at a few additional operatingpoints corresponding to security and user-convenient applicationscenarios.

To measure computational efficiency of background modeltraining, we use the average CPU time over 10 repetitions. All theclustering algorithms have been implemented using either C orC++ languages. The NIST 2006 experiments were carried out in aDell PE2900 workstation with two 3 GHz X5450 CPUs, 48 GB ofRAM and CentOS release 5.3 operating system. Care was taken inexcluding the file I/O overhead from the running times. The TIMITand NIST 1999 experiments were carried out in two older DellOptiplex G270 computers (2.8 GHz CPU) and 1 GB RAM.

3.3. Parameter setting of the clustering algorithms

The clustering algorithms have several control parameters thatshould be fixed beforehand. We document the selection of theparameters and comment their importance for the success of thealgorithm in the following. Summary is given in Table 4. Note that,even though the number of clusters (K) is a control parameter, it iscommon for all the algorithms and hence not counted here.

Random: This algorithm has no control parameters.Repeated K-means: K-means does not have any control parame-

ters but its quality strongly depends on the initial solution. Wetherefore repeat the algorithm several times (R) by restarting fromdifferent random initial solutions as originally proposed in (Dudaand Hart, 1973). As a negative side, this also multiplies the process-ing time R times compared to that of a single run of K-means. Herewe set R = 10.

SOM: The SOM algorithm does not depend much on the initial-ization but it is very sensitive to the parameter setup (Fränti,1999). We fix the initial neighborhood size (Dmax) to 16, and thenstudy the learning rate (a), and the number of iterations (I) on TI-MIT corpus. Based on the identification results in Table 3, we fixthe learning rate as a = 0.01, and the number of iterations toI = 1000.

PNN and SPLIT: The hierarchical algorithms (PNN and SPLIT)have no control parameters and they always produce the same re-sult. The SPLIT approach itself includes several design alternativessuch as which cluster to split and how to split it. However, the

Page 8: Pattern Recognition Letters - UEF

Table 4List of control parameters of the algorithms, and the values considered. The selectedvalues are shown in boldface.

Algorithm Control parameters Values tested

Random N/A N/ARep. K-means Number of restarts (R) R = 5, 10, 100SOM Number of iterations (I) I = 5, 10, 20, 50, 100, 1000, 10000

Maximum learningrate (a)

a = 0.001, 0.01, 0.1, 1

Size of the initialneighborhood (Dmax)

Dmax = 16

PNN N/A N/ASPLIT N/A N/ARS Number of iterations (I) I = 2500, 5000SM Number of splits before

merging (H)H = K (codebook size)

GA Number of generations (I) I = until no improvementSize of generation (Z) Z = 10

FCM Number of FCMiterations (IFCM)

IFCM = 200

Number of K-meansiterations (Ikm)

Ikm = 10, 100

GMM Number of restarts (R) R = 10Variance floor (r2

min) r2min = 1.52 � 10�5r2

Table 3Dependency of the SOM performance on the control parameters for TIMIT corpususing codebook size = 32. The reported numbers are closed-set identification errorrates (IER%) over the whole TIMIT corpus with 630 speakers.

Number of iterations (I) Learning rate (a)

0.001 0.01 0.1 1

5 66.0 15.9 3.2 60.510 48.1 10.3 3.2 59.520 21.4 5.2 2.9 60.350 19.7 2.7 1.4 59.2

100 11.4 2.2 2.9 61.41000 1.7 1.4 4.0 62.2

T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617 1611

proposed solution in (Fränti et al., 1997b) works very well for alldata sets without the need of any data-dependent parameter tun-ing. It was also aimed at maximum quality (at the cost of speed).But since it remains the fastest among the tested algorithms, thefaster variants are not considered.

SM: In the SM algorithm, there is one control parameter (stepsize H), which defines how many subsequent split steps are per-formed before the same amount of merge operations. We followthe recommendation of Kaukoranta et al. (1998), and fix it to beequal to the size of the codebook (H = K). Smaller values would pro-vide slightly higher MSE using less processing time, whereas high-er values do not provide much further improvement. The exactchoice of this parameter is not critical. The other parameters areinsignificant and the default values described in (Kaukorantaet al., 1998) can be used.

RS: In the RS algorithm, we must select the number of iterations(I), which determines the trade-off between processing time andquality. Our previous experience indicates that the number of iter-ations should be proportional to the number of input vectors (N) toguarantee high quality result. We consider the values I = 2500 and5000. The first one will be used later as there was not much differ-ence in accuracy when tested with TIMIT.

FCM: We need to fix the number of iterations. According to pre-vious experiments [VirmajokiShrink2006], they are fixed as shownin Table 4.

GA: We need to fix the population size and the number of gen-erations. Their selection is merely a trade-off between quality andtime. The results in (Fränti, 2000) have shown that even the faster

variant provides very good performance. We therefore fix the gen-eration size to z = 10, accordingly, and iterate the algorithm untilno improvement is found (usually 5–20 iterations).

GMM: The initial mean vectors for the expectation–maximiza-tion (EM) algorithm are initialized by random selection from thetraining set, followed by 10 K-means iterations. Following this,the covariance matrices are computed from the vectors assignedto each cluster, and the weights are set to the relative count of vec-tors assigned to the cluster. After initialization, the expectation andmaximization steps are iteratively repeated until the relative in-crease in likelihood falls below a given threshold (e). Like in K-means, we repeat the algorithm R times, each time starting froma new random solution, and choose the final model as the onewhich yields the highest likelihood. Here, we fix the parametersas e = 2�16 and R = 10. We also need to set a variance floor (r2

min)for each dimension to prevent components becoming singular(Reynolds and Rose, 1995). The values were optimized on the TI-MIT data and fixed to 1.52 � 10�5 times the variance of the trainingset. The number of restarts (R) and the variance floor are importantcontrol parameters of the algorithm, which must be setup experi-mentally since there are no good theoretical guidelines how to setthem optimally for a given data set (Reynolds and Rose, 1995). Theselection of the convergence threshold (e), on the other hand, ismuch less critical and can be considered a fixed constant.

4. Results and discussion

4.1. Speaker recognition accuracy

First, we present results on the NIST 1999 corpus and study theeffect of the background model. We use the random swap (RS) as arepresentative vector quantization method, and compare it againstthe Gaussian mixture model (GMM) trained with the repeatedexpectation–maximization (EM) algorithm. The speaker modelsare trained independently without UBM and without score nor-malization (EM, RS), or by using MAP adaptation from the UBMand with UBM score normalization (EM + MAP, RLS + MAP). TheDET plots are presented in Fig. 3 and representative score distribu-tions are shown in Fig. 5.

VQ achieves higher accuracy at small FAR levels, but the differ-ences become smaller when UBMs are used. The UBM adaptationand normalization improves the accuracy of both methods. Forthe rest of the experiments, we use background modeling and thus,the focus is on comparing different clustering methods to train theUBM.

While UBM adaptation and score normalization are known toimprove accuracy, it is much less studied how the set-up of UBMtraining affects recognition accuracy. To study this in more detail,we consider two different methods to initialize GMM: (1) the re-peated EM method described in Sections 2.2 and 3.3, and (2) adeterministic method used in (Kinnunen et al., 2009). The latterwas specifically optimized on the latest NIST corpora to give goodrecognition accuracy with small time consumption. It uses splittingalgorithm to generate an initial codebook, which is fine-tuned byseven K-means iterations; the covariances and weights are theninitialized from the codebook partitions; finally, two EM iterationsare executed (further EM iterations did not improve accuracy inKinnunen et al. (2009)). Comparative results with VQ-UBM (usingthe random swap algorithm) are shown in Fig. 4.

The two alternative methods of training GMM do exhibit differ-ent performance. At small FRR levels (lower right corner), the re-peated EM clearly outperforms the faster heuristic variant; atsmall FAR levels (upper left corner), in turn, the order is reversed,even though the difference is smaller. RS algorithm gives the sameperformance with repeated EM at small FRR levels and model size

Page 9: Pattern Recognition Letters - UEF

1 2 5 10 20 40 1

2

5

10

20

40

False acceptance rate (FAR %)

Fals

e re

ject

ion

rate

(FR

R %

)

Model order = 64

GMM (EM), EER = 12.99VQ (RS), EER = 12.43GMM-UBM (EM+MAP), EER = 10.42VQ-UBM (RS+MAP), EER = 10.95

1 2 5 10 20 40 1

2

5

10

20

40

False acceptance rate (FAR %)

Fals

e re

ject

ion

rate

(FR

R %

)

Model order = 256

GMM (EM), EER = 12.61VQ (RS), EER = 12.55GMM-UBM (EM+MAP), EER = 10.76VQ-UBM (RS+MAP), EER = 9.75

Fig. 3. Results on the NIST 1999 corpus (Model orders K = 64 and K = 256) using VQ and GMM approaches, with and without UBM.

-0.4 -0.2 0 0.2 0.40

0.005

0.01

0.015

0.02

Match score

Freq

uenc

y

VQ-UBM

impostorscores

genuinescores

Θ1 = -0.19FAR = 76 %FRR = 2 %

Θ2 = 0.02FAR = 2 %FRR = 17 %

-1.5 -1 -0.5 0 0.5 1 1.5 20

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Match score

Freq

uenc

y

GMM-UBM

genuinescores

impostorscores

Θ1 = -0.71FAR = 78 %FRR = 2 %

Θ2 = 0.37FAR = 2 %FRR = 21 %

Fig. 5. Match score distributions for VQ-UBM and GMM–UBM on the NIST 1999 corpus for model size 256. Two operating points (thresholds) with the corresponding errorrates are indicated.

1 2 5 10 20 40 1

2

5

10

20

40

False acceptance rate (FAR %)

Fals

e re

ject

ion

rate

(FR

R %

)

Model order = 512

VQ-UBM (RS)EER = 9.65GMM-UBM (Repeated EM),EER = 9.61GMM-UBM (Split + 7 Kmeans + 2 EM iter)EER = 10.34

1 2 5 10 20 40 1

2

5

10

20

40

False acceptance rate (FAR %)

Fals

e re

ject

ion

rate

(FR

R %

)

Model order = 2048

VQ-UBM (RS)EER = 8.94GMM-UBM (Repeated EM)EER = 9.55GMM-UBM (Split + 7 Kmeans + 2 EM iter)EER = 9.97

Fig. 4. Comparing random swap (RS) to GMM with two different initializations on the NIST 2006 corpus. The UBMs are trained with the indicated methods.

1612 T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617

K = 512; however, RS outperforms EM at small FAR levels. This dif-ference is even larger when model order is increased to K = 2048.At small FRR levels and with K = 2048, repeated EM gives the bestperformance. In summary, VQ-UBM is better suited for securityapplications (small FAR desired), whereas GMM–UBM is betterfor user-convenience applications (small FRR desired).

For more complete analysis, Table 5 summarizes error rates at afew selected operating scenarios for both the NIST 1999 and the

NIST 2006 corpora. The notation ‘‘FAR @ FRR = 1%’’ means thatthe verification threshold has been adjusted to give 1% FRR, andthe corresponding FAR is reported in the table; similarly for theother error type. Independent of the corpus and feature set-up,GMM–UBM seems to be better for user-convenience and VQ-UBM for security application, respectively.

McNemar’s significance test at 95% confidence level (Huanget al., 2001; Leeuwen et al., 2006) was performed at each operating

Page 10: Pattern Recognition Letters - UEF

Table 5Error rates for the VQ-UBM and GMM–UBM for two application scenarios. Here d = feature dimensionality, K = model order, rep.EM = repeated expectation–maximization (EM),heur.EM = Split + 7 K-means initialization, followed by two EM iterations.

Corpus, feature dimensionality (d) and model order (K) NIST 1999 d = 12, K = 256 NIST 2006 d = 36, K = 2048

Application scenario Model and UBM trainingmethod

VQ-UBM(RS)

GMM–UBM(rep.EM)

VQ- UBM(RS)

GMM–UBM(heur.EM)

GMM–UBM(rep.EM)

EER 9.75 10.76 8.94 9.97 9.55

User-convenientapplication

FAR@FRR = 1% 85.56a 80.82 67.51a 99.42 52.36FAR@FRR = 2% 74.64 74.87 41.01a 48.97 32.82FAR@FRR = 5% 41.65a 36.41 18.09a 21.14 17.16FAR@FRR = 10% 9.69a 12.84 7.73a 9.96 9.22

Security application FRR@FAR = 1% 21.71a 30.98 30.75a 42.17 48.20FRR@FAR = 2% 17.07a 20.59 23.30a 31.91 32.62FRR@FAR = 5% 12.99 14.10 13.69a 18.20 17.61FRR@FAR = 10% 9.65 11.32 8.29a 9.97 9.18

a Significantly different from GMM–UBM (rep. EM), at the confidence level of 95%, as evaluated using McNemar’s test.

T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617 1613

point between GMM–UBM (repeated EM) and VQ-UBM (RS). Wemeasure the difference in the decisions on the impostor trials incase of user-convenience application (FAR @ FRR = 1..10%), differ-ence in the genuine trials in case of secure application (FRR @FAR = 1..10%), and all trials at the EER operating point. Statisticallysignificant differences are indicated by an asterisk (⁄) in Table 5.On NIST 2006, GMM-UBM outperforms VQ-UBM at the user-con-venient scenario, whereas the situation is reversed in the securityscenario. In the NIST 1999, the same conclusion holds except in afew cases. In general, the differences are smaller near the EER oper-ating point and only in impostor but not in genuine trials. The rea-son why differences are not always significant for genuine trials onthe NIST 1999 is due to much smaller number of genuine trials incomparison to NIST 2006.

Full comparison of all clustering methods is shown in Fig. 6 onthe NIST 2006 corpus. Firstly, all the methods produce significantlybetter result than random clustering. Secondly, the VQ and GMMmethods are equal in terms of EER. Generally VQ methods outper-form GMM at small FAR levels (security application), whereasGMM outperforms VQ at small FRR levels (user-convenience appli-cations) when the model order is increased to K = 1024 or 2048.Comparing the VQ methods, the recognition accuracies are veryclose to each other, in particular for models with large number ofclusters. For a smaller number of clusters (K = 16), there are somedifferences: SOM gives significantly poorer accuracy than the othermethods, and the hierarchical methods, PNN and Split, are alsoslightly poorer. This is hardly significant since the larger codebooksizes are expected to be used in most applications. It must also benoted that, although SOM works reasonably well in these tests, theparameter tuning was a crucial step, which makes it unfavorablemethod.

An interesting question is whether the clustering quality corre-lates with increased recognition accuracy. To answer this, we pres-ent mean square error (MSE) of the background model in NIST2006 corpus against the three different error metrics used in Table5. The results for model order K = 1024 in Fig. 7 suggest that thereexists a weak correlation: the PNN, which yields the highest MSEamong the tested methods, yields also slightly poorer recognitionaccuracy. However, the rest of the methods are so close to eachother in clustering quality that the differences in recognition accu-racy cannot originate from a better clustering algorithm.

4.2. Processing time

In the following examples, we study the computational effi-ciency of the clustering methods. In the case of iterative algo-rithms, the processing time increases with the size of the model.In Fig. 8, this can be seen most clearly for K-means and GMM. On

the other hand, we use the reduced-search variant of the K-means(Kaukoranta et al., 2000), also in RS, SM and GA, and it exploits thefact that only small portion of the code vectors changes duringeach iteration. In the case of large codebooks, this proportion be-comes smaller and smaller, which makes the dependency on thesize of the codebook rather conservative.

In the case of hierarchical algorithms, the processing time de-pends mainly on the direction of the process. In divisive approach(SPLIT), the algorithm processes from K = 1 to K = N, and therefore,the processing time increases as a function of the codebook size. Inthe merge-based approach (PNN), the situation is the opposite as itprocesses from K = N down to K = 1. However, most of the time willbe spent in the early stage of the process when there are a largenumber of code vectors, and therefore, the increase at the smallersizes does not show anymore in Fig. 8.

The difference of the feature space affects SPLIT and SM meth-ods because they need to calculate principle axis at every splittingstage (SPLIT and SM). These steps have quadratic, O(d2) timedependency on the dimensionality, and thus, the methods becomesomewhat slower when the dimensionality increases. All othermethods depend linearly on the dimensionality and the size ofdata, except PNN, which has quadratic, O(N2), dependency on thesize of data.

In summary, the SPLIT method is the fastest of the tested meth-ods. Among the other methods, RS and the Repeated K-means aresomewhat slow for the highest codebook sizes, but both of themwere tuned here for maximal quality instead of speed.

5. Implementation considerations

A somewhat less appreciated property of an algorithm is thehuman effort needed to make the algorithm work in practice. Thereare two viewpoints: (1) programmer’s viewpoint: complexity of theimplementation, and (2) user’s viewpoint: efforts required to tuneup the parameters for a certain corpus. In general, these two fac-tors are difficult to measure quantitatively given the varying skillsof programmers and users. To give a rough indication of the firstfactor, we estimate the number of functions and program codelength. Program code length was estimated as the number of linesin the source code, and the file size of the binary code. All programswere compiled using GCC version 4.1.12, without code optimiza-tion and debug information. To give an indication of the user effort,we count the number of control parameters.

The numbers in Table 6 match to our own experience in imple-menting these algorithms. All the programs (except FCM andGMM) consist of the same data structures for the feature vectors,codebooks, partition, and the same functions for distance calcula-tions. In addition to these, K-means includes four functions to

Page 11: Pattern Recognition Letters - UEF

1 2 5 10 20 40 1

2

5

10

20

40

False acceptance rate (FAR %)

Fals

e re

ject

ion

rate

(FR

R %

)

Model order = 16

Random (EER = 22.65)K-means (EER = 12.82)SOM (EER = 16.45)PNN (EER = 13.78)Split (EER = 13.96)SM (EER = 13.24)RS (EER = 13.10)FCM (EER = 12.77)GA (EER = 13.20)GMM (EER = 13.53)

Split Random

SOMOthermethods

PNN

1 2 5 10 20 40 1

2

5

10

20

40

False acceptance rate (FAR %)

Fals

e re

ject

ion

rate

(FR

R %

)

Model order = 64

Random (EER = 22.69)K-means (EER = 10.85)SOM (EER = 11.50)PNN (EER = 11.36)Split (EER = 10.84)SM (EER = 11.18)RS (EER = 10.97)FCM (EER = 11.10)GA (EER = 10.82)GMM (EER = 11.44)

SOMPNN

Othermethods

Random

GMM

1 2 5 10 20 40 1

2

5

10

20

40

False acceptance rate (FAR %)

Fals

e re

ject

ion

rate

(FR

R %

)

Model order = 256Random (EER = 21.06)K-means (EER = 9.87)SOM (EER = 9.81)PNN (EER = 9.91)Split (EER = 9.81)SM (EER = 9.89)RS (EER = 9.67)FCM (EER = 9.89)GA (EER = 9.71)GMM (EER = 10.32)

K-means

GMM

Othermethods

Random

1 2 5 10 20 40 1

2

5

10

20

40

False acceptance rate (FAR %)

Fals

e re

ject

ion

rate

(FR

R %

)

Model order = 512Random (EER = 20.43)K-means (EER = 9.67)SOM (EER = 9.55)PNN (EER = 9.67)Split (EER = 9.53)SM (EER = 9.62)RS (EER = 9.65)FCM (EER = 9.68)GA (EER = 9.55)GMM (EER = 9.61)

Random

GMM

Othermethods

1 2 5 10 20 40 1

2

5

10

20

40

False acceptance rate (FAR %)

Fals

e re

ject

ion

rate

(FR

R %

)

Model order = 1024Random (EER = 20.88)K-means (EER = 9.26)SOM (EER = 9.19)PNN (EER = 9.61)Split (EER = 9.34)SM (EER = 9.28)RS (EER = 9.32)FCM (EER = 9.42)GA (EER = 9.36)GMM (EER = 9.08)

GMM

Othermethods

GMMRandom

PNN

1 2 5 10 20 40 1

2

5

10

20

40

False acceptance rate (FAR %)

Fals

e re

ject

ion

rate

(FR

R %

)

Model order = 2048Random (EER = 20.58)K-means (EER = 9.05)SOM (EER = 9.14)PNN (EER = 9.78)Split (EER = 9.30)SM (EER = 9.00)RS (EER = 8.94)FCM (EER = 9.26)GA (EER = 9.40)GMM (EER = 9.55)

Random

GMM

GMM

PNN

Othermethods

Fig. 6. Recognition results for the NIST 2006 SRE corpus.

1614 T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617

generate the partition and centroids for finding the nearest codevector, and for summing up the distortion value. The reducedsearch variant includes one more. In addition to these, RS includesswap and selection procedures. Thus, their corresponding codesizes are almost the same. In our opinion, these two algorithmsare the simplest ones to implement.

The SOM, PNN and FCM algorithms have somewhat longercodes but their implementations are also quite straightforward.The SPLIT, however, is significantly more complex to implement,which is partly reflected in its code size, as well. The algorithm in-cludes functions for calculating the principal component (power

method), finding the optimal dividing hyper plane, projection, sort-ing the data vectors, and a binary tree structure for efficient selec-tion of the next cluster to split, procedure for re-partitioning, andseveral smaller routines.

Split-and-merge is basically a combination of the PNN and SPLITcodes added with the main routine coordinating between thesetwo steps. However, additional difficulties arise from the fact thatupdating one of the data structures either by split and merge, willinfluence the other data structure of the other. As the success of thealgorithm requires that both of these steps are implemented accu-rately, it is far from trivial to make the algorithm work in practice.

Page 12: Pattern Recognition Letters - UEF

18.6 18.8 19 19.2

9.2

9.3

9.4

9.5

9.6

K-means

Mean square error (MSE) of the UBM

Equ

al e

rro

r ra

te (

EE

R %

)

SOM

PNN

Split

SM

RS

FCM

GA

18.6 18.8 19 19.2

80

82

84

86

K-means

Mean square error (MSE) of the UBM

FA

R @

FR

R =

1.0

%

SOM

PNN

Split

SM

RS FCM

GA

18.6 18.8 19 19.231

32

33

34

35

36

K-means

Mean square error (MSE) of the UBM

FR

R @

FA

R =

1.0

%

SOM

PNN

Split

SM

RSFCM

GA

Security applicationEqual error rates User-convenient application

Fig. 7. Clustering quality versus recognition accuracy for model order K = 1024. Here quality is measured as the mean square error (MSE) of the (male) universal backgroundmodel.

0.01

0.1

1

10

100

1000

2 4 8 16 32 64 128 256

Model size

CPU

tim

e (s

econ

ds)

RandomK-MeansSOMPNNSPLITRLSSMGMM

NIST 1999

0.1

1.0

10.0

100.0

1000.0

10000.0

100000.0

2 4 8 16 32 64 128 256 512 1024 2048

Model size

CPU

tim

e (s

econ

ds)

RandomK-MeansSOMPNNSPLITRLSSMFCMGAGMM

NIST 2006

Fig. 8. Running times of the model training for the NIST 1999 (left) and NIST 2006 (right) corpora. The dimensionality of the corresponding features spaces are d = 12 and 36.(Processing times of FCM and GA for NIST 1999 are missing due to historical reasons.)

Table 6Measures estimating the difficulty of implementation.

Number of control parameters Number of functions Lines in source code Binary code size (bytes)

Random 0 2 26 64,448Repeated K-means 1 5 162 71,821RS 1 7 226 70,690SOM 3 7 252 95,448PNN 0 12 317 107,530SPLIT 0 22 947 101,778SM 1 38 1381 131,834GA 2 21 573 105,956FCMa 2 11 295 128,735GMMa 2 44 1111 87,535

a C++ implementation, other methods are in ANSI C.

T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617 1615

According to our experience, we cannot really recommend thismethod for practitioners due to its complex implementation.

GA is combination of PNN and K-means algorithm, including afew additional steps in the crossover stage, handling a set of solu-tions instead of only one, and implementing the selection step. Theparameters are rather straightforward to set according to previousrecommendations, and the method works rather robust from dataset to another. It is to be considered if top performances and theextra (usually insignificant) quality increase in comparison to,say RS or PNN, is desired.

Robust implementation of EM algorithm for GMM requiresknowledge of linear algebra and issues related to numerical pre-

cision. First of all, in our experience double accuracy should al-ways be used with the EM algorithm. In contrast, standard K-means can be implemented even with fixed-point arithmeticon a hand-held device (Saastamoinen et al., 2005). However, inGMM, the covariance matrices can become singular (or non-invertible). The common options are either to limit the variance,or to set the component weight to zero, thus effectively decreas-ing the model size. Another option is to relocate the componentelsewhere with new covariance matrix, or to copy the compo-nent from the previous iteration of the EM algorithm. In sum-mary, implementing GMM requires more care with numericsthan VQ.

Page 13: Pattern Recognition Letters - UEF

1616 T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617

The source codes of the clustering algorithms used in this paperhave been made publicly available at http://cs.joensuu.fi/sipu/clus-tering/.

6. Conclusions

We have presented an extensive comparison of clusteringmethods in a demanding pattern recognition task including highlynoisy telephony speech data. Our main conclusion is that the mostimportant parameter is the order of the model, whereas the choiceof the clustering algorithm is less important. It is therefore enoughthat the data is modeled by any reasonably good clustering algo-rithm, as long as the size of the codebook or the number of Gaus-sians is sufficiently large.

We found the choice of the algorithm to be critical only if verysmall model size is used. However, the result of random clusteringindicated that the recognition rate can be significantly high if noclustering is done. We therefore conclude that some clusteringalgorithm is needed and random sub-sampling is not enough.Regarding the choice of the algorithm, for practitioners we recom-mend the random swap (RS) algorithm because of its simple imple-mentation and robust performance in all test conditions. If therunning time is critical, we recommend the SPLIT algorithm eventhough its implementation is more complex.

In the current study, all VQ algorithms were optimized for thesame squared-error cost function. This explains rather similar re-sults obtained with different VQ variants. For the same reason,since GMM is based on a different objective function, the differ-ences between GMM and VQ type of models tend to be generallylarger. In two recent independent studies (Hanilci and Ertas,2011; Brew and Cunningham, 2010), differences between thesetwo clustering models were reported for different distance func-tions (Hanilci and Ertas, 2011) and in SVM back-end setting (Brewand Cunningham, 2010). We conclude that training methodologyand data selection for UBM (Hasan and Hansen, in press) are worthre-addressing.

According to our tests, GMM–UBM works better for user-conve-nience applications where false rejections must be minimized,however, the order was reversed in the favor of VQ-UBM whensmall false acceptances were considered. The observations weresimilar for both the 12-dimensional MFCC features (NIST 1999 cor-pus) and the 36-dimensional MFCC + D + D2 features (NIST 2006).Differences in the EER region, on the other hand, were not foundstatistically significant. Interestingly, similar observations havebeen recently made in another study for NIST 2001; see Fig. 14in Hanilci and Ertas (2011).

Regarding clustering quality, the results are consistent with thecomparisons made with image data (Fränti and Virmajoki, 2006).We expect the results to generalize to other variations of spectralfeatures as well, and to some extent, to other pattern recognitionapplications.

Acknowledgements

The works of T. Kinnunen and P. Fränti were supported by theInstitute for Infocomm Research (I2R) and Nanyang TechnologicalUniversity (NTU) in Singapore, and by the Academy of Finland.The works of I. Sidoroff and M. Tuononen were supported by TEKESunder the 4-year PUMS-program. The authors would like to thankMrs. Amanda Hajnal for carrying out spell-checking on themanuscript.

References

Auckenthaler, R., Carey, M., Lloyd-Thomas, H., 2000. Score normalization for text-independent speaker verification systems. Digital Signal Process. 10, 42–54.

Bagci, U., Erzin, E., 2007. Automatic classification of musical genres using inter-gender similarity. IEEE Signal Process Lett. 14 (8), 521–524.

Bezdek, J., Pal, N., 1998. Some new indices of cluster validity. IEEE Trans. SystemsMan Cybernet. Part B 28 (3), 301–315.

Bimbot, F., Bonastre, J.-F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I.,Meignier, S., Merlin, T., Ortega-Garcia, J., Petrovska-Delacretaz, D., Reynolds,D.A., 2004. A tutorial on text-independent speaker verification. EURASIP J. Appl.Signal Process. 4, 430–451.

Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Springer.Brew, A., Cunningham, P., 2010. Vector quantization mappings for speaker

verification. In: Proc. 20th Internat. Conf. on Pattern Recognition (ICPR 2010),pp. 560–564.

Burget, L., Matejka, P., Schwarz, P., Glembek, O., Cernocky, J.H., 2007. Analysis offeature extraction and channel compensation in a GMM speaker recognitionsystem. IEEE Trans. Audio Speech Lang. Process. 15 (7), 1979–1986.

Burton, D., 1987. Text-dependent speaker verification using vector quantizationsource coding. IEEE Trans. Acoust. Speech Signal Process. 35 (2), 133–143.

Campbell, J., 1997. Speaker recognition: A tutorial. Proc. IEEE 85 (9), 1437–1462.Campbell, W.M., Campbell, J.P., Reynolds, D.A., Singer, E., Torres-Carrasquillo, P.A.,

2006a. Support vector machines for speaker and language recognition. Comput.Speech Lang. 20 (2–3), 210–229.

Campbell, W.M., Sturim, D.E., Reynolds, D.A., 2006b. Support vector machines usingGMM supervectors for speaker verification. IEEE Signal Process. Lett. 13 (5),308–311.

Davies, D.L., Bouldin, D.W., 1979. A cluster separation measure. IEEE Trans. PatternAnal. Machine Intell. 1 (2), 224–227.

Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P., 2011. Front-end factoranalysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19 (4),788–798.

Duda, R.O., Hart, P.E., 1973. Pattern Classification and Scene Analysis. John Wileyand Sons, New York.

Dunn, J.C., 1974. A fuzzy relative of the ISODATA process and its use in detectingcompact well-separated clusters. J. Cybernet. 3 (3), 32–57.

Equitz, W.H., 1989. A new vector quantization clustering algorithm. IEEE Trans.Acoust. Speech Signal Process. 37 (10), 1568–1575.

Farrell, K.R., Mammone, R.J., Assaleh, K.T., 1994. Speaker recognition using neuralnetworks and conventional classifiers. IEEE Trans. Speech Audio Process. 2 (1),194–205.

Fränti, P., Kivijärvi, J., Kaukoranta, T., Nevalainen, O., 1997a. Genetic algorithms forlarge scale clustering problem. Comput. J. 40 (9), 547–554.

Fränti, P., Kaukoranta, T., Nevalainen, O., 1997b. On the splitting method for vectorquantization codebook generation. Opt. Eng. 36 (11), 3043–3051.

Fränti, P., 1999. On the usefulness of self-organizing maps for the clusteringproblem in vector quantization. In: Proc. 11th Scandinavian Conf. on ImageAnalysis (SCIA99), Kangerlussuaq, Greenland, pp. 415–422.

Fränti, P., 2000. Genetic algorithm with deterministic crossover for vectorquantization. Pattern Recognition Lett. 21 (1), 61–68.

Fränti, P., Kivijärvi, J., 2000. Randomized local search algorithm for the clusteringproblem. Pattern Anal. Appl. 3 (4), 358–369.

Fränti, P., Kaukoranta, T., Shen, D.-F., Chang, K.-S., 2000. Fast and memory efficientimplementation of the exact PNN. IEEE Trans. Image Process. 9 (5), 773–777.

Fränti, P., Virmajoki, O., 2006. Iterative shrinking method for clustering problems.Pattern Recognition 39 (5), 761–765.

Fränti, P., Virmajoki, O., Hautamäki, V., 2008. Probabilistic clustering by randomswap algorithm. In: IAPR Internat. Conf. on Pattern Recognition (ICPR’08),Tampa, Florida, USA.

Garey, M.R., Johnson, D.S., Witsenhausen, H.S., 1982. The complexity of thegeneralized Lloyd-Max problem. IEEE Trans. Inform. Theory 28 (2), 255–256.

Geva, A.B., Steinber, Y., Bruckmair, S., Nahum, G., 2000. A comparison of clustervalidity criteria for a mixture of normal distributed data. Pattern RecognitionLett. 21, 511–529.

He, J., Liu, L., Palm, G., 1999. A discriminative training algorithm for VQ-basedspeaker identification. IEEE Trans. Speech Audio Process. 7 (3), 353–356.

Hanilci, C., Ertas, F., 2011. Comparison of the impact of some Minkowski metrics onVQ/GMM based speaker recognition. Comput. Electr. Eng. 37, 41–56.

Hasan, T., Hansen, J.H.L., 2011. A study on universal background model training inspeaker verification. IEEE Trans. Speech, Audio Language Process. in press.

Hautamäki, V., Tuononen, M., Niemi-Laitinen, T., Fränti, P., 2007. Improving speakerverification by periodicity based voice activity detection. In: Proc. 12th Internat.Conf. on Speech and Computer (SPECOM 2007), vol. 2, Moscow, pp. 645–650.

Hautamäki, V., Kinnunen, T., Kärkkäinen, I., Tuononen, M., Saastamoinen, J., Fränti,P., 2008. Maximum a posteriori estimation of centroid model parameters forspeaker verification. IEEE Signal Process. Lett. 15, 162–165.

Hermansky, H., Morgan, N., 1994. RASTA processing of speech. IEEE Trans. SpeechAudio Process. 2 (4), 578–589.

Huang, X., Acero, A., Hon, H.-W., 2001. Spoken Language Processing: A Guide toTheory, Algorithm, and System Development. Prentice-Hall, New Jersey.

Ito, P.K., 1980. Robustness of ANOVA and MANOVA test procedures. In: Krishnaiah,P.R. (Ed.), Handbook of Statistics 1: Analysis of Variance. North-HollandPublishing Company, pp. 199–236.

Jain, A.K., Murty, M.N., Flynn, P.J., 1999. Data clustering: A review. ACM Comput.Surv. 31 (3), 264–323.

Jain, A.K., 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Lett.31 (8), 651–666.

Juang, Y.-T., Huang, K.-C., Ding, I.-J., 2003. Speaker adaptation based on MAPestimation using fuzzy controller. Pattern Recognition Lett. 24, 2807–2813.

Page 14: Pattern Recognition Letters - UEF

T. Kinnunen et al. / Pattern Recognition Letters 32 (2011) 1604–1617 1617

Kaukoranta, T., Fränti, P., Nevalainen, O., 1998. Iterative split-and-merge algorithmfor VQ codebook generation. Opt. Eng. 37 (10), 2726–2732.

Kaukoranta, T., Fränti, P., Nevalainen, O., 2000. A fast exact GLA based on codevector activity detection. IEEE Trans. Image Process. 9 (8), 1337–1342.

Kenny, P., Ouellet, P., Dehak, N., Gupta, V., Dumouchel, P., 2008. A study of inter-speaker variability in speaker verification. IEEE Trans. Audio Speech Lang.Process. 16 (5), 980–988.

Kinnunen, T., Li, H., 2010. An overview of text-independent speaker recognition:From features to supervectors. Speech Commun. 52 (1), 12–40.

Kinnunen, T., Kilpeläinen, Fränti, P., 2000. Comparison of clustering algorithms inspeaker identification. In: Proc. 28 Internat. Conf. Signal Processing andCommunications (SPC 2000), Marbella, Spain, pp. 222–227.

Kinnunen, T., Kärkkäinen, I., Fränti, P., 2001. Is speech data clustered? – Statisticalanalysis of cepstral features. In: Proc. 7th European Conf. on SpeechCommunication and Technology, (Eurospeech 2001), vol. 4, Aalborg, Denmark,pp. 2627–2630.

Kinnunen, T., Karpov, E., Fränti, P., 2006. Real-time speaker identification andverification. IEEE Trans. Audio Speech Lang. Process. 14 (1), 277–288.

Kinnunen, T., Saastamoinen, J., Hautamäki, V., Vinni, M., Fränti, P., 2009.Comparative evaluation of maximum a posteriori vector quantization andGaussian mixture models in speaker verification. Pattern Recognition Lett. 30(4), 341–347.

Kolano, G., Regel-Brietzmann, P., 1999. Combination of vector quantization andGaussian mixture models for speaker verification. In: Proc. 6th European Conf.on Speech Communication and Technology (Eurospeech 1999), Budapest,Hungary, pp. 1203–1206.

Kärkkäinen, I., Fränti, P., 2002. Stepwise algorithm for finding unknown number ofclusters. In: Proc. Advanced Concepts for Intelligent Vision Systems(ACIVS’2002), Gent, Belgium, pp. 136–143.

Lee, K.A., You, C., Li, H., Kinnunen, T., Zhu, D., 2008. Characterizing speech utterancesfor speaker verification with sequence kernel SVM. In: Proc. Interspeech 2008,Brisbane, Australia, pp. 1397–1400.

Lei, Z., Yang, Y., Wu, Z., 2005. Mixture of support vector machines for text-independent speaker recognition. In: Proc. 9th European Conf. on SpeechCommunication and Technology (Interspeech’2005), pp. 2041–2044.

Leeuwen, D.A.v., Martin, A.F., Przybocki, M.A., Bouten, J.S., 2006. NIST and NFI-TNOevaluations of automatic speaker recognition. Comput. Speech Lang. 20, 128–158.

Linde, Y., Buzo, A., Gray, R.M., 1980. An algorithm for vector quantizer design. IEEETrans. Commun. 28 (1), 84–95.

Linguistic Data Consortium. <http://www.ldc.upenn.edu/>.Louradour, J., Daoudi, K., 2005. SVM speaker verification using a new sequence

kernel. In: Proc. 13th European Conf. on Signal Processing (EUSIPCO’2005),Antalya, Turkey.

Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M., 1997. The DETcurve in assessment of detection task performance. In: Proc. 5th European Conf.on Speech Communication and Technology, (Eurospeech 1997), Rhodes, Greece,pp. 1895–1898.

Martin, A., Przybocki, M., 2000. The NIST 1999 speaker recognition evaluation – Anoverview. Digital Signal Process. 10, 1–18.

McLachlan, G., Peel, D., 2001. Finite Mixture Models. John Wiley & Sons, Brisbane.McQueen, J.B., 1967. Some methods for classification and analysis of multivariate

observations. In: 5th Berkeley Symposium on Mathematical Statistics andProbability, vol. 1, pp. 281–297.

Meila, M., Heckerman, D., 2001. An experimental comparison of model-basedclustering methods. Machine Learn. 42, 9–29.

Milligan, G.W., 1981. A Monte Carlo study of thirty internal criterion measures forcluster analysis. Psychometrika 46 (2), 187–199.

Nasrabadi, N.M., Feng, Y., 1988. Vector quantization of images based upon theKohonen self-organization feature maps. Neural Networks 1, 518.

NIST 2006 Speaker Recognition Evaluation Webpage, 2006. <http://www.itl.nist.gov/iad/mig/tests/sre/2006/index.html>.

Pelecanos, J., Myers, S., Sridharan, S., Chandran, V., 2000. Vector quantization basedGaussian modeling for speaker verification. In: Proc. 15th Internat. Conf. onPattern Recognition (ICPR’2000), vol. 3, pp. 294–297.

Ramachandran, R.P., Farrell, K.R., Ramachandran, R., Mammone, R.J., 2002. Speakerrecognition – General classifier approaches and data fusion methods. PatternRecognition 35 (12), 2801–2821.

Reynolds, D.A., Rose, R.C., 1995. Robust text-independent speaker identificationusing Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83.

Reynolds, D.A., Quatieri, T.F., Dunn, R.B., 2000. Speaker verification using adaptedGaussian mixture models. Digital Signal Process. 10 (1), 19–41.

Reynolds, D.A., Campbell, W., Gleason, T., Quillen, C., Sturim, D., Torres-Carrasquillo,P., Adami, A., 2005. The 2004 MIT Lincoln Laboratory Speaker recognitionsystem. In: Proc. Internat. Conf. Acoustics, Speech, and Signal Processing(ICASSP 2005), Vol. 1, 2005, pp. 177–180.

Roch, M., 2006. Gaussian-selection-based non-optimal search for speakeridentification. Speech Commun. 48, 85–95.

Saastamoinen, J., Karpov, E., Hautamäki, V., Fränti, P., 2005. Accuracy of MFCC basedspeaker recognition in Series 60 device. EURASIP J. Appl. Signal Process. 17,2816–2827.

Sarkar, M., Yegnanarayana, B., Khemani, D., 1997. A clustering algorithm using anevolutionary programming-based approach. Pattern Recognition Lett. 18 (10),975–986.

Singh, G., Panda, A., Bhattacharyya, S., Srikanthan, T., 2003. Vector quantizationtechniques for GMM based speaker verification, In: Proc. Internat. Conf.Acoustics, Speech, and Signal Processing (ICASSP’2003), Vol. 2, April 2003, pp.65–68.

Soong, F.K., Rosenberg, A.E., Juang, B.-H., Rabiner, L.R., 1987. A vector quantizationapproach to speaker recognition. AT&T Tech. J. 66, 14–26.

Stapert, R.,Mason, J.S., 2001. Speaker recognition and the acoustic speech space. In:Proc. 2001: A Speaker Odyssey – The Speaker Recognition Workshop, pp. 195–199.

Steinbach, M., Karypis, G., Kumar, V., 2000. A comparison of document clusteringtechniques. In: Proc. KDD Workshop on Text Mining.

Theodoridis, S., Koutroumbas, K., 2009. Pattern Recognition, 4th ed. Elsevier Inc..Tran, D.T., 2000. Fuzzy Approaches to Speech and Speaker Recognition, Ph.D. Thesis,

University of Canberra, Australia, May 2000, p. 154.Tran, D., Le, T.V., Wagner, M., 1998. Fuzzy Gaussian mixture models for speaker

recognition. In: Proc. Internat. Conf. Spoken Language Processing (ICSLP 1998),paper 0798, Sydney, Australia.

Tran, D., Wagner, M., 2002. Fuzzy C-Means Clustering-based speaker verification.In: Proc. Advances in Soft Computing (AFSS 2002), Calcutta, India, 2002, pp.318–324.

Um, I.-T., Ra, J.-H., Kim, M.-H., 2000. Comparison of clustering methods for MLP-based speaker verification. In: Proc. 15th Internat. Conf. on Pattern Recognition(ICPR’2000), vol. 2, 2000, pp. 475–478.

Wu, D., Li, J., Wu, H., 2009. a-Gaussian mixture modelling for speaker recognition.Pattern Recognition Lett. 30, 589–594.

Yegnanarayana, B., Kishore, S.P., 2002. AANN: An alternative to GMM for patternrecognition. Neural Networks 15, 459–469.