text-independent speaker verification report

Text-Independent Speaker VerificationCody A. Ray

Drexel [email protected]

Abstract—This paper provides an introduction to the taskof speaker recognition, and describes a not-so-novel speakerrecognition system based upon a minimum-distance classificationscheme. We describe both the theory and practical details for areference implementation. Furthermore, we discuss an advancedtechnique for classification based upon Gaussian Mixture Models(GMM). Finally, we discuss the results of a set of experimentsperformed using our reference implementation.

I. INTRODUCTION

The objective of this project was to develop a basic speakerrecognition system to demonstrate an understanding of thesubjects covered in the course Processing of the Human Voice.Speaker recognition systems can generally be classified aseither identification or verification. In speaker identification,the challenge is to decide which voice model from a known setof voice models best characterizes a speaker. In the differenttask of speaker verification, the goal is to decide whethera speaker corresponds to a particular known voice or tosome other unknown voice. In either case, the problem canbe further divided into text-dependent and text-independentsubproblems, depending on whether we may rely upon thesame utterance being used for both training and testing. Thisloose classification scheme is shown in Figure 1. For thepurpose of this project, we focused on the task of text-independent speaker verification.

II. TERMINOLOGY

• A background speaker is an imposter speaker.• A claimant is a speaker known to the system who is

correctly claiming his/her identity.• A false negative is an error where a claimant is rejected

as an imposter.• A false positive is an error where an imposter is accepted

as a claimant.• Speaker identification decides which voice model from

a known set of voice models best characterizes a speaker.• Speaker verification decides whether a speaker cor-

responds to a particular known voice or some otherunknown voice.

• A target speaker is a known speaker.

III. SYSTEM OVERVIEW

Speaker recognition systems must first build a model ofthe voice of each target speaker, as well as a model of acollection of background speakers, using speaker-dependentfeatures extracted from the speech waveform. This is referredto as the training stage, and the associated speech data used

Fig. 1. Speaker Recognition System Classification

to build the speaker model is called training data. During therecognition or testing stage, the features measured from thewaveform of a test utterance, i.e., the test data of a speaker,are matched (in some sense) against speaker models obtainedduring training. An overview of the components of a generalspeaker recognition system is given in Figure 2.

As with any biometric (pattern) recognition system, thespeaker recognition system consists of two core modules:feature extraction and feature matching. In Section IV, weprovide an overview of various dimensions used for speakeranalysis as well as describe the features we selected in moredepth. Section V continues with mathematical techniques usedin the matching process.

IV. FEATURE EXTRACTION

Although its possible to extract a number of featuresfrom sampled audio, including both spectral and non-spectralfeatures, for the purpose of this project we characterize aspeaker’s voice attributes exclusively through spectral features.In selecting acoustic spectral features, we want our featureset to reflect the unique characteristics of a speaker. For thispurpose, we use the magnitude component of the short-time

Fig. 2. General Speaker Recognition System

Fourier transform (STFT) as a basis. As the phase is difficult tomeasure and susceptible to channel distortion, it is discarded.

We first compute the discrete STFT and then weight it by aseries of filter frequency responses that roughly match those ofthe auditory critical band filters. To approximate the auditorycritical bands, we use triangular mel-scale filter bands, asshown in Figure 3.

X(n,wk) =∞∑

m=−∞x[m]w[n−m]e−jwkm

Next, we compute the energy in the mel-scale weightedSTFT and normalize the energy in each frame so as to giveequal energy for a flat input spectrum.

Emel(n, l) =1Al

Ul∑k=Ll

|Vl(wk)X(n,wk)|2,

where

Al =Ul∑

k=Ll

|Vl(wk)|2

Finally, we compute the mel-cepstrum for each frame usingthe even property of the real cepstrum to rewrite the inversetransform as the discrete cosine transform. We then extractthe mel-frequency cepstrum coefficients (MFCC) for use asour feature vector.

Cmel[n,m] =1R

R−1∑i=0

log{Emel(n, l)} cos(2πRlm)

We repeat this feature extraction process for both trainingand testing data to produce Ctrmel and Ctsmel, respectively.

Fig. 3. Idealized Triangular Mel-Scale Filter Bank

V. FEATURE MATCHING

A survey of the literature has revealed numerous approachesbased upon minimum-distance classification, dynamic time-warping, vector quantization, hidden Markov model, Gaussianmixture model, or artificial neural networks. For this project,we chose to implement a minimum-distance classifier, and astime permits improve upon this with a Gaussian mixture modelbased matching.

A. Minimum-Distance Classification

The concept of minimum-distance classification is simple:we calculate a feature vector for each new test case, andmeasure how far it is from the stored training data usingsome metric for distance computation. We then select athreshold distance to determine at which point we consider thespeaker verification to have been successful, or equivalently,the speaker to have been recognized. As we will later see, thisthreshold will determine the tradeoff between the number offalse negatives and false positives.

Specifically, we compute a feature vector based upon theaverages of the MFCCs for the test and training data sets.

Ctr

mel[n] =1M

M∑m=1

Ctrmel[mL,n]

Cts

mel[n] =1M

M∑m=1

Ctsmel[mL,n]

As a distance measure, we then use the mean-squareddifference between the average testing and training featurevectors.

D =1

R− 1

R−1∑n=1

(Cts

mel[n]− Ctrmel[n])2

If this distance is less than the given threshold, the speakerhas been verified.

B. Gaussian Mixture Models

Recognizing that speech production is inherently non-deterministic (due to subtle variations in vocal tract shape andglottal flow), we represent a speaker probabilistically through amultivariate Gaussian probability density function (pdf). Thisis a multi-dimensional structure in which we can think ofeach statistical variable as a state corresponding to a singleacoustic sound class, whether at a broad level, such as quasi-periodic, noise-like, and impulse-like, or at very fine level suchas individual phonemes. The Gaussian pdf of a feature vectorx for the ith state is written as

bi(x) =1

(2π)R2 |Σi|

12e−

12 (x−µi)

T Σ−1i

(x−µi)

where µi is the state mean vector, Σi is the state covariancematrix, and R is the dimension of the feature vector.1 The

1Using MFCC-based feature vectors, this corresponds to the number ofMFCC coefficients.

TABLE IDISTANCES BETWEEN TRAINING AND TESTING DATA

Test1 Test2 Test3 Test4 Test5 Test6 Test7 Test8

Train1 0.1192 0.1945 0.2151 0.2184 0.5364 0.3823 0.4963 0.4538Train2 0.0724 0.0378 0.0406 0.0783 0.4035 0.3177 0.4125 0.3986Train3 0.1672 0.1311 0.1042 0.0969 0.3382 0.2121 0.3597 0.2847Train4 0.1482 0.1412 0.1363 0.0817 0.3268 0.3211 0.3154 0.3282Train5 0.1882 0.1928 0.2237 0.1466 0.1044 0.0709 0.1382 0.1299Train6 0.3012 0.3521 0.3208 0.3112 0.3023 0.0958 0.2755 0.2094Train7 0.2743 0.2973 0.3252 0.2517 0.1618 0.1318 0.0724 0.1427Train8 0.3589 0.3600 0.3381 0.2186 0.1976 0.1133 0.2487 0.0585

speaker model λ then represents the set of GMM mean,covariance, and weight parameters.

λ = {µi,Σi, pi}

The probability of a feature vector being in any one of Istates for a particular speaker model λ is represented by theunion, or mixture, of different Gaussian pdfs:

p(x|λ) =l∑i=I

pibi(x)

where pi are the component mixture weights and bi(x) are themixture densities.

In speaker verification, we must decide whether a testutterance belongs to the target speaker. Formally, we aremaking a binary decision (yes/no) on two hypotheses, whetherthe test utterance belongs to the target speaker, hypothesis H1,or whether it comes from an imposter, hypothesis H2.

Suppose that we have already computed the GMM of thetarget speaker and the GMM for a collection of backgroundspeakers2; we then determined the likelihood ratio test todecide between H1 and H2. This test is the ratio betweenthe probability that the collection of feature vectors X ={x0, x1, . . . , xM−1} is from the claimed speaker, P (λC |X),and the probability that the collection of feature vectors Xis not from the claimed speaker, P (λC |X), i.e., from thebackground. Using Bayes’ rule, we can write this ratio as

P (λC |X)P (λC |X)

=p(X|λC)P (λC)/P (X)p(X|λC)P (λC)/P (X)

where P (X) denotes the probability of the vector stream X .Discarding the constant probability terms and applying thelogarithm, we have the log-likelihood ratio

Λ(X) = log[p(X|λC)]− log[p(X|λC)]

that we compare with a threshold to accept or reject whetherthe utterance belongs to the claimed speaker. If Λ(X) ≥ θ,then the speaker has been verified.

2One common model for generating a background pdf p(X|λC

) is throughmodels of a variety of background (imposter) speakers. In our development,we used speakers from the TIMIT database for this purpose.

VI. IMPLEMENTATION

The reference speaker recognition system was implementedin MATLAB using training data and test data stored in WAVfiles. There are tools included in MATLAB and publicly-available libraries to aid in creating this system. For readingin the data sets, we used MATLAB’s wavread function.For feature extraction, we used the melcepst functionfrom Voicebox, a MATLAB toolbox. We used twelve MFCCcoefficients (skipping the 0th order coefficient) using 256-sample frames and a 128-sample increment Hamming window.We used custom matching and testing routines based uponminimum-distance classification as described above. For theGaussian Mixture Models, we used T. N. Vikram’s GMMlibrary, based upon the text Algorithm Collections For DigitalSignal Processing Applications Using Matlab by E.S. Gopi.

VII. EXPERIMENTAL RESULTS

We performed a series of experiments to determine theaccuracy of the reference implementation for text-independentspeaker verification. We selected eight speakers from theTIMIT database, including four males and four females, eachsaying two sentences. The sentences used for this experimentwere “dont ask me to carry an oily rag like that” and “she hadyour dark suit in greasy wash water all year”. We used thesentence referring to an “oily rag” as training data, and usedthe sentence referring to a “dark suit” as testing data.3

The results of the experiment are shown in Table I. Thistable depicts the “distance” between the training and testingdata sets as a Cartesian product. For example, the value incell (1, 3) (in row-major order) corresponds to the distancebetween speaker1’s training data and speaker3’s testingdata. Note that we’re trying to minimize the diagonals, thatis, the distance between any individual speaker’s testing andtraining data while maximizing the non-diagonal cells in atext-independent manner.

For these data, we empirically selected an initial thresholdvalue of 0.12 to avoid false any negatives. With this threshold,however, the system resulted in six false positives, for anaccuracy of 91%. Decrementing the threshold to 0.11 results in

3The allocation of one sentence for training and the other for testing doesnot affect the results of the experiment. Due to the symmetry of the distancemetric used for computation, if we had switched the sentences used fortraining and testing, the results of Table I would simply be transposed.

one false negative and five false positives, while maintainingan accuracy of 91%. For these data, further variance in thethreshold cannot increase the accuracy; however, it clearlyprovides a tradeoff between false negatives and false positives.Due to this tradeoff, the most desirable threshold is applicationdependent and must be determined on a case-by-case basis.

It is informative to note that this is a challenging experimentfor a text-independent speaker recognition system, given thesmall amount of data used for training as well as the presenceof multiple differing acoustic sound classes in the sentences,i.e., the phoneme “sh” appears twice in the testing sentence,but is absent from the training sentence. Note that evenwith these difficulties, the reference implementation performsdecently; we expect this is due to the effect of averaging themel-cepstral features. Furthermore, we expect that, contrary toconventional wisdom, the minimum-distance classifier mightperform better than GMM when confronted with differingacoustic classes in the testing and training data sets duringtext-independent speech recognition.

As a final comment, note that from these results, it is easyto see why an MFCC-based minimum-distance classifier sys-tem should never be used for text-independent authenticationsystems. There is no way to eliminate false positives whilemaintaining a high degree of accuracy!

VIII. FUTURE WORK

Although it wasn’t in the stated proposal for this project, wefound the task of comparing the minimum-distance classifierwith one based upon a Gaussian Mixture Model intriguing.We found that the minimum-distance classifier was easily

implemented, and used remaining time to explore the useof GMMs for speaker recognition. However, we did not findsufficient remaining time to complete the implementation ofthe GMM-based speaker recognition system and repeat theexperiment. We’ve integrated an existing GMM library intothe project framework to compute the means, covariances, andweights of each state of the target speaker model; however,it remains to implement the least-likelihood ratio classifier.Given the large parallels to completed tasks remaining, i.e.,threshold-based classification and multivariate Gaussian pdfcomputation, it should be simple to compute a backgroundmodel and repeat the above experiment with the new GMM-based reference implementation. During the analysis, it wouldbe insightful to compare the minimum-distance and GMMclassification schemes, particularly focusing on performancewith regards to differing acoustic makeup of the test andtraining data for text-independent speaker recognition.

IX. CONCLUSION

We have shown that minimum-distance classification fortext-independent speaker recognition performs moderatelywell, though there is obvious room for improvement. Througha simple experiment, we’ve clearly demonstrated the tradeoffbetween false negatives and false positives in selecting athreshold, and noted that the most desirable threshold isan application-dependent parameter. From our results, we’vealso hypothesized that the minimum-distance classifier mightoutperform a GMM classifier on acoustically diverse test andtraining data sets, though this remains to be seen.

text-independent speaker verification report

Technology