[ieee 2011 ieee students' technology symposium (techsym) - kharagpur (2011.01.14-2011.01.16)]...

Speaker identification using utterances correspondto speaker-specific-textB. Bharathi, P. Vijayalakshmi and T. Nagarajan

SSN College of Engineering, Kalavakkam, Chennai, IndiaEmail: bharathib, vijayalakshmip, [email protected]

Abstract—In speaker recognition tasks, the main reason forreduced accuracy is due to closely resembling speakers in theacoustic space. Conventional GMM-based modelling techniquecaptures unique features along with common features amongvarious classes. Further, it ignores knowledge of phonetic contentof the speech. In order to increase the discriminative power of theclassifier, the system must be able to use only the unique featuresof a given speaker with respect to his/her acoustically closelyresembling speaker. This paper proposes a technique to reducethe confusion errors, by finding speaker-specific phonemes andformulate a text using the subset of phonemes that are unique,for speaker identification task. Experiments have been conductedon speaker identification task using speech data of 192 femalespeakers from TIMIT corpus.The performance of the proposedsystem is compared with that of a conventional GMM-basedtechnique and a significant improvement is noted.

I. INTRODUCTION

Gaussian Mixture Modeling(GMM) and Hidden MarkovModeling(HMM) techniques have been successful in recogni-tion tasks. Maximum Likelihood Estimation and ExpectationMaximization algorithm can be used to estimate the modelparameters efficiently. However, a major drawback in thistype of modeling techniques is that the modeling is done inisolation, i.e., the modeling technique, when modeling a class,does not consider the information from other classes. Thismay lead to poor models with parameters that are common toother classes, in addition to the unique parameters of a class.This increases the classification (confusion) error. Further, ina conventional GMM-based classifiers, the performance isdirectly proportional to the duration of the test utterances,which is another major drawback.

Better classification accuracy can be achieved if the trainingtechnique is able to capture the unique features of a class, i.e.,the features that discriminate a class from another. One of themajor reasons for the reduction in the accuracy of a speakerverification/identification task is due the confusion betweentwo closely resembling (in the acoustic sense) speakers. Dueto this reason, such systems cannot reliably be used in placeswhere tight security is recommended. Better classification canbe achieved if the system can identify the unique features ofa speaker when compared with his / her acoustically closelyresembling speaker.

The discriminative power of a classifier can be increasedmainly in two ways. In the first method the classes are discrim-inated in the feature level itself i.e., common features betweentwo classes can be identified and removed from the training

data. In the second method the classes can be discriminatedby adjusting the model parameter itself. In [1], [2], [3], theuse of GMM for speaker identification was shown to providegood performance with several existing techniques. However,this kind of modeling is a non-discriminative way to buildspeaker models as it does not consider the information fromother classes. Further, the GMM-based techniques ignores theknowledge of the underlying phonetic content of the speech.

In [4], segmental Generalized Probabilistic Descent (GPD)algorithm has been used to estimate model parameters of aclass considering the competing speakers. Minimum Clas-sification Errors(MCE) approach for speaker verification isproposed in [5]. In this approach [5], all the competingspeakers are used to evaluate the score of the anti speakerwhich is found to be effective. However, it is not practicalfor verification test over a large population. In [6], discrim-ination among GMMs have been introduced using the MCE(minimum classification error) criterion. In [7], [8], MaximumModel Distance algorithm for HMMs (Hidden Markov Mod-els) was used. In [9], Maximum Model Distance algorithm forGMM was described. This approach [9] tries to maximize thedistance between each model and a set of competitive speakersmodels. In [10], product of Gaussians has been used to identifythe most probable confusing features between two classes.Then the common features are removed from the training data.By eliminating the confusing features, during testing, evidenceis derived only from the features that are unique to a class.

In our proposed work, the classes are discriminated atthe phoneme level, i.e., acoustically dissimilar phonemes ofa speaker when compared to his/ her closely resemblingspeakers were derived. During testing, the speaker-specific-text (the words which have acoustically dissimilar phonemes)is used thereby the classification accuracy is increased.

The outline of this paper is given below. The next sectiondecribes theoretical background related to our proposed tech-nique. The experimental setup of the system is presented insection 3. Section 4 deals with the performance analysis ofthe speaker identification task. Finally, section 5 concludesthe paper.

II. THEORETICAL BACKGROUND

Better classification accuracy can be achieved if the trainingtechnique is able to capture the unique features of a class,the features that discriminate a class from another. In[10],a discriminative GMM technique was proposed to equip a

Proceeding of the 2011 IEEE Students' Technology Symposium 14-16 January, 2011, IIT Kharagpur

TS11IMSP0P057 978-1-4244-8943-5/11/$26.00 ©2011 IEEE 171

TABLE ISPEAKER IDENTIFICATION PERFORMANCE OF THE SYSTEM BASED ON DIFFERENT THRESHOLDS AND CONSTRAINTS ( ADP - ACOUSTICALLY

DISSIMILAR PHONEMES)

Case Threshold

Constraints (No.of phonemes in

the test utterance/ No. of ADPs)

No. of speakerssatisfy the constraints

No. of speakersrecognizedcorrectly

Identificationaccuracy

1 ≥ 9 6 / {>3 } 95 78 82.10%

2 ≥ 10 6 / {>3 } 52 44 84.61%

3 ≥ 11 6 / {>3 } 21 17 81%

classifier to capture the unique features of a class and to makedecisions based on the unique features alone. During testing,feature vectors that are unique to a class is derived thereby theclassification accuracy is increased. One of the drawbacks isthat, if the test utterance does not contain the unique featuresthen the classification accuracy can be drastically reduced.Another drawback is, unique features have to be identifiedfrom the test uttereances during testing thus increases thecomputation time. If the speaker is able to utter the wordwhich contains only the unique features then the computationtime will be reduced. Eventhough the unique feature vectorsare known, one cannot expect / force a speaker to utter speechsegments, that contain these features alone. On the other hand,if we know unique phoneme list apriori, one can formulate atext, to be uttered, using such phonemes alone.

In this proposed work we investigate the effect of a subset ofphonemes, that are unique to a speaker in the acoustic sense ona speaker recognition task. The proposed technique involvesthree main steps:

1) To find out confusing speaker for each speaker.2) To derive acoustically dissimilar phoneme set for each

speaker when compared to his / her confusing speaker.3) To test the system using utterances which will have

maximum number of acoustically dissimilar phonemes.The proposed technique is experimented on speaker iden-

tification task using TIMIT speech corpus. The results arecompared with the performance of a conventional GMM-basedclassifier.

III. EXPERIMENTAL SETUP

The TIMIT speech corpus is used for both training andtesting. TIMIT speech corpus has 6300 utterances uttered by630 speakers. Each speaker has 10 utterances and each ofthese utterances are approximately of 3 second duration. Forthe current study, only the female speakers (192 in number) areconsidered, due to the reason that the classification accuracyfor female data is inferior to that of male data. For eachspeaker, among ten sentences, first 8 sentences are used fortraining and the last 2 sentences are used for testing. In

the TIMIT speech corpus, the speech data is segmented atphoneme level and the corresponding phonetic transcriptionis also provided. The total number of training utterances is1536 and the total number of test utterances is 384. Foreach speaker, a GMM with 64 mixture components has beentrained, considering Mel-frequency cepstral coefficients (13static + 13 dynamic + 13 acceleration) as the features.

The training utterances of each speaker have been testedwith 192 speaker models. Based on the log-likelihoods, twobest results have been derived. The second speaker is consid-ered as a closely resembling speaker. This process is repeatedfor all the 192 speakers and a confusing speaker list is derived.

To derive speaker-specific-text of a speaker, as an initialstep, we have to find out the acoustically dissimilar phonemesof the corresponding speaker. The common phonemes (thecorresponding speech segment) of the speaker and her con-fusing speaker, available in the training utterances, are testedwith her model and her confusing speaker model. Average log-likelihood of each phoneme is computed for the first speakerand her confusing speaker. If the mean difference is greaterthan a specific threshold, then the corresponding phoneme isconsidered as an acoustically dissimilar phoneme. The sameprocess is repeated for the phonemes of all the speakers.

During testing, the speaker-specific-text (the utteranceswhich have acoustically dissimilar phonemes) is used. Sincethe TIMIT speech corpus is used, speaker-specific-text cannotbe formulated using only the acoustically dissimilar phonemes.Therefore the speaker-specific-text is derived from the two testutterances by taking the words which have maximum numberof acoustically dissimilar phonemes. Results were comparedwith the words which has maximum number of acousticallydissimlar phonemes and words without considering the acous-tically dissimilar phonemes. When the system is tested usingspeech utterences which correspond to speaker-specific-text,the confusion error is found to be reduced considerably thanthat of the conventional GMM-based classification technique,as discussed below.


TS11IMSP0P057 172

TABLE IISPEAKER IDENTIFICATION PERFORMANCE OF THE SYSTEM WITHOUT CONSIDERING THE ACOUSTICALLY DISSIMILAR PHONEMES (THE SPEAKERS THAT

SATISFY THE CONSTRAINTS GIVEN IN CASE 1,2,3 FROM TABLE I ARE CONSIDERED FOR TESTING)

CaseNo. of speakersfor testing (as in

Table I

No. of 500msspeech utterances

No. of timesrecognizedcorrectly


1 95 1140 788 69%

2 52 624 429 68.5%

3 21 252 177 70%

IV. PERFORMANCE ANALYSIS

The performance of the system has been analyzed us-ing acoustically dissimilar phonemes. The various values ofthe threshold (average log liklihood difference between thespeaker and her confusing speaker) is set and different con-straints are used for testing the performance of the system.Since the TIMIT corpus is used, we cannot formulate the textusing only the acoustically dissimilar phonemes for testing.To derive speaker characteristics, the constraint that is set inour work is that the test utterances (words) should have atleastsix phonemes. Among six phonemes, the word should have aminimum of three acoustically dissimilar phonemes (ADPs)i.e., the word should contain 50% ADPs. For each speakerone such a word (satisfies the constraints) has been chosen fortesting. The performance analysis of such a system is tabulatedin Table I.

From Table I, it can be noted that even with a single word,that contains more than or equal to 3 acoustically dissimilarphonemes, the classification accuracy is reasonably good (i.e.,above 80%). Further, the deviation in the performance for vari-ous thresholds1 is only minor. This shows that the performanceof the system is not very sensitive to the threshold.

Speaker identification performance is compared between theutterances with acoustically dissimilar phonemes and withoutconsidering the acoustically dissimilar phonemes. To derivespeaker characteristics, the constraint that is set in our workis that the test utterances (words) should have atleast sixphonemes. Each phoneme may have approximately 80msduration. Therefore, each test utterance is divided into 500ms speech signal and given for testing. This 500ms speechsignal may contain both acoustically similar and dissimilarphonemes(segments correspond to any silences(more than100ms) are not considered).

From table I and table II, it can be noted that there is a16% performance improvement by using speaker-specific-text,as specified in row 2 of Table I.

The speaker identification performance is measured basedon the number of acoustically dissimilar phonemes in the test

1Since the TIMIT corpus is used, the authors do not have the control overthe number of speakers who satisfy the constraints.

utterance. From each test utterance, the words, with minimumof six phonemes and less than or equal to two acousticallydissimilar phonemes have been taken for testing. Similarly,the words with minimum of six phonemes and greater thanor equal to three acoustically dissimilar phonemes have beentaken for testing. The results are tabulated in Table III. Thenumber of speakers taken for the following experiment is 40.

TABLE IIISPEAKER IDENTIFICATION PERFORMANCE BASED ON NUMBER OF

ACOUSTICALLY DISSIMILAR PHONEMES

Case

No. ofacousticallydissimilarphonemes

No. ofspeakers

No. ofspeakers

recognizedcorrectly


1 ≤ 2 40 31 77%

2 ≥ 3 40 35 85 %

From Table III, it can be noted that the classificationperformance is improved when the number of acousticallydissimilar phonemes is increased.

The speaker identification performance is measured bycomparing the acoustically similar phonemes and acousticallydissimilar phonemes in the test utterance. Feature vectors ofacoustically similar and dissimilar phonemes are extractedand given for testing. That is, testing is done with featurevectors extracted from the utterance of a single phoneme.The experimental results shows that even with the singleacoustically dissimilar phoneme the speakers can be identifiedwith reasonable accuracy which is shown in Fig.1.

From Fig.1 it can be noted that the acoustically dissimilarphonemes have accuracy greater than that of the acousticallysimilar phonemes. The speakers (9,10,11) have lower accuracyfor the acoustically dissimilar phonemes. However, majorityof the speakers were identified even with single acousticallydissimilar phonemes. This result shows that, if the test ut-terance contains only the acoustically dissimilar phonemes


TS11IMSP0P057 173

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

10

20

30

40

50

60

70

Speaker Number

Accu

ra

cy

(%

)ADP

ASP

Fig. 1. Comparison between acoustically similar and dissimilar phonemes(ADP - Acoustically Dissimilar Phoneme, ASP - Acoustically SimilarPhoneme)

confusion error can be reduced and the classification accuracycan be increased. Computation time also reduced becauseunique features (acoustically disimilar phonemes) are aloneconsidered before testing i.e., testing is done using speechutterances correspond to a speaker-specific-text alone. Further,this shows that the duration of the test utterances can bereduced drastically without making a compromise on theclassfication accuracy.

Eventhough our proposed technique gives better accuracywhen compared to the conventional GMM-based technique,three reasons were identified for misclassification in our pro-posed technique which are given below:

1) For each speaker only one speaker is considered asconfusing speaker.

2) Some phonemes may be missed while taking commonphonemes between the first speaker and her confusingspeaker.

3) Acoustically dissimilar phonemes are derived based onaverage of log-liklihood value. Some phonemes mayhave less number of examples.In such cases, using astatistical parameter (the mean of the log-likelihoods)may not be correct.

These errors can be avoided by creating our own speechcorpus that will have enough number of examples for each ofthe phonemes.

V. CONCLUSIONS

In this paper, we have proposed to use speech utterancesthat correspond to a speaker-specific-text for speaker recog-nition tasks. Here, the speaker-specific-text is formed usingthe unique phonemes of a speaker, in otherwords, a set ofphonemes that are acoustically dissimilar when comparedwith that of a competing (acoustically closely resembling)speaker. We have shown that the classification accuracy, in aspeaker identification task, is considerably higher than that ofa conventional GMM-based technique, if the speech utterances

correspond to the unique phonemes are used. Further, we haveshown that, even with a single phoneme, if it is unique to aspeaker, the classification accuracy is quite satisfactory. Theseresults show that the duration of the test utterances can also bereduced considerably without compromising on the accuracy.

REFERENCES

[1] D. Reynolds and R. Rose, “Robust text-independent speaker identifi-cation using Gaussian mixture speaker models”, IEEE Trans. SpeechAudio Processing, vol. 3, pp. 72–83, 1995.

[2] Douglas A. Reynolds, “Automatic speaker recognition using Gaussianmixture speaker models”, The Lincoln Laboratory Journal, vol. 8, no.2, pp. 173–192, 1995.

[3] D. Reynolds, “Speaker identification and verification using Gaussianmixture speaker models”, Speech Communication., vol. 17, pp. 91–108,1995.

[4] C.M. del Alamo, F.J. Caminero Gil, C. dela Torre Munilla, L. HernandezGomez, “Discriminative training of GMM for speaker identification”, inproceedings of IEEE International conference on Acoustics, Speech, andSignal Processing, Vol 1, pp 89–92, May 1996

[5] Chi-Shi Liu, Chin-Hui Lee, Biing-Hwang Juang , A.E. Rosenberg,“Speaker recognition based on minimum error discriminative training”,in proceedings of IEEE International conference on Acoustics, Speech,and Signal Processing, Vol 1, pp 325–328, April 1994

[6] Chi-Shi Liu, Chin-Hui Lee, W. Chou, B.-H. Juang, and A. E. Rosenberg,“A study on minimum error discriminative training for speaker recog-nition”, The Journal of the Acoustical Society of America, vol. 97, pp.637–648, January 1995.

[7] S. Kwong, Q. H. He, K. F. Man, and K. S. Tang, “Improved maximummodel distance for HMM training”, Pattern Recognition, vol. 33, pp.1749–1758, 2000.

[8] S. Kwong, Q. He, K. Man, and K. Tang, “A Maximum Model Distanceapproach for HMM based speech recognition”, Pattern Recognition, pp.219–229, 1998.

[9] Q. Hong and S. Kwong, “Discriminative training for speaker identifica-tion based on maximum model distance algorithm”, Proc. ICASSP, pp.25–28, 2004.

[10] C. Arun Kumar, B. Bharathi and T. Nagarajan, “A discriminative GMMtechnique using product of likelihood Gaussians”, IEEE TENCON pp.1–6, 2009.


TS11IMSP0P057 174

[ieee 2011 ieee students' technology symposium (techsym) - kharagpur (2011.01.14-2011.01.16)]...

Documents