chapter 6 two-level approach for speaker...

60

CHAPTER 6

TWO-LEVEL APPROACH FOR SPEAKER

IDENTIFICATION USING SPEAKER-SPECIFIC-TEXT

6.1 NEED FOR TWO-LEVEL APPROACH

The speaker identification task explained in chapter 5 considers

only one confusing speaker for each of the speaker. During testing, if the

confusing speaker is not present in the first position, we will not have the

chance to improve the performance. On the other hand, if we consider more

than one confusing speaker for each of the speakers, then common set of

unique phonemes can be derived from all of the confusing speakers. One may

assume that these phoneme set is, to certain extent, unique to the other

speakers too. Let us consider closed-set speaker recognition task, with N

speakers. In the proposed approach, the important and sensitive task is to

derive unique phonemes for each of the N speakers. For any speaker in a

given set of N speakers, this can be carried out in the following two ways:

1. Considering the rest of the N 1 speakers as competing

speakers.

2. Considering a smaller set of speakers (say N1 speakers, where

N1 << m) as competing speakers.

In case(1), when m is very large, deriving unique phonemes is

computationally expensive. It is reason able to assume that most of the

speakers in the total set m will not be acoustically closer to the test speaker.

61

Considering this reason, in our work only a subset of speakers is considered.

Since the intention here is to improve the classification accuracy of GMM-

based technique, conventional GMM testing can be used to derive this subset

by considering N1-best results of GMM technique.

Having selected the N1 confusing speakers for a given speaker,

now the task is to derive unique phonemes. For each pair of speakers, where

the pair consists of the intended speaker and one of the competing speakers,

the unique phonemes can be derived as follows: Given the speech segments

for each of the phonemes the model (GMM) for the speakers in a pair, the

unique phonemes (or acoustically dissimilar phonemes) can be derived by

comparing the acoustic likelihoods.

This kind of technique, by default, mandates a two level approach.

In the first-level of testing, 2-best results can be derived. In a reasonable

classifier, probability of the actual class, being in the first place is expected to

be high. If it fails, probability of being in the second place, losing to its

competing speaker, is also high. In the proposed approach, the prime interest

is in detecting the later case by using the unique phonemes in the second-

level, to improve the classification accuracy.

These unique phonemes, used in the second-level of testing, can be

derived for one specific competing speaker, or from the set of competing

speakers a common set of unique phonemes can be derived. In the former

case, during testing, if no competing speakers present in the first position, we

will not have the chance to improve the performance. On the other hand, if a

common set of unique phoneme is derived from all of the competing

speakers, then one may assume that these phoneme set is, to certain extent

unique to the other speakers too. Even though, this may not be always

correct, at least this gives us a chance to go for second-level testing.

62

In our proposed work, the classes are discriminated at the phoneme

level, i.e., acoustically dissimilar phonemes of a speaker when compared to

his/ her closely resembling speakers have been derived. During testing, in the

first-level, using conventional GMM based system, 2-best results have been

derived. In the second-level, only for these two speakers, using the speaker-

specific-text (the speech utterances which have acoustically dissimilar

phonemes), testing has been carried out. Since the speaker-specific-text is

formed using the unique set of phonemes of a particular speaker, the

confusion error is reduced considerably. The proposed technique is

experimented on speaker identification task using TIMIT speech corpus. The

results are compared with the conventional GMM-based classifier.

6.2 EXPERIMENTAL SETUP

The TIMIT speech corpus is used for both training and testing. For

each speaker, a GMM with 64 mixture components has been trained,

considering Mel frequency cepstral coefficients (13 static + 13 dynamic + 13

acceleration) as the features. The proposed technique involves three main

steps:

1. To find out m confusing speakers for each speaker.

2. To derive acoustically dissimilar phoneme set for each

speaker when compared to her confusing speakers.

3. To perform two-level testing using speaker specific-text.

To find out the confusing speakers, the training utterances of each

speaker have been tested with all the speaker models. Leave-one-out

procedure has been used. For each speaker, m confusing speakers have been

derived based on sorted log-likelihoods (for this work, m = 5). This process

is repeated for all the speakers and a confusing speakers list is derived.

63

To derive speaker-specific-text of a speaker, the common

phonemes (the corresponding speech segments) of the speaker and her

confusing speaker, available in the training utterances, are tested with her

model and her confusing speaker model. Average log-likelihood of each

phoneme is computed for the first speaker and her confusing speaker. If the

mean difference of log likelihood is greater than a specific threshold, then the

corresponding phoneme is considered as an acoustically dissimilar phoneme.

For each speaker with respect to her closely resembling speakers different

subset of acoustically dissimilar phonemes are derived. The same process is

repeated for the phonemes of all the speakers. For each speaker, common

acoustically dissimilar phonemes have been derived by considering two,

three, four and five confusing speakers. For each speaker, the speaker-

specific-text has been derived by concatenating six common acoustically

dissimilar phonemes.

Testing has been carried out in two-levels. In the first level testing,

using conventional GMM-based system, 2-best results have been derived. Let

the first speaker in the result as speaker A and the second speaker as speaker

B. Second-level testing is to check whether the speaker B is the actual speaker

or not. Steps involved in second-level testing are as follows:

1. If speaker A is present in the confusing speaker list of B then using

the acoustically dissimilar phonemes of speaker B with respect to

speaker A, speaker-specific-text has been formulated. In the

second-level testing, the test speaker has to be asked to utter the

speaker-specific-text. Since the TIMIT corpus is used, the speech

utterance using speaker-specific-text has been formulated from the

test utterances of the test speaker by concatenating six randomly

picked acoustically dissimilar phonemes. Using the speaker-

specific-text, testing has been performed with speaker models A

64

and B. If log-likelihood of speaker model B is higher when

compared to that of speaker A then speaker B has been declared as

the winner otherwise speaker A has been declared as the winner.

2. If speaker A is not present in the confusing speaker list of speaker

B then using the common unique phonemes of speaker B, speaker-

specific-text has been formulated. The common unique phonemes

of speaker B has been derived by considering all the confusing

speakers of speaker B. Using the speaker-specific-text, testing has

been done with speaker A and speaker B. If log-likelihood of

speaker model B is higher when compare to that of speaker A then

speaker B has been declared as winner otherwise speaker A has

been declared as winner.

3. If the number of acoustically dissimilar phonemes are less than six

while taking common unique phonemes of speaker B, then speaker

A has been declared as winner. In order to formulate speaker-

specific-text, a minimum of six acoustically dissimilar phonemes

have been considered.

When the system is tested using speech utterances that correspond

to speaker-specific-text, the confusion error is found to be reduced

considerably than that of the conventional GMM-based classification

technique, as discussed below.

6.3 PERFORMANCE ANALYSIS

Speaker identification performance is compared between the

utterances with acoustically dissimilar phonemes and without considering the

acoustically dissimilar phonemes. To derive speaker characteristics, the

constraint that is set in our work is that the test utterances (words) should

65

have at least six phonemes. Each phoneme may have approximately 80 ms

duration. Therefore, each test utterance is divided into 500 ms speech signal

and given for testing. This 500 ms speech signal may contain both

acoustically similar and dissimilar phonemes (segments correspond to any

silences (more than 100 ms) are not considered). The number of speakers

taken for this experiment is 192. The first-level testing has been carried out

using the 500 ms speech utterance and 2-best results have been derived.

Second-level testing has been carried out using the speaker-specific-text. The

speaker identification accuracy using conventional GMM and proposed two-

level approach have been tabulated in Table 6.1. In the two-level approach,

one confusing speaker has been considered for this experiment.

Table 6.1 Speaker identification performance using conventional GMM and two-level approach

S.No. Method Identification accuracy

1 Conventional method 66.8%

2 Two-level approach 76.36%

From Table 6.1 , it can be noted that there is a 9.56% performance

improvement by using two-level approach, as specified in row 2 of Table 6.1.

We expect the speaker identification performance is increased

when the number of confusing speakers are increased. In the following

experiment the speaker identification accuracy is measured by varying the

number of confusing speakers. The number of speakers taken for this

experiment is 192.

66

Figure 6.1 Comparison between number of confusing speakers used in the two-level approach and speaker identification accuracy

From Figure 6.1 it can be noted that the speaker identification

accuracy has been reduced when the number of confusing speakers is

increased. The reason for reduction in accuracy is that, in the second-level

testing, for some speakers, few acoustically dissimilar phonemes may be

missed while taking common acoustically dissimilar phonemes of speaker B

by considering their confusing speakers. Common unique phonemes have

been considered only when speaker A is not present in the confusing speaker

list of B.

The reason for reduction in accuracy, when the number of

confusing speakers is increased, is explained as given below Figure 6.2.

67

Figure 6.2 Representation of phoneme space and common phoneme space of a speaker A by considering more than one confusing speakers (AC

1 , AC2 , AC

3 )

In Figure 6.2, let the phoneme space of speaker A is represented by

a circle A (middle circle). The phoneme space of the confusing speakers of

speaker A are represented by AC1 , AC

2 and AC3 .

The common unique phoneme space of speaker A (Up) by

considering their confusing speakers is represented by,

where,

m - number of confusing speakers,

A - common phonemes between the speaker A and his/her

confusing speaker , i = 1, 2, . . . ,m.

68

From Equation (6.1), it can be noted that when the second term of

RHS increases the value of UP decreases. From Figure 6.2, we can conclude

that when the number of confusing speakers is increased, the number of

common unique phonemes is decreased, hence the performance is decreased.

When the number of confusing speakers is increased, the number of

common unique phonemes is decreased, this can be explained with the unique

phonemes list of an example speaker when correspond to his/her confusing

speakers, is tabulated in Table 6.2.

Table 6.2 Common unique phonemes of a speaker by considering varying number of confusing speakers

Common unique phonemes by considering 2

confusing speakers


confusing speakers


confusing speakers


confusing speakers

/a/ /ae/ /ah/ /ar/ /dh/ /eh/ /en/

/eng/ /i/

/ih/ /iy/ /n/

/ng/ /oy/ /w/ /y/

/a/ /ae/ /ar/ /dh/ /eh/ /en/ /i/

/ih/ /iy/ /n/

/ng/ /oy/ /w/

/a/ /ae/ /en/ /i/

/ih/ /iy/ /oy/ /w/

/ae/ /en/ /i/

/ih/ /oy/ /w/

Total 16 Total 13 Total 8 Total 6

69

From Table 6.2, it can be noted that, When the number of

confusing speakers is increased, the number of common unique phonemes is

decreased, as specified in row 2.

The speaker identification performance is measured by varying the

duration of the speech utterances using conventional GMM-based technique,

the results are plotted in Figure 6.3. The speaker identification performance

using speaker-specific-text for 500 ms duration speech utterance is 76.36 %,

as specified in row 2 of Table 6.1.

Figure 6.3 Comparison between speaker identification accuracy and duration of speech utterance using conventional GMM method

From Figure 6.3, it can be noted that the speaker identification

performance of 76.36% using speaker-specific-text for 500 ms duration

speech utterance is achieved in conventional GMM only when 700 ms

duration speech utterances are used. Further, this shows that the classification

accuracy has been increased even with reduced duration of test utterances (1.4

times smaller than the duration of the speech utterance used in conventional

GMM).

70

The speaker identification performance is compared between

conventional method and two-level approach for different population size.

Figure 6.4 Comparison between speaker identification accuracy and number of speakers using conventional and two-level approach

From Figure 6.4, it can be noted that speaker identification

accuracy using two-level approach is higher than that of conventional method

for different population size also.

6.4 MOTIVATION FOR CREATING A NEW SPEECH CORPUS

Even though our proposed technique gives better accuracy when

compared to the conventional GMM-based technique, two reasons were

identified for misclassification in our proposed technique which are given

below:

1. Some acoustically dissimilar phonemes may be missed while

taking common phonemes between the first speaker and her

confusing speakers.

71

2. Acoustically dissimilar phonemes are derived based on

average of log-likelihood value. If some of the phonemes may

have less number of examples, then considering the statistical

parameter like the mean of the log-likelihoods, is not

appropriate. In section 6.2, since the TIMIT speech corpus is

used, many phonemes have very less number of examples

(even just two) it is not appropriate to use the mean value and

this might have led to false set of phonemes as unique.

These errors can be avoided by creating our own phonetically

balanced speech corpus4.

6.4.1 Speech Data Collection

For the present study, we have created and used a new speech

corpus. For this purpose, we have collected utterances correspond to142

sentences from TIMIT corpus that will have enough number (minimum 30) of

examples for all the 45 phonemes. The number of phonemes taken for this

work are 46, including silence. The speech data is recorded using head-

mounted microphone with 16kHz sampling rate. The frequency response of

the microphone is 20Hz-20kHz. Same microphone is used invariably for all

the speakers. The speech data is recorded in a laboratory environment,

without any background noise like fan or AC noise and so on. Wavesurfer

tool is used for speech data collection, and for removing long silences at the

beginning and end of the utterances. We have collected speech utterances

from 50 speakers that include 43 female speakers and 7 male speakers. The

3.5sec duration. The following are the phoneme list considered for this task:

4 NIST SRE corpora cannot be used for the proposed approach due to the reason that our approach requires speech data to be collected for speaker-specific-text.

72

Table 6.3 Phoneme list considered for creating the speech corpus

Phoneme Word Phoneme Word /sh/ Shout /jh/ Joke

/i/ Beet /ih/ Bit

/h/ Hay /d/ Day

/eh/ Bet /ah/ But

/k/ Key /r/ Right

/s/ Sound /w/ Wire

/u/ Boot /ao/ Bought

/en/ Button /ar/ Butter

/g/ Gay /l/ like

/n/ Noon /oy/ Boy

/ae/ Bat /a/ About

/m/ Moon /dh/ Then

/t/ Tea /iy/ Beet

/v/ Vote /f/ Fish

/p/ Pea /ow/ Boat

/ch/ Choke /b/ Bee

/aa/ Father /em/ Bottom

/ng/ Sing /ay/ Bite

/th/ Thin /ey/ Bait

/aw/ About /er/ Bird

/z/ Zoo /el/ Bottle

/zh/ Azure /eng/ Washington

The entire speech data is automatically segmented at phoneme level using

Forced Viterbi algorithm (Brugnara et el 1993). For the forced-Viterbi

algorithm, the speech data, corresponding phonetic transcriptions, and

phoneme models have to be provided. Since the text is considered from the

73

TIMIT corpus, the phonetic transcription provided in the TIMIT corpus is

used. The monophone models are also trained using the speech data of TIMIT

corpus. Here, separate set of models are trained for male and female speakers,

and used for deriving the time-aligned phonetic transcriptions for the newly

collected speech data.

The following are the steps taken for phoneme segmentation.

1. Utterances of one speaker are segmented manually. HTK

transcription has been used to represent phonetic transcription.

The format for HTK transcription as follows:

start duration end duration phoneme identity

(duration is represented in 100 nanoseconds)

2. Hidden Markov models are trained for all the 46 phonemes

using the manually segmented data.

3. speech data

is automatically segmented at phoneme level using forced

Viterbi algorithm.

4. Phoneme models have been created using those two speakers

speech data.

5. Using these phoneme models, the third speaker speech data is

automatically segmented at phoneme level using forced

Viterbi algorithm.

6. This procedure is repeated for five speakers then phoneme

data.

74

7. Using the phoneme models created at step 6, the remaining 38

been automatically

segmented using forced Viterbi algorithm.

8. Phoneme models are created using all the 43 female speakers

speech data.

9. The entire female speakers speech data have been

automatically segmented using forced Viterbi algorithm.

The same steps are followed for segmenting the male speech data.

The phonemes boundaries are refined properly to a greater extent by applying

forced Viterbi algorithm iteratively.

6.5 SPEAKER IDENTIFICATION NEW SPEECH CORPUS

The proposed technique is experimented on speaker identification

task using new speech corpus. The results are compared with the conventional

GMM-based classifier. Speech utterances are collected from 50 speakers.

For each speaker, among 142 sentences, 130 utterances are used for training

and 12 utterances are used for testing. For each speaker, a GMM with 64

mixture components has been trained, considering Mel-frequency cepstral

coefficients (13 static + 13 dynamic + 13 acceleration) as the features.

To find out the confusing speakers, the training utterances of each

speaker have been tested with all the speaker models. Leave-one-out

procedure has been used. For each speaker, m confusing speakers have been

derived based on sorted log likelihoods (for this work, m = 5). This process is

repeated for all the speakers and a confusing speakers list is derived.

To derive speaker-specific-text of a speaker, the common phonemes (i.e.,

corresponding speech segments) of the speaker and his/her confusing speaker,

75

available in the training utterances, are tested with his/her model and his/her

confusing speaker model. Average log-likelihood of each phoneme is

computed for the first speaker and the confusing speaker. Based on the sorted

average log-likelihoods, the first twenty phonemes have been considered as

acoustically dissimilar phonemes. For each speaker with respect to his/her

closely resembling speakers different subset of acoustically dissimilar

phonemes are derived. The same process is repeated for the phonemes of all

the speakers. For each speaker, common acoustically dissimilar phonemes

have been derived by considering two, three, four and five confusing

speakers. For each speaker, the speaker-specific-text has been derived by

concatenating5 six common acoustically dissimilar phonemes. By the same

procedure, acoustically similar phonemes are derived by taking last fifteen

phonemes from the sorted average log-likelihoods. (This is to see the effect of

acoustically dissimilar phonemes and acoustically similar phonemes in

speaker identification task)

Testing is carried out in two-levels. In the first-level testing, using

conventional GMM-based system, 2-best results is derived. Let us denote the

first speaker in the result as speaker A and the second speaker as speaker .

Second-level testing is to check whether the speaker B is the actual speaker or

not.

Steps involved in second-level testing are as follows:

1) If speaker A is present in the confusing speaker list of B then

using the acoustically dissimilar phonemes of speaker B with

respect to speaker A, speaker-specific-text is formulated.

Using the speaker-specific-text, testing is performed with

speaker models A and B. If log likelihood of speaker model B

5 Instead of concatenating phonemes, as a next experiment, using the unique phonemes, a readable text will be formulated, and testing will be carried out.

76

is higher than that of speaker A then speaker B is declared as

the winner otherwise speaker A is declared as the winner.

2) If speaker A is not present in the confusing speaker list of

speaker B then using the common unique phonemes of

speaker B, speaker-specific-text has been formulated. The

common unique phonemes of speaker B is derived by

considering all the confusing speakers of speaker B. Using the

speaker-specific-text, testing is carried out with speaker A and

speaker B. If log-likelihood of speaker model B is higher than

that of speaker A then speaker B is declared as winner

otherwise speaker A is declared as winner.

When the system is tested using speech utterances that correspond

to speaker-specific-text, the confusion error is found to be reduced

considerably than that of the conventional GMM-based classification

technique, as discussed below.

6.6 PERFORMANCE ANALYSIS TWO-LEVEL APPROACH

(USING THE NEW SPEECH CORPUS)

Speaker identification performance is compared between the

utterances with acoustically dissimilar phonemes and without considering the

acoustically dissimilar phonemes. Each test utterance is divided into 500ms

(approximately 6 phonemes) speech signal and given for testing. This 500ms

speech signal may contain both acoustically similar and dissimilar phonemes

(segments correspond to any silences (more than 100ms) are not considered).

For each speaker, number of the 500ms test utterances considered for this

experiment is 60 and the total number of test utterances is 3000. The first-

level testing has been carried out using the 500ms speech utterances and 2-

best results have been derived. Second-level testing has been carried out using

77

the speaker-specific-text by concatenating six unique phonemes, as

mentioned earlier. The speaker identification accuracy using conventional

GMM (till the first-level) and the proposed two-level approach is tabulated in

Table 6.4. Second-level testing is done by using common unique phonemes

by considering one, two and up to five confusing speakers. The identification

accuracy is same for all the cases. For each test case, single speaker-specific-

text is considered for this experiment.

Table 6.4 Speaker identification performance using conventional GMM and two-level approach (new speech corpus)

S. No. Method Identification accuracy

1 Conventional method

(First-level output) 82%

2 Two-level approach 88.5%

From Table 6.4, it can be noted that there is a 6.5% performance

improvement by using two-level approach, as specified in row 2 of Table 6.4.

The following experiment is carried out to measure the speaker

identification performance by using only the common acoustically disimilar

phonemes (common unique phonemes) and common acoustically similar

phonemes (common phonemes) of each speaker by varying the number of

confusing speakers. For each speaker, the number of test utterances

considered for this experiment is 10 (by concatenating six common unique

phonemes or six common phonemes). Total number of test utterances is 500.

78

Figure 6.5 Speaker identification accuracy in terms of number of confusing Speakers

From Figure 6.5, it can be noted that, using common unique

phonemes, the speaker identification accuracy has been increased when the

number of confusing speakers is increased from one to three. The speaker

identification accuracy has been reduced when the number of confusing

speakers is increased as four or five. The reason for reduction in accuracy is

that, in the second-level testing, for some speakers, the number of acoustically

dissimilar phonemes may be reduced while taking common acoustically

dissimilar phonemes of speaker B by considering their confusing speakers.

From Figure 6.5, it can be noted that, using common acoustically similar

phonemes (common phonemes), the speaker identification accuracy is

reduced when the number of confusing speakers is increased.

The speaker identification performance is measured by varying the

duration of the speech utterances using conventional GMM-based technique,

the results are plotted in Figure 6.6. The speaker identification performance

using speaker-specific-text for 500 ms duration speech utterance is 88.5%, as

specified in row 2 of Table 6.3.

79

Figure 6.6 Comparison between speaker identification accuracy and duration of speech utterance using conventional GMM method

Further, it is noted that the speaker identification performance of

88.5% using speaker-specific-text for 500 ms duration speech utterance is

achieved in conventional GMM only when 700ms duration speech utterances

are used. This shows that the classification accuracy has been increased even

with reduced duration of test utterances ( by 1.4 times).

The speaker identification performance is compared between

conventional method and two-level approach for different population size.

From Figure 6.7, it can be noted that speaker identification accuracy using

two-level approach is higher than that of conventional method for different

population size also.

80

Figure 6.7 Comparison between speaker identification accuracy and Number of speakers using conventional and two-level approach

6.7 SUMMARY

In this chapter, we have proposed to use speech utterances that

correspond to a speaker-specific-text for speaker recognition tasks. We have

shown that the classification accuracy, in a speaker identification task using

two-level approach, is considerably higher than that of a conventional GMM-

based technique, if the speech utterances correspond to the unique phonemes

are used. We have experimented the two-level approach, using TIMIT corpus

and the new speech corpus. In speaker identification task, the speaker is not

cooperative to utter the speaker-specific-text during the second-level of

testing. Our proposed work is a hypothetical experiment to prove that the

classification accuracy will be increased when the speaker-specific-text is

used. In fact, the proposed approach can be used for speaker verification task

also, with required modification.

chapter 6 two-level approach for speaker...

Documents