chapter 6 two-level approach for speaker...
TRANSCRIPT
60
CHAPTER 6
TWO-LEVEL APPROACH FOR SPEAKER
IDENTIFICATION USING SPEAKER-SPECIFIC-TEXT
6.1 NEED FOR TWO-LEVEL APPROACH
The speaker identification task explained in chapter 5 considers
only one confusing speaker for each of the speaker. During testing, if the
confusing speaker is not present in the first position, we will not have the
chance to improve the performance. On the other hand, if we consider more
than one confusing speaker for each of the speakers, then common set of
unique phonemes can be derived from all of the confusing speakers. One may
assume that these phoneme set is, to certain extent, unique to the other
speakers too. Let us consider closed-set speaker recognition task, with N
speakers. In the proposed approach, the important and sensitive task is to
derive unique phonemes for each of the N speakers. For any speaker in a
given set of N speakers, this can be carried out in the following two ways:
1. Considering the rest of the N 1 speakers as competing
speakers.
2. Considering a smaller set of speakers (say N1 speakers, where
N1 << m) as competing speakers.
In case(1), when m is very large, deriving unique phonemes is
computationally expensive. It is reason able to assume that most of the
speakers in the total set m will not be acoustically closer to the test speaker.
61
Considering this reason, in our work only a subset of speakers is considered.
Since the intention here is to improve the classification accuracy of GMM-
based technique, conventional GMM testing can be used to derive this subset
by considering N1-best results of GMM technique.
Having selected the N1 confusing speakers for a given speaker,
now the task is to derive unique phonemes. For each pair of speakers, where
the pair consists of the intended speaker and one of the competing speakers,
the unique phonemes can be derived as follows: Given the speech segments
for each of the phonemes the model (GMM) for the speakers in a pair, the
unique phonemes (or acoustically dissimilar phonemes) can be derived by
comparing the acoustic likelihoods.
This kind of technique, by default, mandates a two level approach.
In the first-level of testing, 2-best results can be derived. In a reasonable
classifier, probability of the actual class, being in the first place is expected to
be high. If it fails, probability of being in the second place, losing to its
competing speaker, is also high. In the proposed approach, the prime interest
is in detecting the later case by using the unique phonemes in the second-
level, to improve the classification accuracy.
These unique phonemes, used in the second-level of testing, can be
derived for one specific competing speaker, or from the set of competing
speakers a common set of unique phonemes can be derived. In the former
case, during testing, if no competing speakers present in the first position, we
will not have the chance to improve the performance. On the other hand, if a
common set of unique phoneme is derived from all of the competing
speakers, then one may assume that these phoneme set is, to certain extent
unique to the other speakers too. Even though, this may not be always
correct, at least this gives us a chance to go for second-level testing.
62
In our proposed work, the classes are discriminated at the phoneme
level, i.e., acoustically dissimilar phonemes of a speaker when compared to
his/ her closely resembling speakers have been derived. During testing, in the
first-level, using conventional GMM based system, 2-best results have been
derived. In the second-level, only for these two speakers, using the speaker-
specific-text (the speech utterances which have acoustically dissimilar
phonemes), testing has been carried out. Since the speaker-specific-text is
formed using the unique set of phonemes of a particular speaker, the
confusion error is reduced considerably. The proposed technique is
experimented on speaker identification task using TIMIT speech corpus. The
results are compared with the conventional GMM-based classifier.
6.2 EXPERIMENTAL SETUP
The TIMIT speech corpus is used for both training and testing. For
each speaker, a GMM with 64 mixture components has been trained,
considering Mel frequency cepstral coefficients (13 static + 13 dynamic + 13
acceleration) as the features. The proposed technique involves three main
steps:
1. To find out m confusing speakers for each speaker.
2. To derive acoustically dissimilar phoneme set for each
speaker when compared to her confusing speakers.
3. To perform two-level testing using speaker specific-text.
To find out the confusing speakers, the training utterances of each
speaker have been tested with all the speaker models. Leave-one-out
procedure has been used. For each speaker, m confusing speakers have been
derived based on sorted log-likelihoods (for this work, m = 5). This process
is repeated for all the speakers and a confusing speakers list is derived.
63
To derive speaker-specific-text of a speaker, the common
phonemes (the corresponding speech segments) of the speaker and her
confusing speaker, available in the training utterances, are tested with her
model and her confusing speaker model. Average log-likelihood of each
phoneme is computed for the first speaker and her confusing speaker. If the
mean difference of log likelihood is greater than a specific threshold, then the
corresponding phoneme is considered as an acoustically dissimilar phoneme.
For each speaker with respect to her closely resembling speakers different
subset of acoustically dissimilar phonemes are derived. The same process is
repeated for the phonemes of all the speakers. For each speaker, common
acoustically dissimilar phonemes have been derived by considering two,
three, four and five confusing speakers. For each speaker, the speaker-
specific-text has been derived by concatenating six common acoustically
dissimilar phonemes.
Testing has been carried out in two-levels. In the first level testing,
using conventional GMM-based system, 2-best results have been derived. Let
the first speaker in the result as speaker A and the second speaker as speaker
B. Second-level testing is to check whether the speaker B is the actual speaker
or not. Steps involved in second-level testing are as follows:
1. If speaker A is present in the confusing speaker list of B then using
the acoustically dissimilar phonemes of speaker B with respect to
speaker A, speaker-specific-text has been formulated. In the
second-level testing, the test speaker has to be asked to utter the
speaker-specific-text. Since the TIMIT corpus is used, the speech
utterance using speaker-specific-text has been formulated from the
test utterances of the test speaker by concatenating six randomly
picked acoustically dissimilar phonemes. Using the speaker-
specific-text, testing has been performed with speaker models A
64
and B. If log-likelihood of speaker model B is higher when
compared to that of speaker A then speaker B has been declared as
the winner otherwise speaker A has been declared as the winner.
2. If speaker A is not present in the confusing speaker list of speaker
B then using the common unique phonemes of speaker B, speaker-
specific-text has been formulated. The common unique phonemes
of speaker B has been derived by considering all the confusing
speakers of speaker B. Using the speaker-specific-text, testing has
been done with speaker A and speaker B. If log-likelihood of
speaker model B is higher when compare to that of speaker A then
speaker B has been declared as winner otherwise speaker A has
been declared as winner.
3. If the number of acoustically dissimilar phonemes are less than six
while taking common unique phonemes of speaker B, then speaker
A has been declared as winner. In order to formulate speaker-
specific-text, a minimum of six acoustically dissimilar phonemes
have been considered.
When the system is tested using speech utterances that correspond
to speaker-specific-text, the confusion error is found to be reduced
considerably than that of the conventional GMM-based classification
technique, as discussed below.
6.3 PERFORMANCE ANALYSIS
Speaker identification performance is compared between the
utterances with acoustically dissimilar phonemes and without considering the
acoustically dissimilar phonemes. To derive speaker characteristics, the
constraint that is set in our work is that the test utterances (words) should
65
have at least six phonemes. Each phoneme may have approximately 80 ms
duration. Therefore, each test utterance is divided into 500 ms speech signal
and given for testing. This 500 ms speech signal may contain both
acoustically similar and dissimilar phonemes (segments correspond to any
silences (more than 100 ms) are not considered). The number of speakers
taken for this experiment is 192. The first-level testing has been carried out
using the 500 ms speech utterance and 2-best results have been derived.
Second-level testing has been carried out using the speaker-specific-text. The
speaker identification accuracy using conventional GMM and proposed two-
level approach have been tabulated in Table 6.1. In the two-level approach,
one confusing speaker has been considered for this experiment.
Table 6.1 Speaker identification performance using conventional GMM and two-level approach
S.No. Method Identification accuracy
1 Conventional method 66.8%
2 Two-level approach 76.36%
From Table 6.1 , it can be noted that there is a 9.56% performance
improvement by using two-level approach, as specified in row 2 of Table 6.1.
We expect the speaker identification performance is increased
when the number of confusing speakers are increased. In the following
experiment the speaker identification accuracy is measured by varying the
number of confusing speakers. The number of speakers taken for this
experiment is 192.
66
Figure 6.1 Comparison between number of confusing speakers used in the two-level approach and speaker identification accuracy
From Figure 6.1 it can be noted that the speaker identification
accuracy has been reduced when the number of confusing speakers is
increased. The reason for reduction in accuracy is that, in the second-level
testing, for some speakers, few acoustically dissimilar phonemes may be
missed while taking common acoustically dissimilar phonemes of speaker B
by considering their confusing speakers. Common unique phonemes have
been considered only when speaker A is not present in the confusing speaker
list of B.
The reason for reduction in accuracy, when the number of
confusing speakers is increased, is explained as given below Figure 6.2.
67
Figure 6.2 Representation of phoneme space and common phoneme space of a speaker A by considering more than one confusing speakers (AC
1 , AC2 , AC
3 )
In Figure 6.2, let the phoneme space of speaker A is represented by
a circle A (middle circle). The phoneme space of the confusing speakers of
speaker A are represented by AC1 , AC
2 and AC3 .
The common unique phoneme space of speaker A (Up) by
considering their confusing speakers is represented by,
where,
m - number of confusing speakers,
A - common phonemes between the speaker A and his/her
confusing speaker , i = 1, 2, . . . ,m.
68
From Equation (6.1), it can be noted that when the second term of
RHS increases the value of UP decreases. From Figure 6.2, we can conclude
that when the number of confusing speakers is increased, the number of
common unique phonemes is decreased, hence the performance is decreased.
When the number of confusing speakers is increased, the number of
common unique phonemes is decreased, this can be explained with the unique
phonemes list of an example speaker when correspond to his/her confusing
speakers, is tabulated in Table 6.2.
Table 6.2 Common unique phonemes of a speaker by considering varying number of confusing speakers
Common unique phonemes by considering 2
confusing speakers
Common unique phonemes by considering 3
confusing speakers
Common unique phonemes by considering 4
confusing speakers
Common unique phonemes by considering 5
confusing speakers
/a/ /ae/ /ah/ /ar/ /dh/ /eh/ /en/
/eng/ /i/
/ih/ /iy/ /n/
/ng/ /oy/ /w/ /y/
/a/ /ae/ /ar/ /dh/ /eh/ /en/ /i/
/ih/ /iy/ /n/
/ng/ /oy/ /w/
/a/ /ae/ /en/ /i/
/ih/ /iy/ /oy/ /w/
/ae/ /en/ /i/
/ih/ /oy/ /w/
Total 16 Total 13 Total 8 Total 6
69
From Table 6.2, it can be noted that, When the number of
confusing speakers is increased, the number of common unique phonemes is
decreased, as specified in row 2.
The speaker identification performance is measured by varying the
duration of the speech utterances using conventional GMM-based technique,
the results are plotted in Figure 6.3. The speaker identification performance
using speaker-specific-text for 500 ms duration speech utterance is 76.36 %,
as specified in row 2 of Table 6.1.
Figure 6.3 Comparison between speaker identification accuracy and duration of speech utterance using conventional GMM method
From Figure 6.3, it can be noted that the speaker identification
performance of 76.36% using speaker-specific-text for 500 ms duration
speech utterance is achieved in conventional GMM only when 700 ms
duration speech utterances are used. Further, this shows that the classification
accuracy has been increased even with reduced duration of test utterances (1.4
times smaller than the duration of the speech utterance used in conventional
GMM).
70
The speaker identification performance is compared between
conventional method and two-level approach for different population size.
Figure 6.4 Comparison between speaker identification accuracy and number of speakers using conventional and two-level approach
From Figure 6.4, it can be noted that speaker identification
accuracy using two-level approach is higher than that of conventional method
for different population size also.
6.4 MOTIVATION FOR CREATING A NEW SPEECH CORPUS
Even though our proposed technique gives better accuracy when
compared to the conventional GMM-based technique, two reasons were
identified for misclassification in our proposed technique which are given
below:
1. Some acoustically dissimilar phonemes may be missed while
taking common phonemes between the first speaker and her
confusing speakers.
71
2. Acoustically dissimilar phonemes are derived based on
average of log-likelihood value. If some of the phonemes may
have less number of examples, then considering the statistical
parameter like the mean of the log-likelihoods, is not
appropriate. In section 6.2, since the TIMIT speech corpus is
used, many phonemes have very less number of examples
(even just two) it is not appropriate to use the mean value and
this might have led to false set of phonemes as unique.
These errors can be avoided by creating our own phonetically
balanced speech corpus4.
6.4.1 Speech Data Collection
For the present study, we have created and used a new speech
corpus. For this purpose, we have collected utterances correspond to142
sentences from TIMIT corpus that will have enough number (minimum 30) of
examples for all the 45 phonemes. The number of phonemes taken for this
work are 46, including silence. The speech data is recorded using head-
mounted microphone with 16kHz sampling rate. The frequency response of
the microphone is 20Hz-20kHz. Same microphone is used invariably for all
the speakers. The speech data is recorded in a laboratory environment,
without any background noise like fan or AC noise and so on. Wavesurfer
tool is used for speech data collection, and for removing long silences at the
beginning and end of the utterances. We have collected speech utterances
from 50 speakers that include 43 female speakers and 7 male speakers. The
3.5sec duration. The following are the phoneme list considered for this task:
4 NIST SRE corpora cannot be used for the proposed approach due to the reason that our approach requires speech data to be collected for speaker-specific-text.
72
Table 6.3 Phoneme list considered for creating the speech corpus
Phoneme Word Phoneme Word /sh/ Shout /jh/ Joke
/i/ Beet /ih/ Bit
/h/ Hay /d/ Day
/eh/ Bet /ah/ But
/k/ Key /r/ Right
/s/ Sound /w/ Wire
/u/ Boot /ao/ Bought
/en/ Button /ar/ Butter
/g/ Gay /l/ like
/n/ Noon /oy/ Boy
/ae/ Bat /a/ About
/m/ Moon /dh/ Then
/t/ Tea /iy/ Beet
/v/ Vote /f/ Fish
/p/ Pea /ow/ Boat
/ch/ Choke /b/ Bee
/aa/ Father /em/ Bottom
/ng/ Sing /ay/ Bite
/th/ Thin /ey/ Bait
/aw/ About /er/ Bird
/z/ Zoo /el/ Bottle
/zh/ Azure /eng/ Washington
The entire speech data is automatically segmented at phoneme level using
Forced Viterbi algorithm (Brugnara et el 1993). For the forced-Viterbi
algorithm, the speech data, corresponding phonetic transcriptions, and
phoneme models have to be provided. Since the text is considered from the
73
TIMIT corpus, the phonetic transcription provided in the TIMIT corpus is
used. The monophone models are also trained using the speech data of TIMIT
corpus. Here, separate set of models are trained for male and female speakers,
and used for deriving the time-aligned phonetic transcriptions for the newly
collected speech data.
The following are the steps taken for phoneme segmentation.
1. Utterances of one speaker are segmented manually. HTK
transcription has been used to represent phonetic transcription.
The format for HTK transcription as follows:
start duration end duration phoneme identity
(duration is represented in 100 nanoseconds)
2. Hidden Markov models are trained for all the 46 phonemes
using the manually segmented data.
3. speech data
is automatically segmented at phoneme level using forced
Viterbi algorithm.
4. Phoneme models have been created using those two speakers
speech data.
5. Using these phoneme models, the third speaker speech data is
automatically segmented at phoneme level using forced
Viterbi algorithm.
6. This procedure is repeated for five speakers then phoneme
data.
74
7. Using the phoneme models created at step 6, the remaining 38
been automatically
segmented using forced Viterbi algorithm.
8. Phoneme models are created using all the 43 female speakers
speech data.
9. The entire female speakers speech data have been
automatically segmented using forced Viterbi algorithm.
The same steps are followed for segmenting the male speech data.
The phonemes boundaries are refined properly to a greater extent by applying
forced Viterbi algorithm iteratively.
6.5 SPEAKER IDENTIFICATION NEW SPEECH CORPUS
The proposed technique is experimented on speaker identification
task using new speech corpus. The results are compared with the conventional
GMM-based classifier. Speech utterances are collected from 50 speakers.
For each speaker, among 142 sentences, 130 utterances are used for training
and 12 utterances are used for testing. For each speaker, a GMM with 64
mixture components has been trained, considering Mel-frequency cepstral
coefficients (13 static + 13 dynamic + 13 acceleration) as the features.
To find out the confusing speakers, the training utterances of each
speaker have been tested with all the speaker models. Leave-one-out
procedure has been used. For each speaker, m confusing speakers have been
derived based on sorted log likelihoods (for this work, m = 5). This process is
repeated for all the speakers and a confusing speakers list is derived.
To derive speaker-specific-text of a speaker, the common phonemes (i.e.,
corresponding speech segments) of the speaker and his/her confusing speaker,
75
available in the training utterances, are tested with his/her model and his/her
confusing speaker model. Average log-likelihood of each phoneme is
computed for the first speaker and the confusing speaker. Based on the sorted
average log-likelihoods, the first twenty phonemes have been considered as
acoustically dissimilar phonemes. For each speaker with respect to his/her
closely resembling speakers different subset of acoustically dissimilar
phonemes are derived. The same process is repeated for the phonemes of all
the speakers. For each speaker, common acoustically dissimilar phonemes
have been derived by considering two, three, four and five confusing
speakers. For each speaker, the speaker-specific-text has been derived by
concatenating5 six common acoustically dissimilar phonemes. By the same
procedure, acoustically similar phonemes are derived by taking last fifteen
phonemes from the sorted average log-likelihoods. (This is to see the effect of
acoustically dissimilar phonemes and acoustically similar phonemes in
speaker identification task)
Testing is carried out in two-levels. In the first-level testing, using
conventional GMM-based system, 2-best results is derived. Let us denote the
first speaker in the result as speaker A and the second speaker as speaker .
Second-level testing is to check whether the speaker B is the actual speaker or
not.
Steps involved in second-level testing are as follows:
1) If speaker A is present in the confusing speaker list of B then
using the acoustically dissimilar phonemes of speaker B with
respect to speaker A, speaker-specific-text is formulated.
Using the speaker-specific-text, testing is performed with
speaker models A and B. If log likelihood of speaker model B
5 Instead of concatenating phonemes, as a next experiment, using the unique phonemes, a readable text will be formulated, and testing will be carried out.
76
is higher than that of speaker A then speaker B is declared as
the winner otherwise speaker A is declared as the winner.
2) If speaker A is not present in the confusing speaker list of
speaker B then using the common unique phonemes of
speaker B, speaker-specific-text has been formulated. The
common unique phonemes of speaker B is derived by
considering all the confusing speakers of speaker B. Using the
speaker-specific-text, testing is carried out with speaker A and
speaker B. If log-likelihood of speaker model B is higher than
that of speaker A then speaker B is declared as winner
otherwise speaker A is declared as winner.
When the system is tested using speech utterances that correspond
to speaker-specific-text, the confusion error is found to be reduced
considerably than that of the conventional GMM-based classification
technique, as discussed below.
6.6 PERFORMANCE ANALYSIS TWO-LEVEL APPROACH
(USING THE NEW SPEECH CORPUS)
Speaker identification performance is compared between the
utterances with acoustically dissimilar phonemes and without considering the
acoustically dissimilar phonemes. Each test utterance is divided into 500ms
(approximately 6 phonemes) speech signal and given for testing. This 500ms
speech signal may contain both acoustically similar and dissimilar phonemes
(segments correspond to any silences (more than 100ms) are not considered).
For each speaker, number of the 500ms test utterances considered for this
experiment is 60 and the total number of test utterances is 3000. The first-
level testing has been carried out using the 500ms speech utterances and 2-
best results have been derived. Second-level testing has been carried out using
77
the speaker-specific-text by concatenating six unique phonemes, as
mentioned earlier. The speaker identification accuracy using conventional
GMM (till the first-level) and the proposed two-level approach is tabulated in
Table 6.4. Second-level testing is done by using common unique phonemes
by considering one, two and up to five confusing speakers. The identification
accuracy is same for all the cases. For each test case, single speaker-specific-
text is considered for this experiment.
Table 6.4 Speaker identification performance using conventional GMM and two-level approach (new speech corpus)
S. No. Method Identification accuracy
1 Conventional method
(First-level output) 82%
2 Two-level approach 88.5%
From Table 6.4, it can be noted that there is a 6.5% performance
improvement by using two-level approach, as specified in row 2 of Table 6.4.
The following experiment is carried out to measure the speaker
identification performance by using only the common acoustically disimilar
phonemes (common unique phonemes) and common acoustically similar
phonemes (common phonemes) of each speaker by varying the number of
confusing speakers. For each speaker, the number of test utterances
considered for this experiment is 10 (by concatenating six common unique
phonemes or six common phonemes). Total number of test utterances is 500.
78
Figure 6.5 Speaker identification accuracy in terms of number of confusing Speakers
From Figure 6.5, it can be noted that, using common unique
phonemes, the speaker identification accuracy has been increased when the
number of confusing speakers is increased from one to three. The speaker
identification accuracy has been reduced when the number of confusing
speakers is increased as four or five. The reason for reduction in accuracy is
that, in the second-level testing, for some speakers, the number of acoustically
dissimilar phonemes may be reduced while taking common acoustically
dissimilar phonemes of speaker B by considering their confusing speakers.
From Figure 6.5, it can be noted that, using common acoustically similar
phonemes (common phonemes), the speaker identification accuracy is
reduced when the number of confusing speakers is increased.
The speaker identification performance is measured by varying the
duration of the speech utterances using conventional GMM-based technique,
the results are plotted in Figure 6.6. The speaker identification performance
using speaker-specific-text for 500 ms duration speech utterance is 88.5%, as
specified in row 2 of Table 6.3.
79
Figure 6.6 Comparison between speaker identification accuracy and duration of speech utterance using conventional GMM method
Further, it is noted that the speaker identification performance of
88.5% using speaker-specific-text for 500 ms duration speech utterance is
achieved in conventional GMM only when 700ms duration speech utterances
are used. This shows that the classification accuracy has been increased even
with reduced duration of test utterances ( by 1.4 times).
The speaker identification performance is compared between
conventional method and two-level approach for different population size.
From Figure 6.7, it can be noted that speaker identification accuracy using
two-level approach is higher than that of conventional method for different
population size also.
80
Figure 6.7 Comparison between speaker identification accuracy and Number of speakers using conventional and two-level approach
6.7 SUMMARY
In this chapter, we have proposed to use speech utterances that
correspond to a speaker-specific-text for speaker recognition tasks. We have
shown that the classification accuracy, in a speaker identification task using
two-level approach, is considerably higher than that of a conventional GMM-
based technique, if the speech utterances correspond to the unique phonemes
are used. We have experimented the two-level approach, using TIMIT corpus
and the new speech corpus. In speaker identification task, the speaker is not
cooperative to utter the speaker-specific-text during the second-level of
testing. Our proposed work is a hypothetical experiment to prove that the
classification accuracy will be increased when the speaker-specific-text is
used. In fact, the proposed approach can be used for speaker verification task
also, with required modification.