dimension-decoupled gaussian mixture model for short utterance speaker recognition thilo stadelmann,...
TRANSCRIPT
Dimension-Decoupled Gaussian Mixture Model
forShort Utterance Speaker Recognition
Thilo Stadelmann, Bernd Freisleben, Ralph EwerthUniversity of Marburg, Germany
International Conference on Pattern Recognition, Istanbul, Turkey
24. August 2010
2
Content
1. Introduction
2. Related Work
3. Idea and Justification
4. Implementation
5. Results
6. Conclusions
3
Introduction
Why speaker recognition?
Speaker recognition is useful for– User authentication (e.g., telephone services)– Video indexing by person (e.g., movies)– Preprocessing for automatic speech recognition (e.g.,
speaker adaptation)
Scenarios have in common: – Additional training and testing data is unavailable…
• E.g., movies: typical speaker turn duration of 1-2s
– …or costly• E.g., access control and speaker adaptation: user has to
provide enrollment data, but just wants to proceed with his/her actual purpose
4
Introduction
And why on short utterances?
But: typical speaker recognition systems need:– 30-100s of speech on average for training (evaluation:
10s)– 7-10s as a minimum for training in specialized forensic
software
=> Furui [„40 Years of…“, 2009]: „The most pressing issues […] for speaker recognition are rooted in […] insufficient data.“
5
Related Work
How is this dealt with normally?
Use of additional data, assumptions or modalities:– Anchor models, phonetic structure [Merlin et al., 1999]– Speech content, word dependencies [Larcher et al., 2008]– Video in multimodal data streams [Larcher et al., 2008]– Subspace models, confidence intervals [Vogt et al.,
2008/2009]
Combine this with the typical Gaussian Mixture Model (GMM) approach
6
Idea and Justification
How is this dealt with here?
The typical approach to speaker recognition is to use a statistical voice model
If it is possible to find a similar model formulation using less parameters…
=> fewer data necessary for reliable estimates => side effect: improved runtime with compact model
The typical (almost omnipresent) statistical voice model is the GMM
=> optimize the GMM for employing less parameters
7
Idea and Justification
Idea
Observations: Some dimensions
(e.g.: 0, 1, 4, 7) are multimodal/ skewed=> need Gaussian mixture to be modeled accurately
Some dimensions (6, 11, 13, 18) look Gaussian itself=> why spend parameters of 31 more mixtures on them?
Per-dimension plot of 32-mixture GMM modeling 19 dim. MFCCs (Upper blue curve: joint density)
8
Idea and Justification
Justification
Idea: model each dimension individually with the optimal number of mixtures for its marginal density
Promising:– LPCCs are similar to MFCCs in (non-)Gaussianity of individual dim.– LSPs are Gaussian/like in any dimension– Pitch is quite non/Gaussian– Combinations are common=> method is generally applicable
Permissible:– Standard GMM already treats dimensions as decorrelated via
diagonal covariance– Chances are that additionally treating them as independent doesn’t
miss important information for speaker recognition
9
Implementation
How is it put into existence?
Wrapper around existing GMM implementation:– Build single GMM per dimension of feature set– Optimize number of mixtures in each dimension via Bayesian
Information Criterion (BIC)– Apply orthogonal transform prior to training/test to further
decorrelate data => Dimension-Decoupled GMM (DD-GMM) is tupel
(#mixtures, GMM) per dimension plus transformation matrix
=> easily integratable with existing GMM implementations => combinable with other short utterance schemes from
related work
10
Results
Speaker recognition performance
Until 45% removal:nearly no difference
>50% removal:DD-GMM 7.56% (avg.) better as best competitor with same amount of data
>50% removal:DD-GMM as good as best competitor with 4.17% (avg.) less data
Effect stronger with only less training data
Effect still visible with only less test data
% train./test data removed (100%: ca. 20/5s)
Speake
r id
enti
fica
tion r
ate
on
TIM
IT
Competitors in 630-speaker identification experiment:
BIC-GMM: GMM with #multimodal-mixtures optimized via BIC
32-GMM: Multimodal GMM with always 32 mixtures DD-GMM: New dimension-decoupled GMM
11
Results
Evolution of parameter count
DD-GMM uses 90.98% (avg.) less parameters than BIC/GMM
Best in literature [Liu and He, 1999]: 75% without better identification rate
12
Results
Run time
DD-GMM train time: 2.5 times longer than 32-GMM (best), but still 3.5 times faster than real time
DD-GMM test time: 2.1 times faster than BIC-GMM (best), that is 54.5% real time
Test phase is practically more relevant (occurs more frequently)
13
Conclusions
What remains
DD-GMM gives more reliable speaker recognition results in case of lacking data
DD-GMM is computationally more efficient in case of plenty of data
DD-GMM performs speaker recognition where std. GMM approaches aren’t useable anymore– >80% identification rate with <5.5/1.3s training/test data
DD-GMM is easily integratable with other systems– Wrapper comprises effectively 80 lines of code around existing GMM– Approach is supplemental to other short utterance schemes
Future work:– Apply and test on other features and data sets beyond speaker
recognition