dimension-decoupled gaussian mixture model for short utterance speaker recognition thilo stadelmann,...

Dimension-Decoupled Gaussian Mixture Model

forShort Utterance Speaker Recognition

Thilo Stadelmann, Bernd Freisleben, Ralph EwerthUniversity of Marburg, Germany

International Conference on Pattern Recognition, Istanbul, Turkey

24. August 2010

2

Content

1. Introduction

2. Related Work

3. Idea and Justification

4. Implementation

5. Results

6. Conclusions

3

Introduction

Why speaker recognition?

Speaker recognition is useful for– User authentication (e.g., telephone services)– Video indexing by person (e.g., movies)– Preprocessing for automatic speech recognition (e.g.,

speaker adaptation)

Scenarios have in common: – Additional training and testing data is unavailable…

• E.g., movies: typical speaker turn duration of 1-2s

– …or costly• E.g., access control and speaker adaptation: user has to

provide enrollment data, but just wants to proceed with his/her actual purpose

4

Introduction

And why on short utterances?

But: typical speaker recognition systems need:– 30-100s of speech on average for training (evaluation:

10s)– 7-10s as a minimum for training in specialized forensic

software

=> Furui [„40 Years of…“, 2009]: „The most pressing issues […] for speaker recognition are rooted in […] insufficient data.“

5

Related Work

How is this dealt with normally?

Use of additional data, assumptions or modalities:– Anchor models, phonetic structure [Merlin et al., 1999]– Speech content, word dependencies [Larcher et al., 2008]– Video in multimodal data streams [Larcher et al., 2008]– Subspace models, confidence intervals [Vogt et al.,

2008/2009]

Combine this with the typical Gaussian Mixture Model (GMM) approach

6

Idea and Justification

How is this dealt with here?

The typical approach to speaker recognition is to use a statistical voice model

If it is possible to find a similar model formulation using less parameters…

=> fewer data necessary for reliable estimates => side effect: improved runtime with compact model

The typical (almost omnipresent) statistical voice model is the GMM

=> optimize the GMM for employing less parameters

7


Idea

Observations: Some dimensions

(e.g.: 0, 1, 4, 7) are multimodal/ skewed=> need Gaussian mixture to be modeled accurately

Some dimensions (6, 11, 13, 18) look Gaussian itself=> why spend parameters of 31 more mixtures on them?

Per-dimension plot of 32-mixture GMM modeling 19 dim. MFCCs (Upper blue curve: joint density)

8


Justification

Idea: model each dimension individually with the optimal number of mixtures for its marginal density

Promising:– LPCCs are similar to MFCCs in (non-)Gaussianity of individual dim.– LSPs are Gaussian/like in any dimension– Pitch is quite non/Gaussian– Combinations are common=> method is generally applicable

Permissible:– Standard GMM already treats dimensions as decorrelated via

diagonal covariance– Chances are that additionally treating them as independent doesn’t

miss important information for speaker recognition

9

Implementation

How is it put into existence?

Wrapper around existing GMM implementation:– Build single GMM per dimension of feature set– Optimize number of mixtures in each dimension via Bayesian

Information Criterion (BIC)– Apply orthogonal transform prior to training/test to further

decorrelate data => Dimension-Decoupled GMM (DD-GMM) is tupel

(#mixtures, GMM) per dimension plus transformation matrix

=> easily integratable with existing GMM implementations => combinable with other short utterance schemes from

related work

10

Results

Speaker recognition performance

Until 45% removal:nearly no difference

>50% removal:DD-GMM 7.56% (avg.) better as best competitor with same amount of data

>50% removal:DD-GMM as good as best competitor with 4.17% (avg.) less data

Effect stronger with only less training data

Effect still visible with only less test data

% train./test data removed (100%: ca. 20/5s)

Speake

r id

enti

fica

tion r

ate

on

TIM

IT

Competitors in 630-speaker identification experiment:

BIC-GMM: GMM with #multimodal-mixtures optimized via BIC

32-GMM: Multimodal GMM with always 32 mixtures DD-GMM: New dimension-decoupled GMM

11

Results

Evolution of parameter count

DD-GMM uses 90.98% (avg.) less parameters than BIC/GMM

Best in literature [Liu and He, 1999]: 75% without better identification rate

12

Results

Run time

DD-GMM train time: 2.5 times longer than 32-GMM (best), but still 3.5 times faster than real time

DD-GMM test time: 2.1 times faster than BIC-GMM (best), that is 54.5% real time

Test phase is practically more relevant (occurs more frequently)

13

Conclusions

What remains

DD-GMM gives more reliable speaker recognition results in case of lacking data

DD-GMM is computationally more efficient in case of plenty of data

DD-GMM performs speaker recognition where std. GMM approaches aren’t useable anymore– >80% identification rate with <5.5/1.3s training/test data

DD-GMM is easily integratable with other systems– Wrapper comprises effectively 80 lines of code around existing GMM– Approach is supplemental to other short utterance schemes

Future work:– Apply and test on other features and data sets beyond speaker

recognition

dimension-decoupled gaussian mixture model for short utterance speaker recognition thilo stadelmann,...

Documents