gmm-svm+up-avr

Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification

Man-Wai MAK and Wei RAOThe Hong Kong Polytechnic University

[email protected]://www.eie.polyu.edu.hk/~mwmak/

2

Outline

GMM-UBM for Speaker Verification GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Utterance Partitioning for GMM-SVM Experiments on NIST SRE

3

Speaker Verification

To verify the identify of a claimant based on his/her own voices

Is this Mary’s voice?

I am Mary

4

FeatureExtraction

John’sModel

ImpostorModel

Score Normalization and Decision

Making

+

_

DecisionThreshold

Accept/Reject

John’s “Voiceprint”

Impostors “Voiceprints”

I’m John

Scores

Verification Process

5

Acoustic Features Speech is a continuous evolution of the vocal tract Need to extract a sequence of spectra or sequence of spectral coefficients Use a sliding window - 25 ms window, 10 ms shift

DCTLog|X(ω)|MFCC

6

M

j

sj

sj

sj

s pp1

)()()()( ),|()|( xx

GMM-UBM for Speaker Verification

• The acoustic vectors (MFCC) of speaker s is modeled by a prob. density function parameterized by

Mj

sj

sj

sj

s1

)()()()( },,{

• Gaussian mixture model (GMM) for speaker s:

Mj

sj

sj

sj

s1

)()()()( },,{

7

M

jjjj pp

1

)ubm()ubm()ubm()ubm( ),|()|( xx

• The acoustic vectors of a general population is modeled by another GMM called the universal background model (UBM):

• Parameters of the UBM

Mjjjj 1

)ubm()ubm()ubm()ubm( },,{


8

Client Speaker Model

Universal Background

Model

)(s

ubm)(

MAP

Enrollment Utterance (X(s)) of Client Speaker

)1()( )ubm()()(jj

sjj

sj XE


9

2-class Hypothesis problem:H0: MFCC sequence X(c) comes from to the true speakerH1: MFCC sequence X(c) comes from an impostor

Verification score is a likelihood ratio:

)|(log)|(log)1|(

)0|(logScore ubm)()()()(

)(

)(

cscc

c

XpXpHXp

HXp

Featureextraction

BackgroundModel

Decision+−

accept Score

reject Score

Score

SpeakerModel )(s

ubm)(

GMM-UBM Scoring

)(cX

)(cX

10

Outline

GMM-UBM for Speaker Verification GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Acoustic Vector Resampling for GMM-SVM Results on NIST SRE

11

)(s)(sutt

UBM

Feature Extraction

)(sX Mean Stacking

MAPAdaptation

1 2 M

1 2 Mi

1

2)(

1

MDM

s

1

2)(

1

MDM

s

GMM GMM supervectorsupervector

Mapping)(sX

GMM-SVM for Speaker Verification

12

)( Bbutt

)( 2butt

UBM

Feature Extraction

Feature Extraction

)(sX

)()( ,,1 Bbb XX

Compute GMM-Supervector of Target

Speaker s

Compute GMM-Supervectors of

Background Speakers

Feature Extraction

UBM

)(cXCompute GMM-Supervector of

Claimant c

)(sutt

)(cutt

GMM-SVM Scoring

)( 1butt

)( )(SVM-GMM

cXS

SVM ScoringSVM Scoring

),( )()( sc XXK

),( )()( 1bc XXK

),( )()( Bbc XXK

…

)(sd

M

j

bjjj

cjjj

bc BBXXK1

)(T

)()()( 21

21

),(

)(0

s

)(1

s

)(si

)(sB

)()()(

bkg fromSV

)()()()(0

)(SVM-GMM ),(),()( sbc

i

si

scsc dXXKXXKXS i

13

GMM-UBM Scoring Vs. GMM-SVM Scoring

)()()(

bkg fromSV

)()()()(0

)(SVM-GMM ),(),()( sbc

i

si

scsc dXXKXXKXS i

)|(log)|(log)( ubm)()()()()(UBM-GMM cscc XpXpXS

GMM-UBM:

GMM-SVM:

)()(

1

)(T

)()()(

21

21

21

21

),(

sT

c

M

j

sjjj

cjjj

sc XXK

Normalized GMM-supervector of

claimant’s utterance

Normalized GMM-supervector of target-speaker’s utterance

14

Outline

GMM-UBM for Speaker Verification GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Utterance Partitioning for GMM-SVM Results on NIST SRE

150 1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

x1

x 2Linear SVM, C=10.0, #SV=3, slope=-1.00

Speaker ClassImpostor Class

For each target speaker, we only have one utterance (GMM-supervector) from the target speaker and many utterances from the background speakers.

So, we have a highly imbalance learning problem.

Only one training

vector from the target speaker

Data Imbalance in GMM-SVM

16

0 1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

9

x1

x 2Linear SVM, C=10.0, #SV=3, slope=-1.44


Orientation of the decision boundary

depends mainly on impostor-class

data


17

A 3-dim two-class problem illustrating the problem that the SVM decision plane is largely governed by the impostor-class supervectors.

Impostor Class

Speaker Class

Region for which the target-speaker vector can be located without

changing the orientation of the decision plane


18

Outline

GMM-UBM for Speaker Verification GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Utterance Partitioning for GMM-SVM Results on NIST SRE

19

Partition an enrollment utterance of a target speaker into number of sub-utterances, with each sub-utterance producing one GMM-supervector.

Utterance Partitioning

20

)(4

)(0

)(4

)(0

,,

,,1 Bbb

ss

mm

mm

)(utt Bb

Target-speaker’s Enrollment Utterance

Feature Extraction

Background-speakers’ Utterances

Feature Extraction(s)0X

(s)1X (s)

2X (s)4X(s)

3X

)(b0

1X

)(b2

1X)(b1

1X )(b4

1X)(b3

1X

)(b0

2X

)(b2

2X)(b1

2X )(b4

2X)(b3

2X

)(b0

BX

)(b2

BX)(b1

BX )(b4

BX)(b3

BX

MAP Adaptation and

Mean Stacking

SVM Training

(s)4

(s)0 ,, XX

UBM

)( 1utt b

)( 2utt b

(s)utt

SVM of Target Speaker s

Utterance Partitioning

21

Length-Representation Trade-off

• When the number of partitions increases, the length of sub-utterance decreases.

• If the utterance-length is too short, the supervectors of the sub-utterances will be almost the same as that of the UBM

(s)utt

0 1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

9

x1

x 2

Linear SVM, C=10.0, #SV=3, slope=-1.44


Supervector corresponding to

the UBM

22

1. Randomly rearrange the sequence of acoustic vectors in an utterance;

2. Partition the acoustic vectors of an utterance into N segments;

3. If Step 1 and Step 2 are repeated R times, we obtain RN+1 target-speaker’s supervectors .

Utterance Partitioning with Acoustic Vector Resampling (UP-AVR)

Procedure of UP-AVR:

Goal: Increase the number of sub-utterances without compromising their representation power

MFCC seq. before randomization

MFCC seq. after randomization

23


)(4

)(0

)(4

)(0

,,

,,1 Bbb

ss

mm

mm

)(utt Bb

Target-speaker’s Enrollment Utterance

Feature Extraction andIndex Randomization

Background-speakers’ Utterances

(s)0X

(s)1X (s)

2X (s)4X(s)

3X

)(b0

1X

)(b2

1X)(b1

1X )(b4

1X)(b

31X

)(b0

2X

)(b2

2X)(b1

2X )(b4

2X)(b

32X

)(b0

BX

)(b2

BX)(b1

BX)(b

4BX)(b

3BX

MAP Adaptation and

Mean Stacking

SVM Training

(s)4

(s)0 ,, XX

UBM

)( 1utt b

)( 2utt b

(s)utt

SVM of Target Speaker s

Feature Extraction andIndex Randomization

24


• Characteristics of supervectors created by UP-AVR Average pairwise distance between sub-utt SVs is larger than the

average pairwise distance between sub-utt SVs and full-utt SV. Average pairwise distance between speaker-class’s sub-utt SVs and

impostor-class’s SVs is smaller than the average pairwise distance between speaker-class’s full-utt SV and impostor-class’s SVs.

Imposter-class

Speaker-class

Sub-utt supervector

Full-utt supervector

25

Nuisance Attribute Project (NAP) [Solomonoff et al., ICASSP2005]

Nuisance Attribute Projection

Sub-space representing session variability.Defined by V

),()( hss mPm

),( hsT mVV),( hsm

),(),(),(),( 21

21

),( hsT

hchshc XXK

Recall the GMM-supervector kernel:

Define the session- and speaker-dependent supervector as

sessionfor stands andspeaker for stands where,),(),( 21

hshshs m

Remove the session-dependent part (h) by removing the sub-space that causes the session variability:

),(),()( )( hsThss mVVImPm

The New kernel becomes

),(),(

)()()()( ),(hsThc

sTcsc XXK

mPmP

mm

Goal: To reduce the effect of session variability

26

Nuisance Attribute Project (NAP) [Solomonoff et al., ICASSP2005]

Nuisance Attribute Projection

otherwise0

speaker same the tocorrespond and 1

minarg,

),(),(*

jiw

w

ij

ji

hjhiij mPmPP

P

Sub-space representing session variability.Defined by V

),()( hss mPm

),( hsT mVV),( hsm

27

Enrollment Process of

GMM-SVM with UP-AVR

MFCCs of an utterance from

target-speaker s

MAP and Mean Stacking

NAP

Session-dependent

supervectors

Session-independent supervectors

SVM Training

UBM

),( hsX

)(sim

),( hsim

Resampling/Partitioning

),( hsiX

SVM of target-speaker s

)( jbim

28

Verification Process of

GMM-SVM with UP-AVR

MFCCs of a test utterance

from claimant c

MAP and Mean Stacking

NAP

Session-dependent supervector

Session-independent supervector

SVM Scoring T-NormNormalized

scorescore

UBM

TnormModels

)(cX

)( )(cXS )(~ )(cXS

)(cm

),( hcm

SVM of target-speaker s

29

T-Norm (Auckenthaler, 2000)

)( )(cXS

)(cm

SVM Scoring

T-Norm SVM 1

SVM Scoring

T-Norm SVM R

ComputeMeanand

StandardDeviation

)(

)()()(

~)(

)()()(

c

ccc

X

XXSXS

Z-norm)(

)()(

)(

c

c

X

X

from test utterance

Goal: To shift and scale the verification scores so that a global decision threshold can be used for all speakers

T-Norm

Normalized scorescore

TnormModels

)( )(cXS

)(cm

T-Norm

Normalized scorescore

TnormModels

)( )(cXS

)(cm

30

Outline

GMM-UBM for Speaker Verification GMM-SVM for Speaker Verification Data-Imbalance Problem in GMM-SVM Utterance Partitioning for GMM-SVM Experiments on NIST SRE

31

Evaluations on NIST SRE 2002 and 2004 NIST SRE 2002:

Use NIST’01 for computing the UBMs, impostor-class supervectors of SVMs, Tnorm models, and NAP parameters

2983 true-speaker trials and 36287 impostor attempts 2-min utterances for training and about 1-min utt for test

NIST SRE 2004: Use the Fisher corpus for computing UBMs, impostor-class supervectors of

SVMs, and Tnorm models NIST’99 and NIST’00 for computing NAP parameters 2386 true-speaker trials and 23838 impostor attempts 5-min utterances for training and testing

Experiments

Speech Data

32

12 MFCC + 12 ΔMFCC with feature warping 1024-mixture GMMs for GMM-UBM 256-mixture GMMs for GMM-SVM MAP relevance factor = 16 300 impostor-class supervectors for GMM-SVM 200 T-norm models 64-dim session variability subspace (NAP corank, rank of V)

Experiments

Features and Models

33

No. of mixtures in GMM-SVM (NIST’02)

Results

No

rma

lize

d

Large number of features with small

variance

Threshold below which the variances

of feature are deemed too small

34

Effects of NAP on Different NIST SRE

Results

Large eigenvalues mean large session variation

35

Effect of NAP Corank on Performance

Results

No NAP

36

Results

Comparing discriminative power of GMM-SVM and GMM-SVM with UP-AVR

37

Results

EER and MinDCF vs. No. of Target-Speaker Supervectors

NIST’02

38

Results

Varying the number of resampling (R) and number of partitions (N)

NIST’02

39

Results

NIST’02

40

Performance on NIST’02

EER=9.05%EER=9.05%

EER=9.39%EER=9.39%

EER=8.16%EER=8.16%

Experiments and Results

41

EER=9.46%EER=9.46%EER=10.42%EER=10.42%

EER=16.05%EER=16.05%

Performance on NIST’04

Experiments and Results

GMM-UBM

GMM-SVMGMM-SVM

w/ UP-AVR

42

1. S.X. Zhang and M.W. Mak "Optimized Discriminative Kernel for SVM Scoring and its Application to Speaker Verification", IEEE Trans. on Neural Networks, to appear.

2. M.W. Mak and W. Rao, "Utterance Partitioning with Acoustic Vector Resampling for GMM-SVM Speaker Verification", Speech Communication, vol. 53 (1), Jan. 2011, Pages 119-130.

2. M.W. Mak and W. Rao, "Acoustic Vector Resampling for GMMSVM-Based Speaker Verification, Interspeech 2010. Sept. 2010, Makuhari, Japan, pp. 1449-1452.

3. S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric Authentication: A Machine Learning Approach, Prentice Hall, 2005

4. W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector machines using GMM supervectors for speaker verification,” IEEE Signal Processing Letters, vol. 13, pp. 308–311, 2006.

5. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, pp. 19–41, 2000.

References

gmm-svm+up-avr

Documents