evaluation of speaker recognition algorithms. speaker recognition speech recognition and speaker...

Evaluation of Speaker Recognition Algorithms

Speaker Recognition

• Speech Recognition and Speaker Recognition

• speaker recognition performance is dependent on the channel, noise quality.

• Two sets of data one to enroll and the other to verify.

Data Collection and processing

• MFCC extraction

• Test Algorithms include

AHS(Arithmetic Harmonic Sphericity)

Gaussian Divergence

Radial Basis Function

Linear Discriminant Analysis etc.,

cepstrum

• Cepstrum is a common transform, used to gain information from a speech signal, whose x-axis is quefrency.

• Used to separate transfer function from excitation signal.

X(ω)=G(ω)H(ω)

log|X(ω) | =log|G(ω) | +log|H(ω) |

F−1log|X(ω) | =F−1log|G(ω) | +F−1log|H(ω) |

Cepstrum

MFCC Extraction

MFCC Extraction

• Short-time FFT• Frame Blocking and Windowing Eg: First Frame size=N samples Second Frame size begins M(M<N) Overlap of N-M samples and so on…• Window Function: y(n)=x(n)w(n) Eg: Hamming Window: w(n)=0.54-0.46cos(2πn/N-1), 0<n<N-1

• Mel-Frequency Wrapping

Mel frequency scale is linear upto 1000Hz and logarithmic above 1000 Hz.

mel(f)=2595*log(1+f / 700)

Mel-Spaced Filter bank

MFCC

• Cepstrum log mel spectrum back to time = MFCC

MFCCs(Cn) given by

where Sk is the mel power spectrum coefficients

Arithmetic Harmonic Sphericity

• Function of eigen values of a test covariance matrix relative to a reference covariance matrix for speakers x and y, defined by

where D is the dimensionality of the covariance matrix.

x

y

Gaussian Divergence

• Mixture of gaussian densities to model the distribution of the features of each speaker.

YOHO Dataset

Sampling Frequency 8kHz

Performance – AHS with 138 subjects and 24 MFCCs

Performance – Gaussian Div with 138 subjects and 24 MFCCs

Performance – AHS with 138 subjects and 12 MFCCs

Performance – Gaussian Div with 138 subjects and 12 MFCCs

• Probability Density Functions

Example 2:

Review of Probability and Statistics

f(x)

xa=0.25 b=0.75

0)( otherwise 10)1(2

3)( 2 xfxxxf

Probability that x is between 0.25 and 0.75 is

547.0)3

(2

3)1(

2

3)75.025.0(

75.0

25.0

375.0

25.0

2

x

x

xxdxxxP

• Cumulative Distribution Functions

cumulative distribution function (c.d.f.) F(x) for c.r.v. X is:

example:


f(x)

xb=0.75


3)( 2 xfxxxf

C.D.F. of f(x) is

)3

(2

3)

3(

2

3)1(

2

3)(

3

0

3

0

2 xx

yydyyxF

xy

y

x

x

dyyfxXPxF )()()(

• Expected Values and Variance

expected (mean) value of c.r.v. X with p.d.f. f(x) is:

example 1 (discrete):

example 2 (continuous):


dxxfxXEX )()(

E(X) = 2·0.05+3·0.10+ … +9·0.05 = 5.35 0.05

0.250.20

0.150.10

0.15

0.05 0.05

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

8

3)

42(

2

3)(

2

3)1(

2

3)(

1

0

421

0

31

0

2

x

x

xxdxxxdxxxXE


3)( 2 xfxxxf


• The Normal (Gaussian) Distribution

the p.d.f. of a normal distribution is

xxf x 22 2/)(e2

1),;(

where μ is the mean and σ is the standard deviation

μ

σ


• The Normal Distribution

any arbitrary p.d.f. can be constructed by summing N weighted Gaussians (mixtures of Gaussians)

w1 w2 w3 w4 w5 w6

A Markov Model (Markov Chain) is:

• similar to a finite-state automata, with probabilities of transitioning from one state to another:

Review of Markov Model?

S1 S5S2 S3 S4

0.5

0.5 0.3

0.7

0.1

0.9 0.8

0.2

• transition from state to state at discrete time intervals

• can only be in 1 state at any given time

1.0

Transition Probabilities: • no assumptions (full probabilistic description of system):

P[qt = j | qt-1= i, qt-2= k, … , q1=m]

• usually use first-order Markov Model: P[qt = j | qt-1= i] = aij

• first-order assumption: transition probabilities depend only on previous state

• aij obeys usual rules:

• sum of probabilities leaving a state = 1 (must leave a state)


N

jij

ij

ia

jia

1

1

,0

S1 S2 S3

0.5

0.5 0.3

0.7

Transition Probabilities: • example:


a11 = 0.0 a12 = 0.5 a13 = 0.5 a1Exit=0.0 =1.0a21 = 0.0 a22 = 0.7 a23 = 0.3 a2Exit=0.0 =1.0a31 = 0.0 a32 = 0.0 a33 = 0.0 a3Exit=1.0 =1.0

1.0

Transition Probabilities: • probability distribution function:


S1 S2 S30.6

0.4

p(remain in state S2 exactly 1 time) = 0.4 ·0.6 = 0.240p(remain in state S2 exactly 2 times) = 0.4 ·0.4 ·0.6 = 0.096p(remain in state S2 exactly 3 times) = 0.4 ·0.4 ·0.4 ·0.6 = 0.038

= exponential decay (characteristic of Markov Models)

• Example 1: Single Fair Coin


S1 S2

0.5

0.5

0.5 0.5

S1 corresponds to e1 = Heads a11 = 0.5 a12 = 0.5S2 corresponds to e2 = Tails a21 = 0.5 a22 = 0.5

• Generate events:H T H H T H T T T H H

corresponds to state sequenceS1 S2 S1 S1 S2 S1 S2 S2 S2 S1 S1

• Example 2: Weather


S1S2

0.25

0.4

0.7 0.5

S3

0.20.05

0.70.1

0.1

• Example 2: Weather (con’t)

• S1 = event1 = rain S2 = event2 = clouds A = {aij} = S3 = event3 = sun

• what is probability of {rain, rain, rain, clouds, sun, clouds, rain}?Obs. = {r, r, r, c, s, c, r}S = {S1, S1, S1, S2, S3, S2, S1}time = {1, 2, 3, 4, 5, 6, 7} (days)

= P[S1] P[S1|S1] P[S1|S1] P[S2|S1] P[S3|S2] P[S2|S3] P[S1|S2]

= 0.5 · 0.7 · 0.7 · 0.25 · 0.1 · 0.7 · 0.4

= 0.001715


10.70.20.

10.50.40.

05.25.70. π1 = 0.5π2 = 0.4π3 = 0.1

• Example 2: Weather (con’t)

• S1 = event1 = rain S2 = event2 = clouds A = {aij} = S3 = event3 = sunny

• what is probability of {sun, sun, sun, rain, clouds, sun, sun}?Obs. = {s, s, s, r, c, s, s}S = {S3, S3, S3, S1, S2, S3, S3}time = {1, 2, 3, 4, 5, 6, 7} (days)

= P[S3] P[S3|S3] P[S3|S3] P[S1|S3] P[S2|S1] P[S3|S2] P[S3|S3]

= 0.1 · 0.1 · 0.1 · 0.2 · 0.25 · 0.1 · 0.1

= 5.0x10-7


10.70.20.

10.50.40.

05.25.70. π1 = 0.5π2 = 0.4π3 = 0.1

Simultaneous speech and speaker recognition using hybrid architecture

– Dominique Genoud, Dan Ellis, Nelson Morgan

• The automatic recognition process of the human voice is often divided in two part– speech recognition – speaker recognition

Traditional System

• Traditional state of the art speaker recognition system task can be divided into two parts-– Feature Extraction– Model Creation

Feature ExtractionFrame 1 Frame 2 Frame 3 Frame NWindow

Function

Frame Length

Frame Overlap

Signal processing

Signal processing

Signal processing

X1=X11

X1

2

…X1

d

X3=

X31X3

2

…X3

d

Xn=

Xn1

Xn

2

…Xn

d

Frame Vector

Frame 1 Frame 2 Frame 3 Frame NWindowFunction

Frame Length

Frame Overlap

Signal processing

Signal processing

Signal processing

X1=X11

X1

2

…X1

d

X3=

X31X3

2

…X3

d

Xn=

Xn1

Xn

2

…Xn

d

Frame Vector

Model Creation

• Once the feature is extracted, a model can be created using various techniques i.e. Gaussian Mixture Model.

• Once the model is created we can find distance from one model to another

• Based on the distance a decision can be inferred.

A simultaneous speaker and speech recognition

• A system that models the “phone” of the speaker and also the speakers features and combines them into a model could perform very well.


• Maximum a posteriori (MAP) estimation is used to generate speaker-specific models from a set of speaker independent (SI) seed models.

• Assuming no prior knowledge about the speaker distribution, the a posteriori probability Pr is approximated by the score defined as

where the speaker-specific models for all, world model.

( | )x


• In the previous equation, was determined to be 0.02 empirically.

• Using Viterbi algroithm, N probable speaker P(x| ) can be found.

• Results:– Author reported 0.7% EER compared to 5.6%

EER of GMM based system on the same dataset of 100 person.

Speech and Speaker Combination

• Posteriori probabilities and Likelihoods Combination for Speech and Speaker Recognition

• Mohamed Faouzi BenZeghiba, Eurospeech 2003.

• Authors used a combination of HMM/ANN (MLP) system for this work.

• For the features of the speech, he used 12 MFCC coefficients with energy and their first derivatives were calculated every 10 ms over a 30 ms window.

System Description

ˆ{ }

[log( ( | , ))]maxWw

P W X

ˆ{ }

max[log( ( | ))]s ss

P X

W

S

is the word from a set of finite word {W}

is the speakers from a set of finite registered speakers {S}

is ANN parameters.

System Description

Probability that a speaker is accepted is

( ) log( ( | )) log( ( | )sLLR X P X P X Threshold

LLR(X) is the likelihood ratio. s Is GMM model is the background

model where its parameters are derived from using MAP adaptation and the world data set

Combination

• Use of MLP adaptation.– shifting the boundaries between the phone classes

without strongly affecting the posterior probabilities of the speech sounds of other speakers

ˆ ˆ( , ){ , }max[log( ( | , ) log( ( | ))]w s s sw s

P W X P X

• Author proposed following formula to combine the both system

Combination

• Using posteriori on the test set it can be shown that-

ˆ ˆ( , ) 1{ , }max[ log( ( | , ) log( ( | ))]w s s sw s

P W X P X

• Probability that a speaker is accepted is

2 log( ( | , ) ( )sP W X LLR X threshold

Determined from a posteriori of the test set.1 2,

HMM-Parameter Estimation

1 1

1 11 1

1

( ) ( ) ( )( , )

( ) ( ) ( )

( ) ( , )

i ij j t tt N N

t ij j t ti j

N

t tj

t a b o jp i j

i a b o j

i p i j

• Given an observation sequence O, determine the model parameters (A,B,π) that maximize P(O|λ)

where λ= (A,B,π)

• γt(i) is the probability of being in state i, then

HMM-Parameter Estimation

• = Expected frequency in state i at time t=1

Expected number of transitions from state i to state j

Expected number from state iija

kExpected number of times in state j and observing symbol v( )

Expected number of times in state jjb k

j

• Thank You

evaluation of speaker recognition algorithms. speaker recognition speech recognition and speaker...

Documents

normal gaussian distribution

order markov model

transition probabilities

statereview of markov

mfccsperformance gaussian

normal distribution

previous state aij

sum of probabilities