text-independent speaker verification

Speaker Recognition

Cody A. RayECES 435 Final Project

March 11, 2010

• Speaker Recognition

• Speaker Identification • Speaker Verification

• Text• Dependent

• Text• Independent

• Text• Dependent

• Text• Independent

Speaker Recognition System

Feature Extraction Training Speaker

Model

FeatureExtraction Matching

Verification

Training speech Feature Vector

Target & Background

ScoreTest speech

Cepstrum LPCC MFCC Glottal Flow Derivative

Deterministic Models Min Distance DTW

Stochastic Models GMM HMM

Minimum Distance Maximum-Likelihood Maximum a posteriori Minimum-Mean-Squared Error

Testing

Feature Extraction

• Big surprise here – MFCCs!

Window DFT | . |

DCT Filter Bank

Speech signal x[m] w[n-m] X(n, w)

MFCCsLog

Emel(n, l) Mel-Scale

MFCC - 12 coefficients (skip 0’th order coefficient)256 sample frames, 128 sample increment, Hamming windowTriangular filters in mel domain (absolute magnitude)

Mel Frequency Bank

System 1: Minimum-Distance

• Average of mel-cepstral features for test and training data

€

C melts [n] =

1

MCmel

ts [mL,n]m=1

M

∑

€

C meltr [n] =

1

MCmel

tr [mL,n]m=1

M

∑

Minimum-Distance Classifier

• Mean-squared difference between average testing and training feature vectors

€

D =1

R −1(C mel

ts [n] − C meltr [n])2

n=1

R−1

∑

€

if D < T, then speaker is present

System 2: Gaussian Mixture Model

Multivariate Normal Distribution

Gaussian Mixture Model

GMM Speaker Recognition System

€

λ = pi,μ i,Σi}{

TargetModel

Feature Vectors

Imposter 1

Imposter 2€

−

+

∑

€

Λ(X) ≥ θ, accept

Λ(X) < θ, reject

€

Λ(X)

Experiments

• 8 Speakers (4 Male, 4 Female)• 2 Sentences Each– Don’t ask me to carry an oily rag like that– She had your dark suit in greasy wash water all year

• “Rag” used for training, “suit” for testing

ResultsTest1 Test2 Test3 Test4 Test5 Test6 Test7 Test8

Train1 0.1192 0.1945 0.2151 0.2184 0.5364 0.3823 0.4963 0.4538

Train2 0.0724 0.0378 0.0406 0.0783 0.4035 0.3177 0.4125 0.3986

Train3 0.1672 0.1311 0.1042 0.0969 0.3382 0.2121 0.3597 0.2847

Train4 0.1482 0.1412 0.1363 0.0817 0.3268 0.3211 0.3154 0.3282

Train5 0.1882 0.1928 0.2237 0.1466 0.1044 0.0709 0.1382 0.1299

Train6 0.3012 0.3521 0.3208 0.3112 0.3023 0.0958 0.2755 0.2094

Train7 0.2743 0.2973 0.3252 0.2517 0.1618 0.1318 0.0724 0.1427

Train8 0.3589 0.3600 0.3381 0.2186 0.1976 0.1133 0.2487 0.0585


Train1 0.1192 0.1945 0.2151 0.2184 0.5364 0.3823 0.4963 0.4538

Train2 0.0724 0.0378 0.0406 0.0783 0.4035 0.3177 0.4125 0.3986

Train3 0.1672 0.1311 0.1042 0.0969 0.3382 0.2121 0.3597 0.2847

Train4 0.1482 0.1412 0.1363 0.0817 0.3268 0.3211 0.3154 0.3282

Train5 0.1882 0.1928 0.2237 0.1466 0.1044 0.0709 0.1382 0.1299

Train6 0.3012 0.3521 0.3208 0.3112 0.3023 0.0958 0.2755 0.2094

Train7 0.2743 0.2973 0.3252 0.2517 0.1618 0.1318 0.0724 0.1427

Train8 0.3589 0.3600 0.3381 0.2186 0.1976 0.1133 0.2487 0.0585

Threshold = 0.12Accuracy = 91%


Train1 0.1192 0.1945 0.2151 0.2184 0.5364 0.3823 0.4963 0.4538

Train2 0.0724 0.0378 0.0406 0.0783 0.4035 0.3177 0.4125 0.3986

Train3 0.1672 0.1311 0.1042 0.0969 0.3382 0.2121 0.3597 0.2847

Train4 0.1482 0.1412 0.1363 0.0817 0.3268 0.3211 0.3154 0.3282

Train5 0.1882 0.1928 0.2237 0.1466 0.1044 0.0709 0.1382 0.1299

Train6 0.3012 0.3521 0.3208 0.3112 0.3023 0.0958 0.2755 0.2094

Train7 0.2743 0.2973 0.3252 0.2517 0.1618 0.1318 0.0724 0.1427

Train8 0.3589 0.3600 0.3381 0.2186 0.1976 0.1133 0.2487 0.0585

Threshold = 0.11Accuracy = 91%

Conclusions

• Accuracy isn’t terrible, but room to improve• Threshold tradeoff– false-negatives vs. false-positives

• DON’T use Minimum-Distance classifier for text-independent authentication systems

Future Work

• Implement LLR Classifier using GMM library• Repeat experiment with GMM-based system• Compare Min-Distance and GMM results

text-independent speaker verification

Technology