text-independent speaker verification
DESCRIPTION
Presentation slides discussing the theory and empirical results of a text-independent speaker verification system I developed based upon classification of MFCCs. Both mininimum-distance classification and least-likelihood ratio classification using Gaussian Mixture Models were discussed.TRANSCRIPT
Speaker Recognition
Cody A. RayECES 435 Final Project
March 11, 2010
• Speaker Recognition
• Speaker Identification • Speaker Verification
• Text• Dependent
• Text• Independent
• Text• Dependent
• Text• Independent
Speaker Recognition System
Feature Extraction Training Speaker
Model
FeatureExtraction Matching
Verification
Training speech Feature Vector
Target & Background
ScoreTest speech
Cepstrum LPCC MFCC Glottal Flow Derivative
Deterministic Models Min Distance DTW
Stochastic Models GMM HMM
Minimum Distance Maximum-Likelihood Maximum a posteriori Minimum-Mean-Squared Error
Testing
Feature Extraction
• Big surprise here – MFCCs!
Window DFT | . |
DCT Filter Bank
Speech signal x[m] w[n-m] X(n, w)
MFCCsLog
Emel(n, l) Mel-Scale
MFCC - 12 coefficients (skip 0’th order coefficient)256 sample frames, 128 sample increment, Hamming windowTriangular filters in mel domain (absolute magnitude)
Mel Frequency Bank
System 1: Minimum-Distance
• Average of mel-cepstral features for test and training data
€
C melts [n] =
1
MCmel
ts [mL,n]m=1
M
∑
€
C meltr [n] =
1
MCmel
tr [mL,n]m=1
M
∑
Minimum-Distance Classifier
• Mean-squared difference between average testing and training feature vectors
€
D =1
R −1(C mel
ts [n] − C meltr [n])2
n=1
R−1
∑
€
if D < T, then speaker is present
System 2: Gaussian Mixture Model
Multivariate Normal Distribution
Gaussian Mixture Model
GMM Speaker Recognition System
€
λ = pi,μ i,Σi}{
TargetModel
Feature Vectors
Imposter 1
Imposter 2€
−
+
∑
€
Λ(X) ≥ θ, accept
Λ(X) < θ, reject
€
Λ(X)
Log-Likelihood Ratio
€
Λ(X) = log[p(X | λC )] − log[p(X | λC
)]
€
Λ(X) ≥ θ, accept
Λ(X) < θ, reject
€
P(λC | X)
P(λC
| X)=
p(X | λC )P(λC ) /P(X)
p(X | λC
)P(λC
) /P(X)
Experiments
• 8 Speakers (4 Male, 4 Female)• 2 Sentences Each– Don’t ask me to carry an oily rag like that– She had your dark suit in greasy wash water all year
• “Rag” used for training, “suit” for testing
ResultsTest1 Test2 Test3 Test4 Test5 Test6 Test7 Test8
Train1 0.1192 0.1945 0.2151 0.2184 0.5364 0.3823 0.4963 0.4538
Train2 0.0724 0.0378 0.0406 0.0783 0.4035 0.3177 0.4125 0.3986
Train3 0.1672 0.1311 0.1042 0.0969 0.3382 0.2121 0.3597 0.2847
Train4 0.1482 0.1412 0.1363 0.0817 0.3268 0.3211 0.3154 0.3282
Train5 0.1882 0.1928 0.2237 0.1466 0.1044 0.0709 0.1382 0.1299
Train6 0.3012 0.3521 0.3208 0.3112 0.3023 0.0958 0.2755 0.2094
Train7 0.2743 0.2973 0.3252 0.2517 0.1618 0.1318 0.0724 0.1427
Train8 0.3589 0.3600 0.3381 0.2186 0.1976 0.1133 0.2487 0.0585
ResultsTest1 Test2 Test3 Test4 Test5 Test6 Test7 Test8
Train1 0.1192 0.1945 0.2151 0.2184 0.5364 0.3823 0.4963 0.4538
Train2 0.0724 0.0378 0.0406 0.0783 0.4035 0.3177 0.4125 0.3986
Train3 0.1672 0.1311 0.1042 0.0969 0.3382 0.2121 0.3597 0.2847
Train4 0.1482 0.1412 0.1363 0.0817 0.3268 0.3211 0.3154 0.3282
Train5 0.1882 0.1928 0.2237 0.1466 0.1044 0.0709 0.1382 0.1299
Train6 0.3012 0.3521 0.3208 0.3112 0.3023 0.0958 0.2755 0.2094
Train7 0.2743 0.2973 0.3252 0.2517 0.1618 0.1318 0.0724 0.1427
Train8 0.3589 0.3600 0.3381 0.2186 0.1976 0.1133 0.2487 0.0585
ResultsTest1 Test2 Test3 Test4 Test5 Test6 Test7 Test8
Train1 0.1192 0.1945 0.2151 0.2184 0.5364 0.3823 0.4963 0.4538
Train2 0.0724 0.0378 0.0406 0.0783 0.4035 0.3177 0.4125 0.3986
Train3 0.1672 0.1311 0.1042 0.0969 0.3382 0.2121 0.3597 0.2847
Train4 0.1482 0.1412 0.1363 0.0817 0.3268 0.3211 0.3154 0.3282
Train5 0.1882 0.1928 0.2237 0.1466 0.1044 0.0709 0.1382 0.1299
Train6 0.3012 0.3521 0.3208 0.3112 0.3023 0.0958 0.2755 0.2094
Train7 0.2743 0.2973 0.3252 0.2517 0.1618 0.1318 0.0724 0.1427
Train8 0.3589 0.3600 0.3381 0.2186 0.1976 0.1133 0.2487 0.0585
Threshold = 0.12Accuracy = 91%
ResultsTest1 Test2 Test3 Test4 Test5 Test6 Test7 Test8
Train1 0.1192 0.1945 0.2151 0.2184 0.5364 0.3823 0.4963 0.4538
Train2 0.0724 0.0378 0.0406 0.0783 0.4035 0.3177 0.4125 0.3986
Train3 0.1672 0.1311 0.1042 0.0969 0.3382 0.2121 0.3597 0.2847
Train4 0.1482 0.1412 0.1363 0.0817 0.3268 0.3211 0.3154 0.3282
Train5 0.1882 0.1928 0.2237 0.1466 0.1044 0.0709 0.1382 0.1299
Train6 0.3012 0.3521 0.3208 0.3112 0.3023 0.0958 0.2755 0.2094
Train7 0.2743 0.2973 0.3252 0.2517 0.1618 0.1318 0.0724 0.1427
Train8 0.3589 0.3600 0.3381 0.2186 0.1976 0.1133 0.2487 0.0585
Threshold = 0.11Accuracy = 91%
Conclusions
• Accuracy isn’t terrible, but room to improve• Threshold tradeoff– false-negatives vs. false-positives
• DON’T use Minimum-Distance classifier for text-independent authentication systems
Future Work
• Implement LLR Classifier using GMM library• Repeat experiment with GMM-based system• Compare Min-Distance and GMM results