Download - Speaker Recognition
![Page 1: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/1.jpg)
Speaker Recognition
Sharat.S.ChikkerurCenter for Unified Biometrics and Sensors
http://www.cubs.buffalo.edu
![Page 2: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/2.jpg)
Speech Fundamentals Characterizing speech
Content (Speech recognition) Signal representation (Vocoding)
• Waveform• Parametric( Excitation, Vocal Tract)
Signal analysis (Gender determination, Speaker recognition)
Terminologies Phonemes :
• Basic discrete units of speech. • English has around 42 phonemes.• Language specific
Types of speech• Voiced speech• Unvoiced speech(Fricatives)• Plosives
Formants
![Page 3: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/3.jpg)
Speech production
Speech production mechanism Speech production model
Impulse Train
Generator
Glottal Pulse ModelG(z)
Vocal TractModelV(z)
Radiation Model
R(z)
Noise source
Pitch Av
AN
17 cm
![Page 4: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/4.jpg)
Nature of speech
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000-1
-0.5
0
0.5
1
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
Spectrogram
![Page 5: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/5.jpg)
Vocal Tract modeling
Signal Spectrum Smoothened Signal Spectrum
•The smoothened spectrum indciates the locations of the formants of each user
•The smoothened spectrum is obtained by cepstral coefficients
![Page 6: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/6.jpg)
Parametric Representations: Formants
Formant Frequencies Characterizes the frequency response of the vocal tract Used in characterization of vowels Can be used to determine the gender
0 500 1000 1500 2000 2500 3000 3500 40000
2
4
6
8
0 500 1000 1500 2000 2500 3000 3500 40000
5
10
15
![Page 7: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/7.jpg)
0 500 1000 1500 2000 2500 3000 3500 4000-2
0
2
4
0 500 1000 1500 2000 2500 3000 3500 4000-1
0
1
2
3
0 500 1000 1500 2000 2500 3000 3500 4000-0.5
0
0.5
1
1.5
Parametric Representations:LPC
][][][ nGuknsansk
k
Linear predictive coefficients Used in vocoding Spectral estimation
0 500 1000 1500 2000 2500 3000 3500 4000-2
0
2
4
0 500 1000 1500 2000 2500 3000 3500 4000-2
0
2
4
5
2
20
40
200
![Page 8: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/8.jpg)
0 1000 2000 3000 4000 5000 6000-1.5
-1
-0.5
0
0.5
0 1000 2000 3000 4000 5000 6000-2
-1
0
1
0 1000 2000 3000 4000 5000 6000-1.5
-1
-0.5
0
0.5
Parametric Representations:Cepstrum
P[n] G(z)
V(z) R(z)
u[n]
Pitch Av
AN
D[] L[] D-1[]
x1[n]*x2[n]x1‘[n]+x2‘[n] y1‘[n]+y2‘[n]
y1[n]*y2[n]
DFT[] LOG[] IDFT[]
x1[n]*x2[n]
X1(z)X2(z)
x1‘[n]+x2‘[n]
log(X1(z)) + log(X2(z))
5
10
40
![Page 9: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/9.jpg)
Speaker Recognition
Definition It is the method of recognizing a person based on his voice It is one of the forms of biometric identification
Depends of speaker dependent characteristics.
Speaker Recognition
Speaker Identification Speaker VerificationSpeaker Detection
TextDependent
TextIndependent
TextDependent
TextIndependent
T ra n sm is s ion S p e ech S yn th e s is S p ee ch en h an ce m e nt A ids to h an d ica pp ed S p ee ch R e co g n it ion S p e ake r V e rif ica tion
S p ee ch A p p lica tio ns
![Page 10: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/10.jpg)
Generic Speaker Recognition System
PreprocessingFeature
ExtractionPattern
Matching
PreprocessingFeature
ExtractionSpeaker Model
Verification
Enrollment
A/D Conversion
End point detection
Pre-emphasis filter
Segmentation
LAR
Cepstrum
LPCC
MFCC
Stochastic Models
GMM
HMM
Template Models
DTW
Distance Measures
Speech signalAnalysis Frames Feature Vector
Score
Choice of features
Differentiating factors b/w speakers include vocal tract shape and behavioral traits
Features should have high inter-speaker and low intra speaker variation
![Page 11: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/11.jpg)
Our Approach
Silence Removal
Cepstrum Coefficients
Cepstral Normalization Long time average
Polynomial Function Expansion
Dynamic Time Warping
Distance Computation
Reference Template
Preprocessing
Feature Extraction
Speaker model
Matching
![Page 12: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/12.jpg)
Silence Removal
N
kavg
Wn
kn
kxN
E
knwkxE
1
2
1
2
][1
][][
Preprocessing
Feature Extraction
Speaker model
Matching
![Page 13: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/13.jpg)
Pre-emphasis Preprocessing
Feature Extraction
Speaker model
Matching
95.0
)1()( 1
a
azzH
![Page 14: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/14.jpg)
Segmentation Preprocessing
Feature Extraction
Speaker model
Matching
Short time analysis
The speech signal is segmented into overlapping ‘Analysis Frames’
The speech signal is assumed to be stationary within this frame
frame analysis theoflength : N
frame analysisn:
)(2cos46.054.0][
][][
th n
kn
Q
N
nnw
knwkxQ
Q31 Q32 Q33 Q34
![Page 15: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/15.jpg)
Feature Representation Preprocessing
Feature Extraction
Speaker model
Matching
Speech signal and spectrum of two users uttering ‘ONE’
![Page 16: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/16.jpg)
Speaker Model
F1 = [a1…a10,b1…b10]
F2 = [a1…a10,b1…b10]
FN = [a1…a10,b1…b10]
…………….
…………….
9
1
21
9
11
1 5
jj
jjj
j
P
Pc
b
jP
![Page 17: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/17.jpg)
Dynamic Time Warping
NMnwm
mymymymY
nxnxnxnX
K
K
),(
1....Mm ,)}()....(),({)(
1....N n ,)}().....(),({)(
21
21 Preprocessing
Feature Extraction
Speaker model
Matching
K
iii
N
nT
mtnrmYnXD
nwYnXDD
1
2
1
)()())(),((
))((),({min
•The DTW warping path in the n-by-m matrix is the path which has minimum average cumulative cost. The unmarked area is the constrain that path is allowed to go.
![Page 18: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/18.jpg)
Resultsa0 a1 r0 r1 s0 s1
a0 0 0.1226 0.3664 0.3297 0.4009 0.4685a1 0.1226 0 0.5887 0.3258 0.4086 0.4894r0 0.3664 0.5887 0 0.0989 0.3299 0.4243r1 0.3297 0.3258 0.0989 0 0.367 0.4287s0 0.4009 0.4086 0.3299 0.367 0 0.1401s1 0.4685 0.4894 0.4243 0.4287 0.1401 0
•Distances are normalized w.r.t. length of the speech signal
•Intra speaker distance less than inter speaker distance
•Distance matrix is symmetric
![Page 19: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/19.jpg)
Matlab Implementation
![Page 20: Speaker Recognition](https://reader035.vdocuments.us/reader035/viewer/2022062305/56814c46550346895db947cd/html5/thumbnails/20.jpg)
THANK YOU