speaker recognition
DESCRIPTION
Speaker Recognition. Sharat.S.Chikkerur Center for Unified Biometrics and Sensors http://www.cubs.buffalo.edu. Speech Fundamentals. Characterizing speech Content (Speech recognition) Signal representation (Vocoding) Waveform Parametric( Excitation, Vocal Tract) - PowerPoint PPT PresentationTRANSCRIPT
Speaker Recognition
Sharat.S.ChikkerurCenter for Unified Biometrics and Sensors
http://www.cubs.buffalo.edu
Speech Fundamentals Characterizing speech
Content (Speech recognition) Signal representation (Vocoding)
• Waveform• Parametric( Excitation, Vocal Tract)
Signal analysis (Gender determination, Speaker recognition)
Terminologies Phonemes :
• Basic discrete units of speech. • English has around 42 phonemes.• Language specific
Types of speech• Voiced speech• Unvoiced speech(Fricatives)• Plosives
Formants
Speech production
Speech production mechanism Speech production model
Impulse Train
Generator
Glottal Pulse ModelG(z)
Vocal TractModelV(z)
Radiation Model
R(z)
Noise source
Pitch Av
AN
17 cm
Nature of speech
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000-1
-0.5
0
0.5
1
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
Spectrogram
Vocal Tract modeling
Signal Spectrum Smoothened Signal Spectrum
•The smoothened spectrum indciates the locations of the formants of each user
•The smoothened spectrum is obtained by cepstral coefficients
Parametric Representations: Formants
Formant Frequencies Characterizes the frequency response of the vocal tract Used in characterization of vowels Can be used to determine the gender
0 500 1000 1500 2000 2500 3000 3500 40000
2
4
6
8
0 500 1000 1500 2000 2500 3000 3500 40000
5
10
15
0 500 1000 1500 2000 2500 3000 3500 4000-2
0
2
4
0 500 1000 1500 2000 2500 3000 3500 4000-1
0
1
2
3
0 500 1000 1500 2000 2500 3000 3500 4000-0.5
0
0.5
1
1.5
Parametric Representations:LPC
][][][ nGuknsansk
k
Linear predictive coefficients Used in vocoding Spectral estimation
0 500 1000 1500 2000 2500 3000 3500 4000-2
0
2
4
0 500 1000 1500 2000 2500 3000 3500 4000-2
0
2
4
5
2
20
40
200
0 1000 2000 3000 4000 5000 6000-1.5
-1
-0.5
0
0.5
0 1000 2000 3000 4000 5000 6000-2
-1
0
1
0 1000 2000 3000 4000 5000 6000-1.5
-1
-0.5
0
0.5
Parametric Representations:Cepstrum
P[n] G(z)
V(z) R(z)
u[n]
Pitch Av
AN
D[] L[] D-1[]
x1[n]*x2[n]x1‘[n]+x2‘[n] y1‘[n]+y2‘[n]
y1[n]*y2[n]
DFT[] LOG[] IDFT[]
x1[n]*x2[n]
X1(z)X2(z)
x1‘[n]+x2‘[n]
log(X1(z)) + log(X2(z))
5
10
40
Speaker Recognition
Definition It is the method of recognizing a person based on his voice It is one of the forms of biometric identification
Depends of speaker dependent characteristics.
Speaker Recognition
Speaker Identification Speaker VerificationSpeaker Detection
TextDependent
TextIndependent
TextDependent
TextIndependent
T ra n sm is s ion S p e ech S yn th e s is S p ee ch en h an ce m e nt A ids to h an d ica pp ed S p ee ch R e co g n it ion S p e ake r V e rif ica tion
S p ee ch A p p lica tio ns
Generic Speaker Recognition System
PreprocessingFeature
ExtractionPattern
Matching
PreprocessingFeature
ExtractionSpeaker Model
Verification
Enrollment
A/D Conversion
End point detection
Pre-emphasis filter
Segmentation
LAR
Cepstrum
LPCC
MFCC
Stochastic Models
GMM
HMM
Template Models
DTW
Distance Measures
Speech signalAnalysis Frames Feature Vector
Score
Choice of features
Differentiating factors b/w speakers include vocal tract shape and behavioral traits
Features should have high inter-speaker and low intra speaker variation
Our Approach
Silence Removal
Cepstrum Coefficients
Cepstral Normalization Long time average
Polynomial Function Expansion
Dynamic Time Warping
Distance Computation
Reference Template
Preprocessing
Feature Extraction
Speaker model
Matching
Silence Removal
N
kavg
Wn
kn
kxN
E
knwkxE
1
2
1
2
][1
][][
Preprocessing
Feature Extraction
Speaker model
Matching
Pre-emphasis Preprocessing
Feature Extraction
Speaker model
Matching
95.0
)1()( 1
a
azzH
Segmentation Preprocessing
Feature Extraction
Speaker model
Matching
Short time analysis
The speech signal is segmented into overlapping ‘Analysis Frames’
The speech signal is assumed to be stationary within this frame
frame analysis theoflength : N
frame analysisn:
)(2cos46.054.0][
][][
th n
kn
Q
N
nnw
knwkxQ
Q31 Q32 Q33 Q34
Feature Representation Preprocessing
Feature Extraction
Speaker model
Matching
Speech signal and spectrum of two users uttering ‘ONE’
Speaker Model
F1 = [a1…a10,b1…b10]
F2 = [a1…a10,b1…b10]
FN = [a1…a10,b1…b10]
…………….
…………….
9
1
21
9
11
1 5
jj
jjj
j
P
Pc
b
jP
Dynamic Time Warping
NMnwm
mymymymY
nxnxnxnX
K
K
),(
1....Mm ,)}()....(),({)(
1....N n ,)}().....(),({)(
21
21 Preprocessing
Feature Extraction
Speaker model
Matching
K
iii
N
nT
mtnrmYnXD
nwYnXDD
1
2
1
)()())(),((
))((),({min
•The DTW warping path in the n-by-m matrix is the path which has minimum average cumulative cost. The unmarked area is the constrain that path is allowed to go.
Resultsa0 a1 r0 r1 s0 s1
a0 0 0.1226 0.3664 0.3297 0.4009 0.4685a1 0.1226 0 0.5887 0.3258 0.4086 0.4894r0 0.3664 0.5887 0 0.0989 0.3299 0.4243r1 0.3297 0.3258 0.0989 0 0.367 0.4287s0 0.4009 0.4086 0.3299 0.367 0 0.1401s1 0.4685 0.4894 0.4243 0.4287 0.1401 0
•Distances are normalized w.r.t. length of the speech signal
•Intra speaker distance less than inter speaker distance
•Distance matrix is symmetric
Matlab Implementation
THANK YOU