1 robust hmm classification schemes for speaker recognition using integral decode marie roch florida...
TRANSCRIPT
1
Robust HMM classification schemes for speaker recognition
using integral decode
Marie Roch
Florida International University
2
Who am I?
3
Speaker Recognition
Verification Identification
Text
Dependent
Text
Independent
• Types of speaker recognition
4
Speaker Recognition
• Why is it hard?• Minimal training data
• Background noise
• Transducer mismatch
• Channel distortions
• People’s voices change over time and under stress
• Performance
5
Feature Extraction
• Extract speech
• Spectral analysis
• Cepstrum:
• Cepstral means removal
))((log(1 Sdft
6
Hidden Markov Models
• Statistical pattern recognition
• State dependent modeling– Distribution/state– Radial basis functions common
• State sequence unobservable
7
HMM
• Efficient decoders:
• Training – EM algorithm– Convergence to local maxima guaranteed
)( 2TNO
8
Recognition
• Model for each speaker
• Maximum a priori (MAP) decision rule
ArgMaxFeatures
Models
Scores
9
The MAP decision rule
• Optimal decision rule provided we have accurate distribution parameters & observations.
• Problem:– Corruption of feature vectors.– Distribution known to be inaccurate.
10
A case of mistaken identity
11
Integral decode
• Goal: Include uncorrupted observation ôt.
• Problem: ôt unobservable.
• Determine a local neighborhood t about ot and use a priori information to weight the likelihood:
ooMoMotot
)|Pr()|Pr()|Pr(
12
Integral decode issues
• Problems approximating the integral– High frame rate * number of models– Non-trivial dimensionality
• Selection of the neighborhood
13
Approximating the integral
• Monte Carlo impractical
• Use simplified cubature technique:
1 2)|)(stepPr()()|)(step(...)|Pr( i i i
prior
t
area
tt CioiMiofMo
C
j jj
jj E
kiki 1 )
1
2)1(()(step
14
Neighborhood choice
• Choosing an appropriate neighborhood:– Upper bound difference neighborhoods [Merhav and Lee 93]
– Error source modeling
15
Upper bound difference neighborhoods
• Arbitrary signal pairs with a few general conditions.
• PSD
• Cepstra 1 1
1
ki iic
ki j
ij
i
ji
ji
ee
eeS 1 )1)(1(
)1)(1()(
16
Taking the upper bound
• Asymptotic difference between cepstral parameters:
iiii
k
iiii
k
iiiicc
,,,max 4k,
1
1
1
1
)2()1(
17
Error source modeling
• Multiple error sources
• Simplifying assumption of one normal distribution with zero mean
• Use time series analysis to estimate the noise
• Trend
ttt nO
to
tot
tt
t
t
1
1
1
1
18
Error Source Modeling
• Estimate variance from detrended signal
19
Error source modeling
• Problem: – is infinite
• Solution:– Most of the points are outliers– Set percentage of distribution beyond which
points are culled.
ooMoMotot
)|Pr()|Pr()|Pr(
t
20
Complexity of integration
• Expensive
• Ways to reduce/cope– Implemented
• Top K processing• Principle Components Analysis
– Possible• Gaussian Selection• Sub-band Models• SIMD or MIMD parallelism
)( 2
nIntegratio
C
Mixtures
Decoder
Speakers
CEMTNSO
21
Top K Processing
)( 2 CTMENSO CTopK
1 second 3 seconds
5 seconds
22
Principal Component Analysis
• Choose P most important directions
23
Principal Component Analysis
• Integrate using new basis set for step function
24
Speech Corpus
• King-92– Used San Diego subset
• 26 male speakers
• Long distance telephone speech
• Quiet room environment
• 5 sessions recorded one week apart– 1-3 train
– Sessions 4-5 partitioned into test segments
25
Baseline performance
26
Integral decode performance
Test Baseline Upper Bound Difference Error Modeling Length Error Error % Error % 1 0.4420 0.4237 0.0183 4.14 0.4401 0.0019 0.43 3 0.1833 0.1554 0.0279 15.22 0.1753 0.0080 4.64 5 0.0872 0.0738 0.0134 15.37 0.0638 0.0234 26.83
1 second 3 seconds 5 seconds
27
Integral decode with other conditions
• Performance on – high quality speech– transducer mismatch
28
Future work
• Extensions to the integral decode– Automatic parameter selection– Gaussian selection– distributed computation
• Efficient multiple class preclassifiers
29
30
Optimal/utterance hyperparameters – 5 seconds
KingN
B2
6 KingW
B5
1
SpidreF
18XD
R SpidreM
27XD
R
31
95% Confidence Intervals
• Caveat: – Per speaker
means– Large
granularity
32
Pattern Recognition
• Long term statistics [Bricker et al 71, Markel et al 77]
• Vector Quantization [Soong et al 87]
• HMM [Rosenberg et al 90, Tishby 91, Matsui & Furui 92, Reynolds et al 95]
• Connectionist frameworks• Feed forward [Oglesby & Mason 90] • Learning vector quantization [He et al 99]
33
Pattern Recognition Contd.
• Hybrid/Modified HMMs• Min Classification Error discriminant [Liu et al 95] • Tree structured neural classifiers [Liou & Mammone 95]
• Trajectory modeling [Russell et al 85, Liu et al 95, Ostendorf et al 96, He et al 99]
• Sub-band recognition [Besacier & Bonastre 97]