cricos no. 000213j † csiro ict centre * speech, audio, image and video research laboratory...
Post on 14-Dec-2015
213 Views
Preview:
TRANSCRIPT
CRICOS No. 000213J
†CSIRO ICT Centre*Speech, Audio, Image and Video Research Laboratory
Audio-visual speaker verification using continuous fused HMMs
David Dean*, Sridha Sridharan*, and Tim Wark*†
Presented by David Dean
Slides will be available at http://www.davidbdean.com/category/publications
CRICOS No. 000213J
2
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Why audio-visual speaker recognition
Bimodal recognition exploits the synergy between acoustic speech and visual speech, particularly under adverse conditions. It is motivated by the need—in many potential applications of speech-
based recognition—for robustness to speech variability, high recognition accuracy, and
protection against impersonation.
(Chibelushi, Deravi and Mason 2002)
CRICOS No. 000213J
3
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Visual Speaker Models
A1 A2 A3 A4
V1 V2 V3 V4
Speaker DecisionAcoustic Speaker
Models
Fusion
Early and late fusion• Most early approaches to audio-visual speaker recognition (AVSPR)
used either early or late fusion (feature or output)• Problems
– Output fusion cannot model temporal dependencies– Feature fusion suffers from problems with noise
Early Fusion
Late Fusion
Speaker ModelsA1 A2 A3 A4 V1 V2 V3 V4 Speaker Decision
CRICOS No. 000213J
4
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Middle fusion - coupled HMMs
• Middle fusion models can accept two streams of input and the combination is done within the classifier
• Most middle fusion is performed using coupled HMMs (shown here)– Can be difficult to train– Dependencies between
hidden states are not strong (Brand 1999)
CRICOS No. 000213J
5
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Middle fusion – fused HMMs
• Pan et al. (2004) used probabilistic models to investigate the optimal multi-stream HMM design– Maximise mutual information
in audio and video
• They found that linking the observations of one modality to the hidden states of the other was more optimal than linking just the hidden states (i.e. Coupled HMM)
AVAVAA ppp UOOOO ˆ|;ˆ
Acoustic Biased FHMM
CRICOS No. 000213J
6
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Choosing the dominant modility
• The fused HMM designed results in two designs, acoustic, or video biased
• The choice of the dominant modality (the one biased towards) should be based upon which individual HMM can more reliably estimate the hidden state sequence for a particular application– Generally audio
• Alternatively, both versions can be used concurrently and decision fused (as in Pan et al. 2004)
CRICOS No. 000213J
7
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Continuous fused HMMs
• The original Fused HMM implementation treated the secondary domain as discrete (Pan et al. 2004)
• This caused problems with within-speaker variation– Work fine on single session
(CUAVE – Dean et al. 2006)
– Fail on multi-session (XM2VTS)
• Continuous FHMMs model both modalities with GMMs
Continuous FHMM
Discrete FHMM
CRICOS No. 000213J
8
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Training FHMMs
• Both biased FHMM (if needed) are trained independently1. Train the dominant (audio for acoustic-biased, video for
video-biased) HMM independently upon the training observation sequences for that modality
2. The best hidden state sequence of the trained HMM is found for each training observation using the Viterbi process
3. Model the relationship between the dominant hidden state sequence and the training observation sequences for the subordinate modality– i.e. model the probability of getting certain subordinate
observation whilst within a particular dominant hidden state
CRICOS No. 000213J
9
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Decoding FHMMs
• The dominant FHMM can be viewed as a special type of HMM that outputs observations in two streams
• This does not affect the decoding lattice, and the Viterbi algorithm can be used to decode
CRICOS No. 000213J
10
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Experimental setup
SpeakerScore
Visual Feature Extraction
Acoustic Feature Extraction
Manual Lip Tracking
Visual HMM/GMM
Acoustic HMM/GMM
Output F
usio
n
Acoustic Biased FHMM
Visual Biased FHMM
SpeakerScore
SpeakerScore
HMM/GMM Output Fusion
Acoustic-Biased FHMM
Video-Biased FHMM
CRICOS No. 000213J
11
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Training and testing datasets
• Training and testing configuration was based on the XM2VTSDB protocol (Messer et al. 1999)
• 12 configurations were generated based on the single XM2VTSDB configuration
• For each configuration– 400 client tests– 8000 imposter tests
CRICOS No. 000213J
12
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Feature extraction
• Audio– MFCC – 12 + 1 energy, + deltas and
accelerations = 43 features
• Video– Lip ROI manually tracked every 50
frames• 120x80 pixels• Grayscale• Down-sampled to 24x16
– DCT – 20 coefficients + deltas and accelerations = 60 features
CRICOS No. 000213J
13
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Output fusion training
• Two classifier types for each modality– Gaussian mixture models (GMMs)
• Trained over entire sequences
– Hidden Markov models (HMMs)• Trained for each word
• Speaker models adapted from background models using maximum a posterior (MAP) adaption (Lee & Gauvin 1999)
• Topology of HMMs and GMMs determined from testing evaluation partition in first configuration
CRICOS No. 000213J
14
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
• Fused HMM performance is compared to output fusion of normal HMMs and GMMs in each modality
– Audio HMM + Video HMM– Audio HMM + Video GMM– Audio GMM + Video HMM– Audio GMM + Video GMM
• Evaluation session used to estimate each modalities output score distribution to normalise scores within each modality
• Background model score subtracted from speaker scores to normalise for environment and length
Output fusion testing
+
Output Fusion
Normalisation
Normalisation
CRICOS No. 000213J
15
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Output fusion results
• HMM-based models takes advantage of temporal information to improve performance over GMM-based models in both modalities
• Audio GMM is near HMM, but with large number of Gaussians
• Video GMM does not improve with more Gaussians
CRICOS No. 000213J
16
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Output fusion results
• Advantage of video HMM does not carry over to output fusion
• Little difference between video HMM and GMM in output fusion
• Output fusion performance affected mostly by choice of audio classifier
• Output fusion doesn’t take advantage of video temporal information
Audio GMM
Audio HMM
CRICOS No. 000213J
17
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Fused HMM training
• Both acoustic- and visual-biased FHMMs are examined
• Individual audio and video HMMs used as basis for FHMMs
• Secondary models adapted from individual speaker’s GMMs for each state of the underlying HMM
• Background FHMM was formed similarly
CRICOS No. 000213J
18
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Fused HMM testing
• Subordinate observations are up/down sampled to the same rate as the dominant HMM
• Evaluation session used to estimate each modalities frame-score distribution to normalise scores within each modality– Similar to output-fusion, but on a frame-by-frame basis rather than using
final output score
• As well as using subordinate models adapted to the states of the dominant HMM, testing is performed with– Word subordinate models
• (same secondary model for entire word)
– Global subordinate models • (same secondary for all words)
• Finally, background FHMM score is subtracted to normalise for environment and length
CRICOS No. 000213J
19
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Comparison with output-fusion
• If the same subordinate model is used for each dominant state, the FHMM model can be viewed as functionally equivalent to HMM-GMM output fusion– Although in practice this is not the case due to resampling of the
subordinate observations and where modality-normalisation occurs
• Word and State subordinate models can also be viewed as functionally equivalent to HMM-GMM output fusion– Choose the subordinate GMM based on the dominant state for
each frame– Provided that the FHMM design doesn’t affect the best path
through the lattice
CRICOS No. 000213J
20
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Acoustic-biased FHMM results
• There is some benefit in using state-based FHMM models for audio
• State or word-based FHMM models are better than global for most of the plot
Best Performing Output Fusion
CRICOS No. 000213J
21
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Video-biased FHMM results
• No benefit in using word or state-based FHMM models for video
• Therefore, no use in using FHMM models at all
Best Performing Output Fusion
CRICOS No. 000213J
22
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Acoustic vs. video-biased FHMM
• The acoustic-biased FHMM shows that the audio can be used to segment the video into visually-similar sequences
• However, the video-biased FHMM cannot use video to segment the audio into acoustically-similar sequences
• Whilst the performance increase is small, it appears that the acoustic FHMM is benefiting from a temporal relationship between the acoustic states and the visual observations
CRICOS No. 000213J
23
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Conclusion
• Video HMM improves performance over video GMM, but not when used not in output fusion
• Output fusion performance based mainly on acoustic classifier chosen
• Audio-biased continuous FHMMs can take advantage of the temporal relationship between audio states and video observations
• However, the video-biased continuous FHMM performance appears to show no corresponding relationship between video states and audio observations
CRICOS No. 000213J
24
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Continuing/Future Research
• Secondary GMMs are recognising a large amount of static video information– Skin or lip colour, facial hair, etc.
• This information has no temporal relationship with the audio states, and may be swamping the more dynamic information available in facial movements
• A more efficient structure may be realised by using more dynamic video features (mean-removed DCT, contour-based or optical flow) and output fusion with a face GMM– This would take advantage of the temporal audio-visual
relationship, in addition to static face-recognition
CRICOS No. 000213J
25
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Speech Recognition with Fused HMMs
• FHMMs improve single-modal speech processing in two ways:1. 2nd modality improves scores within states
2. 2nd modality improves state sequence
• Text-dependent speaker recognition only benefits from the first improvement– State sequences is fixed
• However, speech recognition can take advantage of both improvements
CRICOS No. 000213J
26
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Speech Recognition with Fused HMMs
• Using first XM2VTS configuration (XM2VTSDB)
• Speaker-independent, continuous-speech, digit recognition
• PLP-based Audio Features, Hierarchical LDA-based (of mean-removed DCT) video features
• We believe this is comparable performance to coupled and asynchronous HMMs– But simpler to train and
decode
CRICOS No. 000213J
27
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
FHMMs, synchronous HMMs and feature fusion
• The FHMM structure can be implemented as a multi-modal synchronous HMM, and therefore with minor simplification as a feature-fusion HMM
• The difference is in how the structure is trained– In synchronous HMMs and feature-fusion, both modalities are
used to train the HMMs– FHMMs can be viewed as adapting a multi-modal synchronous
HMM from the dominant single-modal HMM
• If the same number of Gaussians are used for both modalities, a FHMM can be implemented within a single-modal HMM decoder– Decoding is exactly the same as with feature-fusion
CRICOS No. 000213J
28
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
References
• Brand, M. (1999), A bayesian computer vision system for modeling human interactions, in `ICVS'99', Gran Canaria, Spain.
• Chibelushi, C., Deravi, F. & Mason, J. (2002), `A review of speech-based bimodal recognition', Multimedia, IEEE Transactions on 4(1), 2337.
• Dean, D., Wark, T. & Sridharan, S. (2006), An examination of audio-visual fused HMMs for speaker recognition, in `MMUA 2006', Toulouse, France.
• Lee, C.-H. & Gauvain, J.-L. (1993), Speaker adaptation based on MAP estimation of HMM parameters, in `Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International Conference on', Vol. 2, pp. 558561 vol.2.
• Luettin, J. & Maitre, G. (1998), Evaluation protocol for the extended M2VTS database (XM2VTSDB), Technical report, IDIAP.
• Messer, K., Matas, J., Kittler, J., Luettin, J. & Maitre, G. (1999), XM2VTSDB: The extended M2VTS database, in `Audio and Video-based Biometric Person Authentication (AVBPA '99), Second International Conference on', Washington D.C., pp. 7277.
• Pan, H., Levinson, S., Huang, T. & Liang, Z.-P. (2004), `A fused hidden markov model with application to bimodal speech processing', IEEE Transactions on Signal Processing 52(3), 573-581.
CRICOS No. 000213J
29
Speech, Audio, Image and Video Research Laboratory
Audio-Visual Speaker Verification using Continuous Fused HMMsCSIRO ICT Centre
Questions?
top related