synchronous hmms for audio-visual speech...
TRANSCRIPT
Synchronous HMMs for Audio-Visual
Speech Processing
by
David Dean, BEng (Hons), BIT
PhD Thesis
Submitted in Fulfilment
of the Requirements
for the Degree of
Doctor of Philosophy
at the
Queensland University of Technology
Faculty of Engineering
July 2008
ii
Keywords
Speech processing, speech recognition, speaker recognition, speaker verification, multi-
modal, audio-visual, data fusion, pattern recognition, hidden Markov models, syn-
chronous hidden Markov models
iv
Abstract
Both human perceptual studies and automatic machine-based experiments have shown
that visual information from a speaker’s mouth region can improve the robustness of
automatic speech processing tasks, especially in the presence of acoustic noise. By
taking advantage of the complementary nature of the acoustic and visual speech in-
formation, audio-visual speech processing (AVSP) applications can work reliably in
more real-world situations than would be possible with traditional acoustic speech
processing applications. The two most prominent applications of AVSP for viable
human-computer-interfaces involve the recognition of the speech events themselves,
and the recognition of speaker’s identities based upon their speech. However, while
these two fields of speech and speaker recognition are closely related, there has been
little systematic comparison of the two tasks under similar conditions in the existing
literature. Accordingly, the primary focus of this thesis is to compare the suitability of
general AVSP techniques for speech or speaker recognition, with a particular focus on
synchronous hidden Markov models (SHMMs).
The cascading appearance-based approach to visual speech feature extraction has been
shown to work well in removing irrelevant static information from the lip region to
greatly improve visual speech recognition performance. This thesis demonstrates that
these dynamic visual speech features also provide for an improvement in speaker
recognition, showing that speakers can be visually recognised by how they speak,
in addition to their appearance alone.
vi
This thesis investigates a number of novel techniques for training and decoding of
SHMMs that improve the audio-visual speech modelling ability of the SHMM ap-
proach over the existing state-of-the-art joint-training technique. Novel experiments
are conducted within to demonstrate that the reliability of the two streams during
training is of little importance to the final performance of the SHMM. Additionally,
two novel techniques of normalising the acoustic and visual state classifiers within the
SHMM structure are demonstrated for AVSP. Fused hidden Markov model (FHMM)
adaptation is introduced as a novel method of adapting SHMMs from existing well-
performing acoustic hiddenMarkovmodels (HMMs). This technique is demonstrated
to provide improved audio-visual modelling over the jointly-trained SHMMapproach
at all levels of acoustic noise for the recognition of audio-visual speech events. How-
ever, the close coupling of the SHMM approach will be shown to be less useful for
speaker recognition, where a late integration approach is demonstrated to be supe-
rior.
Contents
Keywords iii
Abstract v
List of Tables xvii
List of Figures xix
Commonly used Abbreviations xxv
Certification of Thesis xxvii
Acknowledgements xxix
Chapter 1 Introduction 1
1.1 Motivation and overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
viii CONTENTS
1.4 Original contributions of thesis . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Publications resulting from research . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 International journal publications . . . . . . . . . . . . . . . . . . 7
1.5.2 International conference publications . . . . . . . . . . . . . . . . 8
Chapter 2 Audio-Visual Speech Processing 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Audio-visual speech processing by humans . . . . . . . . . . . . . . . . . 12
2.2.1 The speech chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Speech production . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Phonemes and visemes . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Audio-visual speech perception . . . . . . . . . . . . . . . . . . . 17
2.2.5 Audio-visual speaker perception . . . . . . . . . . . . . . . . . . . 18
2.3 Automatic audio-visual speech processing . . . . . . . . . . . . . . . . . 21
2.3.1 Audio-visual speech recognition . . . . . . . . . . . . . . . . . . . 22
2.3.2 Audio-visual speaker recognition . . . . . . . . . . . . . . . . . . 25
2.3.3 Comparing speech and speaker recognition . . . . . . . . . . . . . 26
2.4 Audio-visual databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 A brief review of audio-visual databases . . . . . . . . . . . . . . 27
CONTENTS ix
2.4.2 The XM2VTS database . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 3 Speech and Speaker Classification 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Bayes classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Non-parametric classifiers . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.3 Parametric classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Gaussian mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 GMM complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 GMM parameter estimation . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.1 Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.2 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.3 Viterbi decoding algorithm . . . . . . . . . . . . . . . . . . . . . . 46
3.4.4 HMM parameter estimation . . . . . . . . . . . . . . . . . . . . . . 47
3.4.5 HMM types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Speaker adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
x CONTENTS
3.5.1 MAP adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 4 Speech and Speaker Recognition Framework 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Speaker dependency . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.2 Speech decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Speaker recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Text dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.2 Background adaptation . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.3 Evaluating speaker recognition performance . . . . . . . . . . . . 68
4.4 Speech processing framework† . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.1 Training and testing datasets . . . . . . . . . . . . . . . . . . . . . 72
4.4.2 Background training . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.3 Speaker adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.4 Speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.5 Speaker verification . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5 Acoustic and visual conditions . . . . . . . . . . . . . . . . . . . . . . . . 77
CONTENTS xi
4.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 5 Feature Extraction 79
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Acoustic feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.3 Filter bank analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.4 Mel frequency Cepstral coefficients . . . . . . . . . . . . . . . . . 82
5.2.5 Perceptual linear prediction . . . . . . . . . . . . . . . . . . . . . . 83
5.2.6 Energy and time derivative features . . . . . . . . . . . . . . . . . 83
5.3 Visual front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.1 The front-end effect . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.2 A brief review of visual front-ends . . . . . . . . . . . . . . . . . . 86
5.3.3 Manual front-end implementation . . . . . . . . . . . . . . . . . . 87
5.4 Visual features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.1 Appearance based . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.2 Contour based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.3 Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xii CONTENTS
5.4.4 Choosing a visual feature extraction method . . . . . . . . . . . . 93
5.5 Dynamic visual speech features . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5.2 Cascading appearance-based features . . . . . . . . . . . . . . . . 96
5.6 Comparing speech and speaker recognition . . . . . . . . . . . . . . . . . 103
5.6.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.6.2 Model training and tuning . . . . . . . . . . . . . . . . . . . . . . 105
5.7 Speech recognition experiments . . . . . . . . . . . . . . . . . . . . . . . . 107
5.7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.8 Speaker verification experiments† . . . . . . . . . . . . . . . . . . . . . . . 111
5.8.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.9 Speech and speaker discussion . . . . . . . . . . . . . . . . . . . . . . . . 113
5.10 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Chapter 6 Simple Integration Strategies 115
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Integration strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
CONTENTS xiii
6.3 Early integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3.2 Concatenative feature fusion . . . . . . . . . . . . . . . . . . . . . 119
6.3.3 Discriminative feature fusion . . . . . . . . . . . . . . . . . . . . . 119
6.4 Late integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.4.2 Output score fusion for speaker verification . . . . . . . . . . . . 123
6.4.3 Score-normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.4 Modality weighting . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.5 Speech recognition experiments . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.6 Speaker verification experiments . . . . . . . . . . . . . . . . . . . . . . . 131
6.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.7 Speech and speaker discussion . . . . . . . . . . . . . . . . . . . . . . . . 133
6.8 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Chapter 7 Synchronous HMMs 135
xiv CONTENTS
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.2 Multi-stream HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.3 Synchronous HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.3.2 SHMM joint-training . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.4 Weighting of synchronous HMMs† . . . . . . . . . . . . . . . . . . . . . . 142
7.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.5 Normalisation of synchronous HMMs† . . . . . . . . . . . . . . . . . . . 145
7.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.5.2 Determining normalisation parameters . . . . . . . . . . . . . . . 148
7.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.6 Speech recognition experiments† . . . . . . . . . . . . . . . . . . . . . . . 152
7.6.1 Choosing the stream weight parameters . . . . . . . . . . . . . . . 152
7.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.8 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
CONTENTS xv
Chapter 8 Fused HMM-Adaptation of Synchronous HMMs 159
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.2 Discrete fused HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.2.2 Maximising mutual information for audio-visual speech . . . . . 161
8.2.3 Discrete implementation . . . . . . . . . . . . . . . . . . . . . . . . 163
8.3 Fused HMM adaptation of synchronous HMMs† . . . . . . . . . . . . . . 164
8.3.1 Continuous FHMMs . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.3.2 Fused-HMM adaptation . . . . . . . . . . . . . . . . . . . . . . . . 165
8.4 Biasing of FHMM-adapted SHMMs† . . . . . . . . . . . . . . . . . . . . . 168
8.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.4.2 Acoustic or visual biased . . . . . . . . . . . . . . . . . . . . . . . 169
8.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.5 Speech recognition experiments† . . . . . . . . . . . . . . . . . . . . . . . 171
8.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.6 Speaker verification experiments† . . . . . . . . . . . . . . . . . . . . . . . 175
8.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.6.2 Stream weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
xvi CONTENTS
8.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.7 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Chapter 9 Conclusions and Future Work 181
9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Bibliography 187
List of Tables
4.1 Configurations of the XM2VTS clients possible under this framework. . 74
5.1 HMM topologies used for the uni-modal speech processing experiments. 106
5.2 WERs for speech recognition on all 12 configurations of the XM2VTS
database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1 Normalisation parameters determined from the per-frame evaluation
score distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2 Final weighting parameter α f inal calculated from the intended weight-
ing parameter αtest using the normalisation parameter αnorm = 0.751. . . 150
xviii LIST OF TABLES
List of Figures
2.1 Schematic diagram of human speech communication, considering only
the auditory systems (Adapted from [153]) . . . . . . . . . . . . . . . . . 13
2.2 Sagittal section of the human speech production system. (public do-
main, from [191]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Some examples of raw frame images from the XM2VTS database [119]. . 30
2.4 Configurations for person recognition defined by the XM2VTS proto-
col [107]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 AMarkov process can bemodelled as a statemachine with probabilistic
transitions (aij) between states at discrete intervals of time (t = 1,2, . . .). . 42
3.2 A diagrammatic representation of typical left-to-right HMM for speech
processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 A typical speech recognition system, outlining both the training of speech
models and testing using these models. . . . . . . . . . . . . . . . . . . . 60
4.2 Speaker dependent-speech recognition can be impractical for some ap-
plications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xx LIST OF FIGURES
4.3 An example of a possible voice-dialling speech grammar for continuous
speech recognition. Adapted from [194]. . . . . . . . . . . . . . . . . . . . 63
4.4 A typical automatic speaker recognition system, outlining both the train-
ing of speaker models and testing using these models. . . . . . . . . . . . 65
4.5 An example of a DET plot comparing two systems for speaker verifica-
tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Overview of the speech processing framework used in this thesis. . . . 72
4.7 Word recognition grammar used in this framework. . . . . . . . . . . . . 76
5.1 Configuration of an acoustic feature vector including the static (ci) and
energy (E) coefficients and their corresponding delta and acceleration
coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 The visual feature extraction process, highlighting the visual front end,
encompassing the localisation, tracking and normalisation of the lip ROI. 85
5.3 Manual tracking was performed by recording the eye and lip locations
every 50 frames and interpolating between. . . . . . . . . . . . . . . . . 87
5.4 Some examples of the original and grey-scaled resized ROIs extracted
from the XM2VTS database. . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Contour-based feature extractions used the geometry of the lip region
as the basis of feature extraction. . . . . . . . . . . . . . . . . . . . . . . . 91
5.6 Overview of the dynamic visual feature extraction system used for this
thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.7 Most of the energy of a 2D-DCT resides in the lower-order coefficients,
and can be collected easily using a zig-zag pattern. . . . . . . . . . . . . 99
LIST OF FIGURES xxi
5.8 Text dependent speaker verification performance on all 12 configura-
tions of the XM2VTS database. . . . . . . . . . . . . . . . . . . . . . . . . 112
6.1 Overview of the feature fusion systems used for this thesis, covering
both concatenative and discriminative feature fusion. . . . . . . . . . . . 120
6.2 Overview of the output score fusion approach used for this thesis. . . . 123
6.3 Histograms of speaker verification scores (a) before and (b) after nor-
malisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4 Performance of weighted output score fusion for speaker verification as
α is varied from 0 to 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.5 Speaker-independent feature-fusion speech recognition performance av-
eraged over all 12 configurations of the XM2VTS database. . . . . . . . . 129
6.6 Speaker-dependent feature-fusion speech recognition performance av-
eraged over all 12 configurations of the XM2VTS database. . . . . . . . . 130
6.7 Simple integration strategies for text-dependent speaker verification over
noisy acoustic conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.1 Various multi-stream HMM modelling techniques used for AVSP in
comparison to the uni-modal HMM (a). Acoustic emission densities
are shown in blue and visual in red. . . . . . . . . . . . . . . . . . . . . . 137
7.2 Speech recognition performance using SHMMs as αtest is varied. Each
point represents a different αtrain and the line is the average of all αtrains
for each αtest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3 Speech recognition performance using SHMMs as αtrain is varied. αtest
is chosen based on the best average performance in Figure 7.2. . . . . . . 144
xxii LIST OF FIGURES
7.4 Distribution of per-frame scores for individual A-PLP audio and video
state-models within the SHMM under different types of normalisation. . 148
7.5 Speech recognition performance under normalisation . . . . . . . . . . . 151
7.6 Speaker independent speech recognition using full-normalised word-
model SHMMs as αtest is varied on the first configuration of the XM2VTS
database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.7 Speaker-independent speech recognition performance using SHMMs
over all 12 configurations of the XM2VTS database. . . . . . . . . . . . . 154
8.1 By replacing the discrete secondary representations with continuous
representations in Pan et al.’s [130] original FHMM, it can be seen that
a SHMMwill be created. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.2 Performance of acoustic and visual biased FHMM-adapted SHMMs as
testing stream weights are varied. . . . . . . . . . . . . . . . . . . . . . . . 170
8.3 Speaker independent speech recognition performance using FHMM-
adapted HMMs over all 12 configurations of the XM2VTS database. . . . 171
8.4 Speaker dependent speech recognition performance using FHMM-adapted
HMMs over all 12 configurations of the XM2VTS database. . . . . . . . . 172
8.5 Comparing the A-PLP biased FHMM-adapted SHMM with a equiva-
lent jointly-trained SHMM on the first configuration of the XM2VTS
database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.6 Tuning the testing streamweight parameter αtest for speaker verification
using FHMM-adapted SHMMs. . . . . . . . . . . . . . . . . . . . . . . . . 176
LIST OF FIGURES xxiii
8.7 Text-dependent speaker recognition performance using FHMM-adapted
HMMs over all 12 configurations of the XM2VTS database. . . . . . . . . 178
xxiv LIST OF FIGURES
Commonly used Abbreviations
AVICAR Audio-visual Speech Corpus in a Car Environment (database)
AVSP Audio-visual speech processing
AVSPR Audio-visual speaker recognition
AVSR Audio-visual speech recognition
CUAVE Clemson University Audio Visual Experiments (database)
DCT Discrete cosine transform
DET Detection error tradeoff
EER Equal error rate
EM Expectation maximisation
FF Feature fusion
FHMM Fused HMM
GMM Gaussian mixture model
HCI Human-computer interface
HMM Hidden Markov model
HTK HMM Toolkit (software)
LDA Linear discriminant analysis
xxvi Commonly used Abbreviations
M2VTS MultiModal Verification for Teleservices and Security applications (database)
MAP Maximum a posterior
MFCC Mel frequency cepstral coefficients
MRDCT Mean-removed DCT
PLP Perceptual linear predictive
ROI Region of interest
SD Speaker dependent
SHMM Synchronous HMM
SI Speaker independent
SNR Signal to noise ratio
TD Text dependent
TI Text independent
TIMIT An acoustic speech database developed by Texas Instruments (TI) and Mas-
sachusetts Institute of Technology (MIT)
WER Word error rate
XM2VTS ExtendedM2VTS (database)
Certification of Thesis
The work contained in this thesis has not been previously submitted for a degree or
diploma at any other higher educational institution. To the best of my knowledge and
belief, the thesis contains no material previously published or written by another per-
son except where due reference is made.
Signed:
Date:
xxviii
Acknowledgements
Completing a PhD research programme is certainly one of the more interesting ex-
periences I have had, and a lot of people have helped me along the way. While it
is probably not possible to thank everyone (if only due to my poor memory), there
are certain people who must be mentioned. Firstly and most importantly, I would
like to thank my lovely wife Melly and (sometimes) lovely boys Axel and Henry for
the support they provided, and especially for putting up with me as I experimented
with weird working hours during the hectic write-up stage that produced this final
document. In addition, I would like to thank my parents for the encouragement and
support they have always provided me.
I would also like to thank my supervisory team, Sridha Sridharan, Vinod Chandran
and Tim Wark for providing valuable guidance and encouragement throughout the
course of my study. I am particularly indebted to Sridha for the excellent research
environment he has provided in the Speech, Audio Image and Video Technologies
(SAIVT) research laboratory at Queensland University of Technology (QUT), and the
many opportunities I have had to present my research at both domestic and interna-
tional conferences. I am also thankful for my regular meetings with Tim to discuss the
direction of my research, and the help he has provided in nutting out the difficult little
problems that came up along the way. It should also be mentioned that part of this
PhD was supported by the Australian Research Council Grant No. LP0562101, and I
am grateful for that support.
xxx Acknowledgements
During my PhD I was fortunate to present my research at a number of significant
speech processing conferences, and I am grateful for the opportunity that this pre-
sented for me to network with my fellow researchers from other institutions. I would
like to thank Roland Goecke, Iain Matthews and Gerasimos Potamianos, and many
other I cannot remember specifically (sorry), for listening to me and providing valu-
able feedback that significantly improved my research over what it would be without
their feedback.
Of course the group who probably had the largest impact on my research are the past
and present members of the SAIVT laboratory. In addition the incredibly valuable re-
search expertise embodied in my fellow colleagues, the great social atmospherewithin
the laboratory alsomade it a pleasure there. Particular thanksmust go tomy colleague
Patrick Lucey for his help in sorting out problems in the field of audio-visual speech
processing that we shared. Special mention must also go to Brendan Baker, Jamie
Cook, Simon Denman, Ivan Drago, Clinton Fookes, Tristan Kleinschmidt, Frank Lin,
Terrance Martin, Michael Mason, Chris McCool, Mitchel McLaren, Robbie Vogt, Roy
Wallace and Eddie Wong, who all helped me in some way, at some point.
Finally, I would like to especially thank and acknowledge everybody whom I have
forgotten above.
Chapter 1
Introduction
1.1 Motivation and overview
Automatic speech processing is a very mature area of research, and one that is play-
ing an ever-increasing role in our day-to-day lives. While these systems have shown
promise when performing well defined tasks like dictation or call-centre navigation
in reasonably clean and controlled environments, they have not yet reached the stage
where they can be fully deployed in real-world situations. The major reason behind
this is the susceptibility that audio speech recognition systems have to environmental
noise, which can degrade performance by many orders of magnitude.
However, speech does not consist of the audio modality alone, and studies of human
production and perception of speech have shown that the visual movement of the
speaker’s face and lips are an important factor in human communication.
Fortunately, many of the sources of audio degradation can be considered to have little
effect on the visual signal, and a similar assumption can also be drawn about many
sources of video degradation. By taking advantage of the complementary nature of
audio-visual speech, combining both modalities together will increase the robustness
2 1.1 Motivation and overview
to independent sources of degradation in either modality. This is the motivation be-
hind audio-visual speech processing (AVSP).
In AVSP, the method chosen for combining the two sources of speech information
remains a major area of ongoing research. Early AVSP systems could generally be
divided into two main groups, early or late integration, based on whether the two
modalities were combined before or after classification/scoring. Late integration had
the advantage that the reliability of each modality’s classifier could be weighted easily
before combination, but was difficult to use on anything but isolated word recognition
due to the problem of aligning and fusing two possibly significantly different speech
transcriptions. This was not a problem with early integration, where features are com-
bined before using a single classifier, but, on the other hand, it would be very difficult
to model the reliability of each modality.
To allow a compromise between these two extremes, middle integration schemes were
developed that allow classifier scores to be combined in a weightedmanner within the
structure of the classifier itself. The simplest of the middle integration methods, and
the subject of this thesis, is the synchronous HMM (SHMM). There are more compli-
cated middle integration designs, primarily intended to allow modelling of the asyn-
chronous nature of audio-visual speech, such as asynchronous, product or coupled
HMMs. However, while these models do show a performance increase over SHMMs,
the performance increase is not large and may not be worth the increased complex-
ity in the training and testing of the asynchronous models. It is the simplicity of the
SHMM that encourages further research into improving speech recognition perfor-
mance whilst staying within the synchronous design pattern.
This thesiswill focus on investigating the SHMMstructure for it’s suitability for audio-
visual speech and speaker recognition, in comparison to the baseline performance pro-
vided by uni-modal speech modelling as well as early and late integration stategies.
In the process of investigating the SHMM approach, a number of novel training and
testing techniques relating to the use of SHMMs for audio-visual speech modelling
will be developed. Particular attention will be paid to the novel fused HMM (FHMM)
1.2 Aims and objectives 3
adaptation process, which will be shown to produce a SHMM that can outperform
SHMMs trained using the existing state-of-the-art jointly-trained method at all levels
of acoustic noise.
1.2 Aims and objectives
It follows from Section 1.1, that the broad aims of this thesis can be summarised as
follows:
1. To investigate the suitability of existing feature extraction and integration tech-
niques for both speech and speaker recognition.
2. To study and develop techniques to improve the audio-visual speech modelling
ability of SHMMs trained using the state-of-the-art joint-training process.
3. To develop an alternative training technique for SHMMs that can improve the
audio-visual speech modelling ability in comparison to the existing state-of-the-
art joint-training process.
4. To compare and contrast the suitability of SHMMs for speech and speaker recog-
nition in comparison to existing baseline integration techniques.
More specifically, the objectives of this research programme are:
1. To review existing knowledge and techniques relevant to both speech and speaker
recognition using the audio and visual modalities.
2. To create a speech processing framework that can be used to evaluate both speech
and speaker recognition techniques, encouraging the re-use of models and tech-
niques between the two speech processing tasks where appropriate.
4 1.3 Outline of thesis
3. To investigate the state of the art in acoustic and visual feature extraction tech-
niques for audio-visual speech processing, and compare the suitability of these
features between the two speech processing tasks.
4. To review and investigate simple integration techniques that can be through fu-
sion before or after uni-modal classification techniques to serve as a baseline for
middle integration experiments.
5. To review middle integration methods for audio-visual speech processing, with
a particular focus on SHMMs due to their simplicity in comparison to other mid-
dle integration approaches.
6. To investigate the behaviour of jointly-trained SHMMs during training and test-
ing of a speech processing system, and to develop techniques to improve the
speech modelling ability within the existing training techniques.
7. To develop methods of improving the SHMM performance through FHMM-
adaptation to improve the audio-visual speech modelling ability over the ex-
isting jointly-trained SHMMs.
1.3 Outline of thesis
The remainder of this thesis is organised as follows:
Chapter 2 gives a overview of the broad area of audio-visual speech processing cov-
ering both speech production and the audio-visual perception of speech and
speakers by both humans and machines. A brief review of suitable audio-visual
speech processing databases is also conducted in this chapter.
Chapter 3 introduces the theory behind data classification, as well as outlining the
classification techniques in common use for automatic speech processing. Gaus-
sian mixture models are introduced as static speech classification models, and
1.3 Outline of thesis 5
are extended into hidden Markov models for the temporal modelling of speech
events. Finally, maximum a posterior speaker adaptation using these modelling
techniques is introduced to allow speaker dependent models to be generated
from well-trained background models.
Chapter 4 provides a detailed overview of automatic speech and speaker recognition,
covering the methods and techniques that are involved in both exercises. The
chapter is concluded with a novel framework based on the XM2VTS database
that can be used to test both speech and speaker recognition within a single
training process.
Chapter 5 looks at audio and video feature extraction techniques that have demon-
strated suitability for speech processing applications, and concludes with a com-
parison of visual features at various stages of a dynamic feature extraction cas-
cade for both the speech and speaker verification applications. Early in this
chapter, a review of both acoustic and visual feature extraction techniques is con-
ducted, with a particular focus on visual feature extraction. After a brief review
of the visual front-end, both appearance and geometric based visual feature ex-
traction techniques are reviewed. Within appearance-based feature extraction, a
number of dynamic feature extraction techniques are outlined that are designed
to extract the most relevant speech features from a given ROI. In the experi-
mental section of this chapter, a number of visual and acoustic speech features
are compared to determine the suitability of dynamic visual speech features for
both speech and speaker recognition.
Chapter 6 investigates simple methods of fusing the acoustic and visual modalities
that can be considered with the existing classification techniques already devel-
oped for uni-modal speech processing. Early integration techniques are investi-
gated for speech and speaker recognition, while late integration is only consid-
ered for speaker recognition due to the difficulty of combining speech transcrip-
tions in an output fusion configuration.
Chapter 7 reviews middle integration approaches to audio-visual speech processing
6 1.4 Original contributions of thesis
in the literature, with particular attention paid to the simplest of the middle inte-
gration methods for AVSP, the SHMM. An investigation of the SHMM structure
is conducted to investigate the effect that each modality has on the final speech
recognition performance based upon how each stream is weighted during the
training and decoding of the structure. Additionally, a number of novel clas-
sifier normalisation techniques are investigated within the SHMM structure to
improve the robustness of the SHMM to acoustic noise.
Chapter 8 introduces an alternative training technique for SHMMs that provides im-
proved audio-visual speech modelling ability when compared to the existing
state of the art training techniques for SHMMs. Experiments are conducted
with the resulting FHMM-adapted SHMMs to compare and contrast this SHMM
training technique against the earlier fusionmethods for both speech and speaker
recognition.
Chapter 9 summarises the work presented in this thesis, and presents the main con-
clusions that have been drawn from the work. This chapter also suggests future
work that may be taken to improve upon the research conducted in this thesis.
1.4 Original contributions of thesis
The work presented in this thesis makes original contributions1 in a number of differ-
ent areas, summarised as follows:
1. A novel framework for evaluating both speech and speaker recognition whilst
reusing the same speech models for both tasks is presented in Chapter 4.
2. A comparison of appearance based static and dynamic visual speech features
is conducted for visual speaker verification in Chapter 5 to show that visual
1Sections throughout this thesis which contain significant original work are indicated with the “†”symbol.
1.5 Publications resulting from research 7
speaker verification improves as more dynamic information is extracted from
the ROI.
3. A study of the effect of varying the stream weights independently during the
training and testing of SHMMs is conducted in Chapter 7 to show that the choice
of stream weight during training has a minor effect on the final speech process-
ing ability of the SHMM.
4. A novel adaptation of zero normalisation is applied within the states of a SHMM
in Chapter 7 to normalise the video scores to a similar range to the audio, allow-
ing the final SHMM to be more robust to acoustic noise.
5. An additionally variance-only normalisation technique is developed in Chap-
ter 7 to allow stream normalisation to occur within SHMMs solely through the
use of the stream weighting parameters, rather than requiring access within the
Viterbi process to apply full mean and variance normalisation.
6. The novel FHMM-adaptation method of training a SHMM from a uni-modal
acoustic or visual HMM through the additionally of separately trained GMMs
for the secondary modality is developed in Chapter 8 to show improved audio-
visual speech modelling ability over existing SHMM training techniques.
1.5 Publications resulting from research
The following fully-refereed publications have been produced as a result of the work
in this thesis:
1.5.1 International journal publications
1. D. Dean and S. Sridharan, “Dynamic visual features for audio-visual speaker
verification,” Computer Speech and Language (submitted)
8 1.5 Publications resulting from research
2. D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Fused HMM-adaptation of syn-
chronous HMMs for audio-visual speech recognition,” Digital Signal Processing
(submitted)
1.5.2 International conference publications
1. D. Dean and S. Sridharan, “Fused HMM adaptation of synchronous HMMs for
audio-visual speaker verification,” in Auditory-Visual Speech Processing (accepted),
2008
2. D. Dean, S. Sridharan, and P. Lucey, “Cascading appearance based features for
visual speaker verification,” in Interspeech 2008 (accepted), 2008
3. D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Weighting and normalisation
of synchronous HMMs for audio-visual speech recognition,” in Auditory-Visual
Speech Processing, Hilvarenbeek, The Netherlands, September 2007, pp. 110–115
4. D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Fused HMM-adaptation of multi-
streamHMMs for audio-visual speech recognition,” in Interspeech, Antwerp,Au-
gust 2007, pp. 666–669
5. T. Kleinschmidt, D. Dean, S. Sridharan, and M. Mason, “A continuous speech
recognition evaluation protocol for the AVICAR database,” in International Con-
ference on Signal Processing and Communication Systems (ICSPCS) (accepted), 2007
6. D. Dean, S. Sridharan, and T. Wark, “Audio-visual speaker verification using
continuous fusedHMMs,” inHCSNetWorkshop on the Use of Vision in HCI (VisHCI),
2006
7. D. Dean, T. Wark, and S. Sridharan, “An examination of audio-visual fused
HMMs for speaker recognition,” in Second Workshop on Multimodal User Authen-
tication (MMUA), Toulouse, France, 2006
8. D. Dean, P. Lucey, and S. Sridharan, “Audio-visual speaker identification us-
ing the CUAVE database,” in Auditory-Visual Speech Processing (AVSP), British
1.5 Publications resulting from research 9
Columbia, Canada, July 24-27 2005, pp. 97–101
9. D. Dean, P. Lucey, S. Sridharan, and T.Wark, “Comparing audio and visual infor-
mation for speech processing,” in Eighth International Symposium on Signal Pro-
cessing and Its Applications (ISSPA), Sydney, Australia, 2005, pp. 58–61
10. P. Lucey, D. Dean, and S. Sridharan, “Problems associatedwith area-based visual
speech feature extraction,” in Auditory-Visual Speech Processing (AVSP), British
Columbia, Canada, 2005, pp. 73–78
10 1.5 Publications resulting from research
Chapter 2
Audio-Visual Speech Processing
2.1 Introduction
Speech is clearly one of, if not the most important communication methods avail-
able between humans, and it is the primacy of this medium that motivates research
efforts to allow speech to become a viable human-computer interface (HCI). By allow-
ing computers to recognise both speech and the identities of speakers, the interface can
be more direct, with no need for an additional format conversion (i.e., typing) in the
communications chain. These two main areas of research; using computers to recog-
nising speech, and to recognise the identities of speakers, are collectively referred to
as automatic speech processing.
Human speech is transmitted between speakers both through the acoustic speech
wave and the visual movement of the lips, and while it is may not be immediately
obvious, useful information is contained in both of these modalities. The widespread
adoption of telephones, radios and other audio-based technology clearly shows that
speech can be understood by humans with high accuracy using audio alone in good
conditions. However, when visual information is available, psychological studies
have shown that this information can and does improve speech perception. Simi-
12 2.2 Audio-visual speech processing by humans
larly, incorrect or mistimed visual information can be jarring to users and even cause
mistakes in perception in extreme cases.
This chapter will review existing research in the field of audio-visual speech pro-
cessing, covering both human perception studies and systems designed to recognise
speech or speakers automatically. Reviews of the existing literature in both human
and machine-based speech processing will be conducted to demonstrate the signifi-
cant improvements that can be realised by including visual speech information along-
side traditional acoustic speech processing.
2.2 Audio-visual speech processing by humans
Human speech is a complicated physiological processes, withmany components com-
ing into play for both the production and perception of the speech events. However,
as spoken language is one of the primary characteristics that made humans what they
are [101], the physiological basis of human speech has been studied in extensive detail,
and is reasonably well understood.
In this section a review of the human perception literature will be conducted to explore
the physiological processes involved in human speech production, speech perception
and the recognition of speakers (speaker perception). As audio-visual speech is the
focus of this research program, particular attention will be paid to the impact that the
visual modality has on these physiological processes.
2.2.1 The speech chain
At the highest level, the process of human-to-human communication can be seen as
the imperfect transmission of an idea from one mind into the other. This idea of a
speech chain [63, 153] encompasses both speech production and perception, as well as
the transmission channels between the two participants. In face-to-face communica-
2.2 Audio-visual speech processing by humans 13
Figure 2.1: Schematic diagram of human speech communication, considering only theauditory systems (Adapted from [153])
tion, this channel would simply be the acoustic sound wave and reflected light from
the speaker’s mouth region. However, the transmission channel can easily get more
complicated if, as an example, a telephone or video transmission device were intro-
duced to allow communication at a distance.
An example of such a chain, considering acoustic speech only is shown in Figure 2.1 [153].
It can be seen that the idea is first converted into a language-based representation. This
is further translated into the signals necessary to control the lungs and vocal track
(consisting of the vocal cords and mouth region) which finally generate the acoustic
wave for transmission. Once the signal reaches the listener, the movements of the
ear drum are converted back into nerve signals, then a language representation and
finally, hopefully, converging on the idea intended by the speaker.
The primary effect of including visual speech in this model does not affect the speaker,
where the visual aspect of speech can largely be considered a side effect [105], but
the listener end of the chain must additionally be cognisant of the visual information
which is then converted to nerve signals by the retina. At this point in the speech
14 2.2 Audio-visual speech processing by humans
perception process the two nerve signals (hearing and vision) are fused within the
brain to arrive at a language model and finally, the idea.
2.2.2 Speech production
Human speech is an acoustic waveform that travels in the form of sound pressure
changes through the air. This pressure wave is generated by transforming the original
expulsion of air from the lungs through the vocal folds and articulators within the vocal
tract. This term refers to the portion of the speech production system that transforms
the lung’s expelled air into recognisable human speech, and consists of the larynx,
vocal folds, pharynx and the oral and nasal cavities, shown in Figure 2.2.
The sounds produced within the vocal tract can be classified according to the actions
of a number of components within the vocal tract. Upon leaving the lungs through
the trachea, the airstream enters the larynx and encounters the vocal folds, which can
either be tightened or relaxed. If tightened, the vocal folds interfere and vibrate with
the airflow, with the resulting sound said to be voiced. Correspondingly if the vocal
folds are relaxed they do not vibrate and the sound is unvoiced. The airstream then
enters the pharynx, to be directed into either both the oral and nasal cavity, or just the
oral if the soft palate is closed. If the sound is produced with only the oral cavity it is
referred to as oral, or if the nasal cavity is also used, nasal.
Finally, the sounds produced can be further classified according to their place and
manner of articulation. In speech, articulation is the process bywhich the tongue or lips
make contact with other portions of the oral cavity to form specific speech sounds. The
manner of articulation can vary from aproximant where there is very little obstruction
of the airflow, to fricative, where the obstruction is enough to cause turbulence, and
finally to a stop, where the articulators involved completely obstruct the airflow. The
place of articulation refers which articulators are included in the speech event, which
generally will be either the tongue or the lips and another portion of the oral cavity
such as the teeth, alveolar ridge, or the soft or hard palate [93].
2.2 Audio-visual speech processing by humans 15
Figure 2.2: Sagittal section of the human speech production system. (public domain,from [191])
16 2.2 Audio-visual speech processing by humans
While body language, including facial emotions are (at least) subconsciously intended
to communicate, the visual movement of the speech articulators appears to primarily
be a side effect of the shaping of the acoustic speech, and not an intentional method
of visual communication. However, as humans have clearly adapted to make use of
this visual information, as will be shown later in this chapter, the study of how the
visible speech is related to the acoustic speech is important. Of the large number of
components involved in human speech it can be seen from Figure 2.2 that, even in the
best conditions, only a subset of the articulators are visible, being the lips, teeth and
tongue, and only the lips are visible in an un-obscured manner.
2.2.3 Phonemes and visemes
In traditional acoustic speech processing tasks, phonemes are the smallest units of speech
that can be distinguished linguistically. Two phonemes can be considered linguisti-
cally distinct if two words can be found that differ only in the two phonemes, forming
a minimal pair. An example of this would be using ‘pat’ and ‘bat’ to demonstrate
that /p/ and /b/ are distinct phonemes. It is difficult to establish a exhaustive set of
phonemes, particular if multiple languages are considered, but the International Pho-
netic Alphabet (IPA) [76] is generally considered to be the standard list. Of the 107
distinct phonemes in the IPA, only around 50 are commonly used in English [153].
Visemes are generally considered to be the equivalent of phonemes in the visual do-
main, although they do not actually serve to be linguistically distinct, but are rather
based on visual distinction [111]. Because the variety of acoustic speech events is not
completely represented by the visible articulators, each viseme generally corresponds
tomany visually similar but linguistically distinct phonemes. No real consensus exists
on the number and grouping of visemes, but generally there is considered to be on the
order of 10-20 visemes [23] as compared to the 50 or so phonemes in common English
usage.
2.2 Audio-visual speech processing by humans 17
2.2.4 Audio-visual speech perception
Human speech perception is commonly assumed to primarily be an acoustic pro-
cess [153], and humans can certainly understand speech easily when only the audio
is available, such as in telephone-based communication. For the case of visual-only
speech, the ability of the hard-of-hearing to lip-read well enough to take part in regular
conversations demonstrates that there is sufficient visual information to understand
speech, although context plays a larger part than in auditory listening [171].
However, studies of human speech perception have shown that is not just the hard-of-
hearing that make use of visual information to aid in speech perception. The earliest
such study was performed by Sumby and Pollack in 1954 [170], where they looked
at the effects of auditory noise on human speech perception with and without visual
information being available to the listener. From their experiments they found that
allowing the participants to see the lip movements provided a speech perception in-
crease equivalent to raising the auditory signal-to-noise ratio by up to 15 dB. More
recently, Reisberg et al. [156] showed that listeners with normal hearing ability still
make enough use of the visual information to show improvement in speech recogni-
tion performance even in clearly articulated speech.
Visual speech can be considered useful for human speech perception in two main
ways. Firstly, it is useful at directing the listener at the speaker, and secondly the visual
speech can provide complementary information to the acoustic. In the first case, the
visual speech can be used to allow a speaker to determine who is talking, where they
are, and even when they are actively speaking. By allowing a listener to focus on the
speaker and even to take advantages of the lip movements to filter many simultaneous
voices, such as might be encountered at a noisy party, the visual speech can considered
a speech enhancement stage prior to the actual acoustic speech perception.
Secondarily, the complementary nature of audio and visual speech can be shown by
studying the confusability of speech events in either modality. Summerfield [172] has
shown that many of the easily confused phonemes have distinct visemes that can be
18 2.2 Audio-visual speech processing by humans
easily distinguished provided the lip region is clearly visible. An example of this
would be /f/ and /th/ which can be difficult to distinguish acoustically, but can eas-
ily be distinguished visually based on whether the lower lip or tongue is against the
teeth. Summerfield also showed the converse–that the phonemes corresponding to a
particular viseme are acoustically quite different–is also true, as demonstrated by /t/
and /d/ which have the same viseme, but can be distinguished easily acoustically
because /d/ is voiced and /t/ is not.
Some of the more powerful indicators of the impact of visual speech become apparent
when visual and acoustic speech is combined incorrectly. For example, if a listener
is presented with differing audio and visual cues simultaneously, a third sound can
be perceived rather than either of the two actually ‘said’ in either modality. This is
referred to as the McGurk effect [117], of which the commonly given example is a
listener seeing ‘ga’ but hearing ‘ba’ would believe they were hearing ‘da’, rather than
either of the actual spoken syllables. This effect is also likely related to the jarring
effect that can make badly dubbed movies difficult to watch.
2.2.5 Audio-visual speaker perception
Audio-visual speaker perception is used here to refer to the process by which humans
make used of both the acoustic and visual modalities to successfully recognise the
identity of a speaking person. However, this is not a very active area of research
with most human-person-recognition research focusing on recognition from audio
speech [9] or facial images [199] in isolation. Of these two approaches, facial images
have been shown to be faster and more efficient in human recognition studies when
compared to voice recognition [50, 166].
While human person recognition is dominated by face recognition, recognition of fa-
miliar voices is still quite powerful, as evidenced by recognition over telephone lines
or radio where the visual modality isn’t available. The study of human recognition
by voice is a relatively new area of research compared to face recognition and acous-
2.2 Audio-visual speech processing by humans 19
tic speech recognition, but similar voice-recognition responses have been found in the
acoustic processing areas of the brain to that of face recognition in the visual cortex,
suggesting that acoustic recognition operates in a similar manner [9]. Recently, von
Kriegstein et al [180] have shown that recognition of familiar voices also activates a
region of the brain normally associated with face recognition, even in the absence of a
visible face, suggesting that some processing may be shared between the two modali-
ties.
Studies of the human perception of faces has generally focused on the recognition of
static faces [199] and it has been found that hair, the face-outline, the eyes and the
mouth are all very important for human face recognition [167, 17], but that the top
half of the face as a whole is more important than the lower [167]. However, recent
studies into the recognition of moving faces have demonstrated an improvement in
person recognition over static face images [128, 139, 160], particularly if the move-
ments are within the face region, such as speech or expression variation, rather than
movements of the face as a whole [94]. Indeed, Knappmeyer et al. [91] have demon-
strated that a novel face can be easily confused with a similarly-looking familiar face
if the characteristic movements of the familiar are transferred to the novel, suggesting
that the characteristic movement of a person’s face is an important factor in person
recognition.
While these studies have shown that recognition improved in the presence of a speak-
ing face, the only stimulus presented to the participants was the moving video, and
no complementary acoustic stimulus was supplied alongside to study the effect of
having both modalities available on human person recognition. One early study of
cross-modality person recognition was based on priming studies, where a priming
stimulus is used to influence a future target recognition. In this study, performed in
1997 by Schweinberger et al. [166], it was shown that a face prime appeared to fa-
cilitate improved recognition of a celebrity’s voice, even with a long (30 minutes of
other stimuli) interval between the prime and target. A similar effect was not found
by priming with the name alone, suggesting that there was a perceptual rather than
20 2.2 Audio-visual speech processing by humans
semantic effect present.
Only recently have perception studies been performed that looked at the combination
of both modalities and their effect on person recognition. The first study in this area
was by Kamichi et al. [85] in 2003, who found that participants could match unfamil-
iar faces and voices above the level of chance, suggesting that the movement of the
face and the acoustic signals are correlated in a manner that can be recognised by the
participants. Based on these results, Kamichi et al. suggested that the movements
of the face during speech contain dynamic information about speaker identity. How-
ever this study did not directly investigate the recognition of familiar speakers using
audio-visual stimuli.
The only existing study in the literature of audio-visual person recognition by humans
was published by Schweinberger et al in 2007 [165]. Attempting to rectify the lack of
anymulti-modal studies in the literature, they conducted an exhaustive comparison of
human recognition of people under 14 different conditions based on four underlying
variables:
• familiarity of the face,
• whether the face is dynamic or static,
• familiarity of the voice, and
• whether the voice presented matches the face
Participants were presentedwith a face and audio stimulus for around 2 seconds, and
were asked to judgewhether theywere familiar with the faces presented or not. Partic-
ipants were encouraged to submit their answer as quickly and accurately as possible,
and their response time was recorded alongside their answer.
From the results of their experiments, Schweinberger et al. concluded that recognition
of familiar voices was faster and more accurate when the matching face was shown,
2.3 Automatic audio-visual speech processing 21
and that the performance was degradedwith an incorrect face was shown, when com-
pared to a baseline audio-only recognition. They found that the improvement in per-
formance and speed was much larger for the dynamic faces, but was still present for
the static faces. Additionally, it was found that when an unmatched face was pre-
sented against a familiar voice, it was easier to ignore if static, but cause significant
degradation if dynamic. Similar trends were found in the results for the unfamiliar
voices, but the overall results were not significantly better than the baseline audio
performance.
However, while Schweinberger et al. compared the audio-visual recognition against
an audio-only baseline, they did not evaluate a similar video-only baseline, limiting
the conclusions that can be drawn from the research about the benefits of acoustic
information in addition to face-based person recognition. While Schweinberger et
al.’s [165] research and the improvement gained through dynamic versus static face
recognition [94, 91] demonstrate that human recognition of speakers improves as more
speech-related information is available, no conclusive research has yet shown that
acoustic information improves person recognition in the presence of dynamic visual
information, although it seems sensible that it should.
2.3 Automatic audio-visual speech processing
Because it was clear from human studies that the audio signal contains most of the
speech information, most early research into speech-based human computer inter-
faces (HCIs) was based around automatic acoustic speech processing. This area is
very mature and many commercial implementations are now deployed making use
both speech recognition and speaker recognition technologies in well controlled con-
ditions, such as limited speech vocabularies or in well-controlled environments. One
pertinent example of commercially deployed speech recognition systems would be
the replacement of touch-tone phone menus with automatic speech prompts, where
the limited vocabulary reduces the difficulty of the speech recognition task signifi-
22 2.3 Automatic audio-visual speech processing
cantly [31]. Commercial speaker recognition systems are not quite as widespread as
speech recognition, but they do have application in forensic and security work. An
example of such a system would be Hollien’s SAUSI (Semi-automatic Speaker Identi-
fication) system designed expressly for the use of forensic phoneticians [74].
One of the factors that is holding back widespread adoption of automatic speech pro-
cessing systems is the susceptibility of acoustic speech to environmental noise, which
can degrade performance by many orders of magnitude [66]. One of the obvious pos-
sibilities to improve acoustic speech processing systems–and one that is clearly moti-
vated by human perception studies–is to introduce visual information to existing au-
dio speech processing systems. Because the visual information contains complemen-
tary information to the acoustic this should improve the systems performance, partic-
ularly in the kinds of environments that existing acoustic systems perform poorly. The
introduction of visual information to acoustic speech processing systems leads to the
the research area of audio-visual speech processing (AVSP) covering the related areas
of audio-visual speech and speaker recognition (AVSR and AVSPR).
2.3.1 Audio-visual speech recognition
Research into the automatic recognition of human speech has been an ongoing area
of research since the end of Word War II, with the rapid growth of military and civil
radio for aviation and other purposes providing the motivation to ease the workload
of radio operators. Two of the earliest attempts of automatic speech recognition were
conducted independently by Davis et al. [33] at MIT in 1952 and Olson and Belar [127]
at RCA Laboratories in 1956. Both of these efforts focused on recognising a limited vo-
cabulary of words using spectral measurements of the acoustic signal captured using
analog filter banks.
Between these early efforts and the late 1970s the field of acoustic automatic speech
developed considerably, with many more small-vocabulary systems being developed,
and the pioneering of many modern speech recognition techniques such as dynamic
2.3 Automatic audio-visual speech processing 23
time warping [178] and linear predictive coding of acoustic features [77]. In the 1970s,
early research into large vocabulary speech recognition was begun at IBM [81], and
efforts towards truly speaker-independent speech recognition systems were begun by
AT&T’s Bell Labs [152]. Themain focus of the 1980s was on the recognition of continu-
ous speech, rather than the isolatedword recognition that was themain focus of earlier
efforts, spearheaded by Carnegie Mellon University’s early work in the 1960s [155].
The continuous speech focus was accompanied by a widespread shift from template-
matching methods to statistical modellingmethods, in particular the use of the hidden
Markov model (HMM) to easily negotiate connected word and phone networks [153].
Alongside this maturing of the acoustic speech recognition field, and motivated by
human perception studies, the first automatic audio-visual speech recognition system
was developed by Petajan in 1984 [136]. Petajan’s system extracted geometric param-
eters (height, width, and perimeter) from black and white images of the speaker’s
mouth region and used dynamic time-warping and template matching to recognise
words using these features. Later research by Petajan et al. found that the binary
image data outperformed the geometric features [135].
Much of the feature extraction research for visual speech features followed similar
work in face recognition. This was evident in the Bregler and Konig’s adaptation of
eigenfaces [175] to create eigenlips features for AVSR [16] in the mid-90s, as well as the
further extension of these feature using linear discriminant analysis by Duchnowski
et al. [47] to improve speech-discrimination performance. The modelling techniques
used for visual speech recognition tended to follow that of acoustic speech recognition,
with early systems focusing on template matching [136] and neural networks [16], but
HMMs rose to prominence to become the de-facto standard [65, 121] around the mid-
90s.
While the early research into AVSR mainly focused on recognising visual speech on
its own, the performance obtained using such a design could not match that of audio
alone. Given that human perception studies had shown that best recognition per-
formance could be obtained through a combination of both, research into fusing the
24 2.3 Automatic audio-visual speech processing
acoustic and visual information was of paramount importance to the development of
useful AVSR systems.
The earliest attempt at combining the two modalities was performed by Yuhas et al
in 1989 [195]. Yuhas et al’s system used a neural network with the pixel values of
the lip region as inputs to attempt to estimate the acoustic spectrum based upon the
visual information. This estimated spectrumwas then combinedwith the true acoustic
spectrum, weighted manually according to the acoustic noise level. This combined
spectrumwas then fed into a regular acoustic vowel recogniser, and performance was
found to improve upon the acoustic only result.
This early effort was followed in the early 1990s with research on combining the acous-
tic and visual modalities using time-delay neural networks [169, 47], followed by
many papers looking at various fusion techniques using the now prominent HMMs
as the basis of modelling [110, 168, 1, 140].
Most early efforts at audio-visual fusion developed in the 1990s focused on either
combining audio and video features before classification, or combining the results of
separate classifiers. These two approaches are referred to as early or late integration
respectively. Recent research in AVSR research has focused on modelling techniques
that can be considered to be a compromise between these two approaches. Most of
these approaches focus on a variant of multi-stream HMMs, of which the simplest is
the synchronous HMM [145], the subject of this thesis. More complicated approaches
have been developed [125, 145, 12], intended mostly to deal with the asynchronous
nature of audio-visual speech, but their training and testing complexity has limited
their application for real-world use.
Reflecting thematurity of AVSR research in the last decade, a number of review papers
are solely focused on the topic, with the earliest by Chen and Rao in 1998 [24], and
more recent research has been covered by Chibelushi et al. [25] in 2002 and Potamianos
et al. [145] in 2003. Most recently, the MIT Press have published an entire book solely
devoted to AVSP research [177].
2.3 Automatic audio-visual speech processing 25
2.3.2 Audio-visual speaker recognition
The earliest work on acoustic speaker recognition came from the idea of forensic voice-
print identification from Sonographs, first studied by Kersta [88] , based on earlier work
done during World War II by Ralph Potter and colleagues at Bell Laboratories [150].
Whilst Kersta’s paper is not entirely clear on the methodology [74], it appears that he
found that his fellow staff members could recognise a person by their Sonograph with
99% accuracy [88].
Research on true automatic speaker recognition started in the 1970s with Atal’s work
on text-dependent recognition based on Cepstral features [4], and techniques and
methods tended to be shared and follow along with speech recognition research [18].
Similar to speech recognition research, by the mid 1990s most speaker recognition re-
search had settled on using HMMs or GMMs to model Cepstral-based acoustic speech
features [190, 158].
The earliest effort in attempting to recognise a speaker by both the acoustic and vi-
sual modalities was performed by Wagner and Dieckmann in 1994 [181]. This system
used optical flow to represent the visual features and a frequency representation of
the audio in synergetic classifiers before combining the result. They found the motion
features to work better than the acoustic but couldn’t get an improvement through
fusion of both classifiers.
Luettin et al [109] were the first researchers to use theHMMs for text-dependent recog-
nition of speakers from lip images, using contour-based feature extraction on the lip
region. Contour-based features were popular at the time in AVSPR research [189,
27, 5], in part encouraged by the release of the DAVID audio-visual database [26],
which had blue-highlighted lip images of speakers. Hybrid features, incorporating
both contour and intensity information, were also investigated showing improved
performance over contour alone [84, 185]. Jourlin et al’s paper [84] also showed the
first combination of acoustic and visual features within the HMM approach, which
served as the basis of much future AVSPR research.
26 2.3 Automatic audio-visual speech processing
Most avenues of research continued to focus on simple fusion techniques until early in
the new century, when multi-stream HMMs where introduced for the AVSPR task by
Wark et al [188]. Research continued to grow intomethods of handling bothmodalities
simultaneously, in particular handling the asynchronous nature of audio and video
speech events, with the introduction of the coupled HMM [57] and the asynchronous
HMM [11] for AVSPR.
While AVSPR research is interested in recognising persons whilst they are speaking,
there is still a significant of static face information available in most applications that
can be used for traditional face recognition, in addition to the acoustic and visual
speech features. Some recent examples of such hybrid systems are those developed
by Fox et al [53] and Nefian et al [123]. Both of these systems have shown an im-
provement by bringing back the static face information that was discarded when only
considering the mouth region for visual speaker recognition.
2.3.3 Comparing speech and speaker recognition
As the two fields of speech and speaker recognition are very closely related, there have
been a number of efforts in the literature to compare and contrast the two tasks under
similar conditions. In particular two researchers in the field have published complete
theses covering both fields of research, Luettin in 1997 [111] and Lucey in 2002 [105],
that provide a good summary of the two fields at the time they were published.
While a number of review papers have been published over both speech and speaker
recognition [23, 25], little experimental comparisons of the two fields was conducted
until themost recent half-decade or so. In 2003, Nefian and Liang [122] and Lucey [106]
both published papers comparing speech and speaker recognition for audio-visual
speech. Results from both papers appear to show the visual modality is much closer
in performance to the acoustic modality for speaker recognition than for the recogni-
tion of speech, although neither papers particularly emphasizes this point. A similar
conclusion can also appear to be drawn from Bengio’s 2004 comparative paper [12].
2.4 Audio-visual databases 27
While these three papers have looked at both speech and speaker recognition under
similar conditions, none drew any conclusion as to the comparative suitability of ei-
ther modality for speech or speaker recognition.
2.4 Audio-visual databases
One of the limiting factors on AVSP research is the limited availability of suitable
databases, especially when compared to similar databases for audio-only speech pro-
cessing. While this is partly due to AVSP being a newer area of research, the main
reason for the sparseness of audio-visual databases is the difficulty in collecting, stor-
ing and distributing audio-visual data. For example, a typical audio-visual utterance
stored in a compressed video format might be 20-30 times as large as an equivalent
audio-only utterance. If the video data is not compressed, such as during data collec-
tion, then the difference is even more dramatic. Add in the difficulty of distributing
this volume of data to researchers and it can be seen that storage size has been (and
continues to be) a severe limiting factor on the development of audio-visual databases.
Due to these limitations, most early audio-visual databases were either designed for
a single task, or very limited in scope. However, as the costs of processor speed and
storage have steadily decreased size has become less of an issue, and more databases
have recently become available that are suitable for more general research. This sec-
tion will begin with a brief review of audio-visual speech processing databases, and
finish with an examination of the XM2VTS database [119], which will be used as the
basis for the experiments performed in this thesis.
2.4.1 A brief review of audio-visual databases
Most of the early audio-visual speech processing research focused on the speech recog-
nition task, and early databases were generally only designed to show the utility of
28 2.4 Audio-visual databases
audio-visual speech on a single speaker [136, 168, 28]. When audio-visual speaker
recognition was studied, the speech was typically limited to a single short phrase over
a small number of speakers [182]. Most of these databases were collected directly by
the researchers involved and generally were not widely distributed due to their lim-
ited utility and large (for the time) size.
Starting in the mid 1990s a number of larger multi-speaker databases were released,
such as the Tulips 1 [121] and DAVID [26] databases. By allowing these databases
to be used outside of their creators, speech and speaker recognition performance was
able to be compared by different researchers on the same databases. However, the size
of these databases were still limited compared to the far more abundant audio speech
databases available at the time, typically with only 10 to 30 speakers and a very limited
vocabulary.
TheM2VTS [137] database was released and then extended into the XM2VTS database [119]
in the late nineties, and had proved very popular for audio-visual speech research.
While the vocabulary of the XM2VTS database was still relatively limited, the large
number of speakers available (295) has provided a much more robust research base
for both speech and speaker recognition research, and it is currently the largest pub-
licly available audio-visual database with around 30 hours of speech. As it will serve
as the basis of the research in this thesis, the XM2VTS database will be examined in
more detail in Section 2.4.2.
The XM2VTS database has served as a useful benchmark for audio-visual speech re-
search, but its vocabulary is limited to English digits and a single phonetically bal-
anced phrase. The VidTIMIT database [163] has been recently released to examine
audio-visual speech over a wider vocabulary by having 43 speakers say 10 phoneti-
cally balanced phrases selected from the TIMIT [79] acoustic speech database. Inspired
by the VidTIMIT database, the AVTIMIT [68] database was collected with 223 speak-
ers of 20 TIMIT phrases. While these databases are certainly a good start towards large
vocabulary audio-visual speech processing, their relatively small size (40 minutes for
VidTIMIT and 4 hours for AVTIMIT) puts limitations on their utility for developing
2.4 Audio-visual databases 29
reliable audio-visual speechmodels. To date, themost extensive database available for
large vocabulary audio-visual speech recognition is the IBM Via Voice database [126],
with 50 hours of audio visual speech collected over 290 speakers. Unfortunately due
to commercial restraints this database is not available publicly leaving most research
to be performed on the smaller publicly available databases.
Until quite recently, most audio-visual speech databases have consisted of data col-
lected in clean studio conditions. While this has been useful for the study of audio-
visual speech, more realistic conditions are required to demonstrate the efficacy of
audio-visual speech in the real world. Some examples of recent databases designed
to study more real-world conditions are the CUAVE [133] database which deals with
problems in face and pose tracking, the AVICAR [96] database looking at audio-visual
speech recognition in automotive environments, and the IBMSmart Roomdatabase [148]
focused on meeting room environments. The office-environment-based BANCA [6]
database also looks promising, but hasn’t yet been released in a usable form for audio-
visual speech research.
The recent reduction in distribution and storage costs has allowed some of the more
recent small audio-visual databases to be released to interested researchers at very
low, or even no cost, in the hope of wider use by the audio-visual speech research
community. Some examples of this are the CUAVE dataset [133], VidTIMIT [163] and
the Australian English dataset AVOZES [62]. While these databases are not as large
as XM2VTS, the releasers of these databases hope that the very low cost (or no-cost)
will encourage their wide distribution amongst researchers and subsequent use as a
benchmark for audio-visual speech research.
2.4.2 The XM2VTS database
The XM2VTS [119] database was released by the European M2VTS project (Multi
Modal Verification for Teleservices and Security applications) [138] with the aim of
extending their existing M2VTS database [137] into a large high quality audio visual
30 2.4 Audio-visual databases
Figure 2.3: Some examples of raw frame images from the XM2VTS database [119].
database. Since its release the XM2VTS (extended M2VTS) database has continued to
be the largest publicly available audio-visual speech database, with around 30 hours
of raw video available. The only audio-visual database which is larger is IBM’s Via-
Voice database [126], which has not been made available to the research public.
The XM2VTS database consists of 295 participants speaking 3 distinct phrases. These
phrases are the same throughout all speakers and sessions and are
1. “0 1 2 3 4 5 6 7 8 9”
2. “5 0 6 9 2 8 1 3 7 4”
3. “Joe took fathers green shoe bench out”
The speech events were arranged into two ‘shots’ per session, where each of the three
phrases are spoken for each shot. Four sessionswere recorded in total over a period of
five months to capture the natural variability of speakers over time. Each of the shots
were recorded in studio conditions with good illumination and a blue background
suitable for chroma-keying. Some examples of such frames from the database are
given in Figure 2.3.
The XM2VTS database was primarily designed for the speaker recognition task, and
a speaker-verification protocol [107] was released alongside the database to enable re-
searchers to benchmark performance easily. In the protocol the 295 speakers of the
database were split into 200 clients and 95 impostors. Two configurations were cre-
2.5 Chapter summary 31
Figure 2.4: Configurations for person recognition defined by the XM2VTS proto-col [107].
ated, defining which sessions were used for training, evaluation and testing of the
speaker verification system, which are shown in Figure 2.4. The second configuration
will serve as the basis of the speech processing framework used for the experiments
performed in this thesis, but adapted such that it can be used for both speech and
speaker recognition.
2.5 Chapter summary
This chapter has provided a concise summary of the field of AVSP, covering both areas
of speech and speaker recognition. Both human-based and automatic speech process-
ing research was reviewed to introduce the fundamental concepts involved in AVSP
research.
A review of the existing literature in human production and perception of speech
and speakers was conducted, including the benefits of speech-related movement for
human recognition of faces. The speech perception studies clearly show that while
speech production itself is primarily an acoustic process, it does have visual side-
effects that humans have come to rely upon to improve their perception of each others
speech. In particular, studies have shown that even with clearly articulated speech,
human listeners could recognise speech with higher accuracy than the audio alone.
32 2.5 Chapter summary
Studies of human recognition of speakers based on audio-visual speech were very
limited in the literature, but a number of studies showed that human recognition of
faces was improved with speech-like movement. One recent significant study has
shown that recognition of familiar voices was faster when the correct face was shown,
suggesting that a combination of acoustic and visual speaker recognition occurs when
both are available for human recognition of speakers.
In the final sections of this chapter a brief history of automatic speech and speaker
recognition systems was presented, along with a review of databases available for
audio-visual speech processing. Major publications of importance in both fields were
indicated, as well as the cross-over between each of these fields, as well as the closely
related fields of acoustic speech and speaker recognition.
Chapter 3
Speech and Speaker Classification
3.1 Introduction
Classification is the process of assigning input features into one of a finite number
of classes. For the tasks of speech and speaker classification, these classes are either
speech events or speaker’s identities respectively. When tested against a particular
sequence of features, classification is generally given as a score representing the likeli-
hood of the features belonging to the class represented by a particular classifier. This
score can then be compared to other classifier scores to make a decision on the most
likely class of a particular set of data. Before the classifier can make such a decision,
they need to be trained onmany sets of features that are typical for a particular class so
that accurate classification can occur. This training process is conducted on a separate
set of data to that which the classifiers will eventually be used upon.
This chapter will focus on classification methods which are suitable for implement-
ing speech models suitable for use in modelling acoustic and visual speech features.
Both Gaussian mixture models (GMMs) and hidden Markov models (HMMs) will be
introduced as speech classifiers that have shown good performance in the existing
literature at modelling human speech events. In addition to training these models di-
34 3.2 Background
rectly, maximum a posteriori (MAP) adaptation will also be introduced as a technique
to allow speaker dependent speech models to be trained with limited data.
3.2 Background
The goal of classification is to divide a multi-dimensional feature-space into regions
based upon whether a particular point, or observation, in that space belongs to a par-
ticular class [59]. For speech recognition these classes would correspond to words or
sub-words, whereas a speaker recognition classifier would be choosing amongst sep-
arate classes for each speaker. Classifiers can used to either choose amongst many
different classes, or can make a binary accept/reject decision for a single class.
Ideally classes should be completely separate within the feature space, allowing clas-
sifiers to unambiguously determine which class any particular point in feature space
would correspond to. Unfortunately, this is not the case in the real world, so the aim
of classifier design is to reduce the classification error. As the classifiers are trained on
known data, this classification error can easily be calculated as the number of feature-
space points placed in a class that does not match the labelled class.
3.2.1 Bayes classifier
The Bayes classifier [59] is a theoretical classifier that provides the best performance
for any pattern recognition application by minimising the probability of classification
error. Bayes classification is based upon the assumption that the observations for a
particular class can be modelled as a random variable with a known probability dis-
tribution. Bayes theorem defines the posterior probability of observation o being in a
particular class ωi as:
3.2 Background 35
P (ωi|o) =p (o|ωi)P (ωi)
p (o)(3.1)
where p (o|ωi) is the class conditional probability density function for observation o in
class ωi, P (ωi) is the a priori probability of class ωi and p (o) is the probability density
function for observation o.
For the purposes of choosing between a number of classes only the numerator of (3.1)
is important as the denominator is identical for all classes ωi. Therefore given two
classes ω1 and ω2, a classification decision can be made as:
Assign ω →
ω1 p (o|ω1)P (ω1) > p (o|ω2)P (ω2)
ω2 otherwise(3.2)
Therefore choosing the most likely class for a particular observation is simply a matter
of sorting the numerators of (3.1) and choosing the highest score [48].
While the Bayesian classifier can theoretically provide the best performance of any
classifier, it does require that the P (ωi) and p (o|ωi) are known for every class ωi.
While P (ωi) can be calculated easily given enough training data [59], the probability
density function p (o|ωi) must be estimated based on a training set of observations
for each class. Clearly the more training observations that can be obtained, the better
the true p (o|ωi) can be modelled, and the more Bayesian classifier performance will
approach the theoretical maximum. [105].
3.2.2 Non-parametric classifiers
One of the simplest methods of estimating the class density function p (o|ωi) is using a
non-parametric classifier. Non-parametric classifiers are so called because they make
no major assumptions of the underlying form of the class distributions, and therefore
36 3.2 Background
are not represented using parameters of any particular modelling technique. Non-
parametric classifiers generally compare a test observation directly with the known-
class training observations to determine the class under test.
The simplest, and classical, implementation of a non-parametric classifier is the nearest
neighbour classifier. This classifier works by choosing the class of the closest (or a
majority of the k-closest) training observation to the test observation. This class is
then return as the most likely class for the test observation.
Non-parametric methods can be very useful when the training data available for each
class is limited, such as face recognition. For applications where a reasonable amount
of training data is available for limitation of having to store and compare every train-
ing observation comes into play [105].
3.2.3 Parametric classifiers
Parametric classifiers are designed such that they make some assumption about the
form of the classes within the feature space, and the training process consists of esti-
mating the parameters of an assumed modelling technique [48].
As some assumption is made about the approximate nature of observations within
classes, a large number of training observations can be reduced to a relatively small
number of parameters defining the form of the assuming modelling technique. Fur-
thermore, since p (o|ωi) is calculated directly for each class, statistical methods can be
used to form the best models for each class [105].
For speech processing work, it is generally assumed that observations for a particular
class can be considered to be Gaussian about a relatively small number of points in
feature space [194]. This assumption has led to the development of GMMs andHMMs,
which are the basis of static and dynamic speech processing respectively, and will be
covered in detail for the remainder of this chapter.
3.3 Gaussian mixture models 37
3.3 Gaussian mixture models
Gaussian mixture models (GMM) are a modelling technique that have been exten-
sively used for general pattern recognition research [59, 48]. As the name implies,
GMMs model classes with a weighted sum of Gaussian probability density functions
in feature-space.
The use of Gaussian models is encouraged by the Central Limit Theorem [58] which
states that a large number of measurements subject to small random errors will lead
to a Gaussian, or normal, distribution. Because such measurements are very common
in nature and other complex systems, Gaussian distributions, and therefore GMMs,
are well suited to representing complex variables. In particular GMMs have shown
good performance for text-independent acoustic speaker recognition [158].
GMMs are defined by a weighted sum of M Gaussian density functions, given by:
p (o|ωi) =M
∑i=1
cibi (o) (3.3)
where o is a D-dimensional observation vector, bi (o) is the Gaussian density function
for mixture i, and ci is the weight of mixture i. The weights ci must sum to unity over
all mixtures, ∑Mi=1 ci = 1. Each bi (o) is a D-variate Gaussian function of the form
bi (o) = N (o,µµµi,Σi) =1
(2π)D/2 |Σi|12
exp{
−12
(o− µµµi)′Σ−1i (o− µµµi)
}
(3.4)
with mean vector µµµi and covariance matrix Σi for Gaussian i determined during train-
ing of the GMM.
38 3.3 Gaussian mixture models
3.3.1 GMM complexity
Because the form of the distributions is assumed (i.e. Gaussian), a GMM can be com-
pactly defined by a single parameter vector, λλλ, consisting of the weight, mean and
covariances for each of the M mixture components:
λλλ = [λλλ1, . . . ,λλλM] = [c1,µµµ1,Σ1, . . . , cM,µµµM,ΣM] (3.5)
This representation is clearly much more compact than would be required in a non-
parametric classifier, which generally store every training observation. This simplified
form allows statistical methods to be used to determine the optimal λλλ for a given set
of data representing the class in training. However, GMMs can still be quite complex
in terms of the number of free parameters in λλλ. While this complexity may provide
bettermodelling of the idiosyncrasies of the class under training, this complexity must
be traded off with the volume of training observations required to support it. A num-
ber of decisions can be made to reduce the complexity of a GMM without degrading
speechmodelling performance greatly, generally related to the form of the covariances
and the topology of the GMM.
The first decision is choosing the form of the covariance matrix Σi. In the general case
the covariances between all D dimensions of a collection observation vectors can be
represented by a full D × D covariance matrix. These covariance matrices can be of
the following form:
1. Nodal, where each Gaussian (node) has its own covariance matrix
2. Grand, where all Gaussians within a GMM share a single covariance matrix
3. Global, where all Gaussians within all GMMS share a single covariance matrix
Nodal covariance models are typically chosen as they allow each Gaussian to individ-
ual choose the best covariance, but the other options can be useful when training data
3.3 Gaussian mixture models 39
is limited.
Additionally, rather than training a full covariance matrix, the data and training re-
quirements can be reduced by only training a diagonal covariance vector, and setting
all inter-dimensional variances to zero. The use of a nodal, diagonal covariance vec-
tors has been shown empirically to provide the best performance for most speech
applications [158].
The second decision involves choosing the topology of the GMM. In model design
topology generally refers to the top-level layout of the classifier, which in GMMs boils
down to choosing the number of Gaussians, defined as M above. The choice of M
comes down to a simple trade-off between the complexity of the classifier and the
amount of training data available. If the GMM is too complex (M too large), it may
over-fit to the training data, impairing the models performance on unseen data, how-
ever if the GMM is too simple (M too small) it may not model the variety of the train-
ing observations adequately. Unfortunately, there does not exist a known theoretical
method of calculating the optimal value of M prior to performing the Gaussian train-
ing. M is therefore chosen through heuristic and empirical evidence based on the final
model performance on unseen data [105].
3.3.2 GMM parameter estimation
Once the covariance-form and topology of theGMMhas been chosen, theGMM can be
trained by determining the values of λλλ that best fit the training data, through a process
called maximum likelihood estimation. In maximum likelihood estimation, if we a set
of observations O of size N drawn from the class being modelled, O = {o1, . . . ,oN},then the likelihood of a given set of parameters λλλ producing that data set is given by:
L (λλλ|O) = p (O|λλλ) =N
∏i=1
p (oi|λλλ) (3.6)
40 3.3 Gaussian mixture models
The optimal parameters λλλ′ can then simply be expressed as:
λλλ′ = argmaxλλλ
L (λλλ|O) (3.7)
However, this does not specify how the parameter λλλ is varied to determine the maxi-
mum likelihood, which is a non-trivial problem for any number of Gaussian mixtures,
M > 1. While a single Gaussian’s parameters could be determined directly from ex-
amining the data, the calculation of the parameters of a multiple number of mixtures
must be calculated through a more elaborate process, of which one popular method is
known as expectation maximisation (EM) [13].
EM is an iterative process used to improve a parameter vector based upon the likeli-
hood of observations being fitted by said vector. For GMM training, EM is performed
by maximising the parameters of each mixture individually based upon the training
observations that suit each individual modality. The EM algorithm consists of four
stages, which are given for the training of GMM mixture i here:
1. Initialisation: set λλλ{0}i to initial value, set t = 0
2. Expectation: calculate L(
λλλ{t}i |O
)
3. Maximisation: λλλ{t+1}i = argmaxλλλi
L(
λλλ{t}i |O
)
4. Iterate: t = t + 1, repeat from step 2 until L(
λλλ{t}i |O
)
− L(
λλλ{t−1}i |O
)
≤ Th or
t < T
where λλλ{t}i is the estimation of λλλi at step t, Th is a predefined convergence threshold,
and T is the maximum number of iterations permitted.
Before the EM algorithm can be applied a good ‘first guess’ of each mixture’s parame-
ter vector, λλλ{0}i = {ci,µµµi,Σi}, must first be provided to serve as a starting point for the
expectation and maximisation steps. The initial parameters can be chosen based on
3.3 Gaussian mixture models 41
a random selection from the training observations, but best performance is normally
obtained using a non-random initialisation process. The most common initialisation
method uses k-means clustering [59, 2] to choose M clusters from the training ob-
servations and initialise λλλ{0}i based on one mixture for each cluster. All GMM-based
experiments conducted in this thesis are initialised in this manner.
The expectation step of the EM algorithm determines the likelihood of the current
parameter vector λλλ{t}i fitting each observations in the training set, on ∈ O. This is
calculated based on a mixture-normalised likelihood :
li (n) =cibi (on)
∑Mk=1 ckbk (on)
(3.8)
where L (λλλi|O) ≈ ∏Nn=1 li (n).
Once the likelihoods have been calculated, the parameters can be recalculated using
li (n) to determine the likelihood that observation on is covered bymixture i, under the
previous choice of parameters. As a single mixture only comprises a mean, covariance
and weight parameter, these parameters can be calculated using standard statistical
methods [105]:
µµµi =∑
Nn=1 li (n)on
∑Nn=1 li (n)
(3.9)
Σi =∑
Nn=1 li (n) (on − µµµi) (on − µµµi)
′
∑Nn=1 li (n)
(3.10)
ci =1N
N
∑n=1
li (n) (3.11)
Finally, the iterative step compares the likelihoods using the new parameters to the
old likelihoods to decide if the EM algorithm has converged to a maxima, at which the
42 3.4 Hidden Markov models
Figure 3.1: A Markov process can be modelled as a state machine with probabilistictransitions (aij) between states at discrete intervals of time (t = 1,2, . . .).
algorithm will conclude.
3.4 Hidden Markov models
Hidden Markov models (HMMs) are a well-establish mathematical tool for establish-
ing a statistical model of temporal observations. Whereas the GMM introduced in the
previous section model individual observations independent of each other, HMMs are
designed to treat observations as a sequence in time or space. While spatial HMMs
can be useful for applications such as handwriting recognition [3], most applications
of HMMs involve temporal observations, of which speech is a very common applica-
tion [194]. For this reason, HMMswill be introduced in this section, and used through-
out this thesis, as a temporal model.
3.4.1 Markov models
HMMs are designed to model sequences of observations based on the underlying as-
sumption that these observations came about from a hidden state machine, where the
parameters of this state machine are not known. In the underlying state machine, re-
ferred to as a Markov chain or model [153], the states can change based on statistical
probabilities at discrete points in time, as shown in Figure 3.1.
3.4 Hidden Markov models 43
Markov models can be used to determine the likelihood of a particular sequence of
events, given a particular path through the network. Given a Markov model with S
states, the parameters of the model can be defined as
λλλ =[aij : 1≤ i ≤ S, 1≤ j≤ S
](3.12)
And if the known path, or sequence, through the network is given as
q = [q1,q2, . . . ,qT] , 1≤ qt ≤ S (3.13)
where qt is the state occupied at time t, then the model parameters can easily be ex-
amined to determine the probability of path q being traversed.
If the probability of the initial state being defined as πi = P (q1 = i), then the probabil-
ity of Markov model λλλ producing sequence q can be given as the product of the initial
and transition probabilities:
P (q|λλλ) =T
∏t=1
P (qt|λλλ) (3.14)
= πq1aq1q2aq2q3 . . . aq{T−1}qT (3.15)
3.4.2 Hidden Markov models
While Markov models can be very useful in modelling observations where the under-
lying states can easily be determined from observations, in practise the actual state
sequence is unknown. This problem has led to the development of the HMM design,
where the underlying state sequence is said to be hidden.
Rather than being presented with a known state sequence q, a HMM works with a
sequence of observations vectors given by
44 3.4 Hidden Markov models
O = [o1,o2, . . . ,oT] (3.16)
where ot is the observation vector at time t.
As the underlying state sequence q is unknown, the probability of getting this se-
quence from the HMMmust be evaluated over all possible q [105]:
P (O|λλλ) = ∑allq
P (O|q,λλλ)P (q|λλλ) (3.17)
It can be seen that in addition to the sequence probability P (q|λλλ) given in (3.15), the
output emission probabilities density P (O|q,λλλ)must also be calculated over all se-
quences. Given the assumption of state-independence, this probability can be repre-
sented as a conditional density function with no loss of accuracy [105]. This density
function can then be expressed as the product of state-specific emission densities over
time as follows:
p (O|q,λλλ) =T
∏t=1
p (ot|qt,λλλ) (3.18)
=T
∏t=1
bqt (ot) (3.19)
where bi (o) is the output-emission probability density function of state i.
Therefore to fully represent a HMM, the parameter vector λλλ must also contain these
output density functions in addition to the state transition likelihoods:
λλλ = [A,B] (3.20)
=[aij,bi (o) : 1≤ i ≤ S, 1≤ j ≤ S
](3.21)
3.4 Hidden Markov models 45
It must be noted that this particular implementation of the state-based probabilities is
for a continuousHMMs, as the model works on the actual continuous, real-valued ob-
servations rather than quantising the observations into discrete symbols before mod-
elling as in a discrete HMM. Continuous HMMs have been shown to provide much
better performance for speech recognition tasks than discrete HMMs [153].
At a basic level HMMs can be viewed as a temporal structure, and the choice of mod-
elling technique for bj (o) is not dictated by the HMM framework in any way. How-
ever, in practice, continuous HMMs are typically implemented with the output den-
sity functions being represented by a GMM for each state:
bj (o) =M
∑m=1
cjmbjm (o) (3.22)
=M
∑m=1
cjmN(
o,µµµjm,Σjm
)
(3.23)
where cjm, µµµjm, and Σjm are the weights, mean vector and covariance matrix respec-
tively of each mixture in the GMM. More details on the implementation of GMMs can
be found in Section 3.3, where they are covered as a classifier in their own right.
Therefore with the complete specification of the HMM parameters λλλ, the likelihood
of a particular sequence of observations O can be calculated by (3.17). However, the
need to calculate the probabilities over all possible paths through the state machine is
typically prohibitive, with the order of ST possible paths available. This calculation
can be simplified immensely if instead of calculating the probability over all possible
paths, only the probability of themost-likely path is considered. If this most-likely path
is referred to as q′ then
P (O|λλλ) ≈ P(O|q′,λλλ
)(3.24)
This simplification is referred to as the Viterbi approximation [194, 105], and can greatly
46 3.4 Hidden Markov models
simplify the calculation of P (O|λλλ) with no significant loss in performance [105]. Of
course, the Viterbi approximation does require that the optimal path q′ can be calcu-
lated in some manner first, which leads to the Viterbi decoding algorithm, designed for
this very purpose.
3.4.3 Viterbi decoding algorithm
The Viterbi decoding algorithm is designed to find the most likely path q′ for a par-
ticular sequence of observations O through a HMM defined by λλλ, without having to
exhaustively search every possible path in the process. To simplify the task of choosing
the most likely path, the Viterbi decoding algorithm only calculates (and remembers)
the single most likely path to each state j at time t. Only S possible paths are kept
for each time step, rather than St that would be required for an exhaustive search,
with a corresponding increase in performance. While this approach is not guaranteed
to always find the best path due to this assumption, in practise this algorithm works
effectively for speech and other applications of HMMs [153, 194, 105].
The Viterbi algorithm can easily be represented as a recursive algorithm to calculate
both the probability of arriving in state j at time t through themost likely path, defined
as δt (j), and the previous state in that same path ψt (j).
Initially these parameters are defined for each state at time t = 1:
δ1 (j) = πjbj (o1) , 1≤ j ≤ S
ψ1 (j) = 0, 1≤ j ≤ S (3.25)
Then at each step t, the most likely previous state is chosen for each current state, and
the current probability is calculated:
3.4 Hidden Markov models 47
δt (j) = max1≤i≤S
[δt−1 (i) aij
]bj (ot) ,
2≤ t ≤ N
1≤ j ≤ S
ψt (j) = argmax1≤i
[δt−1 (i) aij
] 2≤ t ≤ N
1≤ j ≤ S(3.26)
So at each t, the best path (i.e. the most likely previous state) for state j is in ψt (j) and
the probability of arriving at state j from that path is in δt (j). Therefore at the final
observation t = T, the final probability and precursor state are given by:
P′ = max1≤j≤S
[δT (j)]
q′T = arg max1≤j≤S
[δT (j)] (3.27)
Once the final path state q′T has been determined, the full path q′ = {q1,q2, . . . ,qT} can
be determined by backtracking through ψt:
q′t = ψt−1(q′t+1
), t = T − 1,T − 2, . . . ,1 (3.28)
In practise a closely related version of this algorithm is implemented using logarithms
of the parameters to simplify implementation in computer code, as the need for mul-
tiplications of very small probabilities can be eliminated.
3.4.4 HMM parameter estimation
The goal of HMM parameter estimation is to determine a HMM parameter vector λλλ,
defined in (3.20), based on a set of training observation sequences {O1,O2, . . . ,ON}.HMM parameter estimation can be considered to encompass that of GMM parameter
estimation covered in Section 3.3.2, but complicated further as each state of the HMM
48 3.4 Hidden Markov models
has a GMM that requires parameter estimation in its own right. The individual GMM
parameter estimation is additionally made more difficult again as the alignment of
observations to state GMMs is not completely defined, and can change somewhat
during the training process.
A single HMM contains S states, represented by a single GMM each. As it has already
been established that a single GMM’s parameter vector is too complicated to optimise
analytically, having to estimate S of these does not make the process any easier. How-
ever, the underling general EM algorithm introduced in Section 3.3.2 can be applied to
the task of HMM parameter estimation in a similar manner. This specific instance of
the EM algorithm is called the Baum-Welch algorithm, and has been shown to provide
good performance for HMM training for speech and other applications [153, 194].
Like the more general EM algorithm, the Baum-Welch algorithm can iterative deter-
mine the λλλ that locally maximises L (λλλ|O) at each stage, thereby arriving at a suitable
λλλ for the models being trained. This algorithm will be introduced shortly, but firstly
a suitable initialisation or ‘best guess’ of λλλ must first be determined to providing a
suitable starting point. This initialisation is performed using the technique of Viterbi
training.
Viterbi training
For HMM parameter estimation, Viterbi training serves as the initialisation step of the
EM algorithm, providing a good ‘first guess’ of the HMM parameter vector λλλ. This
can be considered analogous to the application of k-means clustering in initialising a
GMM parameter vector.
The main task of Viterbi training is to align the observations in the training observa-
tion vectors against the correct state model, whereupon the state models can then be
estimated from the aligned observations. This alignment is performed by dividing the
observation vector into S equal segments at the initial stage, after which the best align-
3.4 Hidden Markov models 49
ment, q′, is performed using the Viterbi decoding algorithm described in Section 3.4.3.
At this stage, the state-transition probabilities of the HMM can be estimated from the
Viterbi alignment. If Aij is the total number of transitions from state i to state j in q′
over all N training observation sequences, then the state-transition probabilities can
be estimated by
aij =Aij
∑Sk=1 Aik
(3.29)
Once each state has been aligned with the corresponding training observations, each
observation is assigned to a particular mixture in the state-model GMM, and each of
these mixtures parameters µµµi and Σi, as well as the mixture weights ci, can be cal-
culated using standard statistical methods similar to that of GMM training in Sec-
tion 3.3.2. This assignment of observation-to-mixture is performed using k-means
clustering on the first step, and thereafter by choosing the most likely mixture for
each observation.
Once the state-model GMM parameter vector has been calculated, a new alignment is
performed and the process begins anew. The Viterbi training process ends when there
is minimal change in the HMM parameter vector λλλ.
Baum-Welch algorithm
Once a good initial estimate of the HMMparameter vector λλλ has been provided by the
Viterbi training algorithm, the Baum-Welch algorithm is used to iteratively improve
this estimate. Being within the same class of EM algorithms, the Baum-Welch algo-
rithm is quite similar to Viterbi training. However, rather than assigning observations
definitively to states and mixtures, Baum-Welch re-estimation considers states and
mixtures to have soft boundaries and each observations is spread amongst all states
and mixtures based upon its likelihood of being in each. This likelihood is calculated
50 3.4 Hidden Markov models
from both forward and backward probabilities of state-and-mixture occupation, lead-
ing to Baum-Welch algorithm’s other moniker, the forward-backward algorithm.
The forward probability, αj (t), is defined as the likelihood of a particular observation
sequence arriving at state j at time t, or more formally,
αj (t) = p (o1o2 . . .ot,qt = j|λλλ) (3.30)
This value can be calculated recursively by first defining αj (1) based on the initial
observation,
αj (1) = πjbj (o1) , 1≤ i ≤ S (3.31)
and can then be extended for higher values of t based on the previous αi (t) values and
the current observation:
αj (t) =
[S
∑i=1
αi (t− 1) aij
]
bj (ot) , 1≤ i, j ≤ S, 2≤ t ≤ T (3.32)
In a similar manner to the forward probability, the backwards probability is defined
as the likelihood of a particular observations sequence starting at time t and state j.
This can be expressed formally as
β j (t) = p (ot+1ot+2 . . .oT|qt = j,λλλ) (3.33)
and can also be calculated in a similar manner to αj (t). At time t = T, the likelihood
of reaching any particularly state is at unity:
β j (T) = 1, 1≤ j ≤ S (3.34)
3.4 Hidden Markov models 51
The backward probabilities at earlier times can then be calculated recursively,
β j (t) =S
∑k=1
ajkbk (ot+1) βk (t + 1) , 1≤ j,k ≤ S, 1≤ t≤ T− 1 (3.35)
Because the forward probability has been defined as a joint probability, but the back-
ward as a conditional, both can be combined to determine the joint probability of being
in state j at time t within a complete sequence of observations,O = {o1,o2, . . . ,oT}:
p (O,qt = j|λλλ) = αj (t) β j (t) (3.36)
Using (3.36) we can define the likelihood of state j being occupied at time t for the nth
training observation sequenceOn in terms of the forward and backward probabilities:
Lnj (t) = p (qnt = j|On,λλλ) (3.37)
=p (On,qnt = j|λλλ)
p (On|λλλ)(3.38)
=1Pn
αnj (t) βn
j (t) (3.39)
where Pn = p (On|λλλ), which can be calculated based on the full iteration of either
probability:
Pn = αnS (T) = βn
1 (1) (3.40)
The state likelihood in (3.39) can be extended to mixture m within state j as
Lnjm (t) =
1Pn
[S
∑i=1
αni (t) aij
]
βnj (t) cjmbjm (on
t ) (3.41)
52 3.4 Hidden Markov models
which serves as the expectation step for further EM under the Baum-Welch algorithm.
Once the likelihoods have been calculated for each state and mixture in each training
observation sequence, the maximisation step can proceed to estimate the new param-
eters,
λλλ = [A,B] (3.42)
=[aij,bj (o) : 1≤ i, j ≤ S
](3.43)
=[
aij, cjm, µµµjm, Σjm : 1≤ i, j ≤ S, 1≤ m ≤ Mj
]
(3.44)
which are individually estimated using standard statistical methods [194]:
aij =∑
Nn=1
1Pn
∑Tn−1t=1 αn
i (t) aijbj(ont+1
)βnj (t + 1)
∑Nn=1
1Pn
∑Tnt=1 αn
i (t) βni (t)
(3.45)
cjm =∑
Nn=1 ∑
Tnt=1Ln
jm (t)
∑Nn=1∑
Tnt=1Ln
j (t)(3.46)
µµµjm =∑
Nn=1 ∑
Tnt=1Ln
jm (t)ont
∑Nn=1 ∑
Tnt=1Ln
jm (t)(3.47)
Σjm =∑
Nn=1 ∑
Tnt=1Ln
jm (t)(
ont − µµµjm
)(
ont − µµµjm
)′
∑Nn=1 ∑
Tnt=1Ln
jm (t)(3.48)
The expectation (3.39 and 3.41) and maximisation (3.45 to 3.48) steps of the Baum-
Welch are then iterated until convergence of the parameters occur, at which point the
algorithm concludes.
3.4 Hidden Markov models 53
Figure 3.2: A diagrammatic representation of typical left-to-right HMM for speechprocessing.
3.4.5 HMM types
Whilst it is possible for the states of a HMM to be interconnected in any manner, in
general some simplification is performed to make training and decoding of the final
HMM simpler. The structure of the HMMmodel can be simply realised by the transi-
tion matrix A =[aij : 1≤ i, j ≤ S
].
The general case is the ergodic HMM, where any state can be reached from any other
state in a single step (or, where aij > 0∀i, j), but for most speech processing tasks, the
left-to-right HMM has found to provide a better representation of human speech [153,
194]. In this form of HMM, connections can only be made form a lower to higher (or
same) state, or
aij = 0, j < i (3.49)
An example of a typical left-to-right HMM is shown in Figure 3.2, and the reason for
the name can be seen when the states are laid out by order of index.
The design of the left-to-right HMMputs a number of restrictions on the training (and
decoding) process whilst still adequately modelling the natural non-cyclic nature of
speech [105]. In particular, the single entry and exit states, and no backwards transi-
tions dictated by the left-to-right HMM can simplify the possible network paths for
EM considerably.
54 3.5 Speaker adaptation
3.5 Speaker adaptation
Using the training procedures outlined earlier in this chapter it is possible to train
GMM and HMM-based classifiers that perform well at their respective tasks, pro-
vided that there is enough training data to adequately estimate the parameters of these
models. While this is not generally a problem when training the speaker-independent
background models, it can be more difficult when training speaker-dependent mod-
els. Because the speaker-dependent models have a much smaller amount of data for
training, the estimation of the speaker-dependent models can be considerably more
difficult.
Speaker adaption is simply the process of adapting the parameters of a previous
trained set of models to a new set of observations. The adaptation process can be con-
sidered very similar to the EM algorithms used to train models from scratch, outlined
earlier in Section 3.3.2 for GMMs and Section 3.4.4 for HMMs, but the initialisation of
parameters already exists in the background models.
For this thesis the maximum a posteriori (MAP) method of adaptation will be used
to form the speaker dependent GMM and HMM models. MAP adaptation was cho-
sen over the other major alternative, maximum likelihood linear regression (MLLR)
adaptation, due to the improved performance of MAP on reasonable sized adapta-
tion datasets [98]. This performance benefit was also verified empirically using the
XM2VTS speech recognition framework determined in Chapter 4.
3.5.1 MAP adaptation
In the standard EM algorithm, first outlined in Section 3.3.2, the aim at each iteration
is to determine a new set of parameters λλλ′ given a fixed initial λλλ:
λλλ′ = argmaxλλλ
p (O|λλλ) (3.50)
3.5 Speaker adaptation 55
However, for MAPAdaption the initial λλλ is assumed to be a random vector with a cer-
tain distribution [98]. Additionally, there is assumed to be a correlation between the
parameter vector and the observations that let to it, such that the adaptation observa-
tions can be used to form an inference about the final parameter vector λλλ. If the prior
density of the parameter vector is given as g (λλλ) then theMAP parameter estimate can
be given by maximising the posterior density g (λλλ|O) [98]:
λλλ′MAP = argmax
λλλg (λλλ|O) (3.51)
= argmaxλλλ
p (O|λλλ) g (λλλ) (3.52)
It can be seen that if the prior density is constant (i.e. all parameter vectors are equally
likely) in (3.52) the MAP adaptation drops to the standard maximum likelihood rule
shown in (3.50). By simply using the MAP adaptation of the parameter vector, rather
than the simpler maximum likelihood estimation, in the EM algorithms for training
both GMMs and HMMs, MAP adaptation can be performed using the same iterative
framework.
MAP adaptation can also put additional restrictions upon which parameters can be
varied to simplify the adaptation process and ensure that over-fitting to the adapta-
tion doesn’t occur [194]. Typically for speech applications, and particularly for the
experiments performed within this thesis, only the means of the mixtures are adapted
to the adaptation dataset. Because of this, only the mean adaptation formulas will be
outlined below, and the reader interested in other parameter adaptations should refer
to Lee and Gauvain’s publications on the topic [97, 98].
The parameter estimation of the new mean parameter µµµjm for state j and mixture m
for MAP adaptation is given as [98, 194]:
µµµjm =Pjm
Pjm + τµµµjm +
τ
Pjm + τµµµjm (3.53)
56 3.6 Chapter summary
where τ is a weighting parameter of the a priori model parameters, Pjm is the occupa-
tion likelihood on the adaptation data, given by
Pjm =N
∑n=1
Tn
∑t=1
Lnjm (t) (3.54)
µµµjm is the prior mean parameter, and µµµjm is the mean of the adaptation data, defined
as
µµµjm =∑
Nn=1 ∑
Tnt=1Ln
jm (t)ont
∑Nn=1 ∑
Tnt=1Ln
jm (t)(3.55)
It can be seen that if the likelihood of mixture occupation, Pjm, is small then the effect
of the MAP adaptation in (3.53) will be relatively minor, whereas mixtures that are
better represented by the new observations will have the highest change in their mean
parameters [194].
3.6 Chapter summary
This chapter has provided a summary of existing techniques for training and adapting
speech models suitable for implementing the models suitable for audio and visual
speech processing. An introduction to classifier theory introduced the theoretically
optimal Bayes classifier, followed by a summary of non-parametric and parametric
methods of estimating classifiers based on training data.
The first of the two main classifier types used for this thesis, the GMM classifier was
then introduced. GMM classifiers represent observations as collection of multi-variate
Gaussian distributions in the feature-representation space. A number of design deci-
sions in representing the parameters of GMMs were discussed, followed by an imple-
mentation of the maximum likelihood EM algorithm on estimating these same param-
eters.
3.6 Chapter summary 57
The second main classifier introduced in this chapter was the HMM, which is used to
chain together a number of assumed static states in a temporal structure. Each of the
states can then be modelled with a single GMM, allowing the HMM to model both
the static and dynamic nature of human speech. The Viterbi decoding algorithm was
introduced, as well as the Viterbi and Baum-Welch EM-based training algorithms for
determining HMM parameters.
Finally in this chapter, the process ofMAP speaker adaptationwas presented to demon-
strate how speaker-specific models can be trained even when there is comparatively
little speaker-specific training data available, through the process of adapting the speaker-
specific models from speaker-independent models that can be trained over a much
larger dataset.
58 3.6 Chapter summary
Chapter 4
Speech and Speaker Recognition
Framework
4.1 Introduction
Automatic speech and speaker recognition can be considered to be two highly related
activities, and many methods and techniques are common to both. In this chapter,
these two tasks will be clearly defined and a protocol will be developed to allow for
evaluating performance of both on the same datasets, sharing models and techniques
where appropriate. This protocol will be developed such that it can be used across a
wide range of modelling techniques and features, and will serve as the basis for all of
the future experiments in this thesis.
The chapterwill beginwith a separate study of the two speech processing tasks, giving
an overview of the methods and techniques that are involved in both tasks. Finally, a
novel framework will be developed based on the XM2VTS database that can be used
to evaluate both speech and speaker recognition performance using the same training
process.
60 4.2 Speech recognition
Figure 4.1: A typical speech recognition system, outlining both the training of speechmodels and testing using these models.
4.2 Speech recognition
Speech recognition is the process of converting human speech into a sequence of
words through a computer algorithm. A broad overview of a typical speech recog-
nition system is shown in Figure 4.1. Before the system can be use to transcribe un-
known speech, models must first be trained to recognise words based upon a training
dataset. The trained models can then be used with a testing dataset to evaluate the
system’s performance. These datasets will typically be subsets of the same database,
and could contain audio or video features, or possibly a fusion of the two.
By aligning the training data and transcriptions, speechmodels can then be trained for
words or sub-words present in the transcriptions. The choice of speech event model
used for speech recognition is generally dictated by the size of the vocabulary the
system is intended to accurately recognise. Systems that need to recognise a wide
range of words typically are best modelled using sub-words such as phonemes or
triphones, whereas systems that only need to recognise a limited subset of words can
get better performance by modelling each word individually. For this thesis, word
models will be used in all experiments as the vocabulary will be limited to English
digits, as will be discussed later in Section 4.4.
4.2 Speech recognition 61
Figure 4.2: Speaker dependent-speech recognition can be impractical for some appli-cations.
4.2.1 Speaker dependency
Speech recognition systems can be designed to work well for all speakers, or they can
be trained for the use of a particular speaker. These two options are generally referred
to as speaker independent (SI) or speaker dependant (SD) speech recognition, respectively.
The benefit of limiting recognition to a single speaker is that performance can gener-
ally be increased by an order of magnitude as the variations between speakers are not
an issue. However, SD systems tend to have scaling problems, as each speaker would
need to have their own set of speech models, which could be prohibitive for many
applications, as illustrated in Figure 4.2. Systems intended for individual use, such as
desktop computer speech recognition software, would be relatively easy to use in a
SD manner, but the users would likely still expect them to work adequately out of the
box, and improve with individual training.
To have adequately trained speech models, there should be many examples of each
word or sub-word in the training set to ensure that each model can be adequately
discriminated during decoding. This can easily amount to a very large amount of
data, especially if the intended vocabulary is large. While SI systems can collect this
over a range of speakers, SD systems obviously have much less training data available
unless the users are very cooperative (again, see Figure 4.2).
To help alleviate data shortage problems with SD speech recognition, the SD speech
62 4.2 Speech recognition
models are typically trained through an adaption process from a SI speechmodel trained
on a set of speakers independent to the intended speaker. This adaptation process con-
sists of attempting to translate the variances existing between and within a set of SI
speech models onto a specific speaker to create a set of SD speechmodels. As the base
models for adaption are speaker independent they can be trained over a much wider
variety of speech events than any SD speech models could, and the adaptation pro-
cess should keep much of this variety while better modelling the intended speakers
speech.
4.2.2 Speech decoding
The performance of a speech recognition system is evaluated by comparing the es-
timated speech transcription with a known transcription of the speech event. The
methods of comparing the two transcriptions and arriving at a performance measure
differ depending upon how the speech is decoded, and can be split into systems that
recognise either isolated word or continuous speech.
Isolated word speech recognition
Isolated word speech recognition systems are designed and intended, as the name
suggests, to only recognise a single word at a time. Because this model of speech
recognition requires that the word boundaries must be known, a word-segmentation
front-end is required to perform this task before actual speech recognition can occur.
Therefore the sole task of the speech recogniser would be to determinewhat the words
are within the predefined boundaries. By reducing the freedom of the decoder in this
manner, an isolatedword speech recognition system can commonly outperform a con-
tinuous speech system in controlled conditions. As the word boundaries have already
been defined and presumably match the known transcriptions, isolated word recogni-
tion performance can be measured easily as a percentage of correctly (or incorrectly)
guessed words.
4.2 Speech recognition 63
Figure 4.3: An example of a possible voice-dialling speech grammar for continuousspeech recognition. Adapted from [194].
The need to determine the word boundaries separately from the speech recognition
task limits the application of isolated word recognition systems in real world situa-
tions. Manyword-segmentation algorithmsworkwell on speechwith pauses between
words, put the performance degrades significantly when recognising natural contin-
uous speech. Continuous speech is more readily recognised using classifiers that can
handle transitions between words automatically.
Continuous speech recognition
Continuous speech recognition systems combine the word-segmentation and word-
classification tasks into a single set of models. Rather than segmenting word bound-
aries before attempting to recognise thewords, continuous speech recognition systems
are designed to take a multi-word utterance and attempt to create a transcription out-
lining both the words spoken and the boundaries between them.
Before a continuous speech recognition systemcan be used, a recognition grammarmust
be created. The recognition grammar is a definition of the allowable paths through a
speech network that the speech recogniser can take. Provided that the actual speech
does fit the grammar, this can greatly help the speech recogniser when compared to an
64 4.2 Speech recognition
exhaustive dictionary search for every word uttered. An example of such a grammar
for a voice dialing application is shown in Figure 4.3. Once the grammar has been
defined, the speech recognisers task is to use the speechmodels to determine the most
likely path through a speech network formed my joining the models as defined in the
grammar.
Once the speech recogniser has generated a transcription of the input speech, the per-
formance of the recogniser is evaluated by comparing the output transcription with
a known reference transcription. The two transcriptions are first aligned by perform-
ing by performing an optimal string match, without consideration to any actual tim-
ing information in the transcriptions. The two transcriptions can differ through three
main types of errors: insertions, substitutions and deletions, and the transcriptions
are aligned by assigning values to these errors and minimising the sum over the entire
transcriptions. Once the transcriptions are aligned, speech recognition performance
is measured in terms of the differences between the two transcriptions. Typically this
speech recognition performance is expressed in terms of an accuracy of the form [194]:
Accuracy =H − I
N× 100% (4.1)
where H is the number of matching words in the two transcriptions, I is the num-
ber of words incorrectly inserted into the estimated transcription (as compared to the
reference) and N is the total number of words in the known transcription. Many pub-
lications alternatively report a word error rate (WER) which is simply calculated as
the opposite of the accuracy:
WER = (100− Accuracy) =
(
1− H − I
N
)
× 100% (4.2)
While either of these metrics can be calculated for a single test sequence, by calcu-
lating them over a large testing set, a good idea of the speech recognition systems
performance can be obtained.
4.3 Speaker recognition 65
Figure 4.4: A typical automatic speaker recognition system, outlining both the trainingof speaker models and testing using these models.
4.3 Speaker recognition
As a research area, speaker recognition covers the use of human speech as a biometric
to identify or verify a speaking person. A broad overview of a typical speaker recogni-
tion system is shown in Figure 4.4. Comparing this system and the speech recognition
system shown in Figure 4.1, it can be seen that there are many similarities between
the two designs. Rather than training models to recognise speech events, for speaker
recognition, models are trained to recognise each speaker based on a training dataset
accompanied by speaker identities for each sequence. This section will cover a broad
overview of speaker recognition systems in comparison to speech recognition systems
as described in Section 4.2.
4.3.1 Text dependency
Speaker recognition can be performed either when the system knows, or can dictate,
what is being said, or alternatively when the speaker is free to say what they like.
These two design choices are referred to as text dependent (TD) and text independent
66 4.3 Speaker recognition
(TI) speaker recognition respectively. Because of the more limited nature of the task,
TD systems can generally outperform TI, but the limitation of having to know the text
of the utterance puts severe limits on the practical uses of TD speaker recognition. For
example, TD speaker recognitionwould not be usable in a situation where the speaker
is uncooperative, such as surveillance, where TI recognition would be more suitable.
TD speaker recognition is better suited to controlled circumstances, such as allowing
entry to a secure computer where the user would be fully cooperative.
Ideally TI speaker recognition could be used in all circumstances, and as such it can
be considered the ‘holy grail’ of speaker recognition. However, it does require a wide
variety of speech to train up models that can recognise speakers reliably regardless of
what is being said. To this end, databases used to design TI speaker recognition must
have a similarly large vocabulary to adequately train and evaluate TI speaker recog-
nition models. While this has become less of a problem in audio speaker recognition
research with large speech databases available such as the Wall Street Journal [134] or
TIMIT [79] corpora, audio-visual databases suitable for speaker recognition are much
thinner on the ground, and the only existing one with a large vocabulary is unavail-
able outside of its parent organisation [126]. For this reason, the speaker recognition
research in this thesis will focus on the TD case, using the small vocabulary database
XM2VTS which will be discussed in more detail in Section 2.4.2.
For TI speaker recognition systems, the speaker models are trained to have a single
model for each speaker over the entire vocabulary available, whereas for TD speaker
recognition, the speaker models are trained for a specific speech event. These TD
systems can be further divided into pass-phrase based systems or prompted text sys-
tems, based upon whether the person being recognised says the same phrase every
time or if they are prompted with a different phrase at each use, respectively. As the
prompt can be different for every use, prompted text speaker models are generated
from speaker-dependent word models concatenated together to match the prompted
text. Pass-phrase systems can be modelled using a single speech model for the en-
tire prompted phrase, but more flexibility can be obtained by modelling each word or
4.3 Speaker recognition 67
sub-word separately in a similar manner to prompted text systems. Such an approach
will form the basis of the TD speaker recognition experiments in this thesis.
4.3.2 Background adaptation
Background adaptation is used in speaker recognition to generate speaker dependent
models from background models. These speaker dependent models can then be used
to model the speakers. The benefit of adaptation over training speakermodels directly
on the speakers is that by starting with models trained on the large background set
the final speaker models can cope better with a large variety of speech than speaker
trained models. Additionally there may simply not be enough data for training di-
rectly on a specific speaker, whereas adaption can provide good performance with a
limited set of speaker-specific data.
Generally, twomain types of differences arise between speech events relevant to speaker
recognition, broadly summarised as between-speaker and within-speaker differences.
The between-speaker differences are obviously more important to recognising indi-
vidual speakers, but the models must also be robust to within-speaker differences,
which are generally related to varying speech events. As it is trained over a wide vari-
ety of speakers and speech the backgroundmodel can serve as a baseline for averaging
the within-speaker differences, which can then be incorporated through adaptation to
each speaker to form the final speaker models.
The adaptation of the speaker models is very similar to the adaptation of SD speech
models. Backgroundmodels are trained for each word, and then individually adapted
based on occurrences of the word within the training speech for a particular speaker.
While ideally the set of speakers used to train the background models should be sep-
arate from the speakers intended to be recognised by the system, it may be difficult to
achieve this due to the large amount of data required. As a compromise when there
is a limited amount of data available, as in audio-visual speech, the background mod-
68 4.3 Speaker recognition
els are commonly trained over all intended speakers and then adapted to the specific
speaker.
4.3.3 Evaluating speaker recognition performance
Speaker recognition systems can subdivided into speaker identification or speaker verifi-
cation. Identification systems are designed to choose the identity of a speaker from a
list of known speakers, possibly including the choice of ‘none of the above’. Alterna-
tively, in a verification system, the speaker claims a particular identity in somemanner
and the system must decide whether to accept or reject the speaker’s claim. This sec-
tion will outline these two models of speaker recognition and how performance is
evaluated in both.
Speaker identification
In a speaker identification system, the speech utterance is compared against all speaker
models available to the system. The scores returned by each of these models are then
used to rank the speakers according to which is most likely to have spoken the ut-
terance. Speaker identification can either be closed-set or open-set, depending upon
whether the possibility exists that a unknown speakermay be presented to the system.
For open-set speaker identification, for which this possibility does exist, an additional
background model may be required to serve as a threshold to catch out-of-set speak-
ers.
The output of a speaker identification systems is typically provided in the form of a
top N list of most-likely speakers for a given utterance. Over a large number of test
utterances, the speaker identification performance can be measured as the number of
times the correct speaker appeared within the top N.
One of the major drawbacks of speaker identification is that every utterance must
4.3 Speaker recognition 69
be tested against all possible speaker models. While this may be practical in some
circumstances, such as spotting a small number of suspicious people in surveilance
video, the time taken to test each models can be a limitation in other circumstances.
Speaker verification
Instead of having to choose a speaker’s identity as in speaker identification, speaker
verification only requires the system to verify a speaker’s claimed identity. To verify
the speaker’s claim, only one speaker model is consulted (i.e. the claimed speaker) as
compared to all speaker models for an identification task.
Speaker verification systems are typically designed to generate a score that represents
the likelihood of the claimed speaker being the same as the speaker who produced the
utterance. This verification score is calculated using the claimed speaker’s models,
from which the background speaker model’s score is then subtracted to normalise the
verification score for the length of the utterance and environmental factors. Finally
the normalised score can then be compared against a threshold to decide whether to
accept or reject the speaker’s claim.
To evaluate verification systems completely, both speakers who correctly claim their
identity and speakers claiming another identity, referred to as clients and impostors, are
used to test the system. This evaluation is based upon the rate of verification errors,
of which there are two types: misses, and false alarms. Misses occur when a client is
incorrectly rejected, and false alarms occur when impostors are incorrectly accepted.
These two types of errors can be considered to be in opposition, and lowering one will
cause the other to rise.
The process of trading-off these two errors comes aboutwhen choosing the accept/reject
decision threshold. If the threshold is low, few misses will occur, but there will be
many false alarms. Correspondingly, a high threshold will cause more misses and
less false alarms. These choices can be illustrated through the use of detection error
70 4.3 Speaker recognition
0.1 0.2 0.5 1 2 5 10 0.1
0.2
0.5
1
2
5
10
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
System 1System 2
Figure 4.5: An example of a DET plot comparing two systems for speaker verification.
trade-off (DET) [112] plots, showing the two error rates at each possible operating point
of a system. DET plots, of which an example is shown in Figure 4.5, can be used to
succinctly illustrate the relative performance of a number of verification systems on a
single plot.
As the axis of the DET plots represent errors in the verification system, the best perfor-
mance is obtained as the results move towards to bottom left of the plot. The dashed
line represents the point at which the false alarms and misses are equal, which is re-
ferred to as the equal error rate (EER). This point can serve as one possible summary
of a DET curve when multiple systems are compared.
For speaker recognition research, speaker verification is generally considered more
useful for two main reasons:
• Speaker verification requires less models to test (only claimed speaker and back-
ground)
4.4 Speech processing framework† 71
• DET plots provide a more detailed look at performance than a single top N
speaker identification percentage
For real world applications speaker verification is limited due to the need to have an
identity claim. This identity claim may consists of a pin number or identity card, but
both would obviously require the cooperation of the speaker. For this reason verifi-
cation would be more suited to security access applications, rather than surveillance
where identification would be more suitable (provided that only a limited number of
subjects are identities are being considered).
4.4 Speech processing framework†
For the experimental work in this thesis, a comprehensive framework was required to
meet the following criteria:
1. Can be used for both SD and SI speech recognition experiments,
2. Can be used for TD speaker verification experiments, and
3. Allows models to be re-used where possible
By examining themodelling requirements of both speech and speaker recognition pre-
sented earlier in this chapter, it can be seen that many of the requirements are common
to both methods. The same models used for SI speech recognition can also serve as
the background models for speaker adaptation and for performing score normalisa-
tion in speaker verification. In a similar manner, the adapted speech models used
for SD speech recognition can also serve as the TD speaker verification models. An
overview of the speech processing framework used in this thesis, which combines the
two speech processing tasks in this manner is shown in Figure 4.6. After examining
the data segmentation requirements of this framework, each of the sections of this
framework will be examined in more detail in the following subsections.
72 4.4 Speech processing framework†
Figure 4.6: Overview of the speech processing framework used in this thesis.
4.4.1 Training and testing datasets
For the experimental work performed in this thesis, this framework will be imple-
mented on the digits section of the XM2VTS database (see Section 2.4.2). It was de-
cided not to use the phonetically balanced phrase (“Joe took fathers green shoe bench
out”) as each word in the phrase was only represented half the time of each of the dig-
its. As a result, this framework is based on two shots of two 10-digit strings for each
of the 4 sessions in the XM2VTS database, for a total of 4720 (2× 2× 4× 295) repeats
of each digit over the entire database.
As can be seen from Figure 4.6, implementing this framework requires that the database
4.4 Speech processing framework† 73
be divided into five datasets:
• client training,
• client evaluation,
• client testing,
• impostor evaluation, and
• impostor testing
For this framework it was decided to use the same dataset for background training
and adaption due to the limitation of the relatively small number of speakers avail-
able compared to audio speech processing databases. Additionally, this configuration
allowed the framework to stay as close to the existing XM2VTS protocol [107] as pos-
sible, which did not define a background set.
While the evaluation and testing datasets are separate within the database, they are
both treated exactly the same within the framework. The evaluation datasets are used
to test and tune the speech processing algorithms and to enable parameters of the
modelling and recognition to be estimated. Once these parameters have been deter-
mined, the speech processing algorithms can be re-run on the testing dataset to report
the final speech or speaker recognition performance.
The first split of the database for this protocol was over the speakers to create the
client and impostor sets. This split was performed in the same manner as the existing
XM2VTS protocol [107], with 200 client and 95 impostor speakers. Client speakers
are used in the framework to train the background models, as well as adapting the
speaker models. For testing and evaluation, these speakers test both SD speech recog-
nition and server as the clients for speaker verification. The impostor speakers are not
involved in training at all, but are used solely in testing both SI speech recognition and
challenging speaker verification with unknown speakers.
74 4.4 Speech processing framework†
Table 4.1: Configurations of the XM2VTS clients possible under this framework.
As the client speakers are used for both training and testing/evaluation, a further split
had to be made over the XM2VTS sessions defining which sessions are to be used to
train the models and which are used for testing and evaluation. For this framework,
the XM2VTS protocol’s Configuration II (Figure 2.4(b)) was chosen as the basis, with
two sessions for the client training and one session each for evaluation and testing.
However, to allow for a larger number of experiments to be run than is possible under
the XM2VTS protocol, 12 configurations of 2-train/1-test/1-evaluation were defined,
shown in Table 4.1. As can be seen in the table, this configuration resulting in 6 distinct
training partitions, for which the evaluation and testing partitions can be swapped to
result in the 12 partitions. While it is not necessary to run all 12 configurations within
the framework, the more configurations that are used allows for more speech process-
ing experiments, and greater confidence in the performance measures reported.
Before the XM2VTS database could be used to train up speech-based models, a tran-
scription of the database had to be obtained. While the database was not suppliedwith
a time-aligned transcription, the textual contents of each shotwas clearly indicated. By
using external speech models trained up on the large Wall Street Journal [134] audio
speech corpus to estimate the boundaries of the known transcriptions, a good esti-
mate of the time-aligned speech transcriptions was obtained. These transcriptions
were used both to train and test the speech models within this framework.
4.4.2 Background training
Within this framework, the background speech models are trained from the client
training dataset. These background speech models are used for the SI speech recogni-
4.4 Speech processing framework† 75
tion, in speaker verification to normalise the claimedmodel scores. Additionally, these
background models serve as the base models for adaptation of the speaker-dependent
speech models.
As the vocabulary of this dataset is very small, with only the 10 English digits in
use, no benefit was found in modelling below the word level. Accordingly, it was
decided to form the backgroundmodels over entire words, resulting in 10 background
word models and 1 background silence model. By synchronising the client training
transcriptions with the audio, video or fused features intended for modelling in the
client dataset, the features corresponding to eachword can be separated and combined
over all client speakers to form the each of the background models.
4.4.3 Speaker adaptation
The adaptation of the speaker models is also performed on the client training dataset.
These adapted speechmodels are used for the SD speech recognition, and also to form
the speaker models for speaker verification.
Speaker adaptation is performed in a very similar manner to background training.
Instead of training models over the entire client speaker set, adaptation is performed
by taking a particular backgroundmodel and adapting it to all of the matching speech
events for one particular speaker. This adaptation process is performed for each of the
background models and for each speaker. The result of the speaker adaptation is the
11 speech models (10 digits and silence) having an adapted form for each of the 200
client speaker within the framework.
4.4.4 Speech recognition
Both SD and SI speech recognition can be accomplished, based upon the choice of
models and testing data used to perform the speech recognition experiments. SI
76 4.4 Speech processing framework†
Figure 4.7: Word recognition grammar used in this framework.
speech recognition is performed by testing the background models against all utter-
ances by the ’imposter’ speakers. SD speech recognition is performed by testing each
speaker’s adapted speech models against their utterances in the client evaluation or
testing dataset, and the final performance is reported over all client speakers. The
specific details of how HMM-based modelling techniques will be used to recognise
speech will be covered in more detail in later chapters.
Other than the choice of models and datasets, both SI and SD speech are tested in an
identical manner, through the evaluation of continuous speech recognition. Because
the vocabulary tested is very small (English digits) the grammar used for recognition
is a simple word loop with silences, as shown in Figure 4.7. Once the estimated tran-
scriptions have been generated, performance is calculated as a WER by comparison
against the actual testing transcriptions supplied with the database.
4.4.5 Speaker verification
For this thesis, it was decided to concentrate on speaker verification as it has the ben-
efit that it less comparisons are required between speaker models, and additionally
provides for finer-grained differentiation between verification systems to be obtained
than with a single rank-N correct percentage obtain from speaker identification exper-
iments.
To get a reasonable idea of the speaker verification performance, each client’s models
are compared against a number of impostor shots as well as the matching client shots.
4.5 Acoustic and visual conditions 77
For each of the 400 client shots in the client testing (or evaluation) dataset, 20 im-
postor shots are randomly selected from the impostor dataset and used to attempt to
gain entrance while claiming to be the client. Because every speaker in the XM2VTS
database speaks the exact same phrase, there were no issues with finding identical
client/impostor phrases to test the speaker verification system.
TD speaker verification can be performed by choosing the appropriate speaker adapted
models. As the phrase spoken by the clients or impostors was known and identical
over all speakers, the speaker recognition model was formed by concatenating the
claimed client word models together to form the known phrase spoken. The back-
ground speech models used for normalisation are similarly formed by concatenating
the background speech models. Speaker verification performance will be presented
as DET plots under this framework, although they may be simplified to a single EER
measure when space considerations dictate.
4.5 Acoustic and visual conditions
In addition to conducting the speech processing experiments in clean conditions, the
acoustic modality was also corrupted with a source of background office babble [176]
at a signal-to-noise ratio (SNR) of 0, 6, 12 and 18 decibels (dBs) to investigate the
robustness of the recognition experiments to train/test mismatch. Visual degradation
was also considered for this thesis, but was not included due to time constraints and
the difficulties of simulating real-world sources of visual degradation [70].
All training and adaptationwas performed using the clean data, while final evaluation
of the speech and speaker recognition systems can be conducted over the clean and
noisy conditions. Finally, when the speech processing experiments are considered in
noisy conditions, only the noisy conditions are presented as the 18 dB SNR acoustic
data was largely indistinguishable from the clean data.
78 4.6 Chapter summary
4.6 Chapter summary
This chapter has provided a summary of both speech and speaker recognition in
audio-visual environments. A novel framework was presented that can be used to
perform both speech and speaker recognition on the XM2VTS database.
By examining both speech and speaker recognition in detail, this chapter has shown
that many of the models and techniques are common to both recognition paradigms.
Taking advantage of these commonalities, the framework developed in this chapter
can be used to test both speech and speaker recognition using a single training pro-
cess. By having a single set of models that can be used both to recognise speech and
speakers, the similarities and differences between the two speech processing tasks can
be examined throughout this thesis without having to be concerned about differences
in training for either task.
Chapter 5
Feature Extraction
5.1 Introduction
The aim of feature extraction is to convert the raw observations into a concise set of
features suitable for classification. In ideal circumstances, the feature extraction should
be able to divide the observations into distinct, non-overlapping regions in a multi-
dimensional feature space for each class under consideration, such that the job of the
classifier is trivial. However, this does not typically happen in real-world scenarios,
so the aim of feature extraction is generally to reduce the number of features used for
classification, whilst still maintaining good separation between classes, and providing
some measure of invariance to changes in observations within the groups chosen for
classification. This chapter will conduct a review of the existing literature in audio
and visual feature extraction, with particular focus on the extraction of visual speech
features.
Recently video features designed to emphasise the temporal nature of human speech
have been implemented and shown much better performance than static features for
speech recognition. This chapter will outline a particular implementation using a cas-
cade of appearance-based feature extraction techniques to form a dynamic represen-
80 5.2 Acoustic feature extraction
tation that has been shown to work well for speech recognition applications.
However, such features have not been examined in detail for speaker recognition to
date, and as the goals of the speech and speaker recognition applications under the
speech processing banner are quite distinct, it stands to reason that the features best
suited to each task may not match. To this end, a novel study will be presented of
various video feature representations for both speech recognition and speaker iden-
tification. The models and the performance obtained using them will also serve as a
baseline for future experiments in this thesis.
5.2 Acoustic feature extraction
5.2.1 Introduction
Extraction of suitable feature vectors for speech processing applications from acoustic
signals is a very mature area of research [34, 132, 157, 154] and, as such, will not be
covered in any great detail in this thesis. This sectionwill cover a brief overview of the
main concepts of acoustic feature extraction, followed by a similarly brief introduction
toMFCC and PLP-based acoustic feature extraction techniques, which will both be ex-
amined as acoustic features experimentally throughout this thesis. All acoustic feature
extraction for this thesis was performed with the HMMToolkit, and more information
on this topic is available in Chapter 5 of The HTK Book [194].
Like all feature extraction techniques, the aim of acoustic feature extraction is to form
a concise representation of the relevant features, while providing invariance to irrel-
evant features of the input acoustic signal. The relevancy of these features is eval-
uated in terms of the intended application of the classifier. For example, a speaker-
independent speech recognition application would be interested in changes relevant
to differing words, and unconcerned about changes due to differing speakers, but for
a speaker identification application the opposite may apply.
5.2 Acoustic feature extraction 81
The process of acoustic feature extraction can be divided into pre-processing, filter-
bank analysis, and the extraction of features from the accumulated filter banks. These
features can then also be augmented with energy and time derivative features. The
details of these processes will be covered in the following subsections.
5.2.2 Pre-processing
Acoustic speech is a naturally varying continuous signal with the characteristics of
the signal varying considerably over time. To be intelligible, the raw speech signal
is generally recorded at a sampling rate of at least 8 kHz, the standard for telephone
speech. However, even at the low quality of telephone speech, the (comparatively
low) sampling rate employed still results in too many features for most classifier to
reliably handle. Fortunately, the variance in the speech signal can be considered slow
enough such that its statistics can be considered quasi-stationary over segments of up
to 100 milliseconds [64], and most acoustic feature extraction occurs over windows of
the acoustic signal of up to that length.
For the experiments conducted in this thesis a Hamming window function is used to
divide the incoming acoustic speech signals into 25millisecond-lengthwindows every
10 milliseconds, resulting in a 100 speech feature vectors extracted for every second of
speech. Before windowing a pre-emphasis function was used to flatten the frequency
characteristics of the speech signal, to compensate for the tendency of speech to have
most of its energy in the low frequencies [64, 194].
5.2.3 Filter bank analysis
Once the pre-processing has divided the incoming speech single into quasi-stationary
windows, the frequency spectrum of the speech signal within each time-window is
examined to generate the final speech features. The two most commonly chosen tech-
niques of spectrum analysis are
82 5.2 Acoustic feature extraction
1. linear prediction analysis and
2. perception-based filter bank analysis.
Linear prediction analysis is based onmodelling the vocal trackwith an all-pole model,
whereas filter bank analysis derives from a human-perception based filter bank on the
power spectrum of the signal. For this thesis, the filter bank analysis technique was
chosen, as such features can be calculated more easily [194], whilst still performing
extremely well for speech processing tasks when compared to features derived from
linear prediction analysis [34, 157].
5.2.4 Mel frequency Cepstral coefficients
Filter bank analysis is based upon studies on the perception of speech, showing that
the human ear resolves frequencies non-linearly across the speech spectrum [194].
This non-linear behaviour can be approximated by a triangular filter bank spaced
across the spectrum according to the human-perception based Mel scale [153, 194],
defined by
Mel ( f ) = 2595log10
(
1+f
700
)
(5.1)
Once the filter bank has been calculated, the incoming speech window is transformed
into the frequency domain using a fast Fourier transform (FFT) and the magnitudes in
the frequency domain are binned according to each the value of each filter to arrive at
N weighted sums, [mi : 1 < i < N], for each window. Finally, Mel-frequency cepstral
coefficients (MFCC) can be calculated by taking the discrete cosine transform (DCT)
of the log of those accumulated amplitudes [194],
ci =
√
2N
N
∑j=1
mj cos[(2j + 1)iπ
2N
]
(5.2)
5.2 Acoustic feature extraction 83
5.2.5 Perceptual linear prediction
PLP features [72] are an alternative to MFCC-based feature extraction that are also
popular for acoustic speech processing tasks [157, 73]. As suggested by its name,
this method of acoustic feature extraction can be seen as an approach combining both
linear prediction analysis and the perception-based filter banks.
As implemented by the HTK Toolkit [194], and used in this thesis, PLP-based fea-
tures are calculated based on the same Mel-scale filter bank as used for MFCC feature
extraction. The Mel filter-bank coefficients, [mi : 1 < i < N], are first weighted by an
equal loudness curve and compressed by taking the cubic root. The resultingmodified
acoustic spectrum is then converted to Cepstral coefficients in an identical manner to
linear prediction analysis [194].
5.2.6 Energy and time derivative features
In addition to the spectrum based coefficients, a number of other components can
added to each feature vector to improve the speech processing performance. The two
main types of additional features are an energy term and features calculated from time
derivatives.
The energy term is used to augment the spectral features, and is computed as the log
of the signal energy. That is, for speech samples [st : 1≤ t ≤ T] corresponding to a
particular audio window,
E = logT
∑t=1
s2t (5.3)
Various normalisation and adjustment techniques can be applied to this energy term,
which for this thesis, are left at the default settings of the HTK Toolkit [194].
Time derivative features are used to allow the classifiers to have limited knowledge
84 5.3 Visual front-end
Figure 5.1: Configuration of an acoustic feature vector including the static (ci) andenergy (E) coefficients and their corresponding delta and acceleration coefficients.
of the dynamic changes in the acoustic features, rather than just their values in each
particular window. Typically, both first and second order derivatives, referred to as
deltas and accelerations are added. These time derivative features are calculated by
di,t =∑
Θθ=1 θ (ct+θ − ct−θ)
2∑Θθ=1 θ2
(5.4)
where di,t is the delta coefficient at time t of static feature ci calculated in terms of
the surrounding static coefficients (including the energy term) [ci,t−Θ, . . . , ci,t+Θ]. The
value of Θ is the window size for calculating the derivatives, which is typically set
to 2. The corresponding acceleration coefficients can be formed by using (5.4) on the
delta features instead of the static. The final feature vector for each window is then
the concatenation of the static spectral coefficients and energy with the deltas and
acceleration features as shown in Figure 5.1.
5.3 Visual front-end
While an acoustic speech signal can be simply represented as amplitude values at the
sampling rate of the recording device, the raw visual speech signal consists of an entire
image for every video-frame collected by the camera. Although video frame-rates are
many orders of magnitude lower than acoustic sampling rates, the large amount of
information in each frame still presents problems for the extraction of visual speech
features.
The first major problemwith extracting visual speech features is that in many applica-
tions, much of every video frame is unrelated to the visual speech. It is widely agreed
5.3 Visual front-end 85
Figure 5.2: The visual feature extraction process, highlighting the visual front end,encompassing the localisation, tracking and normalisation of the lip ROI.
that the majority of visual speech information emanates from the region around the
speaker’s mouth [95], and therefore the extraction of visual features should be primar-
ily based upon this area. The task of the visual front-end is to locate, track and normalise
this ROI over an entire video, before the visual speech feature extraction stage can oc-
cur, as shown in Figure 5.2. Some researchers also include the feature extraction step
within the visual front-end [145], but for this thesis the front-end will refer solely to
the location and normalisation of the ROI.
5.3.1 The front-end effect
Clearly the accuracy of the visual front-end will have a large effect on the final accu-
racy of the speech processing system. If the tracking of the ROI is poor, then the per-
formance of the final classifiers using the tracked features will be unreliable as they
will not always be evaluated on a consistent ROI. This is referred to as the front-end
effect, and can be expressed simply as
p (c) = p (c| f ) p ( f ) (5.5)
Where c represents the speech or speaker classifier working correctly, and f represents
the front-endworking correctly. It can be seen that an ideal front end (p ( f ) = 1) would
allow effort to be solely faced on improving the classifier performance.
86 5.3 Visual front-end
5.3.2 A brief review of visual front-ends
One simple way to make p ( f ) approach 1 is to manually label the ROI [54, 86], which
is the approach taken in this thesis, as the classifier performance is of primary interest.
In a similar vein, many early AVSP applications [5, 159] made use of the DAVID [26]
database, which had blue-highlighted lips to allow a nearly trivial front-end imple-
mentation.
While manual labelling can be useful on previously captured data in research set-
tings, it is unrealistic for real world circumstances, where most speech processing
applications should work unsupervised. Automatic mouth ROI detection methods
vary considerably in the literature, but most approaches start with a broad face de-
tector and move to specific facial feature detectors to finally locate and normalise the
mouth ROI [146]. Many of the methods and techniques are shared between the face
and facial feature detectors, but there exists no single method that works well in all
circumstances.
In a survey of over 150 publications in face and facial feature detection, Yang et al. [192]
established that four broad groups of facial detection algorithms could be defined:
1. Knowledge based methods, encoding human knowledge of what makes a face.
Generally these methods are based upon the spatial relationships between facial
features.
2. Feature invariant approaches, where a bottom up approach is used to attempt
to generate features that are robust to the conditions in which the video has been
collected. Some examples are texture and chromatic based features.
3. Template matchingmethods, where a number of face or facial-feature templates
are stored and compared against test images to track faces.
4. Appearance based methods, where a model is generated on a set of training
face or facial features to adequately capture the variability of facial appearance.
5.3 Visual front-end 87
Figure 5.3: Manual tracking was performed by recording the eye and lip locationsevery 50 frames and interpolating between.
Many different modelling techniques have been used for this purpose including
neural networks, HMMs and support vector machines.
Whilst there is little consensus on the best method of achieving a well performing vi-
sual front-end, one appearance-based method that has been developed recently and
shown good performance in implementing a AVSP visual front-end is the Viola-Jones
algorithm [179]. The Viola-Jones algorithm is a generic object detection algorithm
formed by a cascading chain of very simple features derived from intensity values
in the region being searched. A thorough review of this algorithm as a front-end for
both frontal and profile visual speech is presented by Lucey (2007) [102].
5.3.3 Manual front-end implementation
As this thesis is not concerned with the performance of the visual front-end, a manu-
ally tracked visual front-end approach was chosen to allow p ( f ) in (5.5) to approach
unity, and allow the thesis to focus on improving the classifier performance under the
assumption of a well-performing front-end.
To allow the location of the mouth ROI to be known for every frame in the database, a
volunteer was recruited to track the locations of the eyes and mouth in every 50th
frame, or approximately every 2 seconds, of each video in the XM2VTS database.
Some examples of the manually tracked frames are shown in Figure 5.3. The loca-
88 5.4 Visual features
Figure 5.4: Some examples of the original and grey-scaled resized ROIs extracted fromthe XM2VTS database.
tions of these points on the intermediate frames were then interpolated from these
landmark areas.
Once the locations of the eyes and mouth were determined for a particular frame,
the mouth image was chosen as a 120× 80 pixel region centred on the mouth region
and with the long side parallel to a line drawn between the eyes. This image was
then gray-scaled and down-sampled to a 24× 16 region to reduce the number of raw
pixels for the feature extraction stage. The down-sampling and gray-scaling are not
expected to affect lipreading performance based on work by Jordan and Sergeant [82]
and Potamianos et al [149]. Some examples of lip ROIs generated from this manual
tracking process are shown in Figure 5.4.
5.4 Visual features
Once the relevant ROI has been located, feature-extraction must be performed on the
region before classification can occur. It is widely agreed in the literature [146, 25] that
extraction of visual speech features can be divided into three main categories:
1. Appearance based,
2. Contour based, and
5.4 Visual features 89
3. A combination of both appearance and contour
This section will summarise the techniques that have been developed for extraction
of visual speech features under these categories, noting the advantages and disadvan-
tages of each method. Finally a comparison of the three methods will be made.
5.4.1 Appearance based
Appearance-based methods are designed to take all the information available in the
ROI and producing a single feature vector based on the appearance of the ROI. The
appearance-based methods generally are not designed to make any assumptions of
which components of the ROI are important for speech, and allow features to be ex-
tracted from the entire ROI and not just the lip movements. Some of these additional
visual indicators that can be relevant to visual speech are the visibility of the tongue
and teeth, as well as any visible muscle or jaw movements [172].
While the size and shape of the ROI for appearance-based features is typically a square
or rectangular region centered on the mouth region of the face [146], this does not
have to be the case. Illustrating the extremes of these approaches, some researchers
have extracted features from the entire face region [116], whereas others have used
disc-shaped regions around the lips to limit the amount of non-lip pixels being ex-
tracted [47]. Indeed, some researchers have even used temporal information to form a
three dimensional ROI [99, 141] from which the feature extraction is performed.
Within the extracted ROI the pixel values, either colour [28] or grayscale [49, 54], are
typically concatenated to form a monolithic feature vector. However, frame-to-frame
differences [164] or optical flow analysis [67, 19] have also been used to form alterna-
tive feature vectors with more of a focus on the dynamic information available in the
ROI. While some approaches have used such feature vectors directly [15, 47], the sheer
number of pixels, and therefore feature dimensions, in any reasonable sized ROI can
overwhelm the parameter estimation of many classification techniques, in an effect
90 5.4 Visual features
commonly termed the curse of dimensionality [10].
It is therefore a primary goal of visual speech feature extraction that the number of
features available to the classifier stage can be reduced whilst still maintaining the
discriminative power of the remaining features. This is a similar goal to that required
in face recognition, and therefore many visual speech feature extraction techniques
mirror their earlier counterparts in face recognition research. The earliest example
of this propagation from face recognition to AVSP was Bregler and Konig’s work on
eigenlips [16], which was closely based on Turk and Pentland’s groundbreaking face
recognition paper introducing the principal component analysis (PCA) based feature
extraction technique eigenfaces [175]. Since its introduction, PCA based feature extrac-
tion have become one of themost common feature extraction techniques used in AVSP
research [47, 99, 67, 86, 106, 146].
However, one of the problems with the PCA approach is that a corpus of represen-
tative eigen-ROIs must first be generated before an unseen ROI can be projected into
the eigen-ROI feature space. By comparison a number of linear image transformation
techniques such as discrete wavelet transforms (DWTs) [45, 140, 142] and discrete co-
sine transforms (DCTs) [46, 47, 5, 126, 164, 71, 54] have been used to remove redundant
information in the ROI. The higher energy components can then be extracted from the
transformed image to form a compact feature representation. In particular DCT fea-
ture extraction has been shown to perform as well as PCA techniques for most AVSP
tasks with the benefit of not requiring training to establish the feature-space [67, 147].
The feature-reduction methods presented above have been shown to perform well
for AVSP tasks. However, these methods simply produce a compact representation
of the entire ROI, and are blind to the classification of the reduced features. While
these algorithms have shown good ability to discriminate between speech or speaker
classes, they were not designed with such a discriminative ability in mind.
To create a better feature representation, the linear discriminant analysis (LDA) algo-
rithm [48] can be used to firstmap visual data to specific classes, andwell-discriminating
5.4 Visual features 91
Figure 5.5: Contour-based feature extractions used the geometry of the lip region asthe basis of feature extraction.
features can be extracted with the mapped classes in mind. LDA was first proposed
for such an extraction of visual speech features by Duchnowski et al. [47], where the
LDA step was performed on the raw pixels in the ROI. However, the LDA transfor-
mation matrix can easily get too large to calculate using this approach, and is com-
monly performed as a stage in a cascade beginning with an earlier PCA [100, 57] or
DCT [145, 21] feature extraction stage. Indeed, Potamianos et al. [144] found that an
additional maximum likelihood linear transformation (MLLR) stage after the LDA
improved speech-reading performance further still.
5.4.2 Contour based
The choice of contour- or geometry-based visual speech features over appearance-
based comes from a desire to represent the visual speech directly as the position of the
visual articulators during the speech event. Because the articulators are represented
based on their position relative to the face, contour based features first require that
the positions of the articulators are first located within the wider ROI, before the fea-
tures can be extracted. For contour-based feature extraction, only the positions of a
relatively small number of points are desired, compared to the large number of pix-
els in the ROI for appearance-based methods. The combination of minimal features,
and that the features are recorded directly from the positions of the visible articulators
should provide good performance for AVSP tasks. However, contour extraction does
require further localisation and tracking of the articulators, and this secondary front-
end effect (after the front-end effect in ROI tracking) can have a major effect on AVSP
performance.
92 5.4 Visual features
Many systems using contour-based features taken directly from geometric measures
based on the tracked points, such as the height, width, perimeter or area of the inner
and/or outer lip contours [136, 109, 142, 5, 159, 87]. In addition to the lips, the presence
or absence of the tongue and teeth [198, 63] can also be used as features, particularly as
many visemes are identical but for such features. Alternatively, the location of the vis-
ible articulators can be modelled and the features can be obtained from the parameters
of these models. Some of these parametric modelling approaches include active shape
models (ASMs) [110, 116, 184], snake based algorithms [28, 51] and deformable lip
templates [196, 22]. The choice of geometric or parametric contour feature extraction
is similar to that of non-parametric versus parametric classifiers introduced in Chap-
ter 3, and it is not yet clear which approach is better suited to contour-based feature
extraction.
5.4.3 Combination
As the nature of appearance- and contour-based features are quite different, the prop-
erties of the ROI that are represented in either method are quite different. Contour-
based feature extraction can be considered to extract high level, and appearance-based
low level, speech related features, and therefore can be considered complementary
sources of information in fusion. A number of combinational approaches have shown
promise for visual speech feature extraction, of which the simplest is basic concatena-
tion of the appearance- and contour-based feature vectors, possibly followed by fur-
ther feature reduction steps. Rather than collecting the intensity information from the
ROI directly, the intensities are collected based on the location of the tracked lip con-
tours, so that regions outside of the lip region are not considered for appearance-based
feature extraction [110, 49, 186]. A parametric combinatorial approach was taken in
active appearance models (AAMs) [126, 30, 116] which combine ASMs and intensity
information into a single model, from which speech features could be extracted.
5.4 Visual features 93
5.4.4 Choosing a visual feature extraction method
The choice between appearance and contour-based features is not yet an easy one,
despite the large body of accumulated AVSP research using various techniques from
either methodology and combinations of both. Even within the two camps, there is
little consensus on the best feature extraction methods, so comparing the methodolo-
gies is an even more difficult task, with no comprehensive comparison of various ap-
proaches yet completed in the literature. In limited comparisons (both in techniques
tested, and AVSP tasks evaluated), it has been demonstrated that appearance-based
features outperformed contour-based [142, 164], and combination strategies tend to
perform better than appearance-based alone [28, 115]. However, in a large vocabulary
speech recognition application, Matthews et al. [116] found that appearance models
outperformed the combinational AAM approach.
Intuitively, it would seem that ideal contour-based features would provide better per-
formance that appearance based, because they are related directly to the movement
of the visible articulators. Appearance-based features also can contain a large amount
of information, such as illumination or variations in speaker appearance, that may
be unimportant to modelling either speech or speakers [162], and that contour based
feature extraction methods are largely invariant towards.
However, the ability of contour based methods to accurately represent the position of
the lips is highly dependent upon the front-end-effect of the lip tracking itself. Even if
the lip tracking can be performed accurately, it does introduce an additional complica-
tion to the visual front-end. Therefore, any potential performance increase of contour
based methods must be considered in trade-off with the extra processing in the video
front-end, and a minimal increase may not be worth the additional complication. By
comparison appearance-based methods only rely on a course localisation of the lips,
making the feature extraction much more stable, especially in poor environmental
conditions.
The case for appearance-based features is furthered in a comprehensive review of re-
94 5.5 Dynamic visual speech features
cent visual speech recognition systems by Potamianos et al. [145]. They found that
appearance-based features can extract information about all articulators present in the
ROI, whereas contour based methods had to explicitly track the articulators, and quite
often the teeth, tongue and jawmuscles were not considered. This conclusion was also
motivated by perceptual studies that showed human speech perception improved
when the entire ROI could be seen versus movement alone. Potamianos et al. also
found that appearance-based features can be extracted much quicker than contour-
based, as there was no further need for tracking after the ROI was located. As most
real-world implementations must be expected to extract features at the frame-rate of
the video, this point is a particularly cogent one for the choice of appearance-based
feature extraction techniques. Correspondingly, appearance-based feature extraction
was chosen as the visual feature-extraction methodology of choice for the experimen-
tal work performed in this thesis.
5.5 Dynamic visual speech features
While generally demonstrated to perform as well as or better than contour based
methods, most simple appearance-based methods do tend to contain a significant
amount of information irrelevant to the visual speech events. However, there are a
number of techniques that have been demonstrated that attempt to extract the maxi-
mum visual speech information from the ROI, whilst discarding unwanted variance
due to other factors. This section will begin with a background of existing methods
of dynamic visual speech features, and will then detail the visual feature extraction
technique used for this thesis.
5.5.1 Background
As visual speech is fundamentally represented by the movements of the visual articu-
lators, the best features for representing visual speech are generally considered focus
5.5 Dynamic visual speech features 95
on the movement of these features, rather than the stationary appearance within each
frame [16, 108]. While this is clearly the case in speech recognition applications, it is
not completely clear that this would apply for speaker recognition, where the static
features, such as skin colour or facial hair, of the ROI may be useful for identity pur-
poses [113, 25]. A number of researchers have shown that purely dynamic features
can work well for speaker recognition applications [20, 129, 52], although there have
not been any significant comparison of dynamic features with existing static features
in the literature. Such a comparison will be conducted in Section 5.6 of this thesis.
The simplest method of attempting to extract dynamic information from the video fea-
tures is through the use of time-derivative-based delta and acceleration coefficients.
These coefficients are generally used in addition to the original static feature val-
ues [146], although some researchers have discarded the static and used only the time-
derivative features [54]. In a similar manner, rather than calculating frame differences
using extracted features, the ROIs can be converted into frame-to-frame difference im-
ages before feature extraction can occur [67].
While time-derivative features, whether calculated before or after normal feature ex-
traction, show the differences between adjacent frames, they do not directly indicated
the movement of the visual articulators. For this purpose features based on calcu-
lating the optical flow [7] within the ROI have been used widely for both speech and
speaker recognition applications in the visual domain [120, 20]. However, it is not
clear that there is any performance increase when compared to time-derivative-based
features [67, 120].
One technique that has recently shown good performance in AVSP applications is
the use of LDA to extract the relevant dynamic speech features from the ROI [126, 145,
123]. To emphasize the dynamic features, the static features of a number of consecutive
ROIs are first concentrated before speech-class based LDA is performed based on a
know transcription. This approach will form the basis of the visual feature extraction
of this thesis.
96 5.5 Dynamic visual speech features
Figure 5.6: Overview of the dynamic visual feature extraction system used for thisthesis.
5.5.2 Cascading appearance-based features
The current state-of-the-art in visual speech feature extraction is a multi-stage cas-
cade of appearance-based features extraction techniques developed by Potamianos
et al. [147]. This approach has been shown to work well for both speech [145] and
speaker [123] recognition. A simplified version of Potamianos et al.’s cascade will
form the basis of the visual speech feature extraction techniques used for the experi-
mental work in this thesis.
An outline of the simplified feature extraction system is shown in Figure 5.6, and can
be seen to have three main stages:
1. Frames are (optionally) first normalised to remove irrelevant information,
2. Static features are extracted for each individual frame, and
3. Dynamic features are calculated from the static features over several frames
Frame normalisation
Before the static features can be extracted from each frame’s ROI, an image normali-
sation step is first performed to remove any irrelevant information, such as illumina-
tion or speaker variances. In Potamianos et al.’s original implementation of the cas-
cade [147], this step was performed using feature normalisation on the static features,
but image normalisation has been shown to work slightly better due to the ability
5.5 Dynamic visual speech features 97
to handle variations in speaker appearance, illumination and pose as part of a wider
pre-processing front-end [102]. As such, image mean normalisation was chosen over
feature mean normalisation for this thesis.
This image normalisation step consists of calculating the mean ROI image, I over an
entire utterance [I1, I2, . . . , IT ]:
I =∑
Tt=1 ItT
(5.6)
This mean image can then be subtracted from each ROI image It as it is presented to
the static feature extraction stage:
I′t = It − I (5.7)
While this approach is not suitable to a real time approach to the need to have seen
the entire utterance, a suitable approach could easily be developed if needed through
the use of running averages.
Themotivation behind normalising the ROI in this manner comes from the notion that
a large amount of speaker appearance-based information is collected in the standard
appearance-based feature extraction techniques [29], and that this information would
not be useful for modelling speech events. Of course, it is quite possible that this in-
formation would be useful for the speaker recognition application, so a version of the
cascade will also be testedwithout this normalisation step in Section 5.6 to investigate
this effect.
Static feature extraction
Once the ROI has been mean-normalised, static visual speech features can then be
extracted. As has been mentioned previously in Section 5.4, the main aim of fea-
98 5.5 Dynamic visual speech features
ture extraction is to provide compression of the raw pixel values in the ROI whilst
still maintaining good separation of the differing speech events. DCT-based feature
extraction was chosen for Potamianos et al.’s original cascade [147] as well as this im-
plementation in this thesis. This method of feature extraction was chosen because it
had previously been shown to work equally as well as major alternative feature ex-
traction techniques like PCA, and slightly better than DWT [147]. DCT (and DWT)
also have the added advantage that they do not require the extensive subspace train-
ing needed for PCA-based feature extraction, as these algorithms are not based on any
prior knowledge of the ROIs.
Given an input grayscale ROI of size L× H pixels, the two dimensional DCT can be
defined at position x,y can be defined as
Ft (x,y) =
√
2L
√
2Hcxcy
L−1
∑i=0
H−1
∑j=0
I′t (i, j) cos[(2i + 1) xπ
2L
]
cos[(2j + 1)yπ
2H
]
(5.8)
where I′t (x,y) is the grey-scale value at position x,y of the mean normalised ROI, and
the cx and cy coefficients are defined by
ca =
1√2
a = 0
1 a 6= 0(5.9)
The DCT produces a representation of the original ROI in the frequency domain, with
the DCT calculated in (5.8) resulting in a image of identical dimensions as the original
ROI. As the DCT produces a full representation of the original image, in the frequency
rather than spatial domain, no loss of information occurs and the application of an
inverse-DCT can transform Ft back into the original normalised ROI, I′t flawlessly.
Although a complete DCT transformation provides no compaction of the ROI, the
transform does have a useful property that results in most of the energy residing in
the low-order coefficients. Using this characteristic, most of the transformed image,
5.5 Dynamic visual speech features 99
Figure 5.7: Most of the energy of a 2D-DCT resides in the lower-order coefficients, andcan be collected easily using a zig-zag pattern.
Ft, can be discarded with little loss of information. One of the most common tech-
niques of extracting just the lower-order coefficients is through the use of a zig-zag
scheme, developed for use in the JPEG image compression scheme [183]. The zig-zag
scheme, shown in Figure 5.7, is designed to keep DCT coefficients of similar frequen-
cies together, and by choosing the first DS coefficients of the DCT images through this
scheme, a compact representation of the original ROI can be realised:
oSt =
[
Ft
(zx (1) ,zy (1)
), . . . ,Ft
(
zx
(
DS)
,zy(
DS))]
(5.10)
Where zx (d) and zy (d) are defined based on the zig-zag illustrated in Figure 5.7:
zx
zy
=
1 2 1 1 . . .
1 1 2 3 . . .
(5.11)
Determining the number of coefficients (DS) that should be extracted from the DCT
images consists of a trade-off between the amount of information available versus the
complexity required to train classifiers in higher dimensional spaces. For this thesis
DS was chosen as 20 based on empirical experiments. For evaluation of the static
features, delta and acceleration components were added to result in a 60 dimensional
feature space, but only the primary 20 features were used as input to the dynamic
feature extraction stage.
In Potamianos et al.’s original implementation an application of speech-event based
LDA was performed upon the DCT features before the dynamic feature extraction
100 5.5 Dynamic visual speech features
stage, but this stage was not applied for this thesis, as it added complication and was
found to not provide significant benefit over just applying the speech based LDA dur-
ing dynamic feature extraction.
Dynamic feature extraction
To extract the dynamic visual features that have been shown to improve human per-
ception of speech, this stage of the cascade extracts LDA-based features over a range
of consecutive ROIs. An overview of this stage is shown as the bottom row of Fig-
ure 5.6, and is identical to the dynamic stage of Potamianos et al.’s [147] cascade. To
allow the LDA to be used to extract the dynamic, rather than static, features of the
visual speech,±J consecutive frames were concatenated around the frame under con-
sideration before the LDA transformation matrix could be calculated. For this thesis,
J = 3 was found to provide the best balance between the amount of information cap-
ture and the size of the resulting LDA transformation matrix [102]. The input to the
LDA algorithm for the concatenated ROI features around oSt is given as
oCt =
[
oSt−J , . . . ,o
St , . . . ,o
St+J
]
(5.12)
It can be seen that this results in a feature vector of size DC = (2j + 1)DS, where DS is
the dimensionality of the original static vector.
The aim of the LDA algorithm is to arrive at a suitable transformation matrix WDLDA
that will provides the best separation over a range of known classes. LDA training
is based upon a set of N training examples XC ={oC1 , . . . ,o
CN
}and a set of matching
class labels L = {l1, . . . , lN} where each ln is between 1 and the number of possible
classes A. For the implementation of the cascade used for this thesis, the class labels
are calculated by force-aligning word models trained on the PLP-based acoustic fea-
tures against the known transcription of the training sequences. That is, each video
observation is labelled with the word (including the ‘silence’ meta-word) and state it
5.5 Dynamic visual speech features 101
appears within based on the acoustic models.
The LDA transformation matrix is calculated such that the within-class dispersion is
minimised, while the between-class distance is maximised. To allow these goals to be
met, the within-class scatter matrix Sw and the between-class scatter matrix Sb are
defined based on the statistics of the training data around their own class and around
the whole distribution respectively. The within class matrix is defined by
Sw =A
∑a=1
P (a)Σa (5.13)
where P (a) is the likelihood of class a occurring based on the labelled training data,
and Σa the covariance matrix of the ath class. The between-class scatter matrix is then
defined as
Sb =A
∑a=1
P (a) (µa − µ0) (µa − µo)T (5.14)
where µa is the mean of the ath class, and µ0 is the global mean over all classes given
by
µ0 =A
∑a=1
P (a)µa (5.15)
The transformation matrix can then be found by maximising [126]
Q(
WDLDA
)
=
∣∣∣WD
LDASb
(WD
LDA
)T∣∣∣
∣∣∣WD
LDASw
(WD
LDA
)T∣∣∣
(5.16)
where |x| denotes the determinate of matrix x. In a similar manner to PCA, (5.16) can
be solved by calculating the eigenvalues and eigenvectors of the matrix pair (Sb,Sw)
such that SbF= SwFD, where F contains the eigenvectors as its columns, F= [f1, . . . , fDC ]
and the DC highest eigenvalues form the diagonals of D. The LDA transformation
102 5.5 Dynamic visual speech features
matrix can then be defined simply as the transpose of the eigenvector matrix, and the
eigenvalues are discarded:
WDLDA = FT (5.17)
Due to the dimensionality of the LDA transformation matrix is identical to that of the
data being transformed, LDA does not performwell on high dimensional data such as
raw images [8] due the computational difficulty in calculating such a large matrix. It is
for this reason that the LDA stage of the cascade is based upon theDCT transformation
of the normalised ROIs calculated earlier in the cascade. Even with the concatenation
of the ±J adjacent frames, DC is still much smaller than the number of pixels in the
normalised ROI region.
The set of training examples, XC, is taken from the training sessions of the speech
processing framework developed in Chapter 3. Ideally, these training observations
would be separate from the set of observations used to train and test the speech mod-
els. However, this was not able to be achieved due to the limited data available in the
XM2VTS database. Although the training sequences for the LDA transformation and
the word models are the same, neither of these operations have any knowledge of the
testing sequences, which should allow for a valid evaluation of their performance in
testing. Because the output of this cascade is based on training examples of the mean-
removed DCT static features, an additional complication is introduced that a differing
set of cascading appearance-based features must be formed for each unique training
configuration under the speech processing framework.
Once the LDA transformation matrix WDLDA was calculated using the training se-
quences, it can then be used to transform the static observation vectors from the DCT
stage of the cascade to form the dynamic visual speech features used to train and test
the models for speech and speaker recognition. Before applying the transformation
matrix on the concatenated static features, the dimensionality of the output dynamic
features can be limited by only choosing the first DD eigenvectors inWDLDA to arrive at
W′DLDA. Given a concatenatedmean-removedDCT feature vector, oC
t , the final dynamic
5.6 Comparing speech and speaker recognition 103
speech vector can be calculated by vector multiplication with this matrix to arrive at
the final dynamic speech feature vector, oDt :
oDt = W′D
LDAoCt (5.18)
This dynamic visual speech vector is the final stage of the cascading appearance-based
feature extraction technique, and will be the basis of the visual features used through-
out the remainder of this thesis.
5.6 Comparing speech and speaker recognition
It is clear from existing research that the dynamic information contained in visual
speech is of the most importance for the task of visual speech recognition [126, 145].
However, while dynamic information has been shown to perform well for the task of
visual speaker recognition [123, 20], it is not clear that actively removing static infor-
mation, which can be a useful pre-processing stage for speech, is a sensible decision
for the speaker recognition application. To this end, this section will look at visual
features extracted from the static and dynamic stages of the cascade outlined in Sec-
tion 5.5 for both the speech and speaker recognition tasks. Additionally, the cascade
will also be tested without the image normalisation stage to determine whether the
normalisation, and resulting removal of static speaker information, has any effect on
the final speaker recognition performance.
The training and testing of the models used for speech and speaker recognition in this
section will be based on the framework developed in Chapter 4 on all 12 configura-
tions of the XM2VTS database. The performance of the various stages of the visual
feature extraction cascade will first be presented for the speech recognition task to
confirm the speech recognition ability of the cascade-derived features on the XM2VTS
database with Potamianos et al.’s earlier work [147]. Potamianos et al.’s work on this
104 5.6 Comparing speech and speaker recognition
cascade have primarily been tested on proprietary databases, and an evaluation of
these features on the publicly available XM2VTS database should allow for a good
baseline for future visual speech research.
The same set of features will then be evaluated for the speaker verification task to
determine the utility of the cascading appearance-based features for speaker recog-
nition applications. Such features have only been used for speaker recognition only
once in the literature [125], and were not studied in detail at the time. This chapter
aims to rectify this situation, and provide a detailed comparison of dynamic and static
video features for speaker recognition, based on both image-normalised and raw DCT
static features. Finally conclusions will be drawn from both the speech and speaker
recognition performance as to the dynamic nature of visual speech.
5.6.1 Feature extraction
While the main focus of this section will be on the suitability of dynamic visual speech
features for speech and speaker recognition, acoustic features will also be evaluated
to serve as a baseline. For these experiments the acoustic features will be PLP- and
MFCC-based features extracted from the raw acoustic signal every 10 milliseconds
over 25 millisecond Hamming windows. Both acoustic features were based on the
first 12 Mel-frequency banks, and added to an energy coefficient to result in 13 static
features. Delta and acceleration features were then appended to arrive at an 39-
dimension acoustic feature vector for each window. These feature will be referred
to as the A-PLP and A-MFCC features throughout these experiments.
Video features for these experiments were gathered from both the static and dynamic
stages of the appearance-based cascade described earlier extracted from the manually
tracked ROIs.
Two versions of the static features from the cascade will be evaluated in this sec-
tion, one based on the image normalised grayscale ROIs, and one based on the un-
5.6 Comparing speech and speaker recognition 105
normalised grayscale ROIs. 20-dimensional DCT-based static feature extraction is
performed on these ROIs as described in Section 5.5.2, and deltas and acceleration
coefficients are appended to arrive at a 60 dimensional visual feature vector for each
video frame. The dimensionalities chosen here were based on tuning experiments
and experiments performed by Lucey [102]. The image-mean-normalised DCT and
un-normalised DCT features will be referred to as V-MRDCT and V-DCT features
throughout these experiments.
The two static features vectors are then used as the basis of the dynamic features ex-
traction. The 7 static feature vectors (not including deltas and accelerations) surround-
ing and including each video frame were concatenated. This concatenated feature
vector then underwent LDA-based feature reduction based on speech events deter-
mined by force aligning the A-PLP features with a known transcription. The resulting
image normalised and non-normalised dynamic features will be referred to as V-LDA-
MRDCT and V-LDA-DCT respectively throughout these experiments.
The A-PLP, A-MFCC, V-DCT and V-MRDCT features were designed such that they
could be extracted from any given utterance without any prior knowledge of the type
of data they were working with, allowing their feature vectors to be used for each
of the configurations of the XM2VTS database. However, because the LDA-derived
features, V-LDA-MRDCT and V-LDA-DCT were trained based on acoustic speech
events in the training sessions of the framework, each unique training configuration
of the framework had to use a differing set of LDA-derived visual feature vectors.
As a result, each sequence being tested had 6 different feature representations for
V-LDA-MRDCT and V-LDA-DCT based upon which framework configuration was
being tested.
5.6.2 Model training and tuning
To evaluate the speech and speaker recognition performance two different model
types had to be trained for each of the 6 training configurations of the AVSP frame-
106 5.6 Comparing speech and speaker recognition
HMMDatatype States MixturesA-MFCC 11 8A-PLP 11 8V-DCT 9 16
V-LDA-DCT 9 16V-MRDCT 9 16
V-LDA-MRDCT 9 16
Table 5.1: HMM topologies used for the uni-modal speech processing experiments.
work. These models as follows:
1. background word models
2. speaker word models
Under the AVSP framework developed in Chapter 4, these two models would allow
both speaker-independent and speaker-dependent continuous speech recognition, as
well as performing text-dependent speaker verification.
For this particular implementation of the framework, the word models were imple-
mented as left-to-right HMMs as described in Chapter 3. The topologies of the mod-
els were tuned by evaluating the speech and speaker recognition performance on
a single training configuration of the XM2VTS database, and these topologies were
kept for all remaining configurations. Table 5.1 shows the tuned topologies for each
datatype tested in these experiments. In the process of tuning the HMM topologies
it was discovered that the best performing topologies for speech and speaker recog-
nition tended to be very similar. This was fruitful as it allowed the same models to
be trained and then tested on both tasks, as intended for the AVSP framework, rather
than having to train a separate set of models for speech and speaker recognition.
An additional parameter that required tuning was the MAP-adaptation factor, or τ,
for adapting the speaker word and text-independent speech models from the equiv-
alent background models. This factor controls the relative importance of the existing
5.7 Speech recognition experiments 107
WER (%)Datatype SI SDA-MFCC 4.65 2.72A-PLP 4.05 1.06V-DCT 52.77 18.84
V-LDA-DCT 33.88 10.24V-MRDCT 39.01 15.15
V-LDA-MRDCT 27.90 8.22
Table 5.2: WERs for speech recognition on all 12 configurations of the XM2VTSdatabase
background models as compared to the data being adapted towards. Tuning of this
parameter was performed in a similar manner to the topologies, but only for the task
of speaker verification. A MAP-adaptation factor of τ = 0.75 was found to perform
well over all datatypes and was therefore chosen for these experiments.
5.7 Speech recognition experiments
5.7.1 Results
Speech recognition experiments were performing both using the background word
models and the speaker specific models. Results were reported using word error rates
(WER) calculated for each configuration of the database. The relative performance of
each datatype was found to be similar between differing database configurations, and
so each datatype’s WER was reported as the average over all 12 configurations. The
SI and SD average WERs are shown in Table 5.2.
5.7.2 Discussion
The SI speech recognition results reported in Table 5.2 confirm the relative improve-
ments shown by Potamianos et al.’s [147] original cascade, but SD speech recognition
results against such a cascade have not yet been published in the literature. This sec-
108 5.7 Speech recognition experiments
tion will discuss these results in some detail, focusing on the effect of the normalisa-
tion and LDA stages of the cascade on the speech recognition WERs. The difference
in speech recognition performance will then be discussed in relation to the speaker-
dependency (or not) of the speech models, followed by a quick examination of the
acoustic results.
Mean-image normalisation
By comparing the V-DCT and V-MRDCT speech recognition WERs in Table 5.2, it can
be seen that by removing the mean image from each utterance, the V-MRDCT features
reduce the WER by 13.76% in the SI case. That such a large improvement in the WER
can come by simply removing the mean image suggests that the speaker and environ-
ment specific information contained in the mean image are hindering the SI speech
recognition performance due to variations in this information between speakers and
and even between sessions with specific speakers.
Although the speaker variations should not be a significant problem with SD speech
recognition (as each model is adapted to the target speaker), the V-MRDCT still pro-
vided a 3.69% improvement in theWER over the V-DCT features. This improvement is
likely to be related to normalising other environmental variations in the utterances be-
tween the differing training and testing sessions of the database configurations. Some
examples of such variations may be changes in lighting or positioning of the speaker,
or if the facial hair, glasses or makeup of the speaker has changed between the training
and testing sessions.
One additional factor that may relate to the improvement of the V-MRDCT SD speech
recognition performance is the underlying improvement in the SI background model
before the adaptation to the SD models.
5.7 Speech recognition experiments 109
Dynamic feature extraction
The application of speech-event-based LDA for the V-LDA-DCT and V-LDA-MRDCT
features in the feature extraction cascade provides further improvements in the speech
recognition WER over the underlying V-DCT and V-MRDCT datatype.
Referring back to Table 5.2, it can be seen that for SI speech recognition, the un-
normalised V-LDA-DCT features provide an 18.89% decrease in the WER over the
original V-DCT features. However, the best performance comes from the application
of speech-event-based LDA to the already normalised V-MRDCT features providing
a further 11.11% WER decrease, resulting in the best SI visual speech recognition per-
formance.
The LDA algorithm was designed to choose features that best discriminate between
a set of classes, which for these experiments were the transcribed words models and
their states. For this reason, it can be seen as a more intelligent form of removing
irrelevant features than was attempted using mean image normalisation earlier in the
cascade, as features of the ROI that do not vary with the speech are unlikely to be
included in the LDA-transformed features. The small improvement in the V-LDA-
DCT over the V-MRDCT features shows that the discriminative nature of the LDA
feature extraction can outperform the more brute force removal of the mean image.
Of course, there is no reason not to perform both the normalisation and discriminative
stages of the cascade, resulting in the best-performing V-LDA-MRDCT features.
Similar, but of lesser magnitude, performance increase is obtained for the LDA stage
of the cascade for the SD speech recognition results, with an 6.93% improvement in
the image normalised results and 8.6% for the un-normalised. In a similar manner to
the SD performance increase of mean-image-removal, this improvement is likely to be
related to variations in environmental and speaker’s appearances between the training
and testing sessions, as well as the possibility of the improvement in the SI background
models being passed through to the speaker adapted speech models. Similar to SI
speech, the best performing SD speech recognition performance is provided by the
110 5.7 Speech recognition experiments
final stage of the cascade, with theWER of 8.22% starting to approach that expected of
acoustic features, suggesting that in controlled environments such features could be
usable without the need for any audio at all.
Speaker dependency
For all of the datatypes tested in Table 5.2, it can be seen that the speech recognition
WER is decreased by at least a half when SD speech models are used. This decrease in
performance for the SI case is obviously related to the variation in speakers between
the training and testing sets that is not a factor for SD speech recognition, where each
speech model is adapted directly to the target speaker.
While the acoustic WERs are only separated by a few percent between the SI and
SD cases, the best performing V-LDA-MRDCT features have a much wider gulf be-
tween the two possible speech recognition configuration. Even though the entire
appearance-based cascade provided a major improvement in speech recognition per-
formance over the raw V-DCT features, there is still a large amount of speaker specific
information as seen by the 19.68% decrease in WER gained by using SD models. This
suggests that there is still much room for improvement in the development of video
features for SI video speech recognition, and any such improvements are likely to
move video speech recognition towards where it can reasonably be used uni-modally
in controlled conditions.
Acoustic performance
It can be seen that both the A-PLP andA-MFCC features workwell for speech recogni-
tion, which is reflected by their widespread use in mature acoustic speech processing
research [157, 151]. While both features appear to work equally well when the training
and testing speakers are mismatched in SI speech, the A-PLP features appear to adapt
better to the individual speaker word models, resulting in a improved SDWER when
5.8 Speaker verification experiments† 111
compared to the A-MFCC results.
5.8 Speaker verification experiments†
5.8.1 Results
Speaker verification experiments were performed on the same set of features as the
speech recognition experiments in the previous section. Speaker verification scores
were calculated by comparing scores obtained with the speaker specific models and
the background models and plotting the difference between the two using DET plots
to investigate the relative false alarm rate and misses that can be obtained with each
datatype under consideration. In a similar manner to the speech recognition exper-
iments, and to ensure that enough scores are available to accurately evaluate the
performance of each feature-extraction method, all 12 configurations of the XM2VTS
database were evaluated for speaker verification.
The results of the text-dependent speaker verification experiments using the speaker
dependent and background word HMMs are shown in Figure 5.8. It can be seen that
all of the visual features are providing speaker verification performance in a similar
range to the acoustic features, which is quite different than for speech recognition,
where both audio feature extractionmethods were clearly better than any of the visual
features. The ability of the A-PLP features to better represent individual speakers
when compared to A-MFCC features is also clearly visible here.
5.8.2 Discussion
Figure 5.8 shows that even though the original intention of the appearance-based
cascade was to remove speaker specific information and emphasis the speech-event-
specific, all stages of the cascade actually perform better for visual speaker verification
112 5.8 Speaker verification experiments†
0.1 0.2 0.5 1 2 5 10 0.1
0.2
0.5
1
2
5
10
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
A−MFCCA−PLPV−DCTV−LDA−DCTV−MRDCTV−LDA−MRDCT
Figure 5.8: Text dependent speaker verification performance on all 12 configurationsof the XM2VTS database.
than the V-DCT features theoretically containing the most static speaker specific infor-
mation. Even in the relative improvements between the various video datatypes, the
improvements in video speaker verification performance appear to match those in
video speech recognition, suggested that (of the features tested here) the best video
features for both speech and speaker recognition are the same.
These results suggest that for visual speaker recognition, the (dynamic) behavioural
nature of speech could be more important than the (relatively static) physiological
characteristics [14]. That is, it may be easier to recognise speakers by how they speak,
than by their appearance while they speak. This also has the benefit that because static
appearance is less important, environment conditions such as illumination, andwithin
speaker variations such as facial hair or makeup, become less of an issue provided
they do not change throughout an utterance, and as long as the extraction of dynamic
features can still be performed adequately.
5.9 Speech and speaker discussion 113
5.9 Speech and speaker discussion
Interestingly, these experiments show that the same features that have been shown
to perform well for the task of speaker-independent speech recognition by other re-
searchers [147, 126] also perform well for speaker-dependent speech recognition and
speaker recognition. All of the stages of the cascade that provided benefits by remov-
ing speaker- and session-specific information also provided similar benefits for the
speaker verification experiments, even though the original intent of the cascade was
to provide a form of normalisation across speakers and subsequently improve speech
recognition in unknown speakers.
However, as demonstrated by the speaker-dependent speech recognition results re-
ported earlier, this normalisation effect of the cascade still had benefits even when the
speech was tested on the same speaker as the training, in part due to the normali-
sation of within-speaker session variability such as illumination, facial hair, makeup
and other factors. Of course, it stands to reason that if individual speaker’s word
models can provide a significant improvement in speech recognition over the back-
ground speaker-independent models, then comparing a number of individual sub-
jects word models against a given utterance can allow conclusions as to the identity of
the speaker.
That the speaker verification experiments improved in performance as static informa-
tion was removed suggests that dynamic visual information can play a very important
role in visual (and audio-visual) person recognition, particular when the facial move-
ments are speech related.
Of course, face recognition is a very mature area of research that has shown that static
recognition of faces can provide good performance, and the possibility certainly ex-
ists of using a combination of static face and dynamic features to represent the visual
modality with a minimum loss of information. Some promising versions of such sys-
tems have been developed [123], but this area is still a relatively new area of research.
114 5.10 Chapter summary
5.10 Chapter summary
This chapter has covered the broad fields of acoustic and visual feature extraction for
AVSP. The first half of this chapter covered a review of the state-of-the art in visual
feature extraction for AVSP. A brief overview of visual front ends for localisation and
tracking of lip ROIs was provided, followed by the manual tracking approach that
was chosen for this thesis to attempt to avoid the front-end effect. A more detailed
review of visual feature extraction techniques was conducted covering appearance
and contour based extraction as well as combinations of the two. A review covering
the extraction of dynamic visual features to better model the nature of visual speech
was then conducted, with particular focus on the cascading approach first suggested
by Potamianos et al.’s [147] for continuous speaker-independent speech recognition.
The final half of the chapter was devoted to experimental evaluation of visual features
extraction from the various stages of the dynamic appearance-based cascade for both
the speech recognition and speaker verification tasks according to the framework de-
veloped in Chapter 4. These results confirmed the results found by Potamianos et
al. and other researchers [126] for speaker-independent speech recognition, but also
showed good performance as the cascade progressed for speaker-dependent speech
recognition and both text-dependent and independent speaker verification. These ex-
periments showed that even though the cascade was intended to remove speaker (and
session) specific information to improve speaker-independent speech recognition, the
dynamic information extraction works very well for the recognition of speakers as
well, suggesting that visual speech could be considered to be more of a behaviour
rather than physiological characteristic for the purposes of recognising speaking per-
sons.
Chapter 6
Simple Integration Strategies
6.1 Introduction
This chapter will present two simple integration strategies that can be performed us-
ing the existing classifier methods and techniques developed in Chapter 3. As these
strategies do not modify the existing classifier techniques, these techniques focus on:
1. Fusing the speech features before classification, or
2. Fusing the output scores after classification
These two techniques will be referred to as early and late integration respectively
throughout this thesis. Both of these techniques have been used extensively in the
AVSP literature and a brief review will be conducted in the beginning of this chapter
to illustrate and compare both techniques for audio-visual speech and speaker recog-
nition.
In the final half of this chapter, audio-visual speech and speaker recognition experi-
ments will be conducted using these simple integration strategies to serve as a com-
parative baseline for the novel SHMM experiments conducted later in the thesis.
116 6.2 Integration strategies
6.2 Integration strategies
The study of audio-visual fusion for speech processing is a subset of the broader field
of research referred to as sensor fusion [32]. Sensor fusion covers any research in-
volved in extracting information from multi-sensor environments through some form
of integration of the multi-sensor data. Since the earliest research into sensor fusion
in the early 1980s [174], this area has been adapted for a wide range of applications,
of which one of the more popular has been the improvement in recognition of hu-
mans and their activities. The most obvious such application would be the identifi-
cation of people using multiple biometrics such as face, fingerprints or voices [161],
but many other such applications can also benefit frommultiple sensor fusion, includ-
ing person tracking [75], expression recognition [197] and, of course, speech process-
ing [126, 23, 25].
One important method of characterising methods of sensor fusion is based on where
the integration of the information obtained frommultiple sensors occurs, of which the
main levels can be defined as
• Early integration, where the raw sensor data or features extracted from this data
are combined before classification,
• Middle integration, covering classifiers inherently designed to handle data or
features from multiple modalities, or
• Late integration, where scores or decisions of individual classifiers for each sen-
sor are combined.
The choice of the level of integration often comes down to the type of sensors used for
a particular application. In particular, late integration is more popular in applications
where the sensors are capturing completely independent data such as the recognition
of a person by their signature and face [161]. Early and middle integration can be
more useful in situations where the sensors are capturing similar information, an ex-
6.3 Early integration 117
ample of which might be combining visible and infrared images of a face to improve
face recognition in adverse environments [92]. Middle integration largely serves as a
‘catch-all’ category for approaches that do not fit the other two approaches, but many
such systems typically involve one or more of the sensors controlling the integration
of all the sensors.
For audio-visual speech processing, all three levels of fusion have been considered for
both of the speech and speaker recognition tasks, and as such, all three integration
strategies will be considered in this thesis. This chapter will investigate the early and
late integration strategies, as they can be implemented simply using the same classifier
design as the individual modelling experiments conducted in Chapter 5. Based on
the results reported that chapter, only A-PLP and V-LDA-MRDCT features will be
considered for fusion in this thesis due to their superior speech and speakermodelling
ability.
As the middle-integration-based SHMM approach is a primary focus of this thesis,
this chapter will only focus on early and late integration, and the middle integration
approach will be deferred until Chapter 7 and 8.
6.3 Early integration
6.3.1 Introduction
As the audio and visual modalities are combined before classification, early integra-
tion is one of the simplest methods of fusion available for AVSP research, and many
audio visual speech [1, 173, 145] and speaker [27, 54, 193] recognition systems have
used this approach for this very reason.
Although early integration can come about through the fusion of either the raw sensor
data or from features extracted from the raw data, only the feature-fusion approach has
been demonstrated as feasible in the literature, primarily due to the large data volumes
118 6.3 Early integration
and registration difficulties involved in combining raw audio and video data [25].
The simplest method of feature fusion is through direct concatenation of the acoustic
and visual features vectors resulting in a single multimodal feature vector [114, 27,
141]. Because the audio and video features are normally captured at differing frame-
rates, some form of interpolation or oversampling is generally used to improve the
video feature rate to that of the acoustic data.
Because a simple concatenation can result in a much larger feature vector than is typ-
ically encountered in a single modality classifiers, some form of feature reduction can
be performed on the concatenated feature vector to reduce the overall size and min-
imise the effect of the ‘curse of dimensionality’. Common feature reduction meth-
ods for this purpose include PCA and LDA [27], with the hierarchical LDA approach
adopted by Potamianos et al. [143] showing particular promise in this area.
Another approach that can be considered a form of early fusion is using visual in-
formation to enhance acoustic features for use in regular acoustic classifiers [60, 146].
By estimating a linear transformation from either the video features alone or a con-
catenation of both modalities, this approach allows a simulated or enhanced acoustic
feature vector to be presented to a regular acoustic speech processing system, allowing
integration of video data into an existing acoustic system with minor modifications.
Twomain forms of early integration speech processing experiments will be conducted
in this thesis, plain concatenative feature-fusion and discriminative feature-fusion.
These experiments will be conducted both to investigate the effectiveness of feature
fusion within the speech processing framework developed earlier, and as a baseline
for the late and middle integration experiments conducted later in this thesis.
6.3 Early integration 119
6.3.2 Concatenative feature fusion
The concatenative feature fusion vectors used under the speech processing framework
were calculated by concatenating each acoustic feature vector with a corresponding vi-
sual feature vector, resulting in the concatenative feature-fusion vector. As the original
datatypes consisted of 39 acoustic and 60 video features, a large feature vector of 99
elements was generated for each original acoustic feature extraction window.
As the video features were not extracted at the same rate as the acoustic features, the
corresponding video feature vector for each acousticwindowwas chosen as the closest
(in time) video feature vector. Each video vector was therefore copied completely and
appended to approximately four acoustic feature vectors, with no estimated interpola-
tion occurring between the video frames. This approach was chosen as physiological
experiments have shown as there is little value in using visual features at frame rates
above 15-20 Hz [56].
The resulting 99-dimensional feature vector were then used to train the full variety
of HMM models defined under the speech processing framework. Because feature-
fusion derived feature vectors can be used identically to individual-modality features,
the full range of speech processing experiments could then be conducted. The HMM
topologies for the concatenative feature fusion features were chosen to be 11 states and
16 mixtures for the HMMmodels, which were chosen because they performed well in
empirical tuning experiments, and to provide a good baseline for comparison to the
similarly configured uni-modal HMMs.
6.3.3 Discriminative feature fusion
One of the major problems with simple concatenation is the large feature vectors that
result from such an approach. As this approach can lead to feature vectors signifi-
cantly larger than the individual feature vectors in the individual modality, the spec-
tre of the ‘curse of dimensionality’ can rise again. However, similar methods used to
120 6.3 Early integration
Figure 6.1: Overview of the feature fusion systems used for this thesis, covering bothconcatenative and discriminative feature fusion.
reduce the size of uni-modal feature vectors can be also be applied to reduce the size
of the concatenated feature fusion vector.
The implementation of such an approach for this thesis is shown in Figure 6.1, and
is modelled after Potamianos et al.’s hierarchical LDA approach [143]. As the feature
reduction is performed by LDA, such a system provides the additional benefit over
feature reduction using PCA or DCT of choosing features based upon their ability to
most efficiently separate speech event classes from one another.
Potamianos et al.’s hierarchical LDA approach was so called because the acoustic and
visual features progressed through a hierarchy of LDA transformations, performed
first on the individual modality feature vectors and then again on the concatenated
feature vector resulting from these features. Although the approach conducted here
does perform LDA feature reduction of the concatenated feature vectors, only the vi-
sual feature vector is also LDA-derived with the acoustic feature vectors left as the
A-MFCC or A-PLP features before concatenation. This approach was chosen because
the regular acoustic features offered good speech-event-separation performance and
any benefit of LDA feature reduction would likely be easily offset by the increased
time and processing required to take such an approach.
The process of transforming the concatenated feature vector into a smaller dimen-
sional space was conducted in an identical manner to the transformation video fea-
6.4 Late integration 121
ture vectors in Chapter 5, although only 5 frame vectors were combined instead of
7 for the video features. The concatenated feature vectors used for this purpose did
not use the deltas or accelerations of the underlying datatypes, and to limit the pro-
cessing and memory required to calculate the LDA transformation matrix, only every
4th concatenated feature vector was considered. Once the LDA transformation matrix
was obtained, it was used to extract the top 24 LDA features for each concatenated fea-
ture vector, and delta and accelerations were added to result in 72-dimensional feature
vectors for the FF-LDA datatype.
This discriminative feature fusion datatype was then used to train the full range of
HMM and GMM models in the speech processing framework, in a identical manner
to the concatenative and individual features. An identical topology was chosen as for
concatenative feature fusion (11 states, 16 mixtures) to allow for easy comparisons to
be made between the two fusion techniques and the uni-modal modelling techniques.
6.4 Late integration
6.4.1 Introduction
One of the limitations of the early integration strategy is that, by combining both the
acoustic and visual modalities into a single feature vector, there is limited ability to
model the reliability of each modality. The ability to explicitly model the reliability
of either modality is very important for both speech and speaker recognition appli-
cations, for the simple reason that the discriminative ability of either modality can
vary widely in real world conditions with either modality behaving differently in the
presence of acoustic noise, visual degradation, tracking inaccuracies and individual
speaker characteristics. Fortunately, most sources of degradation typically affect one
modality to a greater extent from the other, allowing the unaffected modality to hold
up the slack when the other has become degraded. Some examples of such single-
modality degradation might be invisible background noise which would favour the
122 6.4 Late integration
visual modality, or a continuouslymoving speaker causing tracking difficulties, which
would favour the acoustic modality.
Late integration systems combine the outputs of individual classifiers in the acoustic
and visual modalities, allowing the scores or decisions from each classifier to be eas-
ily weighted up or down based on the perceived reliability of either modality before
arriving at a final decision based on both modalities [89]. This approach therefore pro-
vides a simple mechanism for explicitly modelling the reliability of the acoustic and
visual modalities for audio-visual speech processing, and is a active area of research
in both speaker [104, 55, 20] and isolated speech recognition [49, 126, 69].
However, for continuous speech recognition late integration is considerable more dif-
ficult to implement because the sequence of classes must be agreed upon between
the modalities before the decisions of either modality can be combined. An extreme
example of this approach clearly leads back to isolated speech recognition, where the
boundaries of each word or smaller speech event are determined by either the acoustic
speech models or another external source before each modalities classifiers are com-
pared strictly within those boundaries. The other alternative is attempting to choose
the highest scored transcription by combing n-best transcriptions from both modal-
ities, but difficulties can arise if a particular transcription is not represented in both
modalities.
Due to the difficulties of a late integration approach with continuous speech recog-
nition, only speaker verification experiments will be presented to demonstrate the
late integration approach to audio-visual speech processing. An alternative approach
which allows modelling stream reliability within the speech models will be presented
by the middle-integration-based SHMM approach in Chapters 7 and 8.
6.4 Late integration 123
Figure 6.2: Overview of the output score fusion approach used for this thesis.
6.4.2 Output score fusion for speaker verification
The late integration approach to speaker verification will be demonstrated in this the-
sis using weighted sum score fusion of the output scores of the individual classifiers
in the acoustic and visual modalities, including normalisation of the underlying score
distributions before combination. This approach is depicted in Figure 6.2.
While a speaker identification approachwould require thatmany scores are normalised
and fused before they can be ranked, the verification approach chosen for speaker
recognition in the speech processing framework developed in Chapter 4 has the ad-
vantage that each test utterance can be represented by a single score for each modality.
The output score fusion score can easily be calculated from these two scores using
s f = αZa (sa) + (1− α)Zv (sv) (6.1)
where sa and sv are the output scores of the audio and video classifiers, Za and Zv
the score-normalisation functions, and α is the weighting parameter from which the
individual stream weights are calculated.
As the scores from the HMM and GMM classifiers are given as log likelihood scores,
the choice of weighted sum fusion corresponds to exponentially weighted product
fusion of the likelihoods, but most classifier and fusion strategies operate on the log-
likelihood domain to avoid having to calculate exponentials and deal with multiplica-
124 6.4 Late integration
tion of very small magnitude likelihoods accurately.
Output score fusion must wait for the individual classifiers in each modality to finish,
and the individual scores sa and sv are therefore gathered over the entire utterance.
If a decision is required to be reached on smaller regions that the entire utterance,
some form of segmentation must first be used to limit the original period on which
the classifiers are evaluated.
For the experiments presented later in this chapter, the individual scores before fusion
were identical to those used for speaker verification in Chapter 5, and are already nor-
malised against the background models, and so no further background-speech nor-
malisation is required after the output-score fusion.
6.4.3 Score-normalisation
Score normalisation is a technique used in multimodal biometric systems to com-
bine scores from multiple different classifiers [78] that may have very different score
distributions. By transforming the output of the classifiers into a common domain,
the scores can be fused through a simple weighted combination of scores, where the
weights canmore accurately represent the true dependence of the final score on the in-
dividual classifiers. In this section the zero normalisation [78] method will be demon-
strated for the purpose of normalising audio and video classifiers scores before fusion
can occur.
Zero normalisation transforms scores from different classifiers that are assumed to be
normal into the standard normal distribution N ∼ (µ = 0,σ2 = 1
)using the following
function for each modality i:
Zi (si) =si − µi
σi(6.2)
where si is an output score from the classifier from distribution S such that S∼ (µi, σ2
i
).
6.4 Late integration 125
−10 −8 −6 −4 −2 0 2 40
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Score
Fre
quen
cy
A−PLPV−LDA−MRDCT
(a) No normalisation
−4 −3 −2 −1 0 1 2 3 40
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
ScoreF
requ
ency
A−PLPV−LDA−MRDCT
(b) Zero normalisation
Figure 6.3: Histograms of speaker verification scores (a) before and (b) after normali-sation.
The estimated normalisation parameters µi and σi are typically calculated on a left-
out evaluation portion of the data. Then during recognition, the original scores si are
replaced in the fusion equations with Zi (si) as shown in (6.1).
For the implementation of zero-normalisation conducted in this thesis, each of the test-
ing partitions in the 12 testing configurations of the XM2VTS database defined in the
speech processing framework used the corresponding evaluation partition to estimate
the normalisation parameters. For these experiments, the acoustic (and video) normal-
isation parameters were calculated in clean conditions, and these clean normalisation
parameters were used for all testing observations, including the acoustically noisy
versions.
An example of the effect of zero-normalisation on the first testing configuration for
text-independent speaker verification is shown in Figure 6.3 as a histogram of the ob-
served likelihoods of getting each score from each modality over entire utterances.
From Figure 6.3(a) it can be seen that while both the original acoustic classifiers have
a similar distribution, the variance of the video classifiers scores is around twice that
of its acoustic counterparts, resulting it having a comparatively large impact on the fi-
126 6.4 Late integration
nal fusion score. By normalising the means and variances of the individual classifiers
distributions to match N (0,1), as shown in Figure 6.3(b) both modalities can be con-
sidered equal before each modality is weighted according to the chosen environment
or application.
6.4.4 Modality weighting
The primary benefit of late integration over early integration is that the reliability
of each modality can be modelled simply through the use of multiplicative stream
weights applied before the individual scores are combined to form the fusion score.
While it is certainly possible for both stream weights to be represented using indi-
vidual weights γa and γv for the audio and video streams respectively, the common
convention is to have all weights add to unity, and therefore both weights can be rep-
resented using a single weighting parameter, α:
γa + γv = 1 (6.3)
γa = α (6.4)
∴ γv = 1− α (6.5)
In an ideal fusion system the value of the weighting parameter α should be adaptive to
the prevailing conditions, such that the reliability of each modality can be estimated
on an utterance or even second-by-second basis and the reliance of the fusion on ei-
ther modality can easily vary based on this estimation. This is a relatively new area of
research in audio-visual speech processing, although a number of efforts have taken
place based upon SNR estimates [49], entropy measures [118], or the degree of acous-
tic voicing [126]. One of the more popular methods for adaptive fusion were based on
somemeasure of the perceived quality of the individual classifiers [186, 55]. However,
most audio visual speech processing systems dealing with modality weighting pa-
rameters generally use a training or evaluation partition to determine the best weight
for eachmodality on data similar to that under test, and have used such weights for all
6.4 Late integration 127
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
α
Equ
al E
rror
Rat
e (in
%)
0dB SNR6dB SNR12dB SNR18db SNRAverage (all noise)
Figure 6.4: Performance of weighted output score fusion for speaker verification as αis varied from 0 to 1.
of the testing utterances, with no direct consideration of the individual environmental
conditions in each utterance.
To determine the modality weights for the comparative experiments in this chapter,
output score fusion experiments were conducted using the A-PLP acoustic classifiers
alongside the V-LDA-MRDCT visual classifiers in performing text-dependent speaker
verification experiments over the entire speaker verification framework. The weight-
ing parameter α was varied from 0.0 to 1.0 in increments of 0.1 and the EERs of each
normalised output fusion combination was recorded for each weighting parameter
over all acoustic noise levels.
The results of these tuning experiments are shown in Figure 6.4 for text-dependent
speaker verification. The verification EER at all acoustic noise levels is shown, as well
as the average performance of each α over all noise levels. To allow the late integration
system to perform well unsupervised over all noise levels, the average performance
128 6.5 Speech recognition experiments
of all noise levels was also calculated for each α and is shown as a dashed line.
In order to compare the late-integration system presented here with the feature-fusion
systems, the late integration system had to be designed such that it could be run over
all noise values unsupervised. While the optimal approach would result from a sys-
tem that could adaptively estimate the noise level present in each utterance and chose
an appropriate α, such an adaptive system is non-trivial to implement [69]. Accord-
ingly, it was decided to simply choose the value of α which had the lowest average
EER over all noise levels, which can be seen in Figure 6.4 to be α = 0.2.
The choice of the weighting parameter should reflect the environment that the final
speech processing experiments will be running in. In this case, because the visual and
clean-acoustic speaker verification EERs are already very low, more attentionwas paid
to the fusion performance in noisy acoustic conditions. By choosing the best average α
over all noise levels from the weighting experiments, an output-fusion system biased
towards noisy conditions resulted because the larger variances of the nosier 0 and 6 dB
SNR α-curves pulled the best α down in comparison to the relatively shallow α-curves
of the cleaner 12 and 18 dB SNR fusion experiments.
6.5 Speech recognition experiments
6.5.1 Results
The speech recognition experiments using the feature-fusion datatypes described ear-
lier were conducted according to the speech processing framework, and therefore in
an identical manner to the individual modality speech recognition experiments re-
ported in Chapter 5. Both speaker independent experimentswere performed using the
background HMMs and speaker dependent experiments using the HMMs adapted to
each speaker being tested, and are shown in Figures 6.5 and 6.6 respectively. The V-
LDA-MRDCT video speech recognition performance, which was unaffected by acous-
6.5 Speech recognition experiments 129
0 2 4 6 8 10 12 14 16 180
10
20
30
40
50
60
70
80
Signal−to−noise ratio (dB)
Wor
d E
rror
Rat
e
A−PLPV−LDA−MRDCTConcatenative FusionDiscriminative Fusion
(a) Speaker independent
Figure 6.5: Speaker-independent feature-fusion speech recognition performance aver-aged over all 12 configurations of the XM2VTS database.
tic noise, is also shown for comparative purposes for both plots.
As has already been discussed, output score fusion experiments were not performed
for speech recognition, but similar benefits can be seen in the SHMMmodels described
later in this thesis.
6.5.2 Discussion
From an examination of the results presented in Figures 6.5 and 6.6, it would appear
the discriminative feature-fusion features do have a benefit over the concatenative
feature-fusion for SNRs above 6dB, although neither of the feature fusion techniques
could provide an improvement over the acoustic-only speech recognition experiments
for the cleaner conditions.
130 6.5 Speech recognition experiments
0 2 4 6 8 10 12 14 16 180
10
20
30
40
50
60
70
80
Signal−to−noise ratio (dB)
Wor
d E
rror
Rat
eA−PLPV−LDA−MRDCTConcatenative FusionDiscriminative Fusion
(a) Speaker dependent
Figure 6.6: Speaker-dependent feature-fusion speech recognition performance aver-aged over all 12 configurations of the XM2VTS database.
Below 6dB SNR, the discriminative feature fusion speech recognition performance de-
grades compared to concatenative feature fusion for all of the experiments performed
above. This is likely to be due to the effect of extreme train/test mismatch by testing
the speech recognition performance on 0dB SNR acoustic data when the models were
trained in clean conditions. While such a large mismatch also has a similar effect on
the concatenative fusion performance, the mismatch only affects the acoustic features
in the concatenation. By comparison the LDA process has the effect of combining the
acoustic and visual features within every single feature of the discriminative feature
fusion vector, causing the train/test mismatch to have an detrimental effect on the
entire feature vector rather than just the acoustic portion. These results are similar to
those reported by Potamianos et al. [145], where they found that concatenative fusion
began to outperform discriminative for their experiments at around 0 dB SNR.
6.6 Speaker verification experiments 131
From these experiments, it can be seen that not one feature-fusion experiment pro-
vided better speech recognition performance than both individual modalities at all
acoustic noise levels. This effect, when a fusion system is outperformed by one of its
individual components, is referred to as catastrophic fusion and should be avoided as
this condition means that better performance could be obtained using an individual
system. For the speaker independent speech recognition experiments the discrimi-
native features were only catastrophic for 0dB SNR, and the concatenative were only
non-catastrophic (compared to audio) for 0 and 6 dB SNR. In the speaker dependent
experiments both the concatenative and discriminative feature fusion experiments
were only non-catastrophic at 6 dB SNR, and were improved upon at all other points
by either the acoustic or visual features. In fact, over all of the speech recognition
experiments reported above, the only point where better or equivalent performance
couldn’t be obtained using an individual modality was at the 6 dB SNR point. At all
other SNRs, the best performance could be obtained using either the acoustic or the
visual modalities alone.
6.6 Speaker verification experiments
6.6.1 Results
Speaker verification experiments were performed according to the speech processing
framework, using both the early and late integration techniques introduced in this
chapter. The results of these experiments were recorded using the EERs for each of
the integration techniques at each level of acoustic noise under test, and are shown in
Figure 6.7.
132 6.6 Speaker verification experiments
0 2 4 6 8 10 12 14 16 180
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Signal−to−noise ratio (dB)
Equ
al E
rror
Rat
e (in
%)
A−PLPV−LDA−MRDCTConcatenative FusionDiscriminative FusionOutput Score Fusion
Figure 6.7: Simple integration strategies for text-dependent speaker verification overnoisy acoustic conditions.
6.6.2 Discussion
From a comparison of the three fusion systems shown in Figure 6.7 (concatenative
feature-fusion, discriminative feature-fusion and output-score fusion), it can easily be
seen that the best performance can be obtained using the late integration approach
of output-score fusion. While the discriminative feature-fusion system does perform
slightly better in cleaner conditions, the very poor performance in noisy conditions
makes both feature-fusion systems unsuitable for speaker verification in many real
world conditions.
The late integration approach can produced better speaker verification performance,
particularly at high noise levels, due to the ability to model the reliability of the acous-
tic and visual classifiers using the stream weighting parameter α. Even better late
integration performance could be obtained by allowing the weighting parameter to
6.7 Speech and speaker discussion 133
be varied based on the prevailing environmental conditions, as can be seen by look-
ing back at the best performing points of the α-curves in Figure 6.4. However, being
able to take advantage of these performance increases in an unsupervised manner re-
quires that the noise level can be estimated and an appropriate weighting parameter
determined automatically.
Within the early integration systems, the concatenative approach appears to provide
the best performance in noisy conditions, while the discriminative is slightly better
in clean. However, both early integration systems do improve upon the acoustic uni-
modal performance, but still remain catastrophic when compared to the visual uni-
modal performance in noisy conditions.
6.7 Speech and speaker discussion
In this section, early integration methods of performing both speech and speaker
recognition were investigated. Both concatenative feature fusion and LDA based dis-
criminative feature fusion were considered. In general, the discriminative feature
fusion technique was found to provide an improvement for speech recognition and
speaker recognition in clean conditions, but the discriminative process was found to
remove some robustness to acoustic noise that was present in the concatenative fea-
ture fusion systems.
However, for both the speech and speaker recognition tasks, neither was found to
provide a major improvement to either the acoustic or visual features across the whole
range of acoustic noise conditions presented here. Assuming that the noise level could
be estimated, in most cases similar or better performance could be obtained by choos-
ing one of the individual acoustic or visual systems over either of the feature fusion
systems.
Additionally due to the need for the uni-modal classifiers to come to a decision before
the output score fusion can occur, such an approach first requires that the acoustic and
134 6.8 Chapter summary
visual information being classified be segmented at the level that fusion can occur.
This is not a major issue with speaker verification, as the speaker verification occurs
over the entire utterance, but difficulties can arise if events smaller than an utterance
are being considered for classification.
For this reason, late integration systems designed to recognise speech generally only
work well with isolated-words, and such simple late integration systems are nearly
impossible with continuous speech due to the difficulty in isolating the words before
isolated-word classification can occur. However a group of alternative approaches
that do allow for the benefits of streamweightingwithin the continuous speech paradigm
will be demonstrated with multi-stream HMMs in the next chapter.
6.8 Chapter summary
In this chapter, two simple integration strategies were introduced that can easily be
implemented using existing classifiers techniques by either fusing features before clas-
sification or fusing the output scores after classification. After reviewing existing ap-
proaches to simple integration in the literature both concatenative and discriminative
feature fusion were introduced as a viable early integration strategy for modelling
audio-visual speech. Similarly, for late integration weighted-sum output score fusion
was introduced to allow modelling the reliability of each stream for speaker verifica-
tion applications.
In the second half of this chapter, these simple integration strategies were imple-
mented for speech and speaker recognition to serve as a comparative baseline for the
middle-integration based SHMM experiments conducted in the remaining portions of
this thesis.
Chapter 7
Synchronous HMMs
7.1 Introduction
In the previous chapter, methods of fusing acoustic and visual information both be-
fore and after classification were introduced. While the early integration approach
was found to work well in clean conditions, performance degraded considerably in
noisy acoustic conditions due to the inability to model the reliability of the acoustic
and visual speech features independently. The late integration approach introduced a
method of combining separate acoustic and visual classifier scores, allowing the abil-
ity to apply weights before combination. This approach allowed for non-catastrophic
fusion at all noise levels provided that appropriate weights are applied, but no de-
cision could be made until both classifiers are complete, limiting the ability of late
integration to easily perform continuous speech recognition.
This chapter will introduce the concept of middle integration methods that combines
the close time coupling of the feature-fusion approach with the ability to model stream
reliability inherit in the late integration approach. Particular focus will be given to the
SHMM as it can be trained easily using existing techniques derived from the uni-
modal HMM training process outlined in Chapter 3, and is known to work well for
136 7.2 Multi-stream HMMs
speech and speaker recognition applications.
In order to improve understanding of the SHMM model, this chapter will look at
novel research into the effect stream weighting of the acoustic and visual modalities,
both in training and testing, has on the final speech recognition performance. While
researchers have studied stream weights of MSHMMs, no consideration in the liter-
ature has yet been made to the difference between the training and testing processes
for SHMMs under differing stream weights.
Additionally, this chapter will introduce the concept of normalisation, usually used in
a late integration design, within the SHMM model to normalise the differing acous-
tic and visual models within the SHMM states on a frame-by-frame basis. Both full
mean and variance normalisation, and variance-only normalisation will be investi-
gated with both showing similar performance in flattening the performance curve as
the stream weights are varied, allowing for more latitude in choosing appropriate
stream weights.
7.2 Multi-stream HMMs
Multi-stream HMMs are a group of temporally-coupled modelling techniques de-
signed to extend the effectiveness of the uni-modal HMM structure for speech pro-
cessing into the multi-modal domain. A number of variations exist within the broad
label of multi-stream HMMs, with the major difference between each model hinging
upon where the acoustic and visual information is tied, or coupled, together. All these
techniques fall under the even broader umbrella of dynamic Bayesian networks [124],
and examples of the most popular multi-stream HMMs in use for AVSP are shown in
Figure 7.1.
The simplest multi-stream HMM is the SHMM, shown in Figure 7.1(b), which cou-
ples the acoustic and visual observations at every frame. Such an approach results in
an almost identical HMM structure to a uni-modal HMM, but with two observation-
7.2 Multi-stream HMMs 137
(a) Unimodal (acoustic) HMM (b) Synchronous HMM
(c) State-asynchronous HMM (d) Coupled HMM
(e) Product/Factorial HMM
Figure 7.1: Various multi-stream HMM modelling techniques used for AVSP in com-parison to the uni-modal HMM (a). Acoustic emission densities are shown in blueand visual in red.
138 7.2 Multi-stream HMMs
emission GMMs in each state instead of one. Due to the simplicity of the SHMM and
existing implementations for separating differing features in acoustic speech recogni-
tion (such as static and dynamic features), the earliest attempts at middle integration
for AVSP were undertaken with this modelling technique by Potamianos et al. [141]
for speech recognition and Wark et al. [187] for speaker recognition.
While the SHMM did–and still continues to–work well for AVSP, researchers have
continued to research alternative multi-stream HMMs in which the acoustic and vi-
sual information was not coupled as tightly as in the SHMM. Such an approach is
motivated by the asynchronous nature of audio visual speech, as it has been known
for some time that the visual speech activity tends to precede the acoustic signal it
generates by up to 120 ms [16, 95].
To handle the asynchrony between the audio and visual speech informationwhilst still
maintaining alignments at the model boundaries a state-asynchronous HMM [12] can
be generalised as two uni-modal HMMs tied together at the boundaries of the speech
event being modelled. An example of such an approach is shown in Figure 7.1(c).
An alternative approach is taken in the coupled HMM [122] shown in Figure 7.1(d)
where the acoustic and visual states can transition within the asynchronous region,
but remain tied at the model boundaries.
While both the asynchronous and coupled HMMs can be implemented directly, and
have been done so for AVSP by multiple researchers [12, 122], a common simplifica-
tion is to implement a generalised form of such networks as a product or factorial
HMM [145]. While such an approach does require more states (S2 compare to 2S)
than the dynamic Bayesian network approach, it does allow for implementation using
a synchronous HMM with additional states linked as shown Figure 7.1(e) such that
multiple states share the same acoustic or visual state models.
The main difference between the asynchronous and coupled HMMs when imple-
mented as a product HMM arise from the method of calculating the additional state
transitions and probabilities. An additional simplification that can be applied is to
7.3 Synchronous HMMs 139
limit the permitted asynchrony in the product HMM, which in the extreme limit of
no asynchrony would result in only the diagonal of the product HMM remaining as a
SHMM.
Both the asynchronous and coupled HMMs, whether implemented directly or using
product HMMs, are much more complicated to train and test than the comparatively
simple SHMM, and thus have mostly been limited to small vocabulary recognition
tasks [49, 124, 145] with only limited large vocabulary implementations [126]. The
onlymiddle integrationmethod that has successfully been demonstrated in real world
conditions for large vocabulary speech recognition appears to be Potamianos et al.’s
implementation of the SHMM [145], although a large vocabulary product-HMM ap-
proach has been demonstrated through lattice re-scoring [126].
7.3 Synchronous HMMs
7.3.1 Introduction
A SHMM can be viewed as a regular single-stream continuous HMM, but with two
observation-emission Gaussian mixture models (GMMs) for each state–one for audio,
and one for video–as shown in Figure 7.1(b). SHMMs have previously been used
in audio-only speech recognition tasks to consider differing types of audio features
separately, such as static features from time-derivative-based features [194]. For AVSP,
audio-visual SHMMs use a different stream for each modality, and this approach has
been used extensively for both speech and speaker recognition research [83, 49, 126,
188, 145].
SHMMs are at an advantage over feature-fusion HMMs primarily because of their
ability to weight each modality on an individual basis. Feature-fusion HMMs are
trained on statemodels estimated over the entire concatenated or discriminative audio-
visual vector. Because both modalities’ features are combined in the one model, it
140 7.3 Synchronous HMMs
is not possible within the feature fusion design to consider a situation where one of
the modalities has more weight than the other. By allowing the two modalities to be
treated independently, the SHMM model is more flexible and can generally provide
greater AVSP performance [145].
Given the audio and visual observation vectors oa,t and ov,t, the observation-emission
score of SHMM state u is given as
P (oa,t;ov,t |u ) = P (oa,t |u )α P (ov,t |u )1−α (7.1)
where α is a single stream weighting parameter 0 ≤ α ≤ 1 defined identically to that
used in Chapter 6 for output score fusion.
The SHMM parameters can then be defined as λλλav = [λλλav,α] where λλλav = [Aav,Ba,Bv].
In the underlyingHMMparameters λλλav, the joint state-transition probabilities are con-
tained in Aav , and Ba and Bv represent that observation-emission probability param-
eters of the audio and video modalities respectively [145]. Training of the SHMM is
the process of estimating these parameters. The parameters in λλλav can be estimated in
an automatic manner using Baum-Welch re-estimation (see Chapter 3), and the stream
weight parameter α is typically estimated by maximising speech performance on an
evaluation session, although more flexible methods based on the concept of stream
reliability have been developed [145].
7.3.2 SHMM joint-training
In the existing literature [126], the estimation of the underlying HMM parameters λλλav
have been performed in one of two manners: either the single-stream parameters are
estimated independently and combined, or the entire set of parameters are jointly-
estimated using both modalities. Because the combination method makes the incor-
rect assumption that the two HMMs were state-synchronous before combination, bet-
ter performance has been shown to be obtained with the joint-training method [126],
which is used to train the SHMMmodels evaluated in this chapter.
7.3 Synchronous HMMs 141
The Baum-Welch re-estimate algorithm is the iterative process used to calculate the
HMM parameters from a training set of representative speech events. The algorithm
was covered in detail in Chapter 3, but can be briefly outlined as follows:
1. Use the HMM parameters (emission and state transition likelihoods) and the
training data to estimate the state-occupation probability Lj (t) for all states j
and times t.
2. For each stream, use the state-occupation probability and the training data to
re-estimate new HMM parameters.
3. Repeat at Step 1 if the HMM parameters have not converged.
As the Baum-Welch algorithm requires a initial set of HMM parameters to form the
first estimate of Lj (t), the parameters are generally initialised by segmenting the train-
ing observations equally amongst the state models. From these segmented training
observations the initial set of observation-emission parameters are determined for
each state. From this point, the Baum-Welch algorithm can take over to refine the
state-alignments and HMM parameters until they have converged upon a solution.
In an audio-visual SHMM, it can be seen that the choice of the stream weighting pa-
rameter α only has a direct effect on the state-occupation probabilities estimation in
Step 1, as this probability is directly based on the observation-emission likelihoods
calculated using (7.1). As the observation-emission likelihoods of each stream are cal-
culated independently in Step 2, they are not directly affected by the streamweighting
parameter, to the extent that they will still be calculated even if the pertinent stream is
weighted to nothing.
The jointly-trained SHMMs used for the experiments in this chapter were trained
according to the speech processing framework, but due to limitations of the HMM
Toolkit [194], only speaker-independent background word SHMMs could be trained.
For this reason, only speaker-independent speech recognition performance of SHMMs
will be evaluated in this chapter.
142 7.4 Weighting of synchronous HMMs†
The SHMM topology was chosen to match the uni-modal HMM topologies used for
Chapter 5, with the number of states taken from the acoustic HMM, and the number
of mixtures for each stream taken from both. The resulting SHMM topology had 11
states with a 8mixture acoustic GMM and a 16mixture visual GMM representing each
state of the speaker-independent models. Training was performed with the HMM
Toolkit [194] which already had built in support for joint-training of SHMMs.
7.4 Weighting of synchronous HMMs†
7.4.1 Introduction
The primary benefit in choosing a SHMM approach for AVSP over early integration is
the ability to weight the acoustic and visual streams based on the perceived reliability
of each individual modality. Accordingly, knowingwhat effect the streamweights has
on the final performance of the SHMM model is an important precursor to training
SHMMs to model audio-visual speech. While a number of researchers have studied
the effect of SHMM stream weights during the decoding of audio-visual speech [61,
69], there has been no research in the literature on what effect the streamweights have
during the training process.
In this section, a study will be performed to determine what effect, if any, that varying
stream weights have during the SHMM training process on the final speech recog-
nition performance. The outcome of these experiments will also be compared and
contrasted with varying the stream weights during speech decoding, where they are
already known to have a significant impact on performance [61].
7.4 Weighting of synchronous HMMs† 143
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
45
50
αtest
Wor
d E
rror
Rat
e
0dB SNR6dB SNR12dB SNR18dB SNR
Figure 7.2: Speech recognition performance using SHMMs as αtest is varied. Eachpoint represents a different αtrain and the line is the average of all αtrains for each αtest.
7.4.2 Results
To investigate the effect of varying the training and testing stream weights indepen-
dently, the single stream weighting parameter α was sub-divided into two separate
parameters, αtrain and αtest, representing the stream weights used during training and
testing of the SHMM respectively. 11 different training alphas, αtrain = 0.0,0.1, ...,1.0,
and testing alphas, αtest = 0.0,0.1, ...1.0, were combined to arrive at 121 individual
speech experiments. These experiments were performed over all 4 testing noise levels,
resulting in a total of 484 tests. To limit the processing time, these weighting experi-
ments were only performed on the first configuration of the XM2VTS database under
the speech processing framework.
The resulting WER obtained for each of these experiments against αtest is shown in
Figure 7.3. Each of the points within each αtest is a differing αtrain and the line shows
144 7.4 Weighting of synchronous HMMs†
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
45
50
αtrain
Wor
d E
rror
Rat
e0dB SNR (α
test = 0.7)
6dB SNR (αtest
= 0.8)
12dB SNR (αtest
= 0.9)
18dB SNR (αtest
= 0.9)
Figure 7.3: Speech recognition performance using SHMMs as αtrain is varied. αtest ischosen based on the best average performance in Figure 7.2.
the average WER over all αtrain for a particular αtest. To illustrate the relatively static
performance as αtrain is varied, a similar plot of the WERs as αtrain is varied at the best
performing αtest for each noise level is shown in Figure 7.6.
7.4.3 Discussion
From examining both these figures, it can be seen that the variance in WER of the
entire range of αtrain is of little-to-no significance to the final speech recognition per-
formance. The choice of αtest is clearly the major factor in the speech performance,
with the minimum WER achieved with αtest around 0.8− 0.9, at least in the cleaner
conditions. However, there appears to be no significant trend visible in the WER as
αtrain varies from 0.0 to 1.0.
7.5 Normalisation of synchronous HMMs† 145
As discussed earlier, the training of a HMM is basically an iterative process of contin-
uously re-estimating state boundaries, and then re-estimating the HMM parameters
based on those boundaries. The value of αtrain has no direct effect on the re-estimation
of the HMM parameters, so the only effect of αtrain comes about when using the es-
timated HMM parameters to arrive at a new set of estimated state boundaries. For
example, if αtrain = 0.0, then only the video parameters will determine the state bound-
aries during training. Similarly αtrain = 1.0 will only use the audio, and values between
those two extremes will use a combination of both modalities for the task.
As the speech transcription is known, training of a HMM is a much more constrained
task than the decoding of unknown speech. The 18dB SNR results presented in the
previous section shows the decoding WER varies from below 5% for audio-only to
a much above 35% for video only, when the testing weight parameter, αtest, is set at
the extremes of 1.0 and 0.0 respectively. That changing the training weight parameter,
αtrain, has no similar effect on the final speech recognition performance suggests that,
at least in this case, the video or audio models perform equally well in estimating the
state boundaries during training, and there appears to be no real benefit to any fusion
of the two.
7.5 Normalisation of synchronous HMMs†
7.5.1 Introduction
Score normalisation is a technique used in multimodal biometric systems to com-
bine scores from multiple different classifiers [78] that may have very different score
distributions. By transforming the output of the classifiers into a common domain,
the scores can be fused through a simple weighted combination of scores, where the
weights can more accurately represent the true dependence of the final score on the
individual classifiers. Normalisation was used previously in the output score fusion
experiments in Chapter 6.
146 7.5 Normalisation of synchronous HMMs†
Two approaches were chosen to normalise the acoustic and visual streams within the
SHMM structure: full and variance-only normalisation. The full normalisation ap-
proach allows both the mean and variances of the two modalities to be matched, but
requires access to the internals of the Viterbi decoder. The variance-only normalisation
technique was developed to allow for a similar effect through a simple modification
of the stream weights, allowing implementation in a wider number of circumstances.
Full mean and variance normalisation
For the SHMM normalisation technique, it was chosen to adapt the video-score dis-
tribution to that of the audio-score, rather than perform zero normalisation on both
distributions. This configuration was chosen because zero-normalisation would cause
the state-emission log-likelihood-scores to be much smaller than the state-transition
log-likelihoods, causing the final speech recognition to be overwhelmed by the lat-
ter. By using the audio log-likelihood-score distribution as a template, the final state-
emission scores should be in a similar range to that of the un-normalised SHMM.
To perform the video normalisation the output of the video-state models were first
transformed to the standard normal distribution, then to the audio distribution. The
final log-likelihood score s f from the combined SHMM state-model is therefore given
as
s f = αsa + (1− α)sv − µv
σv︸ ︷︷ ︸
→N(0,1)
× σa + µa︸ ︷︷ ︸
→N(µa,σ2a )
(7.2)
One of the problems with this form of normalisation, however, is that implementing
the full mean and variance normalisation of (7.2) requires access to the interior of the
Viterbi decoder algorithm, which can be difficult with publicly available HMM tools,
such as the HMM Toolkit [194].
7.5 Normalisation of synchronous HMMs† 147
Variance-only normalisation
To overcome this difficulty, another option is to only normalise the variances of the two
score log-likelihood distributions, as this can be implemented solely through adjust-
ment of the stream weighting parameter. Provided that the fused score log-likelihood
distribution is not too dissimilar after normalisation, to prevent problems with over or
under-whelming the state-transition scores, the means of the two score distributions
do not necessarily need to be equal. This is because speech recognition is a compar-
ative task, and a change in mean on a whole-stream basis will have no effect on the
paths through the speech recognition lattice, as each path will be affected similarly.
Therefore, if mean normalisation is not required, normalisation can bemore easily per-
formed by considering the final modality weights (γa, f inal,γv, f inal) to be a combination
of the intended test weights and calculated normalisation weights:
γa, f inal = γa,test × γa,norm (7.3)
γv, f inal = γv,test × γv,norm (7.4)
Where the testing and normalisation weights can further be expressed in terms of the
single weighting parameters, αtest and αnorm respectively:
γa, f inal = αtest × αnorm (7.5)
γv, f inal = (1− αtest) (1− αnorm) (7.6)
However, to ensure the state-emission scores remain in the same general relationship
with the state-transition scores, the stream weights should add to 1. Using (7.5) and
(7.6) to arrive at the final weight parameter:
148 7.5 Normalisation of synchronous HMMs†
−50 0 500
0.05
0.1
0.15
0.2
0.25
Score
Fre
quen
cy
A−PLPV−LDA−MRDCT
(a) No normalisation
−50 0 500
0.05
0.1
0.15
0.2
0.25
Score
Fre
quen
cy
A−PLPV−LDA−MRDCT
(b) Full normalisation
−50 0 500
0.05
0.1
0.15
0.2
0.25
Score
Fre
quen
cy
A−PLPV−LDA−MRDCT
(c) Variance normalisation
Figure 7.4: Distribution of per-frame scores for individual A-PLP audio and videostate-models within the SHMM under different types of normalisation.
α f inal =γa, f inal
γa, f inal + γv, f inal(7.7)
α f inal =αtestαnorm
αtestαnorm + (1− αtest) (1− αnorm)(7.8)
To calculate the normalisation weighting parameter αnorm, use the following property
of normal distributions,
kN ∼(
kµ, (kσ)2)
(7.9)
and attempt to equalise the standard deviations of the two weighted score distribu-
tions:
αnormσa = (1− αnorm) σv (7.10)
αnorm =σv
σa + σv(7.11)
7.5.2 Determining normalisation parameters
Before the score-normalisation could occur, the normalisation parameters of both dis-
tributions were determined by scoring the known transcriptions on the evaluation
7.5 Normalisation of synchronous HMMs† 149
Datatype µi σiA-PLP -59.81 9.52
V-LDA-MRDCT -6.23 28.71
Table 7.1: Normalisation parameters determined from the per-frame evaluation scoredistributions
session with stream weight parameter, α, set such that only the modality of interest
was being tested (i.e. α = 0 for the video frame-scores and α = 1 for the audio).
A full speech recognition task was also attempted (rather than force alignment with
a transcription) to calculate the distribution parameters, but no major difference was
noted in the final parameters. This was most likely because the difference between the
two modalities’ score distributions was much larger than any difference between the
score distributions of different state models within a particular modality.
The scores of the best path were then recorded on a frame-by-frame basis to determine
the score-distribution of each modality, shown in Figure 7.4(a). The normalisation
parameters, shown in Table 7.1 , were then estimated from the score distributions for
each modality.
The effect of full mean-and-variance normalisation using these parameters on the
score-distribution is shown in Figure 7.4(b). It can be seen that the audio score dis-
tributions remain untouched, and the video scores have been transformed into the
same domain as the audio. Because this normalisation occurs within the Viterbi de-
coding process, an in-house HMM decoder was used to implement this functionality,
as it was not possible within the HMM Toolkit [194].
To perform variance-only normalisation, the normalisation parameters shown in Ta-
ble 7.1, and (7.11) were used to arrive at a normalisation weighting parameter of
αnorm = 0.751 .
Using these normalisation weighting parameters and the relationship shown in (7.8),
any intended αtest can be mapped to the equivalent α f inal which includes the effects of
variance-normalisation, as shown in Table 7.2. The outcome of applying these vari-
150 7.5 Normalisation of synchronous HMMs†
Parameter Valueαtest 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
α f inal 0.00 0.25 0.43 0.56 0.67 0.75 0.82 0.88 0.92 0.96 1.0
Table 7.2: Final weighting parameter α f inal calculated from the intended weightingparameter αtest using the normalisation parameter αnorm = 0.751.
ance normalisation parameters on the unweighted score distributions is shown in Fig-
ure 7.4(c). It can be seen that the variance of the two score distributions has been
equalised, while the means are still very separate, although both have changed from
the non-normalised score distributions.
7.5.3 Results
To investigate the effect of these two normalisation techniques on speech recognition
performance, a series of tests were conducted at varying levels of αtest for both meth-
ods of score normalisation. These scores were conducted using the models trained in
the previous section with the training weight parameter of αtrain = 0.8. As discussed
previously, the choice of the training weight parameter was fairly arbitrary, but these
values were chosen as they had the lowest average WER over all values of αtest and
all noise levels by a minor margin. The results of the two normalisation methods are
shown in comparison to the un-normalised speech recognition performance in Fig-
ure 7.5.
7.5.4 Discussion
From examining Figure 7.5, it can be seen that both normalisation methods are very
similar about the best-performing section of the curve in cleaner conditions. Perform-
ing full mean and variance normalisation of the video state models into the same
domain as the audio state models does appear to give a noticeable improvement in
the video-only (αtest = 0.0) SHMM performance. However, the video-only perfor-
mance of the normalised SHMMs still does not match the uni-modal video HMMs’
7.5 Normalisation of synchronous HMMs† 151
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
45
50
αtest
Wor
d E
rror
Rat
e
NoneVariance OnlyMean and Variance
(a) 0 dB SNR
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
45
50
αtest
Wor
d E
rror
Rat
e
NoneVariance OnlyMean and Variance
(b) 6 dB SNR
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
45
50
αtest
Wor
d E
rror
Rat
e
NoneVariance OnlyMean and Variance
(c) 12 dB SNR
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
45
50
αtest
Wor
d E
rror
Rat
e
NoneVariance OnlyMean and Variance
(d) 18 dB SNR
Figure 7.5: Speech recognition performance under normalisation
152 7.6 Speech recognition experiments†
speaker independent speech recognition WER of 27.9. Accordingly, it is unlikely that
the SHMMs would be used in this configuration, so the effect of either normalisation
method at αtest = 0.0 is of only minor importance.
From the stream-weightingWER curves in Figure 7.5, and the variance-normalisation
mappings between the intended and final weighting parameters shown in Table 7.2, it
can be seen that normalisation is essentially moving the center of the αtest range closer
to the best-performing non-normalised αtest, and also producing a flatter WER curve
around this point. These results show that the main effect of the best-performing
weighting parameter of αtest ≈ 0.8− 0.9 from the earlier weighting experiments pri-
marily serves to normalise the two modalities rather than indicate their impact on
the final SHMM performance. The best-performing αtest in the normalised system is
much closer to 0.5, indicating that both modalities are contributing almost equally to
the final performance.
Each of the 12 configurations of the XM2VTS database established in the speech pro-
cessing framework has a different combination of training, testing and evaluation ses-
sions. For this reason each configuration testing in these experiments calculates and
uses different normalisation parameters to perform the full normalisation based upon
the evaluation session of that particular XM2VTS configuration.
7.6 Speech recognition experiments†
7.6.1 Choosing the stream weight parameters
Before the SHMM design can be used to recognise speech over all configurations of
the XM2VTS data defined in the speech processing framework, the training and test-
ing stream weights must be chosen such that the SHMM works best over all acoustic
noise conditions. The stream weighting experiments in Section 7.4 showed that the
training stream weight parameter αtrain has little impact on the final speech recogni-
7.6 Speech recognition experiments† 153
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
45
50
αtest
Wor
d E
rror
Rat
e
0dB SNR6dB SNR12dB SNR18db SNRAverage (all noise)
Figure 7.6: Speaker independent speech recognition using full-normalised word-model SHMMs as αtest is varied on the first configuration of the XM2VTS database.
tion performance, with the major impact on final performance arising from the testing
stream weighting parameter αtest.
Based on the earlier SHMM normalisation experiments, the final SHMM used for the
speech recognition experiments in this chapter included full normalisation of the vi-
sual mean and variances tomatch the acoustic. The choice of a single, post-normalisation
αtest to perform speech recognition over the full range of acoustic noise levels was
made by looking at the speech recognition performance of the normalised SHMMover
the first configuration of the XM2VTS database under the speech processing frame-
work, shown in Figure 7.6.
Of course, while the choice of αtrain has been demonstrated earlier to be of little import,
some choice did have to be made, and in these cases, αtrain = 0.8 was chosen as it had
the lowest average WER over all values of αtest and all noise levels by a minor margin.
154 7.6 Speech recognition experiments†
0 2 4 6 8 10 12 14 16 180
10
20
30
40
50
60
70
80
Signal−to−noise ratio (dB)
Wor
d E
rror
Rat
eA−PLPV−LDA−MRDCTDiscriminative FFJointly−trained SHMM
Figure 7.7: Speaker-independent speech recognition performance using SHMMs overall 12 configurations of the XM2VTS database.
From the weighting curves shown in Figure 7.6, it can be seen that a testing stream
weighting parameter of αtest = 0.5 worked best when averaged over all noise levels.
Accordingly, this value was chosen for the testing stream weighting parameter of the
final SHMM speech processing experiments conducted in this chapter.
7.6.2 Results
The SHMMspeech recognition experimentswere conducted using the speech process-
ing framework developed in Chapter 4. However, due to the inability of the HMM
Toolkit [194] to perform adaptation on multi-stream HMMs, only the background
speech models could be trained, and therefore only the speaker independent speech
recognition experiments could be performed with the jointly-trained SHMMs. An al-
ternate method of training speaker-dependent SHMMs will be detailed with FHMM-
adaptation in the Chapter 8.
7.7 Discussion 155
The results of the speaker independent speech recognition experiments, over all con-
figurations of the XM2VTS database, are shown in Figure 7.7. The SHMM speech
recognition WERs are shown against the equivalent discriminative feature-fusion sys-
tem and uni-modal acoustic HMMs. The uni-modal visual HMM results are also
shown as an additional baseline.
7.7 Discussion
In looking at the speech recognition results in Figure 7.7, it can be seen that the ability
to normalise and weight individual modalities provided by the SHMM design gives
a significant benefit over feature-fusion-based designs. While in this case a single
set of stream weights was found to work well over the entire range of acoustic noise
conditions for speaker independent speech recognition, the SHMM design can also
allow for the weights to be changed based on the prevailing recognition conditions.
This would provide for even better speech recognition performance than has been
presented here, but a method of estimating the environmental conditions and deriving
an appropriate weighting parameter would first have to be developed.
In comparison to the uni-modal acoustic HMMs, the SHMM design provides better
or similar performance over the entire range of acoustic noise conditions when nor-
malised and weighted at 50% audio and 50% video (αtest = 0.5). Even in the cleaner
acoustic conditions, the introduction of visual speech information did not noticeably
decrease the SHMM performance relative to the uni-modal acoustic HMM.
One of the more important considerations of an audio-visual speech processing sys-
tem is that such a system remain non-catastrophic in poor acoustic conditions, as poor
acoustic conditions are one of the main selling points of using audio-visual speech
information over audio alone. Thankfully, the SHMMdesign does meet these require-
ments at least down to 0 dB SNR, allowing such an approach to be used confidently in
quite degraded conditions without causing catastrophic fusion. Of course, the ability
156 7.8 Chapter summary
to dynamically change SHMM stream weights at run-time would allow the SHMM to
be run in even worse acoustic conditions through adjusting the stream weights closer
to αtest = 0.0, or video-only performance as the audio became unusable.
7.8 Chapter summary
In this chapter the middle integration approach was introduced combining the below-
utterance-level fusion of the early integration approach with the ability to weight the
acoustic and video modalities separately inherit in the late integration approach. A
number of middle-integration multi-stream HMMs have been used in the AVSP liter-
ature, of which the most popular choices were reviewed early in this chapter, compar-
ing and contrasting the strictness and placing of the audio-visual couplings within the
various multi-stream modelling techniques.
The simplestmulti-streamHMM, and the subject of this chapterwas the SHMMwhich
coupled the acoustic and visual observations together within every state of the HMM.
By keeping the design simple, the SHMM design is much easier to train and test with
limited data than other more complicated multi-stream HMMs in use in the literature.
Because the main benefit of the SHMM design over a simple feature-fusion HMM
is the ability to treat each modality separately, this chapter presented a number of
novel experiments in the weighting and normalisation of the audio and visual streams
within the SHMM design.
By examining the effect of varying the stream weights during training and testing of
the SHMM on the final speech recognition performance, it was determined that the
choice of stream weights used during the training of the SHMM had no real effect
on the final performance of the SHMM, with the main factor in the final performance
being the choice of stream weights during testing.
In order to improve the ability to weight the acoustic and visual modalities accurately
7.8 Chapter summary 157
in testing conditions, the concept of classifier normalisation used for output score fu-
sion in the previous chapter was introduced within the HMM decoder to allow nor-
malisation with the SHMM structure. As getting access within the HMM decoder
can be difficult with some HMM decoders, an alternative variance-only form of nor-
malisation was designed that could be implemented solely through the adjustment of
the stream weighting parameters, with similar performance as the full normalisation
method. SHMM normalisation was found to flatten the speech recognition perfor-
mance of the SHMMas the streamweights are varied, andmoved the best-performing
stream weighting parameters close to equal audio and video weights in comparison
to the 80-90% audio weighting that performed best for un-normalised SHMMs.
158 7.8 Chapter summary
Chapter 8
Fused HMM-Adaptation of
Synchronous HMMs
8.1 Introduction
This chapter will introduce a novel method of adapting a SHMM from already trained
unimodal acoustic HMMextended from Pan et al.’s [130] proposedFusedHMM (FHMM)
classifier structure. By adapting the visual state classifiers directly from training seg-
mentations performed by a well-performing acoustic HMM, the FHMM-adaptation
process can produce a SHMM that outperforms the jointly-trained SHMM at all levels
of acoustic noise, with no increase in model complexity. This method can also be used
to create speaker-adapted visual states for use alongside the speaker adapted acoustic
HMM, allowing for speaker dependent speech models for the use in speaker depen-
dent speech recognition and speaker verification. Such speaker dependent SHMMs
were not possible within the HMM Toolkit [194] for jointly-trained SHMMs, but the
FHMM-adaptation method will allow for these speaker dependent speech processing
tasks to be demonstrated using FHMM-adapted SHMMs in this chapter.
This chapter will begin with an introduction to Pan et al.’s [130] original theory and
160 8.2 Discrete fused HMMs
implementation of the FHMM structure, which consisted of a continuous classifier
for the acoustic modality combined with a static vector-quantisation classifier for the
visual modality with each state of a SHMM-like structure. By extending Pan et al.’s
speech processingmodel by replacing the discrete vector-quantisation classifier with a
continuous GMM classifier, it will be demonstrated that this continuous FHMM struc-
ture can be considered identical to a SHMM, and therefore the FHMM trainingmethod
can be considered a novel approach to training a SHMM through adaptation based on
the state alignment of a uni-modal acoustic HMM.
Finally, the FHMM-adapted SHMM will be demonstrated for the applicable speech
processing tasks under the speech processing framework developed in Chapter 4, to
demonstrate the improved speechmodelling ability of the FHMM-adaptation method
over joint training of SHMMs.
8.2 Discrete fused HMMs
8.2.1 Introduction
The original design of Pan et al.’s FHMM [130] was motivated by an attempt to max-
imise the mutual information between the two tightly coupled acoustic and visual
streams for audio-visual speech processing tasks while keeping the design of the re-
sulting multi-stream relatively simple when compared to some of the more compli-
cated multi-stream HMM techniques such as coupled or asynchronous HMMs.
In their work on the FHMM structure which is outlined below, Pan et al. showed
that the maximum mutual information was obtained when the observations of one
modality are combined with the states of the other, rather than in a design that links
the hidden states of separate HMMs [130].
This resulting FHMM structure can either be acoustically or visually biased based
upon which modality controls the state transitions during training, and in the Pan et
8.2 Discrete fused HMMs 161
al.’s implementation both biased versions were considered in output decision fusion
for speaker verification. As this implementation implemented the subordinate modal-
ity using discrete vector-quantisation of the subordinate observations, this FHMM
structure will be referred to as discrete FHMMs for the remainder of this thesis.
This section will outline Pan et al.’s theory behind calculating the optimal multi-
streamHMM structure based onmaximising the mutual information between the two
modalities, and will finish by outlining briefly the discrete FHMM implementation
implemented by Pan et al. for audio-visual speaker verification.
8.2.2 Maximising mutual information for audio-visual speech
In their original work on calculating the joint probability of audio-visual speech, Pan
et al. [131] showed that the optimal solution for the joint probability of a particular
sequence of coupled acoustic and visual observations, Oa and Ov , can be calculated
according to the maximum entropy principle [80] as
p (Oa,Ov) = p (Oa) p (Ov)p (w,v)
p (w) p (v)(8.1)
wherew = ga (Oa), and v = gv (Ov) are transformations designed such that p (w,v) is
easier to calculate than p (Oa,Ov), but still reflects the statistical dependence between
the two streams. The final term in (8.1) can therefore be viewed as a correlationweight-
ing, which will be high if w and v (and therefore p (Oa) and p (Ov)) are related, and
low if they are mostly independent. In their work, Pan et al. [131] also showed that the
minimum distance between p (Oa,Ov) and the ground truth p (Oa,Ov) is established
when the mutual information betweenw and v is maximised:
(w, v) = arg max(w,v)∈θ
I (w,v) (8.2)
In their audio-visual FHMM paper [130], Pan et al. chose w and v empirically from
162 8.2 Discrete fused HMMs
the following set (Θ):
w = Ua, v = Ov (8.3)
w = Ua, v = Uv (8.4)
w = Oa, v = Uv (8.5)
where Uxis an estimate of the optimal state sequence of HMM x for output Ox. By
invoking (8.2) over the set Θ and invoking the following inequality in information
theory
I (x, f (y)) ≤ I (x,y) (8.6)
And considering that estimated hidden state sequences can be viewed as a function of
the output (Ux = fx (Ox)), Pan et al. [130] concluded that
I (Ua, Uv
)= I (
Ua, fv (Ov)) ≤ I (
Ua,Ov)
(8.7)
I (Ua, Uv
)= I (
fa (Oa) , Uv
) ≤ I (Oa, Uv
)(8.8)
Therefore the transforms (8.3) and (8.5) can produce better estimates of p (Oa,Ov)than
(8.4). More generally, this indicates that it is better to fuse twoHMMs together through
a combination of the states of one with the observations of the other, rather than link-
ing the hidden states of the two HMMs.
By invoking (8.3) in (8.1):
pa (Oa,Ov) = p (Oa) p (Ov)p(Ua,Ov
)
p(Ua
)p (Ov)
= p (Oa) p(Ov| Ua
)(8.9)
8.2 Discrete fused HMMs 163
where p (Oa) can be estimated from a regular audio HMM and p(Ov| Ua
)is the like-
lihood of getting the video output sequence given the estimated audio HMM state
sequence which producedOa. This equation represents the audio-biased FHMM as the
main decoding process comes from the audio HMM.
Similarly, invoking (8.5) to arrive at the video-biased FHMM gives:
pv (Oa,Ov) = p (Ov) p(Oa| Uv
)(8.10)
The choice of the audio- or video-biased FHMM should be chosen upon which in-
dividual HMM can more reliably estimate the hidden state sequence for a particular
application. Alternatively, both versions can be use concurrently and combined us-
ing output fusion, as in the speaker verification experiments of Pan et al.’s original
implementation[130].
8.2.3 Discrete implementation
The process of training the acoustically and visually biased FHMM structures are
given by Pan et al. as a three step process [130]:
1. Two individual HMMs are trained independently by the EM algorithm.
2. The best hidden state sequence of the HMMs are found using the Viterbi algo-
rithm.
3. The coupling parameters are determined.
Where the coupling parameters are represented by the final conditional probability
terms in (8.9) and (8.10), or put simply: the likelihood distribution of getting a par-
ticular subordinate observation for a particular dominate HMM state. These distribu-
tions can be estimated simply by aligning the hidden state sequences of the dominant
164 8.3 Fused HMM adaptation of synchronous HMMs†
HMM with the subordinate observations and forming some model of which subordi-
nate observations are likely to coincide with each of the dominant HMM states.
In Pan et al.’s implementation of their FHMM structure a vector-quantisation code-
book was used to represent the subordinate modality alongside the continuous HMM
trained on the dominant modality. Pan et al.’s acoustically biased FHMM combined
linear prediction coding (LPC) cepstral coefficient-based HMM with a 16-word code-
book based on the raw gray-level pixel values of the ROI, and the visually biased
FHMM combined a raw gray-level pixel visual HMMwith a 64-word codebook based
on the acoustic LPC features.
In their speaker verification experiments on a small in-house database, Pan et al.
found that output score fusion of both the acoustically and visually biased FHMM
structures outperformed coupled HMMs, but was still improved by a product HMM.
Pan et al. concluded that although the product HMM did perform better than their
FHMM structure, the simplicity of their FHMM structure gave it a real advantage in
real world situations [130].
8.3 Fused HMM adaptation of synchronous HMMs†
8.3.1 Continuous FHMMs
Pan et al’s implementation of their FHMMstructure, using a discrete vector-quantisation
codebook for the subordinate modalities, worked well for speaker verification in their
original paper [130]. Similar results were also replicated in a discrete implementation
of the FHMM structure by Dean et al. [44] on the single session CUAVE [133] database.
However, the use of a vector quantisation codebook for representing the widely vary-
ing natures of the acoustic and visual speech information available does not generalise
well over multiple recording sessions as the large variability in the acoustic and visual
information between different session cannot easily be represented in the limited con-
8.3 Fused HMM adaptation of synchronous HMMs† 165
(a) Discrete (acoustic-biased) FHMM (b) Synchronous HMM
Figure 8.1: By replacing the discrete secondary representations with continuous rep-resentations in Pan et al.’s [130] original FHMM, it can be seen that a SHMM will becreated.
fines of discrete vector quantisation codebook tables.
By extending the original FHMM design by representing the subordinate modality
using continuous GMMs, a novel extension of Pan et al’s original FHMM structure
was developed that is more robust to inter-session variability than the original discrete
implementation. However, rather than this continuous FHMM structure being a new
multi-stream HMM design, it can be seen that it is in fact equivalent to the SHMM
model introduced in the previous chapter, as shown in Figure 8.1.
Therefore rather than being seen as a novelmodel for representing audio-visual speech,
the continuous FHMM training method can be seen as a novel method of training a
SHMM based on the state sequences a single uni-modal acoustic or visual HMM.
8.3.2 Fused-HMM adaptation
The FHMM adaptation process can be considered identical to the original training
process, but the estimation of the subordinate coupling parameters is performed using
EM training of GMMs rather than through the training of a discrete codebook as in the
original implementation. Therefore the FHMM-adaptation of an acoustically-biased
SHMM from an already trained uni-modal acoustic HMM can be simply defined as a
two step process:
166 8.3 Fused HMM adaptation of synchronous HMMs†
1. Determine the best hidden state sequence of the acoustic HMMover the training
data.
2. For each state of the acoustic HMM:
(a) Train a visual GMM based upon the visual observations that coincide with
the acoustic state in the training data
(b) Append the visual GMM to the already existing acoustic HMM to produce
a SHMM
For the sake of simplicity of explanation, throughout the remainder of this section, the
FHMM-adaptation process will be outlined using the example of FHMM-adaptation
of an acoustically biased SHMM from an already existing uni-modal acoustic SHMM.
The equivalent process for FHMM-adaptation of a visually-biased SHMM, if needed,
can be easily derived by swapping the roles of the two modalities within the FHMM-
adaptation process.
The FHMM-adaptation process allows the state representation of the twomodalities to
be estimated separately, although the visual modality does of course depend upon the
acoustic. In this way, it can be seen to be similar to the separate estimation method of
producing a SHMM by combining two uni-modal HMM that was briefly touched on
in comparison to the joint-training method of SHMMparameter estimation. However,
in the case of FHMM-adaptation there is no concern of the states not being aligned,
because the state alignment of the video models is dictated by the acoustic HMM.
Additionally training of a SHMM using FHMM-adaptation is a quicker process than
jointly-training a SHMM, as the Baum-Welch re-estimation process only has to occur
for the acoustic observations as the state sequences are set in stone by the time the
visual state models are estimated.
Of course, the FHMM-adaptation method does require that a sufficiently good esti-
mate of the state sequences can be obtained from the acoustic data alone in training,
but as the SHMM-training stream weighting experiments showed in the Chapter 7,
8.3 Fused HMM adaptation of synchronous HMMs† 167
the choice of stream weights during training was of little importance on the final
speech performance, and therefore the choice of audio-only (αtrain = 1.0) dictated by
the FHMM-adaptation process should have no detrimental impact on the final perfor-
mance.
Background SHMMmodels
The FHMM-adaptation of the background speech SHMMs was performed over the
same training partitions of the XM2VTS database as the jointly-trained SHMMs in
accordance to the speech processing framework developed in Chapter 4.
However, instead of taking the training observations and transcriptions to jointly-train
a background SHMM for each word in the XM2VTS database, the FHMM-adaptation
process starts with the uni-modal acoustic background HMMs and generates a time-
aligned transcription of the words and states of the acoustic HMMs over the entire
training sequence. This time-aligned transcription can then be used to segment all the
video observations that coincide with each state of the original uni-modal HMM and
train a state-model video GMM for each of the original acoustic states. By appending
the resulting video GMMs to the acoustic GMMs already existing with the states of
the uni-modal HMM, a new audio-visual SHMM is generated.
Speaker-dependent SHMMmodels
Adapting the speaker-dependent FHMM-adapted models to specific speakers for the
purposes of speaker dependent speech recognition and speaker verification is a simple
process. Basically, the already speaker-dependent acoustic HMMs trained in Chap-
ter 5 are used as the time-alignment basis to MAP-adapt a speaker dependent video
GMM for each word and state from the existing FHMM-adapted background video
GMMs trained previously. By appending these speaker-dependent video GMMs to
the speaker-dependent acoustic HMMs a final set of speaker-dependent SHMMs can
168 8.4 Biasing of FHMM-adapted SHMMs†
easily be trained for each client speaker in the speech processing framework.
Decoding
As the SHMM trained using the FHMM-adaptation method is an normal SHMM, the
decoding of the HMM for speech or speaker recognition is conducted identically to
that of a jointly trained SHMM, and the resulting speech models can be normalised
and weighted in the same manner as their jointly-trained cousins.
8.4 Biasing of FHMM-adapted SHMMs†
8.4.1 Introduction
Because both background and speaker-dependent SHMM models can be trained us-
ing FHMM-adaptation, unlike joint-training of SHMMs using the HMM Toolkit [194],
speaker independent and speaker dependent speech recognition, as well as speaker
verification experiments can be performed using the FHMM-adapted SHMMs.
Both the background speech models and the speaker-adaptation of those models is
done according to the speech processing framework developed in Chapter 4, with the
underlying acoustic HMMs being identical to those used for the speech processing
experiments in Chapter 5. These acoustic HMMs are FHMM-adapted through the
addition of a 16 mixture GMM for each acoustic state to result in an 11 state SHMM
with 8 mixtures for the acoustic and 16 for the visual features, which is identical to the
jointly-trained SHMMs demonstrated in the previous chapter.
As well as the acoustically biased FHMM-adapted SHMM which has been the focus
of the FHMM-adaptation process detailed so far, visually-biased SHMM background
and speaker-adaptedmodels were also created by appending 8mixture acoustic GMM
to the original visual HMMs. Both the acoustic and visually biased SHMMs were
8.4 Biasing of FHMM-adapted SHMMs† 169
identical in topology, with the only difference being the order of the two modalities in
the concatenative fusion input vector. However, because the visually-biased SHMM
was based on the underlying visual HMM, the acoustic features are down-sampled to
the video frame rate before concatenation of the two feature vectors.
8.4.2 Acoustic or visual biased
In their original paper on their discrete FHMM structure, Pan et al. suggested that
the best approach is to choose the dominant modality as the one with the best ability
to discriminate speech, or through a combination of both in an output score fusion of
both FHMM designs [130]. Because continuous speech recognition cannot easily be
modelled through output score fusion of separate classifiers, a choice has to be made
for these experiments between the acoustic and visually biased FHMM-adaptation
process.
Taking Pan et al.’s suggestion of choosing the bestmodality based on the ability of said
modality to discriminate speechwould generally lead to the choice of the acoustically-
biased choice as the speech recognition ability of audio is much higher than video, as
was clearly demonstrated in the uni-modal HMM speech recognition experiments in
Chapter 5. However, as the FHMM-adaptation is a training procedure, the ability
of each modality to discriminate the boundaries of speech events in training should
be more important than the ability to decode unknown speech. And, as the train-
ing stream weight experiments in Chapter 7 have shown, there is little difference in
the ability of the acoustic or visual features to discriminate state boundaries during
training, making the choice of acoustically or visually biased FHMM-adaptation less
simple.
To come to a decision on the dominant modality for the FHMM-adaptation experi-
ments, both the acoustically and visually biased SHMMs trained using the FHMM
adaptation technique were compared for speaker independent speech recognition as
the testing stream weighting parameter αtest is varied on the first configuration of the
170 8.4 Biasing of FHMM-adapted SHMMs†
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
45
50
αtest
Wor
d E
rror
Rat
e
0dB SNR6dB SNR12dB SNR18db SNRAverage (all noise)
(a) Acoustically biased
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
45
50
αtest
Wor
d E
rror
Rat
e
0dB SNR6dB SNR12dB SNR18db SNRAverage (all noise)
(b) Visually biased
Figure 8.2: Performance of acoustic and visual biased FHMM-adapted SHMMs astesting stream weights are varied.
XM2VTS database. The results of these experiments are shown in Figure 8.2.
8.4.3 Discussion
Looking at the stream-weighting curves in Figure 8.2, the visually biased FHMMs
appear to provide better performance at the extreme points (αtest = 0 or αtest = 1).
However, the acoustically biased versions generally provide similar or better perfor-
mance at the best performing point of the curves. Additionally the best performing
streamweights for each noise condition appear to have less of a spread around the av-
erage best performing stream weight of αtest = 0.5, allowing for better unsupervised
speech decoding using the audio-biased SHMMs trained using the FHMM-adaptation
method.
While the performance difference between the acoustic and visually biased FHMM-
adaptation method is not large, the acoustically biased version was chosen for the
remainder of this thesis primarily because of the improved speech-recognition ability
in noisy acoustic conditions.
8.5 Speech recognition experiments† 171
0 2 4 6 8 10 12 14 16 180
10
20
30
40
50
60
70
80
Signal−to−noise ratio (dB)
Wor
d E
rror
Rat
e
A−PLPV−LDA−MRDCTDiscriminative FFJointly−trained SHMMFHMM−adapted SHMM
Figure 8.3: Speaker independent speech recognition performance using FHMM-adapted HMMs over all 12 configurations of the XM2VTS database.
8.5 Speech recognition experiments†
8.5.1 Results
The unsupervised speech recognition experiments using the acoustically-biased FHMM-
adaptation process were conducted over all 12 configurations of the XM2VTS database
according to the speech processing framework developed in Chapter 4. Both speaker
independent experimentswere performed using the background SHMMs and speaker
dependent experiments using the SHMMs adapted to each client speaker under test.
Based on the stream weighting tuning experiments performed in Section 8.4, the un-
supervised stream weighting of αtest = 0.5 was chosen as it provided the best average
performance over all noise levels amongst all the possible stream weights.
The results of the speaker independent speech recognition experiments using the FHMM-
172 8.5 Speech recognition experiments†
0 2 4 6 8 10 12 14 16 180
10
20
30
40
50
60
70
80
Signal−to−noise ratio (dB)
Wor
d E
rror
Rat
eA−PLPV−LDA−MRDCTDiscriminative FFFHMM−adapted SHMM
Figure 8.4: Speaker dependent speech recognition performance using FHMM-adaptedHMMs over all 12 configurations of the XM2VTS database.
adaptation process is shown in Figure 8.3. Both the jointly-trained SHMM and dis-
criminative feature-fusion experiments are also included for comparison, in addition
to the uni-modal acoustic and visual HMM performances.
The results of the speaker dependent speech recognition experiments using the FHMM-
adaptation process is also shown in Figure 8.4. While jointly-trained SHMMs couldn’t
be included, both the discriminative feature-fusion and uni-modal audio and video
HMMs are shown for comparison.
8.5.2 Discussion
From an examination of the FHMM-adapted versus jointly-trained SHMMs for speaker
independent speech recognition in Figure 8.3, it can clearly be seen that the FHMM-
adaptation method provides a clear improvement over the jointly-trained SHMMs at
8.5 Speech recognition experiments† 173
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
45
50
αtest
Wor
d E
rror
Rat
e
0dB SNR6dB SNR12dB SNR18db SNRAverage (all noise)
(a) FHMM-adapted (audio-biased)
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
45
50
αtest
Wor
d E
rror
Rat
e
0dB SNR6dB SNR12dB SNR18db SNRAverage (all noise)
(b) Jointly trained
Figure 8.5: Comparing the A-PLP biased FHMM-adapted SHMM with a equivalentjointly-trained SHMM on the first configuration of the XM2VTS database.
all levels of acoustic noise tested in these experiments. This improvement begins at
around 1 point of WER in clean conditions, but rises to around 7 points at 0 dB SNR.
To further examine the reasons for the improvement of the FHMM-adaptationmethod
over joint-training, the effect of varying the testing stream weights for each of these
SHMM training methods can be examined to determine if any conclusions can be
drawn as to the effectiveness of the two methods.
Such a comparison between acoustically-biased FHMM-adaptation and joint-training
is shown in Figure 8.5. These test-weighting results were taken from the tuning experi-
ments for determining the best performing streamweights in Chapter 7 and Section8.4,
and therefore were only conducted on the first configuration of the XM2VTS database
under the speech processing framework.
The main difference between the two SHMM training methods evident in Figure 8.5
is the clearly improved performance of the FHMM-adapted SHMM in the 0 and 6 dB,
while the performance increase of around 0.7 points in the cleaner conditions is barely
perceptible at the scale of the graph. It appears that even though the jointly-trained
174 8.5 Speech recognition experiments†
SHMM produces a better video speech recognition performance when αtest = 0, the
FHMM-adaptation method of training the video state models produces video models
that interact better with the pre-existing acoustic models than are produced through
joint-training method of SHMM parameter estimation.
However, the improvement in the video performance of the FHMM-adapted SHMM
is not due to the audio-only state boundaries estimated during training being better
than video-estimated, or any combination of the two. In fact, as the experiments with
jointly trained SHMMs in Chapter 7 have shown, the choice of αtrain, in this particular
case, has no real effect on the final speech recognition performance. As the FHMM-
adapted SHMM is derived from an audio-only HMM, the FHMM-adapted SHMMcan
be considered to be training the state models on the same estimated state alignments
as a jointly trained SHMM with αtrain = 1.0. However, if our FHMM-adapted system
is compared with such a jointly-trained SHMM, the FHMM-adapted performance is
still much improved.
The main reason for the improvement of the FHMM-adapted SHMM video models
appears to be related to poor initialisation of the Baum-Welch training algorithm for
video HMMs. Before the training algorithm can begin, a bootstrapped estimate of the
HMM parameters must first be provided. Typically, and in our particular case, these
estimates are based on a uniform segmentation of the training data [194], basically
dividing each group of training observations for a particular word-model into equal
segments for each underlying state. While this segmentation is obviously unlikely to
correspond to the final state alignments from a well trained HMM, it has been shown
to provide a good initial point to start the Baum-Welch estimation process for audio
speech processing tasks [194].
However, from the results in this paper, it would appear that this assumption does
not necessarily stand for video speech processing tasks. While the video models for
the baseline jointly-trained SHMM were initialised and trained alongside the audio
models, the FHMM-adapted video models were trained directly on previously deter-
mined state alignments. While the final state-alignment arrived at by the Baum-Welch
8.6 Speaker verification experiments† 175
algorithm of the jointly-trained SHMM trained at αtrain = 1.0 would match the state
alignment used for FHMM-adaptation there would be a clear difference in the re-
sulting speech recognition performance. It would appear that the initialisation of the
video models using the uniform segmentation has a detrimental effect on the final
video models. By determining the video parameters directly on the known good au-
dio alignments, this detrimental effect can be limited and increase the overall speech
recognition performance of the SHMM accordingly.
Speaker dependent FHMM-adaptation
The speaker dependent speech recognition results presented in Figure 8.4 unfortu-
nately couldn’t be presented alongside speaker-dependent jointly-trained SHMMmod-
els due to limitations of theHMMToolkit used for performing the joint-training. How-
ever, similar performance is obtained in comparison to the uni-modal and discrimina-
tive feature fusion for speaker dependent speech recognition.
In particular, the use of speaker-dependent SHMMs trained using the FHMM-adaptation
method allows for the speech recognition WER to remain below the video speech
recognition rate for the entire range of acoustic conditions under test. The FHMM-
adapted systems can outperform the alreadywell-performing uni-modal acousticHMMs
in clean conditions through the addition of video state models based on the acoustic
boundaries of the original HMMs.
8.6 Speaker verification experiments†
8.6.1 Introduction
As the FHMM-adaptation process of training SHMMhas allowed for speaker-dependent
speechmodels to be trained, such models can also be used for text-dependent speaker
176 8.6 Speaker verification experiments†
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
αtest
Equ
al E
rror
Rat
e (in
%)
0dB SNR6dB SNR12dB SNR18db SNRAverage (all noise)
Figure 8.6: Tuning the testing stream weight parameter αtest for speaker verificationusing FHMM-adapted SHMMs.
verification under the speech processing framework developed in Chapter 4. In this
section, the text-dependent FHMM-adapted SHMMs will be used to attempt to verify
speakers according to the speech processing framework, in comparison to the uni-
modal HMMs, discriminative feature fusion and output fusion for the same task. Re-
sults will be reported based on the EERs over the full range of acoustic noise condi-
tions under test.
8.6.2 Stream weighting
In order to conduct the speaker verification experiments using the FHMM-adapted
SHMMs, the response of this structure to the stream weighting parameter αtest un-
der various noisy conditions must first be evaluated to allow a suitable choice of the
final stream weights chosen for the unsupervised speaker verification experiments.
8.6 Speaker verification experiments† 177
In order to choose this value, the speaker verification experiments using the FHMM-
adapted SHMMs were performed over a range of stream weights on the first partition
of the XM2VTS database, the results of which are shown in Figure 8.6.
In performing these tuning experiments to determine the best stream weighting pa-
rameter, one of themajor disadvantages of SHMM-based speaker verification becomes
apparent. As the stream weights are modified within the states of the HMM as each
utterance is verified, rather than as a final stage of fusing classifier scores, evaluating a
different weighting parameter requires that the entire HMM decoding process be run
through each time. In comparison, changing the weighting parameter in output score
fusion only requires to recalculation of a single mathematical equation.
Because the resulting evaluation of a full range of αtest therefore requires a very large
number of HMM-based verifications, the tuning of the stream weighting parameters
shown in Figure 8.6 is only performed on a single partition of the XM2VTS database,
rather than on all 12 configurations available under the speech processing framework.
While this approach clearly is easier than using all configurations, the resulting per-
formance curve as αtest is varied is not as finely differentiated as it would have been
if more verification experiments could have been performed for each of the stream
weighting parameters tested here.
However these curve should be adequate to broadly show the relative speaker verifi-
cation performance as αtest is varied, and the average performance over all noise levels
was used to choose an αtest of 0.2 for the FHMM-adapted SHMMs in the unsupervised
speaker verification tests. This choice of αtest also had the advantage that it was iden-
tical to that used for the late integration experiments, allowing for the two integration
strategies to be compared easily.
178 8.6 Speaker verification experiments†
0 2 4 6 8 10 12 14 16 180
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Signal−to−noise ratio (dB)
Equ
al E
rror
Rat
e (in
%)
A−PLPV−LDA−MRDCTDiscriminative FFOutput FusionFHMM−adapted SHMM
Figure 8.7: Text-dependent speaker recognition performance using FHMM-adaptedHMMs over all 12 configurations of the XM2VTS database.
8.6.3 Results
The results of the unsupervised FHMM-adapted speaker verification experiments are
shown in comparison to the equivalent text-dependent discriminative feature fusion
and output score fusion as well as the uni-modal HMMs in Figure 8.7. While it can be
seen that the FHMM-adapted SHMMs perform well in comparison to early integra-
tion, it nonetheless performs catastrophically in noisy conditions, and is easily bested
by the output score fusion of the uni-modal HMMs for all of the acoustic conditions
under test.
8.6 Speaker verification experiments† 179
8.6.4 Discussion
While it is certainly possible that an adaptive fusion approach could be used to allow
the streamweighting parameters to be varied based on an estimation of the prevailing
environmental conditions, a similar approach could also be applied for the output
score fusion of the uni-modal HMMs.
Indeed, in order for SHMM approaches to speaker verification to improve over simple
output score fusion of uni-modal HMMs, the SHMM approach would have to show
that it can take advantage of some temporal dependency between the speech features.
Because output-score fusion can only occur at the end of an utterance, or through an
externally determined segmentation process, it would not be able to take advantage
of the differences on a frame-by-frame basis.
However, at least for text-dependent speaker verification, such a situation is unlikely
to occur, as the main role of the HMM structure is to align the GMMswithin the HMM
against the pertinent speech events. As the stream weighting experiments in the joint-
training of SHMMs in Chapter 7 have shown, both the acoustic and visual modalities
are equally good at determining the hidden state boundaries of a known transcription,
and therefore can align state-models equally well when evaluating a known phrase for
text-dependent speaker verification.
A possible avenue of future research that may be able to take advantage of the SHMM
structure for speaker verificationmay focus on using the SHMM for limited-vocabulary
text-independent speaker verification. By allowing the SHMMs ability to find the cor-
rect transcription through the network as exhibited in the speech recognition exper-
iments in this and the previous chapter, the SHMM may be able to perform well for
speaker verification. While this may provide better performance that two uni-modal
HMMs in a similar configuration, it is not clear if this approach would be better than
considering the output of two large-vocabulary text-independent speaker verification
models in each modality.
180 8.7 Chapter summary
8.7 Chapter summary
In this chapter the FHMM-adaptation method of training a SHMM through the train-
ing of secondary state models for an already existing uni-modal HMM. By using this
approach to add visual state models to already existing uni-modal acoustic HMMs,
the resulting SHMM provided improved speaker independent and dependent speech
recognition ability in all noise conditions under test, particularly so in the 0 dB SNR
conditions.
Because this approach to SHMM training could also be used to train speaker depen-
dent models, which was not possible for jointly trained SHMMs using the HMM
Toolkit [194], speaker verification experiments could also be conducted in compari-
son to the output score and feature fusion speaker verification experiments performed
in earlier chapters. However, the added complexity of the SHMM approach was not
found to give improve upon output fusion of uni-modal HMMs.
Chapter 9
Conclusions and Future Work
This chapter concludes the thesis by summarising the conclusions made in each chap-
ter and noting the original contributions made. It also provides suggestions for fur-
ther work that could extend the research reported in this thesis or that was beyond the
scope of the work reported here.
9.1 Conclusions
The central aim of the work presented in this thesis has been in the investigation of
speech and speaker recognition using both acoustic and visual speech features, with
a particular focus on the SHMM-based methods of fusing the two separate modalities
to enable automatic speech processing systems to be more robust to acoustic noise
than conventional acoustic systems. Accordingly, the work performed in this thesis
focused on four main areas:
1. To investigate the suitability of existing feature extraction and integration tech-
niques for both speech and speaker recognition.
2. To study and develop techniques to improve the audio-visual speech modelling
182 9.1 Conclusions
ability of SHMMs trained using the state-of-the-art joint-training process.
3. To develop an alternative training technique for SHMMs that can improve the
audio-visual speech modelling ability in comparison to the existing state-of-the-
art joint-training process.
4. To compare and contrast the suitability of SHMMs for speech and speaker recog-
nition in comparison to existing baseline integration techniques.
The work conducted in these four main areas has resulted in a variety of speech pro-
cessing systems and experiments conducted throughout this thesis to investigate the
suitability of each type of system to the two tasks of speech and speaker recognition.
While a number of novel contributions were presented in early chapters, as will be
outlined below, the major novel contribution of this thesis program is presented in
Chapter 8, where the FHMM-adaptation method of training SHMMs is introduced for
speech and speaker recognition. These FHMM-adapted SHMMs have demonstrated
improved speech modelling ability over jointly-trained SHMMs, as was particular
demonstrated by the improved speech recognition performance over jointly-trained
SHMMs at all levels of acoustic noise. However, for speaker verification, the improved
temporal coupling provided by the SHMMmodel did not appear to provide a signifi-
cant improvement over the late integration approach.
The major original contributions resulting from this work are summarised as follows:
1. A novel framework was presented for the evaluation of both speech and speaker
recognition on the XM2VTS database whilst reusing the same speech models in
Chapter 4. This frameworkwas used throughout the thesis, and allowed for easy
comparison between differing features and fusion techniques for both unimodal
and multimodal AVSP applications.
2. The application of dynamic CAB features, known to work well for visual speech
recognition, were tested for visual speaker verification in Chapter 5. These novel
9.1 Conclusions 183
experiments showed that the discrimination ability between speakers also im-
proved as static informationwas removed from the visual speech features through
the application of the stages of the feature extraction cascade. These results sug-
gested that the visual recognition of speakers can be improved by treating visual
speech as a behavioural rather than physiological characteristic of a speaker.
3. The effect of varying the streamweights independently during training and test-
ing of SHMMs was investigated in Chapter 7. Previous experiments in the lit-
erature on SHMM training have only dealt with the stream weights as a single
value that was the same for both the training and testing process of SHMMs.
These novel experiments showed that while varying the testing stream weights
had a large impact on the final speech recognition performance, similar changes
during the training process had a negligible impact on the final performance.
These experiments demonstrated that, as SHMM training is a more constrained
task than SHMMdecoding during testing, either audio or video features (or any
fusion of the two) can segment the training utterances equally well. This re-
sult was particularly interesting in comparison to varying the stream weights
during testing, where the fusion-ratio of the audio and video streams were of
paramount importance to the final speech recognition performance.
4. Chapter 7 also introduced two novel techniques for normalising the two au-
dio and video streams within the SHMM during decoding. The first technique
introduced was a novel adaptation of zero normalisation that normalised the
mean and variance of the video scores to be within a similar range to the acous-
tic scores, but required access to the Viterbi decoder process for implementation.
The second technique performed variance-only normalisation solely through the
adjustment of the streamweights, which allowed normalisation to occur with an
unmodified standard Viterbi decoder. Both normalisation techniques performed
similarly for audio-visual speech processing and were found to improve the ro-
bustness to acoustic noise over un-normalised SHMMs.
5. Chapter 8 introduced FHMM-adaptation, an novel alternative training technique
for SHMMs that provided improved audio-visual speechmodelling ability when
184 9.2 Future work
compared to the existing state of the art training techniques for SHMMs. Exper-
iments were conducted with the resulting FHMM-adapted SHMMs to compare
and contrast this alternative SHMM training technique against jointly-trained
SHMMs and earlier fusion methods for both speech and speaker recognition.
These experiments showed that FHMM-adaptation can improve the performance
over jointly-trained SHMMs for speech recognition over all noise levels, with a
particular improvement in noisy conditions. However, the additional compli-
cation of SHMMs for speaker verification did not appear to provide any benefit
over simple output-score fusion of uni-modal HMMs for the same task.
9.2 Future work
A number of different avenues of further work have been identified as a result of work
completed in this thesis. These can be summarised as follows:
1. In Chapter 4 a speech processing frameworkwas developed for the XM2VTS [119]
database. This framework could fairly easily be extended to other databases to
allow for similar comparative speech processing experiments to be conducted
against other datasets. One promising such dataset may be the AVICAR [96]
database, allowing for audio-visual speech processing to be tested in an auto-
motive environment.
2. While the speaker verification experiments in Chapter 5 showed that verification
error consistently decreased as static information was removed from the visual
speech features. However, as face recognition systems have shown, the static
characteristics of faces clear can be useful for the recognition of people. Whilst
it was outside the scope of this thesis, it would be interesting to make a com-
parison between static and dynamic features for whole-face recognition, paying
particular attention to the current state of the art in face recognition research.
3. While Chapter 5 demonstrated the extraction of dynamic visual speech features
9.2 Future work 185
through a discriminative feature extraction approach on the ROI, alternative
methods of dynamic feature extraction based more directly on movement in the
video, such as optical flow, would allow for an interesting comparison with the
features used in this thesis for visual speech processing.
4. While a single weighting parameter was found to work reasonable well over all
noise conditions for the SHMM-based speech recognition experiments in Chap-
ters 7 and 8, a similar approach did not appear to be viable for the output-
score fusion and SHMM-based speaker verification experiments in Chapters 6
and 8 respectively. Improved unsupervised performance could be obtained for
both speech and speaker recognition systems if the relative reliability of the two
modalities could be estimated and corresponding stream weights adjusted in
real time through an adaptive fusion process.
5. This thesis only investigated degradation in the acoustic domain. However,
it can be difficult to reliably simulate the types of visual degradation that are
present in real world conditions, and rather than look at simulating visual degra-
dation, visual degradation research may be better focused on collecting and us-
ing audio-visual speech data in real-world conditions such as the AVICAR [96]
or BANCA [6] databases.
186 9.2 Future work
Bibliography
[1] A. Adjoudani, T. Guiard-Marigny, B. L. Goff, L. Reveret, and C. Benoit, “A mul-
timedia platform for audio-visual speech processing,” in 5th European Conference
on Speech Communication and Technology. Rhodes, Greece: Institut de la Com-
munication Parlee UPRESA, September 1997, pp. 1671–1674.
[2] K. Alsabti, S. Ranka, and V. Singh, “An efficient k-means clustering algorithm,”
in IPPS/SPDPWorkshop on High Performance Data Mining, 1998.
[3] T. Artieres and P. Gallinari, “Stroke level HMMs for on-line handwriting recog-
nition,” in Frontiers in Handwriting Recognition, 2002. Proceedings. Eighth Interna-
tional Workshop on, 2002, pp. 227–232.
[4] B. Atal, “Effectiveness of linear prediction characteristics of the speech wave for
automatic speaker identification and verification,” The Journal of the Acoustical
Society of America, vol. 55, p. 1304, 1974.
[5] R. Auckenthaler, J. Brand, J. Mason, F. Deravi, and C. Chibelushi, “Lip sig-
natures for automatic person recognition,” in Audio- and Video-based Biometric
Person Authentication (AVBPA ’99), 2nd International Conference on, Washingtion,
D.C., 1999, pp. 142–47.
[6] E. Bailly-Bailliére, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz,
J. Matas, K. Messer, V. Popovici, F. Porée, B. Ruiz, and J.-P. Thiran, “The BANCA
database and evaluation protocol,” in Audio-and Video-Based Biometric Person Au-
thentication (AVBPA 2003), 4th International Conference on, ser. Lecture Notes in
188 BIBLIOGRAPHY
Computer Science, vol. 2688. Guildfork, UK: Springer-Verlag Heidelberg, 2003,
pp. 625–638.
[7] J. Barron, D. Fleet, S. Beauchemin, and T. Burkitt, “Performance of optical flow
techniques,” in Computer Vision and Pattern Recognition, 1992. Proceedings CVPR
’92., 1992 IEEE Computer Society Conference on, 1992, pp. 236–242.
[8] P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces vs. fisherfaces: recog-
nition using class specific linear projection,” Pattern Analysis and Machine Intelli-
gence, IEEE Transactions on, vol. 19, no. 7, pp. 711–720, 1997.
[9] P. Belin, R. Zatorre, P. Lafaille, P. Ahad, and B. Pike, “Voice-selective areas in
human auditory cortex,” Nature, vol. 403, no. 6767, pp. 309–312, 2000.
[10] R. Bellman, Adaptive control processes : a guided tour. Princeton, N. J.: Princeton
University Press, 1961.
[11] S. Bengio, “Multimodal authentication using asynchronous HMMs,” in Audio-
and Video-Based Biometric Person Authentication. 4th International Conference,
AVBPA 2003. Proceedings, J. Kittler andM. Nixon, Eds. Guildford, UK: Springer-
Verlag, 2003, pp. 770–777.
[12] S. Bengio, “Multimodal speech processing using asynchronous hidden Markov
models,” Information Fusion, vol. 5, no. 2, pp. 81–9, June 2004.
[13] J. Bilmes, “A gentle tutorial on the EM algorithm and its application to parame-
ter estimation for Gaussian mixture and hidden Markov models,” International
Computer Science Institute, Tech. Rep., 1997.
[14] J. Brand, J. Mason, and S. Colomb, “Visual speech: A physiological or be-
havioural biometric?” in Audio- and Video-based Biometric Person Authentication
(AVBPA 2001), 3rd International Conference on, Halmstad, Sweden, 2001, pp. 157–
168.
[15] C. Bregler, H. Hild, S. Manke, and A.Waibel, “Improving connected letter recog-
nition by lipreading,” in Acoustics, Speech, and Signal Processing, 1993. ICASSP-
93., 1993 IEEE International Conference on, vol. 1, 1993, pp. 557–560 vol.1.
BIBLIOGRAPHY 189
[16] C. Bregler and Y. Konig, ““Eigenlips” for robust speech recognition,” in Acous-
tics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Con-
ference on, vol. ii, 1994, pp. II/669–II/672 vol.2.
[17] V. Bruce and A. Young, “Understanding face recognition.” Br J Psychol, vol. 77,
no. Pt 3, pp. 305–27, 1986.
[18] J. Campbell, J.P., “Speaker recognition: a tutorial,” Proceedings of the IEEE,
vol. 85, no. 9, pp. 1437–1462, 1997.
[19] H. Cetingul, E. Erzin, Y. Yemez, and A. Tekalp, “On optimal selection of lip-
motion features for speaker identification,” inMultimedia Signal Processing, 2004
IEEE 6th Workshop on, 2004, pp. 7–10.
[20] H. Cetingul, E. Erzin, Y. Yemez, and A. Tekalp, “Multimodal speaker/speech
recognition using lip motion, lip texture and audio,” Signal Processing, vol. In
Press, Corrected Proof, pp. –, 2006.
[21] H. Cetingul, Y. Yemez, E. Erzin, and A. Tekalp, “Discriminative lip-motion fea-
tures for biometric speaker identification,” in 2004 International Conference on
Image Processing (ICIP), vol. Vol. 3. Singapore: IEEE, 2004, p. 2023.
[22] D. Chandramohan and P. Silsbee, “A multiple deformable template approach
for visual speech recognition,” in Spoken Language, 1996. ICSLP 96. Proceedings.,
Fourth International Conference on, vol. 1, 1996, pp. 50–53 vol.1.
[23] T. Chen, “Audiovisual speech processing,” Signal Processing Magazine, IEEE,
vol. 18, no. 1, pp. 9–21, 2001.
[24] T. Chen and R. Rao, “Audio-visual integration in multimodal communication,”
Proceedings of the IEEE, vol. 86, no. 5, pp. 837–852, 1998.
[25] C. Chibelushi, F. Deravi, and J. Mason, “A review of speech-based bimodal
recognition,”Multimedia, IEEE Transactions on, vol. 4, no. 1, pp. 23–37, 2002.
[26] C. Chibelushi, S. Gandon, J. Mason, F. Deravi, and R. Johnston, “Design issues
for a digital audio-visual integrated database,” in Integrated Audio-Visual Pro-
190 BIBLIOGRAPHY
cessing for Recognition, Synthesis and Communication (Digest No: 1996/213), IEE
Colloquium on, 1996, pp. 7/1–7/7.
[27] C. Chibelushi, J. Mason, and F. Deravi, “Feature-level data fusion for bimodal
person recognition,” in Image Processing and Its Applications, 1997., Sixth Interna-
tional Conference on, vol. 1, 1997, pp. 399–403 vol.1.
[28] G. Chiou and J.-N. Hwang, “Lipreading from color video,” Image Processing,
IEEE Transactions on, vol. 6, no. 8, pp. 1192–1195, 1997.
[29] A. G. Chitu, L. J. Rothkrantz, J. C. Wojdel, and P. Wiggers, “Comparison be-
tween different feature extraction techniques for audio-visual speech recogni-
tion,” Journal on Multimodal User Interfaces, vol. 1, no. 1, 2007.
[30] T. Cootes, G. Edwards, and C. Taylor, “Active appearancemodels,” Pattern Anal-
ysis and Machine Intelligence, IEEE Transactions on, vol. 23, no. 6, pp. 681–685,
2001.
[31] R. Corkrey and L. Parkinson, “Interactive voice response: Review of studies
1989-2000,” Behavior Research Methods, Instruments, & Computers, vol. 34, no. 3,
pp. 342–353(12), August 2002.
[32] B. Dasarathy, “Sensor fusion potential exploitation-innovative architectures and
illustrative applications,” Proceedings of the IEEE, vol. 85, no. 1, pp. 24–38, 1997.
[33] K. Davis, R. Biddulph, and S. Balashek, “Automatic recognition of spoken dig-
its,” The Journal of the Acoustical Society of America, vol. 24, p. 637, 1952.
[34] S. Davis and P. Mermelstein, “Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences,” Acoustics,
Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE
Transactions on, vol. 28, no. 4, pp. 357–366, 1980.
[35] D. Dean, P. Lucey, and S. Sridharan, “Audio-visual speaker identification us-
ing the CUAVE database,” in Auditory-Visual Speech Processing (AVSP), British
Columbia, Canada, July 24-27 2005, pp. 97–101.
BIBLIOGRAPHY 191
[36] D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Fused HMM-adaptation of syn-
chronous HMMs for audio-visual speech recognition,” Digital Signal Processing
(submitted).
[37] D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Comparing audio and visual
information for speech processing,” in Eighth International Symposium on Signal
Processing and Its Applications (ISSPA), Sydney, Australia, 2005, pp. 58–61.
[38] D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Fused HMM-adaptation of multi-
stream HMMs for audio-visual speech recognition,” in Interspeech, Antwerp,
August 2007, pp. 666–669.
[39] D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Weighting and normalisation
of synchronous HMMs for audio-visual speech recognition,” in Auditory-Visual
Speech Processing, Hilvarenbeek, The Netherlands, September 2007, pp. 110–115.
[40] D. Dean, S. Sridharan, and P. Lucey, “Cascading appearance based features for
visual speaker verification,” in Interspeech 2008 (accepted), 2008.
[41] D. Dean and S. Sridharan, “Dynamic visual features for audio-visual speaker
verification,” Computer Speech and Language (submitted).
[42] D. Dean and S. Sridharan, “Fused HMM adaptation of synchronous HMMs for
audio-visual speaker verification,” inAuditory-Visual Speech Processing (accepted),
2008.
[43] D. Dean, S. Sridharan, and T. Wark, “Audio-visual speaker verification using
continuous fused HMMs,” in HCSNet Workshop on the Use of Vision in HCI
(VisHCI), 2006.
[44] D. Dean, T. Wark, and S. Sridharan, “An examination of audio-visual fused
HMMs for speaker recognition,” in Second Workshop on Multimodal User Authen-
tication (MMUA), Toulouse, France, 2006.
[45] L. Debnath and S. G. Mallat, A wavelet tour of signal processing, 2nd ed. San
Diego: Academic Press, 1999.
192 BIBLIOGRAPHY
[46] L. Debnath, S. G. Mallat, K. R. Rao, and P. Yip, Discrete cosine transform : algo-
rithms, advantages, applications, 2nd ed. Boston: Academic Press, 1990.
[47] P. Duchnowski, U. Meier, and A. Waibel, “See me, hear me: Integrating auto-
matic speech recognition and lip-reading,” in Proc. Int. Conf. Speech Lang. Pro-
cess., Yokohama, 1994., 1994.
[48] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. Wiley-
Interscience, 2000.
[49] S. Dupont and J. Luettin, “Audio-visual speechmodeling for continuous speech
recognition,”Multimedia, IEEE Transactions on, vol. 2, no. 3, pp. 141–151, 2000.
[50] H. Ellis, D. Jones, and N. Mosdell, “Intra-and inter-modal repetition priming of
familiar faces and voices,” Br J Psychol, vol. 88, no. Pt 1, pp. 143–56, 1997.
[51] N. Eveno, A. Caplier, and P.-Y. Coulon, “Accurate and quasi-automatic lip track-
ing,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 14, no. 5,
pp. 706–715, 2004.
[52] M.-I. Faraj and J. Bigun, “Audio-visual person authentication using lip-motion
from orientationmaps,” Pattern Recognition Letters, vol. 28, no. 11, pp. 1368–1382,
Aug. 2007.
[53] N. Fox, R. Gross, J. Cohn, and R. Reilly, “Robust biometric person identification
using automatic classifier fusion of speech, mouth, and face experts,” Multime-
dia, IEEE Transactions on, vol. 9, no. 4, pp. 701–714, 2007.
[54] N. Fox and R. B. Reilly, “Audio-visual speaker identification based on the use
of dynamic audio and visual features,” in Audio-and Video-Based Biometric Person
Authentication (AVBPA 2003), 4th International Conference on, ser. Lecture Notes
in Computer Science, vol. 2688. Guildfork, UK: Springer-Verlag Heidelberg,
2003, pp. 743–751.
[55] N. A. Fox, B. A. O’Mullane, and R. B. Reilly, “Audio-visual speaker identifica-
tion via adaptive fusion using reliability estimates of bothmodalities,” vol. 3546,
2005, pp. 787–796.
BIBLIOGRAPHY 193
[56] H. Frowein, H. Frowein, G. Smoorenburg, L. Pyters, and D. Schinkel, “Improved
speech recognition through videotelephony: experiments with the hard of hear-
ing,” Selected Areas in Communications, IEEE Journal on, vol. 9, no. 4, pp. 611–616,
1991.
[57] T. Fu, X. X. Liu, L. H. Liang, X. Pi, and A. Nefian, “A audio-visual speaker
identification using coupled hidden Markov models,” in Image Processing, 2003.
Proceedings. 2003 International Conference on, vol. 3, 2003, pp. 29–32.
[58] T. Fukuda, M.-J. Jung, M. Najashima, F. Arai, and Y. Hasegawa, “Facial expres-
sive robotic head system for human-robot communication and its application in
home environment,” Proceedings of the IEEE, vol. 92, no. 11, pp. 1851–1865, 2004.
[59] K. Fukunaga, Introduction to statistical pattern recognition, 2nd ed. Boston: Aca-
demic Press, 1990.
[60] L. Girin, A. Allard, and J.-L. Schwartz, “Speech signals separation: a new ap-
proach exploiting the coherence of audio and visual speech,” in Multimedia Sig-
nal Processing, 2001 IEEE Fourth Workshop on, 2001, pp. 631–636.
[61] H. Glotin, D. Vergyri, C. Neti, G. Potamianos, and J. Luettin, “Weight-
ing schemes for audio-visual fusion in speech recognition,” in Proc.
Int. Conf. Acoust. Speech Signal Process. , 2001, 2001. [Online]. Available:
citeseer.ist.psu.edu/glotin01weighting.html
[62] R. Goecke and J. Millar, “The audio-videoAustralian English speech data corpus
AVOZES,” in Proceedings of the 8th International Conference on Spoken Language
Processing ICSLP2004, vol. III, Jeju, Korea, Oct. 2004, pp. 2525–2528.
[63] R. Goecke, “A stereo vision lip tracking algorithm and subsequent statistical
analyses of the audio-video correlation in Australian English,” Ph.D. disserta-
tion, The Australian National University, Canberra, Australia, January 2004.
[64] B. Gold and N. Morgan, Speech and audio signal processing : processing and percep-
tion of speech and music. New York: Wiley, 2000.
194 BIBLIOGRAPHY
[65] O. P. E. Goldschen, A.J.; Garcia, “Continuous optical automatic speech recog-
nition by lipreading,” in Signals, Systems and Computers, 1994. 1994 Conference
Record of the Twenty-Eighth Asilomar Conference on, vol. 1, 1994, pp. 572–577 vol.1.
[66] Y. Gong, “Speech recognition in noisy environments: A survey,” Speech Commu-
nication, vol. 16, no. 3, pp. 261–291, 1995.
[67] M. Gray, J. Movellan, and T. Sejnowski, “Dynamic features for visual
speechreading: A systematic comparison,” Advances in Neural Information Pro-
cessing Systems, vol. 9, pp. 751–757, 1997.
[68] T. Hazen, K. Saenko, C. La, and J. Glass, “A segment-based audio-visual speech
recognizer: Data collection, development and initial experiments,” in Proc.
ICMI, State College, PA, 2004.
[69] M. Heckmann, F. Berthommier, and K. Kroschel, “Noise adaptive stream
weighting in audio-visual speech recognition,” EURASIP Journal on Applied Sig-
nal Processing, vol. 2002, no. 11, pp. 1260–1273, 2002.
[70] M. Heckmann, F. Berthommier, C. Savariaux, and K. Kroschel, “Effects of im-
age distortion on audio-visual speech recognition,” in ISCA Tutorial and Research
Workshop on Audio Visual Speech Processing. St Jorioz France: ISCA, 2003, pp.
163–168.
[71] M. Heckmann, K. Kroschel, C. Savariaux, and F. Berthommier, “DCT-based
video features for audio-visual speech recognition,” in International Conf. on Spo-
ken Language Processing, Denver, Colorado, 2002, pp. 92 093–0961.
[72] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,” The Jour-
nal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
[73] H. Hermansky, “Should recognizers have ears?” Speech Communication, vol. 25,
no. 1-3, pp. 3–27, August 1998.
[74] H. F. Hollien, Forensic voice identification. San Diego, Calif.: Academic Press,
2002.
BIBLIOGRAPHY 195
[75] T. Ikeda, H. Ishiguro, and M. Asada, “Adaptive fusion of sensor signals based
onmutual information maximization,” in Robotics and Automation, 2003. Proceed-
ings. ICRA ’03. IEEE International Conference on, vol. 3, 2003, pp. 4398–4402 vol.3.
[76] International Phonetic Association, Handbook of the International Phonetic Asso-
ciation : A Guide to the use of the International Phonetic Alphabet. Cambridge:
Cambridge University Press, 1999.
[77] F. Itakura, “Minimum prediction residual principle applied to speech recogni-
tion,” Acoustics, Speech, and Signal Processing, IEEE Transactions on, vol. 23, no. 1,
pp. 67–72, 1975.
[78] A. Jain, K. Nandakumar, and A. Ross, “Score normalization in multimodal bio-
metric systems,” Pattern recognition, vol. 38, no. 12, pp. 2270–2285, 2005.
[79] C. Jankowski, A. Kalyanswamy, S. Basson, and J. Spitz, “NTIMIT: a phonetically
balanced, continuous speech, telephone bandwidth speech database,” in Acous-
tics, Speech, and Signal Processing, 1990. ICASSP-90., 1990 International Conference
on, 1990, pp. 109–112 vol.1.
[80] E. T. Jaynes and G. L. Bretthorst, Probability Theory : The Logic of Science. Cam-
bridge: Cambridge University Press, 2003.
[81] F. Jelinek, “The development of an experimental discrete dictation recognizer,”
Proceedings of the IEEE, vol. 73, no. 11, pp. 1616–1624, 1985.
[82] T. Jordan and P. Sergeant, “Effects of facial image size on visual and audio-visual
speech recognition,” Hearing by eye II: Advances in the psychology of speechreading
and auditory-visual speech, pp. 155–176, 1998.
[83] P. Jourlin, “Word-dependent acoustic-labial weights in HMM-based speech
recognition,” in Proceedings of AVSP’97, 1997, pp. 69–72. [Online]. Available:
citeseer.ist.psu.edu/279607.html
[84] P. Jourlin, J. Luettin, D. Genoud, and H. Wassner, “Acoustic-labial speaker ver-
ification,” in Audio- and Video-based Biometric Person Authentication (AVBPA ’97),
196 BIBLIOGRAPHY
First International Conference on, J. BigÃŒn, G. Chollet, and G. Borgefors, Eds.,
vol. 1. Crans-Montana, Switzerland: Springer, 1997, pp. 319–26.
[85] M. Kamachi, H. Hill, K. Lander, and E. Vatikiotis-Bateson, “‘Putting the face to
the voice’: Matching identity across modality,” Current Biology, vol. 13, no. 19,
pp. 1709–1714, Sep. 2003.
[86] A. Kanak, E. Erzin, Y. Yemez, and A. Tekalp, “Joint audio-video processing for
biometric speaker identification,” inAcoustics, Speech, and Signal Processing, 2003.
Proceedings. (ICASSP ’03). 2003 IEEE International Conference on, vol. 2, 2003, pp.
II–377–80 vol.2.
[87] M. N. Kaynak, Q. Zhi, A. D. Cheok, K. Sengupta, Z. Jian, and K. C. Chung,
“Lip geometric features for human-computer interaction using bimodal speech
recognition: comparison and analysis,” Speech Communication, vol. 43, no. 1-2,
pp. 1–16, 2004.
[88] L. Kersta, “Voiceprint identification,” The Journal of the Acoustical Society of Amer-
ica, vol. 34, p. 725, 1962.
[89] J. Kittler, “Combining classifiers: A theoretical framework,” Pattern Analysis &
Applications, vol. 1, no. 1, pp. 18–27, March, 1998 1998.
[90] T. Kleinschmidt, D. Dean, S. Sridharan, and M. Mason, “A continuous speech
recognition evaluation protocol for the AVICAR database,” in International Con-
ference on Signal Processing and Communication Systems (ICSPCS) (accepted), 2007.
[91] B. Knappmeyer, I. M. Thornton, and H. H. Bulthoff, “The use of facial motion
and facial form during the processing of identity,”Vision Research, vol. 43, no. 18,
pp. 1921–1936, Aug. 2003.
[92] S. Kong, J. Heo, B. Abidi, J. Paik, and M. Abidi, “Recent advances in visual and
infrared face recognition–a review,” Computer Vision and Image Understanding,
vol. 97, no. 1, pp. 103–135, 2005.
[93] P. Ladefoged,A Course In Phonetics, 3rd ed. Harcourt Brace College Publishers,
1993.
BIBLIOGRAPHY 197
[94] K. Lander and L. Chuang, “Why are moving faces easier to recognize?” Visual
cognition, vol. 12, no. 3, pp. 429–442, 2005.
[95] F. Lavagetto, “Converting speech into lip movements: a multimedia telephone
for hard of hearing people,” Rehabilitation Engineering, IEEE Transactions on [see
also IEEE Trans. on Neural Systems and Rehabilitation], vol. 3, no. 1, pp. 90–102,
1995.
[96] B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu, and
T. Huang, “AVICAR: An audiovisual speech corpus in a car environment,” in
Interspeech 2004, 2004.
[97] C.-H. Lee and J.-L. Gauvain, “Speaker adaptation based on MAP estimation of
HMM parameters,” in Acoustics, Speech, and Signal Processing, 1993. ICASSP-93.,
1993 IEEE International Conference on, vol. 2, 1993, pp. 558–561 vol.2.
[98] C.-H. Lee and J.-L. Gauvain, “Bayesian adaptive learning and MAP estimation
of HMM,” in Automatic speech and speaker recognition : Advanced topics, C.-H. Lee,
F. K. Soong, and K. K. Paliwal, Eds. Kluwer Academic, 1996, ch. 4, pp. 83–107.
[99] N. Li, S. Dettmer, and M. Shah, “Lipreading using eigensequences,” in Proc. of
Int. Workshop on Automatic Face- and Gesture-Recognition, 1995, pp. 30–34.
[100] L. Liang, X. Liu, Y. Zhao, X. Pi, and A. Nefian, “Speaker independent audio-
visual continuous speech recognition,” in Multimedia and Expo, 2002. ICME ’02.
Proceedings. 2002 IEEE International Conference on, vol. 2, 2002, pp. 25–28 vol.2.
[101] P. Lieberman, Uniquely Human: The Evolution of Speech, Thought, and Selfless Be-
havior. Harvard University Press, 1991.
[102] P. Lucey, “Lipreading across multiple views,” Ph.D. dissertation, Queensland
University of Technology, Brisbane, Australia, 2007.
[103] P. Lucey, D. Dean, and S. Sridharan, “Problems associated with area-based
visual speech feature extraction,” in Auditory-Visual Speech Processing (AVSP),
British Columbia, Canada, 2005, pp. 73–78.
198 BIBLIOGRAPHY
[104] S. Lucey and T. Chen, “Improved audio-visual speaker recognition via the use
of a hybrid combination strategy,” in Audio- and Video-Based Biometric Person Au-
thentication. 4th International Conference, AVBPA 2003. Proceedings, J. Kittler and
M. Nixon, Eds. Guildford, UK: Springer-Verlag, 2003, pp. 929–936.
[105] S. Lucey, “Audio-visual speech processing,” Ph.D. dissertation, Queensland
University of Technology, Brisbane, 2002.
[106] S. Lucey, “An evaluation of visual speech features for the tasks of speech and
speaker recognition,” in Audio-and Video-Based Biometric Person Authentication
(AVBPA 2003), 4th International Conference on, ser. Lecture Notes in Computer
Science, vol. 2688. Guildfork, UK: Springer-Verlag Heidelberg, 2003, pp. 260–
267.
[107] J. Luettin and G. Maitre, “Evaluation protocol for the extendedM2VTS database
(XM2VTSDB),” IDIAP, Tech. Rep., 1998.
[108] J. Luettin and N. A. Thacker, “Speechreading using probabilistic models,” Com-
puter Vision and Image Understanding, vol. 65, no. 02, pp. 163–178, 1997, iDIAP-RR
97-12.
[109] J. Luettin, N. Thacker, and S. Beet, “Learning to recognise talking faces,” in Pat-
tern Recognition, 1996., Proceedings of the 13th International Conference on, vol. 4,
1996, pp. 55–59 vol.4.
[110] J. Luettin, N. Thacker, and S. Beet, “Speechreading using shape and intensity in-
formation,” in Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International
Conference on, vol. 1, 1996, pp. 58–61 vol.1.
[111] J. Luettin, “Visual speech and speaker recognition,” Ph.D. dissertation, Univer-
sity of Sheffield, May 1997.
[112] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, “The
DET curve in assessment of detection task performance,” vol. 97, no. 4, 1997,
pp. 1895–1898.
BIBLIOGRAPHY 199
[113] J. Mason and J. Brand, “The role of dynamics in visual speech biometrics,” in
Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP ’02). IEEE In-
ternational Conference on, vol. 4, 2002, pp. IV–4076–IV–4079 vol.4.
[114] I. Matthews, J. Bangham, and S. Cox, “Audiovisual speech recognition using
multiscale nonlinear image decomposition,” in Spoken Language, 1996. ICSLP 96.
Proceedings., Fourth International Conference on, vol. 1, 1996, pp. 38–41 vol.1.
[115] I. Matthews, T. Cootes, S. Cox, R. Harvey, and J. Bangham, “Lipreading using
shape, shading and scale,” in Proc. of Audio Visual Speech Processing 1998 (AVSP
1998), 1998.
[116] I. Matthews, G. Potamianos, C. Neti, and J. Luettin, “A comparison of model
and transform-based visual features for audio-visual LVCSR,” inMultimedia and
Expo, 2001. ICME 2001. IEEE International Conference on, 2001, pp. 825–828.
[117] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol.
264, no. 5588, pp. 746–748, Dec. 1976.
[118] U.Meier, R. Stiefelhagen, J. Yang, and A.Waibel, “Towards unrestricted lip read-
ing,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 14,
no. 5, pp. 571–585, 2000.
[119] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “XM2VTSDB: The ex-
tended M2VTS database,” in Audio and Video-based Biometric Person Authentica-
tion (AVBPA ’99), Second International Conference on, Washington D.C., 1999, pp.
72–77.
[120] P. Motlícek, L. Burget, and J. Cernocký, “Phoneme recognition ofmeetings using
audio-visual data,” in Joint AMI/PASCAL/IM2/M4workshop, Martigny, CH, 2004.
[121] J. Movellan, “Visual speech recognition with stochastic networks,” in Advances
in neural information processing systems, G. Tesauro, D. Touretzky, and T. Leen,
Eds. San Mateo, CA: MIT Press Cambridge, 1995, vol. 7, pp. 851–858.
200 BIBLIOGRAPHY
[122] A. V. Nefian and L. H. Liang, “Bayesian networks inmultimodal speech recogni-
tion and speaker identification,” Conference Record of the Thirty-Seventh Asilomar
Conference on Signals, Systems and Computers; Conference Record of the Asilomar
Conference on Signals, Systems and Computers, vol. 2, pp. 2004–2008, 2003.
[123] A. V. Nefian, L. H. Liang, T. Fu, and X. X. Liu, “A Bayesian approach to audio-
visual speaker identification,” in Audio-and Video-Based Biometric Person Authen-
tication (AVBPA 2003), 4th International Conference on, ser. Lecture Notes in Com-
puter Science, vol. 2688. Guildfork, UK: Springer-Verlag Heidelberg, 2003, pp.
761–769.
[124] A. V. Nefian, L. Liang, X. Pi, and X. Liu, “Dynamic bayesian networks for audio-
visual speech recognition,” EURASIP Journal in Applied Signal Processing, vol. 11,
pp. 1–15, 2002.
[125] A. Nefian, L. Liang, X. Pi, L. Xiaoxiang, C. Mao, and K. Murphy, “A coupled
HMM for audio-visual speech recognition,” in Acoustics, Speech, and Signal Pro-
cessing, 2002. Proceedings. (ICASSP ’02). IEEE International Conference on, vol. 2,
2002, pp. 2013–2016.
[126] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison,
A.Mashari, and J. Zhou, “Audio-visual speech recognition: Workshop2000 final
report,” Johns Hopkins University, CLSP, Tech. Rep. WS00AVSR, 2000.
[127] H. Olson and H. Belar, “Phonetic typewriter,” Audio, IRE Transactions on, vol. 5,
no. 4, pp. 90–95, 1957.
[128] A. O’Toole, D. Roark, and H. Abdi, “Recognizing moving faces: A psychological
and neural synthesis,”Trends in Cognitive Sciences, vol. 6, no. 6, pp. 261–266, 2002.
[129] H. Ouyang and T. Lee, “A new lip feature representation method for video-
based bimodal authentication,” in MMUI ’05: Proceedings of the 2005 NICTA-
HCSNet Multimodal User Interaction Workshop. Darlinghurst, Australia, Aus-
tralia: Australian Computer Society, Inc., 2006, pp. 33–37.
BIBLIOGRAPHY 201
[130] H. Pan, S. Levinson, T. Huang, and Z.-P. Liang, “A fused hidden markov model
with application to bimodal speech processing,” IEEE Transactions on Signal Pro-
cessing, vol. 52, no. 3, pp. 573–581, 2004.
[131] H. Pan, Z.-P. Liang, and T. S. Huang, “Estimation of the joint probability of mul-
tisensory signals,” Pattern Recognition Letters, vol. 22, no. 13, pp. 1431–1437, 2001.
[132] P. Papamichalis, Practical approaches to speech coding. EnglewoodCliffs: Prentice-
Hall, 1987.
[133] E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy, “CUAVE: a new audio-
visual database for multimodal human-computer interface research,” in Acous-
tics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP ’02). IEEE Interna-
tional Conference on, vol. 2, 2002, pp. 2017–2020.
[134] D. Paul and J. Baker, “The design for theWall Street Journal-based CSR corpus,”
in Proceedings of the DARPA Speech and Natural Language Workshop, 1992, pp. 357–
362.
[135] E. Petajan, B. Bischoff, D. Bodoff, and N. M. Brooke, “An improved automatic
lipreading system to enhance speech recognition,” in CHI ’88: Proceedings of the
SIGCHI conference on Human factors in computing systems. New York, NY, USA:
ACM, 1988, pp. 19–25.
[136] E. D. Petajan, “Automatic lipreading to enhance speech recognition,” in Proc.
Global Telecomm. Conf., 1984, pp. 265–272.
[137] S. Pigeon. (1998, 5/4/2004) M2VTS multimodal face database. [Online]. Avail-
able: http://www.tele.ucl.ac.be/PROJECTS/M2VTS/m2fdb.html
[138] S. Pigeon. (1998, October) M2VTS project: Multi-
modal biometric person authentication. [Online]. Available:
http://www.tele.ucl.ac.be/PROJECTS/M2VTS/
[139] K. Pilz, I. Thornton, and H. Bülthoff, “A search advantage for faces learned in
motion,” Experimental Brain Research, vol. 171, no. 4, pp. 436–447, 2006.
202 BIBLIOGRAPHY
[140] G. Potamianos, E. Cosatto, H. Graf, and D. Roe, “Speaker independent audio-
visual database for bimodal ASR,” in Proc. Europ. Tut. Work. Audio-Visual Speech
Proc, Rhodes, 1997, pp. 65–68.
[141] G. Potamianos and H. Graf, “Discriminative training of HMM stream exponents
for audio-visual speech recognition,” in Acoustics, Speech, and Signal Processing,
1998. ICASSP ’98. Proceedings of the 1998 IEEE International Conference on, vol. 6,
1998, pp. 3733–3736 vol.6.
[142] G. Potamianos, H. Graf, and E. Cosatto, “An image transform approach for
HMM based automatic lipreading,” in Image Processing, 1998. ICIP 98. Proceed-
ings. 1998 International Conference on, 1998, pp. 173–177 vol.3.
[143] G. Potamianos and C. Neti, “Automatic speechreading of impaired speech,” in
International Conference on Auditory-Visual Speech Processing (AVSP 2001). Aal-
borg, Denmark: ISCA, September 7-9 2001.
[144] G. Potamianos and C. Neti, “Improved ROI and within frame discriminant fea-
tures for lipreading,” in Image Processing, 2001. Proceedings. 2001 International
Conference on, vol. 3, 2001, pp. 250–253 vol.3.
[145] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior, “Recent advances in
the automatic recognition of audiovisual speech,” Proceedings of the IEEE, vol. 91,
no. 9, pp. 1306–1326, 2003.
[146] G. Potamianos, C. Neti, J. Luettin, and I. Matthews, “Audio-visual automatic
speech recognition: An overview,” in Issues in Visual and Audio-Visual Speech
Processing, G. Bailly, E. Vatikiotis-Bateson, and P. Perrier, Eds. MIT Press, 2004.
[147] G. Potamianos, A. Verma, C. Neti, G. Iyengar, and S. Basu, “A cascade image
transform for speaker independent automatic speechreading,” inMultimedia and
Expo, 2000. ICME 2000. 2000 IEEE International Conference on, vol. 2, 2000, pp.
1097–1100 vol.2.
[148] G. Potamianos and P. Lucey, “Audio-visual asr frommultiple views inside smart
BIBLIOGRAPHY 203
rooms,” in Multisensor Fusion and Integration for Intelligent Systems, 2006 IEEE
International Conference on, 2006, pp. 35–40.
[149] G. Potamianos, C. Neti, G. Iyengar, A. W. Senior, and A. Verma, “A cascade vi-
sual front end for speaker independent automatic speechreading,” International
Journal of Speech Technology, vol. 4, no. 3, pp. 193–208, Jul 2001.
[150] R. Potter, “Visible patterns of sound,” Science, vol. 102, no. 2654, pp. 463–470,
1945.
[151] J. Psutka, L. Müller, and J. V. Psutka, “Comparison of mfcc and plp parameteri-
zations in the speaker independent continuous speech recognition task,” in 7th
European Conference on Speech Communication and Technology, Aalborg, Denmark,
September 3-7 2001.
[152] L. Rabiner, S. Levinson, A. Rosenberg, and J. Wilpon, “Speaker-independent
recognition of isolatedwords using clustering techniques,”Acoustics, Speech, and
Signal Processing, IEEE Transactions on, vol. 27, no. 4, pp. 336–349, 1979.
[153] L. R. Rabiner and B. H. Juang, Fundamentals of speech recognition. Englewood
Cliffs, N.J :: PTR Prentice Hall„ 1993, includes bibliographical references and
index.
[154] L. Rabiner, “The role of voice processing in telecommunications,” in Interactive
Voice Technology for Telecommunications Applications, 1994., Second IEEE Workshop
on, 1994, pp. 1–8.
[155] D. R. Reddy, “Approach to computer speech recognition by direct analysis of
the speech wave,” The Journal of the Acoustical Society of America, vol. 40, no. 5,
pp. 1273–1273, 1966.
[156] D. Reisberg, J. McLean, and A. Goldfield, “Easy to hear but hard to under-
stand: A lip-reading advantage with intact auditory stimuli,” in Hearing by Eye,
B. Dodd and R. Campbell, Eds. London: Lawrence Erlbaum Associates, 1987,
ch. 4, pp. 97–113.
204 BIBLIOGRAPHY
[157] D. Reynolds, “Experimental evaluation of features for robust speaker identifica-
tion,” Speech and Audio Processing, IEEE Transactions on, vol. 2, no. 4, pp. 639–643,
1994.
[158] D. Reynolds and R. Rose, “Robust text-independent speaker identification us-
ing Gaussian mixture speaker models,” Speech and Audio Processing, IEEE Trans-
actions on, vol. 3, no. 1, pp. 72–83, 1995.
[159] M. Roach, J. Brand, and J. Mason, “Acoustic and facial features for speaker
recognition,” in Pattern Recognition, 2000. Proceedings. 15th International Confer-
ence on, vol. 3, 2000, pp. 258–261 vol.3.
[160] D. Roark, A. JO’Toole, H. Abdi, and S. Barrett, “Learning the moves: The effect
of familiarity and facial motion on person recognition across large changes in
viewing format,” Perception, vol. 35, pp. 761–773, 2006.
[161] A. A. Ross, K. Nandakumar, and A. K. Jain, Handbook of Multibiometrics, ser.
International Series on Biometrics, D. D. Zhang and A. K. Jain, Eds. Springer,
2006.
[162] L. Rothkrantz, J. Wojdel, and P.Wiggers, “Comparison between different feature
extraction techniques in lipreading applications,” in Specom 2006, 2006.
[163] C. Sanderson, “The VidTIMIT database,” IDIAP Communication, pp. 02–06, 2002.
[164] P. Scanlon and R. Reilly, “Feature analysis for automatic speechreading,” inMul-
timedia Signal Processing, 2001 IEEE Fourth Workshop on, 2001, pp. 625–630.
[165] S. Schweinberger, “Hearing facial identities,” The Quarterly Journal of Experimen-
tal Psychology, vol. 99999, no. 1, pp. 1–1, 2007.
[166] S. Schweinberger, A. Herholz, and V. Stief, “Auditory long term memory: Rep-
etition priming of voice recognition,” The Quarterly Journal of Experimental Psy-
chology Section A, vol. 50, no. 3, pp. 498–517, 1997.
[167] J. Shepherd, G. Davies, and H. Ellis, “Studies of cue saliency,” in Perceiving and
Remembering Faces, 1981, pp. 105–131.
BIBLIOGRAPHY 205
[168] P. Silsbee and A. Bovik, “Computer lipreading for improved accuracy in au-
tomatic speech recognition,” Speech and Audio Processing, IEEE Transactions on,
vol. 4, no. 5, pp. 337–351, 1996.
[169] G. L. E. Stork, D.G.; Wolff, “Neural network lipreading system for improved
speech recognition,” in Neural Networks, 1992. IJCNN., International Joint Confer-
ence on, vol. 2, June 1992, pp. 289–295.
[170] W. Sumby and I. Pollack, “Visual contribution to speech intelligibility in noise,”
The Journal of the Acoustical Society of America, vol. 26, p. 212, 1954.
[171] Q. Summerfield, “Lipreading and audio-visual speech perception,” Philosophical
Transactions: Biological Sciences, vol. 335, no. 1273, pp. 71–78, 1992.
[172] Q. Summerfield, “Some preliminaries to a comprehensive account of audio-
visual speech perception,” in Hearing by Eye, B. Dodd and R. Campbell, Eds.
London: Lawrence Erlbaum Associates, 1987, ch. 1, pp. 1–51.
[173] P. Teissier, J.-L. Schwartz, and A. Guerin-Dugue, “Models for audiovisual fusion
in a noisy-vowel recognition task,” in Multimedia Signal Processing, 1997., IEEE
First Workshop on, 1997, pp. 37–44.
[174] R. Tenney and N. Sandell, “Detection with distributed sensors,” Aerospace and
Electronic Systems, IEEE Transactions on, vol. AES-17, no. 4, pp. 501–510, 1981.
[175] M. Turk and A. Pentland, “Face recognition using eigenfaces,” in Computer Vi-
sion and Pattern Recognition, 1991. Proceedings CVPR ’91., IEEE Computer Society
Conference on, 1991, pp. 586–591.
[176] A. Varga and H. J. M. Steeneken, “Assessment for automatic speech recognition
ii: Noisex-92: a database and an experiment to study the effect of additive noise
on speech recognition systems,” Speech Communication, vol. 12, no. 3, pp. 247–
251, 1993.
[177] E. Vatikiotis-Bateson, G. Bailly, and P. Perrier, Eds., Audio-Visual Speech Process-
ing. MIT Press, 2006.
206 BIBLIOGRAPHY
[178] T. K. Vintsyuk, “Speech discrimination by dynamic programming,” Cybernetics
and Systems Analysis, vol. 4, no. 1, pp. 52–57, Jan 1968.
[179] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple
features,” in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceed-
ings of the 2001 IEEE Computer Society Conference on, vol. 1, 2001, pp. I–511–I–518
vol.1.
[180] K. von Kriegstein, A. Kleinschmidt, P. Sterzer, and A.-L. Giraud, “Interaction
of face and voice areas during speaker recognition,” Journal of Cognitive Neuro-
science, vol. 17, no. 3, pp. 367–376, 2005.
[181] T. Wagner and U. Dieckmann, “Multi-sensorial inputs for the identification of
persons with synergetic computers,” in Image Processing, 1994. Proceedings. ICIP-
94., IEEE International Conference, vol. 2, 1994, pp. 287–291 vol.2.
[182] T. Wagner and U. Dieckmann, “Sensor-fusion for robust identification of per-
sons: a field test,” in Image Processing, 1995. Proceedings., International Conference
on, vol. 3, 1995, pp. 516–519 vol.3.
[183] G. K. Wallace, “The JPEG still picture compression standard,” Commun. ACM,
vol. 34, no. 4, pp. 30–44, 1991.
[184] S. Wang, W. Lau, S. Leung, and H. Yan, “A real-time automatic lipreading sys-
tem,” in Circuits and Systems, 2004. ISCAS ’04. Proceedings of the 2004 International
Symposium on, vol. 2, 2004, pp. II–101–4 Vol.2.
[185] T. Wark and S. Sridharan, “A syntactic approach to automatic lip feature extrac-
tion for speaker identification,” in Acoustics, Speech, and Signal Processing, 1998.
ICASSP ’98. Proceedings of the 1998 IEEE International Conference on, vol. 6, 1998,
pp. 3693–3696 vol.6.
[186] T. Wark and S. Sridharan, “Adaptive fusion of speech and lip information for
robust speaker identification,” Digital Signal Processing, vol. 11, no. 3, pp. 169–
186, 2001.
BIBLIOGRAPHY 207
[187] T. Wark, S. Sridharan, and V. Chandran, “Robust speaker verification via asyn-
chronous fusion of speech and lip information,” in Audio- and Video-based Bio-
metric Person Authentication (AVBPA ’99), 2nd International Conference on, Wash-
ingtion, D.C., 1999, pp. 37–42.
[188] T. Wark, S. Sridharan, and V. Chandran, “The use of temporal speech and lip
information for multi-modal speaker identification via multi-streamHMMs,” in
Acoustics, Speech, and Signal Processing, 2000. ICASSP ’00. Proceedings. 2000 IEEE
International Conference on, vol. 6, 2000, pp. 2389–2392 vol.4.
[189] T. Wark, D. Thambiratnam, and S. Sridharan, “Person authentication using lip
information,” in TENCON ’97. IEEE Region 10 Annual Conference. Speech and Im-
age Technologies for Computing and Telecommunications’., Proceedings of IEEE, vol. 1,
1997, pp. 153–156 vol.1.
[190] J. Webb and E. Rissanen, “Speaker identification experiments using HMMs,” in
Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International
Conference on, vol. 2, 1993, pp. 387–390 vol.2.
[191] P. L. Williams, Gray’s anatomy of the human body, 20th ed. Churchill Livingstone,
1918.
[192] M.-H. Yang, D. Kriegman, and N. Ahuja, “Detecting faces in images: a survey,”
Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, no. 1, pp.
34–58, 2002.
[193] Y. Yemez, A. Kanak, E. Erzin, and A. Tekalp, “Multimodal speaker identifica-
tion with audio-video processing,” in Image Processing, 2003. Proceedings. 2003
International Conference on, vol. 3, 2003, pp. 5–8.
[194] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey,
V. Valtchev, and P. Woodland, The HTK Book, 3rd ed. Cambridge, UK: Cam-
bridge University Engineering Department., 2002.
[195] M. J. S. T. Yuhas, B.P.; Goldstein, “Integration of acoustic and visual speech sig-
208 BIBLIOGRAPHY
nals using neural networks,” in Communications Magazine, IEEE, vol. 27, no. 11,
November 1989, pp. 65–71.
[196] A. L. Yuille, P.W.Hallinan, andD. S. Cohen, “Feature extraction from faces using
deformable templates,” International Journal of Computer Vision, vol. 8, no. 2, pp.
99–111, August 1992.
[197] Z. Zeng, J. Tu, B. Pianfetti, M. Liu, T. Zhang, Z. Zhang, T. Huang, and S. Levin-
son, “Audio-visual affect recognition throughmulti-stream fused hmm for hci,”
in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer So-
ciety Conference on, vol. 2, 2005, pp. 967–972.
[198] X. Zhang and R. Mersereau, “Lip feature extraction towards an automatic
speechreading system,” in Image Processing, 2000. Proceedings. 2000 International
Conference on, vol. 3, 2000, pp. 226–229 vol.3.
[199] W. Zhao, R. Chellappa, P. Phillips, and A. Rosenfeld, “Face recognition: A lit-
erature survey,” ACM Computing Surveys (CSUR), vol. 35, no. 4, pp. 399–458,
2003.