object tracking and asynchrony in audio-visual speech recognition
Post on 13-Jan-2016
46 Views
Preview:
DESCRIPTION
TRANSCRIPT
Object Tracking and Asynchrony in Audio-
Visual Speech Recognition
Mark Hasegawa-Johnson
AIVR Seminar
August 31, 2006
AVICAR is thanks to: Bowon Lee, Ming Liu, Camille Goudeseune, Suketu Kamdar, Carl Press, Sarah Borys
and to the Motorola Communications Center
Some experiments and most good ideas in this talk thanks to
Ming Liu, Karen Livescu, Kate Saenko and Partha Lal
Why AVSR is not like ASR
• Use of classifiers as features– E.g., output of an
AdaBoost lip tracker is feature in a face constellation
• Obstruction– Tongue is rarely visible,
glottis never
• Asynchrony– Visual evidence for a
word can start long before the audio evidence
Which digit is she about to say?
Why ASR is like AVSR
• Use of classifiers as features– E.g., neural networks or SVMs transform audio spectra
into a phonetic feature space
• Obstruction– Lip closure “hides” tongue closure
– Glottal stop “hides” lip or tongue position
• Asynchrony– Tongue, lips, velum, and glottis can be out of sync, e.g.,
“every” →“ervy”
Discriminative Features in Face/Lip Tracking: AdaBoost
1. Each wavelet defines a “weak classifier:
hi(x) = 1 iff fi > threshold, else hi(x) = 0
2. Start with equal weight for all training tokens:
wm(1) = 1/M, 1 ≤ m≤M
3. For each learning iteration t:• Find i that minimizes the weighted training error.
• wm ↓ if token m was correctly classified, else wm ↑.
• αt = log((1- εt)/ εt)
• Final “strong classifier” is H(x) = 1 iff Σ t αt ht(x) > Σ t αt
Example Haar Wavelet Features Selected by AdaBoost
AdaBoost in a Bayesian Context
• The AdaBoost “margin:”
• Guaranteed range: 0≤MD(x)≤1
• Inverse sigmoid transform yields nearly normal distributions
Prior: Relative Position of Lips in the Face
p(r=rlips | MD(x)) p(r=rlips) p(MD(x) | r=rlips)
Lip Tracking: a few results
Pixel-Based Features
Pixel-Based Features: Dimension
Model-Based Correction for Head-Pose Variability
• If the head is an ellipse, its measured width wF(t) and height hF(t) are functions of roll ρ, yaw ψ, pitch φ, true height ħF and true width wF according to
• … which can usefully be approximated as…
Robust Correction: Linear Regression
• The additive random part of the lip width (wL(t)=w1+ħLcosψ(t)sinρ(t)) is
proportional to similar additive variation in the head width
(wF(t)=wF1+ħFcosψ(t)sinρ(t)), so we can eliminate it by orthogonalizing
wL(t) to wF(t).
WER Results from AVICAR(Testing on the training data; 34 talkers, continuous digits)
LR = linear regression
Model = model-based
head-pose
compensation
LLR = log-linear
regression
13+d+dd = 13 static
features
39 = 39 static features
All systems have
mean and variance
normalization and
MLLR
Audio-Visual AsynchronyFor example, tongue touches the teeth before acoustic speech onset in the word “three;” lips
are already round in anticipation of the /r/.
…
acoustic channel
visual channel
t = 1
t = 2
t = 3
…
t =T
Audio-Visual Asynchrony: Coupled HMM is a typical Phoneme-Viseme Model
(Chu and Huang, 2002)
A Physical Model of AsynchronySlide created by Karen Livescu
Articulatory Phonology [Browman & Goldstein ‘90]: The following 8 tract variables are independently & asynchronously controlled
LIP-LOC Protruded, Labial, Dental
LIP-OP CLosed, CRitical, Narrow, Wide
TT-LOC Dental, Alveolar, Palato-Alveolar, Retroflex
TB-LOC Palatal, Velar, Uvular, Pharyngeal
TT-OP, TB-OP
CLosed, CRitical, Narrow, Mid-Narrow, Mid, Wide
GLO CLosed (stop), CRitical (voiced), Open (voiceless)
VEL CLosed (non-nasal), Open (nasal)
LIP-OPTT-OP
TT-LOC
TB-LOC
TB-OPVELUM
GLOTTIS
LIP-LOC
For speech recognition, we collapse these into 3 streams: lips, tongue, and glottis (LTG).
Motivation: Pronunciation variationSlide created by Karen Livescu
(2) p r aa b iy
(1) p r ay
(1) p r aw l uh
(1) p r ah b iy
(1) p r aa l iy
(1) p r aa b uw
(1) p ow ih
(1) p aa iy
(1) p aa b uh b l iy
(1) p aa ah iy
probably
p r aa b ax b l iy
(1) s eh n t s
(1) s ih t s
sense
s eh n s
(1) eh v r ax b ax d iy
(1) eh v er b ah d iy
(1) eh ux b ax iy
(1) eh r uw ay
(1) eh b ah iy
everybody
eh v r iy b ah d iy
(37) d ow n
(16) d ow
(6) ow n
(4) d ow n t
(3) d ow t
(3) d ah n
(3) ow
(3) n ax
(2) d ax n
(2) ax
(1) n uw
...
don’t
d ow n tbaseform
word
surface (actual)
0
20
40
60
80
0 50 100 150 200
minimum # occurrences
# pro
nunci
atio
ns
/ w
ord
closed / alveolarnas
Explanation: Asynchrony of tract variablesBased on a slide created by Karen Livescu
mid / palatalcrit / alveolarT crit / alveolarclosed / alveolar
s
open
nehsphone
criticalopenG
valuesfeature
dictionary
stnehsphone
surface variant #1
n
tihphone
crit / alveolarcl / alvcrit / alveolarTopencriticalopenG
valuesfeaturesurface
variant #2
(example of feature
asynchrony)
(example of feature
asynchrony +
substitution)
nasal
mid / palatalcrit / alveolarT crit / alveolaropencriticalopenG
valuesfeature
nas
criticalnar / palatal
s s
Implementation: Multi-stream DBNSlide created by Karen Livescu
• Phone-based
• Articulatory Feature-based
q (phonetic state)
o (observation vector)
L (state of lips)
o (obs vector)
T (state of tongue)
G (state of glottis)
positionInWordA {0,1,2,...}
stateTransitionA {0,1}
phoneStateA { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}
Baseline: Audio-only phone-based HMMSlide created by Partha Lal
obsA
positionInWordV {0,1,2,...}
stateTransitionV {0,1}
phoneStateV { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}
obsV
Baseline: Video-only phone-based HMMSlide created by Partha Lal
obsV obsA
positionInWord {0,1,2,...}
stateTransition {0,1}
phoneState { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}
obs
Audio-visual HMM without asynchronySlide created by Partha Lal
positionInWordA {0,1,2,...}
stateTransitionA {0,1}
phoneStateA { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}
obsA
positionInWordV {0,1,2,...}
stateTransitionV {0,1}
phoneStateV { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}
obsV
Phoneme-Viseme CHMMSlide created by Partha Lal
Articulatory Feature CHMM
positionInWordT {0,1,2,...}
stateTransitionT {0,1}
T { /CL-ALV/1, /CL-ALV/2, /MID-UV/1, …}
positionInWordG {0,1,2,...}
stateTransitionG {0,1}
G { /OP/1, /OP/2, /CRIT/1, …}
obsV obsA
positionInWordL {0,1,2,...}
stateTransitionL {0,1}
L { /OP/1, /OP/2, /RND/1, …}
Asynchrony Experiments: CUAVE• 169 utterances used, 10 digits each
• NOISEX speech babble added at various SNRs
• Experimental setup
– Training on clean data, number of Gaussians tuned on clean dev set
– Audio/video weights tuned on noise-specific dev sets
– Uniform (“zero-gram”) language model
– Decoding constrained to 10-word utterances (avoids language model scale/penalty tuning)
• Thanks to Amar Subramanya at UW
for the video observations
• Thanks to Kate Saenko at MIT for initial
baselines and audio observations
Results, part 1: Should we use video?Answer: Fusion WER < Single-stream WER( Novelty: None. Many authors have reported this. )
0
10
20
30
40
50
60
70
80
90
CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB
Audio
Video
Audiovisual
Results, part 2: Should the streams be asynchronous?Asynchronous WER < Synchronous WER (4% absolute @ midSNRs)( Novelty: First phone-based AVSR w/ inter-phone asynchrony. )
0
10
20
30
40
50
60
70
CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4
No Asynchrony
1 State Async
2 States Async
Unlimited Asyn
Results, part 3:Should asynchrony be modeled using articulatory features?Answer: Articulatory feature WER = Phoneme-viseme WER
( Novelty: First articulatory feature model for AVSR. )
0
10
20
30
40
50
60
70
80
Clean SNR12dB
SNR10dB
SNR 6dB SNR 4dB SNR -4dB
Phone-viseme
Articulatory features
Results, part 4:Can AF system help the CHMM to correct mistakes?Answer: Combination AF + PV gives best results on this databaseDetails: Systems vote to determine label of each word (NIST rover)
PV = Phone-viseme AF = Articulatory features
17
18
19
20
21
22
23
Rover, Best Threew/ AF
Rover, Best Threew/o AF
PV, 2 StatesAsync
AF PV, 1 StateAsync
WER on devtest, averaged across SNRs
Conclusions• Classifiers as features:
– AdaBoost “margin” outputs can be used as features in Gaussian model of facial geometry
• Head-pose correction in noise:– Best correction algorithm uses linear regression followed by
model-based correction• Asynchrony matters:
– Best phone-based recognizer is a CHMM with two states of asynchrony allowed between audio and video
• Articulatory Feature Models complement Phone Models– These two systems have identical WER– Best result obtained when systems of both types are
combined using rover
top related