database and visual front end makis potamianos
TRANSCRIPT
Click here to add clip
art
Joint Processing of Audio and Visual Joint Processing of Audio and Visual
Information for Speech RecognitionInformation for Speech Recognition
CLSP Workshop, 2000, 08/24/00 CLSP Workshop, 2000, 08/24/00
AVSR TeamChalapathy Neti (IBM)Gerasimos Potamianos (IBM)Juergen Luettin (IDIAP)Iain Matthews (CMU)Herve Glotin (ICP/IDIAP)Dimitra Vergyri (Johns Hopkins)June Sison (UC Santa Cruz)Azad Mashari (U. Toronto)Jie Zhou (Johns Hopkins)
We don't just listen we watch!
AgendaAgendaChalapathy Neti (Introduction)Chalapathy Neti (Introduction)Makis Potamianos Makis Potamianos
Database and visual front endDatabase and visual front end
Iain MathewsIain MathewsActive Appearance ModelsActive Appearance Models
Juergen LuettinJuergen LuettinAsyncrony modelingAsyncrony modeling
Dimitra VergyriDimitra VergyriPhone dependent weighting schemesPhone dependent weighting schemes
Herve GlotinHerve GlotinMultistream and Stream weighting schemesMultistream and Stream weighting schemes
JuneJuneVisual ClusteringVisual Clustering
AzadAzadVisual modelingVisual modeling
JieJieVisual model adaptation Visual model adaptation
Chalapathy Neti (Summary, Conclusions and discussion)Chalapathy Neti (Summary, Conclusions and discussion)
Featured speaker: Eric Petajan (Face2Face Animation)Featured speaker: Eric Petajan (Face2Face Animation)MPEG4 Visual speech representationMPEG4 Visual speech representation
Audio-visual speech content
(compressed)
Visual speech feature
representation
ROIextraction
Auditory feature(A) extraction
Video decoding
P(V/AU)
P(A/acoustic unit)
Audio-Visual speech recognition Audio-Visual speech recognition by rescoring by rescoring
Search
LM
demux
Word lattices
Camera
Rescoring of lattices
Video Clip
[ ]V
[ ]A
Audio decoding
P(A,V/AU)
Combination of audio and visual speech: Combination of audio and visual speech: Research issues Research issues
Goal: improve audio-based LVCSR by using visual informationGoal: improve audio-based LVCSR by using visual informationprevious work on isolated digits/lettersprevious work on isolated digits/letters
Key Research issues are:Key Research issues are:Data (IBM)Data (IBM)Location and tracking of visual speech regions (IBM)Location and tracking of visual speech regions (IBM)Specification and representation of visual speech featuresSpecification and representation of visual speech features
DCT, discriminant representations (Makis)DCT, discriminant representations (Makis)Active Appearance models (Iain Matthews)Active Appearance models (Iain Matthews)
Visual ModelsVisual ModelsVisually relevant triphone clustering (June)Visually relevant triphone clustering (June)Visual modeling schemes (Azad)Visual modeling schemes (Azad)Visual model adaptation (Jie)Visual model adaptation (Jie)
Fusion strategies Fusion strategies Discriminant feature fusion (Makis/Juergen)Discriminant feature fusion (Makis/Juergen)Multistream - state synchrony (Juergen/Herve)Multistream - state synchrony (Juergen/Herve)Multistream - weighting schemes (Herve)Multistream - weighting schemes (Herve)Product HMM - asynchronous (Juergen)Product HMM - asynchronous (Juergen)Unit dependent weighting schemes (Dimitra)Unit dependent weighting schemes (Dimitra)
Fusion Research Issues Fusion Research IssuesGeneral fusion problem.General fusion problem.
Model Pr Model Pr [ [ OO 11 (t)(t) ,...,,...,OOSS (t)(t) | | jj ] , ] , S streams ; j is a class of interest (e.g. phone, Word, etc.)S streams ; j is a class of interest (e.g. phone, Word, etc.)
Stream independence/dependence.Stream independence/dependence.
Type of fusion:Type of fusion:Early (feature) .vs. Late (decision)Early (feature) .vs. Late (decision)
Decision fusion: Decision fusion: Score Score [[OO11 (t)(t) ,...,,...,OOSS (t)(t) | | jj ] = ] = ff ( ( Pr Pr [ [ OO ss (t)(t) | | jj ] , ] , s = 1,...,Ss = 1,...,S ) .) .
Asynchrony between StreamsAsynchrony between Streams
Synchronization level:Synchronization level:
Feature, state, phone, word, utteranceFeature, state, phone, word, utterance
Classes of interest:Classes of interest:
Visemes vs. phonemes.Visemes vs. phonemes.
Stream confidence estimation.Stream confidence estimation.SNR based, HNR, Entropy, etc.SNR based, HNR, Entropy, etc.
BaselineBaselineDataData
SI train (261 spkrs, 35 hrs)SI train (261 spkrs, 35 hrs)SI test (26 spkrs, 2.5hrs)SI test (26 spkrs, 2.5hrs)Vocabulary: 10,400 wordsVocabulary: 10,400 words
Experimental MethodExperimental MethodRescore lattices Rescore lattices
Baseline ResultsBaseline ResultsAudioAudio
Rescoring of lattices usingRescoring of lattices usingmatched audio modelsmatched audio models
OracleOraclePick a path closest to the truthPick a path closest to the truth
Random Path (R. Path)Random Path (R. Path)Pick a random path through thePick a random path through the
lattice lattice
Clean 10 dB SNR0
10
20
30
40
50
60
70
80
90
Audio-IBMAudio-HTKOracleR. Path
Database and Visual Front EndDatabase and Visual Front End
Makis Potamianos
Active Appearance ModelActive Appearance ModelVisual FeaturesVisual Features
Iain Matthews
AcknowledgmentsAcknowledgments
• Cootes, Edwards, Talyor, Manchester
• Sclaroff, Boston
AAM OverviewAAM Overview
Qcb
Wbb
g
s
Shape & Appearance
Appearance
ggbPgg R
eg
ion
of
inte
rest
Wa
rp t
o r
efe
ren
ce
Shape
ssbPxx L
an
dm
arks
Relationship to DCT FeaturesRelationship to DCT Features
• External feature detector vs. model-based learned tracking
Face Detector
Face Detector
AAM TrackerAAM
Tracker
DCTDCT
AAMFeatures
• ROI ‘box’ vs. explicit shape + appearance modeling
Training DataTraining Data
• 4072 hand labelled images = 2m 13s (/ 50h)
Final ModelFinal Model
3
3
mean
Image under modelImage under model
Warp to referenceWarp to reference
Fitting AlgorithmFitting Algorithm
DifferenceDifference
PredictedUpdate
PredictedUpdate
c
weight weight
c
ImageImage Current model projectionCurrent model projection
AppearanceAppearance
c is all model parameters
nsoyxcccaastt ,,,,,,,,,
21c
Error
Iterate until convergence
Current model projectionCurrent model projection
AppearanceAppearance
Image under modelImage under model
Warp to referenceWarp to reference
DifferenceDifference
Tracking ResultsTracking Results
• Worst sequence - mean, mean square error = 548.87
• Best sequence - mean, mean square error = 89.11
Tracking ResultsTracking Results
• Full-face AAM tracker on subset of VVAV database• 4,952 sequences• 1,119,256 images @ 30fps = 10h 22m• Mean, mean MSE per sentence = 254.21• Tracking rate (m2p decode) 4 fps
• Beard area and lips only models will not track• Regions lack sharp texture gradients needed locate model?
FeaturesFeatures
• Use AAM full-face features directly (86 dimensional)
Audio Lattice Rescoring ResultsAudio Lattice Rescoring Results
65.66
65.90
64.00
61.8
58.14
61.37
65.69
0 20 40 60
AAM - 86 features
AAM - 30 features
AAM - 30 +D+DD
AAM - 86 LDA => 24, LDA +-7
DCT - 18 +D+DD
DCT - 24, LDA +-7
Noise - 30
Word error rate, %
Lattice random path = 78.14%
DCT with LM = 51.08%
DCT no LM = 61.06%
Audio Lattice Rescoring ResultsAudio Lattice Rescoring Results
• AAM vs. DCT vs. Noise
Visual Features
0
20
40
60
80
100
120
AAK AEM AGC ALD APM ASA ASJ ATK ATV AXA AXH AXK AXP
Speaker
Wo
rd e
rro
r ra
te %
AAMDCTNoise
Tracking Errors AnalysisTracking Errors Analysis
• AAM vs. Tracking error
Tracking vs. Accuracy
0
50
100
150
200
250
300
AAK AEM AGC ALD APM ASA ASJ ATK ATV AXA AXH AXK AXP
Speaker
Me
an
MS
E /
Wo
rd e
rro
r ra
te %
AAMMMSE
Analysis and Future WorkAnalysis and Future Work
• Models are under trained• Little more than face detection on 2m of training
Analysis and Future WorkAnalysis and Future Work
• Models are under trained• Little more than face detection on 2m of training
• Project face through a more compact model• Retain only useful articulation information?
reprojectreproject
Analysis and Future WorkAnalysis and Future Work
• Models are under trained• Little more than face detection on 2m of training
• Project face through a more compact model• Retain only useful articulation information?
• Improve the reference shape• Minimal information loss through the warping?
reprojectreproject
Asynchronous Stream ModellingAsynchronous Stream Modelling
Juergen Luettin
The Recognition ProblemThe Recognition Problem
VA
MOOMPM ,maxarg*
VA
VAVA
OOP
MPMOOPOOMP
,
)(,,
M: word (phoneme) sequenceM*: most likely word sequenceOA: acoustic observation sequenceOV: visual observation sequence
Integration at the Feature LevelIntegration at the Feature Level
TVM
VVAN
AAAV
AVVA
tototototototowhere
MOPMOOP
)(...,),(),(),(...,),(),()(
,
2121
Assumption: • conditional dependence between modalities• integration at the feature level
Integration at the Decision LevelIntegration at the Decision Level
MOPMOPMOOP VAVA ,
Assumption: • conditional independence between modalities• integration at the unit level
Multiple Synchronous StreamsMultiple Synchronous Streams
1
1
1
;);(
;);())(),((
Vjn
Vjn
VM
n
jVn
Ajm
Ajm
AM
m
jAmVA
j
toNc
toNctotob
V
A
Assumption: • conditional independence• integration at the state level
Two streams in each state:
T
t
txtxAV
txxxX
AV atobaMOP1
)1()()()1()0( )(max
X: state sequenceaij : trans. prob. from i to jbj: probability densitycjm: mth mixture weight of multivariate GaussianN
Multiple Asynchronous StreamsMultiple Asynchronous Streams
T
t
txtxVVV
tx
T
t
txtxAAA
txxx
XX
VAXX
VVV
AAA
VAVA
atob
atoba
MOOP
1
)1()()(
1
)1()()()1()0(
,,
))((
))((
max,ˆ
Assumption:• conditional independence• integration at the unit level
Decoding: individual best state sequences for audio and video
Composite HMM definitionComposite HMM definition
1
5
4
3
2
6
8
7 9
StatesL
LxxXXX
asynchronyLimiting
aaaaaa
aaaa
XXX
where
MOOPMOOP
VAVAAV
AVAVAVAVAVA
AVAVAVA
VAAV
VAX
VAXX AVVA
#,...,0
,
:
...
,
,ˆ,ˆ
452725151312
44221111
,
Speech-noise decomposition (Varga & Moore, 1993)Audio-visual decomposition (Dupont & Luettin, 1998)
Stream ClusteringStream Clustering
AVSR SystemAVSR System
• 3-state HMM with 12 mixture components, 7-state HMM for composite model
• context dependent phone models (silence, short pause), tree-based state clustering
• cross-word context dependent decoding, using lattices computed at IBM
• trigram language model• global stream weights in multi stream models, estimated
on held out set
SSpeaker independent word recognitionpeaker independent word recognition
Word error rate
0
10
20
30
40
50
60
Clean Noisy, ALattices
Noisy, AVLattices
%
Audio
Video
AV HiLDA
AV 1 Stream
AV 2 Streams,synchronous
AV 2 Streams,asynchronous
ConclusionsConclusions
• AV 2 Stream asynchronous model beats other models in noisy conditions
Future directions:• Transition matrices: context dependent, pruning transitions
with low probability, cross-unit asynchrony• Stream weights: model based, discriminative• Clustering: taking stream-tying into account
Phone Dependent WeightingPhone Dependent Weighting
Dimitra Vergyri
Weight EstimationWeight Estimation
Hervé Glotin
Visual ClusteringVisual Clustering
June Sison
OutlineOutline
• Motivation for use of visemes in triphone classification
• Definition of visemes
• Goals of viseme usage
• Inspection of phone trees (validity check)
Equivalence ClassificationEquivalence Classification
• Combats problem of data sparseness
• Must be sufficiently refined so that equivalence classification can serve as a basis for prediction
• Use of decision trees to achieve equivalence classification [co-articulation]
• To derive EC:1] collect speech data realizing each phone2] classify [cluster] this speech into appropriately distinct categories
Definition of visemesDefinition of visemes
• Canonical mouth shapes that accompany speech utterances
• complements phonetic stream [examples]
Visual vs Audio ContextsVisual vs Audio Contexts
276 QS [total]
84single phoneme QS
116audio QS
76visual QS
No. root nodes: 123
33 visual
74 audio
16 single phoneme
Visual ModelsVisual Models
Azad Mashari
Visual Speech RecognitionVisual Speech Recognition
• The Model Trinity• Audio-Clustered Model (Question Set
1)• Self-Clustered Model (Question Set
1)• Self-Clustered Model (Question Set
2)• The "Results"
(From which we learn what not to do)
• The Analysis• Places to Go, Things to Do ...
The QuestionsThe Questions
• Set 1: Original Audio Questions
• 202 Questions
• based primarily on voicing and manner
• Set 2: Audio-Visual Questions
• 274 (includes Set 1)
• includes questions regarding place of articulation
The TrinityThe Trinity
• Audio-Clustered model• Decision trees generated from the audio data using
question set 1
• Visual triphone models clustered using the trees
• Self-Clustered old• Decision trees generated from the visual data using
question set 1
• Self-Clustered new• Decision trees generated from the visual data using
question set 2
Experiment IExperiment I
• 3 major factors
• Independence / Complementarity of the two streams
• Quality of the representation
• Generalization
• Speaker-Independent test
• Noisy audio lattices rescored using visual models
Experiment IExperiment I
• Rescoring noisy audio lattices using the visual models
Audio Clustering (AQ)
Self- Clustering (AQ)
Self- Clustering (VQ)
40
45
50
55
60
51.24 51.08 51.16
Visual Models%
Wor
d E
rror
Experiment IExperiment I
1AX
K01
1JFM
01
1JXC
01 1LCY
01
1MB
G01
1MD
P01
1RT
G01
3BA
E01
3CN
M01
3DJF
01 3DL
N01
3DV
O01
3EP
H01
3JLW
01
3JPC
01 3JWL01
3JXP
01 3KP
R01
3KX
K01
3KX
M01
3LRW
01
3MX
E01
3PJB
01 3RM
F01
3SX
A01
3SX
E01
- 4
- 2
0
2
4
6
8
10
12
14
16
Per- speaker Word Error Rate on SI- test (- nafe wer)
VQ
AQ
speakers
% w
ord
erro
r
Experiment IExperiment I
• Speaker variability of visual models follows variability of audio models. (we don't know why.. lattices?)
• This does not mean that they are not "complementary".
• Viseme clustering gives better results for some speakers only. No overall gain. (we don't know why)
• Are the new questions being used?
• Over-training?
• ~7000 clusters in audio models for ~40 phonemes. Same number in visual models but there are only ~12 "visemes" -> Experiments with fewer clusters
• Is the Greedy clustering algorithm, making a less optimal tree with the new questions?
Experiment IIExperiment II
• Several ways to get fewer clusters:
• Increase minimum cluster size
• Increase likelihood gain threshold
• Remove questions (specially those frequently used at higher depths, as well as unused ones)
• Any combination of the above
• Triple min likelihood gain threshold (single mixture models) -> insignificant increase in error.
• ~7000 clusters -> 54.24% ~2500 clusters -> 54.57%
• Even fewer clusters (~150-200)? Different reduction strategy?
Places to Go, Things to See...Places to Go, Things to See...
• Finding optimal clustering parameters. Current values are optimized for mfcc-based audio models.
• Clustering with viseme-based questions only
• Looking at errors in recognition of particular phones/classes
Visual Model AdaptationVisual Model Adaptation
Jie Zhou
Visual Model AdaptationVisual Model Adaptation
• Problem• The Speaker Independent system is not
sufficient to accurately model each new speaker
• Soluion• Use adaptation to make the Speaker
Independent System to better fit the characteristics of each new speaker
HMM AdaptationHMM Adaptation
To get a new estimate of the adapted mean, µ,
We use the transformation matrix given by:
µ = Wε
Where
W is the (n x n) transformation matrix
n is the dimensionality of the data and
ε is the original mean vector
Speaker independent data Speaker specific data
ε μ
HEAdapt
VVAV HMM Models
Recognition Speaker AdaptedTest Data
(ε, σ)
Speaker Independent Data
Transformed Speaker Independent Model
(µ = W ε)
ProcedureProcedure
A speaker adaptation on visual models was performed using:
• MLLR (method of adaptation)• Global transform• Single mixture triphones
Adaptation data: Average 5 minutes per speaker
Test data: Average 6 minutes per speaker
ResultsResults
Speakers
Speaker Independent
SpeakerAdapted
AXK 44.05% 41.92%
JFM 61.41% 59.23%
JXC 62.28% 60.48%
LCY 31.23% 29.32%
MBG 83.73% 83.56%
MDP 30.16% 29.89%
RTG 57.44% 55.73%
BAE 36.81% 36.17%
CNM 84.73% 83.89%
DJF 71.96% 71.15%
Average 58.98% 55.49%
Word error, %
FutureFuture
Better adaptation can be achieved by :
• Employ Multiple transforms instead of single transform
• Attempt other methods of adaptation such as MAP with more data
• Use mixture Gaussians in the model
Summary and ConclusionsSummary and Conclusions
Chalapathy Neti
Summary of ResultsSummary of ResultsDataData
SI train (261 spkrs, 35 hrs)SI train (261 spkrs, 35 hrs)
SI test (26 spkrs, 2.5hrs)SI test (26 spkrs, 2.5hrs)Vocabulary: 10,400 wordsVocabulary: 10,400 words
Feature Fusion (AV-1str)Feature Fusion (AV-1str)concatenated audio visual features concatenated audio visual features
AV-HiLDAAV-HiLDAHeirarchical LDA for feature fusionHeirarchical LDA for feature fusion
Multistream (MS, AV-2str-synchronous)Multistream (MS, AV-2str-synchronous)State syncronized decision fusionState syncronized decision fusion
Product HMM (PD, AV-2str-asynchronous)Product HMM (PD, AV-2str-asynchronous)State asynchronous (Phone synchronous)State asynchronous (Phone synchronous)
HiF (Hierarchical fusion)HiF (Hierarchical fusion)HiF-HiLDAHiF-HiLDA
- AV lattices rescored using HiLDA models- AV lattices rescored using HiLDA models
HiF-MSHiF-MS
- AV lattices rescored using MS models- AV lattices rescored using MS models
HiFF-PDHiFF-PD
- AV lattices rescore using Product models- AV lattices rescore using Product models
PDUFPDUFPhone specific stream weights -Phone specific stream weights -
- utterance level fusion- utterance level fusion
Clean 10 dB SNR0
10
20
30
40
50
60
AudioVisualAV-1strAV-HiLDAMSPDHiF-HiLDAHiF-MSHiF-PDPDUF
ConclusionsConclusions
Small gains on clean audio (9% relative)Small gains on clean audio (9% relative)Hierarchical LDA (HiLDA) and Phone dependent Hierarchical LDA (HiLDA) and Phone dependent weighting schemes improve clean audio performanceweighting schemes improve clean audio performance
Significant gains for noisy audio using two pass Significant gains for noisy audio using two pass schemes (Hierarchical fusion) (> 27% relative)schemes (Hierarchical fusion) (> 27% relative)
Noisy AV lattices rescored using "Asynchronous fusion" Noisy AV lattices rescored using "Asynchronous fusion" (PD) improves the error rate by 27.51% relative to the (PD) improves the error rate by 27.51% relative to the matched noisy audio model matched noisy audio model
Visual Modeling requires more refinementsVisual Modeling requires more refinementsRescoring methodolgy constrains relative goodness of Rescoring methodolgy constrains relative goodness of modelsmodels
LM dominates lattice best path in absence of good LM dominates lattice best path in absence of good additional evidence?additional evidence?
Open IssuesOpen Issues
Visual feature representationVisual feature representationWhat is the best ROI?What is the best ROI?
3D features?3D features?
Better tracking of ROIBetter tracking of ROIExplicit representation of place of articulation?Explicit representation of place of articulation?
Visual modelsVisual modelsWhy are visually relevant contexts not doing better?Why are visually relevant contexts not doing better?
FusionFusionBetter models of asynchrony (Product Models)Better models of asynchrony (Product Models)
Automatic estimation of stream confidencesAutomatic estimation of stream confidencesUnit dependent weights in Multistream/Product HMMsUnit dependent weights in Multistream/Product HMMs
AcknowledgementsAcknowledgements
Michael Picheny, David Nahamoo (IBM)Michael Picheny, David Nahamoo (IBM)Giri Iyengar, Sunil Sivanandan, Eric Helmuth (IBM)Giri Iyengar, Sunil Sivanandan, Eric Helmuth (IBM)Asela Gunawardana, Murat Saraclar (CLSP)Asela Gunawardana, Murat Saraclar (CLSP)Andreas Andreou, Eugenio Culurciello (JHU CE)Andreas Andreou, Eugenio Culurciello (JHU CE)CLSP Staff (special thanks to Amy for T. Shirts)CLSP Staff (special thanks to Amy for T. Shirts)Fred Jelinek, Sanjeev Khudanpur, Bill Byrne (CLSP)Fred Jelinek, Sanjeev Khudanpur, Bill Byrne (CLSP)
The End.
Extra Slides…Extra Slides…
State based ClusteringState based Clustering
Error rate on DCT featuresError rate on DCT features
Language
Model
No Language
Model
Lattice Depth 1
Clean Audio
24.79 27.79
Lattice Depth 3
Clean Audio
25.55 34.58
Lattice
Noisy Audio
49.79 55.00
Word error rate on small multi-speaker test set
Audio Lattice Rescoring ResultsAudio Lattice Rescoring Results
Visual Feature Word Error Rate, %
AAM - 86 features 65.69
AAM - 30 features 65.66
AAM - 30 + + 69.50
AAM - 86 LDA 24, WiLDA ±7 64.00
DCT - 18 + + 61.80
DCT - 24, WiLDA ±7 58.14
Noise - 30 61.37
DCT WiLDA no LM = 65.14Lattice random path = 78.32
OverviewOverview
Shape
AppearanceAppearance
Results/StatusResults/StatusExperiment Clean Condition
HTK (IBM)Noisy ConditionHTK(IBM)
Clean Audio Models 14.4 (14.5) 83.0
Visual Models (DCT/Discr.) 24.0* 51.08
Noisy Audio models 48.1 (46.43)
Feature fusion 16.0 44.97
Feature Fusion (HiLDA) 13.84 42.86
Multistream 14.58 43.80
Multistream (NGW - Herv) 13.47 35.26
Multistream (ACVM - Herve)
15.15 38.38
Product HMM 14.19 43.67
Hierarchical fusion(NAV lats, HiLDA models)
36.99
Hierarchical fusion (NAV lats, MS models)
36.61
Hierarchical fusion(NAV lats, PD models)
35.21
Phone dependent Utterance fusion (PDUF)
13.05