database and visual front end makis potamianos

Click here to add clip

art

Joint Processing of Audio and Visual Joint Processing of Audio and Visual

Information for Speech RecognitionInformation for Speech Recognition

CLSP Workshop, 2000, 08/24/00 CLSP Workshop, 2000, 08/24/00

AVSR TeamChalapathy Neti (IBM)Gerasimos Potamianos (IBM)Juergen Luettin (IDIAP)Iain Matthews (CMU)Herve Glotin (ICP/IDIAP)Dimitra Vergyri (Johns Hopkins)June Sison (UC Santa Cruz)Azad Mashari (U. Toronto)Jie Zhou (Johns Hopkins)

We don't just listen we watch!

AgendaAgendaChalapathy Neti (Introduction)Chalapathy Neti (Introduction)Makis Potamianos Makis Potamianos

Database and visual front endDatabase and visual front end

Iain MathewsIain MathewsActive Appearance ModelsActive Appearance Models

Juergen LuettinJuergen LuettinAsyncrony modelingAsyncrony modeling

Dimitra VergyriDimitra VergyriPhone dependent weighting schemesPhone dependent weighting schemes

Herve GlotinHerve GlotinMultistream and Stream weighting schemesMultistream and Stream weighting schemes

JuneJuneVisual ClusteringVisual Clustering

AzadAzadVisual modelingVisual modeling

JieJieVisual model adaptation Visual model adaptation

Chalapathy Neti (Summary, Conclusions and discussion)Chalapathy Neti (Summary, Conclusions and discussion)

Featured speaker: Eric Petajan (Face2Face Animation)Featured speaker: Eric Petajan (Face2Face Animation)MPEG4 Visual speech representationMPEG4 Visual speech representation

Audio-visual speech content

(compressed)

Visual speech feature

representation

ROIextraction

Auditory feature(A) extraction

Video decoding

P(V/AU)

P(A/acoustic unit)

Audio-Visual speech recognition Audio-Visual speech recognition by rescoring by rescoring

Search

LM

demux

Word lattices

Camera

Rescoring of lattices

Video Clip

[ ]V

[ ]A

Audio decoding

P(A,V/AU)

Combination of audio and visual speech: Combination of audio and visual speech: Research issues Research issues

Goal: improve audio-based LVCSR by using visual informationGoal: improve audio-based LVCSR by using visual informationprevious work on isolated digits/lettersprevious work on isolated digits/letters

Key Research issues are:Key Research issues are:Data (IBM)Data (IBM)Location and tracking of visual speech regions (IBM)Location and tracking of visual speech regions (IBM)Specification and representation of visual speech featuresSpecification and representation of visual speech features

DCT, discriminant representations (Makis)DCT, discriminant representations (Makis)Active Appearance models (Iain Matthews)Active Appearance models (Iain Matthews)

Visual ModelsVisual ModelsVisually relevant triphone clustering (June)Visually relevant triphone clustering (June)Visual modeling schemes (Azad)Visual modeling schemes (Azad)Visual model adaptation (Jie)Visual model adaptation (Jie)

Fusion strategies Fusion strategies Discriminant feature fusion (Makis/Juergen)Discriminant feature fusion (Makis/Juergen)Multistream - state synchrony (Juergen/Herve)Multistream - state synchrony (Juergen/Herve)Multistream - weighting schemes (Herve)Multistream - weighting schemes (Herve)Product HMM - asynchronous (Juergen)Product HMM - asynchronous (Juergen)Unit dependent weighting schemes (Dimitra)Unit dependent weighting schemes (Dimitra)

Fusion Research Issues Fusion Research IssuesGeneral fusion problem.General fusion problem.

Model Pr Model Pr [ [ OO 11 (t)(t) ,...,,...,OOSS (t)(t) | | jj ] , ] , S streams ; j is a class of interest (e.g. phone, Word, etc.)S streams ; j is a class of interest (e.g. phone, Word, etc.)

Stream independence/dependence.Stream independence/dependence.

Type of fusion:Type of fusion:Early (feature) .vs. Late (decision)Early (feature) .vs. Late (decision)

Decision fusion: Decision fusion: Score Score [[OO11 (t)(t) ,...,,...,OOSS (t)(t) | | jj ] = ] = ff ( ( Pr Pr [ [ OO ss (t)(t) | | jj ] , ] , s = 1,...,Ss = 1,...,S ) .) .

Asynchrony between StreamsAsynchrony between Streams

Synchronization level:Synchronization level:

Feature, state, phone, word, utteranceFeature, state, phone, word, utterance

Classes of interest:Classes of interest:

Visemes vs. phonemes.Visemes vs. phonemes.

Stream confidence estimation.Stream confidence estimation.SNR based, HNR, Entropy, etc.SNR based, HNR, Entropy, etc.

BaselineBaselineDataData

SI train (261 spkrs, 35 hrs)SI train (261 spkrs, 35 hrs)SI test (26 spkrs, 2.5hrs)SI test (26 spkrs, 2.5hrs)Vocabulary: 10,400 wordsVocabulary: 10,400 words

Experimental MethodExperimental MethodRescore lattices Rescore lattices

Baseline ResultsBaseline ResultsAudioAudio

Rescoring of lattices usingRescoring of lattices usingmatched audio modelsmatched audio models

OracleOraclePick a path closest to the truthPick a path closest to the truth

Random Path (R. Path)Random Path (R. Path)Pick a random path through thePick a random path through the

lattice lattice

Clean 10 dB SNR0

10

20

30

40

50

60

70

80

90

Audio-IBMAudio-HTKOracleR. Path

Database and Visual Front EndDatabase and Visual Front End

Makis Potamianos

Active Appearance ModelActive Appearance ModelVisual FeaturesVisual Features

Iain Matthews

AcknowledgmentsAcknowledgments

• Cootes, Edwards, Talyor, Manchester

• Sclaroff, Boston

AAM OverviewAAM Overview

Qcb

Wbb

g

s

Shape & Appearance

Appearance

ggbPgg R

eg

ion

of

inte

rest

Wa

rp t

o r

efe

ren

ce

Shape

ssbPxx L

an

dm

arks

Relationship to DCT FeaturesRelationship to DCT Features

• External feature detector vs. model-based learned tracking

Face Detector

Face Detector

AAM TrackerAAM

Tracker

DCTDCT

AAMFeatures

• ROI ‘box’ vs. explicit shape + appearance modeling

Training DataTraining Data

• 4072 hand labelled images = 2m 13s (/ 50h)

Final ModelFinal Model

3

3

mean

Image under modelImage under model

Warp to referenceWarp to reference

Fitting AlgorithmFitting Algorithm

DifferenceDifference

PredictedUpdate

PredictedUpdate

c

weight weight

c

ImageImage Current model projectionCurrent model projection

AppearanceAppearance

c is all model parameters

nsoyxcccaastt ,,,,,,,,,

21c

Error

Iterate until convergence

Current model projectionCurrent model projection


Image under modelImage under model

Warp to referenceWarp to reference

DifferenceDifference

Tracking ResultsTracking Results

• Worst sequence - mean, mean square error = 548.87

• Best sequence - mean, mean square error = 89.11

Tracking ResultsTracking Results

• Full-face AAM tracker on subset of VVAV database• 4,952 sequences• 1,119,256 images @ 30fps = 10h 22m• Mean, mean MSE per sentence = 254.21• Tracking rate (m2p decode) 4 fps

• Beard area and lips only models will not track• Regions lack sharp texture gradients needed locate model?

FeaturesFeatures

• Use AAM full-face features directly (86 dimensional)

Audio Lattice Rescoring ResultsAudio Lattice Rescoring Results

65.66

65.90

64.00

61.8

58.14

61.37

65.69

0 20 40 60

AAM - 86 features

AAM - 30 features

AAM - 30 +D+DD

AAM - 86 LDA => 24, LDA +-7

DCT - 18 +D+DD

DCT - 24, LDA +-7

Noise - 30

Word error rate, %

Lattice random path = 78.14%

DCT with LM = 51.08%

DCT no LM = 61.06%


• AAM vs. DCT vs. Noise

Visual Features

0

20

40

60

80

100

120

AAK AEM AGC ALD APM ASA ASJ ATK ATV AXA AXH AXK AXP

Speaker

Wo

rd e

rro

r ra

te %

AAMDCTNoise

Tracking Errors AnalysisTracking Errors Analysis

• AAM vs. Tracking error

Tracking vs. Accuracy

0

50

100

150

200

250

300

AAK AEM AGC ALD APM ASA ASJ ATK ATV AXA AXH AXK AXP

Speaker

Me

an

MS

E /

Wo

rd e

rro

r ra

te %

AAMMMSE

Analysis and Future WorkAnalysis and Future Work

• Models are under trained• Little more than face detection on 2m of training



• Project face through a more compact model• Retain only useful articulation information?

reprojectreproject



• Project face through a more compact model• Retain only useful articulation information?

• Improve the reference shape• Minimal information loss through the warping?

reprojectreproject

Asynchronous Stream ModellingAsynchronous Stream Modelling

Juergen Luettin

The Recognition ProblemThe Recognition Problem

VA

MOOMPM ,maxarg*

VA

VAVA

OOP

MPMOOPOOMP

,

)(,,

M: word (phoneme) sequenceM*: most likely word sequenceOA: acoustic observation sequenceOV: visual observation sequence

Integration at the Feature LevelIntegration at the Feature Level

TVM

VVAN

AAAV

AVVA

tototototototowhere

MOPMOOP

)(...,),(),(),(...,),(),()(

,

2121

Assumption: • conditional dependence between modalities• integration at the feature level

Integration at the Decision LevelIntegration at the Decision Level

MOPMOPMOOP VAVA ,

Assumption: • conditional independence between modalities• integration at the unit level

Multiple Synchronous StreamsMultiple Synchronous Streams

1

1

1

;);(

;);())(),((

Vjn

Vjn

VM

n

jVn

Ajm

Ajm

AM

m

jAmVA

j

toNc

toNctotob

V

A

Assumption: • conditional independence• integration at the state level

Two streams in each state:

T

t

txtxAV

txxxX

AV atobaMOP1

)1()()()1()0( )(max

X: state sequenceaij : trans. prob. from i to jbj: probability densitycjm: mth mixture weight of multivariate GaussianN

Multiple Asynchronous StreamsMultiple Asynchronous Streams

T

t

txtxVVV

tx

T

t

txtxAAA

txxx

XX

VAXX

VVV

AAA

VAVA

atob

atoba

MOOP

1

)1()()(

1

)1()()()1()0(

,,

))((

))((

max,ˆ

Assumption:• conditional independence• integration at the unit level

Decoding: individual best state sequences for audio and video

Composite HMM definitionComposite HMM definition

1

5

4

3

2

6

8

7 9

StatesL

LxxXXX

asynchronyLimiting

aaaaaa

aaaa

XXX

where

MOOPMOOP

VAVAAV

AVAVAVAVAVA

AVAVAVA

VAAV

VAX

VAXX AVVA

#,...,0

,

:

...

,

,ˆ,ˆ

452725151312

44221111

,

Speech-noise decomposition (Varga & Moore, 1993)Audio-visual decomposition (Dupont & Luettin, 1998)

Stream ClusteringStream Clustering

AVSR SystemAVSR System

• 3-state HMM with 12 mixture components, 7-state HMM for composite model

• context dependent phone models (silence, short pause), tree-based state clustering

• cross-word context dependent decoding, using lattices computed at IBM

• trigram language model• global stream weights in multi stream models, estimated

on held out set

SSpeaker independent word recognitionpeaker independent word recognition

Word error rate

0

10

20

30

40

50

60

Clean Noisy, ALattices

Noisy, AVLattices

%

Audio

Video

AV HiLDA

AV 1 Stream

AV 2 Streams,synchronous

AV 2 Streams,asynchronous

ConclusionsConclusions

• AV 2 Stream asynchronous model beats other models in noisy conditions

Future directions:• Transition matrices: context dependent, pruning transitions

with low probability, cross-unit asynchrony• Stream weights: model based, discriminative• Clustering: taking stream-tying into account

Phone Dependent WeightingPhone Dependent Weighting

Dimitra Vergyri

Weight EstimationWeight Estimation

Hervé Glotin

Visual ClusteringVisual Clustering

June Sison

OutlineOutline

• Motivation for use of visemes in triphone classification

• Definition of visemes

• Goals of viseme usage

• Inspection of phone trees (validity check)

Equivalence ClassificationEquivalence Classification

• Combats problem of data sparseness

• Must be sufficiently refined so that equivalence classification can serve as a basis for prediction

• Use of decision trees to achieve equivalence classification [co-articulation]

• To derive EC:1] collect speech data realizing each phone2] classify [cluster] this speech into appropriately distinct categories

Definition of visemesDefinition of visemes

• Canonical mouth shapes that accompany speech utterances

• complements phonetic stream [examples]

Visual vs Audio ContextsVisual vs Audio Contexts

276 QS [total]

84single phoneme QS

116audio QS

76visual QS

No. root nodes: 123

33 visual

74 audio

16 single phoneme

Visual ModelsVisual Models

Azad Mashari

Visual Speech RecognitionVisual Speech Recognition

• The Model Trinity• Audio-Clustered Model (Question Set

1)• Self-Clustered Model (Question Set

1)• Self-Clustered Model (Question Set

2)• The "Results"

(From which we learn what not to do)

• The Analysis• Places to Go, Things to Do ...

The QuestionsThe Questions

• Set 1: Original Audio Questions

• 202 Questions

• based primarily on voicing and manner

• Set 2: Audio-Visual Questions

• 274 (includes Set 1)

• includes questions regarding place of articulation

The TrinityThe Trinity

• Audio-Clustered model• Decision trees generated from the audio data using

question set 1

• Visual triphone models clustered using the trees

• Self-Clustered old• Decision trees generated from the visual data using

question set 1

• Self-Clustered new• Decision trees generated from the visual data using

question set 2

Experiment IExperiment I

• 3 major factors

• Independence / Complementarity of the two streams

• Quality of the representation

• Generalization

• Speaker-Independent test

• Noisy audio lattices rescored using visual models


• Rescoring noisy audio lattices using the visual models

Audio Clustering (AQ)

Self- Clustering (AQ)

Self- Clustering (VQ)

40

45

50

55

60

51.24 51.08 51.16

Visual Models%

Wor

d E

rror


1AX

K01

1JFM

01

1JXC

01 1LCY

01

1MB

G01

1MD

P01

1RT

G01

3BA

E01

3CN

M01

3DJF

01 3DL

N01

3DV

O01

3EP

H01

3JLW

01

3JPC

01 3JWL01

3JXP

01 3KP

R01

3KX

K01

3KX

M01

3LRW

01

3MX

E01

3PJB

01 3RM

F01

3SX

A01

3SX

E01

- 4

- 2

0

2

4

6

8

10

12

14

16

Per- speaker Word Error Rate on SI- test (- nafe wer)

VQ

AQ

speakers

% w

ord

erro

r


• Speaker variability of visual models follows variability of audio models. (we don't know why.. lattices?)

• This does not mean that they are not "complementary".

• Viseme clustering gives better results for some speakers only. No overall gain. (we don't know why)

• Are the new questions being used?

• Over-training?

• ~7000 clusters in audio models for ~40 phonemes. Same number in visual models but there are only ~12 "visemes" -> Experiments with fewer clusters

• Is the Greedy clustering algorithm, making a less optimal tree with the new questions?

Experiment IIExperiment II

• Several ways to get fewer clusters:

• Increase minimum cluster size

• Increase likelihood gain threshold

• Remove questions (specially those frequently used at higher depths, as well as unused ones)

• Any combination of the above

• Triple min likelihood gain threshold (single mixture models) -> insignificant increase in error.

• ~7000 clusters -> 54.24% ~2500 clusters -> 54.57%

• Even fewer clusters (~150-200)? Different reduction strategy?

Places to Go, Things to See...Places to Go, Things to See...

• Finding optimal clustering parameters. Current values are optimized for mfcc-based audio models.

• Clustering with viseme-based questions only

• Looking at errors in recognition of particular phones/classes

Visual Model AdaptationVisual Model Adaptation

Jie Zhou

Visual Model AdaptationVisual Model Adaptation

• Problem• The Speaker Independent system is not

sufficient to accurately model each new speaker

• Soluion• Use adaptation to make the Speaker

Independent System to better fit the characteristics of each new speaker

HMM AdaptationHMM Adaptation

To get a new estimate of the adapted mean, µ,

We use the transformation matrix given by:

µ = Wε

Where

W is the (n x n) transformation matrix

n is the dimensionality of the data and

ε is the original mean vector

Speaker independent data Speaker specific data

ε μ

HEAdapt

VVAV HMM Models

Recognition Speaker AdaptedTest Data

(ε, σ)

Speaker Independent Data

Transformed Speaker Independent Model

(µ = W ε)

ProcedureProcedure

A speaker adaptation on visual models was performed using:

• MLLR (method of adaptation)• Global transform• Single mixture triphones

Adaptation data: Average 5 minutes per speaker

Test data: Average 6 minutes per speaker

ResultsResults

Speakers

Speaker Independent

SpeakerAdapted

AXK 44.05% 41.92%

JFM 61.41% 59.23%

JXC 62.28% 60.48%

LCY 31.23% 29.32%

MBG 83.73% 83.56%

MDP 30.16% 29.89%

RTG 57.44% 55.73%

BAE 36.81% 36.17%

CNM 84.73% 83.89%

DJF 71.96% 71.15%

Average 58.98% 55.49%

Word error, %

FutureFuture

Better adaptation can be achieved by :

• Employ Multiple transforms instead of single transform

• Attempt other methods of adaptation such as MAP with more data

• Use mixture Gaussians in the model

Summary and ConclusionsSummary and Conclusions

Chalapathy Neti

Summary of ResultsSummary of ResultsDataData

SI train (261 spkrs, 35 hrs)SI train (261 spkrs, 35 hrs)

SI test (26 spkrs, 2.5hrs)SI test (26 spkrs, 2.5hrs)Vocabulary: 10,400 wordsVocabulary: 10,400 words

Feature Fusion (AV-1str)Feature Fusion (AV-1str)concatenated audio visual features concatenated audio visual features

AV-HiLDAAV-HiLDAHeirarchical LDA for feature fusionHeirarchical LDA for feature fusion

Multistream (MS, AV-2str-synchronous)Multistream (MS, AV-2str-synchronous)State syncronized decision fusionState syncronized decision fusion

Product HMM (PD, AV-2str-asynchronous)Product HMM (PD, AV-2str-asynchronous)State asynchronous (Phone synchronous)State asynchronous (Phone synchronous)

HiF (Hierarchical fusion)HiF (Hierarchical fusion)HiF-HiLDAHiF-HiLDA

- AV lattices rescored using HiLDA models- AV lattices rescored using HiLDA models

HiF-MSHiF-MS

- AV lattices rescored using MS models- AV lattices rescored using MS models

HiFF-PDHiFF-PD

- AV lattices rescore using Product models- AV lattices rescore using Product models

PDUFPDUFPhone specific stream weights -Phone specific stream weights -

- utterance level fusion- utterance level fusion

Clean 10 dB SNR0

10

20

30

40

50

60

AudioVisualAV-1strAV-HiLDAMSPDHiF-HiLDAHiF-MSHiF-PDPDUF

ConclusionsConclusions

Small gains on clean audio (9% relative)Small gains on clean audio (9% relative)Hierarchical LDA (HiLDA) and Phone dependent Hierarchical LDA (HiLDA) and Phone dependent weighting schemes improve clean audio performanceweighting schemes improve clean audio performance

Significant gains for noisy audio using two pass Significant gains for noisy audio using two pass schemes (Hierarchical fusion) (> 27% relative)schemes (Hierarchical fusion) (> 27% relative)

Noisy AV lattices rescored using "Asynchronous fusion" Noisy AV lattices rescored using "Asynchronous fusion" (PD) improves the error rate by 27.51% relative to the (PD) improves the error rate by 27.51% relative to the matched noisy audio model matched noisy audio model

Visual Modeling requires more refinementsVisual Modeling requires more refinementsRescoring methodolgy constrains relative goodness of Rescoring methodolgy constrains relative goodness of modelsmodels

LM dominates lattice best path in absence of good LM dominates lattice best path in absence of good additional evidence?additional evidence?

Open IssuesOpen Issues

Visual feature representationVisual feature representationWhat is the best ROI?What is the best ROI?

3D features?3D features?

Better tracking of ROIBetter tracking of ROIExplicit representation of place of articulation?Explicit representation of place of articulation?

Visual modelsVisual modelsWhy are visually relevant contexts not doing better?Why are visually relevant contexts not doing better?

FusionFusionBetter models of asynchrony (Product Models)Better models of asynchrony (Product Models)

Automatic estimation of stream confidencesAutomatic estimation of stream confidencesUnit dependent weights in Multistream/Product HMMsUnit dependent weights in Multistream/Product HMMs

AcknowledgementsAcknowledgements

Michael Picheny, David Nahamoo (IBM)Michael Picheny, David Nahamoo (IBM)Giri Iyengar, Sunil Sivanandan, Eric Helmuth (IBM)Giri Iyengar, Sunil Sivanandan, Eric Helmuth (IBM)Asela Gunawardana, Murat Saraclar (CLSP)Asela Gunawardana, Murat Saraclar (CLSP)Andreas Andreou, Eugenio Culurciello (JHU CE)Andreas Andreou, Eugenio Culurciello (JHU CE)CLSP Staff (special thanks to Amy for T. Shirts)CLSP Staff (special thanks to Amy for T. Shirts)Fred Jelinek, Sanjeev Khudanpur, Bill Byrne (CLSP)Fred Jelinek, Sanjeev Khudanpur, Bill Byrne (CLSP)

The End.

Extra Slides…Extra Slides…

State based ClusteringState based Clustering

Error rate on DCT featuresError rate on DCT features

Language

Model

No Language

Model

Lattice Depth 1

Clean Audio

24.79 27.79

Lattice Depth 3

Clean Audio

25.55 34.58

Lattice

Noisy Audio

49.79 55.00

Word error rate on small multi-speaker test set


Visual Feature Word Error Rate, %

AAM - 86 features 65.69

AAM - 30 features 65.66

AAM - 30 + + 69.50

AAM - 86 LDA 24, WiLDA ±7 64.00

DCT - 18 + + 61.80

DCT - 24, WiLDA ±7 58.14

Noise - 30 61.37

DCT WiLDA no LM = 65.14Lattice random path = 78.32

OverviewOverview

Shape


Results/StatusResults/StatusExperiment Clean Condition

HTK (IBM)Noisy ConditionHTK(IBM)

Clean Audio Models 14.4 (14.5) 83.0

Visual Models (DCT/Discr.) 24.0* 51.08

Noisy Audio models 48.1 (46.43)

Feature fusion 16.0 44.97

Feature Fusion (HiLDA) 13.84 42.86

Multistream 14.58 43.80

Multistream (NGW - Herv) 13.47 35.26

Multistream (ACVM - Herve)

15.15 38.38

Product HMM 14.19 43.67

Hierarchical fusion(NAV lats, HiLDA models)

36.99

Hierarchical fusion (NAV lats, MS models)

36.61

Hierarchical fusion(NAV lats, PD models)

35.21

Phone dependent Utterance fusion (PDUF)

13.05

database and visual front end makis potamianos

Documents

face detection

stream asynchronous

state sequenceaij

trainingproject face

multi stream models

modalities integration

state leveltwo streams

future workmodels