database and visual front end makis potamianos

92
Clickhere toaddclip art Joint Processingof AudioandVisual Joint Processingof AudioandVisual Informationfor Speech Recognition Informationfor SpeechRecognition CLSP W orkshop, 2000, 08/24/00 CLSP W orkshop, 2000, 08/24/00 AVSR Team ChalapathyNeti (IBM) GerasimosPotamianos(IBM) Juergen Luettin(IDIAP) IainMatthews(CMU) HerveGlotin (ICP/IDIAP) DimitraVergyri (JohnsHopkins) JuneSison(UC SantaCruz) Azad Mashari (U. Toronto) JieZhou(JohnsHopkins) W edon't just listenwewatch!

Upload: beatrix-murphy

Post on 18-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Database and Visual Front End Makis Potamianos

Click here to add clip

art

Joint Processing of Audio and Visual Joint Processing of Audio and Visual

Information for Speech RecognitionInformation for Speech Recognition

CLSP Workshop, 2000, 08/24/00 CLSP Workshop, 2000, 08/24/00

AVSR TeamChalapathy Neti (IBM)Gerasimos Potamianos (IBM)Juergen Luettin (IDIAP)Iain Matthews (CMU)Herve Glotin (ICP/IDIAP)Dimitra Vergyri (Johns Hopkins)June Sison (UC Santa Cruz)Azad Mashari (U. Toronto)Jie Zhou (Johns Hopkins)

We don't just listen we watch!

Page 2: Database and Visual Front End Makis Potamianos

AgendaAgendaChalapathy Neti (Introduction)Chalapathy Neti (Introduction)Makis Potamianos Makis Potamianos

Database and visual front endDatabase and visual front end

Iain MathewsIain MathewsActive Appearance ModelsActive Appearance Models

Juergen LuettinJuergen LuettinAsyncrony modelingAsyncrony modeling

Dimitra VergyriDimitra VergyriPhone dependent weighting schemesPhone dependent weighting schemes

Herve GlotinHerve GlotinMultistream and Stream weighting schemesMultistream and Stream weighting schemes

JuneJuneVisual ClusteringVisual Clustering

AzadAzadVisual modelingVisual modeling

JieJieVisual model adaptation Visual model adaptation

Chalapathy Neti (Summary, Conclusions and discussion)Chalapathy Neti (Summary, Conclusions and discussion)

Featured speaker: Eric Petajan (Face2Face Animation)Featured speaker: Eric Petajan (Face2Face Animation)MPEG4 Visual speech representationMPEG4 Visual speech representation

Page 3: Database and Visual Front End Makis Potamianos

Audio-visual speech content

(compressed)

Visual speech feature

representation

ROIextraction

Auditory feature(A) extraction

Video decoding

P(V/AU)

P(A/acoustic unit)

Audio-Visual speech recognition Audio-Visual speech recognition by rescoring by rescoring

Search

LM

demux

Word lattices

Camera

Rescoring of lattices

Video Clip

[ ]V

[ ]A

Audio decoding

P(A,V/AU)

Page 4: Database and Visual Front End Makis Potamianos

Combination of audio and visual speech: Combination of audio and visual speech: Research issues Research issues

Goal: improve audio-based LVCSR by using visual informationGoal: improve audio-based LVCSR by using visual informationprevious work on isolated digits/lettersprevious work on isolated digits/letters

Key Research issues are:Key Research issues are:Data (IBM)Data (IBM)Location and tracking of visual speech regions (IBM)Location and tracking of visual speech regions (IBM)Specification and representation of visual speech featuresSpecification and representation of visual speech features

DCT, discriminant representations (Makis)DCT, discriminant representations (Makis)Active Appearance models (Iain Matthews)Active Appearance models (Iain Matthews)

Visual ModelsVisual ModelsVisually relevant triphone clustering (June)Visually relevant triphone clustering (June)Visual modeling schemes (Azad)Visual modeling schemes (Azad)Visual model adaptation (Jie)Visual model adaptation (Jie)

Fusion strategies Fusion strategies Discriminant feature fusion (Makis/Juergen)Discriminant feature fusion (Makis/Juergen)Multistream - state synchrony (Juergen/Herve)Multistream - state synchrony (Juergen/Herve)Multistream - weighting schemes (Herve)Multistream - weighting schemes (Herve)Product HMM - asynchronous (Juergen)Product HMM - asynchronous (Juergen)Unit dependent weighting schemes (Dimitra)Unit dependent weighting schemes (Dimitra)

Page 5: Database and Visual Front End Makis Potamianos

Fusion Research Issues Fusion Research IssuesGeneral fusion problem.General fusion problem.

Model Pr Model Pr [ [ OO 11 (t)(t) ,...,,...,OOSS (t)(t) | | jj ] , ] , S streams ; j is a class of interest (e.g. phone, Word, etc.)S streams ; j is a class of interest (e.g. phone, Word, etc.)

Stream independence/dependence.Stream independence/dependence.

Type of fusion:Type of fusion:Early (feature) .vs. Late (decision)Early (feature) .vs. Late (decision)

Decision fusion: Decision fusion: Score Score [[OO11 (t)(t) ,...,,...,OOSS (t)(t) | | jj ] = ] = ff ( ( Pr Pr [ [ OO ss (t)(t) | | jj ] , ] , s = 1,...,Ss = 1,...,S ) .) .

Asynchrony between StreamsAsynchrony between Streams

Synchronization level:Synchronization level:

Feature, state, phone, word, utteranceFeature, state, phone, word, utterance

Classes of interest:Classes of interest:

Visemes vs. phonemes.Visemes vs. phonemes.

Stream confidence estimation.Stream confidence estimation.SNR based, HNR, Entropy, etc.SNR based, HNR, Entropy, etc.

Page 6: Database and Visual Front End Makis Potamianos

BaselineBaselineDataData

SI train (261 spkrs, 35 hrs)SI train (261 spkrs, 35 hrs)SI test (26 spkrs, 2.5hrs)SI test (26 spkrs, 2.5hrs)Vocabulary: 10,400 wordsVocabulary: 10,400 words

Experimental MethodExperimental MethodRescore lattices Rescore lattices

Baseline ResultsBaseline ResultsAudioAudio

Rescoring of lattices usingRescoring of lattices usingmatched audio modelsmatched audio models

OracleOraclePick a path closest to the truthPick a path closest to the truth

Random Path (R. Path)Random Path (R. Path)Pick a random path through thePick a random path through the

lattice lattice

Clean 10 dB SNR0

10

20

30

40

50

60

70

80

90

Audio-IBMAudio-HTKOracleR. Path

Page 7: Database and Visual Front End Makis Potamianos

Database and Visual Front EndDatabase and Visual Front End

Makis Potamianos

Page 8: Database and Visual Front End Makis Potamianos
Page 9: Database and Visual Front End Makis Potamianos
Page 10: Database and Visual Front End Makis Potamianos
Page 11: Database and Visual Front End Makis Potamianos
Page 12: Database and Visual Front End Makis Potamianos
Page 13: Database and Visual Front End Makis Potamianos
Page 14: Database and Visual Front End Makis Potamianos

Active Appearance ModelActive Appearance ModelVisual FeaturesVisual Features

Iain Matthews

Page 15: Database and Visual Front End Makis Potamianos

AcknowledgmentsAcknowledgments

• Cootes, Edwards, Talyor, Manchester

• Sclaroff, Boston

Page 16: Database and Visual Front End Makis Potamianos

AAM OverviewAAM Overview

Qcb

Wbb

g

s

Shape & Appearance

Appearance

ggbPgg R

eg

ion

of

inte

rest

Wa

rp t

o r

efe

ren

ce

Shape

ssbPxx L

an

dm

arks

Page 17: Database and Visual Front End Makis Potamianos

Relationship to DCT FeaturesRelationship to DCT Features

• External feature detector vs. model-based learned tracking

Face Detector

Face Detector

AAM TrackerAAM

Tracker

DCTDCT

AAMFeatures

• ROI ‘box’ vs. explicit shape + appearance modeling

Page 18: Database and Visual Front End Makis Potamianos

Training DataTraining Data

• 4072 hand labelled images = 2m 13s (/ 50h)

Page 19: Database and Visual Front End Makis Potamianos

Final ModelFinal Model

3

3

mean

Page 20: Database and Visual Front End Makis Potamianos

Image under modelImage under model

Warp to referenceWarp to reference

Fitting AlgorithmFitting Algorithm

DifferenceDifference

PredictedUpdate

PredictedUpdate

c

weight weight

c

ImageImage Current model projectionCurrent model projection

AppearanceAppearance

c is all model parameters

nsoyxcccaastt ,,,,,,,,,

21c

Error

Iterate until convergence

Current model projectionCurrent model projection

AppearanceAppearance

Image under modelImage under model

Warp to referenceWarp to reference

DifferenceDifference

Page 21: Database and Visual Front End Makis Potamianos

Tracking ResultsTracking Results

• Worst sequence - mean, mean square error = 548.87

• Best sequence - mean, mean square error = 89.11

Page 22: Database and Visual Front End Makis Potamianos

Tracking ResultsTracking Results

• Full-face AAM tracker on subset of VVAV database• 4,952 sequences• 1,119,256 images @ 30fps = 10h 22m• Mean, mean MSE per sentence = 254.21• Tracking rate (m2p decode) 4 fps

• Beard area and lips only models will not track• Regions lack sharp texture gradients needed locate model?

Page 23: Database and Visual Front End Makis Potamianos

FeaturesFeatures

• Use AAM full-face features directly (86 dimensional)

Page 24: Database and Visual Front End Makis Potamianos

Audio Lattice Rescoring ResultsAudio Lattice Rescoring Results

65.66

65.90

64.00

61.8

58.14

61.37

65.69

0 20 40 60

AAM - 86 features

AAM - 30 features

AAM - 30 +D+DD

AAM - 86 LDA => 24, LDA +-7

DCT - 18 +D+DD

DCT - 24, LDA +-7

Noise - 30

Word error rate, %

Lattice random path = 78.14%

DCT with LM = 51.08%

DCT no LM = 61.06%

Page 25: Database and Visual Front End Makis Potamianos

Audio Lattice Rescoring ResultsAudio Lattice Rescoring Results

• AAM vs. DCT vs. Noise

Visual Features

0

20

40

60

80

100

120

AAK AEM AGC ALD APM ASA ASJ ATK ATV AXA AXH AXK AXP

Speaker

Wo

rd e

rro

r ra

te %

AAMDCTNoise

Page 26: Database and Visual Front End Makis Potamianos

Tracking Errors AnalysisTracking Errors Analysis

• AAM vs. Tracking error

Tracking vs. Accuracy

0

50

100

150

200

250

300

AAK AEM AGC ALD APM ASA ASJ ATK ATV AXA AXH AXK AXP

Speaker

Me

an

MS

E /

Wo

rd e

rro

r ra

te %

AAMMMSE

Page 27: Database and Visual Front End Makis Potamianos

Analysis and Future WorkAnalysis and Future Work

• Models are under trained• Little more than face detection on 2m of training

Page 28: Database and Visual Front End Makis Potamianos

Analysis and Future WorkAnalysis and Future Work

• Models are under trained• Little more than face detection on 2m of training

• Project face through a more compact model• Retain only useful articulation information?

reprojectreproject

Page 29: Database and Visual Front End Makis Potamianos

Analysis and Future WorkAnalysis and Future Work

• Models are under trained• Little more than face detection on 2m of training

• Project face through a more compact model• Retain only useful articulation information?

• Improve the reference shape• Minimal information loss through the warping?

reprojectreproject

Page 30: Database and Visual Front End Makis Potamianos

Asynchronous Stream ModellingAsynchronous Stream Modelling

Juergen Luettin

Page 31: Database and Visual Front End Makis Potamianos

The Recognition ProblemThe Recognition Problem

VA

MOOMPM ,maxarg*

VA

VAVA

OOP

MPMOOPOOMP

,

)(,,

M: word (phoneme) sequenceM*: most likely word sequenceOA: acoustic observation sequenceOV: visual observation sequence

Page 32: Database and Visual Front End Makis Potamianos

Integration at the Feature LevelIntegration at the Feature Level

TVM

VVAN

AAAV

AVVA

tototototototowhere

MOPMOOP

)(...,),(),(),(...,),(),()(

,

2121

Assumption: • conditional dependence between modalities• integration at the feature level

Page 33: Database and Visual Front End Makis Potamianos

Integration at the Decision LevelIntegration at the Decision Level

MOPMOPMOOP VAVA ,

Assumption: • conditional independence between modalities• integration at the unit level

Page 34: Database and Visual Front End Makis Potamianos

Multiple Synchronous StreamsMultiple Synchronous Streams

1

1

1

;);(

;);())(),((

Vjn

Vjn

VM

n

jVn

Ajm

Ajm

AM

m

jAmVA

j

toNc

toNctotob

V

A

Assumption: • conditional independence• integration at the state level

Two streams in each state:

T

t

txtxAV

txxxX

AV atobaMOP1

)1()()()1()0( )(max

X: state sequenceaij : trans. prob. from i to jbj: probability densitycjm: mth mixture weight of multivariate GaussianN

Page 35: Database and Visual Front End Makis Potamianos

Multiple Asynchronous StreamsMultiple Asynchronous Streams

T

t

txtxVVV

tx

T

t

txtxAAA

txxx

XX

VAXX

VVV

AAA

VAVA

atob

atoba

MOOP

1

)1()()(

1

)1()()()1()0(

,,

))((

))((

max,ˆ

Assumption:• conditional independence• integration at the unit level

Decoding: individual best state sequences for audio and video

Page 36: Database and Visual Front End Makis Potamianos

Composite HMM definitionComposite HMM definition

1

5

4

3

2

6

8

7 9

StatesL

LxxXXX

asynchronyLimiting

aaaaaa

aaaa

XXX

where

MOOPMOOP

VAVAAV

AVAVAVAVAVA

AVAVAVA

VAAV

VAX

VAXX AVVA

#,...,0

,

:

...

,

,ˆ,ˆ

452725151312

44221111

,

Speech-noise decomposition (Varga & Moore, 1993)Audio-visual decomposition (Dupont & Luettin, 1998)

Page 37: Database and Visual Front End Makis Potamianos

Stream ClusteringStream Clustering

Page 38: Database and Visual Front End Makis Potamianos

AVSR SystemAVSR System

• 3-state HMM with 12 mixture components, 7-state HMM for composite model

• context dependent phone models (silence, short pause), tree-based state clustering

• cross-word context dependent decoding, using lattices computed at IBM

• trigram language model• global stream weights in multi stream models, estimated

on held out set

Page 39: Database and Visual Front End Makis Potamianos

SSpeaker independent word recognitionpeaker independent word recognition

Word error rate

0

10

20

30

40

50

60

Clean Noisy, ALattices

Noisy, AVLattices

%

Audio

Video

AV HiLDA

AV 1 Stream

AV 2 Streams,synchronous

AV 2 Streams,asynchronous

Page 40: Database and Visual Front End Makis Potamianos

ConclusionsConclusions

• AV 2 Stream asynchronous model beats other models in noisy conditions

Future directions:• Transition matrices: context dependent, pruning transitions

with low probability, cross-unit asynchrony• Stream weights: model based, discriminative• Clustering: taking stream-tying into account

Page 41: Database and Visual Front End Makis Potamianos

Phone Dependent WeightingPhone Dependent Weighting

Dimitra Vergyri

Page 42: Database and Visual Front End Makis Potamianos
Page 43: Database and Visual Front End Makis Potamianos
Page 44: Database and Visual Front End Makis Potamianos
Page 45: Database and Visual Front End Makis Potamianos
Page 46: Database and Visual Front End Makis Potamianos

Weight EstimationWeight Estimation

Hervé Glotin

Page 47: Database and Visual Front End Makis Potamianos
Page 48: Database and Visual Front End Makis Potamianos
Page 49: Database and Visual Front End Makis Potamianos
Page 50: Database and Visual Front End Makis Potamianos
Page 51: Database and Visual Front End Makis Potamianos
Page 52: Database and Visual Front End Makis Potamianos
Page 53: Database and Visual Front End Makis Potamianos
Page 54: Database and Visual Front End Makis Potamianos
Page 55: Database and Visual Front End Makis Potamianos

Visual ClusteringVisual Clustering

June Sison

Page 56: Database and Visual Front End Makis Potamianos

OutlineOutline

• Motivation for use of visemes in triphone classification

• Definition of visemes

• Goals of viseme usage

• Inspection of phone trees (validity check)

Page 57: Database and Visual Front End Makis Potamianos

Equivalence ClassificationEquivalence Classification

• Combats problem of data sparseness

• Must be sufficiently refined so that equivalence classification can serve as a basis for prediction

• Use of decision trees to achieve equivalence classification [co-articulation]

• To derive EC:1] collect speech data realizing each phone2] classify [cluster] this speech into appropriately distinct categories

Page 58: Database and Visual Front End Makis Potamianos
Page 59: Database and Visual Front End Makis Potamianos

Definition of visemesDefinition of visemes

• Canonical mouth shapes that accompany speech utterances

• complements phonetic stream [examples]

Page 60: Database and Visual Front End Makis Potamianos
Page 61: Database and Visual Front End Makis Potamianos

Visual vs Audio ContextsVisual vs Audio Contexts

276 QS [total]

84single phoneme QS

116audio QS

76visual QS

No. root nodes: 123

33 visual

74 audio

16 single phoneme

Page 62: Database and Visual Front End Makis Potamianos
Page 63: Database and Visual Front End Makis Potamianos

Visual ModelsVisual Models

Azad Mashari

Page 64: Database and Visual Front End Makis Potamianos

Visual Speech RecognitionVisual Speech Recognition

• The Model Trinity• Audio-Clustered Model (Question Set

1)• Self-Clustered Model (Question Set

1)• Self-Clustered Model (Question Set

2)• The "Results"

(From which we learn what not to do)

• The Analysis• Places to Go, Things to Do ...

Page 65: Database and Visual Front End Makis Potamianos

The QuestionsThe Questions

• Set 1: Original Audio Questions

• 202 Questions

• based primarily on voicing and manner

• Set 2: Audio-Visual Questions

• 274 (includes Set 1)

• includes questions regarding place of articulation

Page 66: Database and Visual Front End Makis Potamianos

The TrinityThe Trinity

• Audio-Clustered model• Decision trees generated from the audio data using

question set 1

• Visual triphone models clustered using the trees

• Self-Clustered old• Decision trees generated from the visual data using

question set 1

• Self-Clustered new• Decision trees generated from the visual data using

question set 2

Page 67: Database and Visual Front End Makis Potamianos

Experiment IExperiment I

• 3 major factors

• Independence / Complementarity of the two streams

• Quality of the representation

• Generalization

• Speaker-Independent test

• Noisy audio lattices rescored using visual models

Page 68: Database and Visual Front End Makis Potamianos

Experiment IExperiment I

• Rescoring noisy audio lattices using the visual models

Audio Clustering (AQ)

Self- Clustering (AQ)

Self- Clustering (VQ)

40

45

50

55

60

51.24 51.08 51.16

Visual Models%

Wor

d E

rror

Page 69: Database and Visual Front End Makis Potamianos

Experiment IExperiment I

1AX

K01

1JFM

01

1JXC

01 1LCY

01

1MB

G01

1MD

P01

1RT

G01

3BA

E01

3CN

M01

3DJF

01 3DL

N01

3DV

O01

3EP

H01

3JLW

01

3JPC

01 3JWL01

3JXP

01 3KP

R01

3KX

K01

3KX

M01

3LRW

01

3MX

E01

3PJB

01 3RM

F01

3SX

A01

3SX

E01

- 4

- 2

0

2

4

6

8

10

12

14

16

Per- speaker Word Error Rate on SI- test (- nafe wer)

VQ

AQ

speakers

% w

ord

erro

r

Page 70: Database and Visual Front End Makis Potamianos

Experiment IExperiment I

• Speaker variability of visual models follows variability of audio models. (we don't know why.. lattices?)

• This does not mean that they are not "complementary".

• Viseme clustering gives better results for some speakers only. No overall gain. (we don't know why)

• Are the new questions being used?

• Over-training?

• ~7000 clusters in audio models for ~40 phonemes. Same number in visual models but there are only ~12 "visemes" -> Experiments with fewer clusters

• Is the Greedy clustering algorithm, making a less optimal tree with the new questions?

Page 71: Database and Visual Front End Makis Potamianos

Experiment IIExperiment II

• Several ways to get fewer clusters:

• Increase minimum cluster size

• Increase likelihood gain threshold

• Remove questions (specially those frequently used at higher depths, as well as unused ones)

• Any combination of the above

• Triple min likelihood gain threshold (single mixture models) -> insignificant increase in error.

• ~7000 clusters -> 54.24% ~2500 clusters -> 54.57%

• Even fewer clusters (~150-200)? Different reduction strategy?

Page 72: Database and Visual Front End Makis Potamianos

Places to Go, Things to See...Places to Go, Things to See...

• Finding optimal clustering parameters. Current values are optimized for mfcc-based audio models.

• Clustering with viseme-based questions only

• Looking at errors in recognition of particular phones/classes

Page 73: Database and Visual Front End Makis Potamianos

Visual Model AdaptationVisual Model Adaptation

Jie Zhou

Page 74: Database and Visual Front End Makis Potamianos

Visual Model AdaptationVisual Model Adaptation

• Problem• The Speaker Independent system is not

sufficient to accurately model each new speaker

• Soluion• Use adaptation to make the Speaker

Independent System to better fit the characteristics of each new speaker

Page 75: Database and Visual Front End Makis Potamianos

HMM AdaptationHMM Adaptation

To get a new estimate of the adapted mean, µ,

We use the transformation matrix given by:

µ = Wε

Where

W is the (n x n) transformation matrix

n is the dimensionality of the data and

ε is the original mean vector

Page 76: Database and Visual Front End Makis Potamianos

Speaker independent data Speaker specific data

ε μ

Page 77: Database and Visual Front End Makis Potamianos

HEAdapt

VVAV HMM Models

Recognition Speaker AdaptedTest Data

(ε, σ)

Speaker Independent Data

Transformed Speaker Independent Model

(µ = W ε)

Page 78: Database and Visual Front End Makis Potamianos

ProcedureProcedure

A speaker adaptation on visual models was performed using:

• MLLR (method of adaptation)• Global transform• Single mixture triphones

Adaptation data: Average 5 minutes per speaker

Test data: Average 6 minutes per speaker

Page 79: Database and Visual Front End Makis Potamianos

ResultsResults

Speakers

Speaker Independent

SpeakerAdapted

AXK 44.05% 41.92%

JFM 61.41% 59.23%

JXC 62.28% 60.48%

LCY 31.23% 29.32%

MBG 83.73% 83.56%

MDP 30.16% 29.89%

RTG 57.44% 55.73%

BAE 36.81% 36.17%

CNM 84.73% 83.89%

DJF 71.96% 71.15%

Average 58.98% 55.49%

Word error, %

Page 80: Database and Visual Front End Makis Potamianos

FutureFuture

Better adaptation can be achieved by :

• Employ Multiple transforms instead of single transform

• Attempt other methods of adaptation such as MAP with more data

• Use mixture Gaussians in the model

Page 81: Database and Visual Front End Makis Potamianos

Summary and ConclusionsSummary and Conclusions

Chalapathy Neti

Page 82: Database and Visual Front End Makis Potamianos

Summary of ResultsSummary of ResultsDataData

SI train (261 spkrs, 35 hrs)SI train (261 spkrs, 35 hrs)

SI test (26 spkrs, 2.5hrs)SI test (26 spkrs, 2.5hrs)Vocabulary: 10,400 wordsVocabulary: 10,400 words

Feature Fusion (AV-1str)Feature Fusion (AV-1str)concatenated audio visual features concatenated audio visual features

AV-HiLDAAV-HiLDAHeirarchical LDA for feature fusionHeirarchical LDA for feature fusion

Multistream (MS, AV-2str-synchronous)Multistream (MS, AV-2str-synchronous)State syncronized decision fusionState syncronized decision fusion

Product HMM (PD, AV-2str-asynchronous)Product HMM (PD, AV-2str-asynchronous)State asynchronous (Phone synchronous)State asynchronous (Phone synchronous)

HiF (Hierarchical fusion)HiF (Hierarchical fusion)HiF-HiLDAHiF-HiLDA

- AV lattices rescored using HiLDA models- AV lattices rescored using HiLDA models

HiF-MSHiF-MS

- AV lattices rescored using MS models- AV lattices rescored using MS models

HiFF-PDHiFF-PD

- AV lattices rescore using Product models- AV lattices rescore using Product models

PDUFPDUFPhone specific stream weights -Phone specific stream weights -

- utterance level fusion- utterance level fusion

Clean 10 dB SNR0

10

20

30

40

50

60

AudioVisualAV-1strAV-HiLDAMSPDHiF-HiLDAHiF-MSHiF-PDPDUF

Page 83: Database and Visual Front End Makis Potamianos

ConclusionsConclusions

Small gains on clean audio (9% relative)Small gains on clean audio (9% relative)Hierarchical LDA (HiLDA) and Phone dependent Hierarchical LDA (HiLDA) and Phone dependent weighting schemes improve clean audio performanceweighting schemes improve clean audio performance

Significant gains for noisy audio using two pass Significant gains for noisy audio using two pass schemes (Hierarchical fusion) (> 27% relative)schemes (Hierarchical fusion) (> 27% relative)

Noisy AV lattices rescored using "Asynchronous fusion" Noisy AV lattices rescored using "Asynchronous fusion" (PD) improves the error rate by 27.51% relative to the (PD) improves the error rate by 27.51% relative to the matched noisy audio model matched noisy audio model

Visual Modeling requires more refinementsVisual Modeling requires more refinementsRescoring methodolgy constrains relative goodness of Rescoring methodolgy constrains relative goodness of modelsmodels

LM dominates lattice best path in absence of good LM dominates lattice best path in absence of good additional evidence?additional evidence?

Page 84: Database and Visual Front End Makis Potamianos

Open IssuesOpen Issues

Visual feature representationVisual feature representationWhat is the best ROI?What is the best ROI?

3D features?3D features?

Better tracking of ROIBetter tracking of ROIExplicit representation of place of articulation?Explicit representation of place of articulation?

Visual modelsVisual modelsWhy are visually relevant contexts not doing better?Why are visually relevant contexts not doing better?

FusionFusionBetter models of asynchrony (Product Models)Better models of asynchrony (Product Models)

Automatic estimation of stream confidencesAutomatic estimation of stream confidencesUnit dependent weights in Multistream/Product HMMsUnit dependent weights in Multistream/Product HMMs

Page 85: Database and Visual Front End Makis Potamianos

AcknowledgementsAcknowledgements

Michael Picheny, David Nahamoo (IBM)Michael Picheny, David Nahamoo (IBM)Giri Iyengar, Sunil Sivanandan, Eric Helmuth (IBM)Giri Iyengar, Sunil Sivanandan, Eric Helmuth (IBM)Asela Gunawardana, Murat Saraclar (CLSP)Asela Gunawardana, Murat Saraclar (CLSP)Andreas Andreou, Eugenio Culurciello (JHU CE)Andreas Andreou, Eugenio Culurciello (JHU CE)CLSP Staff (special thanks to Amy for T. Shirts)CLSP Staff (special thanks to Amy for T. Shirts)Fred Jelinek, Sanjeev Khudanpur, Bill Byrne (CLSP)Fred Jelinek, Sanjeev Khudanpur, Bill Byrne (CLSP)

Page 86: Database and Visual Front End Makis Potamianos

The End.

Page 87: Database and Visual Front End Makis Potamianos

Extra Slides…Extra Slides…

Page 88: Database and Visual Front End Makis Potamianos

State based ClusteringState based Clustering

Page 89: Database and Visual Front End Makis Potamianos

Error rate on DCT featuresError rate on DCT features

Language

Model

No Language

Model

Lattice Depth 1

Clean Audio

24.79 27.79

Lattice Depth 3

Clean Audio

25.55 34.58

Lattice

Noisy Audio

49.79 55.00

Word error rate on small multi-speaker test set

Page 90: Database and Visual Front End Makis Potamianos

Audio Lattice Rescoring ResultsAudio Lattice Rescoring Results

Visual Feature Word Error Rate, %

AAM - 86 features 65.69

AAM - 30 features 65.66

AAM - 30 + + 69.50

AAM - 86 LDA 24, WiLDA ±7 64.00

DCT - 18 + + 61.80

DCT - 24, WiLDA ±7 58.14

Noise - 30 61.37

DCT WiLDA no LM = 65.14Lattice random path = 78.32

Page 91: Database and Visual Front End Makis Potamianos

OverviewOverview

Shape

AppearanceAppearance

Page 92: Database and Visual Front End Makis Potamianos

Results/StatusResults/StatusExperiment Clean Condition

HTK (IBM)Noisy ConditionHTK(IBM)

Clean Audio Models 14.4 (14.5) 83.0

Visual Models (DCT/Discr.) 24.0* 51.08

Noisy Audio models 48.1 (46.43)

Feature fusion 16.0 44.97

Feature Fusion (HiLDA) 13.84 42.86

Multistream 14.58 43.80

Multistream (NGW - Herv) 13.47 35.26

Multistream (ACVM - Herve)

15.15 38.38

Product HMM 14.19 43.67

Hierarchical fusion(NAV lats, HiLDA models)

36.99

Hierarchical fusion (NAV lats, MS models)

36.61

Hierarchical fusion(NAV lats, PD models)

35.21

Phone dependent Utterance fusion (PDUF)

13.05