katholieke universiteit leuven - esat, belgium combining abstract and exemplar models in asr dirk...

Katholieke Universiteit Leuven - ESAT, BELGIUM

Combining Abstract and Exemplar Models

in ASR

Dirk Van Compernolle Kris Demuynck, Mathias De Wachter

S2S Nijmegen Workshop

February 10-14, 2008

Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 2

Overview

• PART I: Example based models in ASR: – Motivation

– Proof-of-Concept

– Baseline Results

– Required Extensions

• PART II: Bottom-up vs. Top-down processing in ASR– Do we care ?

– A top-down search engine with bottom-up phonetic scoring

– A combined template matching and HMM recognizer


PART I

Example Based ASR


Example Based ASR

• Example based ASR was successful in Speaker-Dependent Isolated Word Recognition. It was abandoned when technology moved to continuous speaker independent recognition

Why re-activate an approach that smoothly died 25 yrs ago ?• Psycho-linguistics and intuition give evidence of the existence of

individual memory traces (spanning many phonemes) in:– human speech recognition in general

– music/song memory & recognition

– second language learning

• Success of concatenative Text-to-Speech

• Acknowledgement of limitations to model based (HMM based) ASR

• Computing demands for continuous large vocabulary recognition were essential bottlenecks – that may not be relevant any longer today.


Today’s Prototypical HMM based ASRBeads-on-a-String Model


Phone Modeling with HMMs

ph(j)_1ph(j)_1

ph(j)_2ph(j)_2

ph(j)_Nph(j)_Njj

SSj1j1 SSj2j2SSj3j3

means +

variances

+duration modelM

UL

TI-

ST

AT

E

MO

DE

Lfo

r p

hone

‘j’

short-time spectral representationsM

AN

Y

EX

AM

PL

ES

of p

hon

e ‘j

’

TRAINING


Iterative HMM Training

speechdatabase

Feature Extraction

dictionaryphone set

reference HMM

word leveltranscription

phone leveltranscription

State (sub-phone) Segmentation

PhoneHMMs

Viterbi Alignment

Re-estimation

words-to-phones


HMM Model Building

• Based on a 2-D LDA projection of mel-cepstra optimized for digit recognition

• ‘S1,S2,S3’ represent the 3 CD HMM-states of the central vowel in “f I ve”

• Ellipses indicate the ‘1-sigma’ boundaries


HMMs – Strengths

• Strong mathematical framework – statistical pattern matching / Bayesian Classification

– optimal strategy under the assumption of a perfect model with

sufficient training data !!

– Fully automatic training (inner loop)

– Ability to (optimally) combine the information from thousands of

hours of example speech

• Highly scalable: more data leads to better results– allows for training a more refined model with more parameters that

gets closer to reality (model assumptions)

– a better trained model that is more robust to intrinsic variability


HMMs - Weaknesses

• Model is intrinsically flawed, because of:– within state (i.e. short-term) stationarity assumption

– 1st order Markov assumption: state independence

– presumed frame by frame independence

• This implies– no guaranteed optimality for Bayesian Classification /

Maximum Likelihood paradigm

– continuous effort to improve (patch) the model

– best performance with discriminative training procedures


HMMs30 yrs of improvements on the basic model

“If the model was correct,

then would be better than HMMs and Maximum Likelihood Training.

So, let’s stick to the concept and fix the model”

• Multi-state Context-Dependent models

• Multi-gaussian modeling of the observation densities

• Derivative Features

“For this we only need bigger computers

and more data that allows us“

• To make these complex models with more degrees of freedom

• To do a proper training of these hundreds of millions of parameters

• To perform recognition with them in real-time


HMMs30 yrs of improvements on the basic model

…. Then we will reach nirvana, unless …

… after 30yrs the model is still basically flawed– because of poor segmental modeling

… more training data does not seem to result in better models

any longer– because requirements for further improvement seem to grow

logarithmically

– because for smaller languages more data is just not feasible

… so, today computers have more power

than we know what to do with it


Example Trajectories and HMM states

• Trajectories contain more information than the HMM state sequence !!

• Trajectories show a very different picture than the ‘cloud’ of points underlying HMM state training


Aligning of individual trajectories to HMMs


S1 S2

HMM vs. Segmental

• HMM viewpoint: red and black sequence of observations yield identical scores =

>• Segmental viewpoint: black trajectory is significantly more plausible than the red one


Segmental Modeling in HMMs

• Segmental properties are obviously important within

phonemes and across multiple phonemes

• HMMs loose this longer time-scale view despite the

modifications made to the model over the years

• Attempts to make segmental statistical models have not

been very successful so far

• Detailed trajectory properties were well preserved in the old

template matching DTW systems

• …..


Is example based

large vocabulary continuous recognition

a viable alternative

to model state based (HMM) recognition ?


Example Based ASRResearch Agenda

• Proof-of-concept phase

– To build a baseline mid/large vocabulary system with

medium sized databases

– Show similar recognition performance to HMM systems

• Competitive phase

– To build systems that can handle huge databases

– Build systems that go beyond the naïve extrapolation of

today’s HMMs

– Improve on performance at acceptable cost


HMM vs. Exemplar

HMM EXEMPLAR

Units Phones/Allophones Phone templates

Local Similarity Multi-gaussian distributions Mahalanobis distance

Time Alignment Viterbi on HMM states Dynamic Time Warping

Transition

probabilities

Phonetic dictionary, LM Phonetic dictionary, LM

Long-span speech attributes

Search Time Synchronous Beam

Search

Time Synchronous Beam

Search

Training • HMM parameters (multi-

gaussian distributions)

• Type and number of

allophonic variants

• Labeling and segmentation

of training database

• Parameter estimation for

distance metric


Example Based LVCSR: How ?Baseline System

• Speech Database [ = “Memory” ] – same databases as used for training statistical systems

– collection of long stretches of acoustic vectors

– annotated at multiple levels: phone, syllable, word, ..

– any of the annotations (incl. segmentation) can serve as a “template”

• Recognition Paradigm– Find that sequence of templates that best matches a given input by using

Dynamic Time Warping (DTW)

– Use the ‘Template Transition Cost’ concept to control template transitions

• Borrow other components from existing HMM technology– Token passing Time Synchronous Beam search

– N-gram language modeling


Aligning of trajectories

OBSERVE: the “closest matching template” is by no means the sequence of nearest neighbors for each frame


Issues in Example Based ASRLocal Distance Metric

• The utterance based distortion is the sum of local (frame based) distances

• One of the great advances of HMM systems was the use of more complex metrics than previously used in DTW

– class (phone state) dependent – multi-gaussian distributions with many parameters

• It is possible to transfer some of the HMM improvements to the DTW framework, but not all and not in a trivial manner:

– Local Mahalanobis distance– further improvements by applying other ideas from non-parametric statistics: outlier

correction, data sharpening, adaptive kernel Mahalanobis, … [see papers ICASSP07, INTERSPEECH07]

• Weakness of (our) current system– score is based on a sequence of reference templates– From a KNN perspective the score should be based on group voting

• …. ongoing research

Input

Solution

15.3 hours of speechfrom 84 speakers

only 14 segments of 2 seconds

relevant to the searchare shown

# # I! t s tI! l @ n klI! r # # #


Input

Concatenation of chosen templates

Concatenation + dynamic time

warping (DTW)

# # I!t s tI! l @n klI! r ## #


# phone

errors

# reference

segments

Avg. # templates

in one segment

Input 55

No costs 12 48 1.2

Optimal settings 3 19 2.9

Controlling Template ConcatenationUsing a concatenation cost model, based on:

– natural successor templates in the reference database

– phonetic context

– gender, accent, recording condition, …

has great impact on selected segment length

naturalness of resynthesized reference

lowering the error


Experiments - Task descriptions

• TIMIT– Test and train sets hand-labeled

– 1.6 hours of training material, 462 speakers

• Resource Management (RM)– 991 word lexicon: CMU v0.4 [train and test highly matched !]

– 3.8 hours of noise-free training speech

– 150.000 phone templates

• Wall Street Journal (WSJ0)– Automatically segmented and labeled by HMM system based on sentence

transcription

– 15.3 hours of training material, 84 speakers

– 4986 words

– 450.000 phone templates (?)


Phone string recognition

Setup PER% nat.

suc.

% cont.

match

% gend.

match

Baseline 31.3% 14% 48% 82%

Context costs 29.1% 19% 77% 83%

Gender costs 31.0% 16% 48% 99%

Natural successors 30.3% 49% 66% 91%

All costs development set 27.6% 38% 83% 98%

All costs evaluation set 29.6% 38% 83% 98%

Reference HMM 27.7%N/A

Best in the literature 24.8%

TIMIT


Phone string recognition

Setup PER% nat.

suc.

% cont.

match

% gend.

match

Baseline 20.8% 7% 54% 85%

Context costs 16.9% 14% 86% 87%

Gender costs 20.3% 8% 54% 99%

Natural successors 15.5% 66% 80% 95%

All costs development set 13.8% 60% 88% 99%

All costs evaluation set 14.2% 59% 87% 99%

Reference HMM 14.9%N/A

Best in the literature N/A

WSJ0


Sentence recognition

dev oct89 feb91 sep92 avg

HMM WER (%) 2.26 2.83 2.13 5.08 3.35

DTW WER (%) 2.23 2.79 2.21 4.22 3.07

% natural successors 69 69 67 68 68

% matching contexts 94 94 94 93 94

% matching gender 99 98 98 98 98

Decoder speed ±10 x real time

Bottom-up speed ±4 x real time

Resource Management


Sentence recognition

dev92 nov92

HMM WER (%) 6.74 5.10

DTW WER (%) 8.69 8.11

% natural successors 38 36

% matching contexts 84 81

% matching gender 98 98

Decoder speed ±20 x real time

Bottom-up speed ±17 x real time

WSJ0


Example based ASR:Discussion

• For some task (medium sized problems) we were able to build a system that matches or exceeds performance of state-of-the-art HMM systems

PROOF OF CONCEPT: OK

• Success is critically dependent on the ability to use multi-phone segments frame based distance metric is not as powerful (yet) as with HMMs !

[ single nearest paradigm instead of KNN ?]

potentially better modeling of phone transitions than CD-HMMs [ i.e. NO modeling ! ]

• Challenges to move to large vocabulary tasks and very large databases:Richness of the database: very many contexts by very many speakersRichness of the database: very many contexts by very many speakers

Move away from the naive HMM-like top-down search engine Make better use of the available data : normalize for speaker (VTLN), acoustics


Issues in Example Based ASRSearch Space Explosion

• any allophone can be represented by any of its examples

• search space keeps on growing with larger example

databases: factor 100, 1.000, 10.000, ….

• large amount of redundant information

• hence a large inefficiency

• traditional pruning approaches will not be efficient

• early data driven (bottom-up) pruning is essential

[ this was applied in all experiments, but not discussed ]


PART II:ASR Search Techniques

Top-Down and Bottom-Up Combined


Top-Down Search StrategyConcept

• hypothesize: – all possible sentences

allowed by the language model

• find: – the one that best matches the observed acoustics (spectral like

frame based parameters)


Top-Down Search StrategyTime-Synchronous Implementation

• initialize: start with the dummy ‘start sentence’ word

• loop:– extend all hypotheses

• that are at or near word-end positions with all next possible words

• find phone/template string equivalents for these extensions

– fetch a new segment (frame) of data• incrementally compute the matching score between all hypotheses active

in the search and the observed acoustics

– order the hypotheses according to score and prune away• hypotheses that are ‘significantly’ worse than the best one

• hypotheses that fall below the Top-N

• end : accept Top-1 as your final result (best guess)


Top-Down Search StrategyWhy has it been so successful ?

• The Language Model constraints are very restrictive– therefore it makes sense to apply them first

– strong (overweighted) language models have been an essential ingredient for many commercial successes in ASR

• The top-down search is very tolerant for errors in one the weakest chains in the beads-on-a-strong model: – errors in the phonetic dictionary are abundant

• pronunciation dictionaries don’t contain all possible pronunciations

• people don’t talk the way they are supposed to talk

– but substituted/missing/inserted phone segments are absorbed by• forcing ‘a few’ frames to align with the presumed phone

• this mismatch cost may not be so big, because

– HMM scores smoothly decay as points are further from the class centroid

– HMMs will stretch or compress segments to their own benefit


At the opposite end:The Intuitive Bottom-Up Recognition Paradigm

spectral analysis, feature extraction, speech signal processing

noise suppression, …

phones phone(tic) recognition

words word/sentence recognition

phonetic features


Bottom-Up Why did it fail ?

• Intuitive bottom-up was the paradigm of choice in the early days of speech recognition (1970’s)

• The prototypical implementation has been:– recognize the next layer on the basis of the layer immediately below

– acknowledge that recognition will be imperfect, for this develop statistical pattern matching techniques that allow for insertions/deletions/substitutions

• The biggest failure in this paradigm– is to use a single best recognition as information carrier between two layers

• As errors – propagate at great speed and with great prosperity throughout the search

network

– the acoustic-phonetic recognition is not good enough to allow any error correction paradigm to function well


Bottom-Up and Top-DownEssential Weaknesses

• Bottom-Up– difficult to recover from recognition errors in lower layers

– the correct hypothesis might never get activated

• Top-down– the linguistic universe is limited to a restrictive

predefined language model

– difficult (impossible) to discover new things

– practically impossible for an LVCSR example based system


What’s in the middle of all this ?

The phoneme concept


Are Phone(me)s Real ?

speech signalis ‘given’ thus unambiguous

but contains massive amounts of non-phonetic information/noise

phones (allophones, phonemes)

a convenient intermediate level both for humans and machines

ill-defined and highly ambiguous

words(morphemes)

conceptual levelquite unambiguously recognized

on the basis of the acoustics


Recognition Models with Early Abstraction

phone graph

bottom-up recognition of low-level abstract units

words(morphemes)

top-down search for best possible word sequence on the basis of uncertain phonetic information


Recognition Modelwith Early Abstraction

spectral analysis, feature extraction, speech signal processing

noise suppression, …

probabilistic phone recognition

words Top-down search engine (driven by LM + phonetic dict.)

with phone graph as input

phone graph

phonemes


It could make sense

• Bottom-up / Early abstraction is required for many skills– “fast match”

– new word recognition

– nonsense word recognition

• Fully top-down was/is an engineering/economic necessity

• Phone recognition is influenced by top-down linguistic processes wrt.– recognition speed

– linguistic overrules


If we can get it to workPhone Graph Quality

• phone graph error rate should be low (a few %)

• phone graph density should be moderate– search on the phone graph should not be slower than on the frame data

– very bad matches should NOT be included • as their acoustic scores make little or no sense

• a more abstract ‘substitution/insertion/deletion’ score will make more sense

• Error model– should serve to overcome genuine phone errors

• dictionary mistakes

• gross pronunciation mistakes

– should be gentle on the search effort


If we can get it to workError Model

• should serve to overcome genuine phone errors– dictionary mistakes– gross pronunciation mistakes

• should be gentle on the search effort – generic insertion/deletion/substitution will again make the search explode– “single error” model: each error should be embedded between 2 phones found in

the graph

RM WSJ 5k WSJ 20k

PhER (1-best) 10.1% 9.42% 11.62%

WER (all-in-one) 3.08% 3.96% 9.99%

WER (graph – no error model) 5.00% 5.53% 11.98%

WER (graph – 1-errror model 3.14% 3.89% 9.98%


New Opportunities

• Assuming a quite dense phone graph with few

errors

– total search effort significantly smaller than in fully top-

down [ FLAVOR !! ]

– possibility to model more complex linguistic knowledge

sources

– a way out for controlling the search problem of example

based systems !!


Experiments with combined system(constrained lexicon on RM)


Experiments with combined system(constrained lexicon on RM)

dev. oct89 feb91 sep92 avg. impr.

HMM baseline 2.26 2.83 2.13 5.08 3.35 0%

Min. graph error 0.12 0.22 0.08 0.43 0.24 93%

DTW bottom-up 3.55 3.76 2.98 4.88 3.87 -16%

DTW on graph 3.05 4.43 3.14 5.35 4.31 -29%

phone combine 1.87 2.65 1.69 3.52 2.62 22%

phone/bi/tri comb. 1.68 2.24 1.65 3.32 2.40 28%

phone no gender 2.11 2.65 2.01 3.66 2.77 17%

phone no concat 2.23 2.94 2.25 4.77 3.32 0%

phone no DTW 2.38 2.50 2.13 3.79 2.81 16%

Combined HMM + Exemplar ASR System


Conclusions

• Exemplar based recognition can improve results

over HMM if longer than phoneme length used can

be used in a productive way

• Abstractionist bottom-up phonetic recognition can

be a very useful component in ASR for

– fast match in conjunction with• more complex linguistic models

• a more efficient exemplar system

– discovery of out-of-vocabulary words


About local distances

• Basic concept: take the shape of the class (phone identity)

into account to measure the distance:

B

A

=2σB,2

=σA,2

yB

x

yA

Euclidean Viewpoint:

d(x,yB) = d(x,yA)

Mahalanobis Viewpoint

d(x,yB) = 2 . d(x,yA)


Is this pronunciation variation modeling ?

• it’s only part of it, we also assume:– phonetic dictionaries containing ALL standard pronunciation

variants

– pronunciation rules for continuous speech

• but acknowledge, that– knowledge sources contain mistakes

– people don’t pronounce the phonemes they are supposed to pronounce according to the dictionary

– any rule set trying to explain everything gives too much chance to small probability events and makes the search explode (once again)

katholieke universiteit leuven - esat, belgium combining abstract and exemplar models in asr dirk...

Documents

trained model

perfect model

refined model

based asrcomputing

based asrbeads

proper training

maximum likelihood training

sufficient training