cmu robust vocabulary-independent speech recognition system

CMU Robust Vocabulary-Independent CMU Robust Vocabulary-Independent Speech Recognition SystemSpeech Recognition System

Hsiao-Wuen Hon and Kai-Fu LeeHsiao-Wuen Hon and Kai-Fu Lee

ICASSP 1991ICASSP 1991

Presenter: Fang-Hui CHUPresenter: Fang-Hui CHU

2005/12/82005/12/8 NTNU Speech Lab 22

ReferenceReference

CMU Robust Vocabulary-Independent Speech Recognition System, HsCMU Robust Vocabulary-Independent Speech Recognition System, Hsiao-Wuen Hon and Kai-Fu Lee, ICASSP 1991iao-Wuen Hon and Kai-Fu Lee, ICASSP 1991

2005/12/82005/12/8 NTNU Speech Lab 33

OutlineOutline

IntroductionIntroduction

Larger Training DatabaseLarger Training Database

Between-Word TriphoneBetween-Word Triphone

Decision Tree Allophone ClusteringDecision Tree Allophone Clustering

Summary of Experiments and ResultsSummary of Experiments and Results

ConclusionsConclusions

2005/12/82005/12/8 NTNU Speech Lab 44

IntroductionIntroduction

This paper reports the efforts to improve the performance of CMU’s roThis paper reports the efforts to improve the performance of CMU’s robust vocabulary-independent (VI) speech recognition systems on the Dbust vocabulary-independent (VI) speech recognition systems on the DARPA speaker-independent resource management taskARPA speaker-independent resource management task

The first improvement involves the incorporation of more dynamic featThe first improvement involves the incorporation of more dynamic features in the acoustic front-end processing (here add second order differeures in the acoustic front-end processing (here add second order differenced cepstra and power)nced cepstra and power)

The second improvement involves the collection of more general EngliThe second improvement involves the collection of more general English data, from which we can model more phonetic variabilities, such as tsh data, from which we can model more phonetic variabilities, such as the word boundary contexthe word boundary context

2005/12/82005/12/8 NTNU Speech Lab 55

Introduction (cont.)Introduction (cont.)

With more detailed models (such as between-word triphones), coverage With more detailed models (such as between-word triphones), coverage on new tasks was reducedon new tasks was reduced

A new decision-tree based subword clustering algorithm to find more suitaA new decision-tree based subword clustering algorithm to find more suitable models for the subword units not covered in the training setble models for the subword units not covered in the training set

The vocabulary-independent system suffered much more from differenThe vocabulary-independent system suffered much more from differences in the recording environments at TI versus CMU than the vocabularces in the recording environments at TI versus CMU than the vocabulary-dependent systemy-dependent system

2005/12/82005/12/8 NTNU Speech Lab 66

Larger Training DatabaseLarger Training Database

The vocabulary-independent results improved dramatically as the vocaThe vocabulary-independent results improved dramatically as the vocabulary-independent training increasedbulary-independent training increased

They add 5000 more general English data into the vocabulary-independThey add 5000 more general English data into the vocabulary-independent training set, but only obtain a small improvement, reducing the erroent training set, but only obtain a small improvement, reducing the error rate from 9.4% to 9.1%r rate from 9.4% to 9.1%

The subword modeling technique may have reached an asymptote, so tThe subword modeling technique may have reached an asymptote, so that additional sentences are not giving much improvementhat additional sentences are not giving much improvement

2005/12/82005/12/8 NTNU Speech Lab 77

Between-Word TriphoneBetween-Word Triphone

Because the subword models are phonetic models, one way to model mBecause the subword models are phonetic models, one way to model more acoustic-phonetic detail is to incorporate more context informationore acoustic-phonetic detail is to incorporate more context information

Between-word triphone are modeling on the vocabulary-independent syBetween-word triphone are modeling on the vocabulary-independent system by adding three more contextsstem by adding three more contexts

Word beginning, Word ending and single-phone word positionsWord beginning, Word ending and single-phone word positions

In the past, it has been argued that between-word triphones might be leIn the past, it has been argued that between-word triphones might be learning grammatical constraints instead of modeling acoustic-phonetic varning grammatical constraints instead of modeling acoustic-phonetic variationsariations

The result shows the contrary, since in vocabulary-independent systems, gThe result shows the contrary, since in vocabulary-independent systems, grammars in the training and recognition are completely differentrammars in the training and recognition are completely different

2005/12/82005/12/8 NTNU Speech Lab 88

Decision Tree Allophone ClusteringDecision Tree Allophone Clustering

At the root of the decision tree is the set of all triphones corresponding tAt the root of the decision tree is the set of all triphones corresponding to a phoneo a phone

Each node has a binary “question” about their contexts including left, riEach node has a binary “question” about their contexts including left, right and word boundary contextsght and word boundary contexts

e.g. “e.g. “Is the right phoneme a back vowel?”Is the right phoneme a back vowel?”

These question are created using human speech knowledge and are desiThese question are created using human speech knowledge and are designed to capture classes of contextual effectsgned to capture classes of contextual effects

To find the generalized triphone for a triphone, the tree is traversed by To find the generalized triphone for a triphone, the tree is traversed by answering the questions attached to each node, until a leaf node is reacanswering the questions attached to each node, until a leaf node is reachedhed

2005/12/82005/12/8 NTNU Speech Lab 99

Decision Tree Allophone Clustering (cont.)Decision Tree Allophone Clustering (cont.)

2005/12/82005/12/8 NTNU Speech Lab 1010


The metric for splitting is a information-theoretic distance measure based on thThe metric for splitting is a information-theoretic distance measure based on the amount of entropy reduction when splitting a nodee amount of entropy reduction when splitting a node

To find the question that divides node To find the question that divides node mm into nodes into nodes aa and and bb, such that, such thatP(P(mm)H()H(mm) - P() - P(aa)H()H(aa) – P() – P(bb)H()H(bb) is maximized) is maximized

C

c

xcPxcpxH )|(log)|()(

2005/12/82005/12/8 NTNU Speech Lab 1111


The algorithm to generate a decision tree for a phone is given belowThe algorithm to generate a decision tree for a phone is given below1. Generate an HMM for every triphone1. Generate an HMM for every triphone

2. Create a tree with one (root) node, consisting of all triphones2. Create a tree with one (root) node, consisting of all triphones

3. Find the best composite question for each node3. Find the best composite question for each node

(a) Generate a tree with simple questions at each node(a) Generate a tree with simple questions at each node

(b) Cluster leaf nodes into two classes, representing the composite questio(b) Cluster leaf nodes into two classes, representing the composite questionn

4. Split the node with the overall best question4. Split the node with the overall best question

5. until some convergence criterion is met, go to step 35. until some convergence criterion is met, go to step 3

2005/12/82005/12/8 NTNU Speech Lab 1212


If only simple questions are allowed in the algorithm, the data may be If only simple questions are allowed in the algorithm, the data may be over-fragmented, resulting in similar leaves in different locations of the over-fragmented, resulting in similar leaves in different locations of the treetree

To deal with this problem by using composite questions (questions that To deal with this problem by using composite questions (questions that involve conjunctive and disjunctive combinations of all questions and involve conjunctive and disjunctive combinations of all questions and their negations)their negations)

2005/12/82005/12/8 NTNU Speech Lab 1313


The significant improvement here is due to three reasonsThe significant improvement here is due to three reasons1. improved tree growing and pruning techniques1. improved tree growing and pruning techniques

2. models in this study are more detailed and consistent, which makes it ea2. models in this study are more detailed and consistent, which makes it easier to find appropriate and meaningful questionssier to find appropriate and meaningful questions

3. triphone coverage is lower in this study, so decision tree based clusterin3. triphone coverage is lower in this study, so decision tree based clustering is able to find more suitable modelsg is able to find more suitable models

2005/12/82005/12/8 NTNU Speech Lab 1414

Summary of Experiments and ResultsSummary of Experiments and Results

All the experiments are evaluated on the speaker-independent DARPA All the experiments are evaluated on the speaker-independent DARPA resource management taskresource management task

A 991-word continuous speech task and a standard word-pair grammar A 991-word continuous speech task and a standard word-pair grammar with perplexity 60 was used throughoutwith perplexity 60 was used throughout

The test set consists of 320 sentences from 32 speakersThe test set consists of 320 sentences from 32 speakers

For the vocabulary-dependent (VD) system, they used the standard For the vocabulary-dependent (VD) system, they used the standard DARPA speaker-independent database which consisted of 3990 DARPA speaker-independent database which consisted of 3990 sentences from 109 speakers to train the system under different sentences from 109 speakers to train the system under different configurationsconfigurations

2005/12/82005/12/8 NTNU Speech Lab 1515

Summary of Experiments and Results (cont.)Summary of Experiments and Results (cont.)

The baseline vocabulary-independent (VI) system, was trained from a The baseline vocabulary-independent (VI) system, was trained from a total of 15000 VI sentences, 5000 of these were the TIMIT and total of 15000 VI sentences, 5000 of these were the TIMIT and Harvard sentences and 10000 were General English sentences recorded Harvard sentences and 10000 were General English sentences recorded at CMUat CMU

2005/12/82005/12/8 NTNU Speech Lab 1616

ConclusionsConclusions

In this paper, they have presented several techniques that substantially iIn this paper, they have presented several techniques that substantially improve the performance of CMU’s vocabulary-independent speech recmprove the performance of CMU’s vocabulary-independent speech recognition systemognition system

We can enhance the subword units by modeling more acoustic-phonetiWe can enhance the subword units by modeling more acoustic-phonetic variationsc variations

We would like to refine and constrain the type of questions which can bWe would like to refine and constrain the type of questions which can be asked to split the decision treee asked to split the decision tree

There is still a non-negligible degradation for cross recording conditionThere is still a non-negligible degradation for cross recording condition

cmu robust vocabulary-independent speech recognition system

Documents