cmu robust vocabulary-independent speech recognition system
DESCRIPTION
CMU Robust Vocabulary-Independent Speech Recognition System. Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU. Reference. CMU Robust Vocabulary-Independent Speech Recognition System, Hsiao-Wuen Hon and Kai-Fu Lee, ICASSP 1991. Outline. Introduction Larger Training Database - PowerPoint PPT PresentationTRANSCRIPT
CMU Robust Vocabulary-Independent CMU Robust Vocabulary-Independent Speech Recognition SystemSpeech Recognition System
Hsiao-Wuen Hon and Kai-Fu LeeHsiao-Wuen Hon and Kai-Fu Lee
ICASSP 1991ICASSP 1991
Presenter: Fang-Hui CHUPresenter: Fang-Hui CHU
2005/12/82005/12/8 NTNU Speech Lab 22
ReferenceReference
CMU Robust Vocabulary-Independent Speech Recognition System, HsCMU Robust Vocabulary-Independent Speech Recognition System, Hsiao-Wuen Hon and Kai-Fu Lee, ICASSP 1991iao-Wuen Hon and Kai-Fu Lee, ICASSP 1991
2005/12/82005/12/8 NTNU Speech Lab 33
OutlineOutline
IntroductionIntroduction
Larger Training DatabaseLarger Training Database
Between-Word TriphoneBetween-Word Triphone
Decision Tree Allophone ClusteringDecision Tree Allophone Clustering
Summary of Experiments and ResultsSummary of Experiments and Results
ConclusionsConclusions
2005/12/82005/12/8 NTNU Speech Lab 44
IntroductionIntroduction
This paper reports the efforts to improve the performance of CMU’s roThis paper reports the efforts to improve the performance of CMU’s robust vocabulary-independent (VI) speech recognition systems on the Dbust vocabulary-independent (VI) speech recognition systems on the DARPA speaker-independent resource management taskARPA speaker-independent resource management task
The first improvement involves the incorporation of more dynamic featThe first improvement involves the incorporation of more dynamic features in the acoustic front-end processing (here add second order differeures in the acoustic front-end processing (here add second order differenced cepstra and power)nced cepstra and power)
The second improvement involves the collection of more general EngliThe second improvement involves the collection of more general English data, from which we can model more phonetic variabilities, such as tsh data, from which we can model more phonetic variabilities, such as the word boundary contexthe word boundary context
2005/12/82005/12/8 NTNU Speech Lab 55
Introduction (cont.)Introduction (cont.)
With more detailed models (such as between-word triphones), coverage With more detailed models (such as between-word triphones), coverage on new tasks was reducedon new tasks was reduced
A new decision-tree based subword clustering algorithm to find more suitaA new decision-tree based subword clustering algorithm to find more suitable models for the subword units not covered in the training setble models for the subword units not covered in the training set
The vocabulary-independent system suffered much more from differenThe vocabulary-independent system suffered much more from differences in the recording environments at TI versus CMU than the vocabularces in the recording environments at TI versus CMU than the vocabulary-dependent systemy-dependent system
2005/12/82005/12/8 NTNU Speech Lab 66
Larger Training DatabaseLarger Training Database
The vocabulary-independent results improved dramatically as the vocaThe vocabulary-independent results improved dramatically as the vocabulary-independent training increasedbulary-independent training increased
They add 5000 more general English data into the vocabulary-independThey add 5000 more general English data into the vocabulary-independent training set, but only obtain a small improvement, reducing the erroent training set, but only obtain a small improvement, reducing the error rate from 9.4% to 9.1%r rate from 9.4% to 9.1%
The subword modeling technique may have reached an asymptote, so tThe subword modeling technique may have reached an asymptote, so that additional sentences are not giving much improvementhat additional sentences are not giving much improvement
2005/12/82005/12/8 NTNU Speech Lab 77
Between-Word TriphoneBetween-Word Triphone
Because the subword models are phonetic models, one way to model mBecause the subword models are phonetic models, one way to model more acoustic-phonetic detail is to incorporate more context informationore acoustic-phonetic detail is to incorporate more context information
Between-word triphone are modeling on the vocabulary-independent syBetween-word triphone are modeling on the vocabulary-independent system by adding three more contextsstem by adding three more contexts
Word beginning, Word ending and single-phone word positionsWord beginning, Word ending and single-phone word positions
In the past, it has been argued that between-word triphones might be leIn the past, it has been argued that between-word triphones might be learning grammatical constraints instead of modeling acoustic-phonetic varning grammatical constraints instead of modeling acoustic-phonetic variationsariations
The result shows the contrary, since in vocabulary-independent systems, gThe result shows the contrary, since in vocabulary-independent systems, grammars in the training and recognition are completely differentrammars in the training and recognition are completely different
2005/12/82005/12/8 NTNU Speech Lab 88
Decision Tree Allophone ClusteringDecision Tree Allophone Clustering
At the root of the decision tree is the set of all triphones corresponding tAt the root of the decision tree is the set of all triphones corresponding to a phoneo a phone
Each node has a binary “question” about their contexts including left, riEach node has a binary “question” about their contexts including left, right and word boundary contextsght and word boundary contexts
e.g. “e.g. “Is the right phoneme a back vowel?”Is the right phoneme a back vowel?”
These question are created using human speech knowledge and are desiThese question are created using human speech knowledge and are designed to capture classes of contextual effectsgned to capture classes of contextual effects
To find the generalized triphone for a triphone, the tree is traversed by To find the generalized triphone for a triphone, the tree is traversed by answering the questions attached to each node, until a leaf node is reacanswering the questions attached to each node, until a leaf node is reachedhed
2005/12/82005/12/8 NTNU Speech Lab 99
Decision Tree Allophone Clustering (cont.)Decision Tree Allophone Clustering (cont.)
2005/12/82005/12/8 NTNU Speech Lab 1010
Decision Tree Allophone Clustering (cont.)Decision Tree Allophone Clustering (cont.)
The metric for splitting is a information-theoretic distance measure based on thThe metric for splitting is a information-theoretic distance measure based on the amount of entropy reduction when splitting a nodee amount of entropy reduction when splitting a node
To find the question that divides node To find the question that divides node mm into nodes into nodes aa and and bb, such that, such thatP(P(mm)H()H(mm) - P() - P(aa)H()H(aa) – P() – P(bb)H()H(bb) is maximized) is maximized
C
c
xcPxcpxH )|(log)|()(
2005/12/82005/12/8 NTNU Speech Lab 1111
Decision Tree Allophone Clustering (cont.)Decision Tree Allophone Clustering (cont.)
The algorithm to generate a decision tree for a phone is given belowThe algorithm to generate a decision tree for a phone is given below1. Generate an HMM for every triphone1. Generate an HMM for every triphone
2. Create a tree with one (root) node, consisting of all triphones2. Create a tree with one (root) node, consisting of all triphones
3. Find the best composite question for each node3. Find the best composite question for each node
(a) Generate a tree with simple questions at each node(a) Generate a tree with simple questions at each node
(b) Cluster leaf nodes into two classes, representing the composite questio(b) Cluster leaf nodes into two classes, representing the composite questionn
4. Split the node with the overall best question4. Split the node with the overall best question
5. until some convergence criterion is met, go to step 35. until some convergence criterion is met, go to step 3
2005/12/82005/12/8 NTNU Speech Lab 1212
Decision Tree Allophone Clustering (cont.)Decision Tree Allophone Clustering (cont.)
If only simple questions are allowed in the algorithm, the data may be If only simple questions are allowed in the algorithm, the data may be over-fragmented, resulting in similar leaves in different locations of the over-fragmented, resulting in similar leaves in different locations of the treetree
To deal with this problem by using composite questions (questions that To deal with this problem by using composite questions (questions that involve conjunctive and disjunctive combinations of all questions and involve conjunctive and disjunctive combinations of all questions and their negations)their negations)
2005/12/82005/12/8 NTNU Speech Lab 1313
Decision Tree Allophone Clustering (cont.)Decision Tree Allophone Clustering (cont.)
The significant improvement here is due to three reasonsThe significant improvement here is due to three reasons1. improved tree growing and pruning techniques1. improved tree growing and pruning techniques
2. models in this study are more detailed and consistent, which makes it ea2. models in this study are more detailed and consistent, which makes it easier to find appropriate and meaningful questionssier to find appropriate and meaningful questions
3. triphone coverage is lower in this study, so decision tree based clusterin3. triphone coverage is lower in this study, so decision tree based clustering is able to find more suitable modelsg is able to find more suitable models
2005/12/82005/12/8 NTNU Speech Lab 1414
Summary of Experiments and ResultsSummary of Experiments and Results
All the experiments are evaluated on the speaker-independent DARPA All the experiments are evaluated on the speaker-independent DARPA resource management taskresource management task
A 991-word continuous speech task and a standard word-pair grammar A 991-word continuous speech task and a standard word-pair grammar with perplexity 60 was used throughoutwith perplexity 60 was used throughout
The test set consists of 320 sentences from 32 speakersThe test set consists of 320 sentences from 32 speakers
For the vocabulary-dependent (VD) system, they used the standard For the vocabulary-dependent (VD) system, they used the standard DARPA speaker-independent database which consisted of 3990 DARPA speaker-independent database which consisted of 3990 sentences from 109 speakers to train the system under different sentences from 109 speakers to train the system under different configurationsconfigurations
2005/12/82005/12/8 NTNU Speech Lab 1515
Summary of Experiments and Results (cont.)Summary of Experiments and Results (cont.)
The baseline vocabulary-independent (VI) system, was trained from a The baseline vocabulary-independent (VI) system, was trained from a total of 15000 VI sentences, 5000 of these were the TIMIT and total of 15000 VI sentences, 5000 of these were the TIMIT and Harvard sentences and 10000 were General English sentences recorded Harvard sentences and 10000 were General English sentences recorded at CMUat CMU
2005/12/82005/12/8 NTNU Speech Lab 1616
ConclusionsConclusions
In this paper, they have presented several techniques that substantially iIn this paper, they have presented several techniques that substantially improve the performance of CMU’s vocabulary-independent speech recmprove the performance of CMU’s vocabulary-independent speech recognition systemognition system
We can enhance the subword units by modeling more acoustic-phonetiWe can enhance the subword units by modeling more acoustic-phonetic variationsc variations
We would like to refine and constrain the type of questions which can bWe would like to refine and constrain the type of questions which can be asked to split the decision treee asked to split the decision tree
There is still a non-negligible degradation for cross recording conditionThere is still a non-negligible degradation for cross recording condition