arthur kunkle ece 5525 fall 2008. introduction and motivation a large vocabulary speech recognition...

12
Arthur Kunkle ECE 5525 Fall 2008

Upload: hilary-elmer-lindsey

Post on 27-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Arthur Kunkle

ECE 5525

Fall 2008

Introduction and Motivation A Large Vocabulary Speech Recognition

(LVSR) system is a system that is able to convert speech data into textual transcriptions.

This system will serve as a test-bed for the development of new speech recognition technologies.

This design presentation assumes basic knowledge of the tasks an LVSR must accomplish, as well as some in-depth knowledge of the HTK framework.

System Technologies

HMM Toolkit (HTK) Cygwin UNIX Emulation Environment Practical Extraction and Reporting

Language (PERL) Subversion Configuration

Management Tool

System Requirements The LVSR shall…

1. Be capable of incorporating prepared data that conforms to a standard HTK interface (defined in “System Design”).

2. Automatically generate language and acoustic models of all available conforming input data.

3. Be configurable to use multiple processors and/or remote computers to share workload for model re-estimation and testing.

4. Have a scheduling mechanism to run different configuration profiles and create different results directories for each, containing the acoustic and language models.

5. Record all HTK tool output for a “run” in time stamped log files.

6. Merge Language Models together and determine the optimum weighting for models based upon measuring model Perplexity.

7. Email a list of users information regarding run errors and completion status.

System Design The following directory

structure will capture each stage of the workflow on the left:

Data Preparation Phase 1 HTK needs the following items that are custom to

each corpus: (OPTIONAL) Dictionary – The list of all words

found in both testing and training files in the corpus and their phonetic pronunciations. Should be “<corpus_name>_dict.txt”.

Word List – This is a list of all unique words found in the transcriptions. “<corpus_name>_word_list.txt”

Training Data List – List of all MFCC data files contributed by the source, using their absolute location on disk. Rename all utterance files to be “corpus_name>_<speaker>_<num>.mfcc”

“Plain” MLF’s – These only include the words of each utterance. Always create this regardless of timing info availability.

“Timed” MLF’s – (OPTIONAL) These included the time boundaries of the appearing words/phones. They must be converted to HTK timing as well. (HTK uses time units of 100ns per unit)

Audio Data – convert wav/NIST/sphere format into MFCC using common parameters. Make sure that max length of HTK is observed, splitting as necessary.

A custom Perl script is used script to handle each source

# Corpus location on diskLocation: F:/CORPORA/TIMIT # Sound-splitting threshold (in HTK units)UtteranceSplit: 300 # Coding parameter config referenceCodingConfigFile: standard_mfcc_cfg.txt

Data Preparation Phase 2

Data must be merged together. Common data such as dictionaries should be added

here. Dictionary – The list of all words found in all

files contributed in the corpus and their phonetic pronunciations.

Indexed Data Files – All the files from individual sources will be merged into a common area and their filenames will be transformed to a common naming scheme.

Word List Training Data List Testing Data List “Plain” MLF’s – These only include the words of

each utterance. Always create this regardless of timing info availability.

“Timed” MLF’s – (OPTIONAL) These included the time boundaries of the appearing words/phones. They must be converted to HTK timing as well. (HTK uses time units of 100ns per unit)

Transcription Files – These are transcription files that are formatted for direct use by the Language Modeling process.

Grammar File – By default, this step will generate an “open” grammar from the wordlist. Any word can legally follow another word in the final wordlist. This is used to test acoustic models only

# Phone-set informationPhoneSet: TIMIT # Coding parameter config referenceCodingConfigFile: standard_mfcc_cfg.txt # Parameters to determine percentage of input

data that is TRAIN/TEST# must add to 100TrainDataPercent: 80TestDataPercent: 20

Acoustic Model Generation The Acoustic Model generation phase will

generate multiple versions of HMM definition files that model the input utterances on the phone, and tri-phone level.

1. Prototype HMM is created

2. Create first HMM model for all phones

3. Tie the states for silence model

4. Re-align the models to use all word pronunciations

5. Create tri-phone HMM models

6. Use decision-based clustering to tie triphone model parameters

7. Split the Gaussian Mixtures used for each state.

#Acoustic Training Configuration ProfilesProfileName: Basic #settings for pruning and floor valuesVarianceFloor: 0.01PruningThresholds: 250.0 150.0 1000.0RealignPruneThreshold: 250.0 #Which corpus contains bootstrap data for iteration 1BootstrapCorpus: TIMIT #how many calls to HEReest to do inbetween major AM stepsReestimationCount: 2 #file for Tree based clustering logicTreeEditFile: basic_tree.hed #determine target mixtures to apply at end of trainingGuassianMixtures: 8MixtureStepSize: 2

Language Model Generation This phase of development will create

n-gram language model that will predict a symbol in a sequence given its n-1 predecessors.

1. Training text is scanned and n-grams are counted and stored in grammar files

2. Words are mapped to an “Out-of-Vocabulary Class”. Other class mapping is applied for class-based Language Models

3. The counts of the resulting grammar files are used to compute n-gram probabilities, which are stored in the language model files.

4. The goodness of the language model is measured by calculating perplexity against testing text from the corpus.

#these settings dictate the Language Model generation process for all sources

MaxNewWords: 100000NGramBufferSize: 200000 #will generate up to N gram modelsNToGenerate: 4  FoFLevels: 32 #must include N-1 cutoff valuesCutoffs: 1, 2, 3 #how much this LM should contrib to the

overall modelOverallContribution: 0.5  #class-model configuration itemsClassAmount: 150ClusterIterations: 1ClassContribution: 0.7

Model Testing The final phase of the

system will be testing the acoustic and language models generate to this point.

The results will be cataloged according to the timestamp and the profile name

1. Recognition using acoustic models only and “open” grammar (i.e. no LM applied)

2. Recognition using both AM and LM.

# standard HMM/LM testing parametersWordInsertionPenalty: 0.0GrammarScaleFactor: 5.0HMMNumbersToTest: 19

Milestones The following actions are given in order with the time

estimates for each:1. TIMIT Data Prep : 6 hours2. AMI Data Prep : 10 hours3. Phase 2 Data Prep Sub-System : 20 hours4. Acoustic Model Sub-System : 20 hours5. Model Testing Sub-System : 12 hours6. Lanugage Model Sub-System : 15 hours7. RTE ‘06 Data Prep : 14 hours8. Scheduling / Reporting : 14 hours9. Extra Features / Refactoring : 16 hours10. Profile Authoring : 4 hours

Total Effort Estimate: 131 hours

Open Issues/Questions Can Acoustic and Language Model generation be run in parallel after a common

data preparation workflow?

Right now, all data input into the LVSR is tagged as training data. What is the best way to choose a subset of data for Testing only? Have a percentage configuration value and pick random utterances? Have a configurable list of specific utterances set aside? If a source (corpus) specifies a testing set, should we use this by default?

Which workflow makes more sense for multiple source LM generation: Generate source-specific word level LM, generate source-specific class level LM,

interpolate together. Then combine with other source-specific LM’s Use all training text to create a single word-level LM, generate class level LM, then

combine to final LM.

Proposed architecture is static, requiring the process to be restarted when new data is introduced. What requirements exist for dynamically adding new data to existing models?