the 2000 nrl evaluation for recognition of speech in noisy environments mitre / ms state - isip...

The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments

MITRE / MS State - ISIP

Burhan Necioglu

Bryan George

George Shuttic

The MITRE Corporation

Ramasubramanian Sundaram

Joe Picone

Mississippi State U.

Inst. for Signal & Information Processing

INTRODUCTION

Collaboration between The MITRE Corporation and Mississippi State Institute for Signal and Information Processing (ISIP)– Primary goal: Evaluate the impact of noise pre-processing

developed for other DoD applications MITRE:

– Focus on robust speech recognition using noise reduction techniques, including effects of tactical communications links

– Distributed information access systems for military applications (DARPA Communicator)

Mississippi State:– Focus on stable, practical, advanced LVCSR technology– Open source large vocabulary speech recognition tools– Training, education and dissemination of information related

to all aspects of speech research ISIP-STT System utilized combination of technologies from both

organizations

OVERVIEW OF THE SYSTEM

Standard MFCC front-end with side-based CMS Acoustic modeling:

– Left-right model topology– Skip states for special models like silence – Continuous density mixture Gaussian HMMs– Both Baum-Welch and Viterbi training supported– Phonetic decision tree-based state-tying

Hierarchical search Viterbi decoder

STATE-TYING: MOTIVATION

Context-dependent models for better performance Increased parameter count Need to reduce computations without degrading performance

FEATURES AND PERFORMANCE

Batch processing Real-time performance of the training process during various

stages:

DECODER: OVERVIEW

Algorithmic features:– Single-pass decoding – Hierarchical Viterbi search – Dynamic network expansion

Functional features:– Cross-word context-dependent acoustic models – Word graph rescoring, forced alignments, N-gram decoding

Structural features:– Word graph compaction – Multiple pronunciations – Memory management

EVALUATION SYSTEM - NOISE PREPROCESSING

Using Harsh Environment Noise Pre-Processor (HENPP) front-end to remove noise from input speech

HENPP developed by AT&T to address background noise effects in DoD speech coding environments (see Accardi and Cox, Malah et al, ICASSP 1999)

Multiplicative spectral processing - minimal distortion, eliminates “doodley-doos” (aka “musical noise”)

“Minimum statistics” noise adaptation - handles quasi-stationary additive noise (random and stochastic) without assumptions

Limitations:– Not designed to address transient noise– Noise adaptation sensitive to “push-to-talk” effects

Integrated 2.4 kbps MELP/HENPP demonstrated successfully in low- to moderate-perplexity ASR:

LPC-10 MELP MELP/HENPP

EVALUATION SYSTEM - DATA AND TRAINING

10 hours of SPINE data used for training - no DRT words 100 frames per second, 25msec Hamming window 12 base FFT-derived mel cepstra with side-based CMS and log-

energy Delta and acceleration coefficients 44 phone set to cover SPINE data 909 models, 2725 states

EVALUATION SYSTEM - LM and LEXICON

5226 words in the SPINE lexicon, provided by CMU CMU language model Bigrams obtained by throwing away the trigrams LM size: 5226 unigrams, 12511 bigrams

EVALUATION SYSTEM - DECODING

Single stage decoding using word-internal acoustic models and bigram LM

RESULTS AND ANALYSIS

Lattice generation/lattice rescoring will improve results. Informal analysis of evaluation data and results:

– Negative correlation between recognition performance and SNR

Experiment WER (%) Subs (%) Dels (%) Ins (%)

Baseline ISIP-STT 56.2 26.0 21.1 9.0

Noise pre-processedtraining & evaluation

data

58.4 27.1 24.9 6.5

RESULTS AND ANALYSIS (cont.)

Clean speech : “B” side of spine_eval_033 (281 total words)

Low SNR example: “A” side of spine_eval_021 (115 total words):

Experiment Correct Subs Dels Ins Tot err

Baseline ISIP-STT 221 36 24 4 64


data

198 37 46 6 89

Experiment Correct Subs Dels Ins Tot err

Baseline ISIP-STT 72 25 18 4 47


data

80 18 17 3 38

RESULTS AND ANALYSIS (cont.)

HENPP designed for human listening purposes– Optimized to raise DRT scores in presence of noise and

coding– DRT scores, WER tend to be poorly correlated; minor

perceptual distortions often have magnified adverse effect on speech recognizers

Need to retune the HENPP– Algorithm is very effective for robust recognition of noisy

speech at low SNR’s– Too aggressive when applied to clean speech - some

information is lost– Minor adjustments will preserve noisy speech performance

and boost clean speech performance

ISSUES

Decoding slow on this task– 100x real-time (on 600 MHz Pentium)– Newer version of ISIP-STT decoder will be faster– Had to use bigram LM in the allowed time frame

Large amount of eval data– With slow decoding, seriously limited experiments

The devil is in the details:– Certain training data problematic “Noise field is

<long silence> up”– Automatic segmentation (having eval segmentations would

help)

CONCLUSIONS

MITRE / MS State-ISIP system; standard recognition approach using advanced noise preprocessing front end

Time limitation: could only officially report on the baseline system

Performed initial experiment with noise-preprocessing (AT&T HENPP)– Overall word error rate did not improve– Informal analysis suggests that for low SNR conversations,

noise pre-processing does help.– Difficulty with high SNR conversations

There is potential for improvement with application specific tuning of HENPP.

Approach is very promising for coded speech in commercial and military environments

the 2000 nrl evaluation for recognition of speech in noisy environments mitre / ms state - isip...

Documents

performance slide

node slide

search hierarchy slide

lexical trees

decision trees

contextdependent lexical

mississippi state institute

contextdependent models