the 2000 nrl evaluation for recognition of speech in noisy environments mitre / ms state - isip...

15
The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE Corporation Ramasubramanian Sundaram Joe Picone Mississippi State U. Inst. for Signal & Information Processing

Upload: albert-little

Post on 27-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments

MITRE / MS State - ISIP

Burhan Necioglu

Bryan George

George Shuttic

The MITRE Corporation

Ramasubramanian Sundaram

Joe Picone

Mississippi State U.

Inst. for Signal & Information Processing

Page 2: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

INTRODUCTION

Collaboration between The MITRE Corporation and Mississippi State Institute for Signal and Information Processing (ISIP)– Primary goal: Evaluate the impact of noise pre-processing

developed for other DoD applications MITRE:

– Focus on robust speech recognition using noise reduction techniques, including effects of tactical communications links

– Distributed information access systems for military applications (DARPA Communicator)

Mississippi State:– Focus on stable, practical, advanced LVCSR technology– Open source large vocabulary speech recognition tools– Training, education and dissemination of information related

to all aspects of speech research ISIP-STT System utilized combination of technologies from both

organizations

Page 3: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

OVERVIEW OF THE SYSTEM

Standard MFCC front-end with side-based CMS Acoustic modeling:

– Left-right model topology– Skip states for special models like silence – Continuous density mixture Gaussian HMMs– Both Baum-Welch and Viterbi training supported– Phonetic decision tree-based state-tying

Hierarchical search Viterbi decoder

Page 4: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

STATE-TYING: MOTIVATION

Context-dependent models for better performance Increased parameter count Need to reduce computations without degrading performance

Page 5: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

FEATURES AND PERFORMANCE

Batch processing Real-time performance of the training process during various

stages:

Page 6: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

DECODER: OVERVIEW

Algorithmic features:– Single-pass decoding – Hierarchical Viterbi search – Dynamic network expansion

Functional features:– Cross-word context-dependent acoustic models – Word graph rescoring, forced alignments, N-gram decoding

Structural features:– Word graph compaction – Multiple pronunciations – Memory management

Page 7: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

EVALUATION SYSTEM - NOISE PREPROCESSING

Using Harsh Environment Noise Pre-Processor (HENPP) front-end to remove noise from input speech

HENPP developed by AT&T to address background noise effects in DoD speech coding environments (see Accardi and Cox, Malah et al, ICASSP 1999)

Multiplicative spectral processing - minimal distortion, eliminates “doodley-doos” (aka “musical noise”)

“Minimum statistics” noise adaptation - handles quasi-stationary additive noise (random and stochastic) without assumptions

Limitations:– Not designed to address transient noise– Noise adaptation sensitive to “push-to-talk” effects

Integrated 2.4 kbps MELP/HENPP demonstrated successfully in low- to moderate-perplexity ASR:

LPC-10 MELP MELP/HENPP

Page 8: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

EVALUATION SYSTEM - DATA AND TRAINING

10 hours of SPINE data used for training - no DRT words 100 frames per second, 25msec Hamming window 12 base FFT-derived mel cepstra with side-based CMS and log-

energy Delta and acceleration coefficients 44 phone set to cover SPINE data 909 models, 2725 states

Page 9: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

EVALUATION SYSTEM - LM and LEXICON

5226 words in the SPINE lexicon, provided by CMU CMU language model Bigrams obtained by throwing away the trigrams LM size: 5226 unigrams, 12511 bigrams

Page 10: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

EVALUATION SYSTEM - DECODING

Single stage decoding using word-internal acoustic models and bigram LM

Page 11: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

RESULTS AND ANALYSIS

Lattice generation/lattice rescoring will improve results. Informal analysis of evaluation data and results:

– Negative correlation between recognition performance and SNR

Experiment WER (%) Subs (%) Dels (%) Ins (%)

Baseline ISIP-STT 56.2 26.0 21.1 9.0

Noise pre-processedtraining & evaluation

data

58.4 27.1 24.9 6.5

Page 12: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

RESULTS AND ANALYSIS (cont.)

Clean speech : “B” side of spine_eval_033 (281 total words)

Low SNR example: “A” side of spine_eval_021 (115 total words):

Experiment Correct Subs Dels Ins Tot err

Baseline ISIP-STT 221 36 24 4 64

Noise pre-processedtraining & evaluation

data

198 37 46 6 89

Experiment Correct Subs Dels Ins Tot err

Baseline ISIP-STT 72 25 18 4 47

Noise pre-processedtraining & evaluation

data

80 18 17 3 38

Page 13: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

RESULTS AND ANALYSIS (cont.)

HENPP designed for human listening purposes– Optimized to raise DRT scores in presence of noise and

coding– DRT scores, WER tend to be poorly correlated; minor

perceptual distortions often have magnified adverse effect on speech recognizers

Need to retune the HENPP– Algorithm is very effective for robust recognition of noisy

speech at low SNR’s– Too aggressive when applied to clean speech - some

information is lost– Minor adjustments will preserve noisy speech performance

and boost clean speech performance

Page 14: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

ISSUES

Decoding slow on this task– 100x real-time (on 600 MHz Pentium)– Newer version of ISIP-STT decoder will be faster– Had to use bigram LM in the allowed time frame

Large amount of eval data– With slow decoding, seriously limited experiments

The devil is in the details:– Certain training data problematic “Noise field is

<long silence> up”– Automatic segmentation (having eval segmentations would

help)

Page 15: The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE

CONCLUSIONS

MITRE / MS State-ISIP system; standard recognition approach using advanced noise preprocessing front end

Time limitation: could only officially report on the baseline system

Performed initial experiment with noise-preprocessing (AT&T HENPP)– Overall word error rate did not improve– Informal analysis suggests that for low SNR conversations,

noise pre-processing does help.– Difficulty with high SNR conversations

There is potential for improvement with application specific tuning of HENPP.

Approach is very promising for coded speech in commercial and military environments