Transcript

Rapid and Accurate Rapid and Accurate Spoken Term DetectionSpoken Term Detection

Owen Kimball

BBN Technologies

15 December 2006

15-Dec-06Rapid and Accurate Spoken Term Detection 2

Overview of TalkOverview of Talk

• BBN Levantine system description

• Evaluation results

• Diacritics

• Out-of-vocabulary issues

15-Dec-06Rapid and Accurate Spoken Term Detection 3

BBN Evaluation TeamBBN Evaluation Team

Core Team• Chia-lin Kao• Owen Kimball• Michael Kleber• David Miller

Additional assistance• Thomas Colthurst• Herb Gish• Steve Lowe• Rich Schwartz

15-Dec-06Rapid and Accurate Spoken Term Detection 4

BBN System OverviewBBN System Overview

Byblos STT

indexer

detector

decider

latticesphonetic-transcripts

indexscored

detectionlists

final outputwith YES/NO

decisions

audiosearc

hterms

ATWV cost

parameters

15-Dec-06Rapid and Accurate Spoken Term Detection 5

Levantine STT ConfigurationLevantine STT Configuration

• STT generates a lattice of hypotheses and a phonetic transcript for each input file.

• Word-based system:– Orthography based on Modern Standard Arabic

(MSA), no short vowel diacritics– Acoustic: 57.3 hours LDC

(noise words, no mixture exponents)– Language: 250 hours of data, 1.3M words

• 38.5K dictionary, grapheme-as-phoneme based plus 100 manual pronunciations

– unknown short vowel (U), 39 phonemes

• 42.32% WER on STD Dev06 CTS data

15-Dec-06Rapid and Accurate Spoken Term Detection 6

Levantine CTS ResultsLevantine CTS Results

0.3467Eval06

0.410DryRun

0.515Dev06

ATWV Data

15-Dec-06Rapid and Accurate Spoken Term Detection 7

OOV Pipeline: DetectorOOV Pipeline: Detector

• Word-based STT produces 1-best transcript: pronounce it 1-best phonetic transcript.

• Query is OOV if it contains any OOV word.

• OOV query detection:– Pronounce query (grapheme-as-phoneme)– Find minimal edit-distance alignments (agrep)– Score = % error = phonemes#

distanceedit 1

15-Dec-06Rapid and Accurate Spoken Term Detection 8

OOV Pipeline: DeciderOOV Pipeline: Decider

• Need different Yes/No decision procedure:IV-decider requires posterior probabilities.

• Simple OOV decision procedure:– Constant threshold on score (~ 0.7)– Cap on maximum number of hits (0-3)– Values set to maximize ATWV on Dev06 data.

15-Dec-06Rapid and Accurate Spoken Term Detection 9

OOV Pipeline: ResultsOOV Pipeline: Results

• ATWV remained good:0.3450 IV

0.3635 OOV

• Searches take longer: ~10-15x IV speed on Dev06 and DryRun06,

with no attempt at indexing.

15-Dec-06Rapid and Accurate Spoken Term Detection 10

OOV Directions for ImprovementOOV Directions for Improvement

• Score substitutions using phoneme confusion matrix instead of flat edit distance

• Speed: indexing phonetic transcripts for approximate matching

• Search lattices beyond 1-best transcripts

15-Dec-06Rapid and Accurate Spoken Term Detection 11

Levantine Diacritic IssuesLevantine Diacritic Issues

• Originally looked at diacritized Levantine

• Trained STT engine using LDC 45 hour set

• Ran STD without knowing WER (no diacritized STT test set to measure WER).– Found very high false alarm rate

• Examining FAs found hits that were legitimate alternate spellings

15-Dec-06Rapid and Accurate Spoken Term Detection 12

Levantine Diacritics- Alternate SpellingsLevantine Diacritics- Alternate Spellings

• Examining query words found more of same:– In first 22 terms of dry run term list, 14 are “alternate

diacritic” spellings of 5 underlying words, i.e. there were just 13 unique words in the first 22 terms

– Min~ahumo v Minohumo

– AlHayaApi v AlHayaAp

– Waliko v Walika

– qabilo v qabola v qabolo

• LDC training and STD test set had additional pervasive differences

15-Dec-06Rapid and Accurate Spoken Term Detection 13

No-Diacritic Levantine IssuesNo-Diacritic Levantine Issues

• A quick look turned up a smaller number of problems for no-diacritic Levantine– Looking at 7 top-FA terms in dev set, found

• “bHky” vs “b>Hky” but no other spelling confusions

• One ref instance of term with 0 duration

• It would be interesting to QC test sets for inconsistent spellings and other issues


Top Related