efficient computer interfaces using continuous gestures, language models, and speech keith vertanen...

22
Efficient Computer Interfaces Using Continuous Gestures, Language Models, and Speech Keith Vertanen Inference Group August 4th, 2004

Post on 20-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Efficient Computer Interfaces Using Continuous Gestures,

Language Models, and Speech

Keith Vertanen

Inference Group

August 4th, 2004

The problem

Speech recognizers make mistakes Correcting mistakes is inefficient

140 WPM Uncorrected dictation 14 WPM Corrected dictation, mouse/keyboard 32 WPM Corrected typing, mouse/keyboard

Voice-only correction is even slower and more frustrating

Research overview

Make correction of dictation: More efficient More fun More accessible

Approach: Build a word lattice from a recognizer’s n-best list Expand lattice to cover likely recognition errors Make a language model from expanded lattice Use model in a continuous gesture interface to

perform confirmation and correction

Building lattice

Example n-best list:1: jack studied very hard2: jack studied hard3: jill studied hard4: jill studied very hard5: jill studied little

Insertion errors

Acoustic confusions Given a word, find words that sound similar Look pronunciation up in dictionary:

studied s t ah d iy d Use observed phone confusions to generate alternative

pronunciations:s t ah d iy d s t ah d iy d

s ao d iys t ah d iy…

Map pronunciation back to words:s t ah d iy d studieds ao d iy saudis t ah d iy study

Acoustic confusions:“Jack studied hard”

Morphology confusions Given a word, find words that share the same “root”. Using the Porter stemmer:

jackingjacksjackjacked

studystudyingstudiedstudies

studi

jack

Morphology confusions:“Jack studied hard”

Language model confusions:“Jack studied hard”

Look at words before or after a node, add likely alternate words based on n-gram LM

Expansion results (on WSJ1)

84.0%

86.0%

88.0%

90.0%

92.0%

94.0%

96.0%

98.0%

Baseli

ne

Inse

rtion

Acous

tic

Mor

pholo

gy

Bigram

Trigra

m

Backw

ard

bigra

m

Backw

ard

trigr

am

Ora

cle

wo

rd a

ccu

racy

ObservedFully additive

Upper bound

Probability model

Our confirmation and correction interface requires probability of a letter given prior letters:

Probability model

Keep track of possible paths in lattice Prediction based on next letter on paths Interpolate with default language model Example, user has entered “the_cat”:

1.00

1.00

Handling word errors Use default language model during entry of erroneous word Rebuild paths allowing for an additional deletion or substitution error Example, user has entered “the_cattle_”:

0.25

0.25

0.25

0.0625

0.0625

Using expanded lattice Paths using arcs added during lattice expansion are penalized. Example, user has entered “jack_”:

0.04

0.041.00

Evaluating expansion Assume a good model requires as little information

from the user as possible

1t

0ii211i2 )s...ss|s(Plog

t

1 entropy(T) Cross

0.4

0.5

0.6

0.7

0.8

0.9

Baseli

ne

Inse

rtion

Acous

tic

Mor

pholo

gy

Bigram

Trigra

m

Backw

ard

bigra

m

Backw

ard

trigr

am

Cro

ss

en

tro

py

(b

its

)

Results on test set Model evaluated on held out test set (Hub1) Default language model

2.4 bits/letter User decides between 5.3 letters

Best speech-based model 0.61 bits/letter User decides between 1.5 letters

“To the mouse snow means freedom from want and fear”

“The hibernating skunk curled up in his deep den uncurls himself and ventures forth to prowl the world”

Conclusions One-third of recognition errors covered by

expanding lattice. Only insertion error expansion improves

efficiency. Speech-based model significantly improves

efficiency (2.4 bits -> 0.61 bits). A good correction interface is possible using

Dasher and an off-the-shelf recognizer.

Future work Update Speech Dasher to use lattice-based

probability model. Incorporate hypothesis probabilities into lattice

(or even better get at recognizer’s lattice). Improve efficiency on sentences with few or no

errors. User trials to validate numeric results.

Questions?