characterizing and processing robot-directed speech

Characterizing and Processing

Robot-Directed Speech

Paulina Varchavskaia • Paul Fitzpatrick • Cynthia Breazeal MIT AI Lab Humanoid Robotics Group

Speech Recognition

What speech does the robot evoke?

How can we process that speech?

– Support a growing vocabulary

– Without hurting recognition accuracy

Baseline System

Extending Vocabulary

Extending Vocabulary

Baseline System

“Magic word” to introduce vocabulary

Separation of introduction from use

Robot confused by unknown/filler words

Study: Characterizing Speech

13 children

5-10 years old

20 minute sessions

Some prompting (Turkle et al)

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8

9

10

children (cumulative %)

nu

mb

er o

f u

tter

ance

s p

er m

inu

te (approx)

Number of Utterances10 20 30 40 50 60 70 80 90 100

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100


sin

gle

wo

rd u

tter

ance

s (%

)Single Word Utterances

10 20 30 40 50 60 70 80 90 100

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100


utt

eran

ces

wit

h e

nu

nci

atio

n (

%)

Utterances with Enunciation10 20 30 40 50 60 70 80 90 100

Cooperative Speech Exists

Some clean word-learning scenarios– “Tiffany”

Can we use them as leverage?– “My name is Tiffany”

What recognition rates are possible?

Study: Processing Speech

“LCSINFO” domain (Glass, Weinstein)

Provide some initial vocabulary

Build language model without transcripts

Language Model

wordn

word2

word1

OOV

Out Of Vocabulary Model

wordn

word2

word1

Out Of Vocabulary Model

OOVphonen

phone2

phone1

(Bazzi)

Hypothesized transcript

N-Best hypotheses

Run recognizer

Extract OOV fragments

Identify competition

Identify rarely-used additions

Remove from lexiconAdd to lexicon

Update lexicon,

baseforms

Update Language Model

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

coverage (%)

keyw

ord

erro

r rat

e (%

)Baseline performance Performance after clustering

K

eyw

ord

Err

or

Rat

e (%

)

coverage (%)

Qualitative Results

Given:1600 Utterances

“email, phone, room, office, address”

Finds:1. n ah m b er 6. p l iy z2. w eh r ih z 7. ae ng k y uw3. w ah t ih z 8. n ow4. t eh l m iy 9. hh aw ax b

aw5. k ix n y uw 10. g r uw p

Conclusions

What about acoustic models?

Idiosyncratic vocabulary is important

Pick a good name for your robot!

Acknowledgements

MIT Initiative on Technology and Self

MIT Spoken Language Systems group

DARPA

NTT

characterizing and processing robot-directed speech

Documents

processing robot

idiosyncratic vocabulary

n y uw10

w ah t ih z8

r ih z

tiffanywhat recognition

ax b aw5

g r uw pconclusionswhat