voice typing: a new speech interaction model for dictation ...€¦ · mobile devices vs. pcs ....
TRANSCRIPT
![Page 1: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/1.jpg)
Voice Typing: A New Speech Interaction Model for Dictation on Touchscreen Devices
Anuj Kumar1,2, Tim Paek1, Bongshin Lee1
1 Microsoft Research, Redmond, USA
2 Carnegie Mellon University, Pittsburgh, USA
![Page 2: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/2.jpg)
Mobile devices have widely penetrated the market
![Page 3: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/3.jpg)
Mobile Devices vs. PCs
![Page 4: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/4.jpg)
Text Input on Mobile Devices
187.7 B text and email messages sent in Dec 2010 in North America (Wireless Facts, CTIA 2011)
Voice Calls (25%) Text Input (28%) Emails, Messages
Others (47%) Social Networking, Games, Maps
Source: AppsFire, 1/11
![Page 5: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/5.jpg)
Existing Techniques for Text Input
§ Typing § QWERTY § Half-QWERTY § Multi-tap § T9 (predictive text entry) Lack of haptic feedback; Ergonomic issues e.g. “fat finger problem”
§ Recognition Oriented § SWYPE § Handwriting recognition, etc. Either slow, or inaccurate
![Page 6: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/6.jpg)
Text Input via Speech
Offers several potential advantages
With speech, interaction becomes independent of device size
If accurately recognized, speech is three times faster than QWERTY (Basapur et al. ’07)
Only plausible input modality for 800 million non-literate users
Typing Speeds
Speech
Handwri7ng
QWERTY
Predic7ve Text
Mul7-‐tap
![Page 7: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/7.jpg)
Problems with Current Dictation Systems
Image Source: Nuance Communica7ons, 2012
Users formulate utterance
Say it aloud Wait for a few seconds
See the entire output at once
“Voice Recorder” type interaction style
Real-time presentation of output sacrificed for potential accuracy gains
![Page 8: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/8.jpg)
Problems with Current Dictation Systems
Image Source: Nuance Communica7ons, 2012
Break thought chain, verify output verbatim
Error identification and correction takes 75% of time (Karat et al., ’99)
Error editing is time intensive & frustrating
Each error edit requires at least two actions: selection & correction
![Page 9: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/9.jpg)
Real-time Feedback & Speaking Style
Traditional Dictation Discrete Recognition
1 word at a time 1 utterance at a time
No real-time feedback
Pause after each word – does not match speaking mental model
Voice Typing
Chunks of 2-4 words at a time
Each chunk is a part of a thought
Enables real-time error identification & correction
“Conversation with a foreign accent friend”
“Voice Recorder” “Typist Secretary”
![Page 10: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/10.jpg)
Cognitive Motivations for Voice Typing
§ Real-time feedback not only promotes learning of interface, but also leads to greater satisfaction (Payne, ’09)
§ Similar to back-channel feedback in real conversations, real-time feedback provides “common ground” (Clark et al. ’91)
§ Similar to current mental models of keyboard typing where users typically monitor and correct text as-they-type
![Page 11: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/11.jpg)
Technical Motivations for Voice Typing
§ Most recognition errors occur due to incorrect segmentation Utterance: “It’s hard to recognize speech”
Recognition failure: “It’s hard to wreck a nice beach” /s/ incorrectly attached to “nice”, instead of “speech”
With Voice Typing, users likely to pause where segmentations should occur
“It’s hard <pause> to recognize <pause> speech”
§ Real-time user correction provides correct context. Stops error propagation (Aist et al, ’07)
![Page 12: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/12.jpg)
Error Correction: Marking Menu
§ Edit operations accessed directly from the word via a marking menu, or simple gestures
§ Single operation to edit errors
![Page 13: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/13.jpg)
Error Correction: Marking Menu
§ Edit operations accessed directly from the word via a marking menu, or simple gestures
§ Single operation to edit errors
§ Delete: swipe left
![Page 14: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/14.jpg)
Error Correction: Marking Menu
§ Edit operations accessed directly from the word via a marking menu, or simple gestures
§ Single operation to edit errors
§ Delete: swipe left
§ Substitute: swipe up (respeak, spell)
![Page 15: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/15.jpg)
Error Correction: Marking Menu
§ Edit operations accessed directly from the word via a marking menu, or simple gestures
§ Single operation to edit errors
§ Delete: swipe left
§ Substitute: swipe up (respeak, spell) OR
swipe down (alternates)
![Page 16: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/16.jpg)
User Study & Hypotheses
§ Controlled experiment to assess correction efficacy and usability of Voice Typing
§ 2 x 2 within-subjects experiment (N = 24)
§ Speech Interaction Model: Voice Typing vs. Traditional Dictation
§ Error Correction Style: Marking Menu vs. Regular
§ Hypotheses: Voice Typing outperforms traditional Dictation, and Marking menu outperforms Regular menu
![Page 17: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/17.jpg)
Task: Compose Emails
§ 5 emails per-condition (2 for practice, 3 for analysis)
§ E.g. “Write an email to your friend recommending a restaurant you like. Suggest a plate she should order and why she will like it.”
§ “Explain to your boss why you won’t be able to come into work today.”
![Page 18: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/18.jpg)
Procedure & Types of Data
!me
10 25 25 25 25 10
Recognizer Training Condition A Condition B Condition C Condition D
![Page 19: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/19.jpg)
User Correction Error Rate (UCER)
§ UCER captures the amount of effort users made to correct errors
§ In Voice Typing, users made significantly lower corrections (10%) than Dictation (14%), F(1,46) = 4.15, p=0.04*
§ Users naturally slowed down to monitor real-time text output, which helped accuracy
![Page 20: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/20.jpg)
4 Types of Corrections
§ Substitutions: Respeak, spell, or alternates § Insertions: Insert a word b/w two existing words § Deletions: Deletion of a word, one at a time § Uncorrected: Words identified incorrect, but left uncorrected
Subs7tu7ons Inser7ons Dele7ons Uncorrected Dicta7on, Marking Menu 7.35 2.14 3.17 0.21
Dicta7on, Regular 5.90 1.43 3.32 0.24
Voice Typing, Marking Menu 7.14 0.78 2.82 0.15
Voice Typing, Regular 5.10 1.36 3.10 0.25
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00
Num
ber o
f errors p
er-‐email
![Page 21: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/21.jpg)
Subs7tu7ons Inser7ons Dele7ons Uncorrected Dicta7on, Marking Menu 7.35 2.14 3.17 0.21
Dicta7on, Regular 5.90 1.43 3.32 0.24
Voice Typing, Marking Menu 7.14 0.78 2.82 0.15
Voice Typing, Regular 5.10 1.36 3.10 0.25
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00
Num
ber o
f errors p
er-‐email
Difference in Number of Substitutions
§ Significantly higher substitutions for marking menu than regular correction style, F(1,46)=5.9, p=0.01*
§ Possibly because users preferred to substitute the word rather than leave it uncorrected
![Page 22: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/22.jpg)
Lower Transcription Delay for Voice Typing
§ Voice Typing = 1.27 sec; Dictation = 12.41 sec
§ Delay in dictation includes the time that the user took to speak the entire utterance, as well as the delay time.
§ Delay in Voice Typing did not vary much across emails to affect user experience
§ Most emails within one S.D. of average; all within two S.D.
-‐5
0
5
10
15
20
25
30
0 500 1000 1500 2000 2500 3000
Num
ber o
f emails
Delay in Voice Typing (in ms)
Emails ofNumber Total(speech) Time (text) Time
emails all ∑ −=Delay
![Page 23: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/23.jpg)
Wins for Voice Typing
§ Users indicated Voice Typing as having lower mental demand, effort, and frustration
§ 18 participants preferred Voice Typing over Dictation :
“It [Voice Typing] was better because you did not have to worry about finding mistakes later on. You could see the interaction [output] as you say; thereby reassuring you that it was working fine.”
![Page 24: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/24.jpg)
Losses for Voice Typing
§ 6 participants disagreed because incorrect recognition disrupted thought flow in Voice Typing:
“I preferred Dictation, because in Voice Typing, if one word was off as I was speaking, it would distract me.”
![Page 25: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/25.jpg)
Wins & Losses for Marking Menu
§ Marking Menu had lower physical and mental demand
§ 21 participants preferred Marking Menu because:
“It [Marking Menu] was great for a beginner. It was easier mentally to see the circle with choices and not have to concern myself with where to select my [correction] choices from.”
“It [Marking Menu] seemed to involve less action.”
§ 3 participants disagreed:
§ Had larger fingers than most; gestures on smaller words was difficult (e.g. single letter words “a”)
![Page 26: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/26.jpg)
Discussion
§ Real-time transcription of speech (as in Voice Typing) reduced user corrections
§ Naturally provided segmented information
§ Plausibly, when users correct transcriptions in real-time, it prevents errors from propagating
§ Marking Menu preferred by most, yet had more substitutions
§ Preferred to substitute the word rather than leave it uncorrected
![Page 27: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/27.jpg)
Discussion
§ Dictation interaction model should not be dismissed
§ Useful when “eyes-free, hands-free” interaction required e.g. driving
§ Other modalities like traditional keypad typing still useful for use in public spaces
![Page 28: Voice Typing: A New Speech Interaction Model for Dictation ...€¦ · Mobile Devices vs. PCs . Text Input on Mobile Devices 187.7 B text and email messages sent in Dec 2010 in North](https://reader033.vdocuments.us/reader033/viewer/2022053100/605a19c08d9484128a6f482c/html5/thumbnails/28.jpg)
Thank You
Questions?
Comments?
Feedback?