designing for voice interactions (uxaustralia)
DESCRIPTION
Voice controlled interfaces and natural language processing are becoming normal. Major mobile platforms now have voice integrated into the operating system, and Google has been engineering its voice search experience in leaps and bounds recently. Voice interactions will transform how we interact with computers and – just like the touch interfaces that came before them – mobile devices are driving the change.TRANSCRIPT
source: http://www.flickr.com/photos/altemark/304079314
Designing for Voice Interactions
UX Australia
Designing for Mobility
Melbourne, March 1 2013
Jonny Schneider
Lead Consultant
Mobile Experience Design & Strategy
‘Name of referenced work’, Author/source/URL, date.
When you think of voice
recognition, you probably
think of...
‘Understanding Moira’, AAMI TV Commercial, http://www.youtube.com/watch?v=EY_jL38HMy8
inaccurate
too slow
never works
it’s a gimmick too tedious for me
“I won’t use it until it’s faster and more accurate than typing”
it can’t handle my accent
A lot of those things might be true, but this is default thinking, likely based on many bad experiences. However, there are two sides to every story.
https://twitter.com/bennyg/status/167192535305945088
https://twitter.com/bennyg/status/167192535305945088
http://www.flickr.com/photos/av_hire_london/5579125851
IDEA: Experience first-hand what it's like to interact with digital devices using predominantly your voice.
METHOD: A group of colleagues committed to using voice wherever possible, for an entire day.
Day of Voice
Let’s take a more objective look at what it’s like to use voice in our everyday interactions. Today.
✦ Controlling the device is tedious
✦ I’m sorry, I can’t do that for you
✦ Comprehension/recognition
✦ Expression
✦ Privacy
✦ Loss of context/paradigm
Day of Voice: what didn’t work
Control:“Dictation itself was fine, but getting to where notes are taken very tedious.”“I couldn’t navigate to where I needed to be. It heard the command correctly, but didn’t know what to do with it”
Limitations: Generally, it’s not pervasive enough to be relied upon
“I can’t...”- “play games with voice”!- Attach to email- dictate an email address "schneider dot jonny at gmail com". - edit an address
Recognition. i.e. Pam’s clips.
Expression. Exclamation marks, commas, full stops, slang etc. is possible, but not natural. As a result “I found that everything tends to run together”
Privacy. “On several occasions, I found myself wandering off to a small room or closet so that other’s couldn’t hear what I was talking about.”
Loss of context. Chat client. Using voice means I have to break-out of the normal short-messaging paradigm that I’m used to. It changes to asynchronous audible communication. Without those visual cues, I’m not sure where I’m up to, or what I want to say next.
A lot of this could just be that we’re not used to it.
✦ Google search with auto-suggest
✦ Dictation
✦ Accessibility*
✦ Control by command (XBox Kinect; Dragon for desktop)
Day of Voice: what worked
Examples of some useful and surprising experiences with voice
Google search. “brilliant for rarely used words like 'oesophagus' or 'onomatopoeia', and much faster than guessing letters and typing.”
Dictation. “Recording of notes is easy and I've done it on a number of occasions as I'd much prefer to talk than to type.” Can make light of a tedious task of typing on a mobile device.Even at 80% accuracy, this is way faster than typing, for longer messages
Accessibility.Blind person using Instagram [video]
‘How Blind People Use Instagram’, Tommy Edison, 2012. http://bit.ly/YBmBmb
blind man uses Instagram(video)
http://www.youtube.com/watch?v=P1e7ZCKQfMA
http://www.google.com/nexus/4/
✦ On-board hardware (microphone and speaker)
✦ hands busy + eyes busy context of use
✦ Personal and ‘always with you’ nature of device suits idea of ‘virtual assistants’
Why is this so relevant for mobile?
Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
AMPSAnalogue
GSM2G/WAP/WML/i-mode
3G UMTS NextG
The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years
•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR
Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/
Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
AMPSAnalogue
GSM2G/WAP/WML/i-mode
3G UMTS NextG
Telecom ‘Walkabout’
KurzweilReadingMachine←(1976)
1st commercial large vocabularyspeech recogniser
The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years
•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR
Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/
Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
Palm Treo
Motorola Brick
Nokia 5110
MotorolaRAZR
AMPSAnalogue
GSM2G/WAP/WML/i-mode
3G UMTS NextG
Telecom ‘Walkabout’
KurzweilReadingMachine←(1976)
1st commercial large vocabularyspeech recogniser
The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years
•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR
Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/
Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
Palm Treo
Motorola Brick
Nokia 5110
MotorolaRAZR
AMPSAnalogue
GSM2G/WAP/WML/i-mode
3G UMTS NextG
SMS is born
Predictive Text (T9)
Telecom ‘Walkabout’
KurzweilReadingMachine←(1976)
1st commercial large vocabularyspeech recogniser
The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years
•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR
Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/
Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
Palm Treo
Motorola Brick
Nokia 5110
MotorolaRAZR
AMPSAnalogue
GSM2G/WAP/WML/i-mode
3G UMTS NextG
SMS is born
Predictive Text (T9)
TelephoneBanking
1st dial-in IVR
(DTMF)
Dragon Dictate v1
for PC
Telecom ‘Walkabout’
KurzweilReadingMachine←(1976)
1st commercial large vocabularyspeech recogniser
The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years
•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR
Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/
Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
Palm Treo
Motorola Brick
Nokia 5110
MotorolaRAZR
HTC Dream (1st Android)
iPhone 3
AMPSAnalogue
GSM2G/WAP/WML/i-mode
3G UMTS NextG
SMS is born
Predictive Text (T9)
TelephoneBanking
1st dial-in IVR
(DTMF)
Dragon Dictate v1
for PC
Telecom ‘Walkabout’
KurzweilReadingMachine←(1976)
1st commercial large vocabularyspeech recogniser
The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years
•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR
Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/
Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
Palm Treo
Motorola Brick
Nokia 5110
MotorolaRAZR
HTC Dream (1st Android)
iPhone 3
AMPSAnalogue
GSM2G/WAP/WML/i-mode
3G UMTS NextG
Google voicesearch app
SMS is born
Predictive Text (T9)
TelephoneBanking
1st dial-in IVR
(DTMF)
Dragon Dictate v1
for PC
Telecom ‘Walkabout’
KurzweilReadingMachine←(1976)
1st commercial large vocabularyspeech recogniser
The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years
•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR
Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/
Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
Palm Treo
Motorola Brick
Nokia 5110
MotorolaRAZR
HTC Dream (1st Android)
iPhone 3
AMPSAnalogue
GSM2G/WAP/WML/i-mode
3G UMTS NextG
Google voicesearch app
SMS is born
Predictive Text (T9)
TelephoneBanking
1st dial-in IVR
(DTMF)
Dragon Dictate v1
for PC
Voice control(iOS3)
Voice actions(Froyo)
Telecom ‘Walkabout’
KurzweilReadingMachine←(1976)
1st commercial large vocabularyspeech recogniser
The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years
•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR
Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/
Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
Palm Treo
Motorola Brick
Nokia 5110
MotorolaRAZR
HTC Dream (1st Android)
iPhone 3
AMPSAnalogue
GSM2G/WAP/WML/i-mode
3G UMTS NextG
Google voicesearch app
SMS is born
Predictive Text (T9)
TelephoneBanking
1st dial-in IVR
(DTMF)
Dragon Dictate v1
for PC
Voice control(iOS3)
Voice actions(Froyo)
Telecom ‘Walkabout’
KurzweilReadingMachine←(1976)
1st commercial large vocabularyspeech recogniser
Swype
The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years
•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR
Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/
Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
Palm Treo
Motorola Brick
Nokia 5110
MotorolaRAZR
HTC Dream (1st Android)
iPhone 3
AMPSAnalogue
GSM2G/WAP/WML/i-mode
3G UMTS NextG
Google voicesearch app
SMS is born
Predictive Text (T9)
TelephoneBanking
1st dial-in IVR
(DTMF)
Dragon Dictate v1
for PC
Voice control(iOS3)
Voice actions(Froyo)
SIRI &Google Now
Telecom ‘Walkabout’
KurzweilReadingMachine←(1976)
1st commercial large vocabularyspeech recogniser
Swype
The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years
•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR
Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/
VisualIVR
Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites
‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11
Palm Treo
Motorola Brick
Nokia 5110
MotorolaRAZR
HTC Dream (1st Android)
iPhone 3
AMPSAnalogue
GSM2G/WAP/WML/i-mode
3G UMTS NextG
Google voicesearch app
SMS is born
Predictive Text (T9)
TelephoneBanking
1st dial-in IVR
(DTMF)
Dragon Dictate v1
for PC
Voice control(iOS3)
Voice actions(Froyo)
SIRI &Google Now
Telecom ‘Walkabout’
KurzweilReadingMachine←(1976)
1st commercial large vocabularyspeech recogniser
Swype
The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years
•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR
Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/
http://www.flickr.com/photos/carnamah/5859235859
What do people want?
If I had asked people what they wanted,
they would have said faster horses.
Henry Ford, nineteen twenty never
Henry didn’t actually say this... Someone at Harvard Business Review went looking, and got a response from the Henry Ford Museum, who have researched the topic before, and had found no satisfactory result to suggest that Ford in fact said it!
The point is...I believe there’s a misconception that people don’t like voice as an interaction method.I would argue that people will use whatever input method gets the job done quickly and with minimum fuss - that can be ‘voice’.
I wonder what people said about:•T9•Touch•Mobile telephony •or even computers
Used with permission by Kenneth Johnson. http://kennethjohnson.us/
✦ All the robots!
✦ Google glass
Imagine the future...
if machines could understand.
A few examples:- HAL 9000 (2001: A Space Odyssey)- T-800 (Terminator)- Johnny 5 (Short Circuit) - Data (Star Trek) - Robocop ED-209 (Robocop)
Not just movies....CSI and other such shows are riddled with intelligent, understanding, all singing, all dancing, talking computers.
Sci-Fi movies have been spruiking the possibilities for decades. In reality, we’re moving at a much slower pace, but things like Google Glass are coming - in fact, you can participate for the trial study right now if you like.
Voice recognition technology
Main types of voice interaction
Design principles
›❯
›❯
›❯
Let’s talk about Voice
Voice recognition technology
Main types of voice interaction
Design principles
›❯
›❯
›❯
A (very) quick look at the technology
search engine
customer database
private APIs
transaction gateway
3rd party APIs
SPEECHRECOGNITION &
SYNTHESIS SERVICE
voice-to-text
text-to-speech
This is one configuration, that we used on a recent project.There are many other ways this could be done.
•sound clip recorded•clip sent to VTT•VTT interprets/translates•sent back as text•device sends text to other services (i.e. search engine)•data sent back to the device (often multiples, with a confidence rating)•device sends text to be voiced over (i.e. a summary of the data presented to user)•TTS creates a voice clip and sends it back to the device•device presents the data and plays the voice clip
A (very) quick look at the technology
search engine
customer database
private APIs
transaction gateway
3rd party APIs
A
SPEECHRECOGNITION &
SYNTHESIS SERVICE
voice-to-text
text-to-speech
This is one configuration, that we used on a recent project.There are many other ways this could be done.
•sound clip recorded•clip sent to VTT•VTT interprets/translates•sent back as text•device sends text to other services (i.e. search engine)•data sent back to the device (often multiples, with a confidence rating)•device sends text to be voiced over (i.e. a summary of the data presented to user)•TTS creates a voice clip and sends it back to the device•device presents the data and plays the voice clip
A (very) quick look at the technology
search engine
customer database
private APIs
transaction gateway
3rd party APIs
A
B
SPEECHRECOGNITION &
SYNTHESIS SERVICE
voice-to-text
text-to-speech
This is one configuration, that we used on a recent project.There are many other ways this could be done.
•sound clip recorded•clip sent to VTT•VTT interprets/translates•sent back as text•device sends text to other services (i.e. search engine)•data sent back to the device (often multiples, with a confidence rating)•device sends text to be voiced over (i.e. a summary of the data presented to user)•TTS creates a voice clip and sends it back to the device•device presents the data and plays the voice clip
A (very) quick look at the technology
search engine
customer database
private APIs
transaction gateway
3rd party APIs
A
B
C
SPEECHRECOGNITION &
SYNTHESIS SERVICE
voice-to-text
text-to-speech
This is one configuration, that we used on a recent project.There are many other ways this could be done.
•sound clip recorded•clip sent to VTT•VTT interprets/translates•sent back as text•device sends text to other services (i.e. search engine)•data sent back to the device (often multiples, with a confidence rating)•device sends text to be voiced over (i.e. a summary of the data presented to user)•TTS creates a voice clip and sends it back to the device•device presents the data and plays the voice clip
http://www.flickr.com/photos/citychiccountrymouse/3856797711
PURPOSE: Measure accuracy and latency of current voice recognition solutions
METHOD:
✦ 4 vendor solutions
✦ 14 test phrases for translation
✦ 12 participants
✦ phrases recorded ‘fast’ and ‘slow’
Let’s Benchmark!
“Are there any good deals nearby”
I’ll get any goodies nearby
Are there any deals near me
Adding any deals any of me
Are there any good deals nearby ✔
✘
✔
✘
Objective (exact) and subjective matching.
Average Accuracy
Number of people tested
Comments
iSpeech 10% 4 Discarded after initial testing
Google 47% 12 Non supported API
Nuance - high quality audio 56% 12 10x file size
Nuance - low quality audio 50% 12 1x file size
Siri 64% 12 Not a reusable product
Average accuracy of voice solutions
Average accuracy.
It’s a small number of participants. I’m sure you could find much more comprehensive test results from other sources. Knock yourself out!
0
20
40
60
80
100
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
P11
P12
Google Voice Nuance Wav Nuance Speex SIRI
Accuracy of voice recognition by participant
Accuracy by participant.Here’s Google Voice in pink.and now Nuance.and the other two vendors tested.
This tells us there is significant variation in accuracy, from person to person.
0
20
40
60
80
100
Aust
ralia
n (2
)
Indi
an (3
)
Sing
apor
ean
(3)
Amer
ican
(1)
Hon
g Ko
ng (1
)
Mal
aysi
an (1
)
Chin
ese
(1)
Google Voice Nuance Wav Nuance Speex SIRI
Average accuracy of voice recognition by accent
It’s a similar story across the different accents.
A (very) quick look at the technology
SPEECHRECOGNITION &
SYNTHESIS SERVICE
voice-to-text
text-to-speech
search engine
customer database
private APIs
transaction gateway
3rd party APIs
A
B
C
Remember A, B, C?We’re going to measure latency now.
2 weeks, sampling every 30 mins.
0
10
20
30
40
50
60
3G (in Asia) WiFi (private)
3
16
10
21
24
Nuance Google
Comparison of latency performance (seconds)
0
10
20
30
40
50
60
3G (in Asia) WiFi (private)
3
18
10
22
4
16
Voice-to-Text ‘Stuff’ in the cloud Text-to-speech
Let’s measure latency of each of those steps.
Enormous latency!Over 40 seconds over 3G. Absurd.
One important note, is that these times represent a whole phrase, the phrases are not broken down and processes synchronously, as is the case with products like Google voice search app.
0
10
20
30
40
50
60
3G (in Asia) WiFi (private)
3
16
24
Nuance Google
Comparison of latency performance (seconds)
0
10
20
30
40
50
60
3G (in Asia) WiFi (private)
3
18
4
16
Voice-to-Text Text-to-speech
Even when we cut out the ‘other stuff’, and measure only VTT and TTS services, it’s still really very slow.
Some of this can be improved with colocation of servers and services. This test involved servers that were geographically spread over the globe. However, that isn’t always feasible, depending on the services you are connecting with, and where they are served from.
http://www.flickr.com/photos/lisovy/5415681393/
✦ Even the best recognisers struggle to achieve higher than 60% accuracy
✦ Latency is a problem, especially over slower networks
Conclusions
Consider the effect when these compound.It takes ages to get the result, and there’s a high likelihood it will be incorrect.
Not ideal.
My friend Rod Farmer kindly pointed out that it is possible to run concurrent requests - translating a few words at a time - in order to reduce latency significantly. For our limited prototype, this kind of engineering wasn’t feasible. None the less, the recommendations that follow are helpful regardless of latency.
Voice recognition technology
Main types of voice interaction
Design principles
›❯
›❯
›❯
Main ways of interacting with voice
Commands Dictation
Natural Language Identification
http://www.flickr.com/photos/bengrogan/2147048247
Command-based interactions
think of: Selective hearing.
✦ System only hears what it is listening for
✦ Structured/scripted
Commands based systems are like ‘selective hearing’.
The system only knows how to understand things that it is listening for.It’s a structured generally tedious way of interacting. It often feels scripted and impersonal, which are the kind of attributes that typically offend customers.
This was typically the back-bone of the early IVRs (late 90s-2000s).
AAMI, the Australian insurance company, has built it’s unique market position on exactly that. You might be familiar with the ‘Moira’ campaign.
Think about any time you’ve called your mobile provider.I know it feels tedious, but ask yourself - would it be any better if you spoke with a person?
Customers hate:1. repeating themselves (usually because of a routing issue)2. waiting in queue
Telstra has 2nd biggest call centrewith 600 unique reasons to call200,000 inbound calls per dayhandling 1M transfers per month
I’d like to argue that speaking with a real agent may well be a poorer experience than a machine.Why? Humans aren’t perfect either:- Attitude- Accents- Understanding- Consistency
There are also times when we might simply prefer a machine. I can think of one or two times when I’ve really hoped to get to voicemail, because the person I was calling is a difficult to talk with. Or perhaps you’re five weeks overdue on your invoice, and would prefer not to explain yourself, but instead get it paid through an IVR.
We’re talking about command based interactions - Strictly, most IVRs today has moved beyond simple ‘commands’. They usually begin with an open prompt, before moving to menu mode. We’ll discuss that in more detail in a moment.
‘Name of referenced work’, Author/source/URL, date.
A very clever use of simple voice commands to control an interface - entirely appropriate for the context of use you’d expect for this scenario (sticky fingers etc.)
Other’s noteworthy examples: - XBOX Kinect- Dragon for desktop
✦ Great as a text-input replacement, particularly for mobile, where keypads are tedious
✦ It doesn’t need to ‘understand’
✦ Predictive dictation, based on data
http://www.flickr.com/photos/vivax_imago/5603582392
DictationDictation
think of: Hearing, but not understanding.
The machine hears what you tell it, but can’t make meaning from it.
I think we all understand how dictation works. The user says something, their speech is ‘recognised’ and then usually converted from voice to text.
If it is reasonably accurate, it’s easy to see how this can be helpful.Driving or walking down the street while composing SMS on a touch screen is hideously difficult. Dangerous, and possibly illegal. Dictation frees you up to focus on other things.
Complex vocabulary often also benefit from dictation. A word like oesophagus is difficult to spell, and you could be left guessing what letter it starts with a few times before T9 kicks in to save the day. Dictating it is likely to be quicker and easier.
Nuance’s Powerscribe360 is a great example of that in action. For medical practitioners.
It’s no co-incidence that major mobile operating systems have this embedded right at the core.
Just how it’s not a co-incidence that Google have just employed Ray Kurzweil as director of Engineering.Are they building SkyNet?
on a mac
Example of predictive dictation:“What does onomatopoeia mean?”
The machine still doesn’t “understand” in the way we mean it.But just like search engines, it can predict what we mean based on statistical modeling.
Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.
on a mac on a mat
Example of predictive dictation:“What does onomatopoeia mean?”
The machine still doesn’t “understand” in the way we mean it.But just like search engines, it can predict what we mean based on statistical modeling.
Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.
on a mac on a mat onomatopoeiamean?
Example of predictive dictation:“What does onomatopoeia mean?”
The machine still doesn’t “understand” in the way we mean it.But just like search engines, it can predict what we mean based on statistical modeling.
Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.
on a mac on a mat onomatopoeiamean?
Example of predictive dictation:“What does onomatopoeia mean?”
The machine still doesn’t “understand” in the way we mean it.But just like search engines, it can predict what we mean based on statistical modeling.
Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.
http://bit.ly/XPJ7DC
✦ ‘natural language’ interactions
✦ The machine understands* meaning, and can then respond in a helpful, meaningful and personal way
Virtual Assistants
think of: hearing and understanding*
This is like hearing and understanding.
‘Understanding’ has an asterisk next to it, and you’ll see why over the next few slides.Machines have a really hard time trying to understand meaning - Why...
‘Subliminal: How your unconscious mind rules your behaviour’, p. 34. Leonard Mlodinow, 2012.
The cooking teacher said the students
made good snacks.
Meaning is nuanced
The cannibal said the students made
good snacks.
It’s because human communication is complex and nuanced.and it can’t easily be automated or codified.
Herein lies one of the biggest challenges for ‘intelligent’ or ‘understanding’ voice systems.
“Teachers and Cannibals” is a basic example.As humans, we easily understand the meaning of these two statements that are only different by a single word.And you’re probably alarmed - I hope you’re alarmed - by the latter.
Machines don’t understand this as easily.
‘Subliminal: How your unconscious mind rules your behaviour’, p. 34. Leonard Mlodinow, 2012.
A common homily
The spirit is willing, but the flesh is weak
Here’s another example...
‘Subliminal: How your unconscious mind rules your behaviour’, p. 34. Leonard Mlodinow, 2012.
The spirit is willing, but the flesh is weak
A common homily,
when programmatically translated
Here’s another example...
‘Subliminal: How your unconscious mind rules your behaviour’, p. 34. Leonard Mlodinow, 2012.
The spirit is willing, but the flesh is weak
The vodka is strong, but the meat is rotten
A common homily,
when programmatically translated
Here’s another example...
http://www.flickr.com/photos/lifementalhealthpics/8384573785
✦ Semantic classification
✦ Statistical probability modeling
✦ Creating a perception of understanding
What is machine ‘understanding’
Documents, conversations, or any kind of content can be manually classified or coded for meaning, and this becomes a model by which the machine can use for matching.
Statistical algorithms similar to those used in search engines are also used to help the machine perform better, based on past behaviour of other people.
This creates a perception of understanding or intelligence. You might call that ‘Artificial Intelligence’.
Vocabulary is an important factor in accuracy of probability modeling.Radiography reader was a successful early speech recognition system, that was ultimately successful because the vocabulary in radiography is constrained, and the acoustic signature of the words are quite different. Therefore the algorithms are more successful.
http://www.flickr.com/photos/lifementalhealthpics/8384573785
✦ Can you access data to help do the thinking on behalf of your users?
✦ prediction of customer needs
✦ Personalisation
System awareness
When a customer interacts with a service, various bits of data may be available:- identity- account status- location of call- time of day- device being used
This can be used to predict customer needs.
Example:Engineer cuts a cable that wipes out internet for all of Brunswick. 30,000 customers affected. For customers calling in from that geographic area, system has automated response, telling them about the problem. Customer hangs up. Lots of money saved.
20% vs. 2% improvement in routing and/or task completion by doing this. When compared with ‘tuning’ of semantic and statistical modeling.
Blade Runner, 1982. Warner Bros. img: http://replicant976.tumblr.com/image/12757032749
The Uncanny Valley
is not something we need
worry about.
Yet.
The Uncanny Valley is a hypothesis is robotics that suggests that as robots approach human likeness, they incite repulsive emotions in humans.
It doesn’t really apply to virtual agents, and so far, our experience has been that there is a long way to go before voice synthesis approaches human likeness - so it’s really nothing to worry about yet.
‘Sneakers’, Universal Studios, 1982. img: http://lat.ms/ZlHtN0
✦ Voice biometrics
Identification
think of: “My voice is my passport, verify me!”
Who remembers the film Sneakers? One of my favourites.
A team of security specialist steal the keycard and vocal codes of Warner Brandes, an unsuspecting employee of the ‘front’ company operated a bad guy who intends to become wealthy by using a decryption device to defraud companies for his own benefit.
In the end, the good guys win, and in a postscript, they use the Janek decryption device to steal from the rich and give to the poor. A modern day Robin Hood story.
This is a nice example of using voice biometrics for multi-factor authentication. There’re obvious applications for this, particularly for things like banking, where 2nd factor is often SMS, which has several limitations.
30 years later, we’re starting to see this kind of security for real.
Voice recognition technology
Main types of voice interaction
Design principles
›❯
›❯
›❯
We’ve seen opportunities for humans to interact with computers in helpful waysconstraints in the capabilities of technology to deliver against this promiseand objectives in business to optimise operating costs and improving customer service
These are essentially the same ingredients to any design problem aren’t they?So let’s look at some principles that apply specifically to voice...
AT&T Visual IVR Project http://www.att.com/gen/press-room?pid=23362
✦ High latency, low accuracy...
✦ Help users recover by using offering alternatives
Design for failure
This could be as a multi-modal interface, or it could be a translated interface like this example of visual IVR, which let’s users traverse the IVR tree using a touch menu.
✦ Don’t treat voice as a ‘me too’ feature(will your product or your customers actually benefit from voice... really?)
✦ Think twice before introducing redundancy
Would you like voice with that?
Voice is the hot new thing right now, but resist the hype. It’s not trivial to implement, and even if it were, does that validate it as a ‘must have’ feature for your product?
Voice is integrated into the OS of modern devices.Their technology is mature. It can be used with any input field, any interface.The interaction design is polished, and extensively tested.Use that! If you can.
‘Name of referenced work’, Author/source/URL, date.
✦ Understand the various modes of voice interaction
✦ Be careful about mixing modes(is that a command or a conversation?)
Know when and how to use voice
When you are designing for voice, understand the modes.
Command, dictate, natural language, identity.
✦ Support multi-modal interactions and make it as seamless as possible(voice, gesture, type, other)
✦ test, iterate, test, iterate...
Let users decide how to interact
Don Norman, 2003“I believe that voice interfaces hold their greatest promise as an additional component to a multi-modal dialogue, rather than as the only interface channel.”
Dictate and edit is a prime example of this. It’s beautifully crafted.Voice -> typegesture -> voice
Test and iterate. Voice still isn’t a common/normal interaction, so you will likely get it a bit wrong the first few times.
Don’t make me think
“A simple voice interface can only be as good as
what the customer thinks they want. A better
system is one that understands what their needs
are likely to be, based on what’s known about
them. ”
✦ Personalisation
✦ Work on making the system ‘smarter’
Create a perception of understanding
The speech recognition and synthesis tools have become commodities. Focus your energies on helping the system seem smarter.
Jonny SchneiderLead ConsultantMobile Experience Design & [email protected]@jonnyschneiderau.linkedin.com/in/jonnyschneider/
All images used by permission