designing for voice interactions (uxaustralia)

64
source: http://www.flickr.com/photos/altemark/304079314 Designing for Voice Interactions UX Australia Designing for Mobility Melbourne, March 1 2013 Jonny Schneider Lead Consultant Mobile Experience Design & Strategy

Upload: jonny-schneider

Post on 08-May-2015

1.496 views

Category:

Technology


2 download

DESCRIPTION

Voice controlled interfaces and natural language processing are becoming normal. Major mobile platforms now have voice integrated into the operating system, and Google has been engineering its voice search experience in leaps and bounds recently. Voice interactions will transform how we interact with computers and – just like the touch interfaces that came before them – mobile devices are driving the change.

TRANSCRIPT

Page 1: Designing for Voice Interactions (UXAustralia)

source: http://www.flickr.com/photos/altemark/304079314

Designing for Voice Interactions

UX Australia

Designing for Mobility

Melbourne, March 1 2013

Jonny Schneider

Lead Consultant

Mobile Experience Design & Strategy

Page 2: Designing for Voice Interactions (UXAustralia)

‘Name of referenced work’, Author/source/URL, date.

When you think of voice

recognition, you probably

think of...

‘Understanding Moira’, AAMI TV Commercial, http://www.youtube.com/watch?v=EY_jL38HMy8

inaccurate

too slow

never works

it’s a gimmick too tedious for me

“I won’t use it until it’s faster and more accurate than typing”

it can’t handle my accent

A lot of those things might be true, but this is default thinking, likely based on many bad experiences. However, there are two sides to every story.

Page 3: Designing for Voice Interactions (UXAustralia)

https://twitter.com/bennyg/status/167192535305945088

https://twitter.com/bennyg/status/167192535305945088

Page 4: Designing for Voice Interactions (UXAustralia)

http://www.flickr.com/photos/av_hire_london/5579125851

IDEA: Experience first-hand what it's like to interact with digital devices using predominantly your voice.

METHOD: A group of colleagues committed to using voice wherever possible, for an entire day.

Day of Voice

Let’s take a more objective look at what it’s like to use voice in our everyday interactions. Today.

Page 5: Designing for Voice Interactions (UXAustralia)

✦ Controlling the device is tedious

✦ I’m sorry, I can’t do that for you

✦ Comprehension/recognition

✦ Expression

✦ Privacy

✦ Loss of context/paradigm

Day of Voice: what didn’t work

Control:“Dictation itself was fine, but getting to where notes are taken very tedious.”“I couldn’t navigate to where I needed to be. It heard the command correctly, but didn’t know what to do with it”

Limitations: Generally, it’s not pervasive enough to be relied upon

“I can’t...”- “play games with voice”!- Attach to email- dictate an email address "schneider dot jonny at gmail com". - edit an address

Recognition. i.e. Pam’s clips.

Expression. Exclamation marks, commas, full stops, slang etc. is possible, but not natural. As a result “I found that everything tends to run together”

Privacy. “On several occasions, I found myself wandering off to a small room or closet so that other’s couldn’t hear what I was talking about.”

Loss of context. Chat client. Using voice means I have to break-out of the normal short-messaging paradigm that I’m used to. It changes to asynchronous audible communication. Without those visual cues, I’m not sure where I’m up to, or what I want to say next.

A lot of this could just be that we’re not used to it.

Page 6: Designing for Voice Interactions (UXAustralia)

✦ Google search with auto-suggest

✦ Dictation

✦ Accessibility*

✦ Control by command (XBox Kinect; Dragon for desktop)

Day of Voice: what worked

Examples of some useful and surprising experiences with voice

Google search. “brilliant for rarely used words like 'oesophagus' or 'onomatopoeia', and much faster than guessing letters and typing.”

Dictation. “Recording of notes is easy and I've done it on a number of occasions as I'd much prefer to talk than to type.” Can make light of a tedious task of typing on a mobile device.Even at 80% accuracy, this is way faster than typing, for longer messages

Accessibility.Blind person using Instagram [video]

Page 7: Designing for Voice Interactions (UXAustralia)

‘How Blind People Use Instagram’, Tommy Edison, 2012. http://bit.ly/YBmBmb

blind man uses Instagram(video)

http://www.youtube.com/watch?v=P1e7ZCKQfMA

Page 8: Designing for Voice Interactions (UXAustralia)

http://www.google.com/nexus/4/

✦ On-board hardware (microphone and speaker)

✦ hands busy + eyes busy context of use

✦ Personal and ‘always with you’ nature of device suits idea of ‘virtual assistants’

Why is this so relevant for mobile?

Page 9: Designing for Voice Interactions (UXAustralia)

Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites

‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11

AMPSAnalogue

GSM2G/WAP/WML/i-mode

3G UMTS NextG

The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years

•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR

Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

Page 10: Designing for Voice Interactions (UXAustralia)

Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites

‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11

AMPSAnalogue

GSM2G/WAP/WML/i-mode

3G UMTS NextG

Telecom ‘Walkabout’

KurzweilReadingMachine←(1976)

1st commercial large vocabularyspeech recogniser

The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years

•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR

Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

Page 11: Designing for Voice Interactions (UXAustralia)

Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites

‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11

Palm Treo

Motorola Brick

Nokia 5110

MotorolaRAZR

AMPSAnalogue

GSM2G/WAP/WML/i-mode

3G UMTS NextG

Telecom ‘Walkabout’

KurzweilReadingMachine←(1976)

1st commercial large vocabularyspeech recogniser

The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years

•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR

Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

Page 12: Designing for Voice Interactions (UXAustralia)

Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites

‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11

Palm Treo

Motorola Brick

Nokia 5110

MotorolaRAZR

AMPSAnalogue

GSM2G/WAP/WML/i-mode

3G UMTS NextG

SMS is born

Predictive Text (T9)

Telecom ‘Walkabout’

KurzweilReadingMachine←(1976)

1st commercial large vocabularyspeech recogniser

The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years

•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR

Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

Page 13: Designing for Voice Interactions (UXAustralia)

Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites

‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11

Palm Treo

Motorola Brick

Nokia 5110

MotorolaRAZR

AMPSAnalogue

GSM2G/WAP/WML/i-mode

3G UMTS NextG

SMS is born

Predictive Text (T9)

TelephoneBanking

1st dial-in IVR

(DTMF)

Dragon Dictate v1

for PC

Telecom ‘Walkabout’

KurzweilReadingMachine←(1976)

1st commercial large vocabularyspeech recogniser

The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years

•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR

Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

Page 14: Designing for Voice Interactions (UXAustralia)

Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites

‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11

Palm Treo

Motorola Brick

Nokia 5110

MotorolaRAZR

HTC Dream (1st Android)

iPhone 3

AMPSAnalogue

GSM2G/WAP/WML/i-mode

3G UMTS NextG

SMS is born

Predictive Text (T9)

TelephoneBanking

1st dial-in IVR

(DTMF)

Dragon Dictate v1

for PC

Telecom ‘Walkabout’

KurzweilReadingMachine←(1976)

1st commercial large vocabularyspeech recogniser

The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years

•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR

Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

Page 15: Designing for Voice Interactions (UXAustralia)

Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites

‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11

Palm Treo

Motorola Brick

Nokia 5110

MotorolaRAZR

HTC Dream (1st Android)

iPhone 3

AMPSAnalogue

GSM2G/WAP/WML/i-mode

3G UMTS NextG

Google voicesearch app

SMS is born

Predictive Text (T9)

TelephoneBanking

1st dial-in IVR

(DTMF)

Dragon Dictate v1

for PC

Telecom ‘Walkabout’

KurzweilReadingMachine←(1976)

1st commercial large vocabularyspeech recogniser

The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years

•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR

Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

Page 16: Designing for Voice Interactions (UXAustralia)

Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites

‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11

Palm Treo

Motorola Brick

Nokia 5110

MotorolaRAZR

HTC Dream (1st Android)

iPhone 3

AMPSAnalogue

GSM2G/WAP/WML/i-mode

3G UMTS NextG

Google voicesearch app

SMS is born

Predictive Text (T9)

TelephoneBanking

1st dial-in IVR

(DTMF)

Dragon Dictate v1

for PC

Voice control(iOS3)

Voice actions(Froyo)

Telecom ‘Walkabout’

KurzweilReadingMachine←(1976)

1st commercial large vocabularyspeech recogniser

The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years

•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR

Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

Page 17: Designing for Voice Interactions (UXAustralia)

Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites

‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11

Palm Treo

Motorola Brick

Nokia 5110

MotorolaRAZR

HTC Dream (1st Android)

iPhone 3

AMPSAnalogue

GSM2G/WAP/WML/i-mode

3G UMTS NextG

Google voicesearch app

SMS is born

Predictive Text (T9)

TelephoneBanking

1st dial-in IVR

(DTMF)

Dragon Dictate v1

for PC

Voice control(iOS3)

Voice actions(Froyo)

Telecom ‘Walkabout’

KurzweilReadingMachine←(1976)

1st commercial large vocabularyspeech recogniser

Swype

The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years

•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR

Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

Page 18: Designing for Voice Interactions (UXAustralia)

Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites

‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11

Palm Treo

Motorola Brick

Nokia 5110

MotorolaRAZR

HTC Dream (1st Android)

iPhone 3

AMPSAnalogue

GSM2G/WAP/WML/i-mode

3G UMTS NextG

Google voicesearch app

SMS is born

Predictive Text (T9)

TelephoneBanking

1st dial-in IVR

(DTMF)

Dragon Dictate v1

for PC

Voice control(iOS3)

Voice actions(Froyo)

SIRI &Google Now

Telecom ‘Walkabout’

KurzweilReadingMachine←(1976)

1st commercial large vocabularyspeech recogniser

Swype

The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years

•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR

Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

Page 19: Designing for Voice Interactions (UXAustralia)

VisualIVR

Data: http://isc.org; http://amta.org.au; http://wikipedia.org and various websites

‘83 ‘85 ‘87 ‘89 ‘91 ‘93 ‘95 ‘97 ‘99 ‘01 ‘03 ‘05 ‘07 ‘09 ‘11

Palm Treo

Motorola Brick

Nokia 5110

MotorolaRAZR

HTC Dream (1st Android)

iPhone 3

AMPSAnalogue

GSM2G/WAP/WML/i-mode

3G UMTS NextG

Google voicesearch app

SMS is born

Predictive Text (T9)

TelephoneBanking

1st dial-in IVR

(DTMF)

Dragon Dictate v1

for PC

Voice control(iOS3)

Voice actions(Froyo)

SIRI &Google Now

Telecom ‘Walkabout’

KurzweilReadingMachine←(1976)

1st commercial large vocabularyspeech recogniser

Swype

The beginnings of speech recognition technology predates mobile telephony.Goes back to the 50s but let’s look at the last30 years

•Ray Kurzweil’s reading machine: speech synthesiser for blind people.•+10 years first the first commercial speech recogniser is created. It’s enormous, and very expensive.•The next decade: mobile devices get smaller and more prolific. Internet starts to take off•(early 90s) SMS, then T9 later that decade•(’95-2000) Dragon dictation, 1st IVR over DTMF, Telephone banking •Touch devices happen•Google voice search (2008)•Voice Control for iOS, then Voice Actions a year later•Swype text input•Voice controlled virtual assistants (SIRI and Google Now) 2012•Visual IVR

Ray Kurzweil is now Head of Engineering at Google. Leading a Search AI program.http://techcrunch.com/2013/01/06/googles-director-of-engineering-ray-kurzweil-is-building-your-cybernetic-friend/

Page 20: Designing for Voice Interactions (UXAustralia)

http://www.flickr.com/photos/carnamah/5859235859

What do people want?

If I had asked people what they wanted,

they would have said faster horses.

Henry Ford, nineteen twenty never

Henry didn’t actually say this... Someone at Harvard Business Review went looking, and got a response from the Henry Ford Museum, who have researched the topic before, and had found no satisfactory result to suggest that Ford in fact said it!

The point is...I believe there’s a misconception that people don’t like voice as an interaction method.I would argue that people will use whatever input method gets the job done quickly and with minimum fuss - that can be ‘voice’.

I wonder what people said about:•T9•Touch•Mobile telephony •or even computers

Page 21: Designing for Voice Interactions (UXAustralia)

Used with permission by Kenneth Johnson. http://kennethjohnson.us/

✦ All the robots!

✦ Google glass

Imagine the future...

if machines could understand.

A few examples:- HAL 9000 (2001: A Space Odyssey)- T-800 (Terminator)- Johnny 5 (Short Circuit) - Data (Star Trek) - Robocop ED-209 (Robocop)

Not just movies....CSI and other such shows are riddled with intelligent, understanding, all singing, all dancing, talking computers.

Sci-Fi movies have been spruiking the possibilities for decades. In reality, we’re moving at a much slower pace, but things like Google Glass are coming - in fact, you can participate for the trial study right now if you like.

Page 22: Designing for Voice Interactions (UXAustralia)

Voice recognition technology

Main types of voice interaction

Design principles

›❯

›❯

›❯

Let’s talk about Voice

Page 23: Designing for Voice Interactions (UXAustralia)

Voice recognition technology

Main types of voice interaction

Design principles

›❯

›❯

›❯

Page 24: Designing for Voice Interactions (UXAustralia)

A (very) quick look at the technology

search engine

customer database

private APIs

transaction gateway

3rd party APIs

SPEECHRECOGNITION &

SYNTHESIS SERVICE

voice-to-text

text-to-speech

This is one configuration, that we used on a recent project.There are many other ways this could be done.

•sound clip recorded•clip sent to VTT•VTT interprets/translates•sent back as text•device sends text to other services (i.e. search engine)•data sent back to the device (often multiples, with a confidence rating)•device sends text to be voiced over (i.e. a summary of the data presented to user)•TTS creates a voice clip and sends it back to the device•device presents the data and plays the voice clip

Page 25: Designing for Voice Interactions (UXAustralia)

A (very) quick look at the technology

search engine

customer database

private APIs

transaction gateway

3rd party APIs

A

SPEECHRECOGNITION &

SYNTHESIS SERVICE

voice-to-text

text-to-speech

This is one configuration, that we used on a recent project.There are many other ways this could be done.

•sound clip recorded•clip sent to VTT•VTT interprets/translates•sent back as text•device sends text to other services (i.e. search engine)•data sent back to the device (often multiples, with a confidence rating)•device sends text to be voiced over (i.e. a summary of the data presented to user)•TTS creates a voice clip and sends it back to the device•device presents the data and plays the voice clip

Page 26: Designing for Voice Interactions (UXAustralia)

A (very) quick look at the technology

search engine

customer database

private APIs

transaction gateway

3rd party APIs

A

B

SPEECHRECOGNITION &

SYNTHESIS SERVICE

voice-to-text

text-to-speech

This is one configuration, that we used on a recent project.There are many other ways this could be done.

•sound clip recorded•clip sent to VTT•VTT interprets/translates•sent back as text•device sends text to other services (i.e. search engine)•data sent back to the device (often multiples, with a confidence rating)•device sends text to be voiced over (i.e. a summary of the data presented to user)•TTS creates a voice clip and sends it back to the device•device presents the data and plays the voice clip

Page 27: Designing for Voice Interactions (UXAustralia)

A (very) quick look at the technology

search engine

customer database

private APIs

transaction gateway

3rd party APIs

A

B

C

SPEECHRECOGNITION &

SYNTHESIS SERVICE

voice-to-text

text-to-speech

This is one configuration, that we used on a recent project.There are many other ways this could be done.

•sound clip recorded•clip sent to VTT•VTT interprets/translates•sent back as text•device sends text to other services (i.e. search engine)•data sent back to the device (often multiples, with a confidence rating)•device sends text to be voiced over (i.e. a summary of the data presented to user)•TTS creates a voice clip and sends it back to the device•device presents the data and plays the voice clip

Page 28: Designing for Voice Interactions (UXAustralia)

http://www.flickr.com/photos/citychiccountrymouse/3856797711

PURPOSE: Measure accuracy and latency of current voice recognition solutions

METHOD:

✦ 4 vendor solutions

✦ 14 test phrases for translation

✦ 12 participants

✦ phrases recorded ‘fast’ and ‘slow’

Let’s Benchmark!

Page 29: Designing for Voice Interactions (UXAustralia)

“Are there any good deals nearby”

I’ll get any goodies nearby

Are there any deals near me

Adding any deals any of me

Are there any good deals nearby ✔

Objective (exact) and subjective matching.

Page 30: Designing for Voice Interactions (UXAustralia)

Average Accuracy

Number of people tested

Comments

iSpeech 10% 4 Discarded after initial testing

Google 47% 12 Non supported API

Nuance - high quality audio 56% 12 10x file size

Nuance - low quality audio 50% 12 1x file size

Siri 64% 12 Not a reusable product

Average accuracy of voice solutions

Average accuracy.

It’s a small number of participants. I’m sure you could find much more comprehensive test results from other sources. Knock yourself out!

Page 31: Designing for Voice Interactions (UXAustralia)

0

20

40

60

80

100

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10

P11

P12

Google Voice Nuance Wav Nuance Speex SIRI

Accuracy of voice recognition by participant

Accuracy by participant.Here’s Google Voice in pink.and now Nuance.and the other two vendors tested.

This tells us there is significant variation in accuracy, from person to person.

Page 32: Designing for Voice Interactions (UXAustralia)

0

20

40

60

80

100

Aust

ralia

n (2

)

Indi

an (3

)

Sing

apor

ean

(3)

Amer

ican

(1)

Hon

g Ko

ng (1

)

Mal

aysi

an (1

)

Chin

ese

(1)

Google Voice Nuance Wav Nuance Speex SIRI

Average accuracy of voice recognition by accent

It’s a similar story across the different accents.

Page 33: Designing for Voice Interactions (UXAustralia)

A (very) quick look at the technology

SPEECHRECOGNITION &

SYNTHESIS SERVICE

voice-to-text

text-to-speech

search engine

customer database

private APIs

transaction gateway

3rd party APIs

A

B

C

Remember A, B, C?We’re going to measure latency now.

2 weeks, sampling every 30 mins.

Page 34: Designing for Voice Interactions (UXAustralia)

0

10

20

30

40

50

60

3G (in Asia) WiFi (private)

3

16

10

21

24

Nuance Google

Comparison of latency performance (seconds)

0

10

20

30

40

50

60

3G (in Asia) WiFi (private)

3

18

10

22

4

16

Voice-to-Text ‘Stuff’ in the cloud Text-to-speech

Let’s measure latency of each of those steps.

Enormous latency!Over 40 seconds over 3G. Absurd.

One important note, is that these times represent a whole phrase, the phrases are not broken down and processes synchronously, as is the case with products like Google voice search app.

Page 35: Designing for Voice Interactions (UXAustralia)

0

10

20

30

40

50

60

3G (in Asia) WiFi (private)

3

16

24

Nuance Google

Comparison of latency performance (seconds)

0

10

20

30

40

50

60

3G (in Asia) WiFi (private)

3

18

4

16

Voice-to-Text Text-to-speech

Even when we cut out the ‘other stuff’, and measure only VTT and TTS services, it’s still really very slow.

Some of this can be improved with colocation of servers and services. This test involved servers that were geographically spread over the globe. However, that isn’t always feasible, depending on the services you are connecting with, and where they are served from.

Page 36: Designing for Voice Interactions (UXAustralia)

http://www.flickr.com/photos/lisovy/5415681393/

✦ Even the best recognisers struggle to achieve higher than 60% accuracy

✦ Latency is a problem, especially over slower networks

Conclusions

Consider the effect when these compound.It takes ages to get the result, and there’s a high likelihood it will be incorrect.

Not ideal.

My friend Rod Farmer kindly pointed out that it is possible to run concurrent requests - translating a few words at a time - in order to reduce latency significantly. For our limited prototype, this kind of engineering wasn’t feasible. None the less, the recommendations that follow are helpful regardless of latency.

Page 37: Designing for Voice Interactions (UXAustralia)

Voice recognition technology

Main types of voice interaction

Design principles

›❯

›❯

›❯

Page 38: Designing for Voice Interactions (UXAustralia)

Main ways of interacting with voice

Commands Dictation

Natural Language Identification

Page 39: Designing for Voice Interactions (UXAustralia)

http://www.flickr.com/photos/bengrogan/2147048247

Command-based interactions

think of: Selective hearing.

✦ System only hears what it is listening for

✦ Structured/scripted

Commands based systems are like ‘selective hearing’.

The system only knows how to understand things that it is listening for.It’s a structured generally tedious way of interacting. It often feels scripted and impersonal, which are the kind of attributes that typically offend customers.

This was typically the back-bone of the early IVRs (late 90s-2000s).

AAMI, the Australian insurance company, has built it’s unique market position on exactly that. You might be familiar with the ‘Moira’ campaign.

Page 40: Designing for Voice Interactions (UXAustralia)

Think about any time you’ve called your mobile provider.I know it feels tedious, but ask yourself - would it be any better if you spoke with a person?

Customers hate:1. repeating themselves (usually because of a routing issue)2. waiting in queue

Telstra has 2nd biggest call centrewith 600 unique reasons to call200,000 inbound calls per dayhandling 1M transfers per month

I’d like to argue that speaking with a real agent may well be a poorer experience than a machine.Why? Humans aren’t perfect either:- Attitude- Accents- Understanding- Consistency

There are also times when we might simply prefer a machine. I can think of one or two times when I’ve really hoped to get to voicemail, because the person I was calling is a difficult to talk with. Or perhaps you’re five weeks overdue on your invoice, and would prefer not to explain yourself, but instead get it paid through an IVR.

We’re talking about command based interactions - Strictly, most IVRs today has moved beyond simple ‘commands’. They usually begin with an open prompt, before moving to menu mode. We’ll discuss that in more detail in a moment.

Page 41: Designing for Voice Interactions (UXAustralia)

‘Name of referenced work’, Author/source/URL, date.

A very clever use of simple voice commands to control an interface - entirely appropriate for the context of use you’d expect for this scenario (sticky fingers etc.)

Other’s noteworthy examples: - XBOX Kinect- Dragon for desktop

Page 42: Designing for Voice Interactions (UXAustralia)

✦ Great as a text-input replacement, particularly for mobile, where keypads are tedious

✦ It doesn’t need to ‘understand’

✦ Predictive dictation, based on data

http://www.flickr.com/photos/vivax_imago/5603582392

DictationDictation

think of: Hearing, but not understanding.

The machine hears what you tell it, but can’t make meaning from it.

I think we all understand how dictation works. The user says something, their speech is ‘recognised’ and then usually converted from voice to text.

If it is reasonably accurate, it’s easy to see how this can be helpful.Driving or walking down the street while composing SMS on a touch screen is hideously difficult. Dangerous, and possibly illegal. Dictation frees you up to focus on other things.

Complex vocabulary often also benefit from dictation. A word like oesophagus is difficult to spell, and you could be left guessing what letter it starts with a few times before T9 kicks in to save the day. Dictating it is likely to be quicker and easier.

Nuance’s Powerscribe360 is a great example of that in action. For medical practitioners.

Page 43: Designing for Voice Interactions (UXAustralia)

It’s no co-incidence that major mobile operating systems have this embedded right at the core.

Just how it’s not a co-incidence that Google have just employed Ray Kurzweil as director of Engineering.Are they building SkyNet?

Page 44: Designing for Voice Interactions (UXAustralia)

on a mac

Example of predictive dictation:“What does onomatopoeia mean?”

The machine still doesn’t “understand” in the way we mean it.But just like search engines, it can predict what we mean based on statistical modeling.

Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.

Page 45: Designing for Voice Interactions (UXAustralia)

on a mac on a mat

Example of predictive dictation:“What does onomatopoeia mean?”

The machine still doesn’t “understand” in the way we mean it.But just like search engines, it can predict what we mean based on statistical modeling.

Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.

Page 46: Designing for Voice Interactions (UXAustralia)

on a mac on a mat onomatopoeiamean?

Example of predictive dictation:“What does onomatopoeia mean?”

The machine still doesn’t “understand” in the way we mean it.But just like search engines, it can predict what we mean based on statistical modeling.

Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.

Page 47: Designing for Voice Interactions (UXAustralia)

on a mac on a mat onomatopoeiamean?

Example of predictive dictation:“What does onomatopoeia mean?”

The machine still doesn’t “understand” in the way we mean it.But just like search engines, it can predict what we mean based on statistical modeling.

Think of how many billions of search queries Google has at hand, that are used to inform these statistical models.

Page 48: Designing for Voice Interactions (UXAustralia)

http://bit.ly/XPJ7DC

✦ ‘natural language’ interactions

✦ The machine understands* meaning, and can then respond in a helpful, meaningful and personal way

Virtual Assistants

think of: hearing and understanding*

This is like hearing and understanding.

‘Understanding’ has an asterisk next to it, and you’ll see why over the next few slides.Machines have a really hard time trying to understand meaning - Why...

Page 49: Designing for Voice Interactions (UXAustralia)

‘Subliminal: How your unconscious mind rules your behaviour’, p. 34. Leonard Mlodinow, 2012.

The cooking teacher said the students

made good snacks.

Meaning is nuanced

The cannibal said the students made

good snacks.

It’s because human communication is complex and nuanced.and it can’t easily be automated or codified.

Herein lies one of the biggest challenges for ‘intelligent’ or ‘understanding’ voice systems.

“Teachers and Cannibals” is a basic example.As humans, we easily understand the meaning of these two statements that are only different by a single word.And you’re probably alarmed - I hope you’re alarmed - by the latter.

Machines don’t understand this as easily.

Page 50: Designing for Voice Interactions (UXAustralia)

‘Subliminal: How your unconscious mind rules your behaviour’, p. 34. Leonard Mlodinow, 2012.

A common homily

The spirit is willing, but the flesh is weak

Here’s another example...

Page 51: Designing for Voice Interactions (UXAustralia)

‘Subliminal: How your unconscious mind rules your behaviour’, p. 34. Leonard Mlodinow, 2012.

The spirit is willing, but the flesh is weak

A common homily,

when programmatically translated

Here’s another example...

Page 52: Designing for Voice Interactions (UXAustralia)

‘Subliminal: How your unconscious mind rules your behaviour’, p. 34. Leonard Mlodinow, 2012.

The spirit is willing, but the flesh is weak

The vodka is strong, but the meat is rotten

A common homily,

when programmatically translated

Here’s another example...

Page 53: Designing for Voice Interactions (UXAustralia)

http://www.flickr.com/photos/lifementalhealthpics/8384573785

✦ Semantic classification

✦ Statistical probability modeling

✦ Creating a perception of understanding

What is machine ‘understanding’

Documents, conversations, or any kind of content can be manually classified or coded for meaning, and this becomes a model by which the machine can use for matching.

Statistical algorithms similar to those used in search engines are also used to help the machine perform better, based on past behaviour of other people.

This creates a perception of understanding or intelligence. You might call that ‘Artificial Intelligence’.

Vocabulary is an important factor in accuracy of probability modeling.Radiography reader was a successful early speech recognition system, that was ultimately successful because the vocabulary in radiography is constrained, and the acoustic signature of the words are quite different. Therefore the algorithms are more successful.

Page 54: Designing for Voice Interactions (UXAustralia)

http://www.flickr.com/photos/lifementalhealthpics/8384573785

✦ Can you access data to help do the thinking on behalf of your users?

✦ prediction of customer needs

✦ Personalisation

System awareness

When a customer interacts with a service, various bits of data may be available:- identity- account status- location of call- time of day- device being used

This can be used to predict customer needs.

Example:Engineer cuts a cable that wipes out internet for all of Brunswick. 30,000 customers affected. For customers calling in from that geographic area, system has automated response, telling them about the problem. Customer hangs up. Lots of money saved.

20% vs. 2% improvement in routing and/or task completion by doing this. When compared with ‘tuning’ of semantic and statistical modeling.

Page 55: Designing for Voice Interactions (UXAustralia)

Blade Runner, 1982. Warner Bros. img: http://replicant976.tumblr.com/image/12757032749

The Uncanny Valley

is not something we need

worry about.

Yet.

The Uncanny Valley is a hypothesis is robotics that suggests that as robots approach human likeness, they incite repulsive emotions in humans.

It doesn’t really apply to virtual agents, and so far, our experience has been that there is a long way to go before voice synthesis approaches human likeness - so it’s really nothing to worry about yet.

Page 56: Designing for Voice Interactions (UXAustralia)

‘Sneakers’, Universal Studios, 1982. img: http://lat.ms/ZlHtN0

✦ Voice biometrics

Identification

think of: “My voice is my passport, verify me!”

Who remembers the film Sneakers? One of my favourites.

A team of security specialist steal the keycard and vocal codes of Warner Brandes, an unsuspecting employee of the ‘front’ company operated a bad guy who intends to become wealthy by using a decryption device to defraud companies for his own benefit.

In the end, the good guys win, and in a postscript, they use the Janek decryption device to steal from the rich and give to the poor. A modern day Robin Hood story.

This is a nice example of using voice biometrics for multi-factor authentication. There’re obvious applications for this, particularly for things like banking, where 2nd factor is often SMS, which has several limitations.

30 years later, we’re starting to see this kind of security for real.

Page 57: Designing for Voice Interactions (UXAustralia)

Voice recognition technology

Main types of voice interaction

Design principles

›❯

›❯

›❯

We’ve seen opportunities for humans to interact with computers in helpful waysconstraints in the capabilities of technology to deliver against this promiseand objectives in business to optimise operating costs and improving customer service

These are essentially the same ingredients to any design problem aren’t they?So let’s look at some principles that apply specifically to voice...

Page 58: Designing for Voice Interactions (UXAustralia)

AT&T Visual IVR Project http://www.att.com/gen/press-room?pid=23362

✦ High latency, low accuracy...

✦ Help users recover by using offering alternatives

Design for failure

This could be as a multi-modal interface, or it could be a translated interface like this example of visual IVR, which let’s users traverse the IVR tree using a touch menu.

Page 59: Designing for Voice Interactions (UXAustralia)

✦ Don’t treat voice as a ‘me too’ feature(will your product or your customers actually benefit from voice... really?)

✦ Think twice before introducing redundancy

Would you like voice with that?

Voice is the hot new thing right now, but resist the hype. It’s not trivial to implement, and even if it were, does that validate it as a ‘must have’ feature for your product?

Voice is integrated into the OS of modern devices.Their technology is mature. It can be used with any input field, any interface.The interaction design is polished, and extensively tested.Use that! If you can.

Page 60: Designing for Voice Interactions (UXAustralia)

‘Name of referenced work’, Author/source/URL, date.

✦ Understand the various modes of voice interaction

✦ Be careful about mixing modes(is that a command or a conversation?)

Know when and how to use voice

When you are designing for voice, understand the modes.

Command, dictate, natural language, identity.

Page 61: Designing for Voice Interactions (UXAustralia)

✦ Support multi-modal interactions and make it as seamless as possible(voice, gesture, type, other)

✦ test, iterate, test, iterate...

Let users decide how to interact

Don Norman, 2003“I believe that voice interfaces hold their greatest promise as an additional component to a multi-modal dialogue, rather than as the only interface channel.”

Dictate and edit is a prime example of this. It’s beautifully crafted.Voice -> typegesture -> voice

Test and iterate. Voice still isn’t a common/normal interaction, so you will likely get it a bit wrong the first few times.

Page 62: Designing for Voice Interactions (UXAustralia)

Don’t make me think

“A simple voice interface can only be as good as

what the customer thinks they want. A better

system is one that understands what their needs

are likely to be, based on what’s known about

them. ”

Page 63: Designing for Voice Interactions (UXAustralia)

✦ Personalisation

✦ Work on making the system ‘smarter’

Create a perception of understanding

The speech recognition and synthesis tools have become commodities. Focus your energies on helping the system seem smarter.

Page 64: Designing for Voice Interactions (UXAustralia)

Jonny SchneiderLead ConsultantMobile Experience Design & [email protected]@jonnyschneiderau.linkedin.com/in/jonnyschneider/

All images used by permission