wheres jarvis? the future of voice recognition and natural language user interfaces
TRANSCRIPT
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Where’s Jarvis?The Future of Voice
Recognition and Natural Language User Interfaces.
Crispin Reedy, Versay Solutions
@crispinTX crispinreedy.com
#UXPA2016
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
From the session description
• What is voice recognition?
• What is natural language understanding?
• What are the common technologies in the market today? • How does this fit with IoT?
• What are design considerations / methods to evaluate these types of interfaces?
• Implied: Should I speech-enable my ___?
• Bonus Q: Why doesn’t it work the way we want it to, and when will it?
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Should I Speech-Enable My ___?
Iron Man 2: Marvel Studios, Paramount Pictures
Star Trek Voyager: Paramount Television
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
“Tomato soup”
“Tomato soup. Ok, what kind?”
“Just plain”
“Coming right up!”
Implicit confirmation
Second level-open ended prompting
Cultural context: plain = hot
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Terms & Technologies
• Speech Recognition
• Natural Language Understanding
• Voice Verification (Biometrics)
• Text to Speech
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Speech Recognition “ASR”
“See the cat.”
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Natural Language Understanding
• Extracting meaning from natural text
“Hello, yes,
I’d like to
pay my
water bill.
Can you
help me with
that?
Intent =
BillPay
Entity
(Bill Type) =
Water
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Voice Verification
“My voice is
my password.”
“Authenticated.
Welcome, Mr.
Smith.”
✓
Text To Speech
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
What Is Good TTS?
• Phonemes change based on location• “Cat”
• “Alligator”
• Elision• “I’m. Awaiting. You.”
• “I’m awaiting you.”
• Intonation• “Do you want coffee?”
• “Do you want soda, tea, or coffee?”
• Most TTS isn’t “Movie Quality”
IMDB
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
SSML Example
SSML
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Speech Recognition
• Hands-free command /
control
• Dictation
• Input text
• Small form factor
device, etc.
Text To Speech
• Output text dynamically
• Respond to input
• Useful when no
display is available
Natural Language
Understanding
• Necessary for all
language-based input
• Extract meaning
• Parse large volumes of
text
Voice Verification
• Security
ASR
Application
Data
• Sign-In
• Interaction• Request• Action• Meaning
• Access Data
• Output
TTS
NLU
Voiceprints
Verifi-cation
ASR
Application
Data
• Sign-In
• Interaction• Request• Action• Meaning
• Access Data
• Output
TTS
NLU
Voiceprints
Verifi-cation
Touch
Keyboard
Manage I/O ModalityDetermine Meaning in
ContextVisual
Context!
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
ASR
World Knowledge
Semantics
Syntax
Lexicon
Morphology
Phonetics
AcousticsLinguistics
Physiology
Concepts
Phrases
Words
Phonemes
Sounds
ASR
NLU
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Speech is ambiguous
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Language is ambiguous
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Everything is ambiguous
Speaker Independence
Speaker Dependent
Multiple Speakers
Speaker Independent
Isolated Words
Connected Words
Natural Speech
10 words
1000 words
100,000 words
Unlimited
Vo
cab
ula
ry S
ize
Humanlike
AUDREY: Automatic Digit Recognizer
Bell Labs 1952
X — statesy — possible observationsa — state transition probabilitiesb — output probabilities
"HiddenMarkovModel" by Tdunningvectorization: Wikimedia
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Training
Speech Recognition
Engine
Acoustic Model
SLM and/orGrammar
Pronunciation Model
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Utterance
Noise Levels?
Barge-In?
Feature Extraction
Endpointing
Speech Recognition
EngineGrammar or SLM
Probabilities
n:best list
Literal return
Tokens
Recognition Event
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Early Commercial Adoptions
• Interactive Voice Response• “Those Phone Menus”
• Server-based ASR
• Nuance
• Microsoft
• Voice-Enabled Handheld Devices• Industrial / Productivity applications
• Device-based ASR
• Network not needed
Note: Call center is still an
important customer
touchpoint!
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Today’s Speech Agents vs. APIs
• Siri / Apple APIs
• Cortana / Cortana APIs
• Google Now / Google Voice Actions
• Amazon Echo (Alexa) / AVS API
• Jibo
• Ubi / Ubi Kit
• Assistant.ai / Api.ai
Alexa Skill vs. Amazon Voice Service
Amazon.com
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Alexa Skill Example
Amazon.com
Amazon.com
Capitol One.com
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
NLU
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Natural Language Understanding
• Parsing input to extract meaning
• Covers a large field• Commands
• Automatic classification of emails
• Newspaper articles, large chunks of text
• Bots
• Conversational agents
• Messaging apps
• Personal assistants
• Input could be via speech or via text
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Levels of Meaning
Too Broad / Ambiguous Too MuchJust Right
“I’m having a problem with my account.”
“Well, I was looking at my bill, because I do that every week, and I was reviewing everything on there, and I saw…”
“I’m seeing an unusual charge on my bill.”
“How can I help you?”
NLU Tasks
http://www.conversational-technologies.com/nldemos/nlDemos.html
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Intents and Entities
• “I’d like to transfer $50 from my checking account to my savings account.”• ACTION = Transfer (Intent)
• FROM_ACCOUNT = Checking (Entity)
• TO_ACCOUNT = Savings (Entity)
• AMOUNT = $50 (Entity)
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
NLU APIs
• API.ai
• Alexa
• Microsoft LUIS
• Wit.ai
• Google Voice Actions
• Etc.
Today’s NLU APIs
• Microsoft LUIS (part of Project Oxford)
Microsoft.com
Today’s NLU APIs
API.ai|
• API.ai
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
The Future Is Here
• DNN (Deep Neural Networks)
• Being applied to both ASR and NLU problems
• Requires large amounts of data to train the models
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
What’s The Glue Here?
ConsistencyAcross
Contexts?
“Omnichannel CX”
DataIs
Everywhere
State Chart XML?
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
ASR vs. NLU: Wrap Up
ASR
• Spoken aloud
• Requires some NLU even if it’s hand-crafted (tagging)
• Useful in hands-free, eyes-free contexts
NLU
• Focuses on meaning extraction
• Could be used for chat bots, etc.
• Machine learning to train models
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Design Considerations
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Design Considerations
• What are you trying to build?
• What’s your platform?
• Existing guidelines / research
• User testing is key• Especially if you’re trying to do something complicated
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Should I Speech-Enable My ___?
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
What’s Your ASR/NLU Platform?
Write an app (skill) for an agent such as Cortana / Alexa
Use cloud APIs to add ASR / NLU to your app / device / page / gadget
Download software and use full-featured
capabilities for more robust recognition on a specific
device
Build your own
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Network Availability
• Simply irritating… or totally unusable?
“What’s on my calendar today?
“Sorry, I can’t complete that request
right now.”
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Appropriate Modality?
• Voice Only? Voice + Display?
• Is it possible for the user to switch modalities?
• Or would switching potentially be dangerous?
“How long is the flight from Dallas to
Seattle?
“I’ve got a few results to show you.”
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Is State Maintained?
• Does your platform support a multiple-stage interaction?
• Does it remember what you did previously?
“Who is Barack Obama?”
“Barack Obama is the 44th
president of the United States.”
“How old is he?”
“I’m sorry, I don’t understand your question.”
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Wake-Up Words
• How many of these “Agents” will we be talking to?
“Jibo, take a picture.”
“Alexa, play music.”
“OK Google, set the temperature to 77
degrees.”
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
System Personality
• Are you writing for an “Agent” who has an existing style?
• What if your skill or app doesn’t match that style?
• If not, should you create one?
“Hi, I’m Julie!”
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Context
• Real-world context
• Digital context
• How much does your app know about where you are and what it can do?
“When I get home, remind me to take
out the trash.”
“I’m sorry, your calendar doesn’t support location-
based reminders.”
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
What Are You Trying To Recognize?
• Long utterances work better than short ones
• Letter names require extra work
“Start a session”
“Got it”
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
And So Much More….
• What will you do when the recognizer just can’t get it?
“I want my…. BARK BARK BARK Timmy STOP
THAT NOW GET DOWN!”
????
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Existing Guidelines / Research
• Caveat: Best practices evolved in one modality (e.g. voice-only) may not apply the same way in another (e.g. combined voice + touch)• But they could be adapted
• Association for Voice Interaction Design (AVIxD.org)• Wiki
• Peer-Reviewed Journal
• Virtual “Brown Bags”
• Academic Sources, Books
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
AVIxD.org
CUI Working Group is actively recruiting!
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Specific Example: “Help”
Voice XML Standard
(2004)“Help” should
be a global command
AVIxD Wiki(2014)
Stop using “Help” as a
global
Agent API Doc
(2015)Offer “Help”
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Specific Example: “Help”
• Designers who tune applications have seen that the word “help” is a known “False Attractor”• Other things that you say which are short get recognized
as “help”
• People don’t voluntarily come up with “help” unless they are prompted
• Give callers a context specific command only where help may truly be needed, and call it something besides "help”• System: Say or enter your account number, or say, where
do I find it.
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Special Case: Car
• “Distracted Driver” is a hot topic!
• Richard Young, Wayne State University• Paper: “Safe Interaction For Drivers”
• “Visual-Manual Mode” – What we do today
• “Auditory-Vocal Mode” – Speech only. NO GUI.
• “Mixed Mode” – Speech and GUI being used together
• Finding: If you give someone a graphic interface, they’re going to look at it • And take their eyes off the road
Design Documents
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Usability Studies / Research
• Special Challenges• Technical setup
• Phone tap / Recording both sides
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions Warner Bros.
Early Stage Voice Only Prototype
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Should I Speech-Enable My ___?
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
What’s the Use Case?
• Enabling application• User can’t do it any other way
• New tasks
• Enhancing application• User can do it now
• But speech makes it better• Faster
• Safer
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
API-Based
Device-Based
Roll Your Own / Open-Source
• Flexibility
• Power
• Customization
• Time
• Difficulty
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Cloud vs. Downloadable / Embedded• Easy to get started
• Lightweight
• Not much specialized knowledge
• Customizable
• Probably better recognition
• Can be device-specific
• More features
• Higher powered
• May require specialized knowledge
– Speech scientist
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Open Source ASR
• CMU Sphinx• pocketsphinx
• Kaldi• http://kaldi-asr.org/
• Github
• New updates include some pretty interesting stuff (DNN)
• Requires: • Corpus
• Tech know-how
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Should I Speech-Enable My ___?
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Should I Speech-Enable My ___?
Maybe
Iron Man 2: Marvel Studios, Paramount Pictures
Where’s Jarvis?
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Where’s Jarvis?
Gesture Based
Interface
Artificial Intelligence
Voice Based Interface
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Where’s Jarvis?
ASR
NLU
Voice Design
Context
#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321© 2016 Versay Solutions
Resources
• Handout / Web page