![Page 1: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/1.jpg)
Speech Recognition and Conversational
Interfaces
![Page 2: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/2.jpg)
Virtues of Spoken LanguageVirtues of Spoken Language
Natural: Requires no special training Flexible: Leaves hands and eyes freeFlexible: Leaves hands and eyes freeEfficient: Has high data rateEconomical: Can be transmitted and received inexpensively
Speech interfaces are ideal for informationSpeech interfaces are ideal for information
p y
Speech interfaces are ideal for information access and management when:
• The information space is broad and complex,
Speech interfaces are ideal for information access and management when:
• The information space is broad and complex,The information space is broad and complex,• The users are technically naive, or• Speech is the only available modality.
The information space is broad and complex,• The users are technically naive, or• Speech is the only available modality.Speech is the only available modality. Speech is the only available modality.
![Page 3: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/3.jpg)
Information Access via Speech
Read my important
![Page 4: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/4.jpg)
Future UIs for Information Access
• Star Trek style UIy– verbally ask the computer for info or services– may be common in mobile/hands-free situations
h d t t t k ll i it i f t h iti & bi– hard to get to work well since it requires perfect speech recognition & unambiguous language understanding
4
![Page 5: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/5.jpg)
Human Factors in SpeechHuman Factors in Speech
•• High Error Rates
• Speech recognitionSpeech recognition
• Background noise, intonation, pitch, volumeg p
• Grammars (missing words, size limitations)
• “When speech recognition becomes genuinely reliable, this will cause another big change inreliable, this will cause another big change in operating systems.” (Bill Gates, The Road Ahead 1995)
![Page 6: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/6.jpg)
Th S f R itiThe Space of Recognition
DomainDomainDomainDomainSpeaker Dependent Independent
Dependent not interesting Transcription(training)( g)
Independent We are hereUltimate Goal
(requiresIndependent We are here (requires knowledge)
![Page 7: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/7.jpg)
Th S f R itiThe Space of Recognition
• Speaker dependent
• First train the system to recognize your speaking
• Better recognition ratesg
• Domain dependent:
• Only recognize what is in the domain
• Better recognition ratesg
• Domain can be large. How is it specified?
![Page 8: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/8.jpg)
Wh S h?Why Speech?
• No special training -- naive users
• Leaves hands and eyes free -- but must know when to start recognitiong
• High data rate -- assuming low errors
•• Inexpensive I/O -- microphone, speaker, button
• speaker needed for feedbackp
• Some things are easier to specify with speech
![Page 9: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/9.jpg)
Communication via Spoken LanguageInput Output
Human
Speech Speech
Human
Computer
Recognition Synthesis
Computer
TextText
Generation Understanding
Meaning
![Page 10: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/10.jpg)
Components of Conversational Systems
LanguageGeneration
DialogueManagement
SpeechSynthesis
Audio Database
SpeechRecognition
ContextResolution
Language UnderstandingUnderstanding
![Page 11: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/11.jpg)
OverviewOverview
• Components of a speech based interfacep p– ASR: Automated Speech Recognition– NLU: Natural Language Understanding
Generation– Generation– Dialogue Manager
11
![Page 12: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/12.jpg)
Conversational AgentsConversational Agents
• Also known as:• Also known as:– Spoken Language Systems– Dialogue SystemsDialogue Systems– Speech Dialogue Systems
• Applications:– Travel arrangements (Train, hotel, flight)– Telephone call routing– Tutoring– Communicating with robots– Anything with limited screen/keyboardAnything with limited screen/keyboard
12
![Page 13: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/13.jpg)
A travel dialog: CommunicatorA travel dialog: Communicator
13
![Page 14: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/14.jpg)
Call routing: ATT HMIHYCall routing: ATT HMIHY
14
![Page 15: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/15.jpg)
A tutorial dialogue: ITSPOKEA tutorial dialogue: ITSPOKE
15
![Page 16: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/16.jpg)
Dialogue SystemDialogue System Architecture
16
![Page 17: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/17.jpg)
Speech RecognitionSpeech Recognition
Li t ll th A b• Refers to the technologies that
enable computing devices to
List all the Auburn University orders.
p gidentify the sound of human voice.
17
![Page 18: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/18.jpg)
Automated Speech Recognition engineAutomated Speech Recognition engine• Standard ASR engines
– Speech to wordsSpeech to words• But specific characteristics for dialogue
– Language models could depend on where we are in the dialogueLanguage models could depend on where we are in the dialogue– Could make use of the fact that we are talking to the same
human over time.– Barge-in (human will talk over the computer)– Confidence values
18
![Page 19: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/19.jpg)
Speech RecognitionSpeech Recognition
• Continuous RecognitionContinuous Recognition– Allows a user to speak to the system in an everyday manner without using
specific, learned commands.
• Discrete RecognitionRecognizes a limited vocabulary of individual words and phrases spoken by a– Recognizes a limited vocabulary of individual words and phrases spoken by a person.
19
![Page 20: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/20.jpg)
Speech RecognitionSpeech Recognition
• Word SpottingWord Spotting– Recognizes predefined words or phrases.– Used by discrete recognition applications.
* “Computer I want to surf the Web”* “Hey, I would like to surf the Web”
20
![Page 21: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/21.jpg)
Speech RecognitionSpeech Recognition
• Voice Verification or Speaker Identification– Voice verification is the science of verifying a person's identity on the basis of y g p y
their voice characteristics.– Unique features of a person's voice are digitized and compared with the
individual's pre-recorded "voiceprint" sample stored in the database for identityindividual s pre recorded voiceprint sample stored in the database for identity verification.
– It is different from speech recognition because the technology does not recognize the spoken word itselfrecognize the spoken word itself.
21
![Page 22: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/22.jpg)
Speech SynthesisSpeech Synthesis
• Refers to the technologies that enable computing devices to p goutput simulated human speech.
James, here are the Auburn University orders.
22
![Page 23: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/23.jpg)
Speech SynthesisSpeech Synthesis
• Formant SynthesisFormant Synthesis– Uses a set of phonological rules to control an audio waveform that simulates
human speech.– Sounds like a robot, very synthetic, but getting better.
23
![Page 24: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/24.jpg)
Speech SynthesisSpeech Synthesis
• Concatenated SynthesisConcatenated Synthesis– Uses computer assembly of recorded voice sounds to create meaningful
speech output.– Sounds very human, most people can’t tell the difference.
24
![Page 25: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/25.jpg)
Concatenative Speech SynthesisConcatenative Speech Synthesis• Output waveform generated by concatenating segments of
pre-recorded speech corpus.• Concatenation at phrase, word or sub-word level.
Synthesis ExamplesThe third ad is a 1996 black Acura Integra with 45380 miles.The price is 8970 dollars Please call (404) 399-7682
Synthesis Examples
compassiondisputed
compassiondisputedlabyrinth
b d blabyrinth
b d b
The price is 8970 dollars. Please call (404) 399-7682.
l b t disputedcedar city
sincegiant
disputedcedar city
sincegiant
abracadabraobligatory
abracadabraobligatory computer
sciencelaboratory
giantsincegiantsince
Continental flight 4695 from Greensboro is expected in H lif t 10 08 l l ti
25
Halifax at 10:08 pm local time.
![Page 26: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/26.jpg)
Uses of Speech TechnologiesUses of Speech Technologies
• Interactive Voice Response Systems– Call centers
• Medical, Legal, Business, Commercial, Warehouse• Handheld Devices• Toys and Education• Automobile Industry• Universal Access (visual/physical impaired)
26
![Page 27: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/27.jpg)
Segment-Based Speech RecognitionWaveform
Segment Based Speech RecognitionWaveform
F b d t ( 5 )Frame-based measurements (every 5ms)
Segment network created by interconnecting spectral landmarks
ao
-m - ae
dh
- k
p
uw
er
z t k-ax dx ehx
Probabilistic search finds most likely phone & word strings
computers that talk
Probabilistic search finds most likely phone & word strings
![Page 28: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/28.jpg)
Segment-Based Speech RecognitionSegment Based Speech Recognition
![Page 29: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/29.jpg)
Speech Recognition: Dimensions of SuccessSpeec ecog t o e s o s o Success
• Size of vocabulary: A few words to tens of thousands• Accuracy, recognition percentage
– >99%, >95%, >75%
R t bilit f f• Repeatability of performance• Cost
S k d d t k i d d t• Speaker-dependent vs. speaker-independent• Training required or not required
L ti f i h• Location of microphone• Acoustic environment, quiet or noisy environment
Di t d ti h• Discrete words or continuous speech
29
![Page 30: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/30.jpg)
Language ModelLanguage Model
• Language models for dialogue are often based on hand-writtenLanguage models for dialogue are often based on hand written Context-Free or finite-state grammars.
• Why? Because of need for understanding; we need to constrain user to say things that we know what to do with.
30
![Page 31: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/31.jpg)
Language Models for Dialogue (2)g g g ( )
• We can have LM specific to a dialogue stateWe can have LM specific to a dialogue state• If system just asked “What city are you departing from?”• LM can beLM can be
– City names only– FSA: (I want to (leave|depart)) (from) [CITYNAME]
• A LM that is constrained in this way is technically called a “restricted grammar” or “restricted LM”
31
![Page 32: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/32.jpg)
Barge-inBarge in
• Speakers barge in• Speakers barge-in• Need to deal properly with this via speech-detection, etc.
32
![Page 33: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/33.jpg)
Natural Language UnderstandingNatural Language Understanding
• Or “NLU”• There are many ways to represent the meaning of sentencesThere are many ways to represent the meaning of sentences• For speech dialogue systems, most common is “Frame and
slot semantics”.
33
![Page 34: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/34.jpg)
An example of a frameAn example of a frame
• Show me morning flights from Boston to SF on Tuesday.• SHOW:• FLIGHTS:• ORIGIN:• ORIGIN:• CITY: Boston• DATE: Tuesday• TIME: morning• DEST:• CITY: San Francisco• CITY: San Francisco
34
![Page 35: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/35.jpg)
How to generate this semantics?How to generate this semantics?• Many methods, • Simplest: “semantic grammars”Simplest: semantic grammars• CFG in which the left hand side of rules is a semantic
category:– LIST -> show me | I want | can I see|…– DEPARTTIME -> (after|around|before) HOUR | morning |
afternoon | evening– HOUR -> one|two|three…|twelve (am|pm)
FLIGHTS ( ) fli ht|fli ht– FLIGHTS -> (a) flight|flights– ORIGIN -> from CITY
DESTINATION t CITY– DESTINATION -> to CITY– CITY -> Boston | San Francisco | Denver | Washington
35
![Page 36: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/36.jpg)
Semantics for a sentenceSemantics for a sentence• LIST FLIGHTS ORIGIN• Show me flights from BostonShow me flights from Boston
• DESTINATION DEPARTDATEDESTINATION DEPARTDATE• to San Francisco on Tuesday
• DEPARTTIME• morningmorning
36
![Page 37: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/37.jpg)
Frame-fillingFrame filling
• We use a parser to take these rules and apply them to theWe use a parser to take these rules and apply them to the sentence.
• Resulting in a semantics for the sentence• We can then write some simple code• That takes the semantically labeled sentence• And fills in the frame.
37
![Page 38: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/38.jpg)
Problems with any of theseProblems with any of these semantic grammars
• Relies on hand-written grammar– Expensivep– May miss possible ways of saying something if the grammar-writer just doesn’t
think about them
N t b bili ti• Not probabilistic– In practice, every sentence is ambiguous– Probabilities are best way to resolve ambiguitiesProbabilities are best way to resolve ambiguities– Good statistical models can be learned and built using existing techniques.
38
![Page 39: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/39.jpg)
Generation and TTS (Text-to-speech)Generation and TTS (Text to speech)
• Generation component– Chooses concepts to express to userp p– Plans out how to express these concepts in words– Assigns any necessary prosody to the words
• TTS component– Takes words and prosodic annotations
S th i f– Synthesizes a waveform
39
![Page 40: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/40.jpg)
Generation ComponentGeneration Component
• Content Planner– Decides what content to express to user
* (ask a question, present an answer, etc)Often merged with dialogue manager– Often merged with dialogue manager
• Language Generation– Chooses syntactic structures and words to express meaning.– Simplest method
* All words in sentence are prespecified!* “Template based generation”* “Template-based generation”* Can have variables:
• What time do you want to leave CITY-ORIG?• Will you return to CITY-ORIG from CITY-DEST?
40
![Page 41: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/41.jpg)
HCI constraints on generation for dialogue: “C h ”“Coherence”
• Discourse markers and pronouns (“Coherence”):
• (1) Please say the date.• …• Please say the start timePlease say the start time.• …• Please say the duration…• …• Please say the subject…
• (2) First, tell me the date.• …• Next, I’ll need the time it starts.• …• Thanks. <pause> Now, how long is it supposed to last?• …• Last of all, I just need a brief description
41
![Page 42: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/42.jpg)
Dialogue ManagerDialogue Manager
• Controls the architecture and structure of dialogueControls the architecture and structure of dialogue– Takes input from Automated Speech Recognition/Natural Language
Understanding components– Maintains some sort of state– Interfaces with Task Manager– Passes output to Natural Language Generation/Text-to-speech modulesPasses output to Natural Language Generation/Text to speech modules
42
![Page 43: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/43.jpg)
User-centered dialogueUser centered dialogue system design
1. Early focus on users and task:– interviews, study of human-human task, etc., y ,
2. Build prototypes: – Wizard of Oz systems
3. Iterative Design:– iterative design cycle with embedded user testing
43
![Page 44: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/44.jpg)
Dialogue system EvaluationDialogue system Evaluation
• Whenever we design a new algorithm or build a newWhenever we design a new algorithm or build a new application, need to evaluate it
• How to evaluate a dialogue system?• What constitutes success or failure for a dialogue system?
44
![Page 45: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/45.jpg)
Task Completion SuccessTask Completion Success
• % of subtasks completed% of subtasks completed• Correctness of each questions/answer/error msg• Correctness of total solutionCorrectness of total solution
45
![Page 46: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/46.jpg)
Task Completion CostTask Completion Cost
• Completion time in turns/secondsCompletion time in turns/seconds• Number of queries• Turn correction ration: number of system or user turns usedTurn correction ration: number of system or user turns used
solely to correct errors, divided by total number of turns• Inappropriateness (verbose, ambiguous) of system’s
questions, answers, error messages
46
![Page 47: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/47.jpg)
User SatisfactionUser Satisfaction
• Were answers provided quickly enough?• Did the system understand your requests the first time?Did the system understand your requests the first time?• Do you think a person unfamiliar with computers could use the
system easily?
47
![Page 48: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/48.jpg)
Speech interface guidelinesSpeech interface guidelines
• Speech recognition is errorful• System state is often opaque to the userSystem state is often opaque to the user• http://www.speech.cs.cmu.edu/air/papers/SpInGuidelines/SpIn
Guidelines.html
48
![Page 49: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/49.jpg)
1. Errors1. Errors
• Speech based interfaces are error-prone due to speech recognition.g– Humans aren’t perfect speech recognizers, therefore, machines aren’t
either.
• Goal: Reduce the number and severity of errors.
49
![Page 50: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/50.jpg)
1. Errors1. Errors
• Use Specific Error Messages• Limit Background Noise• Limit Background Noise• Allow the User to Turn Off the Input Device
Provide an Undo Capability• Provide an Undo Capability• Use Auditory Icons
U M lti M d l C f E If A li bl• Use Multi-Modal Cues for Errors If Applicable• Don’t Assume People Hear Everything
50
![Page 51: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/51.jpg)
Use Specific Error MessagesUse Specific Error Messages
• Bad ExampleBad Example– System: “Say the departure date.”– User: “Tomorrow.”– System: “Say the departure date.”– User: “I want to travel tomorrow.”
S t “S th d t d t ”– System: “Say the departure date.”
51
![Page 52: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/52.jpg)
Use Specific Error MessagesUse Specific Error Messages
• Good ExampleGood Example– System: “Say the departure date.”– User: “Tomorrow.”– System: “I don’t understand that date. Say the month, date and year. For
example, say October 13th, 2003.”– User: “July 1st 2003 ”User: July 1 , 2003.
52
![Page 53: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/53.jpg)
Limit Background NoiseLimit Background Noise
• Background noise is input.
• Computer hears the background, not the user.
53
![Page 54: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/54.jpg)
All th U t T Off th I t D iAllow the User to Turn Off the Input Device
• This reduces background noise errors.
• For Speech Based Interfaces, allow the user to place the system in an ignore mode– System ignores input until a keyword is spoken, i.e. “I am back”.
54
![Page 55: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/55.jpg)
P id U d C bilitProvide an Undo Capability
• Build in ways for users to cancel out, go back and undo actions.
55
![Page 56: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/56.jpg)
Use an Auditory IconUse an Auditory Icon
• Auditory Icons are sound clips with a messageAuditory Icons are sound clips with a message.
• When errors occur, play an auditory icon to notify the user.
56
![Page 57: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/57.jpg)
U M lti M d l C f E If A li blUse Multi-Modal Cues for Errors If Applicable
• Use more than one mode to signal an error if possibleUse more than one mode to signal an error, if possible.
• Play an auditory icon display a message and speak aPlay an auditory icon, display a message and speak a message.
57
![Page 58: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/58.jpg)
D ’t A P l H E thiDon’t Assume People Hear Everything
• Just because the system spoke it doesn’t mean the userJust because the system spoke it, doesn t mean the user heard it.
• Say important information first or last to improve the likelihood of it being heard.
58
![Page 59: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/59.jpg)
2. Feedback2. Feedback
• During HCI, the user needs feedback from the computer.
• When a user issues a command, the system should acknowledge that the user has been heard.acknowledge that the user has been heard.
• Users also want feedback when the system is busy• Users also want feedback when the system is busy.
59
![Page 60: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/60.jpg)
2. Feedback2. Feedback
• Supply Alternative Guesses• Acknowledge the User’s Speech• Acknowledge the User s Speech• Show When It Is the User’s Turn to Talk
Allow for Verification• Allow for Verification• Use Non-Speech Audio for Transitions
U I P M• Use In-Progress Messages
60
![Page 61: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/61.jpg)
Supply Alternative GuessesSupply Alternative Guesses
• Users may say one word, but the computer hears a different word. (IDEAL SOLUTION)( )– i.e. User says “Boston” and the computer hears “Austin”.– The computer should respond “Did you say Austin or Boston?”
• This is easier said than done because you have to know all ythe words that sound alike in order to accomplish this for a large vocabulary.
61
![Page 62: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/62.jpg)
Supply Alternative GuessesSupply Alternative Guesses
• Repeat what the user said and allow the user to correct what was recognized. (REAL SOLUTION)g ( )– i.e. User says “Boston” and the computer hears “Austin”.– The computer should respond “You said Boston, is that correct?”
62
![Page 63: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/63.jpg)
Acknowlege the User’sAcknowlege the User s Speech
• When the user speaks, provide feedback that she was heard.– Auditory Icon– Go to the next option– If the next option is time consuming, let the user know in advance. i.e. “I heard
you, let me process your request”
63
![Page 64: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/64.jpg)
Sh Wh It I th U ’ T t T lkShow When It Is the User’s Turn to Talk
• In a multi-modal user interface, provide the user with a visual cue that the computer is listening.p g
64
![Page 65: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/65.jpg)
Sh Wh It I th U ’ T t T lkShow When It Is the User’s Turn to Talk
• In a speech based interface, provide the user with:– PromptPrompt– Auditory Icon
65
![Page 66: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/66.jpg)
Allow for Verification
• Users tend to verify more when using a speech interface versus a visual interface.
• Speech based interfaces should allow the user to verifySpeech based interfaces should allow the user to verify what is happening and what has happened.
66
![Page 67: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/67.jpg)
U N S h A di f T itiUse Non-Speech Audio for Transitions
• When the user issues a command that requires a transition, play an auditory icon to acknowledge the transition is under p y y gway.
• Avoid non-speech feedback that sounds like equipment noise.
67
![Page 68: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/68.jpg)
Use In-Progress MessagesUse In Progress Messages
• If there is more than a 3 seconds delay between when the user issues a command and the system responds, issue an y p ,in-progress message.
• For best results, your in-progress messages should be informative.– i.e. tell the user their position in the wait queue when it changes.
68
![Page 69: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/69.jpg)
Use In-Progress MessagesUse In Progress Messages
• Playing a musical auditory icon in the background doesn’t work alone, but it is better than nothing. , g
• Combine the verbal message with music to have the bestCombine the verbal message with music to have the best effect.
69
![Page 70: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/70.jpg)
3. Confirmations3. Confirmations• Confirmations are questions you
ask of the user to be sure that the user has been heard correctly.
70
![Page 71: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/71.jpg)
3. Confirmations3. Confirmations
• Use Confirmations Appropriately• Ask for Clarifying Information• Ask for Clarifying Information• Use Confirmations for Destructive or Predictable Actions
Be Specific• Be Specific
71
![Page 72: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/72.jpg)
Use ConfirmationsUse Confirmations Appropriately
• Don’t over confirm– You could overdo the confirmations by asking for a confirmation for everyYou could overdo the confirmations by asking for a confirmation for every
input.
• You have to balance the cost of making an error with the extra time and annoyance in requiring the user to confirm a l t f t t tlot of statements.
72
![Page 73: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/73.jpg)
Ask for ClarifyingAsk for Clarifying Information
• If the expected response has more than one known responseIf the expected response has more than one known response, then you may want to clarify what the user said.
• i.e. “Do you want to set up an appointment or contact the person by phone”
73
![Page 74: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/74.jpg)
U C fi ti f D t ti P di t blUse Confirmations for Destructive or Predictable Actions
• If the user’s action is destructive, delete files, require a confirmation.
• If the user’s input prone to errors, require a confirmation.– i.e. the grammar has a lot of sound alike words.
74
![Page 75: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/75.jpg)
Be SpecificBe Specific
• If the system doesn’t recognize what was spoken be specificIf the system doesn t recognize what was spoken, be specific about what you need.
• i.e.– “Please repeat the date again” vs. “Please repeat”– “Do you mean December 3rd?” is not a good example, unless you are fairly
confident.
75
![Page 76: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/76.jpg)
4. Command-and-Control4. Command and Control
• Speech based interfaces that recognize a limited vocabulary of individual words and phrases spoken by the user.p p y
76
![Page 77: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/77.jpg)
5. Command-and-Control5. Command and Control
• User Constraints• Be Brief• Be Brief
77
![Page 78: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/78.jpg)
User ConstraintsUser Constraints
• Limit the user’s input through specific prompts.
78
![Page 79: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/79.jpg)
User ConstraintsUser Constraints
• Bad dialouge:– System: “Welcome to the XYZ Company. We look forward to servicing your travel needs.
Wh t th d t f t l th t ld lik t h k f ?”What are the dates of travel that you would like me to check for?”– User: “We are interested in traveling the first week of July, say July 1st to July 5th”.
• The system’s statement is too open. This is a natural dialogue that humans understand.
79
![Page 80: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/80.jpg)
User ConstraintsUser Constraints
• Good dialogue:– System: “Welcome to the XYZ Company. Say the departure date of travel. For y p y y p
example, say October 1st, 2003.”– User: “July 4th, 2003”
System: “Thank you Say the return date ”– System: “Thank you. Say the return date.”
80
![Page 81: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/81.jpg)
Be BriefBe Brief
• People model the length of system speech.– If the system is lengthy, then the user will tend to be lengthy.y g y, g y
• The length of user speech is directly proportional to the number of recognition errors.– The longer you speak, the chances of errors increases.
81
![Page 82: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/82.jpg)
Allow Relative DatesAllow Relative Dates
• For example:– next Friday yesterday tomorrow next week next month etcnext Friday, yesterday, tomorrow, next week, next month, etc.
82
![Page 83: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/83.jpg)
Avoid Long PausesAvoid Long Pauses
• People don’t like dead air in• People don t like dead air in conversation.
• Use auditory icons or speech to avoid long pauses.g p
83
![Page 84: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/84.jpg)
Choose an AppropriateChoose an Appropriate Speed
• If the systems speaks fast, then the users will speak fast.
• Users will mimic the speed of the computer.
84
![Page 85: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/85.jpg)
Use Barge-InUse Barge In
• Allow users to interrupt the computer’s speech. This isAllow users to interrupt the computer s speech. This is barge-in.
85
![Page 86: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/86.jpg)
Use Barge-InUse Barge In
• When a group of users have adapted to a speech based interface and they barge-in, they barge-in with 2 seconds of y g , y gthe introduction (maybe less).
• So, if you have to change the options of the speech based interface, how do you notify the users if they barge-in
ithi 1 d?within 1 second?
86
![Page 87: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/87.jpg)
SUEDE:Low-fi Prototyping for Speech-based UIs
• Built-in iterative design– design – test – analysis– design – test – analysis– fast -> no real recognition
• Support design practiceSupport design practice– example scripts– Wizard of Oz (WoZ)
• Handle needs of real UIs– error simulation
87
![Page 88: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/88.jpg)
script area
88
![Page 89: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/89.jpg)
machine prompt user response
89
![Page 90: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/90.jpg)
90
![Page 91: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/91.jpg)
design area
91
![Page 92: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/92.jpg)
92
![Page 93: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/93.jpg)
SUEDE SummarySUEDE Summary
• Speech is an important mode for info access in the fieldSpeech is an important mode for info access in the field• SUEDE supports speech-based UI design
– moving from concrete examples to abstractionsg p– embeds iterative design w/ design-test-analyze
• Designers using SUEDE need not be experts in speech recognition technology
93
![Page 94: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/94.jpg)
Audio Visual IntegrationAudio Visual Integration
• Audio and visual signals both contain information about:• Audio and visual signals both contain information about:– Identity of the person: Who is talking?– Linguistic message: What’s (s)he saying?g g ( ) y g– Emotion, mood, stress, etc.: How does (s)he feel?
• The two channels of information– Are often inter-related– Are often complementary
M t b i t t• – Must be consistent•Integration of these cues can lead to enhanced capabilities
for future human computer interfacesp
![Page 95: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/95.jpg)
Audio Visual SymbiosisAudio Visual Symbiosis
PersonalIdentityIdentity
RobustSpeaker Face RobustPerson ID
SpeakerID
A ti Vi l
FaceID
AcousticSignal
Visual Signal
RobustASR
SpeechRecognition
Lip/MouthReading
RobustParalinguistic
Detection
AcousticParaling.Detection
VisualParaling.Detection
ParalinguisticInformation
LinguisticMessageg
![Page 96: Speech Recognition and Conversational Interfacesuser.ceng.metu.edu.tr/~tcan/se542_f1314/Schedule/se542_week12.pdfSpeech RecognitionSpeech Recognition • Voice Verification or Speaker](https://reader034.vdocuments.us/reader034/viewer/2022051813/6031108d6ab684667f12b821/html5/thumbnails/96.jpg)
Assignment #7 (Reading #6)
• ICMI’12 Grand Challenge – Haptic Voice Recognition
by Simby Sim et al.