multimodal dialog
DESCRIPTION
Multimodal Dialog. Multimodal Dialog System. Multimodal Dialog System. A system which supports human-computer interaction over multiple different input and/or output modes . Input: voice, pen, gesture, face expression, etc. Output: voice, graphical output, etc. Applications GPS - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/1.jpg)
Multimodal Dialog
Multimodal DialogMultimodal Dialog
1Intelligent Robot Lecture Note
![Page 2: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/2.jpg)
Multimodal Dialog
2
Multimodal Dialog System
Intelligent Robot Lecture Note
![Page 3: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/3.jpg)
Multimodal Dialog
Multimodal Dialog System
• A system which supports human-computer interaction over multiple different input and/or output modes.
► Input: voice, pen, gesture, face expression, etc.► Output: voice, graphical output, etc.
• Applications► GPS► Information guide system► Smart home control► Etc.
여기에서 여기로 가는 제일 빠른 길 좀 알려
줘 .voice
pen
3Intelligent Robot Lecture Note
![Page 4: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/4.jpg)
Multimodal Dialog
Motivations
• Speech: the Ultimate Interface?► + Interaction style: natural (use free speech)
◦ Natural repair process for error recovery► + Richer channel – speaker’s disposition and emotional state (if
system’s knew how to deal with that..)► - Input inconsistent (high error rates), hard to correct error
◦ e.g., may get different result, each time we speak the same words.► - Slow (sequential) output style: using TTS (text-to-speech)
• How to overcome these weak points?► Multimodal interface!!
4Intelligent Robot Lecture Note
![Page 5: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/5.jpg)
Multimodal Dialog
Advantages of Multimodal Interface
• Task performance and user preference• Migration of Human-Computer Interaction away from the desktop• Adaptation to the environment• Error recovery and handling• Special situations where mode choice helps
5Intelligent Robot Lecture Note
![Page 6: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/6.jpg)
Multimodal Dialog
Task Performance and User Preference
• Task performance and user preference for multimodal over speech only interfaces [Oviatt et al., 1997]
► 10% faster task completion,► 23% fewer words, (Shorter and simpler linguistic constructions)► 36% fewer task errors,► 35% fewer spoken disfluencies,► 90-100% user preference to interact this way.
• Speech-only dialog system
Speech: Bring the drink on the table to the side of bed
• Multimodal dialog System
Speech: Bring this to herePen gesture:
Easy, Simplified
user utterance !
6Intelligent Robot Lecture Note
![Page 7: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/7.jpg)
Multimodal DialogMigration of Human-Computer Interaction away
from the desktop• Small portable computing devices
► Such as PDAs, organizers, and smart-phones► Limited screen real estate for graphical output► Limited input no keyboard/mouse (arrow keys, thumbwheel)► Complex GUIs not feasible► Augment limited GUI with natural modalities such as speech and pen
◦ Use less space◦ Rapid navigation over menu hierarchy
• Other devices► Kiosks, car navigation system…
◦ No mouse or keyboard
Speech + pen gesture
7Intelligent Robot Lecture Note
![Page 8: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/8.jpg)
Multimodal Dialog
Adaptation to the environment
• Multimodal interfaces enable rapid adaptation to changes in the environment
► Allow user to switch modes► Mobile devices that are used in multiple environments
• Environmental conditions can be either physical or social► Physical
◦ Noise: Increases in ambient noise can degrade speech performance switch to GUI, stylus pen input
◦ Brightness: Bright light in outdoor environment can limit usefulness of graphical display
► Social◦ Speech many be easiest for password, account number etc, but in public
places users may be uncomfortable being overheard Switch to GUI or keypad input
8Intelligent Robot Lecture Note
![Page 9: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/9.jpg)
Multimodal Dialog
Error Recovery and Handling
• Advantages for recovery and reduction of error:► Users intuitively pick the mode that is less error-prone.► Language is often simplified.► Users intuitively switch modes after an error
◦ The same problem is not repeated.◦ Multimodal error correction
► Cross-mode compensation - complementarity◦ Combining inputs from multiple modalities can reduce the overall error rate◦ Multimodal interface has potentially
9Intelligent Robot Lecture Note
![Page 10: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/10.jpg)
Multimodal Dialog
Special Situations Where Mode Choice Helps
• Users with disability• People with a strong accent or a cold• People with RSI• Young children or non-literate users• Other users who have problems when handle the standard
devices: mouse and keyboard
• Multimodal interface let people choose their preferred interaction style depending on the actual task, the context, and their own preferences and abilities.
10Intelligent Robot Lecture Note
![Page 11: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/11.jpg)
Multimodal Dialog
Multimodal Dialog System Architecture
• Architecture of QuickSet []► Multi-agent architecture
VR/AR InterfacesMAVEN
BARS
Facilitatorrouting, triggering, dispatching,
Facilitatorrouting, triggering, dispatching,
Inter-agent Communication Language
Sketch/ Gesture
ICL — Horn Clauses
Speech/TTS
NaturalLanguage
MapInterface
MultimodalIntegration
Simulators
WebSvcs(XML, SOAP, …)
Other Facilitators Databases
CORBAbridge
Other userinterfaces
Java-enabledWeb pages
COMobjects
11Intelligent Robot Lecture Note
![Page 12: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/12.jpg)
Multimodal Dialog
12
Multimodal Language Processing
12Intelligent Robot Lecture Note
![Page 13: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/13.jpg)
Multimodal Dialog
Multimodal Reference Resolution
• Multimodal Reference Resolution► Need to resolve references (what the user is referring to) across
modalities.► A user may refer to an item in a display by using speech, by pointing,
or both► Closely related with Multimodal Integration
여기에서 여기로 가는 제일 빠른 길 좀 알려 줘 .
voice
pen13Intelligent Robot Lecture Note
![Page 14: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/14.jpg)
Multimodal Dialog
Multimodal Reference Resolution
• Multimodal Reference Resolution► Finds the most proper referents to referring expressions. [Chai et al.,
2004]◦ Referring expression
– Refer to a specific entity or entities– Given by a user’s inputs (most likely in speech inputs)
◦ Referent– An entity which the user refers
◦ Referent can be an object that is not specified by current utterance.
Speech
Gesture
여기
g1 g2
여기
여기에서 여기로 가는 가장 빠른 길 좀 알려줘
Object 버거킹 롯데 백화점
14Intelligent Robot Lecture Note
![Page 15: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/15.jpg)
Multimodal Dialog
Multimodal Reference Resolution
• Multimodal Reference Resolution► Hard case
◦ Multiple and complex gesture inputs.◦ E.g.) in information guide system
Speech
Gesture
이것들
g1 g2
이거
이거랑 이것들이랑 가격 좀 비교 해 줄래
Time
g3
Speech
Gesture
이것들
g1 g2
이거
Time
g3
?
User: 이건 가격이 얼마지 ? ( 물건 하나를 선택한다 )
System: 만 오천원 입니다 .
User: 이거랑 이것들이랑 가격 좀 비교 해 줄래 ( 물건 세 개를 선택 한다 )
15Intelligent Robot Lecture Note
![Page 16: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/16.jpg)
Multimodal Dialog
Multimodal Reference Resolution
• Multimodal Reference Resolution► Using linguistic theories to guide the reference resolution process.
[Chai et al., 2005]◦ Conversation Implicature◦ Givenness Hierarchy
► Greedy algorithm for finding the best assignment for a referring expression given a cognitive status.
◦ Calculate the match score between referring expressions and referent candidates.
– Matching score
◦ Finds the best assignments by using greedy algorithm
},,{
),(*)]|(*)|([),(DFGS
eoityCompatibileSPSoPeoMatch
object selectivity
Likelihood of status
compatibility measurement
16Intelligent Robot Lecture Note
![Page 17: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/17.jpg)
Multimodal Dialog
Multimodal Integration
• Combining information from multiple input modalities to understand user’s intention and attention
► Multimodal reference resolution is a special case of multimodal integration
◦ Speech + pen gesture.◦ The case where pen gestures can express meaning of deictic or grouping
only.
Meaning Meaning
Multimodal Integration / Fusion
Combined Meaning
17Intelligent Robot Lecture Note
![Page 18: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/18.jpg)
Multimodal Dialog
Multimodal Integration
• Issues:► Nature of multimodal integration mechanism
◦ Algorithmic – procedural◦ Parser / Grammars – Declarative
► Does approach treat one mode as primary?◦ Is gesture a secondary dependent mode?
– Multimodal reference resolution
► How temporal and spatial constraints are expressed► Common meaning representation for speech and gesture
• Two main approaches► Unification-based multimodal parsing and understanding [Johnston,
1998]► Finite-state transducer for multimodal parsing and understanding
[Johnston et al., 2000
18Intelligent Robot Lecture Note
![Page 19: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/19.jpg)
Multimodal Dialog
Unification-based multimodal parsing and understanding
• Parallel recognizers and “understanders”• Time-stamped meaning fragments for each stream• Common framework for meaning representation – typed feature
structures• Meaning fusion operations – unification
► Unification is an operation that determines the consistency of two pieces of partial information,
► And if they are consistent combines them into a single result◦ Whether a given gestural input is compatible with a given piece of spoken
input.◦ And if they are, combine them into a single result
► Semantic, and spatiotemporal constraints
• Statistical ranking• Flexible asynchronous architecture• Must handle unimodal and multimodal input
19Intelligent Robot Lecture Note
![Page 20: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/20.jpg)
Multimodal Dialog
Unification-based multimodal parsing and understanding
• Temporal Constraints [Oviatt et al., 1997]► Speech and gesture overlap, or► Gesture precedes speech by <= 4 seconds► Speech does not precede gesture
Given sequence speech1; gesture; speech2
Possible grouping speech1; (gesture; speech2)
Finding [Oviatt et al. 2004, 2005] -
Users have a consistent temporal integration style adapt
20Intelligent Robot Lecture Note
![Page 21: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/21.jpg)
Multimodal Dialog
Unification-based multimodal parsing and understanding
• Each unimodal inputs are represented as feature structure [Holzapfel et al., 2004]
► Very common representation in Comp. Ling. – FUG, LFG, PATR◦ e.g., lexical entries, grammar rules, etc.
► e.g., “please switch on the lamp”
• And there are some predefined rules for resolving the deictic reference and integrating multimodal inputs
Type
Type2
Attr1: val1Attr2: val2
Attr3: Attr4: val4
21Intelligent Robot Lecture Note
![Page 22: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/22.jpg)
Multimodal Dialog
Unification-based multimodal parsing and understanding
• An example “Draw a line”
From speech (one of many hyp’s)
From pen gesture
Create_line
Object:
Location:
Color: greenLabel: draw a line
Create_line
Line
command
Location:
Line
command
Location:Point
Xcoord: 15487, Ycoord: 19547
ISA Create_line
Object: Color: greenLabel: draw a line
Create_line
Line
Coordlist [ (12143,12134), (12146,12134), … ]
Location:
Coordlist [ (12143,12134), (12146,12134), … ]
+
Cross-modecompensation
22Intelligent Robot Lecture Note
![Page 23: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/23.jpg)
Multimodal Dialog
Unification-based multimodal parsing and understanding
• Advantages of multimodal integration via typed feature structure unification
► Partiality► Structure sharing► Mutual Compensation (cross-mode compensation)► Multimodal discourse
23Intelligent Robot Lecture Note
![Page 24: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/24.jpg)
Multimodal Dialog
Unification-based multimodal parsing and understanding
• Mutual Disambiguation (MD)► Each input mode provides a set of scored recognition hypotheses► MD derives the best joint interpretation by unification of meaning
representation fragments► PMM = αPS + βPG + C
◦ Learn α, β and C over a multimodal corpus► MD stabilizes system performance in challenging environments
mm1
mm2
mm3
mm4
speech gesture object multimodal
s1
s2
s3
g1
g2
g3
g4
o1
o2
o3
24Intelligent Robot Lecture Note
![Page 25: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/25.jpg)
Multimodal Dialog
Finite-state Multimodal Understanding
• Modeled by a 3-tape finite state device► Speech and gesture stream (gesture symbols)► Their combined meaning (meaning symbols)
• Device take speech and gesture as inputs and create the meaning output.
• Simulated by two transducers► G:W aligning speech and gesture► G*W:M composite alphabet of speech and gesture symbols as
inputs and outputs meaning
• Speech and gesture input will be composed by G:W• Then G_W will be composed by G*W:M
25Intelligent Robot Lecture Note
![Page 26: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/26.jpg)
Multimodal Dialog
Finite-state Multimodal Understanding
• Representation of speech input modality► Lattice of words
• Representation of gesture input modality► Represent range of recognitions as lattice of symbols
phone numbers for
these two
tenrestaurants
american
newshow
areasel
2 restSEM(r12,r15)
hw
loc SEM(points…)
0
G
26Intelligent Robot Lecture Note
![Page 27: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/27.jpg)
Multimodal Dialog
Finite-state Multimodal Understanding
• Representation of combined meaning► Also represented as lattice
► Paths in meaning lattice are well-formed XML
<cmd> <info> <type>phone</type> <obj><rest>r12,r15</rest></obj> </info></cmd>
<cmd> <type> phone </type> <obj> <rest>
SEM(r12,r15) </rest> </obj> </cmd>
27Intelligent Robot Lecture Note
![Page 28: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/28.jpg)
Multimodal Dialog
Finite-state Multimodal Understanding
• Multimodal Grammar Formalism► Multimodal context-free grammar (MCFG)
◦ e.g., HEADPL restaurants:rest:<rest> ε:SEM:SEM ε: ε:</rest>► Terminals are multimodal tokens consisting of three components:
◦ Speech stream : Gesture stream : Combined meaning (W:G:M)► e.g., “put that there”
S ε:ε:<cmd> PUTV OBJNP LOCNP ε: ε:</ 층 >
PUTV ε:ε:<act> put:ε:put ε:ε:</act>
OBJNP ε:ε:<obj> that:Gvehicle:ε ε:SEM:SEM ε:ε:</obj>
LOCNP ε:ε:<loc> there:Garea:ε ε:ε:</loc>S
PUTV OBJNP LOCNP
<cmd> </cmd><act>put</act> <obj>v1</obj> <loc>a1</loc>
put that thereGvehicle v1 Garea a1
MeaningGestureSpeech
28Intelligent Robot Lecture Note
![Page 29: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/29.jpg)
Multimodal Dialog
Finite-state Multimodal Understanding
• Multimodal Grammar Example► Speech: email this person and that organization► Gesture: Gp SEM Go SEM► Meaning: email([ person(SEM) , org(SEM ) ])
S V NP ε:ε:])
NP DET N
NP NP CONJ NP
CONJ and:ε:,
V email:ε:email([
V page:ε:page([
DET this:ε:ε
DET that:ε:ε
N person:Gp:person( ε:SEM:SEM ε:ε:)
N organization:Go:org( ε:SEM:SEM ε:ε:)
N department:Gd:dept( ε:SEM:SEM ε:ε:)
organization:Go:org(
0 1
2 3 4
5 6email:ε:email([
page:ε:page([
this:ε:ε
that:ε:εand:ε:,
ε:ε:])
ε:ε:)
ε:SEM:SEM
department:Gd;dept(
person:Gp;person(
29Intelligent Robot Lecture Note
![Page 30: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/30.jpg)
Multimodal Dialog
Finite-state Multimodal Understanding
phone numbers for
these two
tenrestaurants
american
newshow
areasel
2 restSEM(r12,r15)
hw
loc SEM(points…)
0
G
<cmd> <type> phone
Speechlattice
Gesturelattice
Meaninglattice
3-Tape MultimodalFinite-state Device
integrationprocessing
30Intelligent Robot Lecture Note
![Page 31: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/31.jpg)
Multimodal Dialog
Finite-state Multimodal Understanding
• An example
0 1
2 3 4
5 6email:ε:email([
page:ε:page([
this:ε:ε
that:ε:εand:ε:,
ε:ε:])
ε:ε:)
ε:SEM:SEM
department:Gd;dept(
person:Gp;person(
1 20email this
3 4person and
5 6that organization
1 20Gp SEM
3 4Go SEM
1 20email([ person(
3 4SEM )
5 6, org(
7 8SEM )
9])
Speechlattice
Gesturelattice
Meaninglattice
MultimodalGrammar
31Intelligent Robot Lecture Note
![Page 32: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/32.jpg)
Multimodal Dialog
32
Robustness in Multimodal Dialog
32Intelligent Robot Lecture Note
![Page 33: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/33.jpg)
Multimodal Dialog
Robustness in Multimodal Dialog
• Gain robustness via► Fusion of inputs from multiple modalities ► Using strengths of one mode to compensate for weaknesses of others—design
time and run time ► Avoiding/correcting errors► Statistical architecture ► Confirmation ► Dialogue context ► Simplification of language in a multimodal context ► Output affecting/channeling input
• Example approaches► Edit machine in FST based Multimodal integration and understanding► Salience driven approach to robust input interpretation► N-best re-ranking method for improving speech recognition performance
33Intelligent Robot Lecture Note
![Page 34: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/34.jpg)
Multimodal Dialog
Edit Machine in FST based MM integration
• Problem of FST based MM integration - mismatch between the user’s input and the language encoded in the grammar
ASR: show cheap restaurants thai places in in chelsea
Grammar: show cheap thai places in chelsea
• How to parse it? determine which in-grammar string it is most like
Edits: show cheap ε thai places in ε chelsea
(restaurants and in is deleted)
To find this, employ the edit machine !
34Intelligent Robot Lecture Note
![Page 35: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/35.jpg)
Multimodal Dialog
Handcrafted Finite-state Edit Machines
• Edit-based Multimodal Understanding – Basic edit► Transform ASR output so that it can be assigned a meaning by the
FST-based Multimodal Understanding model
► Find the string with the least costly number of edits that can be assigned an interpretation by the grammar
◦ λg: Language encoded in the multimodal grammar
◦ λs: String encoded in the lattice resulting from ASR
◦ ◦ : composition of transducers
geditsSs
s
minarg*
35Intelligent Robot Lecture Note
![Page 36: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/36.jpg)
Multimodal Dialog
Handcrafted Finite-state Edit Machines
• Edit-based Multimodal Understanding – 4-edit► Basic edit is quite large and adds an unacceptable amount of latency
(5s on average).► Limited number of edit operations (at most 4)
36Intelligent Robot Lecture Note
![Page 37: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/37.jpg)
Multimodal Dialog
Handcrafted Finite-state Edit Machines
• Edit-based Multimodal Understanding – Smart edit► Smart edit is a 4-edit machine + heuristics + refinements
◦ Deletion of SLM only words (not found in the grammar)– thai restaurant listings in midtown -> thai restaurant in midtown
◦ Deletion of doubled words– Subway to to the cloisters -> subway to the cloisters
◦ Subdivided cost classes ( icost, dcost 3 classes )– High cost: slot fillers (e.g. Chinese, cheap, downtown)– Low cost: dispensable words (e.g. please, would )– Medium cost: all other words
◦ Auto-completion of place names– Algorithm enumerates all possible shortening of places names– Metropolitan Museum of Art, Metropolitan Museum
37Intelligent Robot Lecture Note
![Page 38: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/38.jpg)
Multimodal Dialog
Learning Edit Patterns
• User’s input is considered a “noisy” version of the parsable input (clean).
Noisy (S): show cheap restaurants thai places in in chelsea
Clean (T): show cheap ε thai places in ε chelsea
• Translating the user’s input to a string that can be assigned a meaning representation by the grammar
38Intelligent Robot Lecture Note
![Page 39: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/39.jpg)
Multimodal Dialog
Learning Edit Patterns
• Noisy Channel Model for Error Correction► Translation probability
◦ Sg: string that can be assigned a meaning representation by the grammar
◦ Su: user’s input utterance
◦ From Markov assumption, (trigram)
– Where Su = Su1Su
2…Sun and Sg = Sg
1Sg2…Sg
m
► Word Alignment (Sui,Sg
i)
◦ GIZA++
),(maxarg* gug SSPSgS
),,,|,(maxarg* 2121 ig
ig
iu
iu
ig
iug SSSSSSPS
gS
39Intelligent Robot Lecture Note
![Page 40: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/40.jpg)
Multimodal Dialog
Learning Edit Patterns
• Deriving Translation Corpus► Finite-state transducer can generate the input strings for given
meaning.► Training the translation model
corpus
meaning
string
MultimodalGrammar
GeneratedString
TargetString
Generate the stringsgiven meaning
Select the closest strings
40Intelligent Robot Lecture Note
![Page 41: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/41.jpg)
Multimodal Dialog
Experiments and Results
• 16 first time users (8 male, 8 female).• 833 user interactions (218 multimodal / 491 speech-only / 124
pen-only)• Finding restaurants of various types and getting their names,
phone numbers, addresses.• Getting subway directions between locations.
• Avg. ASR sentence accuracy: 49%• Avg. ASR word accuracy: 73.4%
41Intelligent Robot Lecture Note
![Page 42: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/42.jpg)
Multimodal Dialog
Experiments and Results
• Improvements on concept accuracy
ConAcc Rel Impr
No edits 38.9% 0%
Basic edit 51.5% 32%
4-edit 53.0% 36%
Smart edit 60.2% 55%
Smart edit (lattice) 63.2% 62%
MT edit 50.3% 29%
ConAcc
Smart edit 67.4%
MT edit 61.1%
Result of 6-fold cross validation
Result of 10-fold cross validation
42Intelligent Robot Lecture Note
![Page 43: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/43.jpg)
Multimodal Dialog
A Salience Driven Approach
• Modify the language model score, and rescore recognized hypotheses
► By using the information of gesture input
► Primed Language model◦ W* = argmaxP(O|W)P(W)
43Intelligent Robot Lecture Note
![Page 44: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/44.jpg)
Multimodal Dialog
A Salience Driven Approach
• “People do not make any unnecessary deictic gesture”► Cognitive theory of Conversation Implicature
◦ Speakers tend to make their contribution as informative as is required◦ And not make their contribution more informative than is required
• “Speech and gesture tend to complement each other”► When a speech utterance is accompanied by a deictic gesture,
◦ Speech input – issue commands or inquiries about properties of object◦ Deictic gesture – indicate the objects of interest
• Gesture as an earlier indicator to anticipate the content of communication in the subsequent spoken utterances
► 85% of time gestures occurred before corresponding speech unit
44Intelligent Robot Lecture Note
![Page 45: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/45.jpg)
Multimodal Dialog
A Salience Driven Approach
• A deictic gesture can activate several objects on the graphical display
► It will signal a distribution of objects that are salient
Move this to here
Graphical display Salience weight
timegesturespeech
Salient Object A cup
45Intelligent Robot Lecture Note
![Page 46: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/46.jpg)
Multimodal Dialog
A Salience Driven Approach
• Salient object ‘a cup’ is mapped to the physical world representation
► To indicate a salient part of representation◦ Such as relevant properties or task related to the salient objects.
• This salient part of the physical world is likely to be the potential content of speech
Move this to here
timegesturespeech
A cup
46Intelligent Robot Lecture Note
![Page 47: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/47.jpg)
Multimodal Dialog
A Salience Driven Approach
47Intelligent Robot Lecture Note
• Physical world representation► Domain Model
◦ Relevant knowledge about the domain– Domain objects– Properties of objects– Relations between objects– Task models related to objects
◦ Frame-based representation– Frame: domain object– Frame elements: attributes and tasks related to the objects
► Domain Grammar◦ Specifies grammar and vocabularies used to process language inputs
– Semantics-based context free grammar– Non-terminal: semantic tag– Terminal: word (value of semantic tag)
– Annotated user spoken utterance– Relevant semantic information– N-grams
![Page 48: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/48.jpg)
Multimodal Dialog
Salience Modeling
• Calculating a salience distribution of entities in the physical world► Salience value of entity at time tn is influenced by a joint effect from
◦ Sequence of gestures that happen before tn
nn tt gegeP gesturegiven entity ofy selectivitObject : )|(
at time gesture agiven at time valuesalience ofeight : )(
it
ntt
tgtwg
i
in
g)P(e(gα
g)P(e(gα)(eP
ee
m
ittt
m
itktt
kt
iin
iin
n
1
1
)|
)|
nkkt teePn
at time entity of valuesalience : )(
48Intelligent Robot Lecture Note
![Page 49: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/49.jpg)
Multimodal Dialog
Salience Modeling
g)P(e(gα
g)P(e(gα)(eP
ee
m
ittt
m
itktt
kt
iin
iin
n
1
1
)|
)|
Summation of P(ek|g) for all gestures before time tn
Weighted by α
Normalizing factor:Summation of salience value of all entities at time tn
)2000
)(exp()( in
tt
ttg
in
The closer gesture has higher impact to
salience distribution
49Intelligent Robot Lecture Note
![Page 50: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/50.jpg)
Multimodal Dialog
Salience Driven Spoken Language Understanding
• Maps the salience distribution to the physical world representation• Uses salient world to influence spoken language understanding• primes language models to facilitate language understanding
► Rescoring the hypotheses of speech recognizer by using primed language model score
50Intelligent Robot Lecture Note
![Page 51: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/51.jpg)
Multimodal Dialog
Primed Language Model
• Primed language model is based on the class-based bigram model► Class : semantic and functional class for domain
◦ E.g.) this Demonstrative, price AttrPrice
► Modify the word class probability◦ Originally it measures the probability of seeing a word wi given a class ci
◦ It modified as the choice of word “wi” is dependent on the salient physical world– Which is represented as the salience distribution P(e)
◦ P(wi,ci|ek) and P(ci|ek) are not dependent on time ti
◦ can be estimated based on the training data
• Speech hypotheses are reordered according to primed language model.
)|()|()|( 11 iiiiii ccPcwPwwPClass transition probabilityWord class probability
ee
ktki
kiiii
k
ieP
ecP
ecwPcwP
)(
)|(
)|,()|(
51Intelligent Robot Lecture Note
![Page 52: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/52.jpg)
Multimodal Dialog
Evaluation - WER
• Domain : real estate properties• Interface : speech + pen gesture• 11 users tested, five non-native speakers and six native speakers• 226 user inputs with an average of 8 words per utterance• Average WER reduction is about 12% (t=4.75, p<0.001)
User
Index
# of
inputs
# inputs
w/o gesture
Baseline
WER
1 21 0 0.287
2 31 0 0.335
3 27 0 0.399
4 10 0 0.680
5 8 1 0.200
6 36 0 0.387
7 18 0 0.250
8 25 1 0.278
9 23 0 0.482
10 11 0 0.117
11 16 3 0.255
52Intelligent Robot Lecture Note
![Page 53: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/53.jpg)
Multimodal Dialog
Evaluation – Concept Identification
• Examples of improved case► Transcription: What is the population of this town► Baseline: What is the publisher of this time► Salience-based: What is the population of this town
► Transcription: How much is this gray house► Baseline: How much is this great house► Salience-based: How much is this gray house
Baseline Salience-based
Precision 80.3% 84.6%
Recall 75.7% 83.8%
F-measure 77.9% 84.2%
53Intelligent Robot Lecture Note
![Page 54: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/54.jpg)
Multimodal DialogN-best re-ranking for improving speech recognition
performance• Using multimodal understanding feature
이것 좀 여기에 갖다 놔 .
이다 좀 여기에 갖 다 가
Speech Act: requestMain Goal: moveComponent Slots: Target.Loc : 여기
ASR
SLU
Missing the slot!!!
Source.item : 이것
errorSpeech
Pen
54Intelligent Robot Lecture Note
![Page 55: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/55.jpg)
Multimodal DialogN-best re-ranking for improving speech recognition
performance• Using N-best ASR Hypotheses
► Rescore the hypotheses with many information► That are not available during speech recognition► We use multimodal understanding features
이다 좀 여기 갖 다 가이다 좀 여기 갖 다 줘이것 좀 여기 갖 다 가이것 좀 이것 갖 다 가…
이것 좀 여기 갖 다 가이것 좀 이것 갖 다 가이다 좀 여기 갖 다 가이다 좀 여기 갖 다 줘…
Re-ranking Modelwithmany
Features
55Intelligent Robot Lecture Note
![Page 56: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/56.jpg)
Multimodal Dialog
Speech Recognizer Features
• Speech recognizer score: P(W|X). • Acoustic model score: P(X|W). • Language model score: P(W).• N-best word rate: To give more confidence to a particular word
which occurs in many hypotheses.
• N-best homogeneity: To give more weight to a word which appears in a higher ranked hypothesis, we weigh each word by the score of the hypothesis in which it appears.
list best N an in hypotheses of Number
wcontaining hypotheses of Number)(w rate wordbest N i
i
tN best lisses in an of hypothe scoresof Sum
ning wses contaiof hypothe scoresof Sum)(wy homogeneit N best i
i
56Intelligent Robot Lecture Note
![Page 57: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/57.jpg)
Multimodal Dialog
SLU features
• CRF confidence Score: the confidence score of the SLU results.► Confidence score of speech act and main goal:
P(speech act|word sequence), P(main goal|word sequence)◦ Driven from a CRF formulation
◦ y: output variable◦ x: input variable◦ Z: normalization factor
◦ fk(yt-1,yt,x,t): arbitrary linguistic feature function (often binary-valued)
◦ λk: trained parameter associated with feature fk
T
t kttkk
x
txyyfZ
xyP1
1 ),,,(exp1
)|(
57Intelligent Robot Lecture Note
![Page 58: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/58.jpg)
Multimodal Dialog
SLU features
• CRF confidence Score (cont.)► Confidence score of component slot
◦ yt: component slot
◦ xt: corresponding word
ty k tttk
k tttk
tt xyyf
xyyfxyConf
),,(
),,(),(
1
1
58Intelligent Robot Lecture Note
![Page 59: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/59.jpg)
Multimodal Dialog
Multimodal Understanding Features
• Multimodal reference resolution score► Well recognized speech hypothesis tend to resolve well.► (a) well recognized► (b),(c) this bathroom this bad noon
► (b) this bad noon can not be
referring expression.
Second pen gesture has low
reference resolution score
► (c) this bad noon as a
referring expression but has
low reference resolution score
Clean this room and this bathroom
Gesture Input:
Clean this room and this bad noon
Gesture Input:
(a)
(b)
Clean this room and this bad noon
Gesture Input:
(c)
time
time
time
59Intelligent Robot Lecture Note
![Page 60: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/60.jpg)
Multimodal Dialog
Experimental Setup
• Corpus► 617 Multimodal inputs
◦ 118 (speech + pen gesture) + 499 (speech only)◦ 3135 words, 5.08 words per utterance.◦ Vocabulary size: 396
• Speech Recognizer► HTK-based Korean speech recognizer was trained by MFCC 39
dimensional feature vectors.► Output 75 best lists
60Intelligent Robot Lecture Note
![Page 61: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/61.jpg)
Multimodal Dialog
Experimental Result (WER)
• Comparison word error rate between baseline and N-best re-ranking model with variable feature set.
► Relative error reduction rate: 7.95 (%)► Re-ranking model has a significantly smaller word error rates than
that of baseline system. (p < 0.001)
WER (%)
baseline 17.74
+ Speech recognizer features 17.38
+ SLU features 16.43
+ Multimodal reference resolution features
16.33
61Intelligent Robot Lecture Note
![Page 62: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/62.jpg)
Multimodal Dialog
Experimental Results (WER)
• Word error rates of a N-best re-ranking model with the varied size of an N
► If N is too large many noisy hypotheses
► If N is too small small candidate size and few clues to re-rank
62Intelligent Robot Lecture Note
![Page 63: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/63.jpg)
Multimodal Dialog
Experimental Results (CER)
• Comparison concept error rate between baseline and N-best re-ranking model.
► Relative error reduction rate: 10.13 (%)► Re-ranking model has a significantly smaller concept error rates than
that of baseline system. (p < 0.01)
CER (%)
baseline 14.28
+ Speech recognizer features 13.81
+ SLU features 13.11
+ Multimodal reference resolution features
12.83
63Intelligent Robot Lecture Note
![Page 64: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/64.jpg)
Multimodal Dialog
Reading List
• R. A. Bolt, 1980, “Put that there: Voice and gesture at the graphics interface,” Computer Graphics Vol. 14, no. 3, 262-270.
• J. Chai, S. Pan, M. Zhou, and K. Houck, 2002, Context-based Multimodal Understanding in Conversational Systems. Proceedings of the Fourth International Conference on Multimodal Interfaces (ICMI).
• J. Chai, P. Hong, and M. Zhou, 2004, A Probabilistic Approach to Reference Resolution in Multimodal User Interfaces. Proceedings of 9th International Conference on Intelligent User Interfaces (IUI-04), 70-77.
• J. Chai, Z. Prasov, J. Blaim, and R. Jin., 2005, Linguistic Theories in Efficient Multimodal Reference Resolution: an Empirical Investigation. Proceedings of the 10th International Conference on Intelligent User Interfaces (IUI-05), 43-50.
64Intelligent Robot Lecture Note
![Page 65: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/65.jpg)
Multimodal Dialog
Reading List
• J. Chai, S. Qu, A Salience Driven Approach to Robust Input Interpretation in Multimodal Conversational Systems, In Proceedings of the HLT/EMNLP 2005
• H. Holzapfel, K. Nickel, R. Stiefelhagen, 2004, Implementation and Evaluation of a ConstraintBased Multimodal Fusion System for Speech and 3D Pointing Gestures, Proceedings of the International Conference on Multimodal Interfaces, (ICMI),
• M. Johnston, 1998. Unification-based multimodal parsing. Proceedings of the International Joint Conference of the Association for Computational Linguistics and the International Committee on Computational Linguistics , 624-630.
• M. Johnston, and S. Bangalore. 2000. Finite-state multimodal parsing and understanding. Proceedings of COLING-2000.
65Intelligent Robot Lecture Note
![Page 66: Multimodal Dialog](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e8c550346895da8d133/html5/thumbnails/66.jpg)
Multimodal Dialog
Reading List
• M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, and P. Maloor. 2002. MATCH: An architecture for multimodal dialogue systems. In Proceedings of ACL-2002.
• M. Johnston, S. Bangalore, Learning Edit Machines For Robust Multimodal Understanding, In Proceedings of the ICASSP 2006
• P.R. Cohen, M. Johnston, D.R. McGee, S.L. Oviatt, J.A. Pittman, I. Smith, L. Chen, and J. Clow, 1997, "QuickSet: Multimodal Interaction for Distributed Applications," Intl. Multimedia Conference, 31-40.
• S. L. Oviatt , A. DeAngeli, and K. Kuhn, 1997, Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of Conference on Human Factors in Computing Systems: CHI '97.
66Intelligent Robot Lecture Note