multimodal dialog

Multimodal Dialog

Multimodal DialogMultimodal Dialog

1Intelligent Robot Lecture Note

Multimodal Dialog

2

Multimodal Dialog System

Intelligent Robot Lecture Note

Multimodal Dialog

Multimodal Dialog System

• A system which supports human-computer interaction over multiple different input and/or output modes.

► Input: voice, pen, gesture, face expression, etc.► Output: voice, graphical output, etc.

• Applications► GPS► Information guide system► Smart home control► Etc.

여기에서 여기로 가는 제일 빠른 길 좀 알려

줘 .voice

pen


Multimodal Dialog

Motivations

• Speech: the Ultimate Interface?► + Interaction style: natural (use free speech)

◦ Natural repair process for error recovery► + Richer channel – speaker’s disposition and emotional state (if

system’s knew how to deal with that..)► - Input inconsistent (high error rates), hard to correct error

◦ e.g., may get different result, each time we speak the same words.► - Slow (sequential) output style: using TTS (text-to-speech)

• How to overcome these weak points?► Multimodal interface!!


Multimodal Dialog

Advantages of Multimodal Interface

• Task performance and user preference• Migration of Human-Computer Interaction away from the desktop• Adaptation to the environment• Error recovery and handling• Special situations where mode choice helps


Multimodal Dialog

Task Performance and User Preference

• Task performance and user preference for multimodal over speech only interfaces [Oviatt et al., 1997]

► 10% faster task completion,► 23% fewer words, (Shorter and simpler linguistic constructions)► 36% fewer task errors,► 35% fewer spoken disfluencies,► 90-100% user preference to interact this way.

• Speech-only dialog system

Speech: Bring the drink on the table to the side of bed

• Multimodal dialog System

Speech: Bring this to herePen gesture:

Easy, Simplified

user utterance !


Multimodal DialogMigration of Human-Computer Interaction away

from the desktop• Small portable computing devices

► Such as PDAs, organizers, and smart-phones► Limited screen real estate for graphical output► Limited input no keyboard/mouse (arrow keys, thumbwheel)► Complex GUIs not feasible► Augment limited GUI with natural modalities such as speech and pen

◦ Use less space◦ Rapid navigation over menu hierarchy

• Other devices► Kiosks, car navigation system…

◦ No mouse or keyboard

Speech + pen gesture


Multimodal Dialog

Adaptation to the environment

• Multimodal interfaces enable rapid adaptation to changes in the environment

► Allow user to switch modes► Mobile devices that are used in multiple environments

• Environmental conditions can be either physical or social► Physical

◦ Noise: Increases in ambient noise can degrade speech performance switch to GUI, stylus pen input

◦ Brightness: Bright light in outdoor environment can limit usefulness of graphical display

► Social◦ Speech many be easiest for password, account number etc, but in public

places users may be uncomfortable being overheard Switch to GUI or keypad input


Multimodal Dialog

Error Recovery and Handling

• Advantages for recovery and reduction of error:► Users intuitively pick the mode that is less error-prone.► Language is often simplified.► Users intuitively switch modes after an error

◦ The same problem is not repeated.◦ Multimodal error correction

► Cross-mode compensation - complementarity◦ Combining inputs from multiple modalities can reduce the overall error rate◦ Multimodal interface has potentially


Multimodal Dialog

Special Situations Where Mode Choice Helps

• Users with disability• People with a strong accent or a cold• People with RSI• Young children or non-literate users• Other users who have problems when handle the standard

devices: mouse and keyboard

• Multimodal interface let people choose their preferred interaction style depending on the actual task, the context, and their own preferences and abilities.


Multimodal Dialog

Multimodal Dialog System Architecture

• Architecture of QuickSet []► Multi-agent architecture

VR/AR InterfacesMAVEN

BARS

Facilitatorrouting, triggering, dispatching,

Facilitatorrouting, triggering, dispatching,

Inter-agent Communication Language

Sketch/ Gesture

ICL — Horn Clauses

Speech/TTS

NaturalLanguage

MapInterface

MultimodalIntegration

Simulators

WebSvcs(XML, SOAP, …)

Other Facilitators Databases

CORBAbridge

Other userinterfaces

Java-enabledWeb pages

COMobjects


Multimodal Dialog

12

Multimodal Language Processing


Multimodal Dialog

Multimodal Reference Resolution

• Multimodal Reference Resolution► Need to resolve references (what the user is referring to) across

modalities.► A user may refer to an item in a display by using speech, by pointing,

or both► Closely related with Multimodal Integration

여기에서 여기로 가는 제일 빠른 길 좀 알려 줘 .

voice

pen13Intelligent Robot Lecture Note

Multimodal Dialog


• Multimodal Reference Resolution► Finds the most proper referents to referring expressions. [Chai et al.,

2004]◦ Referring expression

– Refer to a specific entity or entities– Given by a user’s inputs (most likely in speech inputs)

◦ Referent– An entity which the user refers

◦ Referent can be an object that is not specified by current utterance.

Speech

Gesture

여기

g1 g2

여기

여기에서 여기로 가는 가장 빠른 길 좀 알려줘

Object 버거킹 롯데 백화점


Multimodal Dialog


• Multimodal Reference Resolution► Hard case

◦ Multiple and complex gesture inputs.◦ E.g.) in information guide system

Speech

Gesture

이것들

g1 g2

이거

이거랑 이것들이랑 가격 좀 비교 해 줄래

Time

g3

Speech

Gesture

이것들

g1 g2

이거

Time

g3

?

User: 이건 가격이 얼마지 ? ( 물건 하나를 선택한다 )

System: 만 오천원 입니다 .

User: 이거랑 이것들이랑 가격 좀 비교 해 줄래 ( 물건 세 개를 선택 한다 )


Multimodal Dialog


• Multimodal Reference Resolution► Using linguistic theories to guide the reference resolution process.

[Chai et al., 2005]◦ Conversation Implicature◦ Givenness Hierarchy

► Greedy algorithm for finding the best assignment for a referring expression given a cognitive status.

◦ Calculate the match score between referring expressions and referent candidates.

– Matching score

◦ Finds the best assignments by using greedy algorithm

},,{

),(*)]|(*)|([),(DFGS

eoityCompatibileSPSoPeoMatch

object selectivity

Likelihood of status

compatibility measurement


Multimodal Dialog

Multimodal Integration

• Combining information from multiple input modalities to understand user’s intention and attention

► Multimodal reference resolution is a special case of multimodal integration

◦ Speech + pen gesture.◦ The case where pen gestures can express meaning of deictic or grouping

only.

Meaning Meaning

Multimodal Integration / Fusion

Combined Meaning


Multimodal Dialog

Multimodal Integration

• Issues:► Nature of multimodal integration mechanism

◦ Algorithmic – procedural◦ Parser / Grammars – Declarative

► Does approach treat one mode as primary?◦ Is gesture a secondary dependent mode?

– Multimodal reference resolution

► How temporal and spatial constraints are expressed► Common meaning representation for speech and gesture

• Two main approaches► Unification-based multimodal parsing and understanding [Johnston,

1998]► Finite-state transducer for multimodal parsing and understanding

[Johnston et al., 2000


Multimodal Dialog

Unification-based multimodal parsing and understanding

• Parallel recognizers and “understanders”• Time-stamped meaning fragments for each stream• Common framework for meaning representation – typed feature

structures• Meaning fusion operations – unification

► Unification is an operation that determines the consistency of two pieces of partial information,

► And if they are consistent combines them into a single result◦ Whether a given gestural input is compatible with a given piece of spoken

input.◦ And if they are, combine them into a single result

► Semantic, and spatiotemporal constraints

• Statistical ranking• Flexible asynchronous architecture• Must handle unimodal and multimodal input


Multimodal Dialog


• Temporal Constraints [Oviatt et al., 1997]► Speech and gesture overlap, or► Gesture precedes speech by <= 4 seconds► Speech does not precede gesture

Given sequence speech1; gesture; speech2

Possible grouping speech1; (gesture; speech2)

Finding [Oviatt et al. 2004, 2005] -

Users have a consistent temporal integration style adapt


Multimodal Dialog


• Each unimodal inputs are represented as feature structure [Holzapfel et al., 2004]

► Very common representation in Comp. Ling. – FUG, LFG, PATR◦ e.g., lexical entries, grammar rules, etc.

► e.g., “please switch on the lamp”

• And there are some predefined rules for resolving the deictic reference and integrating multimodal inputs

Type

Type2

Attr1: val1Attr2: val2

Attr3: Attr4: val4


Multimodal Dialog


• An example “Draw a line”

From speech (one of many hyp’s)

From pen gesture

Create_line

Object:

Location:

Color: greenLabel: draw a line

Create_line

Line

command

Location:

Line

command

Location:Point

Xcoord: 15487, Ycoord: 19547

ISA Create_line

Object: Color: greenLabel: draw a line

Create_line

Line

Coordlist [ (12143,12134), (12146,12134), … ]

Location:

Coordlist [ (12143,12134), (12146,12134), … ]

+

Cross-modecompensation


Multimodal Dialog


• Advantages of multimodal integration via typed feature structure unification

► Partiality► Structure sharing► Mutual Compensation (cross-mode compensation)► Multimodal discourse


Multimodal Dialog


• Mutual Disambiguation (MD)► Each input mode provides a set of scored recognition hypotheses► MD derives the best joint interpretation by unification of meaning

representation fragments► PMM = αPS + βPG + C

◦ Learn α, β and C over a multimodal corpus► MD stabilizes system performance in challenging environments

mm1

mm2

mm3

mm4

speech gesture object multimodal

s1

s2

s3

g1

g2

g3

g4

o1

o2

o3


Multimodal Dialog

Finite-state Multimodal Understanding

• Modeled by a 3-tape finite state device► Speech and gesture stream (gesture symbols)► Their combined meaning (meaning symbols)

• Device take speech and gesture as inputs and create the meaning output.

• Simulated by two transducers► G:W aligning speech and gesture► G*W:M composite alphabet of speech and gesture symbols as

inputs and outputs meaning

• Speech and gesture input will be composed by G:W• Then G_W will be composed by G*W:M


Multimodal Dialog


• Representation of speech input modality► Lattice of words

• Representation of gesture input modality► Represent range of recognitions as lattice of symbols

phone numbers for

these two

tenrestaurants

american

newshow

areasel

2 restSEM(r12,r15)

hw

loc SEM(points…)

0

G


Multimodal Dialog


• Representation of combined meaning► Also represented as lattice

► Paths in meaning lattice are well-formed XML

<cmd> <info> <type>phone</type> <obj><rest>r12,r15</rest></obj> </info></cmd>

<cmd> <type> phone </type> <obj> <rest>

SEM(r12,r15) </rest> </obj> </cmd>


Multimodal Dialog


• Multimodal Grammar Formalism► Multimodal context-free grammar (MCFG)

◦ e.g., HEADPL restaurants:rest:<rest> ε:SEM:SEM ε: ε:</rest>► Terminals are multimodal tokens consisting of three components:

◦ Speech stream : Gesture stream : Combined meaning (W:G:M)► e.g., “put that there”

S ε:ε:<cmd> PUTV OBJNP LOCNP ε: ε:</ 층 >

PUTV ε:ε:<act> put:ε:put ε:ε:</act>

OBJNP ε:ε:<obj> that:Gvehicle:ε ε:SEM:SEM ε:ε:</obj>

LOCNP ε:ε:<loc> there:Garea:ε ε:ε:</loc>S

PUTV OBJNP LOCNP

<cmd> </cmd><act>put</act> <obj>v1</obj> <loc>a1</loc>

put that thereGvehicle v1 Garea a1

MeaningGestureSpeech


Multimodal Dialog


• Multimodal Grammar Example► Speech: email this person and that organization► Gesture: Gp SEM Go SEM► Meaning: email([ person(SEM) , org(SEM ) ])

S V NP ε:ε:])

NP DET N

NP NP CONJ NP

CONJ and:ε:,

V email:ε:email([

V page:ε:page([

DET this:ε:ε

DET that:ε:ε

N person:Gp:person( ε:SEM:SEM ε:ε:)

N organization:Go:org( ε:SEM:SEM ε:ε:)

N department:Gd:dept( ε:SEM:SEM ε:ε:)

organization:Go:org(

0 1

2 3 4

5 6email:ε:email([

page:ε:page([

this:ε:ε

that:ε:εand:ε:,

ε:ε:])

ε:ε:)

ε:SEM:SEM

department:Gd;dept(

person:Gp;person(


Multimodal Dialog


phone numbers for

these two

tenrestaurants

american

newshow

areasel

2 restSEM(r12,r15)

hw

loc SEM(points…)

0

G

<cmd> <type> phone

Speechlattice

Gesturelattice

Meaninglattice

3-Tape MultimodalFinite-state Device

integrationprocessing


Multimodal Dialog


• An example

0 1

2 3 4

5 6email:ε:email([

page:ε:page([

this:ε:ε

that:ε:εand:ε:,

ε:ε:])

ε:ε:)

ε:SEM:SEM

department:Gd;dept(

person:Gp;person(

1 20email this

3 4person and

5 6that organization

1 20Gp SEM

3 4Go SEM

1 20email([ person(

3 4SEM )

5 6, org(

7 8SEM )

9])

Speechlattice

Gesturelattice

Meaninglattice

MultimodalGrammar


Multimodal Dialog

32

Robustness in Multimodal Dialog


Multimodal Dialog

Robustness in Multimodal Dialog

• Gain robustness via► Fusion of inputs from multiple modalities ► Using strengths of one mode to compensate for weaknesses of others—design

time and run time ► Avoiding/correcting errors► Statistical architecture ► Confirmation ► Dialogue context ► Simplification of language in a multimodal context ► Output affecting/channeling input

• Example approaches► Edit machine in FST based Multimodal integration and understanding► Salience driven approach to robust input interpretation► N-best re-ranking method for improving speech recognition performance


Multimodal Dialog

Edit Machine in FST based MM integration

• Problem of FST based MM integration - mismatch between the user’s input and the language encoded in the grammar

ASR: show cheap restaurants thai places in in chelsea

Grammar: show cheap thai places in chelsea

• How to parse it? determine which in-grammar string it is most like

Edits: show cheap ε thai places in ε chelsea

(restaurants and in is deleted)

To find this, employ the edit machine !


Multimodal Dialog

Handcrafted Finite-state Edit Machines

• Edit-based Multimodal Understanding – Basic edit► Transform ASR output so that it can be assigned a meaning by the

FST-based Multimodal Understanding model

► Find the string with the least costly number of edits that can be assigned an interpretation by the grammar

◦ λg: Language encoded in the multimodal grammar

◦ λs: String encoded in the lattice resulting from ASR

◦ ◦ : composition of transducers

geditsSs

s

minarg*


Multimodal Dialog


• Edit-based Multimodal Understanding – 4-edit► Basic edit is quite large and adds an unacceptable amount of latency

(5s on average).► Limited number of edit operations (at most 4)


Multimodal Dialog


• Edit-based Multimodal Understanding – Smart edit► Smart edit is a 4-edit machine + heuristics + refinements

◦ Deletion of SLM only words (not found in the grammar)– thai restaurant listings in midtown -> thai restaurant in midtown

◦ Deletion of doubled words– Subway to to the cloisters -> subway to the cloisters

◦ Subdivided cost classes ( icost, dcost 3 classes )– High cost: slot fillers (e.g. Chinese, cheap, downtown)– Low cost: dispensable words (e.g. please, would )– Medium cost: all other words

◦ Auto-completion of place names– Algorithm enumerates all possible shortening of places names– Metropolitan Museum of Art, Metropolitan Museum


Multimodal Dialog

Learning Edit Patterns

• User’s input is considered a “noisy” version of the parsable input (clean).

Noisy (S): show cheap restaurants thai places in in chelsea

Clean (T): show cheap ε thai places in ε chelsea

• Translating the user’s input to a string that can be assigned a meaning representation by the grammar


Multimodal Dialog


• Noisy Channel Model for Error Correction► Translation probability

◦ Sg: string that can be assigned a meaning representation by the grammar

◦ Su: user’s input utterance

◦ From Markov assumption, (trigram)

– Where Su = Su1Su

2…Sun and Sg = Sg

1Sg2…Sg

m

► Word Alignment (Sui,Sg

i)

◦ GIZA++

),(maxarg* gug SSPSgS

),,,|,(maxarg* 2121 ig

ig

iu

iu

ig

iug SSSSSSPS

gS


Multimodal Dialog


• Deriving Translation Corpus► Finite-state transducer can generate the input strings for given

meaning.► Training the translation model

corpus

meaning

string

MultimodalGrammar

GeneratedString

TargetString

Generate the stringsgiven meaning

Select the closest strings


Multimodal Dialog

Experiments and Results

• 16 first time users (8 male, 8 female).• 833 user interactions (218 multimodal / 491 speech-only / 124

pen-only)• Finding restaurants of various types and getting their names,

phone numbers, addresses.• Getting subway directions between locations.

• Avg. ASR sentence accuracy: 49%• Avg. ASR word accuracy: 73.4%


Multimodal Dialog

Experiments and Results

• Improvements on concept accuracy

ConAcc Rel Impr

No edits 38.9% 0%

Basic edit 51.5% 32%

4-edit 53.0% 36%

Smart edit 60.2% 55%

Smart edit (lattice) 63.2% 62%

MT edit 50.3% 29%

ConAcc

Smart edit 67.4%

MT edit 61.1%

Result of 6-fold cross validation

Result of 10-fold cross validation


Multimodal Dialog

A Salience Driven Approach

• Modify the language model score, and rescore recognized hypotheses

► By using the information of gesture input

► Primed Language model◦ W* = argmaxP(O|W)P(W)


Multimodal Dialog


• “People do not make any unnecessary deictic gesture”► Cognitive theory of Conversation Implicature

◦ Speakers tend to make their contribution as informative as is required◦ And not make their contribution more informative than is required

• “Speech and gesture tend to complement each other”► When a speech utterance is accompanied by a deictic gesture,

◦ Speech input – issue commands or inquiries about properties of object◦ Deictic gesture – indicate the objects of interest

• Gesture as an earlier indicator to anticipate the content of communication in the subsequent spoken utterances

► 85% of time gestures occurred before corresponding speech unit


Multimodal Dialog


• A deictic gesture can activate several objects on the graphical display

► It will signal a distribution of objects that are salient

Move this to here

Graphical display Salience weight

timegesturespeech

Salient Object A cup


Multimodal Dialog


• Salient object ‘a cup’ is mapped to the physical world representation

► To indicate a salient part of representation◦ Such as relevant properties or task related to the salient objects.

• This salient part of the physical world is likely to be the potential content of speech

Move this to here

timegesturespeech

A cup


Multimodal Dialog



• Physical world representation► Domain Model

◦ Relevant knowledge about the domain– Domain objects– Properties of objects– Relations between objects– Task models related to objects

◦ Frame-based representation– Frame: domain object– Frame elements: attributes and tasks related to the objects

► Domain Grammar◦ Specifies grammar and vocabularies used to process language inputs

– Semantics-based context free grammar– Non-terminal: semantic tag– Terminal: word (value of semantic tag)

– Annotated user spoken utterance– Relevant semantic information– N-grams

Multimodal Dialog

Salience Modeling

• Calculating a salience distribution of entities in the physical world► Salience value of entity at time tn is influenced by a joint effect from

◦ Sequence of gestures that happen before tn

nn tt gegeP gesturegiven entity ofy selectivitObject : )|(

at time gesture agiven at time valuesalience ofeight : )(

it

ntt

tgtwg

i

in

g)P(e(gα

g)P(e(gα)(eP

ee

m

ittt

m

itktt

kt

iin

iin

n

1

1

)|

)|

nkkt teePn

at time entity of valuesalience : )(


Multimodal Dialog

Salience Modeling

g)P(e(gα

g)P(e(gα)(eP

ee

m

ittt

m

itktt

kt

iin

iin

n

1

1

)|

)|

Summation of P(ek|g) for all gestures before time tn

Weighted by α

Normalizing factor:Summation of salience value of all entities at time tn

)2000

)(exp()( in

tt

ttg

in

The closer gesture has higher impact to

salience distribution


Multimodal Dialog

Salience Driven Spoken Language Understanding

• Maps the salience distribution to the physical world representation• Uses salient world to influence spoken language understanding• primes language models to facilitate language understanding

► Rescoring the hypotheses of speech recognizer by using primed language model score


Multimodal Dialog

Primed Language Model

• Primed language model is based on the class-based bigram model► Class : semantic and functional class for domain

◦ E.g.) this Demonstrative, price AttrPrice

► Modify the word class probability◦ Originally it measures the probability of seeing a word wi given a class ci

◦ It modified as the choice of word “wi” is dependent on the salient physical world– Which is represented as the salience distribution P(e)

◦ P(wi,ci|ek) and P(ci|ek) are not dependent on time ti

◦ can be estimated based on the training data

• Speech hypotheses are reordered according to primed language model.

)|()|()|( 11 iiiiii ccPcwPwwPClass transition probabilityWord class probability

ee

ktki

kiiii

k

ieP

ecP

ecwPcwP

)(

)|(

)|,()|(


Multimodal Dialog

Evaluation - WER

• Domain : real estate properties• Interface : speech + pen gesture• 11 users tested, five non-native speakers and six native speakers• 226 user inputs with an average of 8 words per utterance• Average WER reduction is about 12% (t=4.75, p<0.001)

User

Index

# of

inputs

# inputs

w/o gesture

Baseline

WER

1 21 0 0.287

2 31 0 0.335

3 27 0 0.399

4 10 0 0.680

5 8 1 0.200

6 36 0 0.387

7 18 0 0.250

8 25 1 0.278

9 23 0 0.482

10 11 0 0.117

11 16 3 0.255


Multimodal Dialog

Evaluation – Concept Identification

• Examples of improved case► Transcription: What is the population of this town► Baseline: What is the publisher of this time► Salience-based: What is the population of this town

► Transcription: How much is this gray house► Baseline: How much is this great house► Salience-based: How much is this gray house

Baseline Salience-based

Precision 80.3% 84.6%

Recall 75.7% 83.8%

F-measure 77.9% 84.2%


Multimodal DialogN-best re-ranking for improving speech recognition

performance• Using multimodal understanding feature

이것 좀 여기에 갖다 놔 .

이다 좀 여기에 갖 다 가

Speech Act: requestMain Goal: moveComponent Slots: Target.Loc : 여기

ASR

SLU

Missing the slot!!!

Source.item : 이것

errorSpeech

Pen


Multimodal DialogN-best re-ranking for improving speech recognition

performance• Using N-best ASR Hypotheses

► Rescore the hypotheses with many information► That are not available during speech recognition► We use multimodal understanding features

이다 좀 여기 갖 다 가이다 좀 여기 갖 다 줘이것 좀 여기 갖 다 가이것 좀 이것 갖 다 가…

이것 좀 여기 갖 다 가이것 좀 이것 갖 다 가이다 좀 여기 갖 다 가이다 좀 여기 갖 다 줘…

Re-ranking Modelwithmany

Features


Multimodal Dialog

Speech Recognizer Features

• Speech recognizer score: P(W|X). • Acoustic model score: P(X|W). • Language model score: P(W).• N-best word rate: To give more confidence to a particular word

which occurs in many hypotheses.

• N-best homogeneity: To give more weight to a word which appears in a higher ranked hypothesis, we weigh each word by the score of the hypothesis in which it appears.

list best N an in hypotheses of Number

wcontaining hypotheses of Number)(w rate wordbest N i

i

tN best lisses in an of hypothe scoresof Sum

ning wses contaiof hypothe scoresof Sum)(wy homogeneit N best i

i


Multimodal Dialog

SLU features

• CRF confidence Score: the confidence score of the SLU results.► Confidence score of speech act and main goal:

P(speech act|word sequence), P(main goal|word sequence)◦ Driven from a CRF formulation

◦ y: output variable◦ x: input variable◦ Z: normalization factor

◦ fk(yt-1,yt,x,t): arbitrary linguistic feature function (often binary-valued)

◦ λk: trained parameter associated with feature fk

T

t kttkk

x

txyyfZ

xyP1

1 ),,,(exp1

)|(


Multimodal Dialog

SLU features

• CRF confidence Score (cont.)► Confidence score of component slot

◦ yt: component slot

◦ xt: corresponding word

ty k tttk

k tttk

tt xyyf

xyyfxyConf

),,(

),,(),(

1

1


Multimodal Dialog

Multimodal Understanding Features

• Multimodal reference resolution score► Well recognized speech hypothesis tend to resolve well.► (a) well recognized► (b),(c) this bathroom this bad noon

► (b) this bad noon can not be

referring expression.

Second pen gesture has low

reference resolution score

► (c) this bad noon as a

referring expression but has

low reference resolution score

Clean this room and this bathroom

Gesture Input:

Clean this room and this bad noon

Gesture Input:

(a)

(b)

Clean this room and this bad noon

Gesture Input:

(c)

time

time

time


Multimodal Dialog

Experimental Setup

• Corpus► 617 Multimodal inputs

◦ 118 (speech + pen gesture) + 499 (speech only)◦ 3135 words, 5.08 words per utterance.◦ Vocabulary size: 396

• Speech Recognizer► HTK-based Korean speech recognizer was trained by MFCC 39

dimensional feature vectors.► Output 75 best lists


Multimodal Dialog

Experimental Result (WER)

• Comparison word error rate between baseline and N-best re-ranking model with variable feature set.

► Relative error reduction rate: 7.95 (%)► Re-ranking model has a significantly smaller word error rates than

that of baseline system. (p < 0.001)

WER (%)

baseline 17.74

+ Speech recognizer features 17.38

+ SLU features 16.43

+ Multimodal reference resolution features

16.33


Multimodal Dialog

Experimental Results (WER)

• Word error rates of a N-best re-ranking model with the varied size of an N

► If N is too large many noisy hypotheses

► If N is too small small candidate size and few clues to re-rank


Multimodal Dialog

Experimental Results (CER)

• Comparison concept error rate between baseline and N-best re-ranking model.

► Relative error reduction rate: 10.13 (%)► Re-ranking model has a significantly smaller concept error rates than

that of baseline system. (p < 0.01)

CER (%)

baseline 14.28

+ Speech recognizer features 13.81

+ SLU features 13.11

+ Multimodal reference resolution features

12.83


Multimodal Dialog

Reading List

• R. A. Bolt, 1980, “Put that there: Voice and gesture at the graphics interface,” Computer Graphics Vol. 14, no. 3, 262-270.

• J. Chai, S. Pan, M. Zhou, and K. Houck, 2002, Context-based Multimodal Understanding in Conversational Systems. Proceedings of the Fourth International Conference on Multimodal Interfaces (ICMI).

• J. Chai, P. Hong, and M. Zhou, 2004, A Probabilistic Approach to Reference Resolution in Multimodal User Interfaces. Proceedings of 9th International Conference on Intelligent User Interfaces (IUI-04), 70-77.

• J. Chai, Z. Prasov, J. Blaim, and R. Jin., 2005, Linguistic Theories in Efficient Multimodal Reference Resolution: an Empirical Investigation. Proceedings of the 10th International Conference on Intelligent User Interfaces (IUI-05), 43-50.


Multimodal Dialog

Reading List

• J. Chai, S. Qu, A Salience Driven Approach to Robust Input Interpretation in Multimodal Conversational Systems, In Proceedings of the HLT/EMNLP 2005

• H. Holzapfel, K. Nickel, R. Stiefelhagen, 2004, Implementation and Evaluation of a ConstraintBased Multimodal Fusion System for Speech and 3D Pointing Gestures, Proceedings of the International Conference on Multimodal Interfaces, (ICMI),

• M. Johnston, 1998. Unification-based multimodal parsing. Proceedings of the International Joint Conference of the Association for Computational Linguistics and the International Committee on Computational Linguistics , 624-630.

• M. Johnston, and S. Bangalore. 2000. Finite-state multimodal parsing and understanding. Proceedings of COLING-2000.


Multimodal Dialog

Reading List

• M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, and P. Maloor. 2002. MATCH: An architecture for multimodal dialogue systems. In Proceedings of ACL-2002.

• M. Johnston, S. Bangalore, Learning Edit Machines For Robust Multimodal Understanding, In Proceedings of the ICASSP 2006

• P.R. Cohen, M. Johnston, D.R. McGee, S.L. Oviatt, J.A. Pittman, I. Smith, L. Chen, and J. Clow, 1997, "QuickSet: Multimodal Interaction for Distributed Applications," Intl. Multimedia Conference, 31-40.

• S. L. Oviatt , A. DeAngeli, and K. Kuhn, 1997, Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of Conference on Human Factors in Computing Systems: CHI '97.


multimodal dialog

Documents