intelligent multimodal interaction: challenges and promise mark t. maybury [email protected] schloss...

Intelligent Multimodal Interaction: Challenges and Promise

Mark T. [email protected]

Schloss Dagstuhl, Germany

29 October 2001

MITRE

www.mitre.org/resources/centers/it/maybury/mark.html

This data is the copyright and proprietary data of the MITRE Corporation. It is made available subject to Limited Rights, as defined in paragraph (a) (15) of the clause at

DFAR 252.227-7013. The restrictions governing the use and disclosure of these materials are set forth in the aforesaid clause.

MITRE

Information

What are we talking about?Information Perception Cognition Emotion

VisualizationCognition

Image Source: Dr. Nahum Gershon and Ellaine Mullen, Copyright The MITRE Corporation

Speech

Haptics/Gesture

Facial

See

Smell

MITRE

Why Multimedia?Direct Manipulation Natural Language

Strengths

1. Intuitive2. Consistent Look and Feel3. Options Apparent4. Fail Safe5. “Direct Engagement”

with objecta. Point, actb. Feedback

1. Intuitive2. Description, e.g.,

a. Quantificationb. Negationc. Temporald. Motivation/Cause

3. Context4. Anaphora5. Asynchronous

Weaknesses

1. Description2. Anaphora3. Operation on sets4. Delayed actions difficult

1. Coverage is opaque2. Overkill for short or frequent

queries3. Difficulty of establishing and

navigating context; Spatialspecification cumbersome

4. Anaphora problematic5. Error prone (ambiguity,

vagueness, incomplete)

Above modified from Cohen, P. 1992. The role of natural language in a multimodal interface. In Proceedings of ACM SIGGRAPH Symposium on User Interface and Software and Technology (UIST), Monterey, CA 143-149.

MITRE

Why Multimedia?

Evidence users prefer both:- Flexibility (user, task, situation) - e.g., speech text, pen #’s - Efficiency and expressive power

Handwriting/Pen Speech

Strengths

1. Intuitive2. Visual feedback3. Persistent/ease of editing4. Multifunctional

(select, sketch, write)5. Private

1. Intuitive2. Fast3. Hands/eyes busy tasks4. Tight interaction

(e.g., clarification)5. High bandwidth

(e.g., prosodics, attitude,age, gender)

Weaknesses

1. Unrobust recognition2. Training cost

(i.e., writer dependence)3. Some unnatural interaction

(e.g., no beautifying)4. Slower than speech

1. Unrobust recognition2. Training cost

(e.g., speaker dependence)3. Some systems unnatural

(e.g., isolated words,small vocabulary)

4. Public

Our Challenges Empirical studies of the optimal combination of text, audio, video,

gesture for device, cognitive load and style, task, etc. for both input and output - perceptual, cognitive, and emotional effect

Multi* Input- Integration of imprecise, ambiguous and incomplete input- Interpretation of uncertain multi* inputs

Multi* Output- Select, design, allocate, and realize coherent, cohesive and

coordinated output Interaction Management

- Natural, joyous, agent-based (?) mixed initiative interaction Integrative Architectures

- Common components, well defined interfaces, levels of representation

Methodology and Evaluation- Ethnographic studies, community tasks, corpus-based

MITRE

InterfaceFunction State of the Art Grand Challenges Benefits of Advances

InputAnalysis

Sequential keyboard andtwo dimensional mouse ortouch screen input.Limited spoken languageinput.

Interpretation of imprecise,ambiguous, and/or partialmultimodal input

Flexibility and choice inalternative media input,synergistic input, robustinterpretation.

OutputGeneration

Canned presentationsutilizing primarilygraphics and text. Singledocument, mono-lingualsummarization.

Automated generation ofcoordinated speech,natural language, gesture,animation, non-speechaudio, generation, possiblydelivered via interactive,animated life-like agents.

Mixed media (e.g., text, graphics,video, speech and non-speechaudio) and mode (e.g., linguistic,visual, auditory) displays tailoredto the user and context. Life likeanimated characters.

DialogControl

Pre-scripted interactionswith standard dialoguepresentations (e.g.,windows, menus, buttons)

Mixed initiative naturalinteraction that dealsrobustly with context shift,interruptions, feedback,and shift of locus ofcontrol.

Ability to tailor flow and controlof interactions and facilitateinteractions including errordetection and correction tailoredto individual physical, perceptualand cognitive differences.Motivational and engaging life-like agents.

Agent/UserModeling

Limited models of userinterests (e.g., viaexplicitly solicited usermodels). Recommendertechnology.

Unobtrusive learning,representation, and use ofmodels of user/agents,including models ofperception, cognition, andemotion.

Enables tracking of usercharacteristics, skills and goals inorder to adapt and enhanceinteraction.

APIVariable specification ofunderlying applicationfunctionality. Movetoward component basedarchitectures.

Addressing increasinglybroad, interdependent, andcomplex applicationfunctionality.

Simplification of functionality,possibly limited by user and/orcontext models. Automated taskcompletion. Task help tailored tosituation, context, and user.Mobile and substitutableinterfaces for disabled users.

MITRE

MAPS

Athens

VIDEOPlato

Aristotle

NATURAL LANGUAGE

Socrates, Plato, and Aristotle were Greek philosophers ...

Multimedia Presentation Generation:“No Presentation without Representation”

Philsopher Born DiedSocrates 470 399Plato 428 348Aristotle 384 322

TABLES

DATAPhilosopher Aristotle Plato Socrates

Born 384 BC 428 BC 470 BCDied 322 BC 348 BC 399 BCWorks Poetics NoneEmphasis VirtueScience Conduct

Republic

GRAPHS

Lifespan

500 450 400 350 300 BC

Plato

Aristotle

Socrates

Lifespan

020406080

100

Socrates Plato Aristotle

Age

ANIMATEDAGENTS

MITRE

Common Presentation Design Tasks

Co-constraining Cascaded processes

CommunicationManagement

ContentSelection

PresentationDesign

Media Allocation

Media Realization

Media Coordination

Media Layout

Length affects layout in space or time

(e.g., EYP, audio)

Information, task, user …

Expressivity of different languages

e.g., “ven aca” gesture

MITRE

Common Representations: Communicative Acts[Maybury, 1993; Wahlster, Andre, Rist 1993]

PHYSICAL ACT LINGUISTIC ACT GRAPHICAL ACT

DEICTIC ACTREFERENTIAL/ATTENTIONAL ACT

DEICTIC ACT

point, tap, circle highlight, blink,circle etc.

indicate direction ILLOCUTIONARY ACT indicate directioninform

ATTENTIONAL ACT request DISPLAY CONTROL ACTpound fist/stomp foot warn display-regionsnap/tap fingers, claphands

concede zoom (in, out)

pan (left, right, up,down)

BODY LANGUAGE ACT LOCUTIONARY ACTfacial expressions assert (declarative) DEPICT ACT

gestures ask (interrogative) depict image

sign-language command (imperative) draw (line, arc,circle)

recommend (“should”) animate-action

exclaim (exclamation)

MITRE

User(s)

Information

Applications

People

Traditional Architecture

Presentation ApplicationInterface

Dialog Control

User(s)

Information

Applications

People

Application Interface

Media Fusion

InteractionManagement

IntentionRecognition

DiscourseModeling

UserModeling

PresentationDesign

Representation and Inference

UserModel

DiscourseModel

DomainModel

TaskModel

MediaModels

MediaAnalysis

Media/ModeAnalysis

Language

Graphics

Gesture

Biometrics

Design

Media/ModeDesign

Language

Graphics

Gesture

AnimatedPresentation

Agent

Media InputProcessing

Media OutputRendering

Architecture of the SmartKom Agent (cf. Maybury/Wahlster 1998)

Presentation Dialog ControlApplication

Interface

Integration

Request

Initiation

Response

MITRE

DARPA Galaxy Communicator

LanguageGeneration

LanguageGeneration

Text-to-SpeechConversion

Text-to-SpeechConversion

AudioServer

AudioServer

DialogueManagement

DialogueManagement

ApplicationBackend

ApplicationBackend

ContextTracking

ContextTracking

FrameConstruction

FrameConstruction

SpeechRecognition

SpeechRecognition

Hub

The Galaxy Communicator Software Infrastructure (GCSI) is a distributed, message-based, hub-and-spoke infrastructure optimized for constructing spoken dialogue systems

Open source and documentation available at fofoca.mitre.org and sourceforge.net/projects/communicator

MITRE

An Example: Communicator-Compliant Emergency Management Interface

MITRE I/O podium

displays input and output text

MITRE I/O podium

displays input and output text

MIT phone connectivity

connects audio to a telephone line

MIT phone connectivity

connects audio to a telephone line

Database

MITRE SQL generation

converts abstract requests to SQL

MITRE dialogue management

tracks information, decides what to do,

and formulates answers

Frameconstruction

extracts information from

input text

Frameconstruction

extracts information from

input text

Speechrecognition

converts speech to text

Speechrecognition

converts speech to text

Hub

Text-to-speechconverts output text

to audio

Text-to-speechconverts output text

to audioCMU Festival engine,Colorado wrapper

MIT SUMMIT engine and wrapper

Colorado Phoenix engine,MITRE wrapper

Open source PostGres engine,MITRE wrapper

AUDITORY VISUAL

SENSORYpassive words stimuli- fixation

2=primary auditory cortex

OUTPUTrepeat words- passive words

ASSOCIATIONgenerate use- repeat words(e.g., cake -> eat)

SEMANTICmonitor semantic category- passive words

1.6 cm above ac-pc line

a) - temporoparietal - bilateral superior - posterior temporal- inferior anterior cingulati

non-speech audio No 2

b) occipital cortex

(4cm above ac-pc line)c) d)Rolandic cortex (anterior superior motor cortex)

(8 cm below)e) f)(inferior anterior frontal cortex, area 47 of Brodmann)

- Left inferior frontal (semantic association)- Anterior cinguilati gyrus (attentional system for action selection, e.g. Pick dangerous animals)

a

b

c d

e f

Source:Science or NatureUniv Washington

MITRE

Evaluation Techniques IUI harder than HCI evaluation

- User influences interface behavior (i.e., user model)- Interface influences user behavior (e.g., critiquing,

cooperating, challenging)- Varying task complexity, environment- Requires more careful evaluation

Many techniques- “Heuristic evaluation” - i.e., cognitive walk-through- Analytic/formal/theoretic (e.g., GOMS, CCT, ICS)

model resources required, task complexity, time to complete to predict performance, critique interface

- Ablation studies- Wizard-of-oz, simulations- Instrumentation of live environments

MITRE

Instrumented Evaluation Process

Replay, Data Visualization,

& Annotation

InstrumentedInteractiveApplication

Indexed, Enriched Log

Interaction Logging

Corpus-based Adaptation

Analysis and

Evaluation

Source: DARPA IC&V

MITRE

WOSIT and COLAGEN

Tutoring Agent:Collagen (MERL)

Instrumentation:JOSIT

End User Application:TARGETS

observe

simulate

interpret

perform

comm

unicateinteract

JOSIT: http://www.mitre.org/tech_transfer/josit/WOSIT: http://www.mitre.org/technology/wosit/

Instrumentation Software

Summary: Our Challenges Empirical studies of the optimal combination of text, audio, video,

gesture for device, cognitive load and style, task, etc. for both input and output - perceptual, cognitive, and emotional effect

Multi* Input- Integration of imprecise, ambiguous and incomplete input- Interpretation of uncertain multi* inputs

Multi* Output- Select, design, allocate, and realize coherent, cohesive and

coordinated output Interaction Management

- Natural, joyous, agent-based (?) mixed initiative interaction Integrative Architectures

- Common components, well defined interfaces, levels of representation

Methodology and Evaluation- Ethnographic studies, community tasks, corpus-based

MITRE

Conclusion Emerging techniques for parsing simultaneous multimedia

input, generating coordinated multimedia output, tailoring interaction to the user, task, situation.

Laboratory prototypes that integrate these to support multimedia dialogue, agent-based interaction

Personalization increasing, privacy a concern Range of application areas: decision support, information

retrieval, education and training, entertainment Potential benefits

- Increase the raw bit rate of information flow (right media/modality mix for job)

- Increase relevance of information (e.g., information selection, tailored presentation)

- Simplify and speed task performance via interface agents (e.g., speech inflections, facial expressions, hand gestures, task delegation).

intelligent multimodal interaction: challenges and promise mark t. maybury [email protected] schloss...

Documents