intelligent multimodal interaction: challenges and promise mark t. maybury [email protected] schloss...
TRANSCRIPT
Intelligent Multimodal Interaction: Challenges and Promise
Mark T. [email protected]
Schloss Dagstuhl, Germany
29 October 2001
MITRE
www.mitre.org/resources/centers/it/maybury/mark.html
This data is the copyright and proprietary data of the MITRE Corporation. It is made available subject to Limited Rights, as defined in paragraph (a) (15) of the clause at
DFAR 252.227-7013. The restrictions governing the use and disclosure of these materials are set forth in the aforesaid clause.
MITRE
Information
What are we talking about?Information Perception Cognition Emotion
VisualizationCognition
Image Source: Dr. Nahum Gershon and Ellaine Mullen, Copyright The MITRE Corporation
Speech
Haptics/Gesture
Facial
See
Smell
MITRE
Why Multimedia?Direct Manipulation Natural Language
Strengths
1. Intuitive2. Consistent Look and Feel3. Options Apparent4. Fail Safe5. “Direct Engagement”
with objecta. Point, actb. Feedback
1. Intuitive2. Description, e.g.,
a. Quantificationb. Negationc. Temporald. Motivation/Cause
3. Context4. Anaphora5. Asynchronous
Weaknesses
1. Description2. Anaphora3. Operation on sets4. Delayed actions difficult
1. Coverage is opaque2. Overkill for short or frequent
queries3. Difficulty of establishing and
navigating context; Spatialspecification cumbersome
4. Anaphora problematic5. Error prone (ambiguity,
vagueness, incomplete)
Above modified from Cohen, P. 1992. The role of natural language in a multimodal interface. In Proceedings of ACM SIGGRAPH Symposium on User Interface and Software and Technology (UIST), Monterey, CA 143-149.
MITRE
Why Multimedia?
Evidence users prefer both:- Flexibility (user, task, situation) - e.g., speech text, pen #’s - Efficiency and expressive power
Handwriting/Pen Speech
Strengths
1. Intuitive2. Visual feedback3. Persistent/ease of editing4. Multifunctional
(select, sketch, write)5. Private
1. Intuitive2. Fast3. Hands/eyes busy tasks4. Tight interaction
(e.g., clarification)5. High bandwidth
(e.g., prosodics, attitude,age, gender)
Weaknesses
1. Unrobust recognition2. Training cost
(i.e., writer dependence)3. Some unnatural interaction
(e.g., no beautifying)4. Slower than speech
1. Unrobust recognition2. Training cost
(e.g., speaker dependence)3. Some systems unnatural
(e.g., isolated words,small vocabulary)
4. Public
Our Challenges Empirical studies of the optimal combination of text, audio, video,
gesture for device, cognitive load and style, task, etc. for both input and output - perceptual, cognitive, and emotional effect
Multi* Input- Integration of imprecise, ambiguous and incomplete input- Interpretation of uncertain multi* inputs
Multi* Output- Select, design, allocate, and realize coherent, cohesive and
coordinated output Interaction Management
- Natural, joyous, agent-based (?) mixed initiative interaction Integrative Architectures
- Common components, well defined interfaces, levels of representation
Methodology and Evaluation- Ethnographic studies, community tasks, corpus-based
MITRE
InterfaceFunction State of the Art Grand Challenges Benefits of Advances
InputAnalysis
Sequential keyboard andtwo dimensional mouse ortouch screen input.Limited spoken languageinput.
Interpretation of imprecise,ambiguous, and/or partialmultimodal input
Flexibility and choice inalternative media input,synergistic input, robustinterpretation.
OutputGeneration
Canned presentationsutilizing primarilygraphics and text. Singledocument, mono-lingualsummarization.
Automated generation ofcoordinated speech,natural language, gesture,animation, non-speechaudio, generation, possiblydelivered via interactive,animated life-like agents.
Mixed media (e.g., text, graphics,video, speech and non-speechaudio) and mode (e.g., linguistic,visual, auditory) displays tailoredto the user and context. Life likeanimated characters.
DialogControl
Pre-scripted interactionswith standard dialoguepresentations (e.g.,windows, menus, buttons)
Mixed initiative naturalinteraction that dealsrobustly with context shift,interruptions, feedback,and shift of locus ofcontrol.
Ability to tailor flow and controlof interactions and facilitateinteractions including errordetection and correction tailoredto individual physical, perceptualand cognitive differences.Motivational and engaging life-like agents.
Agent/UserModeling
Limited models of userinterests (e.g., viaexplicitly solicited usermodels). Recommendertechnology.
Unobtrusive learning,representation, and use ofmodels of user/agents,including models ofperception, cognition, andemotion.
Enables tracking of usercharacteristics, skills and goals inorder to adapt and enhanceinteraction.
APIVariable specification ofunderlying applicationfunctionality. Movetoward component basedarchitectures.
Addressing increasinglybroad, interdependent, andcomplex applicationfunctionality.
Simplification of functionality,possibly limited by user and/orcontext models. Automated taskcompletion. Task help tailored tosituation, context, and user.Mobile and substitutableinterfaces for disabled users.
MITRE
MAPS
Athens
VIDEOPlato
Aristotle
NATURAL LANGUAGE
Socrates, Plato, and Aristotle were Greek philosophers ...
Multimedia Presentation Generation:“No Presentation without Representation”
Philsopher Born DiedSocrates 470 399Plato 428 348Aristotle 384 322
TABLES
DATAPhilosopher Aristotle Plato Socrates
Born 384 BC 428 BC 470 BCDied 322 BC 348 BC 399 BCWorks Poetics NoneEmphasis VirtueScience Conduct
Republic
GRAPHS
Lifespan
500 450 400 350 300 BC
Plato
Aristotle
Socrates
Lifespan
020406080
100
Socrates Plato Aristotle
Age
ANIMATEDAGENTS
MITRE
Common Presentation Design Tasks
Co-constraining Cascaded processes
CommunicationManagement
ContentSelection
PresentationDesign
Media Allocation
Media Realization
Media Coordination
Media Layout
Length affects layout in space or time
(e.g., EYP, audio)
Information, task, user …
Expressivity of different languages
e.g., “ven aca” gesture
MITRE
Common Representations: Communicative Acts[Maybury, 1993; Wahlster, Andre, Rist 1993]
PHYSICAL ACT LINGUISTIC ACT GRAPHICAL ACT
DEICTIC ACTREFERENTIAL/ATTENTIONAL ACT
DEICTIC ACT
point, tap, circle highlight, blink,circle etc.
indicate direction ILLOCUTIONARY ACT indicate directioninform
ATTENTIONAL ACT request DISPLAY CONTROL ACTpound fist/stomp foot warn display-regionsnap/tap fingers, claphands
concede zoom (in, out)
pan (left, right, up,down)
BODY LANGUAGE ACT LOCUTIONARY ACTfacial expressions assert (declarative) DEPICT ACT
gestures ask (interrogative) depict image
sign-language command (imperative) draw (line, arc,circle)
recommend (“should”) animate-action
exclaim (exclamation)
MITRE
User(s)
Information
Applications
People
Traditional Architecture
Presentation ApplicationInterface
Dialog Control
User(s)
Information
Applications
People
Application Interface
Media Fusion
InteractionManagement
IntentionRecognition
DiscourseModeling
UserModeling
PresentationDesign
Representation and Inference
UserModel
DiscourseModel
DomainModel
TaskModel
MediaModels
MediaAnalysis
Media/ModeAnalysis
Language
Graphics
Gesture
Biometrics
Design
Media/ModeDesign
Language
Graphics
Gesture
AnimatedPresentation
Agent
Media InputProcessing
Media OutputRendering
Architecture of the SmartKom Agent (cf. Maybury/Wahlster 1998)
Presentation Dialog ControlApplication
Interface
Integration
Request
Initiation
Response
MITRE
DARPA Galaxy Communicator
LanguageGeneration
LanguageGeneration
Text-to-SpeechConversion
Text-to-SpeechConversion
AudioServer
AudioServer
DialogueManagement
DialogueManagement
ApplicationBackend
ApplicationBackend
ContextTracking
ContextTracking
FrameConstruction
FrameConstruction
SpeechRecognition
SpeechRecognition
Hub
The Galaxy Communicator Software Infrastructure (GCSI) is a distributed, message-based, hub-and-spoke infrastructure optimized for constructing spoken dialogue systems
Open source and documentation available at fofoca.mitre.org and sourceforge.net/projects/communicator
MITRE
An Example: Communicator-Compliant Emergency Management Interface
MITRE I/O podium
displays input and output text
MITRE I/O podium
displays input and output text
MIT phone connectivity
connects audio to a telephone line
MIT phone connectivity
connects audio to a telephone line
Database
MITRE SQL generation
converts abstract requests to SQL
MITRE dialogue management
tracks information, decides what to do,
and formulates answers
Frameconstruction
extracts information from
input text
Frameconstruction
extracts information from
input text
Speechrecognition
converts speech to text
Speechrecognition
converts speech to text
Hub
Text-to-speechconverts output text
to audio
Text-to-speechconverts output text
to audioCMU Festival engine,Colorado wrapper
MIT SUMMIT engine and wrapper
Colorado Phoenix engine,MITRE wrapper
Open source PostGres engine,MITRE wrapper
AUDITORY VISUAL
SENSORYpassive words stimuli- fixation
2=primary auditory cortex
OUTPUTrepeat words- passive words
ASSOCIATIONgenerate use- repeat words(e.g., cake -> eat)
SEMANTICmonitor semantic category- passive words
1.6 cm above ac-pc line
a) - temporoparietal - bilateral superior - posterior temporal- inferior anterior cingulati
non-speech audio No 2
b) occipital cortex
(4cm above ac-pc line)c) d)Rolandic cortex (anterior superior motor cortex)
(8 cm below)e) f)(inferior anterior frontal cortex, area 47 of Brodmann)
- Left inferior frontal (semantic association)- Anterior cinguilati gyrus (attentional system for action selection, e.g. Pick dangerous animals)
a
b
c d
e f
Source:Science or NatureUniv Washington
MITRE
Evaluation Techniques IUI harder than HCI evaluation
- User influences interface behavior (i.e., user model)- Interface influences user behavior (e.g., critiquing,
cooperating, challenging)- Varying task complexity, environment- Requires more careful evaluation
Many techniques- “Heuristic evaluation” - i.e., cognitive walk-through- Analytic/formal/theoretic (e.g., GOMS, CCT, ICS)
model resources required, task complexity, time to complete to predict performance, critique interface
- Ablation studies- Wizard-of-oz, simulations- Instrumentation of live environments
MITRE
Instrumented Evaluation Process
Replay, Data Visualization,
& Annotation
InstrumentedInteractiveApplication
Indexed, Enriched Log
Interaction Logging
Corpus-based Adaptation
Analysis and
Evaluation
Source: DARPA IC&V
MITRE
WOSIT and COLAGEN
Tutoring Agent:Collagen (MERL)
Instrumentation:JOSIT
End User Application:TARGETS
observe
simulate
interpret
perform
comm
unicateinteract
JOSIT: http://www.mitre.org/tech_transfer/josit/WOSIT: http://www.mitre.org/technology/wosit/
Instrumentation Software
Summary: Our Challenges Empirical studies of the optimal combination of text, audio, video,
gesture for device, cognitive load and style, task, etc. for both input and output - perceptual, cognitive, and emotional effect
Multi* Input- Integration of imprecise, ambiguous and incomplete input- Interpretation of uncertain multi* inputs
Multi* Output- Select, design, allocate, and realize coherent, cohesive and
coordinated output Interaction Management
- Natural, joyous, agent-based (?) mixed initiative interaction Integrative Architectures
- Common components, well defined interfaces, levels of representation
Methodology and Evaluation- Ethnographic studies, community tasks, corpus-based
MITRE
Conclusion Emerging techniques for parsing simultaneous multimedia
input, generating coordinated multimedia output, tailoring interaction to the user, task, situation.
Laboratory prototypes that integrate these to support multimedia dialogue, agent-based interaction
Personalization increasing, privacy a concern Range of application areas: decision support, information
retrieval, education and training, entertainment Potential benefits
- Increase the raw bit rate of information flow (right media/modality mix for job)
- Increase relevance of information (e.g., information selection, tailored presentation)
- Simplify and speed task performance via interface agents (e.g., speech inflections, facial expressions, hand gestures, task delegation).