smartkom: fusion and fission of speech, gestures, and facial expressions
DESCRIPTION
SmartKom: Fusion and Fission of Speech, Gestures, and Facial Expressions. International Workshop on Man-Machine Symbiotic Systems Kyoto, 26 November 2002, p. 213. Wolfgang Wahlster. German Research Center for Artificial Intelligence DFKI GmbH Stuhlsatzenhausweg 3 - PowerPoint PPT PresentationTRANSCRIPT
German Research Center for Artificial IntelligenceDFKI GmbH
Stuhlsatzenhausweg 366123 Saarbruecken, Germany
phone: (+49 681) 302-5252/4162fax: (+49 681) 302-5341e-mail: [email protected]
WWW:http://www.dfki.de/~wahlster
Wolfgang Wahlster
SmartKom: Fusion and Fission of Speech, Gestures,
and Facial Expressions
International Workshop on Man-Machine Symbiotic SystemsKyoto, 26 November 2002, p. 213
© W. Wahlster
Spoken Dialogue
Graphical Userinterfaces
GesturalInteraction
MultimodalInteraction
SmartKom: Merging Various User Interface Paradigms
Facial Expressions
Biometrics
© W. Wahlster
Symbolic and Subsymbolic Fusion of Multiple Modes
SpeechRecognition
GestureRecognition
ProsodyRecognition
Facial ExpressionRecognition
LipReading
SubsymbolicFusion
- Neuronal Networks- Hidden Markov Models
SymbolicFusion
- Graph Unification - Bayesian Networks
Reference Resolution and Disambiguation
Modality-Free Semantic Representation
© W. Wahlster
1. Using all Human Senses for Symbiotic Man-Machine Interaction
2. SmartKom: Multimodal, Multilingual and Multidomain Dialogues
3. Modality Fusion in SmartKom
4. Multimodal Discourse Processing
5. Plan-based Modality Fission in SmartKom
6. Conclusions
Outline of the Talk
© W. Wahlster
MMDialogue
Back-Bone
Home:Consumer Electronics
EPG
Public:Cinema,
Phone, Fax,
Mail, Biometrics
Mobile:Car andPedestrianNavigation
Application
Layer
SmartKom-Mobile
SmartKom-Public
SmartKom-Home/Office
SmartKom: A Highly Portable Multimodal Dialogue System
© W. Wahlster
SmartKom: Intuitive Multimodal Interaction
MediaInterfaceEuropean Media LabUinv. Of
MunichUniv. ofStuttgart
Saarbrücken
Aachen
Dresden Berkeley
Stuttgart
MunichUniv. ofErlangen
Heidelberg
Main ContractorScientific Director
W. Wahlster
DFKISaarbrücken
The SmartKom Consortium:
Project Budget: € 25.5 million, funded by BMBF (Dr. Reuse) and industryProject Duration: 4 years (September 1999 – September 2003)
Ulm
© W. Wahlster
SmartKom`s SDDP Interaction Metaphor
SDDP = Situated Delegation-oriented Dialogue ParadigmAnthropomorphic Interface = Dialogue Partner
User
specifies goal delegates task
cooperate on problems
asks questions presents results
Service 1 Service 1
Service 2Service 2
Service 3Service 3
Webservices
PersonalizedInteraction Agent
See: Wahlster et al. 2001 , Eurospeech
© W. Wahlster
Multimodal Input and Output in the SmartKom System
Where wouldyou like to
sit?
© W. Wahlster
I‘d like to reserve tickets for this performance. Where would
you like to sit?
I‘d likethese two
seats.
Symbiotic Interaction with a Life-like Character
User Input: Speech, Gesture, and
Facial Expressions
Smartakus Output:Speech, Gesture andFacial Expressions
User Input: Speech, Gesture,
and Facial Expressions
© W. Wahlster
Multimodal Input and Output in SmartKomFusion and Fission of Multiple Modalities
Input by the User
Output by the Presentation agent
Speech
Gesture
FacialExpressions
+
+
+
+
+
+
© W. Wahlster
SmartKom‘s Data Collection of Multimodal Dialogs
User
Side-viewCamera
Face-trackingCamera withMicrophone
EnvironmentalNoise
MicrophoneArray
Screen
ProjectedWebpage
Face-trackingCamera
LoudspeakerMicrophone
Array
User
Bird’s-eyeCamera LCD
Beamer
SIVIT-Camera
© W. Wahlster
Personalized Interaction with WebTVs via SmartKom (DFKI with Sony, Philips, Siemens)
User: Switch on the TV.
Smartakus: Okay, the TV is on.
User: Which channels are presenting the latest news right now?
Smartakus: CNN and NTV are presenting news.
User: Please record this news channel on a videotape.
Smartakus: Okay, the VCR is now recording the selected program.
Example: Multimodal Access to Electronic Program Guides for TV
© W. Wahlster
Using Facial Expression Recognition forAffective Personalization
(1) Smartakus: Here you see the CNN program for tonight.
(2) User: That’s great.
(3) Smartakus: I’ll show you the program of another channel for tonight.
(2’) User: That’s great.
(3’) Smartakus: Which of these features do you want to see?
Processing ironic or sarcastic comments
© W. Wahlster
negativeneutral
Recognizing Affect: A Negative Facial Expression of the User
© W. Wahlster
The SmartKom Demonstrator System
Camera for Gestural Input
Microphone
Multimodal Control of TV-Set
Multimodal Control of VCR/DVD Player
Camera forFacial Analysis
© W. Wahlster
Combination of Speech and Gesture in SmartKom
This one I would like to see.
Where is it shown?
© W. Wahlster
Multimodal Input and Output in SmartKom
Please show me where you would like
to be seated.
© W. Wahlster
Getting Driving and Walking Directions via SmartKom
User: I want to drive to Heidelberg.
Smartakus: Do you want to take the fastest or the shortest route?
User: The fastest.
Smartakus: Here you see a map with your route from Saarbrücken to Heidelberg.
SmartKom can be used for Multimodal Navigation Dialogues in a Car
© W. Wahlster
Getting Driving and Walking Directions via SmartKom
Smartakus: You are now in Heidelberg. Here is a sightseeing map of Heidelberg.
User: I would like to know more about this church!
Smartakus: Here is some information about the St. Peter's Church.
User: Could you please give me walking directions to this church?
Smartakus: In this map, I have high-lighted your walking route.
© W. Wahlster
SmartKom: Multimodal Dialogues with a Hybrid Navigation System
© W. Wahlster
• Seamless integration and mutual disambiguation of multimodal input and output on semantic and pragmatic levels
• Situated understanding of possibly imprecise, ambiguous, or incom-plete multimodal input
• Context-sensitive interpretation of dialog interaction on the basis of dynamic discourse and context models
• Adaptive generation of coordinated, cohesive and coherent multimodal presentations
• Semi- or fully automatic completion of user-delegated tasks through the integration of information services
• Intuitive personification of the system through a presentation agent
Salient Characteristics of SmartKom
© W. Wahlster
The High-Level Control Flow of SmartKom
© W. Wahlster
SmartKom’s Multimodal Dialogue Back-Bone
Communication BlackboardsData FlowContext Dependencies
Analyzers
ExternalServices
ModalityFusion
DiscourseModeling
ActionPlanning
ModalityFission
Generators
• Speech
• Gestures
• Facial Expressions
• Speech
• Graphics
• Gestures
Dialogue Manager
© W. Wahlster
Unification of Scored Hypothesis Graphs for Modality Fusion in SmartKom
Modality FusionMutual Disambiguation
Reduction of UncertaintyIntention Hypotheses Graph
Word HypothesisGraph with
Acoustic Scores
Intention RecognizerSelection of Most Likely Interpretation
Clause and Sentence
Boundarieswith Prosodic
Scores
Scored Hypotheses
about the User‘s Emotional State
Gesture HypothesisGraph with Scores
of PotentialReference Objects
© W. Wahlster
SmartKom‘s Computational Mechanisms for Modality Fusion and Fission
M3L: Modality-Free Semantic Representation
OntologicalInferences
Unification
OverlayOperations
Planning
ConstraintPropagation
Modality Fusion Modality Fission
© W. Wahlster
The Overlay Operation Versus the Unification Operation
• Nonmonotonic and noncommutative unification-like operation
• Inherit (non-conflicting) background information
• two sources of conflicts:
– conflicting atomic values
overwrite background (old) with covering (new)
– type clash
assimilate background to the type of covering; recursion
Unification
Overlay
cf. J. Alexandersson, T. Becker 2001
© W. Wahlster
Overlay Operations Using the Discourse Model
Augmentation and Validation
– compare with a number of previous discourse states:
•fill in consistent information•compute a score
– for each hypothesis - background pair:
– Overlay (covering, background)
Covering:Background:
IntentionHypothesis
Lattice
SelectedAugmentedHypothesis
Sequence
© W. Wahlster
...:
"":
anycinematonightTimebeginTime
ePerformanc
An Example of the Overlay Operation
Stringtitle
TimebeginTime
nmentAVEntertai
:
...
:
...
:Cinemacinema
ePerformanc
...
:Channelchannel
Broadcast
Go to the moviesFilms on TV tonight
...:
"":
anychanneltonightTimebeginTime
Broadcast
...:
:anycinemaTimebeginTimeePerformanc
..."": tonight
TimebeginTime
ePerformanc
Generalisation and
Specialisation
U: What films are shown on TV tonight? .... U: I‘d rather go to the movies.
© W. Wahlster
Smartkom‘s Three-Tiered Discourse Model
DO1 DO2
VO1
DO10
DO3 DO9
Modality Layer
Discourse Layer
System: This [] is a list of films showing in Heidelberg.
heidelberglist
LO2 LO3. . .
Domain Layer DomainObject1
ticket first
DO11 DO12
reserve
LO4 LO5 LO6
DomainObject2
GO1
. . .
. . .
User: Please reserve a ticket for the first one.
DO = Discourse Object, LO = Linguistic ObjectGO = Gestural Object, VO = Visual Object
cf. M. Löckelt et. al. 2002, N. Pfleger 2002
© W. Wahlster
The High-Level Control Flow of SmartKom
© W. Wahlster
Smartakus uses body language to notify the user that it is waiting for his input, that it is listening to him, that it has problems to understand his input, or that it is trying hard to find an answer to his question.g
Smartakus is a Self-Animated Interface Agent
Idle TimeNavigationPresentation System State
© W. Wahlster
Some Complex Behavioural Patterns of the Interaction Agent Smartakus
© W. Wahlster
<?xml version="1.0"?><presentationContent>[...] <abstractPresentationContent> <movieTheater structId="pid1234”> <entityKey> cinema_17a </entityKey> <name> Europa </name> <geoCoordinate> <x> 225 </x> <y> 230 </y> </geoCoordinate> </movieTheater> </abstractPresentationContent>[...] <panelElement> <map structId="PM23"> <boundingShape> <leftTop> <x> 0.5542 </x> <y> 0.1950 </y> </leftTop> <rightBottom> <x> 0.9892 </x> <y> 0.7068 </y> </rightBottom> </boundingShape> <contentReference> pid1234 </contentReference> </map> </panelElement>[...]</presentationContent>
<?xml version="1.0"?><presentationContent>[...] <abstractPresentationContent> <movieTheater structId="pid1234”> <entityKey> cinema_17a </entityKey> <name> Europa </name> <geoCoordinate> <x> 225 </x> <y> 230 </y> </geoCoordinate> </movieTheater> </abstractPresentationContent>[...] <panelElement> <map structId="PM23"> <boundingShape> <leftTop> <x> 0.5542 </x> <y> 0.1950 </y> </leftTop> <rightBottom> <x> 0.9892 </x> <y> 0.7068 </y> </rightBottom> </boundingShape> <contentReference> pid1234 </contentReference> </map> </panelElement>[...]</presentationContent>
M3L Representation of the Multimodal Discourse ContextBlackboard with Presentation Context of the Previous Dialogue Turn
© W. Wahlster
M3L Specification of a Presentation Task<presentationTask> <subTask> <presentationGoal> <inform> ... </inform> <abstractPresentationContent> ... <result> <broadcast id="bc1"> <channel> <name>EuroSport</name> </channel> <beginTime> <time> <at>2000-12-05T14:00:00</at> </time> </beginTime> <endTime> <time> <at>2000-12-05T15:00:00</at> </time> </endTime> <avMedium> <title>Sport News</title> <avType>sport</avType> ... </abstractPresentationContent> <interactionMode>leanForward</interactionMode> <goalID>APGOAL3000</goalID> <source>generatorAction</source> <realizationType>GraphicsAndSpeech</realizationType>
© W. Wahlster
SmartKom‘s Presentation Planner
The Presentation Planner generates a Presentation Plan by applying a set of Presentation Strategies to the Presentation Goal.
GlobalPresent
Present AddSmartakus DoLayout EvaluatePersonaNode
Inform
TryToPresentTVOverview
ShowTVOverview
ShowTVOverview SetLayoutDataSetLayoutData
ShowTVOverview
SetLayoutData SetLayoutData
PersonaAction
SendScreenCommand
....
...
...
Generation of Layout
Smartakus Actions
GenerateText
......
... Speak
cf. J. Müller, P. Poller, V. Tschernomas 2002
© W. Wahlster
SmartKom‘s Use of Semantic Web Technology
Three Layers of Annotations
cf.: Dieter Fensel, James Hendler, Henry Liebermann, Wolfgang Wahlster (eds.)Spinning the Semantic Web, MIT Press, November 2002
PersonalizedPresentation
M3LContent high
Structure XML medium
Layout HTML low
© W. Wahlster
Various types of unification, overlay, constraint processing,
planning and ontological inferences are the fundamental
processes involved in SmartKom‘s modality fusion and
fission components.
The key function of modality fusion is the reduction of the
overall uncertainty and the mutual disambiguation of the
various analysis results based on a three-tiered
representation of multimodal discourse.
We have shown that a multimodal dialogue sytsem must not
only understand and represent the user‘s input, but its own
multimodal output.
Conclusions
© W. Wahlster
First International Conference on Perceptive &Multimodal User Interfaces (PMUI’03)
November 5-7th, 2003Delta Pinnacle Hotel, Vancouver, B.C., CanadaConference Chair Sharon Oviatt, Oregon Health & Science Univ., USAProgram Chairs Wolfgang Wahlster, DFKI, GermanyMark Maybury, MITRE, USA
PMUI’03 is sponsored by ACM, and will be co-located in Vancouver with ACM’s UIST’03. This meeting follows three successful Perceptive User Interface Workshops (with PUI’01 held in Florida) and three International Multimodal Interface Conferences initiated in Asia (with ICMI’02 held in Pittsburgh).