smartkom: fusion and fission of speech, gestures, and facial expressions

German Research Center for Artificial IntelligenceDFKI GmbH

Stuhlsatzenhausweg 366123 Saarbruecken, Germany

phone: (+49 681) 302-5252/4162fax: (+49 681) 302-5341e-mail: [email protected]

WWW:http://www.dfki.de/~wahlster

Wolfgang Wahlster

SmartKom: Fusion and Fission of Speech, Gestures,

and Facial Expressions

International Workshop on Man-Machine Symbiotic SystemsKyoto, 26 November 2002, p. 213

© W. Wahlster

Spoken Dialogue

Graphical Userinterfaces

GesturalInteraction

MultimodalInteraction

SmartKom: Merging Various User Interface Paradigms

Facial Expressions

Biometrics

© W. Wahlster

Symbolic and Subsymbolic Fusion of Multiple Modes

SpeechRecognition

GestureRecognition

ProsodyRecognition

Facial ExpressionRecognition

LipReading

SubsymbolicFusion

- Neuronal Networks- Hidden Markov Models

SymbolicFusion

- Graph Unification - Bayesian Networks

Reference Resolution and Disambiguation

Modality-Free Semantic Representation

© W. Wahlster

1. Using all Human Senses for Symbiotic Man-Machine Interaction

2. SmartKom: Multimodal, Multilingual and Multidomain Dialogues

3. Modality Fusion in SmartKom

4. Multimodal Discourse Processing

5. Plan-based Modality Fission in SmartKom

6. Conclusions

Outline of the Talk

© W. Wahlster

MMDialogue

Back-Bone

Home:Consumer Electronics

EPG

Public:Cinema,

Phone, Fax,

Mail, Biometrics

Mobile:Car andPedestrianNavigation

Application

Layer

SmartKom-Mobile

SmartKom-Public

SmartKom-Home/Office

SmartKom: A Highly Portable Multimodal Dialogue System

© W. Wahlster

SmartKom: Intuitive Multimodal Interaction

MediaInterfaceEuropean Media LabUinv. Of

MunichUniv. ofStuttgart

Saarbrücken

Aachen

Dresden Berkeley

Stuttgart

MunichUniv. ofErlangen

Heidelberg

Main ContractorScientific Director

W. Wahlster

DFKISaarbrücken

The SmartKom Consortium:

Project Budget: € 25.5 million, funded by BMBF (Dr. Reuse) and industryProject Duration: 4 years (September 1999 – September 2003)

Ulm

© W. Wahlster

SmartKom`s SDDP Interaction Metaphor

SDDP = Situated Delegation-oriented Dialogue ParadigmAnthropomorphic Interface = Dialogue Partner

User

specifies goal delegates task

cooperate on problems

asks questions presents results

Service 1 Service 1

Service 2Service 2

Service 3Service 3

Webservices

PersonalizedInteraction Agent

See: Wahlster et al. 2001 , Eurospeech

© W. Wahlster

Multimodal Input and Output in the SmartKom System

Where wouldyou like to

sit?

© W. Wahlster

I‘d like to reserve tickets for this performance. Where would

you like to sit?

I‘d likethese two

seats.

Symbiotic Interaction with a Life-like Character

User Input: Speech, Gesture, and

Facial Expressions

Smartakus Output:Speech, Gesture andFacial Expressions

User Input: Speech, Gesture,

and Facial Expressions

© W. Wahlster

Multimodal Input and Output in SmartKomFusion and Fission of Multiple Modalities

Input by the User

Output by the Presentation agent

Speech

Gesture

FacialExpressions

+

+

+

+

+

+

© W. Wahlster

SmartKom‘s Data Collection of Multimodal Dialogs

User

Side-viewCamera

Face-trackingCamera withMicrophone

EnvironmentalNoise

MicrophoneArray

Screen

ProjectedWebpage

Face-trackingCamera

LoudspeakerMicrophone

Array

User

Bird’s-eyeCamera LCD

Beamer

SIVIT-Camera

© W. Wahlster

Personalized Interaction with WebTVs via SmartKom (DFKI with Sony, Philips, Siemens)

User: Switch on the TV.

Smartakus: Okay, the TV is on.

User: Which channels are presenting the latest news right now?

Smartakus: CNN and NTV are presenting news.

User: Please record this news channel on a videotape.

Smartakus: Okay, the VCR is now recording the selected program.

Example: Multimodal Access to Electronic Program Guides for TV

© W. Wahlster

Using Facial Expression Recognition forAffective Personalization

(1) Smartakus: Here you see the CNN program for tonight.

(2) User: That’s great.

(3) Smartakus: I’ll show you the program of another channel for tonight.

(2’) User: That’s great.

(3’) Smartakus: Which of these features do you want to see?

Processing ironic or sarcastic comments

© W. Wahlster

negativeneutral

Recognizing Affect: A Negative Facial Expression of the User

© W. Wahlster

The SmartKom Demonstrator System

Camera for Gestural Input

Microphone

Multimodal Control of TV-Set

Multimodal Control of VCR/DVD Player

Camera forFacial Analysis

© W. Wahlster

Combination of Speech and Gesture in SmartKom

This one I would like to see.

Where is it shown?

© W. Wahlster

Multimodal Input and Output in SmartKom

Please show me where you would like

to be seated.

© W. Wahlster

Getting Driving and Walking Directions via SmartKom

User: I want to drive to Heidelberg.

Smartakus: Do you want to take the fastest or the shortest route?

User: The fastest.

Smartakus: Here you see a map with your route from Saarbrücken to Heidelberg.

SmartKom can be used for Multimodal Navigation Dialogues in a Car

© W. Wahlster

Getting Driving and Walking Directions via SmartKom

Smartakus: You are now in Heidelberg. Here is a sightseeing map of Heidelberg.

User: I would like to know more about this church!

Smartakus: Here is some information about the St. Peter's Church.

User: Could you please give me walking directions to this church?

Smartakus: In this map, I have high-lighted your walking route.

© W. Wahlster

SmartKom: Multimodal Dialogues with a Hybrid Navigation System

© W. Wahlster

• Seamless integration and mutual disambiguation of multimodal input and output on semantic and pragmatic levels

• Situated understanding of possibly imprecise, ambiguous, or incom-plete multimodal input

• Context-sensitive interpretation of dialog interaction on the basis of dynamic discourse and context models

• Adaptive generation of coordinated, cohesive and coherent multimodal presentations

• Semi- or fully automatic completion of user-delegated tasks through the integration of information services

• Intuitive personification of the system through a presentation agent

Salient Characteristics of SmartKom

© W. Wahlster

The High-Level Control Flow of SmartKom

© W. Wahlster

SmartKom’s Multimodal Dialogue Back-Bone

Communication BlackboardsData FlowContext Dependencies

Analyzers

ExternalServices

ModalityFusion

DiscourseModeling

ActionPlanning

ModalityFission

Generators

• Speech

• Gestures

• Facial Expressions

• Speech

• Graphics

• Gestures

Dialogue Manager

© W. Wahlster

Unification of Scored Hypothesis Graphs for Modality Fusion in SmartKom

Modality FusionMutual Disambiguation

Reduction of UncertaintyIntention Hypotheses Graph

Word HypothesisGraph with

Acoustic Scores

Intention RecognizerSelection of Most Likely Interpretation

Clause and Sentence

Boundarieswith Prosodic

Scores

Scored Hypotheses

about the User‘s Emotional State

Gesture HypothesisGraph with Scores

of PotentialReference Objects

© W. Wahlster

SmartKom‘s Computational Mechanisms for Modality Fusion and Fission

M3L: Modality-Free Semantic Representation

OntologicalInferences

Unification

OverlayOperations

Planning

ConstraintPropagation

Modality Fusion Modality Fission

© W. Wahlster

The Overlay Operation Versus the Unification Operation

• Nonmonotonic and noncommutative unification-like operation

• Inherit (non-conflicting) background information

• two sources of conflicts:

– conflicting atomic values

overwrite background (old) with covering (new)

– type clash

assimilate background to the type of covering; recursion

Unification

Overlay

cf. J. Alexandersson, T. Becker 2001

© W. Wahlster

Overlay Operations Using the Discourse Model

Augmentation and Validation

– compare with a number of previous discourse states:

•fill in consistent information•compute a score

– for each hypothesis - background pair:

– Overlay (covering, background)

Covering:Background:

IntentionHypothesis

Lattice

SelectedAugmentedHypothesis

Sequence

© W. Wahlster

...:

"":

anycinematonightTimebeginTime

ePerformanc

An Example of the Overlay Operation

Stringtitle

TimebeginTime

nmentAVEntertai

:

...

:

...

:Cinemacinema

ePerformanc

...

:Channelchannel

Broadcast

Go to the moviesFilms on TV tonight

...:

"":

anychanneltonightTimebeginTime

Broadcast

...:

:anycinemaTimebeginTimeePerformanc

..."": tonight

TimebeginTime

ePerformanc

Generalisation and

Specialisation

U: What films are shown on TV tonight? .... U: I‘d rather go to the movies.

© W. Wahlster

Smartkom‘s Three-Tiered Discourse Model

DO1 DO2

VO1

DO10

DO3 DO9

Modality Layer

Discourse Layer

System: This [] is a list of films showing in Heidelberg.

heidelberglist

LO2 LO3. . .

Domain Layer DomainObject1

ticket first

DO11 DO12

reserve

LO4 LO5 LO6

DomainObject2

GO1

. . .

. . .

User: Please reserve a ticket for the first one.

DO = Discourse Object, LO = Linguistic ObjectGO = Gestural Object, VO = Visual Object

cf. M. Löckelt et. al. 2002, N. Pfleger 2002

© W. Wahlster

Smartakus uses body language to notify the user that it is waiting for his input, that it is listening to him, that it has problems to understand his input, or that it is trying hard to find an answer to his question.g

Smartakus is a Self-Animated Interface Agent

Idle TimeNavigationPresentation System State

© W. Wahlster

<?xml version="1.0"?><presentationContent>[...] <abstractPresentationContent> <movieTheater structId="pid1234”> <entityKey> cinema_17a </entityKey> <name> Europa </name> <geoCoordinate> <x> 225 </x> <y> 230 </y> </geoCoordinate> </movieTheater> </abstractPresentationContent>[...] <panelElement> <map structId="PM23"> <boundingShape> <leftTop> <x> 0.5542 </x> <y> 0.1950 </y> </leftTop> <rightBottom> <x> 0.9892 </x> <y> 0.7068 </y> </rightBottom> </boundingShape> <contentReference> pid1234 </contentReference> </map> </panelElement>[...]</presentationContent>

<?xml version="1.0"?><presentationContent>[...] <abstractPresentationContent> <movieTheater structId="pid1234”> <entityKey> cinema_17a </entityKey> <name> Europa </name> <geoCoordinate> <x> 225 </x> <y> 230 </y> </geoCoordinate> </movieTheater> </abstractPresentationContent>[...] <panelElement> <map structId="PM23"> <boundingShape> <leftTop> <x> 0.5542 </x> <y> 0.1950 </y> </leftTop> <rightBottom> <x> 0.9892 </x> <y> 0.7068 </y> </rightBottom> </boundingShape> <contentReference> pid1234 </contentReference> </map> </panelElement>[...]</presentationContent>

M3L Representation of the Multimodal Discourse ContextBlackboard with Presentation Context of the Previous Dialogue Turn

© W. Wahlster

M3L Specification of a Presentation Task<presentationTask> <subTask> <presentationGoal> <inform> ... </inform> <abstractPresentationContent> ... <result> <broadcast id="bc1"> <channel> <name>EuroSport</name> </channel> <beginTime> <time> <at>2000-12-05T14:00:00</at> </time> </beginTime> <endTime> <time> <at>2000-12-05T15:00:00</at> </time> </endTime> <avMedium> <title>Sport News</title> <avType>sport</avType> ... </abstractPresentationContent> <interactionMode>leanForward</interactionMode> <goalID>APGOAL3000</goalID> <source>generatorAction</source> <realizationType>GraphicsAndSpeech</realizationType>

© W. Wahlster

SmartKom‘s Presentation Planner

The Presentation Planner generates a Presentation Plan by applying a set of Presentation Strategies to the Presentation Goal.

GlobalPresent

Present AddSmartakus DoLayout EvaluatePersonaNode

Inform

TryToPresentTVOverview

ShowTVOverview

ShowTVOverview SetLayoutDataSetLayoutData

ShowTVOverview

SetLayoutData SetLayoutData

PersonaAction

SendScreenCommand

....

...

...

Generation of Layout

Smartakus Actions

GenerateText

......

... Speak

cf. J. Müller, P. Poller, V. Tschernomas 2002

© W. Wahlster

SmartKom‘s Use of Semantic Web Technology

Three Layers of Annotations

cf.: Dieter Fensel, James Hendler, Henry Liebermann, Wolfgang Wahlster (eds.)Spinning the Semantic Web, MIT Press, November 2002

PersonalizedPresentation

M3LContent high

Structure XML medium

Layout HTML low

© W. Wahlster

Various types of unification, overlay, constraint processing,

planning and ontological inferences are the fundamental

processes involved in SmartKom‘s modality fusion and

fission components.

The key function of modality fusion is the reduction of the

overall uncertainty and the mutual disambiguation of the

various analysis results based on a three-tiered

representation of multimodal discourse.

We have shown that a multimodal dialogue sytsem must not

only understand and represent the user‘s input, but its own

multimodal output.

Conclusions

© W. Wahlster

First International Conference on Perceptive &Multimodal User Interfaces (PMUI’03)

November 5-7th, 2003Delta Pinnacle Hotel, Vancouver, B.C., CanadaConference Chair Sharon Oviatt, Oregon Health & Science Univ., USAProgram Chairs Wolfgang Wahlster, DFKI, GermanyMark Maybury, MITRE, USA

PMUI’03 is sponsored by ACM, and will be co-located in Vancouver with ACM’s UIST’03. This meeting follows three successful Perceptive User Interface Workshops (with PUI’01 held in Florida) and three International Multimodal Interface Conferences initiated in Asia (with ICMI’02 held in Pittsburgh).

smartkom: fusion and fission of speech, gestures, and facial expressions

Documents