smartkom: fusion and fission of speech, gestures, and facial expressions

38
German Research Center for Artificial Intelligence DFKI GmbH Stuhlsatzenhausweg 3 66123 Saarbruecken, Germany phone: (+49 681) 302-5252/4162 fax: (+49 681) 302-5341 e-mail: [email protected] WWW:http://www.dfki.de/~wahlster Wolfgang Wahlster SmartKom: Fusion and Fission of Speech, Gestures, and Facial Expressions International Workshop on Man-Machine Symbiotic Systems Kyoto, 26 November 2002, p. 213

Upload: mio

Post on 13-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

SmartKom: Fusion and Fission of Speech, Gestures, and Facial Expressions. International Workshop on Man-Machine Symbiotic Systems Kyoto, 26 November 2002, p. 213. Wolfgang Wahlster. German Research Center for Artificial Intelligence DFKI GmbH Stuhlsatzenhausweg 3 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

German Research Center for Artificial IntelligenceDFKI GmbH

Stuhlsatzenhausweg 366123 Saarbruecken, Germany

phone: (+49 681) 302-5252/4162fax: (+49 681) 302-5341e-mail: [email protected]

WWW:http://www.dfki.de/~wahlster

Wolfgang Wahlster

SmartKom: Fusion and Fission of Speech, Gestures,

and Facial Expressions

International Workshop on Man-Machine Symbiotic SystemsKyoto, 26 November 2002, p. 213

Page 2: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Spoken Dialogue

Graphical Userinterfaces

GesturalInteraction

MultimodalInteraction

SmartKom: Merging Various User Interface Paradigms

Facial Expressions

Biometrics

Page 3: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Symbolic and Subsymbolic Fusion of Multiple Modes

SpeechRecognition

GestureRecognition

ProsodyRecognition

Facial ExpressionRecognition

LipReading

SubsymbolicFusion

- Neuronal Networks- Hidden Markov Models

SymbolicFusion

- Graph Unification - Bayesian Networks

Reference Resolution and Disambiguation

Modality-Free Semantic Representation

Page 4: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

1. Using all Human Senses for Symbiotic Man-Machine Interaction

2. SmartKom: Multimodal, Multilingual and Multidomain Dialogues

3. Modality Fusion in SmartKom

4. Multimodal Discourse Processing

5. Plan-based Modality Fission in SmartKom

6. Conclusions

Outline of the Talk

Page 5: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

MMDialogue

Back-Bone

Home:Consumer Electronics

EPG

Public:Cinema,

Phone, Fax,

Mail, Biometrics

Mobile:Car andPedestrianNavigation

Application

Layer

SmartKom-Mobile

SmartKom-Public

SmartKom-Home/Office

SmartKom: A Highly Portable Multimodal Dialogue System

Page 6: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

SmartKom: Intuitive Multimodal Interaction

MediaInterfaceEuropean Media LabUinv. Of

MunichUniv. ofStuttgart

Saarbrücken

Aachen

Dresden Berkeley

Stuttgart

MunichUniv. ofErlangen

Heidelberg

Main ContractorScientific Director

W. Wahlster

DFKISaarbrücken

The SmartKom Consortium:

Project Budget: € 25.5 million, funded by BMBF (Dr. Reuse) and industryProject Duration: 4 years (September 1999 – September 2003)

Ulm

Page 7: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

SmartKom`s SDDP Interaction Metaphor

SDDP = Situated Delegation-oriented Dialogue ParadigmAnthropomorphic Interface = Dialogue Partner

User

specifies goal delegates task

cooperate on problems

asks questions presents results

Service 1 Service 1

Service 2Service 2

Service 3Service 3

Webservices

PersonalizedInteraction Agent

See: Wahlster et al. 2001 , Eurospeech

Page 8: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Multimodal Input and Output in the SmartKom System

Where wouldyou like to

sit?

Page 9: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

I‘d like to reserve tickets for this performance. Where would

you like to sit?

I‘d likethese two

seats.

Symbiotic Interaction with a Life-like Character

User Input: Speech, Gesture, and

Facial Expressions

Smartakus Output:Speech, Gesture andFacial Expressions

User Input: Speech, Gesture,

and Facial Expressions

Page 10: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Multimodal Input and Output in SmartKomFusion and Fission of Multiple Modalities

Input by the User

Output by the Presentation agent

Speech

Gesture

FacialExpressions

+

+

+

+

+

+

Page 11: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

SmartKom‘s Data Collection of Multimodal Dialogs

User

Side-viewCamera

Face-trackingCamera withMicrophone

EnvironmentalNoise

MicrophoneArray

Screen

ProjectedWebpage

Face-trackingCamera

LoudspeakerMicrophone

Array

User

Bird’s-eyeCamera LCD

Beamer

SIVIT-Camera

Page 12: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Personalized Interaction with WebTVs via SmartKom (DFKI with Sony, Philips, Siemens)

User: Switch on the TV.

Smartakus: Okay, the TV is on.

User: Which channels are presenting the latest news right now?

Smartakus: CNN and NTV are presenting news.

User: Please record this news channel on a videotape.

Smartakus: Okay, the VCR is now recording the selected program.

Example: Multimodal Access to Electronic Program Guides for TV

Page 13: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Using Facial Expression Recognition forAffective Personalization

(1) Smartakus: Here you see the CNN program for tonight.

(2) User: That’s great.

(3) Smartakus: I’ll show you the program of another channel for tonight.

(2’) User: That’s great.

(3’) Smartakus: Which of these features do you want to see?

Processing ironic or sarcastic comments

Page 14: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

negativeneutral

Recognizing Affect: A Negative Facial Expression of the User

Page 15: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

The SmartKom Demonstrator System

Camera for Gestural Input

Microphone

Multimodal Control of TV-Set

Multimodal Control of VCR/DVD Player

Camera forFacial Analysis

Page 16: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Combination of Speech and Gesture in SmartKom

This one I would like to see.

Where is it shown?

Page 17: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Multimodal Input and Output in SmartKom

Please show me where you would like

to be seated.

Page 18: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Getting Driving and Walking Directions via SmartKom

User: I want to drive to Heidelberg.

Smartakus: Do you want to take the fastest or the shortest route?

User: The fastest.

Smartakus: Here you see a map with your route from Saarbrücken to Heidelberg.

SmartKom can be used for Multimodal Navigation Dialogues in a Car

Page 19: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Getting Driving and Walking Directions via SmartKom

Smartakus: You are now in Heidelberg. Here is a sightseeing map of Heidelberg.

User: I would like to know more about this church!

Smartakus: Here is some information about the St. Peter's Church.

User: Could you please give me walking directions to this church?

Smartakus: In this map, I have high-lighted your walking route.

Page 20: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

SmartKom: Multimodal Dialogues with a Hybrid Navigation System

Page 21: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

• Seamless integration and mutual disambiguation of multimodal input and output on semantic and pragmatic levels

• Situated understanding of possibly imprecise, ambiguous, or incom-plete multimodal input

• Context-sensitive interpretation of dialog interaction on the basis of dynamic discourse and context models

• Adaptive generation of coordinated, cohesive and coherent multimodal presentations

• Semi- or fully automatic completion of user-delegated tasks through the integration of information services

• Intuitive personification of the system through a presentation agent

Salient Characteristics of SmartKom

Page 22: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

The High-Level Control Flow of SmartKom

Page 23: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

SmartKom’s Multimodal Dialogue Back-Bone

Communication BlackboardsData FlowContext Dependencies

Analyzers

ExternalServices

ModalityFusion

DiscourseModeling

ActionPlanning

ModalityFission

Generators

• Speech

• Gestures

• Facial Expressions

• Speech

• Graphics

• Gestures

Dialogue Manager

Page 24: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Unification of Scored Hypothesis Graphs for Modality Fusion in SmartKom

Modality FusionMutual Disambiguation

Reduction of UncertaintyIntention Hypotheses Graph

Word HypothesisGraph with

Acoustic Scores

Intention RecognizerSelection of Most Likely Interpretation

Clause and Sentence

Boundarieswith Prosodic

Scores

Scored Hypotheses

about the User‘s Emotional State

Gesture HypothesisGraph with Scores

of PotentialReference Objects

Page 25: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

SmartKom‘s Computational Mechanisms for Modality Fusion and Fission

M3L: Modality-Free Semantic Representation

OntologicalInferences

Unification

OverlayOperations

Planning

ConstraintPropagation

Modality Fusion Modality Fission

Page 26: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

The Overlay Operation Versus the Unification Operation

• Nonmonotonic and noncommutative unification-like operation

• Inherit (non-conflicting) background information

• two sources of conflicts:

– conflicting atomic values

overwrite background (old) with covering (new)

– type clash

assimilate background to the type of covering; recursion

Unification

Overlay

cf. J. Alexandersson, T. Becker 2001

Page 27: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Overlay Operations Using the Discourse Model

Augmentation and Validation

– compare with a number of previous discourse states:

•fill in consistent information•compute a score

– for each hypothesis - background pair:

– Overlay (covering, background)

Covering:Background:

IntentionHypothesis

Lattice

SelectedAugmentedHypothesis

Sequence

Page 28: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

...:

"":

anycinematonightTimebeginTime

ePerformanc

An Example of the Overlay Operation

Stringtitle

TimebeginTime

nmentAVEntertai

:

...

:

...

:Cinemacinema

ePerformanc

...

:Channelchannel

Broadcast

Go to the moviesFilms on TV tonight

...:

"":

anychanneltonightTimebeginTime

Broadcast

...:

:anycinemaTimebeginTimeePerformanc

..."": tonight

TimebeginTime

ePerformanc

Generalisation and

Specialisation

U: What films are shown on TV tonight? .... U: I‘d rather go to the movies.

Page 29: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Smartkom‘s Three-Tiered Discourse Model

DO1 DO2

VO1

DO10

DO3 DO9

Modality Layer

Discourse Layer

System: This [] is a list of films showing in Heidelberg.

heidelberglist

LO2 LO3. . .

Domain Layer DomainObject1

ticket first

DO11 DO12

reserve

LO4 LO5 LO6

DomainObject2

GO1

. . .

. . .

User: Please reserve a ticket for the first one.

DO = Discourse Object, LO = Linguistic ObjectGO = Gestural Object, VO = Visual Object

cf. M. Löckelt et. al. 2002, N. Pfleger 2002

Page 30: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

The High-Level Control Flow of SmartKom

Page 31: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Smartakus uses body language to notify the user that it is waiting for his input, that it is listening to him, that it has problems to understand his input, or that it is trying hard to find an answer to his question.g

Smartakus is a Self-Animated Interface Agent

Idle TimeNavigationPresentation System State

Page 32: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Some Complex Behavioural Patterns of the Interaction Agent Smartakus

Page 33: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

<?xml version="1.0"?><presentationContent>[...] <abstractPresentationContent> <movieTheater structId="pid1234”> <entityKey> cinema_17a </entityKey> <name> Europa </name> <geoCoordinate> <x> 225 </x> <y> 230 </y> </geoCoordinate> </movieTheater> </abstractPresentationContent>[...] <panelElement> <map structId="PM23"> <boundingShape> <leftTop> <x> 0.5542 </x> <y> 0.1950 </y> </leftTop> <rightBottom> <x> 0.9892 </x> <y> 0.7068 </y> </rightBottom> </boundingShape> <contentReference> pid1234 </contentReference> </map> </panelElement>[...]</presentationContent>

<?xml version="1.0"?><presentationContent>[...] <abstractPresentationContent> <movieTheater structId="pid1234”> <entityKey> cinema_17a </entityKey> <name> Europa </name> <geoCoordinate> <x> 225 </x> <y> 230 </y> </geoCoordinate> </movieTheater> </abstractPresentationContent>[...] <panelElement> <map structId="PM23"> <boundingShape> <leftTop> <x> 0.5542 </x> <y> 0.1950 </y> </leftTop> <rightBottom> <x> 0.9892 </x> <y> 0.7068 </y> </rightBottom> </boundingShape> <contentReference> pid1234 </contentReference> </map> </panelElement>[...]</presentationContent>

M3L Representation of the Multimodal Discourse ContextBlackboard with Presentation Context of the Previous Dialogue Turn

Page 34: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

M3L Specification of a Presentation Task<presentationTask> <subTask> <presentationGoal> <inform> ... </inform> <abstractPresentationContent> ... <result> <broadcast id="bc1"> <channel> <name>EuroSport</name> </channel> <beginTime> <time> <at>2000-12-05T14:00:00</at> </time> </beginTime> <endTime> <time> <at>2000-12-05T15:00:00</at> </time> </endTime> <avMedium> <title>Sport News</title> <avType>sport</avType> ... </abstractPresentationContent> <interactionMode>leanForward</interactionMode> <goalID>APGOAL3000</goalID> <source>generatorAction</source> <realizationType>GraphicsAndSpeech</realizationType>

Page 35: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

SmartKom‘s Presentation Planner

The Presentation Planner generates a Presentation Plan by applying a set of Presentation Strategies to the Presentation Goal.

GlobalPresent

Present AddSmartakus DoLayout EvaluatePersonaNode

Inform

TryToPresentTVOverview

ShowTVOverview

ShowTVOverview SetLayoutDataSetLayoutData

ShowTVOverview

SetLayoutData SetLayoutData

PersonaAction

SendScreenCommand

....

...

...

Generation of Layout

Smartakus Actions

GenerateText

......

... Speak

cf. J. Müller, P. Poller, V. Tschernomas 2002

Page 36: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

SmartKom‘s Use of Semantic Web Technology

Three Layers of Annotations

cf.: Dieter Fensel, James Hendler, Henry Liebermann, Wolfgang Wahlster (eds.)Spinning the Semantic Web, MIT Press, November 2002

PersonalizedPresentation

M3LContent high

Structure XML medium

Layout HTML low

Page 37: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

Various types of unification, overlay, constraint processing,

planning and ontological inferences are the fundamental

processes involved in SmartKom‘s modality fusion and

fission components.

The key function of modality fusion is the reduction of the

overall uncertainty and the mutual disambiguation of the

various analysis results based on a three-tiered

representation of multimodal discourse.

We have shown that a multimodal dialogue sytsem must not

only understand and represent the user‘s input, but its own

multimodal output.

Conclusions

Page 38: SmartKom: Fusion and Fission of Speech, Gestures,  and Facial Expressions

© W. Wahlster

First International Conference on Perceptive &Multimodal User Interfaces (PMUI’03)

November 5-7th, 2003Delta Pinnacle Hotel, Vancouver, B.C., CanadaConference Chair Sharon Oviatt, Oregon Health & Science Univ., USAProgram Chairs Wolfgang Wahlster, DFKI, GermanyMark Maybury, MITRE, USA

PMUI’03 is sponsored by ACM, and will be co-located in Vancouver with ACM’s UIST’03. This meeting follows three successful Perceptive User Interface Workshops (with PUI’01 held in Florida) and three International Multimodal Interface Conferences initiated in Asia (with ICMI’02 held in Pittsburgh).