confucius: an intelligent multimedia storytelling interpretation & presentation system

CONFUCIUS: an Intelligent MultiMedia storytelling interpretation & presentation system

Minhua Eunice Ma

Supervisor: Prof. Paul Mc Kevitt

School of Computing and Intelligent Systems

Faculty of Informatics

University of Ulster, Magee

Objectives of CONFUCIUS

To interpret natural language story and movie (drama) script input and to extract conceptual semantics from the natural language

To generate 3D animation and virtual worlds automatically from natural language

To integrate 3D animation with speech and non-speech audio, to form an intelligent multimedia storytelling system for presenting multimodal stories

CONFUCIUS’ context diagram

Story in natural language

CONFUCIUSMovie/drama script 3D animation

non-speech audioTailored menu for script input

Speech (dialogue)Storywrit

er /playwrig

ht

User/story listene

r

Schank’s CD Theory (1972) Primitive & scripts SAM & PAM

Automatic Text-to-Graphics Systems WordsEye (Coyne & Sproat, 2001) ‘Micons’ and CD-based language animation

(Narayanan et al. 1995) Spoken Image (Ó Nualláin & Smith, 1994)

& its successor SONAS (Kelleher et al. 2000)

Previous systems

MultiModal interactive storytelling AesopWorld KidsRoom Larsen & Petersen’s Interactive Storytelling Oz Computer games

Virtual humans & embodied agents BEAT (Cassell et al., 2000) Jack (University of Pennsylvania) Improv (Perlin and Goldberg, 1996) SimHuman Gandalf PPP persona

Architecture of CONFUCIUS

3D authoring tools, existing 3D models &

character models

visual knowledge (3D graphic library)

Prefabricated objects(knowledge base)

Script writer

Script parser

Natural Language Processing

Text To Speech

Sound effects

Animation generation

Synchronizing & fusion

3D world with audio in VRML

Natural language stories

Language knowledge

mapping

lexicongrammaretc

semantic representations

visual knowledge

Semantic representations

Categories Knowledge representations Decomposition Typical applications

rule-based representation expert systems

FOPC (First Order Predicate Calculus)

sentence representation, expert systems

semantic networks

lexical semantics

Schank’s scripts

story understanding

frame-based representations

(1) general knowledge representation & reasoning

XML-based representations

multimodal semantics

Conceptual Dependency (CD)

event-logic truth conditions

x-schema and f-structure

Jackendoff’s Lexical-Conceptual Semantics (LCS)

(2) physical knowledge representation & reasoning (inc. spatial /temporal reasoning)

decomposite predicate-argument representation

dynamic vision (movement) recognition & generation

MultiModal semantic representation

Multimodal semantics

Language modality

Visual modality

Non-speech audio modality

Media-independent representation

Visual media-dependent representationIntermediate level

High-level multimodal semantic representation:XML/frame-based

Audio media-dependent representation

Mental imagery & meaning processing

Cognition Re-cognition

Communication

Simulation:presentation via language or other modalities

Simulation:Image recognition

Simulation:Language understanding

Meanings, communicable ideas, thoughts, manifestable messages, proverbs, examples, parables, etc.

Physical world Virtual world

Mental world Mental world

knowledge base

Language knowledge

Visual knowledge

World knowledge

Spatial & qualitative reasoning knowledge

Semantic knowledge - lexicons (eg. WordNet)Syntactic knowledge - grammarsStatistical models of languageAssociations between words

Object model (nouns)

Functional informationInternal coordinate axes (for spatial reasoning)Associations between objects

Knowledge base of CONFUCIUS

Event model (event verbs, describes the motion of objects)

Graphic library

Simple geometry filesgeometry & joint hierarchy

files

animation library(key frames)

objects/props characters

motions

instantiation

script

story

Script parser

Natural language processor

Script writer

Animation generator

TTS

Sound effect driver

Media coordination

Synthesized animation

Primitives library

Music library

script

dialogues

Non-speech audio

Data Flow Diagram

Visual semantics

Scene&Actor descriptions

VRML without sound nodes

Animation generator

verbsemantic analysis use lexical relations (WordNet)

to replace synonyms, scripts application, etc.

match basic motionsin library?

motiondecomposition

animation controller

environmentplacement

N

Y

LCS representation

VRML format of the virtual story worldexamples demo

motioninstantiation

Categories of eventsAtomic entities

Change physical location such as position and orientation, e.g. “bounce”, “turn”Change intrinsic attributes such as shape, size, color, and texture, e.g. “bend”, and even visibility, e.g. “disappear”, “fade” (in/out)

Non-atomic entitiesNon-character events

Two or more individual objects fuse together, e.g. “melt” (in)One object divides into two or more individual parts, e.g. “break” (into pieces)Change sub-components (their position, size, color), e.g. “blossom”Environment events (weather verbs), e.g. “snow”, “rain”

Character eventsAction verbs

Intransitive verbsTransitive verbs

Non-action verbs (stative, emotion, possession, mental activities, cognition & perception)Idioms & metaphor verbs

Categories of action verbs

Intransitive verbs Biped kinematics, e.g. “walk”, “swim”, & other motion models

like “fly” Face expressions, e.g. “laugh”, “anger” Lip movement, e.g. “speak”, “say”

Transitive verbs single object, e.g. “throw”, “push”, “kick” multiple objects

direct and indirect objects, e.g. “give”, “pass”, “show” indirect object & the instrument, e.g. “cut”, “hammer”

involve speech modality

Visual definition & word sense

verb word sense visual definition entrymapping

word sense -- minimal complete unit of meaning in the language modality

visual definition entry -- minimal complete unit of meaning in the visual modality

polysemy

synonymy

Example: “close” (a door)

1. a normal door (rotation on y axis)

2. a sliding door (moving on x axis)3. a rolling shutter door (a

combination of rotation on x axis and moving on y axis)

one manymany many

Troponyms & verbs derived from adjectives/nouns

troponym elaborates the manners of a base verb (Fellbaum 1998) examples: “trot”-“walk” (fast), “gulp”-“eat” (quickly) base verb + adverb

present the base verb + modify the manner (speed, the agent’s state, duration of the activity, iteration, etc.)

Verbs derived from adjectives or nouns change objects’ properties (size, color, shape) or the world

state verbs with affixes such as –en, -ify, or –ize, e.g. “lengthen” using predicates scale(), squash() or changing the

corresponding property fields of the object in VRML

Representing active & passive voice

active and passive voice converse verb pairs such as “give/take”,

“buy/sell”, “lend/borrow” same activity from different point of view use of VRML Viewpoint node

Implementation: semanticsVRML

bounce(ball):- [moveTo(ball, [0,0,0]), moveTo(ball,[0,20,0])]L.(a) visual definition of “bounce”

Example: “A ball is bouncing”

DEF ball Transform { translation 0 0 0 children [ Shape { appearance Appearance{ material Material{} } geometry Sphere {

radius 5 } } ]}(b) VRML code of a static ball

DEF ball Transform { translation 0 0 0 children [ DEF ball-TIMER TimeSensor {

loop TRUEcycleInterval 0.5 },

DEF ball-POS-INTERP PositionInterpolator { key [0, 0.5, 1 ] keyValue [0 0 0, 0 20 0, 0 0 0 ] }, Shape { appearance Appearance { material Material {} } geometry Sphere { radius 5 } }]ROUTE ball-TIMER.fraction_changed TO ball-POS-INTERP.set_fractionROUTE ball-POS-INTERP.value_changed TO ball.set_translation}(c) Output VRML code of a bouncing ball

Categories of adjectives

Visually observable

Visually unobservable

Objects’ attributes/states: dark/light, large/small, big/little, white/black (color adj.), long/short, new/old, high/low, full/empty, open/closed

Observablehuman attributes

Relational adj.: nasal (nose), mural (wall), dental (teeth)

Perceivable by other modalities: wet/dry, warm/cold, coarse/smooth, hard/soft, heavy/light

Abstract attributes

Reference-modifying adj.: possible/impossible, former, past/present, last, other, different/same

Feelings: happy/sad, angry, excited, surprised, terrified

Others: old/young, beautiful/ugly, strong/weak, poor/rich, fat/thin

Unobservable human attributes (virtue): good/evil, kind, mean, ambitious

Others: easy/difficult, real, important, particular, right/wrong, early/late

Software Analysis

Java programming language parsing intermediate representation changing VRML code to create/modify animation integrating modules

Natural language processing tools Gate (pre-processing) PC-PARSE (morphologic and syntax analysis) WordNet (lexicon, semantic inference)

3D graphic modelling existing 3D models on the Internet 3D Studio Max (props & stage) VRML (Virtual Reality Modelling Language) 97, H-anim 2001 spec.

The Actors – using embodied agents Microsoft Agent (the narrator and minor actors) Character Studio, Internet Character Animator (protagonists)

Natural Language Processing

Semantic inference

Coreference resolution

Part-of-speech tagger

Syntactic parser morphological parser

Temporal reasoning

Pre-processing

PC-PARSER

WordNet 1.6

LEXICON &MORPHOLOGICAL RULES

FEATURES

Contribution & prospective applications

Children’s education Multimedia presentation Movie/drama production Script writing Computer games Virtual Reality

multimodal semantic representation of natural language automatic animation generation multimodal fusion and coordination

Conclusion

The objectives of CONFUCIUS meet the challenging

problems in language visualisation:

formalizes meaning of action verbs and states

mapping language primitives with visual primitives

a reusable ‘common sense’ knowledge base for other systems

sophisticated spatial and temporal reasoning

representing stories by temporal multimedia requires

significant coordination

confucius: an intelligent multimedia storytelling interpretation & presentation system

Documents

language understandingmeanings

d animation

natural language story

natural languageto

image recognitionsimulation

spoken image nuallin

communicable ideas

movie drama script input