creating interactive virtual humans: some assembly …ict.usc.edu/pubs/creating interactive virtual...

54 1094-7167/02/$17.00 © 2002 IEEE IEEE INTELLIGENT SYSTEMS

W o r k s h o p R e p o r t

are still a distant fantasy, researchers across a wide rangeof disciplines are beginning to work together toward amore modest goal—building virtual humans. These soft-ware entities look and act like people and can engage inconversation and collaborative tasks, but they live in simu-lated environments. With the untidy problems of sensingand acting in the physical world thus dispensed, the focusof virtual human research is on capturing the richness anddynamics of human behavior.

The potential applications of this technology are con-siderable. History students could visit ancient Greeceand debate Aristotle. Patients with social phobias couldrehearse threatening social situations in the safety of avirtual environment. Social psychologists could studytheories of communication by systematically modifyinga virtual human’s verbal and nonverbal behavior. A vari-ety of applications are already in progress, includingeducation and training,1 therapy,2 marketing,3,4 andentertainment.5,6

Building a virtual human is a multidisciplinary effort,joining traditional artificial intelligence problems with arange of issues from computer graphics to social science.Virtual humans must act and react in their simulated envi-ronment, drawing on the disciplines of automated reason-ing and planning. To hold a conversation, they must exploitthe full gamut of natural language processing research,from speech recognition and natural language understand-ing to natural language generation and speech synthesis.Providing human bodies that can be controlled in real timedelves into computer graphics and animation. And becausean agent looks like a human, people expect it to behave likeone as well and will be disturbed by, or misinterpret, dis-

crepancies from human norms. Thus, virtual human researchmust draw heavily on psychology and communication theory to appropriately convey nonverbal behavior, emotion,and personality.

This broad range of requirements poses a serious prob-lem. Researchers working on particular aspects of virtualhumans cannot explore their component in the context ofa complete virtual human unless they can understandresults across this array of disciplines and assemble thevast range of software tools (for example, speech recog-nizers, planners, and animation systems) required to con-struct one. Moreover, these tools were rarely designed tointeroperate and, worse, were often designed with differ-ent purposes in mind. For example, most computer graph-ics research has focused on high fidelity offline imagerendering that does not support the fine-grained interac-tive control that a virtual human must have over its body.

In the spring of 2002, about 30 international researchersfrom across disciplines convened at the University ofSouthern California to begin to bridge this gap in knowl-edge and tools (see www.ict.usc.edu/~vhumans). Ourultimate goal is a modular architecture and interface stan-dards that will allow researchers in this area to reuse eachother’s work. This goal can only be achieved through aclose multidisciplinary collaboration. Towards this end,the workshop gathered a collection of experts representingthe range of required research areas, including

• Human figure animation• Facial animation• Perception• Cognitive modeling• Emotions and personality• Natural language processing• Speech recognition and synthesis• Nonverbal communication• Distributed simulation• Computer games

Science fiction has long imagined a future populated

with artificial humans—human-looking devices with

human-like intelligence. Although Asimov’s benevolent

robots and the Terminator movies’ terrible war machines

Creating Interactive VirtualHumans: Some AssemblyRequiredJonathan Gratch, USC Institute for Creative TechnologiesJeff Rickel, USC Information Sciences InstituteElisabeth André, University of AugsburgJustine Cassell, MIT Media LabEric Petajan, Face2Face AnimationNorman Badler, University of Pennsylvania

Here we discuss some of the key issuesthat must be addressed in creating virtualhumans. As a first step, we overview theissues and available tools in three key areasof virtual human research: face-to-faceconversation, emotions and personality,and human figure animation.

Face-to-face conversationHuman face-to-face conversation involves

both language and nonverbal behavior. Thebehaviors during conversation don’t justfunction in parallel, but interdependently.The meaning of a word informs the inter-pretation of a gesture, and vice versa. Thetime scales of these behaviors, however, aredifferent—a quick look at the other personto check that they are listening lasts for lesstime than it takes to pronounce a singleword, while a hand gesture that indicateswhat the word “caulk” means might lastlonger than it takes to say, “I caulked allweekend.”

Coordinating verbal and nonverbal con-versational behaviors for virtual humansrequires meeting several interrelated chal-lenges. How speech, intonation, gaze, andhead movements make meaning together,the patterns of their co-occurrence in con-versation, and what kinds of goals areachieved by the different channels, are allequally important for understanding theconstruction of virtual humans. Speech andnonverbal behaviors do not always manifestthe same information, but what they conveyis virtually always compatible.7 In manycases, different modalities serve to reinforceone another through redundancy of mean-ing. In other cases, semantic and pragmaticattributes of the message are distributedacross the modalities.8 The compatibility ofmeaning between gestures and speechrecalls the interaction of words and graph-ics in multimodal presentations.9 For pat-terns of co-occurrence, there is a tight syn-chrony among the different conversationalmodalities in humans. For example, peopleaccentuate important words by speakingmore forcefully, illustrating their point witha gesture, and turning their eyes toward thelistener when coming to the end of a thought.Meanwhile listeners nod within a few hun-dred milliseconds of when the speaker’sgaze shifts. This synchrony is essential tothe meaning of conversation. When it isdestroyed, as in low bandwidth videocon-ferencing, satisfaction and trust in the out-come of a conversation diminishes.10

Regarding the goals achieved by thedifferent modalities, in natural conversa-tion speakers tend to produce a gesturewith respect to their propositional goals (toadvance the conversation content), such asmaking the first two fingers look like legswalking when saying “it took 15 minutes toget here,” and speakers tend to use eyemovement with respect to interactionalgoals (to ease the conversation process),such as looking toward the other personwhen giving up the turn.7 To realisticallygenerate all the different verbal and non-verbal behaviors, then, computationalarchitectures for virtual humans must con-trol both the propositional and interactionalstructures. In addition, because some ofthese goals can be equally well met by onemodality or the other, the architecture must

deal at the level of goals or functions, and notat the level of modalities or behaviors. Thatis, giving up the turn is often achieved bylooking at the listener. But, if the speaker’seyes are on the road, he or she can get aresponse by saying, “Don’t you think?”

Constructing a virtual human that caneffectively participate in face-to-face con-versation requires a control architecturewith the following features:4

• Multimodal input and output. Becausehumans in face-to-face conversationsend and receive information throughgesture, intonation, and gaze as well asspeech, the architecture should also sup-port receiving and transmitting thisinformation.

• Real-time feedback. The system must letthe speaker watch for feedback and turnrequests, while the listener can sendthese at any time through various modal-

ities. The architecture should be flexibleenough to track these different threads ofcommunication in ways appropriate toeach thread. Different threads have dif-ferent response-time requirements;some, such as feedback and interruption,occur on a sub-second time scale. Thearchitecture should reflect this by allow-ing different processes to concentrate onactivities at different time scales.

• Understanding and synthesis of proposi-tional and interactional information.Dealing with propositional information—the communication content—requiresbuilding a model of the user’s needs andknowledge. The architecture must in-clude a static domain knowledge baseand a dynamic discourse knowledgebase. Presenting propositional informa-tion requires a planning module forpresenting multi-sentence output andmanaging the order of presentation ofinterdependent facts. Understandinginteractional information—about theprocesses of conversation—on the otherhand, entails building a model of the cur-rent state of the conversation with respectto the conversational process (to deter-mine who is the current speaker and lis-tener, has the listener understood thespeaker’s contribution, and so on).

• Conversational function model. Func-tions, such as initiating a conversation orgiving up the floor, can be achieved by arange of different behaviors, such aslooking repeatedly at another person orbringing your hands down to your lap.Explicitly representing conversationalfunctions, rather than behaviors, providesboth modularity and a principled way tocombine different modalities. Functionalmodels influence the architecture becausethe core system modules operate exclu-sively on functions, while other systemmodules at the edges translate inputbehaviors into functions, and functionsinto output behaviors. This also producesa symmetric architecture because thesame functions and modalities are pre-sent in both input and output.

To capture different time scales and theimportance of co-occurrence, input to avirtual human must be incremental andtime stamped. For example, incrementalspeech recognition lets the virtual humangive feedback (such as a quick nod) right asthe real human finishes a sentence, there-

JULY/AUGUST 2002 computer.org/intelligent 55

With the untidy problems ofsensing and acting in the physicalworld thus dispensed, the focus ofvirtual human research is oncapturing the richness anddynamics of human behavior.

fore influencing the direction the humanspeaker takes. At the very least, the sytemshould report a significant change in stateright away, even if full information aboutthe event has not yet been processed. Thismeans that if speech recognition cannot beincremental, at least someone speaking orfinished speaking should be relayed imme-diately, even in the absence of a fully rec-ognized utterance. This lets the virtualhuman give up the turn when the realhuman claims it and signal reception afterbeing addressed. When dealing with multi-ple modalities, fusing interpretations of thedifferent input events is important to under-stand what behaviors are acting together toconvey meaning.12 For this, a synchronizedclock across modalities is crucial so eventssuch as exactly when an emphasis beatgesture occurs can be compared to speech,word by word. This requires, of course,that the speech recognizer supply wordonset times.

Similarly, for the virtual human to pro-duce a multimodal performance, the outputchannels also must be incremental andtightly synchronized. Incremental refers totwo properties in particular: seamless tran-sitions and interruptible behavior. Whenproducing certain behaviors, such as ges-tures, the virtual human must reconfigureits limbs in a natural manner, usually requir-ing that some time be spent on interpolat-ing from a previous posture to a new one.For the transition to be seamless, the virtualhuman must give the animation systemadvance notice of events such as gestures,so that it has time to bring the arms intoplace. Sometimes, however, behaviorsmust be abruptly interrupted, such as whenthe real human takes the turn before thevirtual human has finished speaking. In

that case, the current behavior schedulemust be scrapped, the voice halted, andnew attentive behaviors initiated—all withreasonable seamlessness.

Synchronicity between modalities is asimportant in the output as the input. Thevirtual human must align a graphicalbehavior with the uttering of particularwords or a group of words. The temporalassociation between the words and behav-iors might have been resolved as part of thebehavior generation process, as is done inSPUD (Sentence Planning Using Descrip-tion),8 but it is essential that the speechsynthesizer provide a mechanism for main-taining synchrony through the final produc-tion stage. There are two types of mecha-nisms, event based or time based. A text-to-speech engine can usually be programmedto send events on phoneme and word bound-aries. Although this is geared towards sup-porting lip synch, other behaviors can beexecuted as well. However, this does notallow any time for behavior preparation.Preferably, the TTS engine can provideexact start-times for each word prior toplaying back the voice, as Festival does.13

This way, we can schedule the behaviors,and thus the transitions between behaviors,beforehand, and then play them backalong with the voice for a perfectly seam-less performance.

On the output side, one tool that providessuch tight synchronicity is the BehaviorExpression Animation Toolkit system.11

Figure 1 shows BEAT’s architecture. BEAThas the advantage of automatically annotat-ing text with hand gestures, eye gaze, eye-brow movement, and intonation. The anno-tation is carried out in XML, through inter-action with an embedded word ontologymodule, which creates a set of hypernyms

that broadens a knowledge base search ofthe domain being discussed. The annotationis then passed to a set of behavior genera-tion rules. Output is scheduled so that tightsynchronization is maintained amongmodalities.

Emotions and personalityPeople infuse their verbal and nonverbal

behavior with emotion and personality, andmodeling such behavior is essential forbuilding believable virtual humans. Conse-quently, researchers have developed com-putational models for a wide range of appli-cations. Computational approaches mightbe roughly divided into communication-driven and simulation-based approaches.

In communication-driven approaches, avirtual human chooses its emotional expres-sion on the basis of its desired impact on theuser. Catherine Pelachaud and her colleaguesuse facial expressions to convey affect incombination with other communicativefunctions.14 For example, making a requestwith a sorrowful face can evoke pity andmotivate an affirmative response from thelistener. An interesting feature of theirapproach is that the agent deliberatelyplans whether or not to convey a certainemotion. Tutoring applications usually alsofollow a communication-driven approach,intentionally expressing emotions with thegoal of motivating the students and thusincreasing the learning effect. The Cosmosystem, where the agent’s pedagogicalgoals drive the selection and sequencing ofemotive behaviors, is one example.15 Forinstance, a congratulatory act triggers amotivational goal to express admirationthat is conveyed with applause. To conveyappropriate emotive behaviors, agents suchas Cosmo need to appraise events not onlyfrom their own perspective but also fromthe perspective of others.

The second category of approaches aimsat a simulation of “true” emotion (as op-posed to deliberately conveyed emotion).These approaches build on appraisal theo-ries of emotion, the most prominent beingAndrew Ortony, Gerald Clore, and AllanCollins’ cognitive appraisal theory—com-monly referred to as the OCC model.16

This theory views emotions as arising froma valenced reaction to events and objects inthe light of agent goals, standards, and atti-tudes. For example, an agent watching agame-winning move should respond differ-ently depending on which team is preferred.3

56 computer.org/intelligent IEEE INTELLIGENT SYSTEMS

Translator

Animation

Discoursemodel

Wordtiming

Knowledgebase

Behavior generation

Behaviorsuggestion

Behaviorselection

Generator set Filter set

BehaviorschedulingLanguage

tagging

Figure 1. Behavior Expression Animation Toolkit text-to-nonverbal behavior module.

Recent work by Stacy Marsella andJonathan Gratch integrates the OCC modelwith coping theories that explain how peo-ple cope with strong emotions.17 For exam-ple, their agents can engage in either prob-lem-focused coping strategies, selectingand executing actions in the world thatcould improve the agent’s emotional state,or emotion-focused coping strategies,improving emotional state by altering theagent’s mental state (for example, dealingwith guilt by blaming someone else). Fur-ther simulation approaches are based onthe observation that an agent should beable to dynamically adapt its emotionsthrough its own experience, using learningmechanisms. 6,18

Appraisal theories focus on the relation-ship between an agent’s world assessmentand the resulting emotions. Nevertheless,they are rather vague about the assessmentprocess. For instance, they do not explainhow to determine whether a certain event isdesirable. A promising line of research isintegrating appraisal theories with AI-based planning approaches,19 which mightlead to a concretization of such theories.First, emotions can arise in response to adeliberative planning process (when rele-vant risks are noticed, progress assessed,and success detected). For example, severalapproaches derive an emotion’s intensityfrom the importance of a goal and its prob-ability of achievement. 20,21 Second, emo-tions can influence decision-making byallocating cognitive resources to specificgoals or threats. Plan-based approachessupport the implementation of decision andaction selection mechanisms that areguided by an agent’s emotional state. Forexample, the Inhabited Market Place appli-cation treats emotions as filters to constrainthe decision process when selecting andinstantiating dialogue operators.3

In addition to generating affective states,we must also express them in a manner easilyinterpretable to the user. Effective means ofconveying emotions include body gestures,acoustic realization, and facial expressions(see Gary Collier’s work for an overview ofstudies on emotive expressions22). Severalresearchers use Bayesian networks tomodel the relationship between emotionand its behavioral expression. Bayesiannetworks let us deal explicitly with uncer-tainty, which is a great advantage whenmodeling the connections between emo-tions and the resulting behaviors. Gene

Ball and Jack Breese presented an exampleof such an approach. They constructed aBayesian network that estimates the likeli-hood of specific body postures and gesturesfor individuals with different personalitytypes and emotions.23 For instance, a nega-tive emotion increases the probability thatan agent will say “Oh, you again,” asopposed to “Nice to see you!”

Recent work by Catherine Pelachaud andcolleagues employs Bayesian networks toresolve conflicts that occur when differentcommunicative functions need to be shownon different channels of the face, such aseyebrows, mouth shape, gaze direction,head direction, and head movements (seeFigure 2).14 In this case, the Bayesian net-work estimates the likelihood that a facemovement overrides another. Bayesian net-works also offer a possibility to model howemotions vary over time. Even though nei-ther Ball and Breese nor Pelachaud andcolleagues took advantage of this feature,the extension of the two approaches todynamic Bayesian networks seems obvious.

While significant progress has beenmade on the visualization of emotivebehaviors, automated speech synthesis stillhas a long way to go. The most natural-sounding approaches rely on a large inven-tory of human speech units (for example,combinations of phonemes) that are subse-quently selected and combined based onthe sentence to be synthesized. Theseapproaches do not, yet, provide much abil-ity to convey emotion through speech (forexample, by varying prosody or intensity).Marc Schröder provides an overview ofspeech manipulations that have been suc-cessfully employed to express several basicemotions.25 While the interest in affectivespeech synthesis is increasing, hardly any

work has been done on conveying emotionthrough sentence structure or word choice.An exception includes Eduard Hovy’s pio-neering work on natural language genera-tion that addresses not only the goal ofinformation delivery, but also pragmaticaspects, such as the speaker’s emotions.26

Marilyn Walker and colleagues present afirst approach to integrating acoustic para-meters with other linguistic phenomena,such as sentence structure and wording.27

Obviously, there is a close relationshipbetween emotion and personality. DaveMoffat differentiates between personalityand emotion using the two dimensionsduration and focus.28 Whereas personalityremains stable over a long period of time,emotions are short-lived. Moreover, whileemotions focus on particular events orobjects, factors determining personality aremore diffuse and indirect. Because of thisobvious relationship, several projects aimto develop an integrated model of emotionand personality. As an example, Ball andBreese model dependencies between emo-tions and personality in a Bayesian network.23

To enhance the believability of animatedagents beyond reasoning about emotionand personality, Helmut Prendinger andcolleagues model the relationship betweenan agent’s social role and the associatedconstraints on emotion expression, forexample, by suppressing negative emo-tion when interacting with higher-statusindividuals.29

Another line of research aims at provid-ing an enabling technology to supportaffective interactions. This includes boththe definition of standardized languages forspecifying emotive behaviors, such as theAffective Presentation Markup Language14

or the Emotion Markup Language (www.


Figure 2. Pelachaud and colleagues use a MPEG-4 compatible facial animation systemto investigate how to resolve conflicts that arise when different communication functionsneed to be shown on different channels of the face.

(a) (b) (c)

vhml.org), as well as the implementation oftoolkits for affective computing combininga set of components addressing affectiveknowledge acquisition, representation,reasoning, planning, communication, andexpression.30

Human figure animationBy engaging in face-to-face conversation,

conveying emotion and personality, andotherwise interacting with the syntheticenvironment, virtual humans impose fairlysevere behavioral requirements on the un-derlying animation system that must rendertheir physical bodies. Most productionwork involves animator effort to design or script movements or direct performermotion capture. Replaying movements inreal time is not the issue; rather, it is creatingnovel, contextually sensitive movements inreal time that matters. Interactive and con-versational agents, for example, will notenjoy the luxury of relying on animators tocreate human time-frame responses. Anima-tion techniques must span a variety of bodysystems: locomotion, manual gestures, handmovements, body pose, faces, eyes, speech,and other physiological necessities such asbreathing, blinking, and perspiring. Researchin human figure animation has addressed allof these modalities, but historically the workfocuses either on the animation of complete

body movements or on animation of theface.

Body animation methodsIn body animation, there are two basic

ways to gain the required interactivity: usemotion capture and additional techniquesto rapidly modify or re-target movementsto immediate needs,31 or write proceduralcode that allows program control overimportant movement parameters.32 Thedifficulty with the motion capture approachis maintaining environmental constraintssuch as solid foot contacts and properreach, grasp, and observation interactionswith the agent’s own body parts and otherobjects. To alleviate these problems, pro-cedural approaches parameterize targetlocations, motion qualities, and othermovement constraints to form a plausiblemovement directly. Procedural approachesconsist of kinematic and dynamics tech-niques. Each has its preferred domain ofapplicability; kinematics is generally betterfor goal-directed activities, and slower(controlled) actions and dynamics is morenatural for movements directed by applica-tion of forces, impacts, or high-speedbehaviors.33 The wide range of humanmovement demands that both approacheshave real-time implementations that can beprocedurally selected as required.

Animating a human body form requiresmore than just controlling skeletal rotationangles. People are neither skeletons norrobots, and considerable human qualitiesarise from intelligent movement strategies,soft deformable surfaces, and clothing.Movement strategies include reach or con-strained contacts, often achieved with goal-directed inverse kinematics.34 Complexworkplaces, however, entail more complexplanning to avoid collisions, find freepaths, and optimize strength availability.The suppleness of human skin and theunderlying tissue biomechanics lead toshape changes caused by internal muscleactions as well as external contact with theenvironment. Modeling and animating thelocal, muscle-based, deformation of bodysurfaces in real time is possible throughshape morphing techniques,35,36 but provid-ing appropriate shape changes in responseto external forces is a challenging problem.“Skin-tight” texture mapped clothing isprevalent in computer game characters,but animating draped or flowing garmentsrequires dynamic simulation, fast colli-sion detection, and appropriate collisionresponse.37,38

Accordingly, animation systems buildprocedural models of these various behav-iors and execute them on human models.The diversity of body movements involvedhas led to building more consistent agents:procedural animations that affect and con-trol multiple body communication chan-nels in coordinated ways.11,24,39,40 The par-ticular challenge here is constructingcomputer graphics human models that bal-ance sufficient articulation, detail, andmotion generators to effect both gross andsubtle movements with realism, real-timeresponsiveness, and visual acceptability.And if that isn’t enough, consider the addi-tional difficulty of modeling a specific realindividual. Computer graphics still lackseffective techniques to transfer even cap-tured motion into features that characterizea specific person’s mannerisms and behav-iors, though machine-learning approachescould prove promising.41

Implementing an animated human bodyis complicated by a relative paucity of gen-erally available tools. Body models tend tobe proprietary (for example, Extempo.com,Ananova.com), optimized for real time andthus limited in body structure and features(for example, DI-Guy, BDI.com, illustratedin Figure 3), or constructions for particular


Figure 3. PeopleShop and DI-Guy are used to create scenarios for ground combat training. This scenario was used at Ft. Benning to enhance situation awareness inexperiments to train US Army officers for urban combat. Image courtesy of BostonDynamics.

animations built with standard animatortools such as Poser, Maya, or 3DSMax. Thebest attempt to design a transportable, stan-dard avatar is the Web3D Consortium’s H-Anim effort (www.h-anim.org). With well-defined body structure and feature sites, theH-Anim specification has engenderedmodel sharing and testing not possible withproprietary approaches. The present liabilityis the lack of an application programminginterface in the VRML language binding ofH-Anim. A general API for human modelsis a highly desirable next step, the benefitsof which have been demonstrated by Nor-man Badler’s research group’s use of thesoftware API in Jack (www.ugs.com/prod-ucts/efactory/jack), which allows featureaccess and provides plug-in extensions fornew real-time behaviors.

Face animation methodsA computer-animated human face can

evoke a wide range of emotions in real peo-ple because faces are central to human real-ity. Unfortunately, modeling and renderingartifacts can easily produce a negativeresponse in the viewer. The great complex-ity and psychological depth of the humanresponse to faces causes difficulty in pre-dicting the response to a given animatedface model. The partial or minimalist ren-dering of a face can be pleasing as long asit maintains quality and accuracy in certainkey dimensions. The ultimate goal is toanalyze and synthesize humans withenough fidelity and control to pass the Tur-ing test, create any kind of virtual being,and enable total control over its virtualappearance. Eventually, surviving tech-nologies will be combined to increaseaccuracy and efficiency of the capture, lin-guistic, and rendering systems. Currentlythe approaches to animating the face aredisjoint and driven by production costs andimperfect technology. Each method pre-sents a distinct “look and feel,” as well asadvantages and disadvantages.

Facial animation methods fall into threemajor categories. The first and earliestmethod is to manually generate keyframesand then automatically interpolate framesbetween the keyframes (or use less skilledanimators). This approach is used in tradi-tional cell animation and in 3D animatedfeature films. Keyframe and morph targetanimation provides complete artistic con-trol but can be time consuming to perfect.

The second method is to synthesize facial

movements from text or acoustic speech. ATTS algorithm, or an acoustic speech recog-nizer, provides a translation to phonemes,which are then mapped to visemes (visualphonemes). The visemes drive a speecharticulation model that animates the face.The convincing synthesis of a face from texthas yet to be accomplished. The state of theart provides understandable acoustic andvisual speech and facial expressions.42,43

The third and most recent method foranimating a face model is to measurehuman facial movements directly and thenapply the motion data to the face model.The model can capture facial motionsusing one or more cameras and can incor-porate face markers, structured light, laserrange finders, and other face measurementmodes. Each facial motion capture approachhas limitations that might require postpro-cessing to overcome. The ideal motion-capture data representation supports suffi-cient detail without sacrificing editability(for example, MPEG-4 Facial AnimationParameters). The choice of modeling andrendering technologies ranges from 2Dline drawings to physics-based 3D modelswith muscles, skin, and bone.44,45 Ofcourse, textured polygons (nonuniformrational b-splines and subdivision sur-faces) are by far the most common. Avariety of surface deformation schemesexist that attempt to simulate the naturaldeformations of the human face whiledriven by external parameters.46,47

MPEG-4, which was designed for high-quality visual communication at low bit-rates coupled with low-cost graphics ren-dering systems, offers one existing standardfor human figure animation. It contains acomprehensive set of tools for representingand compressing content objects and theanimation of those objects, and it treatsvirtual humans (faces and bodies) as a spe-cial type of object. The MPEG-4 Face andBody Animation standard providesanatomically specific locations and anima-tion parameters. It defines Face DefinitionParameter feature points and locates themon the face (see Figure 4). Some of thesepoints only serve to help define the face’sshape. The rest of them are displaced byFacial Animation Parameters, which spec-ify feature point displacements from theneutral face position. Some FAPs aredescriptors for visemes and emotionalexpressions. Most remaining FAPs are nor-malized to be proportional to neutral face

mouth width, mouth-nose distance, eyeseparation, iris diameter, or eye-nose dis-tance. Although MPEG-4 has defined alimited set of visemes and facial expres-sions, designers can specify two visemes ortwo expressions with a blend factor betweenthe visemes and an intensity value for eachexpression. The normalization of the FAPsgives the face model designer freedom tocreate characters with any facial propor-tions, regardless of the source of the FAPs.They can embed MPEG-4 compliant facemodels into decoders, store them on CD-ROM, download them as an executablefrom a Web site, or build them into a Webbrowser.

Integration challengesIntegrating all the various elements

described here into a virtual human is adaunting task. It is difficult for any singleresearch group to do it alone. Reusabletools and modular architectures would bean enormous benefit to virtual humanresearchers, letting them leverage eachother’s work. Indeed, some research groupshave begun to share tools, and several stan-dards have recently emerged that will fur-ther encourage sharing. However, we mustconfront several difficult issues before wecan readily plug-and-play different modulesto control a virtual human’s behavior. Twokey issues discussed at the workshop wereconsistency and timing of behavior.

ConsistencyWhen combining a variety of behavioral

components, one problem is maintainingconsistency between the agent’s internalstate (for example, goals, plans, and emo-tions) and the various channels of outwardbehavior (for example, speech and bodymovements). When real people presentmultiple behavior channels, we interpretthem for consistency, honesty, and sincerity,and for social roles, relationships, power,and intention. When these channels con-flict, the agent might simply look clumsy orawkward, but it could appear insincere,confused, conflicted, emotionally detached,repetitious, or simply fake. To an actor or anexpert animator, this is obvious. Bad actorsmight fail to control gestures or facialexpressions to portray the demeanor of theirpersona in a given situation. The actormight not have internalized the character’sgoals and motivations enough to use thebody’s own machinery to manifest these


inner drives as appropriate behaviors. Askilled animator (and actor) knows that allaspects of a character must be consistentwith its desired mental state because we cancontrol only voice, body shape, and move-ment for the final product. We cannot opena dialog with a pre-animated character tofurther probe its mind or its psychologicalstate. With a real-time embodied agent,however, we might indeed have such anopportunity.

One approach to remedying this problemis to explicitly coordinate the agent’s internalstate with the expression of body movementsin all possible channels. For example, Nor-man Badler’s research group has been build-ing a system, EMOTE, to parameterize andmodulate action performance.24 It is basedon Laban Movement Analysis, a humanmovement observation System. EMOTE isnot an action selector per se; it is used tomodify the execution of a given behavior andthus change its movement qualities or char-

acter. EMOTE’s power arises from the rela-tively small number of parameters that con-trol or affect a much larger set, and from newextensions to the original definitions thatinclude non-articulated face movements. Thesame set of parameters control many aspectsof manifest behavior across the agent’s bodyand therefore permit experimentation withsimilar or dissimilar settings. The hypothesisis that behaviors manifest in separate chan-nels with similar EMOTE parameters willappear consistent to some internal state ofthe agent; conversely, dissimilar EMOTEparameters will convey various negativeimpressions of the character’s internal con-sistency. Most computer-animated agentsprovide direct evidence for the latter view:

• Arm gestures without facial expressionslook odd.

• Facial expressions with neutral gestureslook artificial.

• Arm gestures without torso involvement

look insincere.• Attempts at emotions in gait variations

look funny without concomitant bodyand facial affect.

• Otherwise carefully timed gestures andspeech fail to register with gesture per-formance and facial expressions.

• Repetitious actions become irritatingbecause they appear unconcerned aboutour changing (more negative) feelingsabout them.

TimingIn working together toward a unifying

architecture, timing emerged as a centralconcern at the workshop. A virtual human’sbehavior must unfold over time, subject to avariety of temporal constraints. For exam-ple, speech-related gestures must closelyfollow the voice cadence. It became obvi-ous during the workshop that previous workfocused on a specific aspect of behavior (forexample, speech, reactivity, or emotion),leading to architectures that are tuned to asubset of timing constraints and cannotstraightforwardly incorporate others. Dur-ing the final day of the workshop, we strug-gled with possible architectures that mightaddress this limitation.

For example, BEAT schedules speech-related body movements using a pipelinedarchitecture: a text-to-speech system gener-ates a fixed timeline to which a subsequentgesture scheduler must conform. Essen-tially, behavior is a slave to the timing con-straints of the speech synthesis tool. In con-trast, systems that try to physically convey asense of emotion or personality often workby altering the time course of gestures. For example, EMOTE works later in thepipeline, taking a previously generatedsequence of gestures and shortening ordrawing them out for emotional effect.Essentially, behavior is a slave to the con-straints of emotional dynamics. Finally,some systems have focused on making thecharacter highly reactive and embedded inthe synthetic environment. For example,Mr. Bubb of Zoesis Studios (see Figure 5) istightly responsive to unpredictable and con-tinuous changes in the environment (such asmouse movements or bouncing balls). Insuch systems, behavior is a slave to envi-ronmental dynamics. Clearly, if these vari-ous capabilities are to be combined, wemust reconcile these different approaches.

One outcome of the workshop was a


Figure 4. The set of MPEG-4 Face Definition Parameter (FDP) feature points.

Right eye Left eyeFeature points affected by FAPsOther feature points

Teeth

Nose

MouthTongue

number of promising proposals for recon-ciling these competing constraints. At thevery least, much more information mustbe shared between components in thepipeline. For example, if BEAT had moreaccess to timing constraints generated byEMOTE, it could do a better job of up-front scheduling. Another possibility wouldbe to specify all of the constraints explicitlyand devise an animation system flexibleenough to handle them all, an approach themotion graph technique suggests.48 Nor-man Badler suggests an interesting pipelinearchitecture that consists of “fat” pipeswith weak uplinks. Modules would senddown considerably more information (andpossibly multiple options) and could polldownstream modules for relevant informa-tion (for example, how long would it taketo look at the ball, given its current loca-tion). Exploring these and other alterna-tives is an important open problem in vir-tual human research.

The future of androids remains to beseen, but realistic interactive virtual humanswill almost certainly populate our nearfuture, guiding us toward opportunities tolearn, enjoy, and consume. The move towardsharable tools and modular architectureswill certainly hasten this progress, and,although significant challenges remain,work is progressing on multiple fronts. Theemergence of animation standards such asMPEG-4 and H-Anim has already facili-tated the modular separation of animationfrom behavioral controllers and sparked thedevelopment of higher-level extensionssuch as the Affective Presentation MarkupLanguage. Researchers are already sharingbehavioral models such as BEAT andEMOTE. We have outlined only a subset ofthe many issues that arise, ignoring many ofthe more classical AI issues such as percep-tion, planning, and learning. Nonetheless,we have highlighted the considerable recentprogress towards interactive virtual humansand some of the key challenges that remain.Assembling a new virtual human is still adaunting task, but the building blocks aregetting bigger and better every day.

AcknowledgmentsWe are indebted to Bill Swartout for develop-

ing the initial proposal to run a Virtual Humanworkshop, to Jim Blake for helping us arrangefunding, and to Julia Kim and Lynda Strand forhandling all the logistics and support, withoutwhich the workshop would have been impossi-ble. We gratefully acknowledge the Departmentof the Army for their financial support undercontract number DAAD 19-99-D-0046. We alsoacknowledge the support of the US Air ForceF33615-99-D-6001 (D0 #8), Office of NavalResearch K-5-55043/3916-1552793, NSF IIS99-00297, and NASA NRA NAG 9-1279. Anyopinions, findings, and conclusions or recom-mendations expressed in this material are thoseof the authors and do not necessarily reflect theviews of these agencies.

References

1. W.L. Johnson, J.W. Rickel, and J.C. Lester,“Animated Pedagogical Agents: Face-to-FaceInteraction in Interactive Learning Environ-ments,” Int’l J. AI in Education, vol. 11, Nov.2000, pp. 47–78.

2. S.C.Marsella, W.L. Johnson, and C. LaBore,“Interactive Pedagogical Drama,” Proc. 4thInt’l Conf. Autonomous Agents, ACM Press,New York, 2000, pp. 301–308.

3. E. André et al., “The Automated Design ofBelievable Dialogues for Animated Presenta-tion Teams,” Embodied ConversationalAgents, MIT Press, Cambridge, Mass., 2000,pp. 220–255.

4. J. Cassell et al., “Human Conversation as aSystem Framework: Designing EmbodiedConversational Agents,” Embodied Conver-sational Agents, MIT Press, Cambridge,Mass., 2000, pp. 29–63.

5. J.E. Laird, “It Knows What You’re Going ToDo: Adding Anticipation to a Quakebot,”Proc. 5th Int’l Conf. Autonomous Agents,ACM Press, New York, 2001, pp. 385–392.

6. S. Yoon, B. Blumberg, and G.E. Schneider,“Motivation Driven Learning for InteractiveSynthetic Characters,” Proc. 4th Int’l Conf.Autonomous Agents, ACM Press, New York,2000, pp. 365–372.

7. J. Cassell, “Nudge, Nudge, Wink, Wink: Ele-ments of Face-to-Face Conversation forEmbodied Conversational Agents,” Embod-ied Conversational Agents, MIT Press, Cam-bridge, Mass., 2000, pp. 1–27.

8. J. Cassell, M. Stone, and H. Yan, “Coordina-tion and Context-Dependence in the Genera-tion of Embodied Conversation,” Proc. INLG2000, 2000.

9. M.T. Maybury, ed., Intelligent MultimediaInterfaces, AAAI Press, Menlo Park, Cal-if., 1993.

10. B. O’Conaill and S. Whittaker, “Characteriz-ing, Predicting, and Measuring Video-Medi-ated Communication: A ConversationalApproach,” Video-Mediated Communication:Computers, Cognition, and Work, LawrenceErlbaum Associates, Mahwah, N.J., 1997, pp.107–131.

11. J. Cassell, H. Vilhjálmsson, and T. Bickmore,“BEAT: The Behavior Expression Animation


Figure 5. Mr. Bubb is an interactive character developed by Zoesis Studios that reactscontinously to the user’s social interactions during cooperative play experience. Imagecourtesy of Zoesis Studios.

Toolkit,” to be published in Proc. SIGGRAPH’01, ACM Press, New York, 2001.

12. M. Johnston et al., “Unification-Based Mul-timodal Integration,” Proc. 35th Ann. Meet-ing Assoc. Compuational Linguistics (ACL97/EACL 97), Morgan Kaufmann, San Fran-cisco, 2001, pp. 281–288.

13. P. Taylor et al., “The Architecture of the Fes-tival Speech Synthesis System,” 3rd ESCAWorkshop Speech Synthesis, 1998.

14. C. Pelachaud et al., “Embodied Agent inInformation Delivering Application,” Proc.Autonomous Agents and Multiagent Systems,ACM Press, New York, 2002.

15. J. C. Lester et al., “Deictic and Emotive Com-munication in Animated PedagogicalAgents,” Embodied Conversational Agents,MIT Press, Cambridge, Mass., 2000, pp.123–154.

16. A. Ortony, G.L. Clore, and A. Collins, TheCognitive Structure of Emotions, CambridgeUniv. Press, Cambridge, UK, 1988.

17. S. Marsella, and J. Gratch, “A Step TowardsIrrationality: Using Emotion to ChangeBelief,” to be published in Proc. 1st Int’l JointConf. Autonomous Agents and Multi-AgentSystems, ACM Press, New York, 2002.

18. M.S. El-Nasr, T.R Ioerger, and J. Yen, “Peteei:A Pet with Evolving Emotional Intelligence,”Proc. 3rd Int’l Conf. Autonomous Agents,ACM Press, New York, 1999, pp. 9–15.

19. S. Marsella and J.Gratch, “Modeling theInterplay of Emotions and Plans in Multi-Agent Simulations,” Proc. 23rd Ann. Conf.Cognitive Science Society, Lawrence Erl-baum, Mahwah, N.J., 2001, pp. 294–599.

20. A. Sloman, “Motives, Mechanisms and Emo-tions,” Cognition and Emotion, The Philoso-phy of Artificial Intelligence, Oxford Univ.Press, Oxford, UK, 1990, pp. 231–247.

21. J. Gratch, “Emile: Marshalling Passions inTraining and Education,” Proc. 4th Int’l Conf.Autonomous Agents, ACM Press, New York,2000.

22. G. Collier, Emotional Expression, LawrenceErlbaum, Hillsdale, N.J., 1985.

23. G. Ball and J. Breese, “Emotion and Person-ality in a Conversational Character,” Embod-ied Conversational Agents, MIT Press, Cam-bridge, Mass., 2000, pp. 189–219.

24. D. Chi et al., “The EMOTE Model for Effortand Shape,” ACM SIGGRAPH ’00, ACMPress, New York, 2000, pp. 173–182.

25. M. Schröder, “Emotional Speech Synthesis:AReview,” Proc. Eurospeech 2001, ISCA,Bonn, Germany, 2001, pp. 561–564.

26. E. Hovy, “Some Pragmatic Decision Criteriain Generation,” Natural Language Genera-tion, Martinus Nijhoff Publishers, Dordrecht,Germany, 1987, pp. 3–17.

Jonathan Gratch is a project leader for the stress and emotion project atthe University of Southern California’s Institute for Creative Technolo-gies and is a research assistant professor in the department of computerscience. His research interests include cognitive science, emotion, plan-ning, and the use of simulation in training. He received his PhD in com-puter science from the University of Illinois. Contact him at the USC Inst.for Creative Technologies, 13274 Fiji Way, Marina del Rey, CA 90292;[email protected]; www.ict.usc.edu/~gratch.

Jeff Rickel is a project leader at the University of Southern California’sInformation Sciences Institute and a research assistant professor in thedepartment of computer science. His research interests include intelligentagents for education and training, especially animated agents that collab-orate with people in virtual reality. He received his PhD in computer sci-ence from the University of Texas at Austin. Contact him at the USCInformation Sciences Inst., 4676 Admiralty Way, Ste. 1001, Marina delRey, CA 90292; [email protected]; www.isi.edu/~rickel.

Elisabeth André is a full professor at the University of Augsburg, Ger-many. Her research interests include embodied conversational agents,multimedia interface, and the integration of vision and natural language. She received her PhD from the University of Saarbrücken, Germany.Contact her at Lehrstuhl für Multimedia Konzepte und Anwendungen,Institut für Informatik, Universität Augsburg, Eichleitnerstr. 30, 86135Augsburg, Germany; [email protected]; www.informatik.uni-augsburg.de/~lisa.

Justine Cassell is an associate professor at the MIT Media Lab whereshe directs the Gesture and Narrative Language Research Group. Herresearch interests are bringing knowledge about verbal and nonverbalaspects of human conversation and story telling to computational systemsdesign, and the role that technologies—such as a virtual story telling peerthat has the potential to encourage children in creative, empowered, andindependent learning—play in children’s lives. She holds a master’sdegree in literature from the Université de Besançon (France), a master’sdegree in linguistics from the University of Edinburgh (Scotland), and a

dual PhD from the University of Chicago in psychology and linguistics. Contact her at [email protected].

Eric Petajan is chief scientist and founder of face2face animation, andchaired the MPEG-4 Face and Body Animation (FBA) group. Prior toforming face2face, he was a Bell Labs researcher where he developedfacial motion capture, HD video coding, and interactive graphics systems.Starting in 1989, he was a leader in the development of HDTV technol-ogy and standards leading up to the US HDTV Grand Alliance. He receiveda PhD in EE in 1984 and an MS in physics from the University of Illinoiswhere he built the first automatic lipreading system. He is also associateeditor of the IEEE Transactions on Circuits and Systems for Video Tech-

nology. Contact him at [email protected].

Norman Badler is a professor of computer and information science atthe University of Pennsylvania and has been on that faculty since 1974.Active in computer graphics since 1968, his research focuses on humanfigure modeling, real-time animation, embodied agent software, and com-putational connections between language and action. He received a BA increative studies mathematics from the University of California at SantaBarbara in 1970, an MSc in mathematics in 1971, and a PhD in computerscience in 1975, both from the University of Toronto. He directs the Cen-ter for Human Modeling and Simulation and is the Associate Dean for

Academic Affairs in the School of Engineering and Applied Science at the University of Pennsyl-vania. Contact him at [email protected].


27. M. Walker, J. Cahn, and S.J. Whittaker,“Improving Linguistic Style: Social andAffective Bases for Agent Personality,” Proc.Autonomous Agents ’97, ACM Press, NewYork, 1997, pp. 96–105.

28. D. Moffat, “Personality Parameters and Pro-grams,” Creating Personalities for SyntheticActors, Springer Verlag, New York, 1997, pp.120–165.

29. H. Prendinger, and M. Ishizuka, “Social RoleAwareness in Animated Agents,” Proc. 5thConf. Autonomous Agents, ACM Press, NewYork, 2001, pp. 270–377.

30. A. Paiva et al., “Satira: Supporting AffectiveInteractions in Real-Time Applications,”CAST 2001: Living in Mixed Realities, FhG-ZPS, Schloss Birlinghoven, Germany, 2001,pp. 227–230.

31. M. Gleicher, “Comparing Constraint-basedMotion Editing Methods,” Graphical Mod-els, vol. 62, no. 2, Feb. 2001, pp. 107–134.

32. N. Badler, C. Phillips, and B. Webber, Simu-lating Humans: Computer Graphics, Anima-tion, and Control, Oxford Univ. Press, NewYork, 1993.

33. M. Raibert and J. Hodgins, “Animation ofDynamic Legged Locomotion,” ACM Sig-graph, vol. 25, no. 4, July 1991, pp. 349–358.

34. D. Tolani,A. Goswami, and N. Badler, “Real-Time Inverse Kinematics Techniques for

Anthropomorphic Limbs,” Graphical Mod-els, vol. 62, no. 5, May 2000, pp. 353–388.

35. J. Lewis, M. Cordner, and N. Fong, “PoseSpace Deformations: A Unified Approach toShape Interpolation and Skeleton-drivenDeformation,” ACM Siggraph, July 2000, pp.165–172.

36. P. Sloan, C. Rose, and M. Cohen, “Shape byExample,” Symp. Interactive 3D Graphics,Mar., 2001, pp. 135–144.

37. D. Baraff and A. Witkin, “Large Steps inCloth Simulation, ACM Siggraph, July 1998,pp. 43–54.

38. N. Magnenat-Thalmann and P. Volino, Dress-ing Virtual Humans, Morgan Kaufmann, SanFrancisco, 1999.

39. K. Perlin, “Real Time Responsive Animationwith Personality,” IEEE Trans. Visualizationand Computer Graphics, vol. 1, no. 1, Jan.1995, pp. 5–15.

40. M. Byun and N. Badler, “FacEMOTE: Qual-itative Parametric Modifiers for Facial Ani-mations,” Symp. Computer Animation, ACMPress, New York, 2002.

41. M. Brand and A. Hertzmann, “StyleMachines, ACM Siggraph, July 2000, pp.183–192.

42. E. Cosatto and H.P. Graf, “Photo-RealisticTalking-Heads from Image Samples,” IEEE

Trans. Multimedia, vol. 2, no. 3, Mar. 2000,pp. 152–163.

43. T. Ezzat, G. Geiger, and T. Poggio, “Train-able Videorealistic Speech Animation, to bepublished in Proc. ACM Siggraph, 2002.

44. D. Terzopoulos and K. Waters, “Analysis andSynthesis of Facial Images Using Physicaland Anatomical Models,” IEEE Trans. Pat-tern Analysis and Machine Intelligence, vol.15, no. 6, June 1993, pp. 569–579.

45. F.I. Parke and K. Waters, Computer FacialAnimation,AK Peters, Wellesley, Mass., 1996.

46. T.K. Capin, E. Petajan, and J. Ostermann,“Efficient Modeling of Virtual Humans inMPEG-4,” IEEE Int’l Conf. Multimedia,2000, pp. 1103–1106.

47. T.K. Capin, E. Petajan, and J. Ostermann,“Very Low Bitrate Coding of Virtual HumanAnimation in MPEG-4,” IEEE Int’l Conf.Multimedia, 2000, pp. 1107–1110.

48. L. Kovar, M. Gleicher, and F. Pighin, “MotionGraphs,” to be published in Proc. ACM Sig-graph 2002, ACM Press, New York.

For more information on this or any other comput-ing topic, please visit our Digital Library athttp://computer.org/publications/dlib.

Doing Software Right • Demonstrate your level of ability in

relation to your peers

• Measure your professional knowledgeand competence

The CSDP Program differentiates between you and others in a field that has every kind of credential, but only one that was developed by, for, and with software engineering professionals.

Register TodayVisit the CSDP web site at http://computer.org/certification

or contact [email protected]

Get CSDP CertifiedAnnouncing IEEE Computer Society's new

Certified Software Development

Professional Program

"The exam is valuable to me for two reasons:One, it validates my knowledge in various areas of exper-tise within the software field, without regard to specificknowledge of tools or commercial products...Two, my participation, along with others, in the exam andin continuing education sends a message that softwaredevelopment is a professional pursuit requiring advancededucation and/or experience, and all the other require-ments the IEEE Computer Society has established. I alsobelieve in living by the Software Engineering code ofethics endorsed by the Computer Society. All of this willhelp to improve the overall quality of the products andservices we provide to our customers..."— Karen Thurston, Base Two Solutions

creating interactive virtual humans: some assembly …ict.usc.edu/pubs/creating interactive virtual...

Documents