synthesis and control of high resolution facial expressions for visual...

SYNTHESIS AND CONTROL OF HIGH RESOLUTION FACIAL EXPRESSIONSFOR VISUAL INTERACTIONS

Chan-Su Lee, Ahmed Elgammal, and Dimitris Metaxas

Rutgers UniversityComputer Science Department

New Brunswick, NJ 08854, USA

ABSTRACT

The synthesis of facial expression with control of intensityand personal styles is important in intelligent and affectivehuman-computer interaction, especially in face-to-face inter-action between human and intelligent agent. We present afacial expression animation system that facilitates control ofexpressiveness and style. We learn a decomposable genera-tive model for the nonlinear deformation of facial expressionsby analyzing the mapping space between low dimensionalembedded representation and high resolution tracking data.Bilinear analysis of the mapping space provides a compactrepresentation of the nonlinear generative model for facialexpressions. The decomposition allows synthesis of new fa-cial expressions by control of geometry and expression style.The generative model provides control of expressiveness pre-serving nonlinear deformation in the expressions with simpleparameters and allows synthesis of stylized facial geometry.In addition, we can directly extract the MPEG-4 Facial Ani-mation Parameters (FAPs) from the synthesized data, whichallows using any animation engine that supports FAPs to ani-mate new synthesized expressions.

1. INTRODUCTION

In intelligent and affective computing, computers need abili-ties not only to recognize human emotion but also to expressemotions to convey affections to the human [1]. Conversa-tional agents are frequently used in affective computing toexpress emotions even though computers that have very dif-ferent bodies can be used to display emotions [2]. Facial ex-pressions, however, are the main communication channel ofemotions in face-to-face interactions. We present a facial an-imation system for effective emotion communications for in-telligent agents.

It is important to generate subtle details of facial expres-sion that convey personality and intensity in facial expres-sions. We want to generate facial expressions that give theappearance with different expressiveness and with differentpersonality in emotion expressions. Our research is relatedto emotional appearance and multiple levels of emotion gen-

eration in affective computing [3]. The challenging aspectsin realistic facial expression synthesis are capturing details offacial expression including small nuances [4] in different per-sons and generating new stylized expressions with the controlof expressiveness. We learn nonlinear mapping of facial mo-tions in order to capture nonlinear deformations in facial ex-pressions. The decomposition of the mapping space using bi-linear model allows us to parameterize subtle expression char-acteristics in different people with realistic facial expressionsynthesis. We present the representation of facial expressionusing a generative model in addition to the high resolutiontracking in Sec. 2.

We provide control of intensity and personality in syn-thesized facial expressions. Exaggerations of differences be-tween an expressive face and a neutral face enhances intensityof the given expression [5]. The exaggeration of differences ingeometry of a person face from standard face geometry syn-thesizes stylized face. We also generate new stylized expres-sion by combination of existing styles. The scaling of facialmotion provides control of expressiveness, intensity of facialexpressions. In Sec. 3.1 and 3.2, we explain the synthesis offacial expression with expressiveness control and stylized fa-cial geometry and the estimation of the animation parameterusing the proposed generative model.

From synthesized facial animation, we extract MPEG-4facial animation parameters (FAPs), low resolution animationparameters to drive facial animation in other animation enginefor conversational agents. As we have very high resolutiontracking data for facial expression and synthesis of the facialnodal points, we can extract FAPs after identifying featurepoints for parameter extractions from a generic model usedfor tracking. Extracted FAPs are used to synthesize facial an-imation in animiation softwares that support MPEG-4 FAPformat. Synthesis results of facial expressions using FAPs arepresented in Sec. 3.3.

Related Works: Synthesis of facial expressions is preformedby modeling facial expressions using generative models andcontrolling model parameters. Researchers have used linearmodels (PCA [6]) and variations such as bilinear models [7, 8]and multilinear tensor models for facial expression analysis

Chan-Su Lee et. al, "Synthesis and Control of Highresolution Facial Expressions for Visual Interactions", in Proc. ICME, pp.65-68, 2006

and synthesis [8, 9]. However, a major limitation of suchmodels is modeling the facial expression in linear subspaces.In this paper, we analyze expression in a nonlinear mappingspace and synthesize new facial expressions based on the de-composition of the mapping space.

Exaggerations of face were investigated by Brennan [10]in computer graphics to develop a caricature generator usinguser’s line drawing input with manual correspondence. Ex-aggeration was controlled by scaling difference in the cor-responding points. Computer vision techniques are appliedto extract features automatically [11]. There are few worksrelated to 3D facial caricature and facial expression exagger-ation. We provide stylized facial expressions which supportcaricature of face geometry and exaggeration of 3D facial mo-tions.

2. GENERATIVE MODELS FOR HIGHRESOLUTION FACIAL EXPRESSION SYNTHESIS

We achieve high resolution tracking of facial expressions fromdense clouds of 3D range data using harmonic maps [12].Tracking data provides an efficient non-rigid 3D motion track-ing with correspondences between different frames of the samesequences as well as between different sequences.

2.1. Learning nonlinear mapping space of facial expres-sions

Facial motion can be described by the displacements of 3D fa-cial nodal points of the general face geometry used for track-ing. As a result of high resolution tracking with one-to-oneintra-frame correspondence, we can represent facial expres-sions by motion vectors for the vertices of a generic facemesh. Let vt ∈ R3N×1 be locations of 3D points at time in-stance t representing N facial nodal points in a 3-dimensionalspace, where N is the number of nodal points in a densegeneric facial model. The trajectory of the 3D nodal pointsis the combination of rigid head motion and facial motion,which can be described as

vt = Tαtyt = Tαt (y0 + mt) = Tαt(g + mt) , (1)

where Tαt is the head motion at time t, yt is the 3 ×N facenodal point locations at time t in face centered coordinate andy0 = g is the facial geometry at the initial frames. We assumethat the captured facial expression starts from a neutral face.The global rigid transformation parameter α comes from thetracking results.

Tracking data are collected from multiple people for learn-ing stylized facial expression models. The displacement inlocal coordinate of every facial nodal point from the trackingdata of expression style s can be described as

vst = Tαs(gs + ms

t ) , (2)

where Tαs is the global transformation by head motion whichdepends on expression style and type, gs is the facial geome-try of each person, and ms

t is the facial expression motion.The main problem in facial expression animation is how

to model and control the facial expression motion, mst , and

facial geometry gs. Both of them depend on the person. Thefacial motion undergoes nonlinear deformations and it is ofhigh dimension. We derive a low dimensional representationfor facial motion using conceptual manifold embedding. Unitcircle is used in the embedding of facial expression whichis cyclic in the sense that the expression changes from neu-tral expression → target expression → neutral expression inthe tracking data set. The conceptual manifold is homeomor-phic to actual data-driven manifold using nonlinear dimen-sionality reduction like LLE (locally linear embedding) [13]that finds intrinsic configuration representation in low dimen-sional space [14].

Given a set of distinctive facial motion sequence M s =[ms

1ms2 · · ·ms

Ns]T and its embedding Xs = [xs

1xs2 · · ·xs

Ns]T ,

we can learn nonlinear mapping function fs(x) that satisfiesfs(xi) = ms

i , i = 1 · · ·Ns, where Ns is the number of cap-tured motion frame for style s. Generalized radial basis func-tion (GRBF) interpolation [15] is used to learn mapping inthe form fs(x) = Bsψ(x) where each row in the matrix Brepresents the interpolation coefficients for corresponding el-ement in the input. i.e., we have d simultaneous interpolationfunctions each from 2D to 1D. The mapping coefficients canbe obtained by solving the linear system

[ms1 · · ·ms

Ns] = Bs[ψ(xs

1) · · ·ψ(xsNs

)] (3)

The mapping function contains all the information to gener-ate new interpolated motion for given xt. For a given kernelψ(x), the matrix Bs captures the facial motion characteris-tics for expression style s. As a result, the facial expressionof person style s can be represented by

vst = Tαs(g

s + Bsψ(xt)). (4)

However, this model requires to high dimensional parametersgs and Bs for each person to generate a new facial expres-sion.

2.2. Decomposition of facial geometry and facial motion

We achieve compact and orthogonal representation for facegeometry and facial motion mapping space using bilinear anal-ysis. We can represent matrix Bs as a vector bs by columnstacking. We collect mapping vector bs as

F = [b1b2 · · · bNs ] , (5)

where Ns is the number of facial expression styles. By apply-ing an asymmetric bilinear model [7], we can represent ge-ometry based on geometry basis E and geometry style vectorss

b as follows:F s = Ess

b . (6)

We apply the bilinear analysis for facial geometry and achieverepresentation of person geometry based on geometry basisD and geometry style vector ss

g as follows:

Cs = Dssg , (7)

where the dimension of the style vector ssg, a compact repre-

sentation of person geometry, is Ns(<< 3N).Now, the decomposable nonlinear generative model can

be expressed as

vs(t) = Tαst

(Dgss

g + unstacking(Essb)ψ(x(t))

), (8)

where unstacking(·) means converting vectorized represen-tation to original matrix by unstacking column. The gener-ative model captures nonlinearity in the mapping space andprovides compact control parameters for expression style andexpression geometry variations in facial expression synthesis.In addition, it supports generation of stylized geometry andfacial expression.

3. SYNTHESIS OF STYLIZED AND EXPRESSIVEFACIAL EXPRESSIONS

New facial expressions can be generated by new geometrystyle vectors and new facial motion style vectors. Linear weight-ing of the existing style vector can be used to generate newstyle vector for synthesis.

snewg = α1s

1g + α2s

2g + · · ·+ αNes

Nsg ,

snewb = β1s

1b + β2s

2b + · · ·+ βNss

Ns

b , (9)

where∑

i αi = 1, and∑

j βj = 1. αi controls the weightfor geometry style si

g , and βj specify the weight for facialmotion style sj

b. New facial expression can be generated byusing new styles as

vnew(t) = Tαnewt

(Dgsnew

g + unstacking(Esnewb )ψ(x(t))

).

(10)We can add additional parameters to synthesize more stylizedgeometry and to control expressiveness.

3.1. Stylized facial geometry synthesis

New stylized facial geometry can be synthesized using meangeometry style. Stylized face geometry exaggerates facialfeatures that are different from standard face, or average face.As we get tracking data with correspondence between differ-ent subjects, the mean facial geometry can be the arithmeticmean of individuals’ face geometry. The same result can beachieved using equal weighting of all the style vectors. Wecan represent using mean style s̄ and scale factor γ as

gexaggerated = Dg(γ(ssg − s̄g) + s̄g) . (11)

The parameter γ controls the amount of exaggeration basedon difference from the average face geometry. When γ = 1,the new face is the same face as the original face. When γ <1, the new face closes to standard face and the same as averageface at γ = 0. Usually, caricatured faces are generated usingγ > 1. Fig. 1 shows mean geometry, and style exaggeratedgeometries.

Fig. 1. Original image and its 3D caricatures: Col.(a): Originalimages Col. (b): γ = 0.0 (mean geometry), Col. (c): γ = 0.7, Col.(d): γ = 1.0, Col. (e): γ = 1.3.

3.2. Intensity control in facial expressions

We can control expressiveness in the synthesized facial ex-pression by scaling the facial motion. We can extend the gen-erative model using motion scaling parameter τ .

mexaggeratedt = τ(unstacking(Esnew

b )ψ(x(t))) , (12)

The scaling of the motion vector affects different strength orfeeling of the expression. Reducing the facial motion scaleparameter τ causes milder or weak expression than the orig-inal ones and stronger expression when we increase scalingparameter. The overall generative model is combination ofstylized facial geometry and facial motion with expressive-ness control. Fig. 2 show examples of scaling effects, whichshows different expressiveness of facial motion according toscale parameters in smile expression sequence. Facial expres-sions in different intensities are generated with preserving de-tails in the expression.

3.3. Extraction of FAPs and animation in intelligent agents

MPEG-4 facial animation parameters (FAPs) are part of thespecifications of the international standard for efficient cod-ing of shape and animation of human face and bodies. FAPsare used not only for the web and mobile applications but alsoemotion recognition and talking head human computer inter-face [16]. Computer vision techniques are used to extract FAPparameters from video. Optical marker based motion capturesystems are used to get accurate parameters as in [17]. Ourhigh resolution tracking of facial motion by generic modelwith correspondence allows accurate extraction of FAPs with-out markers.

As we have high resolution tracking of 3D facial nodalpoints with correspondence between frames and in differentsequences, we can estimate the facial animation parameters(FAPs) if we once identify the nodal points corresponding to

Fig. 2. Intensity control by scaling factors in facial expres-sions: Row: τ changed by 0.5, 1.0, 1.25, 1.75 from top row tobutton row. Column: Sampled expression sequence(neutral → tar-get)

FAP nodes. Our generic facial geometry model does not cre-ate for FAP parameters. However, it has 8K 3D nodal pointsin high resolution tracking and 1K in low resolution track-ing. We can easily identify nodal points to calculate FAPs.In Fig. 3, the first column shows identified nodal points inneutral face. In the following columns, it shows synthesizedfacial expressions (a) and corrresponding re-synthesis results(b) of expressions using extracted FAPs from the synthesisdata. We used Visage Technologies software supplied by Vis-age Technologies AB to convert FAPs into binary format andto generate facial expressions in agents.

(a) Feature points used for FAPs estimation and synthesized expressions

(b) Neutral face and re-synthesized facial expressions in agents

Fig. 3. Feature points and re-synthesis of facial expressions

4. CONCLUSION AND FUTURE WORK

We presented a facial expression synthesis system using anonlinear generative model from tracking data of high reso-lution 3D nodal points with correspondences between framesand in different persons. The generative model can synthe-size, not only personalized facial expressions including sub-

tle expressions, but also expressiveness controlled ones. Inaddition, we can extract FAPs from the synthesized data andcan generate facial expression by any animation software sup-porting MPEG-4 FAPs. We plan to perform evaluation of oursystem in human-intelligent agent interaction situation to seethe effect of expressiveness and personalization in intelligentagent appearance.Acknowledgement: The facial expression tracking data was made availableby Dimitris Samaras at SUNY Stony Brook. This research is partially fundedby NSF award IIS-0328991.

5. REFERENCES

[1] Rosalind W. Picard, “Affective computing: Challenges,” Int. Journalof Human-Computer Studies, vol. 59, no. 1–2, pp. 55–64, 2003.

[2] Karen K. Liu and Rosalind W. Picard, “Subtle expressivity in a roboticcomputer,” in Proc. of the CHI 2003 Workshop on Subtle Expressivityfor Chnaracters and Robots, 2003.

[3] Rosalind W. Picard, “What does it mean for a computer to have emo-tions?,” Tech. Rep. 534, MIT Media La, 2001.

[4] Bernadette Kiss, Balazs Benedek, Gabor Szijarto, and BarnabasTakacs, “Closed loop dialog model of face-to-face communication witha photo-real virtual human,” in Visual Communications and Image Pro-cessing, 2004, pp. 765–772.

[5] Andrew J. Calder, Duncan Rowland, Andrew W. Young, Ian Nimmo-Smith, Jill Keane, and David I. Pettett, “Caricaturing facial expres-sions,” Gognition, vol. 76, pp. 105–146, 2000.

[6] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3dfaces,” in SIGGRAPH’99, 1999, pp. 187–194.

[7] Joshua B. Tenenbaum and William T. Freeman, “Separating style andcontent with biliear models,” Neural Computation, vol. 12, pp. 1247–1283, 2000.

[8] E. S. Chunang, H. Deshpande, and C. Bergler, “Facial expression spacelearning,” in Pacific Graphics, 2002, pp. 68–76.

[9] Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and Jovan Popovic,“Face transfer with multilinear models,” ACM Trans. Graph., vol. 24,no. 3, pp. 426–433, 2005.

[10] Susan E. Brennan, “Caricature generator: The dynamic exaggerationof faces by computer,” Leonardo, vol. 18, no. 3, pp. 170–178, 1985.

[11] Rein-Lien Hsu and Anil K. Jain, “Generating discriminating cartoonfaces using interacting snakes,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 25, no. 11, 2003.

[12] Yang Wang, Mohit Gupta, Song Zhang, Sen Wang, Xianfeng Gu, Dim-itris Samaras, and Peisen Huang, “High resolution tracking of non-rigid3d motion of densely sampled data using harmonic maps,” in Proc. inICCV, 2005, pp. 388–395.

[13] Sam Roweis and Lawrence Saul, “Nonlinear dimensionality reductionby locally linar embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.

[14] Ahmed. Elgammal and Chan-Su Lee, “Separating style and content ona nonlinear manifold,” in CVPR, 2004, pp. 478–485.

[15] Tomaso Poggio and Fredrico Girosi, “Networks for approximation andlearning,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1481–1497,1990.

[16] Igor S. Pandzic and Robert Forchheimer, Eds., MPEG-4 Facial Ani-mation: The Standard, Implementation and Applications, John Wiley,2002.

[17] Sumedha Kshirsagar, Stephane Garchery, Gael Sannier, and NadiaMagnenat-Thalmann, “Synthetic faces: Analysis and applications,” Int.J. Imaging Syst. Technol., pp. 65–73, 2003.

synthesis and control of high resolution facial expressions for visual...

Documents