video-based facial re-animationiphome.hhi.de/paier/files/pubs/cvmp2015...facial performances from a...

Video-Based Facial Re-Animation

Wolfgang PaierFraunhofer HHIwolfgang.paier

@hhi.fraunhofer.de

Markus KetternFraunhofer HHImarkus.kettern

@hhi.fraunhofer.de

Anna HilsmannFraunhofer HHIanna.hilsmann

@hhi.fraunhofer.de

Peter EisertFraunhofer HHI/

Humboldt University Berlinpeter.eisert

@hhi.fraunhofer.de

ABSTRACTGenerating photorealistic facial animations is still a challen-ging task in computer graphics, and synthetically generatedfacial animations often do not meet the visual quality of cap-tured video sequences. Video sequences on the other handneed to be captured prior to the animation stage and donot offer the same animation flexibility as computer graphicsmodels. We present a method for video-based facial animati-on, which combines the photorealism of real videos with theflexibility of CGI-based animation by extracting dynamictexture sequences from existing multi-view footage. To syn-thesize new facial performances, these texture sequences areconcatenated in a motion-graph-like way. In order to ensurerealistic appearance, we combine a warp-based optimizati-on scheme with a modified cross dissolve to prevent visualartifacts during the transition between texture sequences.Our approach makes photorealistic facial re-animation fromexisting video footage possible, which is especially useful inapplications like video editing or the animation of digitalcharacters.

Categories and Subject DescriptorsH.5.1 [Multimedia Information Systems]: Animations;I.4.8 [Scene Analysis]: Tracking

Keywordsfacial animation, facial texture, geometric proxy, tracking

1. INTRODUCTIONThe creation of realistic virtual human characters and

especially of human faces is one of the most challenging tasksin computer graphics. The geometric and reflective proper-ties of the human face are very complex and hard to model.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

CVMP 2015, November 24 - 25, 2015, London, United Kingdom© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.ISBN 978-1-4503-3560-7/15/11. . . $15.00

DOI: http://dx.doi.org/10.1145/2824840.2824843

Deformation is induced by complex interactions of a largenumber of muscles as well as several layers of tissue andskin. Reflective properties vary from diffuse to highly spe-cular areas and also show subsurface scattering effects. Fur-thermore, humans are very good at interpreting faces, suchthat even slight deviations from the expected visual appea-rance are perceived as wrong or unnatural facial expressions.Therefore, synthetically generated facial expressions are of-ten not perceived as realistic as real video sequences. Videosequences on the other hand need to be captured in advance,and editing the captured facial performances is difficult.

In this paper, we present a video-based approach to re-animate an actor’s face. Our approach does not try to mo-del the facial appearance with low dimensional statisticalmodels as this would drop important details that cannot berepresented in a low dimensional space. We rather base ourmethod on real video footage which is transformed to dy-namic texture sequences to achieve a photorealistic appea-rance. These dynamic textures are then processed in a waythat allows for seamless concatenation of texture sequencesaccording to an animator’s input, enabling the creation ofnovel facial video performances. The presented approach canfor example be used in computer games or video editing ap-plications where it allows to conveniently synthesize realisticfacial performances from a database of source sequences, toseamlessly transfer a facial performance between differentvideo sequences, to re-arrange a sequence of facial video orto re-animate a digital character (e.g. in video games).

For the extraction of the dynamic texture sequences con-taining the facial performance (section 4), we use a pre-created 3D proxy model to consistently track the head posein one or multiple synchronized video streams (section 4.2).The proxy mesh is a-priorily created using a single shot ofthe actor’s head using a multi-view D-SLR camera rig and adense image-based 3D-reconstruction method (section 4.1).After dynamic texture extraction (section 4.3), the dynamictextures can be used to synthesize novel facial performan-ces by seamlessly concatenating several texture sequences(section 5). We use a combination of warp-based image re-gistration (section 5.1) and cross dissolve blending (section5.2) to create seamless transitions between consecutive tex-ture sequences.

2. RELATED WORK

Figure 1: Examples of our results. The proxy headmodel is rendered with different facial expressionsand from different viewpoints

Performance capture is a popular technique to make theperformance of an actor reusable. It can for example be ex-ploited to drive the skeletal animation of a human character,to transfer facial expressions by matching the face geometry[27], or to aid the creation of more realistic video dubbings[11]. In this paper, we mainly focus on facial performancecapture. A detailed survey on this topic can for example befound in [22]. Our approach is related to marker-less facialmotion capture like [17, 26, 8, 3, 7] since we use an opticalflow-based technique to accurately track the head pose inthe video stream. Borshukov et al. [3] use a highly sophisti-cated capture setup to capture an actor’s facial performan-ce. Based on optical flow and photogrammetry, they drivea Cyberscan model of the actor’s face in neutral expressionto obtain a time-consistent animation mesh. Missing detailslike wrinkles and skin pores are added afterwards using abump-map that is extracted from a high resolution laserscan. A different approach is presented by Garrido et al. [9],where a blend shape model is personalized by matching itto a detailed 3D-scan of the actor’s face. This blend shapemodel is then used to track the facial performance using acombination of sparse facial features and optical flow. Alex-ander et al. [1] describe an approach to create extremelyrealistic digital characters but their approach also requiresa highly sophisticated capture setup and additional humaneffort. Most of these approaches are used to transfer facialexpressions from a source sequence to an animatable facemodel achieving different levels of realism.

In contrast, our approach tries to keep the complexity as

low as possible while at the same time aiming at photorea-listic animations. We do not rely on a fully animatable 3Dmodel of the actor’s face, but on a roughly approximatedgeometric model with only a few degrees of freedom. In theexperiments presented in this paper, we only use a singleblendshape to account for mouth opening and closing. Thenecessary expressivity is achieved by using photorealistic dy-namic textures which add fine details and facial movements.Similar strategies have been used for other applications in[18, 4, 8, 15, 25, 6, 19]. Using image-based rendering, thesemethods create photorealistic renderings from novel view-points though the used geometry is only a rough approxi-mation. Pushing this idea further, Xu et al. [28] presenteda system for the synthesis of novel full body performancesfrom multi-view video. They use performance capture to ob-tain pose and geometry for each video frame. Based on thisdata, they render a synthetic video performance accordingto a user provided query viewpoint and skeleton-sequence,even if the exact body pose is not represented in the data-base. However, they mention that while their approach isappropriate for skeletal animation, facial animation has tobe handled separately.

As humans are very sensitive to inconsistencies in the ap-pearance of other human faces, we specifically concentrateon facial re-animation, in contrast to the previously men-tioned papers. Inspired by the aforementioned advances inimage- and video-based rendering, our approach is based onreal video footage to achieve photorealistic results. Video-based facial animation has only recently found attention inthe literature. Paier et al. [19] presented a system for faci-al re-targeting, i.e. transferring short sequences of a facialperformance between different videos. This can be used invideo editing to fine tune the timing of facial actions or toexchange similar facial expressions in order create a flawlessshot from already captured shots. In contrast to [19], wefocus on synthesizing completely new sequences from shortclips of facial expressions of an actor. Furthermore, by ex-tracting dynamic textures and the use of an approximatehead model that allows for jaw movements, novel animati-ons can be rendered from arbitrary viewpoints.

Our re-animation strategy is also related to the idea ofmotion graphs [16, 13, 5] that have already been success-fully used for skeletal or surface-based animation of humancharacters. We capture several video sequences of a facialperformance and split them up into short clips that containsingle actions or facial expressions (e.g. smile, talk, lookingsurprised or angry). These clips are transformed to texturespace and are concatenated in order to compose a novel fa-cial performance. Similar to [14, 21], we also find smoothtransitions between different facial sequences because direct-ly switching from one texture sequence to another wouldcreate obvious artifacts in the synthesized facial video (e.g.sudden change of facial expression or illumination). For thispurpose, we use a geometric image registration technique tocompensate tracking errors and changes in the facial expres-sion as well as a modified cross dissolve to smoothly blendall remaining color differences.

3. SYSTEM OVERVIEWThe workflow of the presented system consists of two main

steps: First, in a pre-processing phase a database of dynamictextures, each containing a certain facial expression/action,is created. This step has to be done only once in advance.

After this database has been set up, new facial videos canbe created in real time according to user input by seamlesslyconcatenating selected dynamic textures and rendering themusing an approximate model of the actor’s head.

The input data for the extraction of dynamic textures con-sists of a multi-view video stream that contains several facialperformances of an actor (see figure 5) as well as a calibra-ted 360° multi-view set of still images showing the actor’shead with a neutral expression (see figure 3). First, we re-construct the head geometry based on the still images andrun a semi-automatic mesh unwrapping technique to gene-rate texture coordinates. The extracted head geometry willbe used to consistently track the head pose in the multi-viewvideo streams. It is an approximation of the true geometrysince it is almost static and allows for dominant deforma-tions only (in this paper the only possible deformation isjaw movement). More subtle deformations will be expres-sed by dynamic textures. The following steps process themulti-view video stream only. We label several facial expres-sions/actions (e.g. neutral, happy, talking, ...) by storing atag as well as the first and the last frame of each facial per-formance. Then, using the extracted head geometry, the faceis tracked through all frames in all annotated sequences andtemporally consistent textures are created for each multi-view frame. These pre-processing steps need to be performedonly once in advance and are detailed in section 4.

Input to the synthesis of facial videos is a user definedsequence of facial expression labels. Based on these labels,the processed dynamic texture sequences are combined to re-animate the face either by directly rendering the mesh withanimated textures (e.g. games or virtual reality applications)or by rendering the sequence to a target video. In order toensure a seamless concatenation of the texture sequences, weapply a two stage blending approach. First, in a pre-definedtransition window at the end of the current sequence and thebeginning of the following one, we adjust the motion using awarp-based optimization technique. Finally, we apply a crossdissolve-based color adjustment. Details on the the synthesisof novel facial expression sequences, i.e. facial re-animation,are given in section 5.

4. CREATING A DATABASE OF DYNAMICTEXTURES

This section explains how we extract a database of dy-namic textures from a multi-view video stream. First, theactor is captured in neutral position using a calibrated 360°

multi-view setup of D-SLR cameras to generate a 3D headmodel (section 4.1). Then, several facial performances arecaptured with a multi-view video setup, and the 3D headmodel is used to track the 3D pose and orientation of theactor’s head and jaw movements in each sequence (section4.2). This allows us to extract temporally consistent textures(section 4.3) from the multi-view video sequences in orderto set up a database of dynamic textures.

4.1 Generation of 3D Head ModelsFor the extraction of dynamic textures from multi-view

video, a 3D representation of the actor’s head is generatedfrom a calibrated multi-view set of D-SLR images. We em-ploy a state-of-the-art image-based dense stereo matchingscheme to estimate depth-maps for each D-SLR pair [2]. Theresulting 2.5D models are then registered using an iterati-

Still Images(DSLR)Multiview Video(4K)

Extraction of Dynamic Textures

Seamless Composition of Facial

Textures

Reconstruction of Head Geometry

Tracking of Head Movements

Extraction of Texture Sequences

User Defined Sequence of Facial Expression Labels

Geometric Blending

Composed Texture Sequences

Photometric Blending

Optimized Texture Sequences

Synthesized Facial Performance

Figure 2: Schematic system overview

ve closest point approach and merged into a complete 3Dhead model (see figure 4). This method provides an accu-rate reconstruction of the facial geometry and a realisticapproximation of the actor’s hair.

The reconstructed head geometry is almost static and onlyused as a geometric proxy in all following processing steps.Typically, a neutral facial expression is used for the proxymesh since this provides a reasonable approximation of thehead geometry for most facial expressions. Finally, a semiau-tomatic method for mesh unwrapping is used to create tex-ture coordinates for each vertex. Note that this step needsto be done only once for each actor, and can then be usedto process all video sequences of this actor.

4.2 Head Motion TrackingIn order to enable the extraction of temporally consistent

textures from video streams displaying facial performances,the 3D model used for texture extraction should follow thismotion as closely as possible. However, correctly trackingthe subtle geometric deformations of a face is considered tobe a very hard computer vision problem (e.g. [10]) and evenstate-of-the-art methods may quickly lose track due to the

Figure 3: Samples of the still images used to reconstruct the head geometry

Figure 4: Image of the geometric proxy used fortracking, texture extraction, and rendering

manifold deformations or large head rotations occurring innatural facial performances, producing visually disturbingartifacts during re-animation. In order to allow for photo-realistic animations, the overall idea of our approach is toexpress all subtle facial deformations by animating the tex-ture rendered upon the geometry instead of modeling themin 3D. The only type of deformation we consider impossibleto represent by texture alone is jaw movement since it lar-gely deforms the face boundary where strong depth discon-tinuities would severely hamper the results of any approachrelying on texture alone.

Thus, the rigid head motion and jaw movement have tobe separated from deformations due to facial expressions.We achieve this by tracking the actor’s face with the origi-nal proxy mesh and a single blend shape for downwards jawmovement which is easily created using 3D modeling soft-ware. Note that the method described below is not limitedto a single blend shape and could also be used to track afull-blown blend shape model.

The tracking procedure is preceded by selecting a set ofkey-frames from the video stream and matching the proxymesh to these key-frames via a small set of fiducial pointseither obtained from a facial feature extractor (e.g. [24]) orby manual annotation. In order to maximize semantic con-sistency of the extracted textures, we minimize the imagedifference between each frame and a reconstruction of thatframe obtained by warping the last key-frame according tothe estimated motion and deformation of the proxy mesh,

resulting in an analysis-by-synthesis approach. The rigid mo-tion of the head proxy model in frame s is defined by a ro-tation Rs around the model’s center point and a translationts. The jaw movement will be parametrized by a blend sha-pe factor λs. Since we are working in a calibrated multi-viewsetup, each camera c also has a pose (Rc, tc) and its projec-tion of a 3D point xs on the surface of the proxy model inframe s is given by[

uv

]= Ψc

(RT

c (Rsxs + ts − tc))

(1)

Ψc

xyz

= cc − diag (fc)

[xzyz

](2)

where cc and fc denote the camera’s principal point andscaled focal length, respectively. The position of point xs inmodel space is defined by

xs = v + λsb (3)

where v is the position in the original proxy mesh, b is thecorresponding offset for blend shape animation and λs isthe animation coefficient for jaw movement. Note that for amodel with more than one blend shape, λs would be replacedby a vector of coefficients and b by a matrix containing thevertex offsets.

Since we estimate each frame by modifying a renderedversion of its preceding frame, we may assume the rotationupdate for each new frame to be small enough to be appro-ximated linearly. Thus, we can express the motion of a 3Dmodel point in world coordinates as

ws = ws−1 + ∆sx (4)

= ∆sRR

s−1xs + ts−1 + ∆st (5)

∆sR =

1 −rsz rsyrsz 1 −rsx−rsy rsx 1

(6)

∆sx = (∆s

R − I)Rs−1xs + ∆st (7)

This motion induces an offset for the projected position

u =[u v

]Tcorresponding to xs in the image of camera c

which we represent by its first order approximation

∆su,c (θs) ≈ JΨcR

Tc ∆s

x (8)

where θs = [∆sr,∆

st , λ

s]T is the parameter vector consistingof the changes in model rotation, translation and jaw mo-vement. JΨ is the Jacobian of the projection function givenby

JΨc = diag (fc)

[− 1

z0 x

z2

0 − 1z

yz2

](9)

for a 3D point[x y z

]T= RT

c (Rsxs + ts − tc) in cameraspace.

Substituting (3) and (6) into (8) yields

∆su,c (θs) = JΨcR

Tc

[Rs−1v

]×

I3

−(Rs−1b

)TT

θs (10)

where [a]× denotes the skew-symmetric cross-product ma-trix of vector a. Note that this 2D motion offset is nowexpressed as linearly dependent on the parameters of themodel’s motion in 3D space given by θs. We can derive amatrix representing the induced motion of all pixels in animage area Ω by

...∆s

u,c

...

u∈Ω

= Acθs (11)

with each pair of rows in Ac given by

Au,c = JΨcRTc

[−[Rs−1v

]× I3 −Rs−1b

]. (12)

The established relation between 3D and 2D motion isused to explain the optical flow between two images Isc andJsc . This amounts to minimizing the error

Ef(θs) =

∑c

∑u∈I

∥∥∥(∇Isu,c

)T∆s

u,c (θs)−(Jsu,c − Isu,c

)∥∥∥2

(13)with ∇Isu,c being the image gradient of Isc at pixel u.

As described above, to prevent drifting errors, we use arendered version of frame I0

c , the last key-frame, as imageJsc . This rendered version is created by projecting I0

c ontothe texture of the model mesh at its pose in I0

c and thenrendering the mesh with the estimate of its current pose.If Ω is the area covered by the rendered mesh, Ef can beminimized in closed form by solving the system of linearequations given by

∇IscAcθs = Js

c − Isc (14)

evaluated in region Ω.This yields a linearized estimate of the image variations

induced by the parameters θs =[(∆s

r)T (∆st )T λs

]T. Sin-

ce this relation is in truth a non-linear one, we resort to aniterative optimization approach. Observe that (14) repres-ents the set of normal equations for this non-linear problemso iteratively solving it and updating the rendered imageJs and the depth map for obtaining AΩ results in a Gauss-Newton optimization. This process typically converges toyielding very small parameter updates within less than 10iterations.

4.3 Dynamic Texture GenerationThis step uses the results of the tracking procedure to

transform the multi-view video streams (see figure 5) intoa single stream of texture mosaics. Such a representationhas several advantages. First, it can easily be integrated in-to existing rendering engines. Second, it eases the process ofediting facial expressions as all texture information is mer-ged into a single image. Finally, it reduces redundancy as intexture space only relevant data is stored and unnecessarydata is dropped (e.g. background and overlaps).

Since our setup consists of multiple video cameras it is ne-cessary to decide for each triangle fi, from which camera itshould receive its texture. This can be formulated as a labe-ling problem, estimating a camera label ci for each trianglefi of the proxy mesh.

In order to create an optimal sequence of texture mosaicsfor each facial expression/facial action, we employ a discreteoptimization scheme minimizing an objective function (15)that consists of three terms, each corresponding to one visualquality criterion [19]: high visual quality, low visibility ofseams and no temporal artifacts (e.g. flickering caused byrapidly changing source cameras).

Et(C) =

T∑t

N∑i

D(f ti , c

ti)

+ λ∑

i,j∈N

Vi,j(cti, ctj)

+ ηT (cti, ct−1i )

(15)

where C denotes the set of camera labels for all triangles.The first term D(fi, ci) measures the visual quality of a tri-angle fi in camera ci and uses a quality measure W(fi, ci),which is the area of fi projected on the image plane of ca-mera ci relative to the sum of area(fi, ci) over all possibleci to ease the choice of the weighting factors η and λ:

D(fi, ci) =

1−W(fi, ci) fi is visible

∞ fi is occluded(16)

W(fi, ci) =area(fi, ci)∑

cj

area(fi, cj)(17)

The second term Vi,j(ci, cj) in (15) adds a spatial smooth-ness constraint to the objective function (15) which relatesto the sum of color differences along the common edge ei,j oftwo triangles fi and fj that are textured from two camerasci and cj .

Vi,j(ci, cj) =

0 ci = cj

Πei,j ci 6= cj(18)

Πei,j =

ˆei,j

∥∥Ici(x)− Icj (x)∥∥ dx (19)

Finally, a temporal smoothness term T (ci, cj) is added tothe objective function. Without such a term, the resultingdynamic textures are not necessarily temporally consistent,i.e. the source camera of a certain triangle can change ar-bitrarily between two consecutive texture frames resultingin visually disturbing flickering in the resulting texture se-quence. T increases the overall cost if the source camera ciof a triangle fi changes between two consecutive time steps.

T (cti, ct−ti ) =

0 cti = ct−t

i

1 cti 6= ct−ti

(20)

Finally, we employ a simple but effective global color mat-ching [23] together with Poisson blending [20] modified forthe usage in texture mosaics to conceal remaining seams wi-thout unnecessarily blurring the resulting texture or addingghosting artifacts (which can be caused by simpler approa-ches like alpha-blending).

Figure 5: Samples of the multi-view video captureshowing different facial expressions

In case the video footage alone is not sufficient to create360° dynamic textures, we allow the filling of missing regionsand areas of low spatial resolution (caused by the viewingangle) with texture data from the D-SLR capture.

5. SYNTHESIS OF FACIAL VIDEOSIn the previous stage (section 4), we created a set of in-

dependent texture sequences. Each sequence represents afacial expression or action like smiling, talking, laughing,looking friendly or angry. We can now use the extracted dy-namic textures to create photorealistic facial performancesby playing the texture sequences on a static 3D model of thehead like a video. This creates the photorealistic illusion ofa talking mouth, blinking eyes or wrinkles caused by a smilewithout the need to model all fine deformations in the geo-metry. Furthermore, by looping and concatenating severaltexture sequences, longer and more complex sequences canbe synthesized. This type of animation strategy is closelyrelated to motion graphs [16]. In the context of motion gra-phs, edges in the graph would correspond to facial actions,and vertices to expression states.

Since the extracted dynamic textures have been capturedseparately and in a different order, simple concatenation ofindependent dynamic textures would create visual artifactsat the transition between two sequences. These artifacts aredue to small tracking errors, changing illumination (e.g. cau-sed by head movement) and differences in the facial expres-sion at the end of one sequence and the beginning of another(see figure 6).

Therefore, at this stage, the independent texture sequencesfrom the pre-processing phase (section 4) are brought intoconnection by defining transition rules between the separa-te sequences. Between each pair of texture sequences, a twostage blending strategy is employed: first, the geometric mi-salignment between the last frame Tlast,t−1 of the previoustexture sequence and the first frame Tfirst,t of the next se-quence is corrected, before the remaining color differencesare blended by a cross dissolve.

5.1 Geometric BlendingThe geometric misalignment is compensated by calcula-

Figure 6: Intensity difference at a transition point.Bottom-left: previous frame, bottom-right: currentframe, top: color difference at transition frame

ting a 2D warp W(T ,Φ) that maps Tlast,t−1 on Tfirst,t, mi-nimizing

argminΦ

‖Tfirst,t −W(Tlast,t−1,Φ)‖2 + λR(Φ), (21)

with R being a regularization term weighted by a scalarfactor λ. Similar to [12], we model the geometric image de-formation of Tlast,t−1 with regard to Tfirst,t as a regulardeforming 2D control mesh with Barycentric interpolationbetween vertex position, i.e. the warping function is para-metrized by a vector Φ containing the control vertex displa-cements, and the regularization term is based on the meshLaplacian.

Based on the estimated warp, the motion in the last fra-mes of T...,t−1 and the first of T...,t are deformed graduallyto ensure that the transition frames of both sequences areidentical. This deformation process is distributed over sever-al frames. We use a rather high number of frames n=60 (at59 fps) to perform the geometric deformation because theadditional motion per frame should be as low as possible tomake it barely noticeable.

5.2 Anisotropic Cross Dissolve

Figure 7: Impact of geometric warping. Bottom-left:50% cross dissolve without geometric warp (artifactsaround the lips and the eyes), bottom-right: withgeometric warp compensation, top: Color differencesafter geometric image warp. No strong edges arevisible around eyes and mouth.

The geometric texture alignment reduces ghosting arti-facts during blending (see figure 7). However, color diffe-rences between Tlast,t−1 and Tfirst,t can still exist. These canbe caused by changing lighting conditions as the head usual-ly moves during the capturing process, surface deformations(e.g. wrinkles that appear or disappear) and remaining misa-lignments that could not be fully compensated by the imagewarping (see figure 7). Though the remaining discrepanciesare not disturbing in the still image, they become apparentwhen re-playing the texture sequences. Therefore, an addi-tional cross dissolve blending is performed in parallel to thegeometric deformation. The cross dissolve is also distributedover a large number of frames in order to achieve a slow andsmooth transition. The number of frames has to be chosencarefully: if the number of frames used for the transition istoo small, the resulting transition can become apparent dueto sudden changes in shading or specularities. On the otherhand, if the number of frames is too large, ghosting artifactscan appear because the cross dissolve adds high frequency

details while the face deforms (e.g. specularities on the clo-sed eye, lip line on a opened mouth, etc.).

Therefore, we use an anisotropic cross dissolve that al-lows for multiple blending speeds within the same texture.For example, a fast blending (e.g. the blending finishes after4 frames) is used in regions with high frequency differences(e.g. eyes and mouth) whereas slow blending speed (e.g. theblending finishes after 40 frames) is applied in smooth regi-ons with mainly low frequency differences (e.g. skin regions).The faster cross dissolve does not create disturbing effectsbecause blending small misalignments with cross dissolveresults in a sensation of movement [14]. This small but fastmovement is barely noticeable in contrast to a slowly appea-ring or disappearing ghosting effect caused by an isotropiccross dissolve. The anisotropic cross dissolve is realized byproviding an additional speed-up factor s for each texel. Forthis purpose, a static binary map S was used to mark regi-ons of increased blending speed. To ensure a smooth spatialtransition between regions of different blending speeds, Sis blurred in order to create intermediate regions where schanges gradually from slow to fast. For our experiments, asingle binary map was created manually in texture space

6. RESULTS AND DISCUSSION

Experimental Results and DiscussionThis section presents still images of our proposed re-animationtechnique. Note that the results can best be evaluated inmotion and we therefore also refer to the video in the sup-plementary material.

For our experiments, we captured different facial perfor-mances of an actress with 4 calibrated UHD-Cameras (SonyF55) at 59 Hz. We annotated 18 clips in the captured multi-view video footage and transformed them to dynamic tex-ture sequences. A 3D head geometry proxy of the actress wascreated a-priorily using one shot of a multi-view still camerarig consisting of 14 D-SLR cameras (Canon EOS 550D) (seefigure 3).

The presented approach was used to optimize the first/lastn=60 frames of each texture sequence to allow for a seamlesstransition between the different texture sequences or to loopa single sequence multiple times (e.g. idle). A non-optimizedimplementation of our system tracks up to 4 frames per se-cond, takes several seconds for one texture mosaic and ap-proximately 10 seconds for the generation of an optimizedtexture sequence transition. This is sufficient as we consi-der the main purpose of our presented system as an offlineprocessing tool (e.g. video editing or creation of video-basedanimations for digital characters in games).

To demonstrate the interactive usability of our re-animationapproach, the visual quality of the resulting face sequencesand the usage for free viewpoint rendering, we implementeda free viewpoint GUI, in which a user can arbitrarily changethe order of facial performance sequences while at the sametime changing the viewpoint (see figures 8 and 9 as well asthe accompanying video). This demonstrates that our ap-proach can be used as a video editing tool to seamlesslyexchange or recompose the facial performance of an actor inpost production or to conveniently create an optimized setof dynamic textures that allow for rendering photorealisticfacial performances for virtual agents or digital characters(e.g. in computer games).

The use of a low-dimensional model results in a very sta-

Figure 8: Different re-animated facial expressionsand modified camera orientations demonstratingfree viewpoint capabilities.

ble tracking, which is important in order to generate realisticdynamic textures. Tracking errors would directly influencethe visual quality of the synthesized facial performance be-cause they result in an additional movement of the wholeface when the textured model is rendered. The low dimen-sionality of the model is compensated for by applying war-ping and blending to the textures at the transition pointwhen concatenating sequences with different facial expressi-ons. This compensation in texture space is possible as longas the deformation can be described as an image warp. Thisdescribes the trade-off between geometric and textural chan-ges: Geometric changes are needed for large-scale changes,e.g. viewpoint and global illumination changes, jaw move-ments and strong deformations, whereas textural changesare especially well suited in regions with small-scale, detai-led and more subtle movements, e.g. fine wrinkles that formaround the eyes or mouth. The geometric model should the-refore be of low dimensionality to ensure a robust trackingbut have enough freedom to model large-scale changes thatcannot be described as a textural warp.

The results show that by using real video footage to ex-press subtle facial motion and details, highly realistic facialanimation sequences of an actor can be achieved. In addi-tion, our approach requires only little additional data: anapproximation of the actor’s head together with a calibra-ted video stream is sufficient to perform the presented facialre-animation technique. We aim at keeping the manual ef-fort as low as possible, i.e. the user is only required once toselect a few fiducial points in a single keyframe in order toinitialize the head motion tracking. All subsequent trackings

Figure 9: Screenshot of the implemented user in-terface for re-animation. Possible texture clips aredisplayed in the list box on the left side.

can then be initialized automatically using standard featuredetection/matching (e.g. SIFT).

Lighting variations present a general limitation of image-based approaches as these variations are captured in thetextures. To address this, we captured under homogeneouslighting conditions to mainly capture intensity changes inthe texture that are induced by the facial expressions (e.g.wrinkles). Global lighting conditions can then still be modi-fied during rendering using the approximate geometry. Whi-le this generally follows the overall concept of our approach(i.e. global geometry and lighting changes can be modeledgeometrically while subtle details are captured in the tex-ture), the database could also be extended by different ligh-ting conditions.

Future WorkIn our experiments, we manually selected transition pointsat the beginning and at the end of each dynamic texture.While this successfully demonstrates the visual quality ofsynthesized facial animations achieved with our approach,it would be desirable to directly switch from one sequenceto another (e.g. starting to laugh while talking) without ne-cessarily finishing the first one. This could be achieved forexample by analyzing the facial video sequences in orderto extract optimal transition points where it is possible todirectly switch from one sequence to another, similar to amotion-graph-based approach. This would allow to createmore complex animation graphs. Furthermore, we plan toextend the texture database by capturing facial expressi-ons under different lighting situations. For animation, ap-propriate textures could then be selected based on the targetexpression and desired lighting conditions and it would bepossible to blend between different light conditions duringthe animation. Another possible extension is the definitionof local regions in texture space allowing for an independentanimation of multiple face parts (e.g. eyes and mouth) atthe same time.

7. CONCLUSIONWe presented a method for video-based synthesis of faci-

al performances. The key idea of our approach is to enablethe animator to create novel facial performance sequencesby simply providing a sequence of facial-expression-labelsthat describe the desired facial performance. This makes iteasy to create novel facial videos, even for untrained users.Deformations caused by facial expressions are encapsulatedin dynamic textures that are extracted from real video foo-tage. New facial animation sequences are then synthesizedby clever concatenation of dynamic textures rendered uponthe geometry, instead of modeling all fine deformations in3D. In order to create seamless transitions between conse-cutive sequences, we perform a geometric and photometricoptimization of each sequence. Through the extraction of dy-namic textures from real video footage and the definition oftransition rules between independent textures, our approachcombines the photorealism of real image data with the abili-ty to modify or re-animate recorded performances. Possibleapplications of our approach range from video editing app-lications to the animation of digital characters.

8. REFERENCES[1] O. Alexander, M. Rogers, W. Lambeth, M. Chiang,

and P. Debevec. Creating a photoreal digital actor:The digital emily project. In European Conference onVisual Media Production (CVMP), London, UK, 2009.

[2] D. Blumenthal-Barby and P. Eisert. High-resolutiondepth for binocular image-based modelling. Computers& Graphics, 39:89–100, 2014.

[3] G. Borshukov, D. Piponi, O. Larsen, J. P. Lewis, andC. Tempelaar-Lietz. Universal capture - image-basedfacial animation for ”the matrix reloaded”. In ACMSIGGRAPH Courses, 2005.

[4] J. Carranza, C. Theobalt, M.A. Magnor, and H.-P.Seidel. Free-viewpoint video of human actors. In ACMSIGGRAPH, 2003.

[5] D. Casas, M. Tejera, J.-Y. Guillemaut, and A. Hilton.4d parametric motion graphs for interactive

animation. In Proceedings of the ACM SIGGRAPHSymposium on Interactive 3D Graphics and Games,I3D ’12, pages 103–110, New York, NY, USA, 2012.ACM.

[6] D. Casas, M. Volino, J. Collomosse, and A. Hilton. 4dvideo textures for interactive character appearance.Computer Graphics Forum, 33(2):371–380, 2014.

[7] J.-X. Chai, J. Xiao, and Jessica J. Hodgins.Vision-based control of 3d facial animation. In ACMSIGGRAPH/Eurographics Symposium on ComputerAnimation, 2003.

[8] Peter Eisert and Jurgen Rurainsky. Geometry-assistedimage-based rendering for facial analysis andsynthesis. Sig. Proc.: Image Comm., 21(6):493–505,2006.

[9] P. Garrido, L. Valgaert, C. Wu, and C. Theobalt.Reconstructing detailed dynamic face geometry frommonocular video. ACM Transactions on Graphics,32(6):158:1–158:10, 2013.

[10] P. Garrido, L. Valgaert, C. Wu, and C. Theobalt.Reconstructing detailed dynamic face geometry frommonocular video. ACM Trans. Graph.,32(6):158:1–158:1F0, November 2013.

[11] P. Garrido, L. Valgaerts, H. Sarmadi, I. Steiner,K. Varanasi, P. Perez, and C. Theobalt. Vdub:Modifying face video of actors for plausible visualalignment to a dubbed audio track. In Eurographics2015, pages –, 2015.

[12] A. Hilsmann and P. Eisert. Tracking deformablesurfaces with optical flow in the presence ofself-occlusions in monocular image sequences. InCVPR Workshops, Workshop on Non-Rigid ShapeAnalysis and Deformable Image Alignment(NORDIA), pages 1–6. IEEE Computer Society, June2008.

[13] P. Huang, A. Hilton, and J. Starck. Human motionsynthesis from 3d video. In IEEE Conf. on ComputerVision and Pattern Recognition, pages 1478–1485,June 2009.

[14] I. Kemelmacher-Shlizerman, E. Shechtman, R. Garg,and S. M. Seitz. Exploring photobios. ACM Trans.Graph., 30(4):61:1–61:10, July 2011.

[15] J. Kilner, J. Starck, and A. Hilton. A comparativestudy of free-viewpoint video techniques for sportsevents. In European Conference on Visual MediaProduction (CVMP), 2006.

[16] L. Kovar, M. Gleicher, and F. Pighin. Motion graphs.In Proc. of the 29th Annual Conference on ComputerGraphics and Interactive Techniques, SIGGRAPH ’02,pages 473–482, New York, NY, USA, 2002. ACM.

[17] K. Li, Q. Dai, R. Wang, Y. Liu, F. Xu, and J. Wang.A data-driven approach for facial expressionretargeting in video. IEEE Transactions onMultimedia, 16:299–310, 2014.

[18] C. Lipski, F. Klose, K. Ruhl, and M. Magnor. Makingof who cares hd stereoscopic free viewpoint video. InEuropean Conference on Visual Media Production(CVMP), 2011.

[19] W. Paier, M. Kettern, and P. Eisert. Realisticretargeting of facial video. In Proc. of the 11thEuropean Conference on Visual Media Production,

CVMP ’14, pages 2:1–2:10, New York, NY, USA,2014. ACM.

[20] P. Perez, M. Gangnet, and A. Blake. Poisson imageediting. ACM Trans. Graph., 22(3):313–318, July2003.

[21] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, andD. H. Salesin. Synthesizing realistic facial expressionsfrom photographs. In Proceedings of the 25th AnnualConference on Computer Graphics and InteractiveTechniques, SIGGRAPH ’98, pages 75–84, New York,NY, USA, 1998. ACM.

[22] F. Pighin and J. Lewis. Facial motion retargeting. InACM SIGGRAPH Courses, 2006.

[23] E. Reinhard, M. Ashikhmin, B. Gooch, and B. Shirley.Color transfer between images. IEEE ComputerGraphics Applications, 21(5):34–41, 2001.

[24] J. M. Saragih, S. Lucey, and J. F. Cohn. Deformablemodel fitting by regularized landmark mean-shift. Int.J. Comput. Vision, 91(2):200–215, January 2011.

[25] A. Schodl and I. A. Essa. Controlled animation ofvideo sprites. In SCA ’02: Proceedings of the 2002ACM SIGGRAPH/Eurographics symposium onComputer animation, pages 121–127, New York, NY,USA, 2002. ACM.

[26] D. Sibbing, M. Habbecke, and L. Kobbelt. Markerlessreconstruction and synthesis of dynamic facialexpressions. Computer Vision and ImageUnderstanding, 115(5):668–680, 2011.

[27] T. Weise, H. Li, L. Van Gool, and M. Pauly. Face/off:Live facial puppetry. In Proceedings of the 2009 ACMSIGGRAPH/Eurographics Symposium on ComputerAnimation, SCA ’09, pages 7–16, New York, NY,USA, 2009. ACM.

[28] F. Xu, Y. Liu, C. Stoll, J. Tompkin, G. Bharaj,Q. Dai, H.-P. Seidel, J. Kautz, and C. Theobalt.Video-based characters: Creating new humanperformances from a multi-view video database. InACM SIGGRAPH 2011 Papers, SIGGRAPH ’11,pages 32:1–32:10, New York, NY, USA, 2011. ACM.

video-based facial re-animationiphome.hhi.de/paier/files/pubs/cvmp2015...facial performances from a...

Documents