semantic parametric body shape estimation from...

SEMANTIC PARAMETRIC BODY SHAPE ESTIMATION FROM NOISY DEPTH

SEQUENCES

Alexandru E. Ichim Federico Tombari

Online Submission ID: 0417

Dynamic 3D Avatar Creation from Hand-held Video Input

Figure 1: Our system creates a fully rigged 3D avatar of the user from uncalibrated video input acquired with a cell-phone camera. Theblendshape models of the reconstructed avatars are augmented with textures and dynamic detail maps, and can be animated in realtime.

Abstract1

We present a complete pipeline for creating fully rigged, personal-2

ized 3D facial avatars from hand-held video. Our system faithfully3

recovers facial expression dynamics of the user by adapting a blend-4

shape template to an image sequence of recorded expressions using5

an optimization that integrates feature tracking, optical flow, and6

shape from shading. Fine-scale details such as wrinkles are captured7

separately in normal maps and ambient occlusion maps. From this8

user- and expression-specific data, we learn a regressor for on-the-fly9

detail synthesis during animation to enhance the perceptual realism10

of the avatars. Our system demonstrates that the use of appropri-11

ate reconstruction priors yields compelling face rigs even with a12

minimalistic acquisition system and limited user assistance. This13

facilitates a range of new applications in computer animation and14

consumer-level online communication based on personalized avatars.15

We present realtime application demos to validate our method.16

CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional17

Graphics and Realism—Animation;18

Keywords: 3D avatar creation, face animation, blendshapes, rig-19

ging20

1 Introduction21

Recent advances in realtime face tracking enable fascinating new ap-22

plications in performance-based facial animation for entertainment23

and human communication. Current realtime systems typically use24

the extracted tracking parameters to animate a set of pre-defined25

characters [Weise et al. 2011; Li et al. 2013; Cao et al. 2013; Bouaziz26

et al. 2013; Cao et al. 2014a]. While this allows the user to enact27

virtual avatars in realtime, personalized interaction requires a custom28

rig that matches the facial geometry, texture, and expression dynam-29

ics of the user. With accurate tracking solutions in place, creating30

compelling user-specific face rigs is currently a major challenge for31

new interactive applications in online communication. In this paper32

we propose a software pipeline for building fully rigged 3D avatars33

from hand-held video recordings of the user.34

Avatar-based interactions offer a number of distinct advantages for35

online communication compared to video streaming. An important36

benefit for mobile applications is the significantly lower demand37

on bandwidth. Once the avatar has been transferred to the target38

device, only animation parameters need to be transmitted during39

live interaction. Bandwidth can thus be reduced by several orders40

of magnitude compared to video streaming, which is particularly41

relevant for multi-person interactions such as conference calls.42

A second main advantage is the increased content flexibility. A 3D43

avatar can be more easily integrated into different scenes, such as44

games or virtual meeting rooms, with changing geometry, illumi-45

nation, or viewpoint. This facilitates a range of new applications,46

in particular on mobile platforms and for VR devices such as the47

Occulus Rift.48

Our goal is to enable users to create fully rigged and textured 3D49

avatars of themselves at home. These avatars should be as realistic50

as possible, yet lightweight, so that they can be readily integrated51

into realtime applications for online communication. Achieving52

this goal implies meeting a number of constraints: the acquisition53

hardware and process need to be simple and robust, precluding54

any custom-build setups that are not easily deployable. Manual55

assistance needs to be minimal and restricted to operations that can56

be easily performed by untrained users. The created rigs need to be57

efficient to support realtime animation, yet accurate and detailed to58

enable engaging virtual interactions.59

These requirements pose significant technical challenges. We maxi-60

mize the potential user base of our system by only relying on simple61

photo and video recording using a hand-held cell-phone camera to62

acquire user-specific data.63

The core processing algorithms of our reconstruction pipeline run64

automatically. To improve reconstruction quality, we integrate a65

simple UI to enable the user to communicate tracking errors with66

simple point clicking. User assistance is minimal, however, and67

required less than 15 minutes of interaction for all our examples.68

Realtime applications are enabled by representing the facial rig as69

a set of blendshape meshes with low polygon count. Blendshapes70

allow for efficient animation and are compatible with all major ani-71

mation tools. We increase perceptual realism by adding fine-scale72

facial features such as dynamic wrinkles that are synthesized on the73

fly during animation based on precomputed normal and ambient74

occlusion maps.75

We aim for the best possible quality of the facial rigs in terms of76

geometry, texture, and expression dynamics. To achieve this goal,77

we formulate dynamic avatar creation as a geometry and texture re-78

construction problem that is regularized through the use of carefully79

designed facial priors. These priors enforce consistency and guaran-80

1

Dynamic 3D Avatar Creation From Hand-held Video Input

SOME OF MY OTHER WORK 1

2

Static Dynamic8 MP, ~100 images HD, ~2 min

Semantic Parametric Body Shape Estimation from Noisy Depth Sequences


3



4

Semantic Parametric Body Shape Estimation from Noisy Depth Sequences 5


Semantic Parametric Body Shape Estimation from Noisy Depth Sequences 6



MOTIVATION

7

Generic framework for tracking and modeling articulated bodies

in our experiments:

human bodies

RGB-D input

can be applied to:

hand tracking

animal tracking


BODY REPRESENTATION

8

Pose variations - skeleton + linear blend skinning

Rest pose variations - blendshapes + bone lengths


REGISTRATION ENERGY

9


REGISTRATION ENERGY

10

Features

Efeatures

=X

j

��tposej

� �j

��22

NiTE landmarks - replaceable

Point-to-plane dense closeness

Esurface =X

i

��n

Ti (xi � vi)

��22

ICP iterations


REGISTRATION ENERGY

11

Contour

n =(r

x

Icontour

,ry

Icontour

, 0)T��(rx

Icontour

,ry

Icontour

, 0)��

extracted from the 2D depth image

can be extended to RGB-segmentation


REGISTRATION ENERGY

12

Statistical Pose Prior

Eprior

= �5Eprior proj

+ �6Eprior dev

Eprior proj

=��(� � µ)�MMT (� � µ)

��2

2distance to projection

Eprior dev

=��⌃�1/2MT (� � µ)

��2

2

std-dev-weighting of the projection


REGISTRATION ENERGY

13

Smoothness term

Esmoothness

=��(t) � �(t�1)

��2

2


MODELING ENERGY

14

Emodeling

= �1Esurface

+ �2Econtour

+ �3Ebs reg

Regularization termsEbs reg L2 = ||s||22Ebs reg L1 = ||s||1

accumulate constraints over windows of frames - performance trade-off

solve and update the body template shape


MODELING ENERGY

15

word cloud visualizations

L1 regularization

L2 regularization


RESULTS

16


Thank you!

17

semantic parametric body shape estimation from...

Documents