lightweight photometric stereo for facial details...

10
Lightweight Photometric Stereo for Facial Details Recovery Xueying Wang 1 Yudong Guo 1 Bailin Deng 2 Juyong Zhang 1* 1 University of Science and Technology of China 2 Cardiff University {WXY17719, gyd2011}@mail.ustc.edu.cn [email protected] [email protected] Abstract Recently, 3D face reconstruction from a single image has achieved great success with the help of deep learning and shape prior knowledge, but they often fail to produce ac- curate geometry details. On the other hand, photometric stereo methods can recover reliable geometry details, but require dense inputs and need to solve a complex optimiza- tion problem. In this paper, we present a lightweight strategy that only requires sparse inputs or even a single image to recover high-fidelity face shapes with images captured un- der near-field lights. To this end, we construct a dataset containing 84 different subjects with 29 expressions under 3 different lights. Data augmentation is applied to enrich the data in terms of diversity in identity, lighting, expression, etc. With this constructed dataset, we propose a novel neural network specially designed for photometric stereo based 3D face reconstruction. Extensive experiments and comparisons demonstrate that our method can generate high-quality re- construction results with one to three facial images captured under near-field lights. Our full framework is available at https://github.com/Juyong/FacePSNet. 1. Introduction High-quality 3D face reconstruction is an important prob- lem in computer vision and graphics [38] that is related to various applications such as digital actor [3], face recogni- tion [5, 48] and animation [3, 21, 40]. Some works have been devoted to solving this problem at the source, using either multi-view information [15, 43] or the illumination conditions [1, 2, 37]. Although some of these methods are capable of reconstructing high-quality 3D face mod- els with both low-frequency structures and high-frequency details like wrinkles and pores, the hardware environment is hard to set up and the underlying optimization problem is not easy to solve. For this reason, 3D face reconstruction from a single image has attracted wide attention, with many works focusing on reconstruction from an “in-the-wild” im- * Corresponding author age [35, 17, 14, 18]. Although most of them can reconstruct accurate low-frequency facial structures, few can recover fine facial details. In this paper, we turn our attention to the photometric stereo technique [42], and consider the near- field point light source setting due to its portability. We aim to reconstruct high-precision 3D face models with sparse inputs using photometric stereo under near point lighting. State-of-the-art sparse photometric 3D reconstruction methods such as [9, 13] can reconstruct 3D face shapes with fine geometric details. However, they are mainly based on conventional optimization approaches with high compu- tational costs. In recent years, great progress has been made in deep learning-based photometric stereo [22, 12] that can estimate accurate normals. However, these existing methods cannot be directly applied to solve our problem. First, they mainly focus on general objects with dense inputs, making them not suitable for our 3D face reconstruction problem with sparse inputs. Second, they assume parallel directional lights, which is difficult to achieve in practice especially for indoor lighting conditions. To solve the sparse photometric stereo problem fast and well, we must address the following challenges. First, without the parallel lighting assumption, calibrating the lighting direction of near-field point light sources is much more complex and needs to solve a non- linear optimization problem. Moreover, the reconstruction problem with less than three input images is ill-posed, and thus prior knowledge of the reconstruction object is needed. In this paper, we combine deep learning-based sparse photometric stereo and facial prior information to recon- struct high-accuracy 3D face models. Currently, there is no publicly available dataset of face images captured un- der near point lighting conditions and their corresponding 3D geometry. Therefore, we construct such a dataset for the network training. We use real face images captured us- ing a system composed of three near point light sources and a fixed camera. Based on this system, we develop an optimization method to recover 3D geometry along with calibrating light positions and estimating normals. Using our reconstructed 3D face models and publicly available high-quality 3D face datasets, we augment our dataset by synthesizing a large number of face images with their cor- 740

Upload: others

Post on 08-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lightweight Photometric Stereo for Facial Details Recoveryopenaccess.thecvf.com/content_CVPR_2020/papers/... · face details using SfS along with an albedo prior mask and a depth-image

Lightweight Photometric Stereo for Facial Details Recovery

Xueying Wang1 Yudong Guo1 Bailin Deng2 Juyong Zhang1∗

1University of Science and Technology of China 2Cardiff University

WXY17719, [email protected] [email protected] [email protected]

Abstract

Recently, 3D face reconstruction from a single image has

achieved great success with the help of deep learning and

shape prior knowledge, but they often fail to produce ac-

curate geometry details. On the other hand, photometric

stereo methods can recover reliable geometry details, but

require dense inputs and need to solve a complex optimiza-

tion problem. In this paper, we present a lightweight strategy

that only requires sparse inputs or even a single image to

recover high-fidelity face shapes with images captured un-

der near-field lights. To this end, we construct a dataset

containing 84 different subjects with 29 expressions under

3 different lights. Data augmentation is applied to enrich

the data in terms of diversity in identity, lighting, expression,

etc. With this constructed dataset, we propose a novel neural

network specially designed for photometric stereo based 3D

face reconstruction. Extensive experiments and comparisons

demonstrate that our method can generate high-quality re-

construction results with one to three facial images captured

under near-field lights. Our full framework is available at

https://github.com/Juyong/FacePSNet.

1. Introduction

High-quality 3D face reconstruction is an important prob-

lem in computer vision and graphics [38] that is related to

various applications such as digital actor [3], face recogni-

tion [5, 48] and animation [3, 21, 40]. Some works have

been devoted to solving this problem at the source, using

either multi-view information [15, 43] or the illumination

conditions [1, 2, 37]. Although some of these methods

are capable of reconstructing high-quality 3D face mod-

els with both low-frequency structures and high-frequency

details like wrinkles and pores, the hardware environment

is hard to set up and the underlying optimization problem

is not easy to solve. For this reason, 3D face reconstruction

from a single image has attracted wide attention, with many

works focusing on reconstruction from an “in-the-wild” im-

∗Corresponding author

age [35, 17, 14, 18]. Although most of them can reconstruct

accurate low-frequency facial structures, few can recover

fine facial details. In this paper, we turn our attention to the

photometric stereo technique [42], and consider the near-

field point light source setting due to its portability. We aim

to reconstruct high-precision 3D face models with sparse

inputs using photometric stereo under near point lighting.

State-of-the-art sparse photometric 3D reconstruction

methods such as [9, 13] can reconstruct 3D face shapes

with fine geometric details. However, they are mainly based

on conventional optimization approaches with high compu-

tational costs. In recent years, great progress has been made

in deep learning-based photometric stereo [22, 12] that can

estimate accurate normals. However, these existing methods

cannot be directly applied to solve our problem. First, they

mainly focus on general objects with dense inputs, making

them not suitable for our 3D face reconstruction problem

with sparse inputs. Second, they assume parallel directional

lights, which is difficult to achieve in practice especially for

indoor lighting conditions. To solve the sparse photometric

stereo problem fast and well, we must address the following

challenges. First, without the parallel lighting assumption,

calibrating the lighting direction of near-field point light

sources is much more complex and needs to solve a non-

linear optimization problem. Moreover, the reconstruction

problem with less than three input images is ill-posed, and

thus prior knowledge of the reconstruction object is needed.

In this paper, we combine deep learning-based sparse

photometric stereo and facial prior information to recon-

struct high-accuracy 3D face models. Currently, there is

no publicly available dataset of face images captured un-

der near point lighting conditions and their corresponding

3D geometry. Therefore, we construct such a dataset for

the network training. We use real face images captured us-

ing a system composed of three near point light sources

and a fixed camera. Based on this system, we develop an

optimization method to recover 3D geometry along with

calibrating light positions and estimating normals. Using

our reconstructed 3D face models and publicly available

high-quality 3D face datasets, we augment our dataset by

synthesizing a large number of face images with their cor-

1740

Page 2: Lightweight Photometric Stereo for Facial Details Recoveryopenaccess.thecvf.com/content_CVPR_2020/papers/... · face details using SfS along with an albedo prior mask and a depth-image

(c) results(a) capture system

camera

(b) dastasets

point light source

Figure 1. We propose a convolutional neural network based method for face reconstruction under photometric stereo scenario. (a) & (b):

Our dataset for network training consists of photos with different expressions, captured using a system composed of three near point light

sources and a fixed camera. (c): Our proposed method can recover fine details even with a single image input (left). For images captured by

a smartphone, with a hand-held light at locations not seen in our training dataset, our method also works well in this casual setup. (right).

responding 3D shapes. With the real and synthetic data, we

design a two-stage convolutional neural network to estimate

a high-accuracy normal map from sparse input images. The

coarse shape, represented by a parametric 3D face model [6]

and the pose parameters, are recovered in the first stage. The

face images and the normal map obtained from the first stage

are fed into the second-stage network to estimate a more

accurate normal map. Finally, a high-quality 3D face model

is recovered via a fast surface-from-normal optimization.

Fig. 1 shows the pipeline of our method. Comprehensive

experiments demonstrate that our network can produce more

accurate normal maps compared with state-of-the-art pho-

tometric stereo methods. Our lightweight method can also

recover fine facial details better than state-of-the-art single

image-based face reconstruction methods.

2. Related work

Photometric Stereo. The photometric stereo (PS)

method [42] estimates surface normals from a set of images

captured under different lighting conditions. Since the semi-

nal work of [42], different methods have been proposed to

recover surfaces in this manner [44, 20]. Many such methods

assume directional lights with infinite light source positions.

On the other hand, some works focus on reconstruction un-

der near point light sources, using optimization approaches

that are often complex and time-consuming [46, 28, 32]. To

achieve efficiency for practical applications with near point

light sources, we only adopt optimization-based methods

to construct the training dataset and then train the neutral

model for lightweight photometric stereo for 3D face re-

construction. The most related work to our training data

construction step is [9], which proposed an iteration pipeline

to reconstruct high-quality 3D face models.

Deep Learning-Based Photometric Stereo. With the

development of convolutional neural networks, various deep

learning-based approaches have been proposed to solve pho-

tometric stereo problems. Most of them can be categorized

into two types according to their input. The first type requires

images together with corresponding calibrated lighting con-

ditions. Santo et al. [34] proposed a differentiable multi-

layer deep photometric stereo network (DPSN) to learn the

mapping from a measurement of a pixel to the correspond-

ing surface normal. Chen et al. [12] put forward a fully

connected convolutional network to predict the normal map

of a static object from an arbitrary number of images. A

physics-based unsupervised neural network was proposed

by Taniai et al. [39] with both surface normal map and syn-

thesized images as output. Ikehata [22] presented an obser-

vation map to describe pixel-wise illumination information,

and estimated surface normals with the observation map as

input to an end-to-end convolutional network. Furthermore,

Zheng et al. [47] and Li et al. [26] solved the sparse photo-

metric stereo problem based on the observation map. This

type of work assumes lighting directions as prior and can-

not handle unknown lighting directions. The second type

directly estimates lighting conditions and normal maps alto-

gether from the input images. A network named UPS-FCN

was introduced in [12] to calibrate lights and predict surface

normals. Later, Chen et al. [11] proposed a two-stage deep

learning architecture called SDPS-Net to handle this uncal-

ibrated problem. Both types focus on solving photometric

stereo problems under directional lights which is difficult

to achieve in practice, and most of these methods do not

perform well with sparse inputs. In this paper, we solve

the sparse uncalibrated photometric stereo problem under

near-field point light sources.

Single Image-Based 3D Face Reconstruction. 3D face

reconstruction from a single image has made great progress

in recent years. The key to this task is to establish a cor-

respondence map from 2D pixels to 3D points. Jackson

et al. [23] proposed to directly regress a volumetric repre-

sentation of the 3D mesh from a single face image with a

convolutional neural network. Feng et al. [16] designed a

741

Page 3: Lightweight Photometric Stereo for Facial Details Recoveryopenaccess.thecvf.com/content_CVPR_2020/papers/... · face details using SfS along with an albedo prior mask and a depth-image

2D representation called UV position map to record 3D po-

sitions of a complete human face. Deng et al. [14] directly

regressed a group of parameters based on 3DMM [6, 7, 31].

All these works can reconstruct the 3D face model from a

single image but cannot recover geometry details. Recently,

this issue has been addressed with a coarse-to-fine recon-

struction strategy. Sela et al. [36] first constructed a coarse

model based on a depth map and a dense correspondence

map and then recovered details in a geometric refinement

process. Richardson et al. [33] developed an end-to-end

CNN framework composed of a CoarseNet and a FineNet to

reconstruct detailed face models. Jiang et al. [24] designed a

three-stage approach based on a bilinear face model and the

shape-from-shading (SfS) method. Li et al. [27] recovered

face details using SfS along with an albedo prior mask and a

depth-image gradient constraint. Tran et al. [41] proposed

a bump map to describe face details and use a hole filling

approach to handle occlusions. Chen et al. [10] recovered

high-quality face models based on a proxy estimation and a

displacement map. For 3D face reconstruction from carica-

ture images, Wu et al. [45] proposed an intrinsic deformation

representation for extrapolation from normal 3D face shapes.

Most existing works approximated the human face as

a Lambertian surface and simulated the environment light

using the spherical harmonics (SH) basis functions, which

is not suitable for the near point lighting condition due to a

large area of shadows. Based on our constructed dataset, we

also design a network that can reconstruct a 3D face model

with rich details from a single image captured under the near

point lighting condition.

3. Dataset Construction

In this paper, we propose a lightweight method to recon-

struct high-quality 3D face models from uncalibrated sparse

photometric stereo images. As there is no publicly available

dataset that contains face images with near point lighting

and their corresponding 3D face shapes, we construct such

a dataset by ourselves. Given face images captured under

different light sources, we would like to solve for the albedos

and the normals of the face model such that the intensities

of the resulting images under calibrated lights are consistent

with the observed intensities from the input images. This

problem may be ill-posed with only three input images due to

the presence of shadows. Therefore, we utilize a parametric

3D face model as prior knowledge, and propose an opti-

mization method to estimate accurate normal maps. In this

section, we first introduce some related basic knowledge, and

then present how we construct the real image-based dataset

and synthetic dataset.

3.1. Preliminaries

Imaging Formula. We approximate the human face as

a Lambertian surface and simulate the near point lighting

condition using the photometric stereo. Given a point light

source at position Pj ∈ R3 with illumination βj ∈ R, the

imaging formula for a point i can be expressed as [46]:

Iij(Vi,Ni,ρi) , ρi

(Ni ·

βj (Pj −Vi)

‖Pj −Vi‖3

2

), (1)

where Vi,Ni ∈ R3 are the position and normal of the point,

and Iij ,ρi ∈ R3 are the intensity and albedo in the RGB

color space, respectively. Given the captured images, the

photometric stereo problem with near point light sources is

to recover lighting positions and illuminations, the vertex

position, albedo and normal of a point on the object.

Parametric Face Model. 3DMM [6] is a widely used

parametric model for human face geometry and albedo. We

use 3DMM to build a coarse face model for further optimiza-

tion. In general the parametric model represents the face

geometry G ∈ R3nv and albedo A ∈ R

3nv as

G = G+Bidαid +Bexpαexp, (2)

A = A+Balbedoαalbedo, (3)

where nv is the number of vertices of the face model; G ∈R

3nv and A ∈ R3nv are respectively the mean shape and

albedo; αid ∈ R100, αexp ∈ R

79 and αalbedo ∈ R100 are cor-

responding coefficient parameters specifying an individual;

Bid ∈ R3nv×100, Bexp ∈ R

3nv×79 and Balbedo ∈ R3nv×100

are principle axes extracted from some 3D face models by

PCA. We use the Basel Face Model (BFM) [31] for Bid and

Balbedo, and the FaceWarehouse [8] for Bexp.

Camera Model. We use the standard perspective projec-

tion to project the 3D face model to the image plane, which

can be expressed as

qi = Π(RVi + t), (4)

where qi ∈ R2 is the location of vertex Vi in the image

plane, and R ∈ R3×3 is the rotation matrix constructed from

Euler angles pitch, yaw and roll, t ∈ R3 is the translation

vector, and Π : R3 → R2 is the perspective projection.

3.2. Construction of Real Dataset

Our real dataset is derived from photometric face images

captured using a system consisting of three near point light

sources (on the front, left and right) and a fixed camera.

The dataset contains 84 subjects covering different races,

genders and ages, with each subject captured under 29 differ-

ent expressions. All images are captured at the resolution of

1600×1200. Similar to [9], we design an optimization-based

method to reconstruct a 3D face model with rich details from

a set of images captured under different near point light-

ing positions and illuminations. The method in [9] uses the

face shape prior for lighting calibration, then estimates the

742

Page 4: Lightweight Photometric Stereo for Facial Details Recoveryopenaccess.thecvf.com/content_CVPR_2020/papers/... · face details using SfS along with an albedo prior mask and a depth-image

(a) (b)

Figure 2. (a) Some results of our constructed real dataset. From left

to right: input images, estimated normals and the reconstructed face

models. (b) Updated normals with method in [9] that chooses at

least three reliable lights to update normals after handling shadows.

This method can only update a part of normals due to the large area

of shadows and only three input images. Thus it is not suitable for

our situation.

normals and recovers the depths in the image plane. Differ-

ent from existing photometric stereo methods which always

need more than three images, we have only three images as

input and there may exist under-determined parts caused by

shadows (Fig. 2 (b)). To alleviate this problem, we utilize

the parametric model to help recover the normals. From the

recovered coarse shape and updated normals, we can recover

the 3D face shape with fine details as shown in Fig. 2 (a).

Our algorithm pipeline is shown in Fig. 3.

In order to provide a good initial 3D face shape for the

following optimization, we first generate the coarse face

model with three image inputs using the optimization-based

inverse rendering method in [24]. Different from the problem

setting in [24] which has only one input image, we have

three face images that share the same shape, expression

and albedo parameters but with different lighting conditions.

After recovering the coarse face model, we calibrate the light

positions P ∈ R3×n and illuminations β ∈ R

n using the

calibration method proposed in [9]. Since the Lambertian

surface model is invalid in regions under shadows, we use a

simple filter to determine the available light sources Li for

each triangle of the 3D face mesh by

Li =j | Nf

i · (Pj −Vfi ) > 0, j = 1, . . . , n

(5)

where Pj ∈ R3 is the position of the jth light source, and

Nfi ,V

fi ∈ R

3 are the normal and centroid of the ith triangle.

We only use available light sources in Li for each triangle to

Normal Estimation

Lighting Calibration

Vertex Recovery

Coarse Model

Figure 3. The algorithm pipeline of real dataset construction.

update its normal. During photometric stereo optimization,

we first optmize the triangle normals and then recover the

vertex positions from the updated normals.

Normal Update. As the 3D face mesh recovered by the

parametric model only contains low-frequency signals, rich

geometry details are lost. Thus we refine the normal of

each triangle based on the photometric stereo. The updated

normal N and albedo ρ are optimized via:

minρ,N

i∈Fv

j∈Li

∥∥∥Iij − Iij(Vfi , N

fi , ρ

fi )∥∥∥2

2

+ µ1

∥∥∥N−N

∥∥∥2

F+ µ2

i∈Fv

∥∥∥∥∥∥ρfi −

1

|Ωi|

j∈Ωi

ρfj

∥∥∥∥∥∥

2

2

s.t.

∥∥∥Nfi

∥∥∥2

= 1 (i = 1, . . . , |Fv|). (6)

Here the first term penalizes the deviation between the ob-

served intensity Iij from the input images and the intensity

resulting Iij evaluated with Eq. (1) using the updated albedo

ρfi and the updated normal N

fi at each triangle centroid V

fi ,

with Fv representing the set of visible triangles on the initial

face model. Iij is determined by projecting the centroid Vfi

onto the image plane and performing bilinear interpolation

of its nearest pixels. The second term penalizes the deviation

between the updated normals N ∈ R3×|Fv| on visible tri-

angles and the corresponding normals N ∈ R3×|Fv| on the

initial face model. The last term regularizes the smoothness

of the updated albedo, with Ωi denoting the set of visible

triangles in the one-ring neighborhood of triangle i. We

solve Eq. (6) via alternating minimization. Specifically, we

optimize N while fixing ρ, and then optimize ρ while fixing

N. This process is iterated until convergence.

Vertex Recovery. After updating the triangle normals N,

we optimize the face shape as a height filed Z ∈ Rm over

the image plane to match the updated normals, where m is

the number of the pixels covered by the projection of the

coarse face model. We first transfer N to pixel normals via

the standard perspective projection. Then we compute Z via:

minZ

∥∥∥N−N0

∥∥∥2

F+ w1

∥∥Z− Z0∥∥22+ w2 ‖∆Z‖

2

2. (7)

743

Page 5: Lightweight Photometric Stereo for Facial Details Recoveryopenaccess.thecvf.com/content_CVPR_2020/papers/... · face details using SfS along with an albedo prior mask and a depth-image

3DM

odelGeneration

ImageSynthesis

Non-rigidIC

P

(a).

(b).

Figure 4. The process of our data augmentation methods. (a) We

generate different geometries by randomly generating shape and

expression parameters from 3DMM [6] and transfer albedos ob-

tained in our real dataset. (b) We use non-rigid ICP [4] to fit the

face models in Light Stage [29] with the mean shape, together with

albedos in our real dataset to generate training data.

Here Z0 ∈ Rm is the initial height field obtained from the

coarse face model. ∆Z ∈ Rm denotes the Laplacian of the

height field, and the third term in Eq. (7) is to regularize the

smoothness of height field. N0, N ∈ R3×m collect the pixel

normals derived from the triangle normals N and from the

height field Z, respectively. Specifically, to derive the normal

Np for a pixel p from the height field, we first project the

pixel back into its 3D location Vp by inverting the standard

perspective projection. Then Np is computed as

Np =e2 × e1 + e3 × e2 + e4 × e3 + e1 × e4

‖e2 × e1 + e3 × e2 + e4 × e3 + e1 × e4‖,

where e1, e2, e3, e4 denote the vectors from Vp to the 3D

locations of p’s four neighbor pixels in counter-clockwise

order. This non-linear least squares problem is solved with

Gauss-Newton algorithm.

3.3. Construction of Synthetic Dataset

To improve the coverage of our dataset, we further con-

struct a synthetic dataset. We use albedos and 3D face mod-

els obtained from the Light Stage [29], a publicly available

dataset containing 23 people with 15 different expressions

and their corresponding high-resolution 3D models, as the

ground truth. Then we render synthetic images under three

random point light positions and illuminations calibrated

from our real dataset using Eq. (1).

Data augmentation. In order to fit the requirement of fur-

ther network training we carry out a data augmentation pro-

cess mainly from the following two aspects. On the one

hand, we use the parametric model introduced in Sec. 3.1

to present different face geometry structures and albedos

by randomly generating parameters αid,αexp,αalbedo. We

transfer the albedos obtained from our real dataset to such

shape models with randomly generated shape parameters,

since our initial coarse model is based on the same topology.

On the other hand, to have accurate parametric models as

ground truth for network training on our synthetic dataset,

we register a neutral parametric model to 3D face models

obtained from the Light Stage using the non-rigid ICP [4],

and find closest points between these two types of models as

their correspondence. We further transfer albedos in our real

dataset according to this correspondence. After generating

those mentioned models, we render three images for each

model with point light sources calibrated in our real dataset.

The process is shown in Fig. 4.

4. Deep Photometric Stereo for 3D Faces

The optimization-based method described in Sec. 3.2

can recover high-quality facial geometry from several face

images captured under different point lighting conditions,

but the procedure is time-consuming and requires at least

three images as input due to the ambiguity of geometry and

albedo. To alleviate these problems, we propose a CNN-

based method to learn high-quality facial details from an

arbitrary number of face images captured under different

near point lighting conditions. Similar to the procedure in

Sec. 3.2, we use a two-stage network to regress a coarse face

model represented with 3DMM and a high-quality normal

map respectively. With the power of CNN and our well-

constructed dataset, our method can efficiently recover high-

quality facial geometry even with a single image, which

is not possible for optimization-based photometric stereo

methods and other deep photometric stereo methods that do

not utilize facial priors. Better results can be obtained with

more input images. The network structure is shown in Fig. 5.

4.1. Proxy Estimation Network

At the first stage, we learn the 3DMM parameters and

pose parameters directly from a single image to obtain a

coarse face model as a proxy for the second stage with a

ResNet-18 [19]. The set of regressed parameters is repre-

sented by χ = αid,αexp, pitch, yaw, roll, t. To train the

proxy estimation network, we use both the real data and the

synthetic data with ground truth parameters as described in

Sec. 3. To enrich the data, we also synthesize 5000 images

using the data augmentation strategy described in Sec. 3.3.

744

Page 6: Lightweight Photometric Stereo for Facial Details Recoveryopenaccess.thecvf.com/content_CVPR_2020/papers/... · face details using SfS along with an albedo prior mask and a depth-image

χShared Weights

Feature Conv Max-Pooling Concat Deconv

(a) Proxy Estimation Network (b) Normal Estimation Network

χ Proxy Parameters

ResN

et-18

Rendering

Normal

Figure 5. The architecture of our two-stage network which consists of (a) Proxy Estimation Network and (b) Normal Estimation Network.

The connection between the two modules is a rendering layer which generates a coarse normal map with the estimated proxy parameters.

We use two loss terms to evaluate the alignment of dense

facial geometry and sparse facial features respectively. The

first term computes the distance between the recovered ge-

ometry and the ground truth geometry as follows:

Egeo(χ) = ‖G−Ggt‖2

2, (8)

where G is the geometry recovered with Eq. (2) and Ggt

is the ground truth geometry. As facial landmarks convey

the structural information of the human face, we design the

second term to measure how close the projected 3D landmark

vertices are to the corresponding landmarks in the imge:

Elan(χ) =1

|L|

i∈L

‖qi −Π(RVi + t)‖22, (9)

where L is the set of landmarks, qi is a detected landmark

position in the input image, and Vi is the corresponding

vertex location in the 3D mesh. The final loss function is a

combination of the two loss terms:

Eloss(χ) = Egeo(χ) + wlanElan(χ) (10)

where wlan is a tuning weight.

4.2. Normal Estimation Network

The recovered geometry at the first stage lacks facial de-

tails due to the limited representation ability of 3DMM. To

recover the facial geometry with final details, we learn an

accurate normal map by utilizing the appearance information

from face images and the geometric information from the

proxy model obtained at the first stage. Specifically, the in-

put to our normal estimation network is several face images

and the normal map rendered with parameters obtained from

our proxy estimation network, and the output is a refined

normal map that contains high-quality facial details. The net-

work architecture is similar to PS-FCN [12], which consists

of a shared-weight feature extractor, an aggregation layer,

and a normal regression module. One notable difference is

that PS-FCN requires lighting information as input, while

our normal estimation network requires proxy geometry as

input to utilize facial priors. The loss function for normal

estimation network is:

Enormal =1

|M|

i∈M

(1− nTi ni), (11)

where M is the set of all pixels in the face region covered by

the coarse face model, ni and ni is the estimated and ground

truth normals at pixel i, respectively.

With the estimated accurate normal map, we then obtain

a high-quality face model using the vertex recovery method

as explained in Sec. 3.2.

5. Experiments

5.1. Implementation Details

To evaluate the proposed method, we select 77 subjects

from our captured dataset and 18 subjects from the Light

Stage dataset to train our networks and use the other sub-

jects for testing, yielding 95 subjects with 2503 samples for

training and 12 subjects (7 from our constructed dataset and

5 from the Light Stage dataset) with 278 samples for testing.

We implement our method in PyTorch [30] and optimize the

networks’ parameters with Adam solver [25]. We first train

the proxy estimation network for 200 epochs with a batch

745

Page 7: Lightweight Photometric Stereo for Facial Details Recoveryopenaccess.thecvf.com/content_CVPR_2020/papers/... · face details using SfS along with an albedo prior mask and a depth-image

w/o Proxy w/o Augmentation Ours w/o Proxy w/o Augmentation OursInput GT

Three Images Input Single Image Input

90°

Figure 6. Ablation studies that compare the proposed method with two approaches that exclude the proxy estimation module and data

augmentation respectively. For each method, we show the estimated normal maps and the corresponding angular error maps. And we use the

leftmost images for single image input.

Table 1. Average angular errors (in degrees) on test set with different

inputs. S1, S2, S3 represent the leftmost, the upper-right corner

and the lower-right corner image respectively.

S1 S2 S3 S1&S2 S2&S3 S3&S1 S1&S2&S3

10.641 10.635 10.705 8.245 8.476 8.328 6.498

size of 50. Then we train the normal estimation network for

100 epochs with a batch size of 6 for an arbitrary number of

input images. Specifically, we randomly choose one, two or

three images as input in every mini-batch during training. It

takes about one hour to train the proxy estimation network

and 12 hours to train a normal estimation network on a single

RTX 2080 Ti GPU. The results on our test set with different

inputs are shown in Tab. 1. It can be seen that better results

are achieved with more input images.

5.2. Ablation Study

To validate the design of our architecture, we compare

the proposed method with alternative strategies that exclude

some components. First, we demonstrate the necessity of

the proxy estimation network by conducting an experiment

that excludes the proxy estimation module and estimates the

normal map with only face images as input in the normal

estimation network. Secondly, we show the effectiveness of

data augmentation for training the proxy estimation network,

with another experiment that trains the proxy estimation net-

work without the 5000 synthesized images derived from data

augmentation. The comparison results on test set for both

experiments are shown in Tab. 2 and Fig. 6. We can see that

excluding each component will cause a drop performance

for both three image inputs and single image input.

Table 2. Average angular errors (in degrees) on test set for ablation

studies.

# Input w/o Proxy w/o Augmentation Ours

1 14.843 12.342 9.875

3 9.499 8.694 6.154

Table 3. Average angular errors (in degrees) on test sets.

UPS-FCN SDPS-Net Ours

Real Set 45.708 33.154 6.579

Light Stage [29] 31.254 15.592 5.007

5.3. Comparisons

Comparison with deep learning-based photometric

stereo. We further compare our network with UPS-

FCN [12] and SDPS-Net [11] that solve the uncalibrated

photometric stereo problem. Both methods estimate normals

for general objects under different directional lights, whereas

we focus on the human face under different point lighting

conditions. We take three images with different uncalibrated

lighting conditions from the test set as input and compare the

accuracy of the output normal map according to the angle

between the output and the ground truth normal map. We

show the results in Table. 3 and Fig. 7. It can be observed

from Table. 3 that all methods perform better on the Light

State test data, potentially due to noises in the real captured

data. On the other hand, our method performs better than

the other two methods both qualitatively and quantitatively,

due to the near point lighting hypothesis and the face prior

information.

746

Page 8: Lightweight Photometric Stereo for Facial Details Recoveryopenaccess.thecvf.com/content_CVPR_2020/papers/... · face details using SfS along with an albedo prior mask and a depth-image

Input GT UPS-FCN SDPS-Net Ours

Figure 7. Estimated normal maps and their corresponding error

maps for UPS-FCN [12], SDPS-Net [11] and our network.

Comparison with 3D face reconstruction from a single

image. In order to evaluate the quality of our reconstructed

3D face models, we compare our deep learning-based recon-

struction method with some state-of-the-art detail-preserving

reconstruction methods from a single image. Most existing

methods focus on reconstruction from an “in-the-wild” im-

age and simulate the environment lighting condition using

the spherical harmonics (SH) basis functions, which per-

forms poorly in simulating the near point lighting condition

due to a large area of shadows. For a fair comparison, we

take only one photometric stereo image as input to our net-

work and one image captured in normal light as input to

compared methods. The results shown in Fig. 8 demonstrate

that our method can better recover facial details such as wrin-

kles and eyes. For quantitative evaluation, we compute a

geometric error for each reconstructed model, by first apply-

ing a transformation with seven degrees of freedom (six for

rigid transformation and one for scaling) to align it with the

ground-truth model, and then computing its point-to-point

distance to the ground-truth model. The average geometric

errors of Extreme3D [41], DFDN [10] and our method on

test set are 1.77, 1.54, 0.86 respectively, with four examples

shown in Fig. 9. It can be seen that our method significantly

outperforms other methods due to our accurate simulation

of the near point lighting condition.

6. Conclusion

We proposed a lightweight photometric stereo algorithm

combining deep learning method and face shape prior to

.

Input

Pix2Vertex

DFDN

Extreme3D

Ours

Figure 8. Qualitative comparison between Pix2vertex [36],

DFDN [10], Extreme3D [41] and our method. Other methods

use the left image on the top row as input while ours uses the right

image as input. Our method can reconstruct more accurate face

models with fine details such as wrinkles and eyes.

Input GT Extreme3D DFDN Ours

0mm 10mm

Figure 9. Reconstructed results and geometric error maps of Ex-

treme3D [41], DFDN [10] and ours. Other methods use the left

image in the first column as input while ours uses the right image

as input.

reconstruct 3D face models containing fine-scale details.

Our two-stage neural network estimates a coarse face shape

with structure and a normal map with details, followed by

an optimization method to recover the final facial geome-

try. For the network training, we construct a real dataset

across different races, genders and ages, and a data augmen-

tation is applied to enrich the dataset. Extensive experiments

demonstrated that our method outperforms state-of-the-art

deep learning-based photometric stereo methods and 3D face

reconstruction methods from a single image.

Acknowledgement This work was supported by the Na-

tional Natural Science Foundation of China (No. 61672481),

and Youth Innovation Promotion Association CAS (No.

2018495).

747

Page 9: Lightweight Photometric Stereo for Facial Details Recoveryopenaccess.thecvf.com/content_CVPR_2020/papers/... · face details using SfS along with an albedo prior mask and a depth-image

References

[1] https://www.di4d.com/. 1

[2] https://www.eisko.com/. 1

[3] Oleg Alexander, M. Rogers, W. Lambeth, Jen-Yuan Chiang,

Wan-Chun Ma, Chuan-Chang Wang, and Paul E. Debevec.

The digital emily project: Achieving a photorealistic digital

actor. IEEE Computer Graphics and Applications, 30(4):20–

31, 2010. 1

[4] Brian Amberg, Sami Romdhani, and Thomas Vetter. Opti-

mal step nonrigid icp algorithms for surface registration. In

IEEE Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 1–8. IEEE, 2007. 5

[5] Volker Blanz and Thomas Vetter. Face recognition based on

fitting a 3d morphable model. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 25(9):1063–1074, 2003.

1

[6] Volker Blanz, Thomas Vetter, et al. A morphable model for

the synthesis of 3d faces. In Siggraph, volume 99, pages

187–194, 1999. 2, 3, 5

[7] James Booth, Anastasios Roussos, Stefanos Zafeiriou, Allan

Ponniah, and David Dunaway. A 3d morphable model learnt

from 10,000 faces. In IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 5543–5552, 2016. 3

[8] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun

Zhou. Facewarehouse: A 3d facial expression database for

visual computing. IEEE Transactions on Visualization and

Computer Graphics, 20(3):413–425, 2013. 3

[9] Xuan Cao, Zhang Chen, Anpei Chen, Xin Chen, Shiying

Li, and Jingyi Yu. Sparse photometric 3d face reconstruc-

tion guided by morphable models. In IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), pages

4635–4644, 2018. 1, 2, 3, 4

[10] Anpei Chen, Zhang Chen, Guli Zhang, Ziheng Zhang, Kenny

Mitchell, and Jingyi Yu. Photo-realistic facial details synthe-

sis from single image. In IEEE International Conference on

Computer Vision (ICCV). 3, 8

[11] Guanying Chen, Kai Han, Boxin Shi, Yasuyuki Matsushita,

and Kwan-Yee K Wong. Self-calibrating deep photometric

stereo networks. In IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 8739–8747, 2019. 2,

7, 8

[12] Guanying Chen, Kai Han, and Kwan-Yee K Wong. Ps-fcn:

A flexible learning framework for photometric stereo. In

European Conference on Computer Vision (ECCV), pages

3–18, 2018. 1, 2, 6, 7, 8

[13] Zhang Chen, Yu Ji, Mingyuan Zhou, Sing Bing Kang, and

Jingyi Yu. 3d face reconstruction using color photometric

stereo with uncalibrated near point lights. In IEEE Interna-

tional Conference on Computer Vision (ICCV), 2019. 1

[14] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde

Jia, and Xin Tong. Accurate 3d face reconstruction with

weakly-supervised learning: From single image to image

set. In IEEE Conference on Computer Vision and Pattern

Recognition (CVPR) Workshops, 2019. 1, 3

[15] Pengfei Dou and Ioannis A Kakadiaris. Multi-view 3d face

reconstruction with deep recurrent neural networks. Image

and Vision Computing, 80:80–91, 2018. 1

[16] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi

Zhou. Joint 3d face reconstruction and dense alignment with

position map regression network. In European Conference

on Computer Vision (ECCV), pages 534–551, 2018. 2

[17] Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Stefanos

Zafeiriou. Ganfit: Generative adversarial network fitting for

high fidelity 3d face reconstruction. In IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), pages

1155–1164, 2019. 1

[18] Yudong Guo, Juyong Zhang, Jianfei Cai, Boyi Jiang, and

Jianmin Zheng. Cnn-based real-time dense face reconstruc-

tion with inverse-rendered photo-realistic face images. IEEE

Transactions on Pattern Analysis and Machine Intelligence,

41(6):1294–1307, 2019. 1

[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

Deep residual learning for image recognition. In IEEE Con-

ference on Computer Vision and Pattern Recognition (CVPR),

pages 770–778, 2016. 5

[20] Steffen Herbort and Christian Wohler. An introduction to

image-based 3d surface reconstruction and a survey of photo-

metric stereo methods. 3D Research, 2(3):4, 2011. 2

[21] Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly.

Dynamic 3d avatar creation from hand-held video input. ACM

Transactions on Graphics (ToG), 34(4):45, 2015. 1

[22] Satoshi Ikehata. Cnn-ps: Cnn-based photometric stereo for

general non-convex surfaces. In European Conference on

Computer Vision (ECCV), pages 3–18, 2018. 1, 2

[23] Aaron S Jackson, Adrian Bulat, Vasileios Argyriou, and Geor-

gios Tzimiropoulos. Large pose 3d face reconstruction from

a single image via direct volumetric cnn regression. In IEEE

International Conference on Computer Vision (ICCV), pages

1031–1039, 2017. 2

[24] Luo Jiang, Juyong Zhang, Bailin Deng, Hao Li, and Lig-

ang Liu. 3d face reconstruction with geometry details from

a single image. IEEE Transactions on Image Processing,

27(10):4756–4770, 2018. 3, 4

[25] Diederick P Kingma and Jimmy Ba. Adam: A method

for stochastic optimization. In International Conference on

Learning Representations (ICLR), 2015. 6

[26] Junxuan Li, Antonio Robles-Kelly, Shaodi You, and Yasuyuki

Matsushita. Learning to minify photometric stereo. In IEEE

Conference on Computer Vision and Pattern Recognition

(CVPR), pages 7568–7576, 2019. 2

[27] Yue Li, Liqian Ma, Haoqiang Fan, and Kenny Mitchell.

Feature-preserving detailed 3d face reconstruction from a

single image. In Proceedings of the 15th ACM SIGGRAPH

European Conference on Visual Media Production, page 1.

ACM, 2018. 3

[28] Chao Liu, Srinivasa G Narasimhan, and Artur W Dubrawski.

Near-light photometric stereo using circularly placed point

light sources. In IEEE International Conference on Com-

putational Photography (ICCP), pages 1–10. IEEE, 2018.

2

[29] Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix

Chabert, Malte Weiss, and Paul Debevec. Rapid acquisition

of specular and diffuse normal maps from polarized spherical

gradient illumination. In Eurographics conference on Render-

748

Page 10: Lightweight Photometric Stereo for Facial Details Recoveryopenaccess.thecvf.com/content_CVPR_2020/papers/... · face details using SfS along with an albedo prior mask and a depth-image

ing Techniques, pages 183–194. Eurographics Association,

2007. 5, 7

[30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory

Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-

ban Desmaison, Luca Antiga, and Adam Lerer. Automatic

differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.

6

[31] Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romd-

hani, and Thomas Vetter. A 3d face model for pose and

illumination invariant face recognition. In Sixth IEEE Inter-

national Conference on Advanced Video and Signal Based

Surveillance, pages 296–301. IEEE, 2009. 3

[32] Yvain Queau, Bastien Durix, Tao Wu, Daniel Cremers,

Francois Lauze, and Jean-Denis Durou. Led-based photo-

metric stereo: modeling, calibration and numerical solution.

Journal of Mathematical Imaging and Vision, 60(3):313–340,

2018. 2

[33] Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel.

Learning detailed face reconstruction from a single image. In

IEEE Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 1259–1268, 2017. 3

[34] Hiroaki Santo, Masaki Samejima, Yusuke Sugano, Boxin

Shi, and Yasuyuki Matsushita. Deep photometric stereo net-

work. In IEEE International Conference on Computer Vision

(ICCV), pages 501–509, 2017. 2

[35] Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael J

Black. Learning to regress 3d face shape and expression

from an image without 3d supervision. In IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), pages

7763–7772, 2019. 1

[36] Matan Sela, Elad Richardson, and Ron Kimmel. Unrestricted

facial geometry reconstruction using image-to-image transla-

tion. In IEEE International Conference on Computer Vision

(ICCV), pages 1576–1585, 2017. 3, 8

[37] Giota Stratou, Abhijeet Ghosh, Paul Debevec, and Louis-

Philippe Morency. Effect of illumination on automatic ex-

pression recognition: a novel 3d relightable facial database.

In Face and Gesture 2011, pages 611–618. IEEE, 2011. 1

[38] Georgios Stylianou and Andreas Lanitis. Image based 3d face

reconstruction: a survey. International Journal of Image and

Graphics, 9(02):217–250, 2009. 1

[39] Tatsunori Taniai and Takanori Maehara. Neural inverse ren-

dering for general reflectance photometric stereo. In Inter-

national Conference on Machine Learning (ICML), pages

4864–4873, 2018. 2

[40] Justus Thies, Michael Zollhofer, Marc Stamminger, Christian

Theobalt, and Matthias Nießner. Face2face: Real-time face

capture and reenactment of rgb videos. In IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), pages

2387–2395, 2016. 1

[41] Anh Tuan Tran, Tal Hassner, Iacopo Masi, Eran Paz, Yuval

Nirkin, and Gerard G Medioni. Extreme 3d face reconstruc-

tion: Seeing through occlusions. In IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), pages

3935–3944, 2018. 3, 8

[42] Robert J Woodham. Photometric method for determining sur-

face orientation from multiple images. Optical engineering,

19(1):191139, 1980. 1, 2

[43] Fanzi Wu, Linchao Bao, Yajing Chen, Yonggen Ling, Yib-

ing Song, Songnan Li, King Ngi Ngan, and Wei Liu. Mvf-

net: Multi-view 3d face morphable model regression. In

IEEE Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 959–968, 2019. 1

[44] Lun Wu, Arvind Ganesh, Boxin Shi, Yasuyuki Matsushita,

Yongtian Wang, and Yi Ma. Robust photometric stereo via

low-rank matrix completion and recovery. In Asian Confer-

ence on Computer Vision, pages 703–717. Springer, 2010.

2

[45] Qianyi Wu, Juyong Zhang, Yu-Kun Lai, Jianmin Zheng, and

Jianfei Cai. Alive caricature from 2d to 3d. In IEEE Confer-

ence on Computer Vision and Pattern Recognition (CVPR),

pages 7336–7345, 2018. 3

[46] Wuyuan Xie, Chengkai Dai, and Charlie CL Wang. Photo-

metric stereo with near point lighting: A solution by mesh

deformation. In IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pages 4585–4593, 2015. 2, 3

[47] Qian Zheng, Yiming Jia, Boxin Shi, Xudong Jiang, Ling-

Yu Duan, and Alex C Kot. Spline-net: Sparse photometric

stereo through lighting interpolation and normal estimation

networks. In IEEE International Conference on Computer

Vision (ICCV), pages 8549–8558, 2019. 2

[48] Syed Zulqarnain Gilani and Ajmal Mian. Learning from

millions of 3d scans for large-scale 3d face recognition. In

IEEE Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 1896–1905, 2018. 1

749