cascade learning from adversarial synthetic images …cvrl/publication/pdf/gou2019.pdfshrivastava...

11
Pattern Recognition 88 (2019) 584–594 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/patcog Cascade learning from adversarial synthetic images for accurate pupil detection Chao Gou a,c,, Hui Zhang a , Kunfeng Wang a,c , Fei-Yue Wang a,c , Qiang Ji b a Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China b Rensselaer Polytechnic Institute, Troy, NY 12180, USA c Qingdao Academy of Intelligent Industries, Qingdao 266109, China a r t i c l e i n f o Article history: Received 24 July 2018 Revised 19 November 2018 Accepted 15 December 2018 Available online 18 December 2018 Keywords: Cascade regression GANs Pupil detection a b s t r a c t Image-based pupil detection, which aims to find the pupil location in an image, has been an active re- search topic in computer vision community. Learning-based approaches can achieve preferable results given large amounts of training data with eye center annotations. However, there are limited publicly available datasets with accurate eye center annotations and it is unreliable and time-consuming for man- ually labeling large amounts of training data. In this paper, inspired by learning from synthetic data in Parallel Vision framework, we introduce a step of parallel imaging built upon Generative Adversarial Net- works (GANs) to generate adversarial synthetic images. In particular, we refine the synthetic eye images by the improved SimGAN using adversarial training scheme. For the computational experiments, we fur- ther propose a coarse-to-fine pupil detection framework based on shape augmented cascade regression models learning from the adversarial synthetic images. Experiments on benchmark databases of BioID, GI4E, and LFW show that the proposed work performs significantly better over other state-of-the-art methods by leveraging the power of cascade regression and adversarial image synthesis. © 2018 Elsevier Ltd. All rights reserved. 1. Introduction Image-based pupil detection, also named as eye center detec- tion, aims to predict the location of pupil in an image and it is becoming increasingly active area of research in the computer vi- sion community. Accurate pupil detection is useful for gaze pre- diction, eye state estimation, and human-machine interactions. Re- cently, cascade regression has emerged as a promising approach for landmark detection [1–4]. It has been further extended for ac- curate pupil detection and eye state estimation [5,6] where they achieved the state-of-the-art pupil detection result. Cascade regres- sion begins with an initial semantic key point (e.g. eye corner) lo- cations, followed by iteratively updating the positions through se- quential regression models until convergence. The cascade regres- sion models are learned to map current key point related appear- ance to the target location increment with respect to ground truth. Moreover, it allows for capturing the structural and contextual in- formation for pupil localization. For pupil detection, there are limited databases with eye re- lated key point including the eye center annotations. In addition, Corresponding author at: Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China. E-mail addresses: [email protected], [email protected] (C. Gou). manual annotation of eye centers is time-consuming and not reli- able. Learning-by-synthesis has been successfully applied for eye center detection where it learns the model from pure synthetic data with automatic annotations that are generated by computer graphics technique [6]. Parallel Vision (PV) [7,8], which is extended from ACP methodology [9] to computer vision, was proposed as a unified framework including Artificial scenes, Computational ex- periments, and Parallel execution for perception and understanding of complex scenes and also emphasize the significance of synthe- sis. As pointed out in [10], large number of artificial data make it possible to model the real distributions. While in the case of pupil detection, synthetic eyes are limited to cover the variations in appearance of real eyes. SimGAN [11] was proposed to refine the synthetic images through adversarial training by the framework of Generative Adversarial Networks (GANs) [12]. Experimental results show that SimGAN can improve the gaze estimation. In this paper, inspired by PV framework, we propose to learn cascade regression models by the adversarial image synthesis for accurate pupil detection. The overall framework is illustrated in Fig. 1. From the perspective of PV, artificial eyes can offer us suf- ficient synthetic eye images with annotations. We first simulate thousands of eye images with accurate eye annotations. Built upon SimGAN, we further perform parallel imaging to add realism to the artificial eyes, where we can generate adversarial synthetic https://doi.org/10.1016/j.patcog.2018.12.014 0031-3203/© 2018 Elsevier Ltd. All rights reserved.

Upload: others

Post on 22-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism

Pattern Recognition 88 (2019) 584–594

Contents lists available at ScienceDirect

Pattern Recognition

journal homepage: www.elsevier.com/locate/patcog

Cascade learning from adversarial synthetic images for accurate pupil

detection

Chao Gou

a , c , ∗, Hui Zhang

a , Kunfeng Wang

a , c , Fei-Yue Wang

a , c , Qiang Ji b

a Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China b Rensselaer Polytechnic Institute, Troy, NY 12180, USA c Qingdao Academy of Intelligent Industries, Qingdao 266109, China

a r t i c l e i n f o

Article history:

Received 24 July 2018

Revised 19 November 2018

Accepted 15 December 2018

Available online 18 December 2018

Keywords:

Cascade regression

GANs

Pupil detection

a b s t r a c t

Image-based pupil detection, which aims to find the pupil location in an image, has been an active re-

search topic in computer vision community. Learning-based approaches can achieve preferable results

given large amounts of training data with eye center annotations. However, there are limited publicly

available datasets with accurate eye center annotations and it is unreliable and time-consuming for man-

ually labeling large amounts of training data. In this paper, inspired by learning from synthetic data in

Parallel Vision framework, we introduce a step of parallel imaging built upon Generative Adversarial Net-

works (GANs) to generate adversarial synthetic images. In particular, we refine the synthetic eye images

by the improved SimGAN using adversarial training scheme. For the computational experiments, we fur-

ther propose a coarse-to-fine pupil detection framework based on shape augmented cascade regression

models learning from the adversarial synthetic images. Experiments on benchmark databases of BioID,

GI4E, and LFW show that the proposed work performs significantly better over other state-of-the-art

methods by leveraging the power of cascade regression and adversarial image synthesis.

© 2018 Elsevier Ltd. All rights reserved.

m

a

c

d

g

f

a

p

o

s

i

p

i

s

G

s

c

a

F

1. Introduction

Image-based pupil detection, also named as eye center detec-

tion, aims to predict the location of pupil in an image and it is

becoming increasingly active area of research in the computer vi-

sion community. Accurate pupil detection is useful for gaze pre-

diction, eye state estimation, and human-machine interactions. Re-

cently, cascade regression has emerged as a promising approach

for landmark detection [1–4] . It has been further extended for ac-

curate pupil detection and eye state estimation [5,6] where they

achieved the state-of-the-art pupil detection result. Cascade regres-

sion begins with an initial semantic key point (e.g. eye corner) lo-

cations, followed by iteratively updating the positions through se-

quential regression models until convergence. The cascade regres-

sion models are learned to map current key point related appear-

ance to the target location increment with respect to ground truth.

Moreover, it allows for capturing the structural and contextual in-

formation for pupil localization.

For pupil detection, there are limited databases with eye re-

lated key point including the eye center annotations. In addition,

∗ Corresponding author at: Institute of Automation, Chinese Academy of Sciences,

Beijing 100190, China.

E-mail addresses: [email protected] , [email protected] (C. Gou).

fi

t

S

t

https://doi.org/10.1016/j.patcog.2018.12.014

0031-3203/© 2018 Elsevier Ltd. All rights reserved.

anual annotation of eye centers is time-consuming and not reli-

ble. Learning-by-synthesis has been successfully applied for eye

enter detection where it learns the model from pure synthetic

ata with automatic annotations that are generated by computer

raphics technique [6] . Parallel Vision (PV) [7,8] , which is extended

rom ACP methodology [9] to computer vision, was proposed as

unified framework including Artificial scenes, Computational ex-

eriments, and Parallel execution for perception and understanding

f complex scenes and also emphasize the significance of synthe-

is. As pointed out in [10] , large number of artificial data make

t possible to model the real distributions. While in the case of

upil detection, synthetic eyes are limited to cover the variations

n appearance of real eyes. SimGAN [11] was proposed to refine the

ynthetic images through adversarial training by the framework of

enerative Adversarial Networks (GANs) [12] . Experimental results

how that SimGAN can improve the gaze estimation.

In this paper, inspired by PV framework, we propose to learn

ascade regression models by the adversarial image synthesis for

ccurate pupil detection. The overall framework is illustrated in

ig. 1 . From the perspective of PV, artificial eyes can offer us suf-

cient synthetic eye images with annotations. We first simulate

housands of eye images with accurate eye annotations. Built upon

imGAN, we further perform parallel imaging to add realism to

he artificial eyes, where we can generate adversarial synthetic

Page 2: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism

C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594 585

Fig. 1. Overview of the proposed pupil detection framework based on PV.

s

a

m

a

T

l

D

i

i

2

b

t

n

b

i

t

d

p

a

A

c

r

o

l

b

t

w

t

f

l

a

t

o

f

e

t

t

m

p

f

t

t

t

f

h

[

f

H

a

H

i

a

e

e

v

v

c

t

t

c

l

t

e

h

t

b

l

i

t

b

[

t

s

a

a

l

s

e

a

r

n

[

w

e

a

c

p

p

t

a

b

p

i

v

amples that preserve the simulator’s annotations and add real

ppearance information. Finally, we propose to learn shape aug-

ented regression models from the adversarial synthetic images

nd test on real ones to perform the computational experiments.

he major contributions of proposed work are highlighted as be-

ow:

• Adversarial image synthesis: We perform parallel imaging by an

improved SimGAN for training image synthesis. The generated

adversarial images can preserve both the accurate annotations

of synthetic eyes and realistic appearance information of real

eyes. • Shape augmented cascade regression: By leveraging the shape

augmented framework, it capture the eye related shape, appear-

ance, and contextual information for final pupil detection. • Promising experimental results: The proposed method performs

better over other state-of-the-arts on benchmark databases and

allows for real-time applications.

In the following, related work will be reviewed in Section 2 .

etails of the proposed method are described in Section 3 . Exper-

mental results are discussed in Section 4 . We give the conclusion

n Section 5 .

. Related work

For image-based pupil detection, we review the recent learning-

ased work in this section. Earlier prominent review on pupil de-

ection can be found in [13,14] . Valenti and Gevers [15] provided a

ew method to infer eye center location using circular symmetry

ased on isophote properties. A voting based method is introduced

n this isophote based framework. Timm and Barth [16] proposed

o detect the eye center by image gradients. They mathematically

escribed the relationship of orientations in gradients between a

ossible iris center and contextual regions. In [17] , a novel vari-

tion of linear SVM was presented for efficient pupil detection.

raujo et al. [18] introduced the Inner Product Detector based on

orrelation filters to detect the center of the eye. This method is

obust to small variations. But it is limited to model the variations

f appearance of eyes in the wild. A recent work [19] reviewed the

ocal structure patterns (LSPs) and extended them to multi-scale

lock based LSPs for pupil detection. Adaboost feature selection on

he combination of intensity-based LSPs and gradient-based LSPs

as performed to further improve the performance of pupil detec-

ion in [19] .

Cascade regression has emerged as the leading framework for

acial landmark detection [1–3,20–22] . It begins with an initial

andmark locations, e.g. mean locations of the training samples,

nd follows by iteratively updating the locations through sequen-

ially trained regression models. Zhou et al. [20] optimized the

riginal facial landmark detection method SDM [3] by extracting

eatures at different scale and employing the coarse-to-fine strat-

gy to enhance the robustness. By using multiple nonlinear fea-

ures, the approach obtained higher localization accuracy. In [20] ,

he authors trained the model using more than 10 K images with

anual annotations. Gou et al. [5] extended cascade regression for

upil detection where they proposed a joint cascade regression

ramework for simultaneous pupil detection and eye state estima-

ion. By capturing the structural, shape, and contextual informa-

ion using cascade regression framework, it achieved the state-of-

he-art result [5] . The methods in [5,20] employed the handcrafted

eatures that are limited in representation. Recently, deep features

ave been widely used in fields such as 3-D human pose recovery

23–25] , face detection [26,27] , privacy setting recommendation

or social image sharing [28,29] and landmark detection [21,22] .

ong et al. [30] firstly introduced locality-sensitive constriction

nd fed into combination of multi-view hand-crafted feature (e.g.

OG, Shape context, Hierarchical centroid, etc.) in the sparse cod-

ng for 3-D poses recovery from silhouettes. They further proposed

novel deep feature extractor [24] with multi-modal fusion and

xperimental results demonstrated its effectiveness for pose recov-

ry [23] . Yu et al. [28] proposed a novel approach termed as iPri-

acy that jointly learned deep CNNs and the tree classifier for pri-

acy setting recommendation. They further incorporated the image

ontent sensitiveness and the user trustworthiness which were ex-

racted by designing deep models and learning dictionary, respec-

ively [29] . In [21] , the authors presented a three-level cascaded

onvolution networks for facial landmark detection. Reliable initial

ocation estimation was achieved at first level, followed by finely

uning at the following two levels. Zhang et al. [22] proposed to

mploy deep cascaded convolution networks to capture the in-

erent correlation between face detection and landmark location

o boost the performance of these two tasks. For these learning-

ased approaches, it is unreliable and time-consuming to collect

arge number of such training data. With the increasing progress

n computer graphics and virtual reality, it is becoming possible

o learn from the artificial scene and objects that cover the distri-

ution of targets in appearance in real world [6,31–35] . Gou et al.

6] proposed to learn from the synthetic images for pupil detec-

ion using cascade regression methods and achieve preferable re-

ults on benchmark databases. Learning-by-synthesis is becoming

n increasingly active topic for eye related research. In [31] , the

uthors presented UnityEyes , a novel method to rapidly synthesize

arge amounts of eye region images as training data. The synthe-

ized images can be used to estimate gaze in-the-wild scenarios,

ven for large head pose and gaze directions.

However, learning from synthetic images cannot achieve desir-

ble performance due to the bias between synthetic images and

eal images. Synthetic data is often not realistic enough and can-

ot cover the variations in appearance in real world. Wang et al.

7,8] proposed a unified framework termed as Parallel Visions (PV),

hich was extended from ACP (Artificial systems, Computational

xperiments, and Parallel execution) methodology [9] , for the im-

ge perception and understanding. From the perspective of PV,

onventional vision algorithms learning from synthetic images are

art of PV with respect to artificial scene and computational ex-

eriments, and it can be further improved by the parallel execu-

ion of online optimization through interaction between synthetic

nd real images. Ganin and Lempitsky tried to tackle the data

ias problem from the point of domain adaptation [36] . They pro-

osed to learn domain-invariant features by the adversarial train-

ng method. Yoo et al [37] presented a pixel level domain con-

erter learning with Generative Adversarial Networks (GANs) [12] .

Page 3: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism

586 C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594

Fig. 2. Parallel vision framework based on ACP [7] .

s

t

c

m

t

o

t

i

v

i

g

S

3

i

c

a

e

u

a

c

a

f

a

e

e

t

s

e

o

s

s

g

t

r

a

t

a

t

w

3

r

m

i

P

o

m

s

t

o

l

s

w

t

e

z

t

a

l

g

s

Shrivastava et al. [11] proposed Simulated+Unsupervised (S+U)

learning to improve the realism of synthetic images and preserve

the annotations information at the same time.

3. Proposed method

3.1. ACP methodology and parallel vision framework

ACP (Artificial systems, Computational experiments, and Paral-

lel execution) methodology was first proposed for effective model-

ing and controlling the complex systems [9] . Artificial systems aim

to capture all the information that real systems have. In such ar-

tificial systems, different models can be built to describe actions

and response of target real systems. After building the ideal arti-

ficial systems, computational experiments will be conducted to ef-

fectively validate the proposed algorithm. The third part of ACP is

parallel execution referring to one or more artificial systems run-

ning in parallel with real complex system, where it is executed in

Parallel Systems. Through parallel execution, we can explore the

potential of artificial systems and transform the roles between ar-

tificial systems and the real system. Based on the interactions, we

can achieve online optimization from static to dynamic. The core

philosophy of ACP is to combine Artificial systems, Computational

experiments, and Parallel execution to turn the virtual artificial

space into another space for solving complex problems [38] . ACP

is becoming an increasingly important research topic and is widely

applied such as social computing, traffic management and control,

and Ethylene production management [39–41] .

Wang et al. [7,8] further extended ACP framework to computer

vision as PV (Parallel Vision) framework as shown in Fig. 2 . PV also

consists of three major parts including artificial scenes, computa-

tional experiments and parallel execution. Modern machine learn-

ing methods require large number of images with labels for train-

ing. While in real applications, we cannot collect sufficient training

samples with labels to cover the real distributions. With increas-

ingly development of computer graphics and virtual reality tech-

niques, it makes the artificial scenes to accurately model the real

scene possible. In our case of pupil detection, we can collect large

scale of synthetic samples by artificial generation for training al-

gorithms. Computational experiments can be conducted to evalu-

ate the proposed pupil detection algorithms including learning and

testing perception in both real and artificial eye images. However,

training algorithms purely from artificial synthetic eye images can-

not model the real eyes distribution due to the dataset bias prob-

lem. Parallel execution like parallel imaging of GNAs-based image

generation [34] offers a methodology to cover the gap between

synthetic and real images.

From the perspective of PV, existing work of learning-by-

synthesis [6,31] are part of PV with respect to artificial scenes and

computational experiments. To tackle the data bias problem of syn-

thetic and real data, inspired by PV framework, we introduce a step

of parallel imaging to refine the synthetic images based on the ad-

versarial training scheme. Specifically, following the recent adver-

arial training based work [11] , we perform adversarial images syn-

hesis to add the realism to synthetic images and preserve the ac-

urate annotations. We further propose to learn cascade regression

odels by adversarial images synthesis to perform the computa-

ional experiments. The overview of our proposed method based

n PV is illustrated in Fig. 1 . In the following, we first describe

he generation of synthetic eyes in the part of artificial scenes

n Section 3.2 . Then we present the method for generating ad-

ersarial photo-realistic eye images in the part of parallel imag-

ng in Section 3.3 . Finally, we present our proposed cascade re-

ression framework for accurate eye center detection in detail in

ection 3.4 .

.2. Artificial eye generation

Learning-based pupil detection needs a large number of train-

ng samples with accurate annotations. However, it is time-

onsuming and unreliable to label the eye center of real images. In

ddition, since there are diverse illuminations, different ethnicity,

ye states, gaze directions, gender and head pose of subjects, man-

al collected data is impossible to cover the variations of eyes in

ppearance in the wild. From the perspective of PV, artificial eyes

an offer us sufficient training samples to cover the general appear-

nce in real scenarios. Recently, learning from synthetic eye images

or gaze estimation has achieved promising results [31] , where the

uthors proposed a method called UnityEyes to generate synthetic

ye images with gaze annotations. Experimental results show that

ven KNN based algorithms training on the simulated eyes by Uni-

yEyes for eye gaze estimation can achieve the state-of-the-art re-

ults. It also outperforms the deep learning-based method for gaze

stimation testing on real images [31] .

In this paper, we apply UnityEyes to generate artificial eyes as

ur first step for pupil detection. UnityEyes can rapidly generate

ynthetic eyes with various appearance with respect to different

kin color, various illuminations and large range of head pose and

aze directions. It can also automatically label the key point loca-

ions as shown in Fig. 1 . To capture the eye ball appearance with

espect to different gaze, a 3D mesh is used to model the shape

nd texture variations of eye ball. This mesh is corresponding to

he external surface and related shape is defined by the cornea

nd two sphere representing. A generative model is applied to syn-

hesize facial eye regions to cover different appearance of subjects

ith respect to different gender, age and ethnicity. In particular, a

D morphable shape model of eye region is built upon the high

esolution 3D head scans captured by a professional photogram-

etric studio. The generic eye region topology with n vertices

s denoted by 3n dimensional vector: s = [ x 1 , y 1 , z 1 , . . . , x n , y n , z n ] .

CA is performed on the m retopologized scans to extract the

rthogonal basis functions U ∈ �

3 n × m . A linear parametric shape

odel M s is formulated by M s ( μ, δ, U ), where μ∈ �

3 n is the mean

hape of m retopologized scans, and δ are the parameters following

he Gaussian that are estimated by fitting to each basis function to

riginal scans. As a result, a morphable model is formulated as be-

ow:

(θ ) = μ + Udiag(δ) θ T , (1)

here θ ∈ �

m is a vector of the coefficients of related basis func-

ion. As a result, new eye region can be formed by randomly gen-

rated parameters of θ which follow the Gausian distribution with

ero mean and unit variance.

Another critical factor for appearance-based pupil detection is

he illuminations since different lighting result in various appear-

nce. To synthesize eye images under diverse illuminations, the

ight sources pointing at a random direction towards the eye re-

ions are simulated. Different panoramic photographs are cho-

en to randomly rotate and expose to render the refections and

Page 4: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism

C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594 587

Fig. 3. Examples of artificial eye images.

Fig. 4. Illustration of the adversarial learning for generating the photorealistic eyes.

e

a

t

s

3

b

v

i

a

o

r

t

i

b

t

B

G

t

r

p

G

n

i

a

p

E

w

t

t

a

f

t

0

t

g

r

s

f

o

s

f

t

p

t

f

v

t

a

t

a

l

e

w

w

s

t

E

r

g

r

i

p

3

e

c

o

f

t

p

[

e

o

w

t

3

f

g

w

t

e

p

m

r

l

u

nvironmental ambient light. And different head pose can be

chieved by using a spherical coordinates and pointing it towards

he eye ball center of 3D model. Some artificial eye images are

hown in Fig. 3 . Refer to [31] for more details.

.3. Adversarial images synthesis from parallel imaging

Synthetic data has been widely used for training the vision-

ased algorithms because it can be automatically generated with

arious appearance and accurate annotations [7] . However, learn-

ng from synthetic images cannot achieve the desired performance

s the distribution of synthetic images is different from that of real

nes. Although large effort s have been undert aken to improve the

ealism of generated artificial images, it is still limited to cover all

he variations of appearance in real scene. In this work, as shown

n Fig. 1 , we introduce a step of parallel imaging to reduce the gap

etween synthetic and real eye images. Our goal is to refine the ar-

ificial eyes with texture and appearance from unlabeled real eyes.

y building upon the recent GANs based approach termed as Sim-

AN [11] , we learn a generative model to generate adversarial syn-

hetic images that preserve the annotations of synthetic eyes and

ealism of real eyes. The overview of GANs based model in this

aper is illustrated in Fig. 4 , where the deep network of generator

is the expected generative model whose output makes the deep

etwork of discriminator D impossible to determine the source of

nput images. In this paper, we use r and s to represent real eyes

nd synthetic eyes, respectively. G and D is alternatively optimized.

For the discriminator, following the idea of GANs, the learning

rocedure aims to minimize the adversarial loss as formulated in

q. 2 .

L D (θd ) = E s ∼p s (s ) [ −log(D (G (s ; θg ) ; θd )) ]

+ E r ∼p r (r ) [ −log(1 − D (r; θd )) ] , (2)

here E s ∼p s (s ) represents the expectation of the log-likelihood of

he input being sampled from the probability distribution of syn-

hetic eyes p s ( s ), p r ( r ) represents the distribution of the real eyes,

nd D ( ∗ ; θd ) denotes the probability of the input sampled from

ake eyes. By minimizing this loss function, the discriminator aims

o make D ( G ( s ; θ g ); θd ) approach 1 and make D ( r ; θd ) approach

. Hence, for the training of discriminator network D , the ground

ruth labels are 0 for all the real eyes r and 1 for the outputs of

enerator G ( s ; θ g ).

The purpose of generator G is to make its output G ( s ; θ g ) as

eal as possible in appearance and preserve the accurate structural

hape with respect to input synthetic eye s . Following recent idea

rom Shrivastava et al. [11] , we formulate its loss as a combination

f adversarial loss of realism with a self-regularization loss to pre-

erve the detailed information. Since cascade regression is applied

or pupil detection based on the texture and appearance informa-

ion around the key points, we need to preserve the accurate key

oint (e.g. eyelids and eye corners) locations and the related local

exture information from the synthetic eyes. To this end, different

rom Shrivastava et al. [11] where they calculate the difference of

alue in pixel level as self-regularization loss to preserve the anno-

ation like gaze vector for gaze estimation, we propose to design

nother ConvNet to preserve the detailed information (e.g. struc-

ural shape, annotated locations of key points and related global

nd local texture) of synthetic eyes followed by calculating the L1

oss as self-regularization loss. The loss function for learning gen-

rator is formulated in Eq. (3) as below:

L G (θg , θc ) = E s ∼p s (s ) [ −log(1 − D (G (s ; θg ) ; θd )) ]

+ λE s ∼p s (s ) (|| Con v Net(G (s ; θg ) ; θc )

− Con v Net(s ; θc ) || 1 ) , (3)

here λ is the balance factor, ConvNet is a deep convolutional net-

ork that transfers the image space to the representative feature

pace, θ c are the parameters of ConvNet , and ||.|| 1 is the L1 norm.

In this work, the parameters of discriminator and genera-

or are alternatively learned by minimizing the loss function in

qs. (2) and ( 3 ), respectively. Specifically, when learning the pa-

ameters θd of discriminator, θ g and θ c are fixed by the result of

enerator training, and vice versa. It is worth noticing that the pa-

ameters of θ g and θ c are learned simultaneously when perform-

ng the generator learning. For each update of θd , we update the

arameters of generator 10 times.

.4. Shape augmented cascade regression for computational

xperiments

In the PV framework, computational experiments should be

onducted to evaluate the performance of proposed algorithm. For

ur task, we first build a novel model based on cascade regression

or pupil detection. Then we carry out computational experiments

o validate the proposed cascade regression framework. In this pa-

er, motivated by our previous work for facial landmark detection

1,2] which use cascade regression models, we propose to detect

ye center using shape augmented regression method on the basis

f eye-related shape, structural and appearance information. Before

e introduce the shape augmented regression method, we review

he general cascaded framework.

.4.1. General cascaded regression

General cascaded regression method has been widely applied

or facial landmark detection where it learns multi sequential re-

ressors based on the local appearance. The general cascade frame-

ork for landmark detection is summarized in Algorithm 1 . In

his work, the location of N semantic facial landmarks such as

ye, nose tip, and mouth corners in a given image are denoted by

= {{ x 1 , y 1 } , . . . , { x N , y N }} ∈ �

2 ·N . p

t represent the location of land-

arks at iteration t in cascade regression framework. At cascade

egression level t , a function of g t is adopted to map the extracted

ocal appearance features around landmarks p

t−1 to the updates of

p

t . Supervised Decent Method (SDM) [3] is one of the most pop-

lar cascade framework for facial landmark detection where they

Page 5: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism

588 C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594

Algorithm 1 General cascade regression for landmark detection.

Input:

N landmarks are initialized as p

0 by mean locations of training

samples given the input image I .

Do cascade regression:

for t=1,…,T do

Estimate the landmark location updates p

t given the current

landmark locations p

t−1 .

g t : I , p

t−1 → �p

t

Update the landmark locations.

p

t = p

t−1 + �p

t

end for

Output:

Landmark locations p

T .

Fig. 5. Proposed coarse-to-fine pupil detection based on shape augmented regres-

sion. (a) Facial landmark detection. (b) Coarsely extracted eye region and initialize

the key point location. (c) Output of first cascade level. (d) Final key point location

including eye center.

c

w

t

E

f

p

c

p

a

M

c

g

w

c

d

f

m

i

A

t

I

O

w

a

r

w

e

e

p

k

l

use a linear function for mapping. Hence, we can formulate the

cascade regression model g t in Algorithm 1 by �p

t = g t ( p

t−1 , I ) =R

t �( p

t−1 , I ) + c t , where �( p

t−1 , I ) is the appearance feature, R

t

and c t are the model parameters. For the training of cascade re-

gression in this work, we acquire the ground truth of desired up-

dates by �p

t, ∗ = p

∗ − p

t−1 , where p

∗ is the ground truth location.

Then we learn the linear regression model parameters R

t and c t

by least-squares formulation with closed form solution. By adding

the updates �p

t to the current locations p

t−1 , the prediction of

landmark location is achieved.

3.4.2. Shape augmented cascade regression for pupil detection

Shape augmented cascade regression is first proposed in

[42] for facial landmark detection. We further explore it to 3D

facial landmark alignment to capture the shape information [4] .

Our proposed coarse-to-fine framework for pupil detection is illus-

trated in Fig. 5 . We first perform face detection followed by apply-

ing shape augmented cascade regression for 51 facial landmarks

detection. The eye region can be further coarsely cropped by the

detected facial landmarks. Intuitively, the eye center location is re-

lated to iris region appearance information. In addition, eye related

shape and contextual information like eye corner and eyelids are of

help for accurate eye center location. To capture these information,

as shown in Fig. 5 (b), we choose 11 eye related key points includ-

ing eye corners, eyelids and eye center for cascade regression.

In this paper, we extend shape augmented regression for pupil

detection. The objective function for eye related key points detec-

tion is denoted by Eq. (4) .

f ( p ) =

1

2

|| �( p , I ) − �( p

∗, I ) || 2 +

1

2

|| �( p ) − �( p

∗) || 2 , (4)

where I is the given eye region image, p are the eye related key

point locations, p

∗ are the annotations, �( p , I ) are the local ap-

pearance features around p , and �( p ) are the extracted shape fea-

tures by calculating the difference among pairs of key points. A

second order Taylor expansion of Eq. (4) is applied at an initial lo-

ation p

0 as below:

f ( p ) = f ( p

0 + �p ) ≈ f ( p

0 ) + J f ( p

0 ) T �p +

1

2

�p

T H f ( p

0 )�p ,

(5)

here J f ( p

0 ) is the Jacobian matrix, and H f ( p

0 ) is the Hessian ma-

rix of function f ( ·) at p

0 . We further take the derivation of f ( p ) in

q. (5) and set it to zero to acquire the optima for the objective

unction. Hence, we can estimate the updates of eye related key

oint locations as Eq. (6) .

p = − H f ( p

0 ) −1 J f ( p

0 )

= − H f ( p

0 ) −1 [ J �( p

0 )(�( p

0 , I ) − �( p

∗, I ))

+ J �( p

0 )(�( p

0 ) − �( p

∗))]

= − H f ( p

0 ) −1 J �( p

0 )�( p

0 , I ) − H f ( p

0 ) −1 J �( p

0 )�( p

0 )

+ H f ( p

0 ) −1 (J �( p

0 )�( p

∗, I ) + J �( p

0 )�( p

∗)) (6)

However, the ground truth locations p

∗ are unknown but fixed

onstants during inference and Hessian matrix is not guaranteed

ositive definite everywhere. Hence, we introduce the parameters

s below:

= −H f ( p

0 ) −1 J �( p

0 ) Q = −H f ( p

0 ) −1 J �( p

0 ) b = H f ( p

0 ) −1 (J �( p

0 )�( p

∗, I ) + J �( p

0 )�( p

∗)) (7)

By embedding the introduced parameters into iteration t of cas-

ade regression, we can linearly formulate Eq. (6) as below:

t : �p

t = M

t �( p

t−1 , I ) + Q

t �( p

t−1 ) + b

t , (8)

here we use a linear model as the map function g ( ·) in general

ascade regression and incorporate the shape information for pupil

etection in cascade regression, �( p , I ) are the local appearance

eatures (e.g. SIFT), and �( p ) are the shape features. Our proposed

ethod of shape augmented cascade regression for pupil detection

s summarized in Algorithm 2 .

lgorithm 2 Shape augmented cascade regression for pupil detec-

ion.

nput:

N key point locations are initialized as p

0 by mean locations of

training samples given the input eye image I .

Do cascade regression:

for t=1,…,T do

Estimate the key point location updates given the current key

point locations p

t−1 .

�p

t = M

t �( p

t−1 , I ) + Q

t �( p

t−1 ) + b

t

Update the key point locations.

p

t = p

t−1 + �p

t

end for

utput:

Locations p

T of eye related key points including the eye center.

After building the model in part of computational experiments,

e need to learn the parameters in Eq. (7) . Since we have gener-

ted sufficient adversarial synthetic images with ground truth eye

elated key point annotations by GANs in parallel imaging part,

e can model the real eye center location distribution. Given j th

ye image I j with ground truth locations p

∗j , we can acquire the

ye shape updates by subtracting the current locations p

t−1 j

from

∗j

through �p

t, ∗j

= p

∗j − p

t−1 j

. Given K training images, the initial

ey point locations p

0 j

are mean locations of training samples. The

earning of M

t , Q

t and bias b

t at cascade level t can be formulated

Page 6: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism

C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594 589

Fig. 6. Examples of detection result by our proposed work for each benchmark database.

a

M

m

e

A

4

l

d

m

4

4

u

r

T

a

t

f

f

w

B

i

w

l

j

s

1

o

t

L

o

w

4

e

b

d

w

a

E

t

o

t

d

4

s

a

p

d

b

o

T

t

f

f

a

t

s

m

s

m

f

4

4

r

q

s

t

n

p

v

i

1

l

a

s

fi

d

i

p

t

t

o

p

r

l

s a standard least-squares formulation with closed form solution:

t ∗, Q

t ∗, b

t ∗ = arg min

M

t , Q t , b

t

K ∑

j=1

‖ �p

t, ∗j

− M

t �(I j , p

t−1 j

)

−Q

t �( p

t−1 j

) − b

t ‖

2

(9)

For inference, given the detected eye region I , begin with a

ean key point initialization of p

0 , we iteratively estimate the

ye location by learned cased model as shown in Fig. 5 and

lgorithm 2 .

. Experimental results

For computational experiments, we have described the model

earning and inference in detail. In this section, we further con-

uct experiments to validate the proposed framework on bench-

ark datasets for pupil detection.

.1. Experimental setup

.1.1. Datasets

For the adversarial training in the step of parallel imaging, we

se around 28 K synthetic images simulated by UnityEye [31] , and

eal eye images from normalized part of MPIIGaze [32] for training.

hen we use the trained generator network to test on the 10,730

rtificial samples used in [6] , which results in adding realism to

he artificial eye images and keep original annotations in [6] . We

urther conduct computational experiments on benchmark datasets

or pupil detection to demonstrate the performance of proposed

ork.

In this work, three widely used benchmark datasets including

ioID [43] , GI4E [44] , and LFW [45] are used for testing to ver-

fy our proposed method for pupil detection. As one of the most

idely used dataset for pupil detection, BioID contains 1521 gray-

evel images of size 384 × 286. It is challenging with different sub-

ects, various gaze, and illuminations. GI4E contains 1236 images of

ize 800 × 600 taken by a normal camera, from 103 subjects with

2 different gaze directions. We randomly select 10 0 0 color images

f size 250 × 250 from LFW for testing. We subjectively evaluate

he performance of our proposed framework on GI4E, BioID, and

FW which are with respect to high, medium, and poor qualities

f images, respectively. Sample detection results by our proposed

ork for each database are shown in Fig. 6 .

.1.2. Measure criteria

Similar to [6,43] , we use the maximum normalized detection

rror to evaluate the performance measure the eye detection capa-

ility as below:

eye =

max (D r , D l )

D rl

, (10)

here D r and D l are the Euclidean distances between ground truth

nd the predicted right and left eye centers, respectively. D is the

rl

uclidean distance between the right and left eye center ground

ruth. For this measurement, d eye ≤ 0.1 corresponds to the range

f iris, and d eye ≤ 0.05 corresponds to the range of pupil diame-

er. Commonly, we set different thresholds for d eye to calculate the

etection rate for measurement.

.1.3. Implementation details

In this paper, we crop eye regions from synthetic images with

ize of 240 × 140, and resize it to 72 × 42 for training the gener-

tor network in the parallel execution. For face detection, we ap-

ly VJ face detector [46] . For fair comparison on the test of GI4E

ataset with other methods, we use deformable Part Models (DPM)

ased face detector [47] for face detection. The face detection rate

f BioID, GI4E, and LFW are 97.5%, 99.8%, and 100%, respectively.

he false positives (as shown in Fig. 12 (e)) are not discarded and

he eye location rate can be further improved by optimization of

ace detection. The balance parameter in Eq. (3) is set as 1.2. SIFT

eature is extracted around the key points in this paper. For the

dversarial synthetic images generation, similar network architec-

ures from Shrivastava et al. [11] are applied. For the ConvNet of

elf-regularization in this paper, it contains 3 convolution and 3

ax-pooling layers as follows: (1) Conv3 × 3, feature maps = 64,

tride = 1, (2) MaxPool2 × 2, stride = 2, (3) Conv3 × 3, feature

aps = 128, stride = 1, (4) MaxPool2 × 2, stride = 2, (5) Conv3 × 3,

eature maps = 256, stride = 1, (6)MaxPool2 × 2, stride = 2.

.2. Evaluation

.2.1. Learning from adversarial synthetic images

We use the trained generator in the parallel imaging step to

efine the synthetic images as adversarial synthetic images. Some

ualitative results are shown in Fig. 7 . We can see that the adver-

arial synthetic images capture the texture and appearance from

he real eyes. In addition, they preserve the original key point an-

otations from the synthetic images which is of help for our pro-

osed learning-based pupil detection.

To further demonstrate the effectiveness of learning from ad-

ersarial synthetic images, we perform two baselines for compar-

sons. We separately train the model on 3584 real images [5] and

0,730 synthetic eye images [6] , terming as learning-from-real,

earning-from-synthesis, respectively. We name our proposed work

s learning-from-adversarial (proposed) in which we learn the

hape augmented models from 10,730 adversarial eye images re-

ned from Gou et al. [6] . Then we test on the most widely used

atabase BioID for demonstration. Experimental results are listed

n Table 1 . We can see that, compared with learning from the

ure synthetic data or real images, learning from adversarial syn-

hetic images generated by the step of parallel imaging boosts

he performance on pupil detection. It achieves a detection rate

f 92.3% under the normalized error of 0.05 and significantly im-

roves over the learning-from-synthesis of 89.2%. Manual collected

eal eye images contain limited gaze directions with limited pupil

ocations and synthetic eye images cannot cover the variations

Page 7: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism

590 C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594

Fig. 7. Examples of real eyes, synthetic eyes and adversarial synthetic eyes.

Table 1

Eye detection results on BioID with normalized error

( d eye ≤ 0.05) based on different training samples.

Training types Detection rate

Learning-from-synthesis 89.2%

Learning-from-real 90.3%

Learning-from-adversarial(SimGAN) 91.5%

Learning-from-adversarial(proposed) 92.3%

Table 2

Eye detection result and comparison on GI4E database.

Method d eye ≤ 0.05 d eye ≤ 0.1 d eye ≤ 0.15

George et al. [48] 89.3% 92.3% 93.6%

Timm et al. [16] 92.4% 96.0% ∗ 96.9% ∗

Villanueva et al. [44] 93.9% 97.3% ∗ 98.0% ∗

Gou et al. [5] 94.2% 99.1% 99.6%

Learning-from-adversarial 98.3% 99.8% 99.8%

∗ are estimated from the accuracy curves in corresponding paper

[44] .

Table 3

Eye detection result and comparison on BioID database.

Method d eye ≤ 0.05 d eye ≤ 0.10 d eye ≤ 0.15

Campadelli et al. [49] 80.7% 93.2% 95.3%

Timm et al. [16] 82.5% 93.4% 95.2%

Valenti et al. [15] 86.1% 91.7% 93.5%

Araujo et al. [18] 88.3% 92.7% 94.5%

Ren et al. [50] 77.1% 92.3% 97.2% ∗

George et al. [48] 85.1% 94.3% 96.7%

Chen et al. [51] 88.8% 95.2% 96.5% ∗

Choi et al. [19] 91.1% 98.4% 98.7% ∗

Gou et al. [5] 91.2% 99.4% 99.6%

Learning-from-adversarial 92.3% 99.1% 99.7%

∗ are estimated from the accuracy curves in corresponding papers.

i

s

a

r

l

i

t

t

r

i

t

p

t

i

f

r

p

s

t

t

r

a

a

t

u

a

w

a

a

t

i

p

m

s

in appearance in real world. In this work, our generated adver-

sarial synthetic images take advantages from synthetic and real

eyes. To further verify the effectiveness of our modification of the

self-regularization in SimGAN , we conduct comparison experiments

on generated adversarial images by SimGAN [11] where we name

it as learning-from-adversarial (SimGAN). For fair comparison, we

learn the SimGAN from the same training samples with our pro-

posed method and also generate 10,730 refined synthetic images

from Gou et al. [6] . As shown in Table 1 , the introduced Con-

vNet for self-regularization enhances the performance of pupil de-

tection. Compared with pixel level self-regularization to preserve

annotations from synthetic images [11] , our introduced ConvNet

emphasizes more on the detailed information like global shape,

texture, and appearance around eye related key points, which

are beneficial for learning-based shape cascade regression frame-

work.

4.2.2. Comparisons with the state-of-the-art methods

The quantitative results on GI4E are listed in Table 2 and the

best results are highlighted. All the results of the methods in

Table 2 are from the original corresponding papers. Some quali-

tative results are shown in Fig. 8 . GI4E contains images with high

quality taken by normal camera. Fig. 8 shows that we can accu-

rately locate the eye center even the subjects with classes or under

low illuminations. As shown in Table 2 , our proposed method can

achieve a detection rate of 98.3% under d eye = 0 . 05 . Compared with

other related works, our method significantly improve the pupil

detection performance. It further demonstrates that our proposed

pupil detection framework can be applied for accurate pupil detec-

tion task. It is worth noticing that, our method outperforms a sim-

lar work [5] whose model is learned from the combination of real

amples and synthetic samples. On the one hand, we learn from

dversarial synthetic images where artificial eyes are added with

ealism and keep the accurate annotations. On the other hand, we

everage shape augmented regression to capture more local shape

nformation with more eye related key points for eye center detec-

ion.

BioID is one of the most widely used datasets for eye detec-

ion. Experimental results on BioID are listed in Table 3 . All the

esults of the methods in Table 3 are from the original correspond-

ng papers. Fig. 9 shows some of the correct detection examples of

he proposed method in BioID database. We can see that, the pro-

osed pupil detection method can handle low illumination condi-

ions, various gaze directions and subjects with glasses. As listed

n Table 3 , the proposed algorithm gives a detection rate of 92.3%

or d eye ≤ 0.05 and outperforms the state-of-the-art with detection

ate of 91.1% in [19] . It demonstrates that the proposed work ex-

loits the cascade learning from the generated adversarial training

amples better than the local patterns based methods. While for

he coarser evaluation at d eye ≤ 0.1, it performs slightly worse than

he state-of-the-arts [5] with detection rate of 99.4%. It means the

ealism from training images of MPIIGaze is limited to cover the

ppearance distribution of BioID. This problem can be tackled by

dding more unlabeled images with diverse appearance. In addi-

ion, the better performance of our proposed work on finer eval-

ation of d eye ≤ 0.05 also demonstrates that the annotations from

dversarial synthetic images are more reliable.

We further conduct evaluation on LFW which contains images

ith low resolution and poor quality. Some detection examples

re shown in Fig. 10 . We can see that the proposed framework

chieves promising results on extreme low resolution images. Even

hough the image is blurred with ambiguous edge and contextual

nformation, we still can detect the eye center based on the local

upil appearance information. Qualitative comparisons with other

ethods are listed in Table 4 . As shown in Table 4 , our method

ignificantly outperforms a recent work based on multi scale

Page 8: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism

C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594 591

Fig. 8. Example results of eye localization in GI4E. The white dot represents the ground truth and red dot represents the predicted eye center. (For interpretation of the

references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 9. Example results of eye localization in BioID. The white dot represents the ground truth and red dot represents the predicted eye center. (For interpretation of the

references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 10. Example results of eye localization in LFW. The white dot represents the ground truth and red dot represents the predicted eye center. (For interpretation of the

references to colour in this figure legend, the reader is referred to the web version of this article.)

Table 4

Eye detection result and comparison on LFW database.

Method d eye ≤ 0.05 d eye ≤ 0.1 d eye ≤ 0.15

Fu et al. [52] 24.2% 75.4% 88.0%

Yi et al. [53] 52.0% ∗ 88.2% ∗ 95.3% ∗

Qian et al. [54] 75.1% 90.6% 94.3%

Choi et al. [19] 54.1% 91.4% 96.3%

Learning-from-adversarial 81.6% 97.2% 98.8%

∗ are estimated from the accuracy curves in the corresponding pa-

per.

l

c

i

m

a

b

o

p

l

l

r

n

d

w

w

l

d

Fig. 11. Eye detection results at each cascaded iteration on BioID and Gi4E database.

Y coordinate denotes the detection rate with the normalized error less than 0.05.

4

w

F

g

d

T

g

o

a

a

ocal structure pattern proposed in [19] . It shows that our proposed

oarse-to-fine framework is much preferable on pupil detection in

mages with low quality.

In summary, thorough experiments show that our proposed

ethod can achieve the state-of-the-art pupil detection results. In

ddition, the proposed pupil detection method is effective and ro-

ust both on high and low quality images.The major reason that

ur method outperforms other state-of-the-arts comes into three

arts. Fist, we detect face in the whole image followed by facial

andmark detection to coarsely extract the eye regions, which al-

ows us to capture the global structural information of eyes with

espect to face. Second, by reducing the gap between the large

umber of synthetic data with annotations and unlabeled real

ata through GANs, we can learn the real eye center locations

ith respect to different gaze directions more accurately. Third,

e further exploit the cascade regression based on the eye re-

ated contextual and local appearance information for eye center

etection.

.2.3. Further discussion

We further conduct comparison experiments with a similar

ork detecting the pupil by cascade regression framework [20,21] .

or fair comparison, similar to [20] , we coarsely extract the eye re-

ion by ground truth instead of with the help of facial landmark

etection method. Experimental results on BioID are shown in

able 5 . As shown in Table 5 , our introduced shape augmented re-

ression outperform conventional linear cascade regression meth-

ds of SDM [3,20] since we can capture the global structural shape

nd local appearance information for the eye center detection. In

ddition, we further test the public available method from [21] for

Page 9: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism

592 C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594

Fig. 12. Examples of qualitative detection results with normalized error d eye . The white dot represents the ground truth and red dot represents the predicted eye center. The

green bounding box are the wrongly detected face region. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of

this article.)

Table 5

Eye localization comparison results with cascade regression

methods.

Method d eye ≤ 0.05 d eye ≤ 0.1 d eye ≤ 0.25

SDM [3] 90.3% ∗ 96.4% ∗ 100.0% ∗

Cascaded CNN [21] 92.6% + 99.4% + 99.8% +

CF-MF-SDM [20] 93.8% 99.8% 99.9%

Ours 95.7 % 99.9 % 100.0 %

∗ results are from Zhou et al. [20] . + are produced by the original landmark detection methods.

t

m

w

5

a

i

l

e

t

a

a

t

a

b

t

d

t

m

A

F

a

R

facial landmarks including the eye center detection on BioID. It

can achieve comparable results due to deep model’s nonlinear

generalization ability. Future work will focus on design nonlin-

ear deep models [55–57] as the shape augmented cascade regres-

sion model. By further investigation, compared with the results

with Table 3 , the initialization of the eye related key points based

on the extracted eye region is important because it can not con-

verge to the global optimal when they are far from the ground

truth.

In addition, we further study the influence of number of it-

eration which denoted by T in cascade regression. Experimental

results with respect to different iteration are shown in Fig. 11 .

We can see that it converges fast at the first two iterations and

achieves to optimal at iteration 4 and 6 for BioID and Gi4E, re-

spectively. By further investigation, number of iteration T can be

generally set to 4 and larger to 6 for more challenging task such as

BioID dataset with various illuminations and glasses.

In this work, all the testing experiments are conducted with

non-optimized Matlab codes on a standard PC, which has an Intel

i5 3.47Ghz CPU and 16 GB RAM. Given the eye images, we test our

model on BioID dataset and evaluate the computational cost. A FPS

of 23 can be achieved with 6 cascade regression steps. By further

investigation, Sift feature extraction cost most of the time for test-

ing. And 16 FPS can be achieved when applied in real scenario with

face detection and eye region extraction by our proposed method,

which allows for real time applications like gaze estimation and

human-computer interactions.

Although our proposed method significantly improves the state-

of-the-arts, image-based pupil detection is still a challenging task.

More qualitative detection examples on database with respect to

maximum normalized error are shown in Fig. 12 . When the whole

eye region is under low illumination as shown in Fig. 12 (a) and

(b), the appearance of pupil and eye related edge information are

ambiguous which lead to the normalized error d eye bigger than

0.05. For the case of Fig. 12 (d), our proposed method cannot

handle the reflections of classes which are still very challeng-

ing. As previous investigation, the face detection is important to

coarsely extract the eye region. As shown in Fig. 12 (e), if the

face detection fails with eye region initialization far away from

the ground truth, cascade regression method cannot converge to

he optimum eye centers. The performance can be further opti-

ized in the eye regions extraction step in our proposed frame-

ork.

. Conclusion and future work

In this paper, we propose a unified framework to learn shape

ugmented cascade regression models from adversarial synthetic

mages for accurate pupil detection. We introduce a step of paral-

el imaging exploiting the adversarial training to refine the artificial

yes with texture and appearance from real images and preserving

he detailed information like structural shape from synthetic im-

ges. For the step of computational experiments, we learn shape

ugmented regression models based on the eye related shape,

exture, and appearance for pupil detection. Our proposed work

chieves the state-of-the-art performance on three public available

enchmark datasets.

Future work will focus on designing powerful nonlinear archi-

ectures (e.g. deep models) to map the appearance and target up-

ates in the cascade level. And more effort s will be undertaken

o realize the parallel execution where it can achieve on-line opti-

ization through real-time inputs of unlabeled images.

cknowledgments

This work was supported in part by National Natural Science

oundation of China under Grant 61806198 , 61304200 , 61533019 ,

nd National Science Foundation under the grant 1145152 .

eferences

[1] Y. Wu , C. Gou , Q. Ji , Simultaneous facial landmark detection, pose and defor-mation estimation under facial occlusion, in: Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition, 2017, pp. 3471–3480 . [2] C. Gou , Y. Wu , F.-Y. Wang , Q. Ji , Coupled cascade regression for simulta-

neous facial landmark detection and head pose estimation, in: Proceedings

of the IEEE International Conference on Image Processing (ICIP), IEEE, 2017,pp. 2906–2910 .

[3] X. Xiong , F. De la Torre , Supervised descent method and its applications toface alignment, in: Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), IEEE, 2013, pp. 532–539 . [4] C. Gou , Y. Wu , F.-Y. Wang , Q. Ji , Shape augmented regression for 3D face align-

ment, in: Proceedings of the European Conference on Computer Vision Work-

shops, Springer, 2016, pp. 604–615 . [5] C. Gou , Y. Wu , K. Wang , K. Wang , F.-Y. Wang , Q. Ji , A joint cascaded framework

for simultaneous eye detection and eye state estimation, Pattern Recognit. 67(2017) 23–31 .

[6] C. Gou , Y. Wu , K. Wang , F.-Y. Wang , Q. Ji , Learning-by-synthesis for accurate eyedetection, in: Proceedings of the International Conference on Pattern Recogni-

tion, 2017, pp. 3362–3367 . [7] K. Wang , C. Gou , N. Zheng , J.M. Rehg , F.-Y. Wang , Parallel vision for perception

and understanding of complex scenes: methods, framework, and perspectives,

Artif. IntelL. Rev. (2017) 1–31 . [8] K. Wang , C. Gou , F.-Y. Wang , Parallel vision: an ACP-based approach to intelli-

gent vision computing, Acta Autom. Sin. 42 (10) (2016) 1490–1500 . [9] F.-Y. Wang , Parallel system methods for management and control of complex

systems [J], Control and Decis. 5 (2004) 001 .

Page 10: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism

C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594 593

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

C

T

s

2

d

P

H

H

C

M

A

s

p

K

t

H

a

C

a

U

s

F

R

o

S

a

c

i

c

C

t

G

[10] X. Liu , X. Wang , W. Zhang , J. Wang , F.-Y. Wang , Parallel data: from big data todata intelligence, Pattern Recogni. Artif. Intell. 30 (8) (2017) 673–682 .

[11] A. Shrivastava , T. Pfister , O. Tuzel , J. Susskind , W. Wang , R. Webb , Learningfrom simulated and unsupervised images through adversarial training., in: Pro-

ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,vol. 2, 2017, p. 5 .

[12] I. Goodfellow , J. Pouget-Abadie , M. Mirza , B. Xu , D. Warde-Farley , S. Ozair ,A. Courville , Y. Bengio , Generative adversarial nets, in: Proceedings of

the Advances in Neural Information Processing Systems, 2014, pp. 2672–

2680 . [13] D.W. Hansen , Q. Ji , In the eye of the beholder: a survey of models for eyes and

gaze, IEEE Trans. Pattern Anal. Mach. Intell. 32 (3) (2010) 478–500 . [14] F. Song , X. Tan , S. Chen , Z.-H. Zhou , A literature survey on robust and effi-

cient eye localization in real-life scenarios, Pattern Recognit. 46 (12) (2013)3157–3173 .

[15] R. Valenti , T. Gevers , Accurate eye center location through invariant isocentric

patterns, IEEE Trans. Pattern Anal. Mach. Intell. 34 (9) (2012) 1785–1798 . [16] F. Timm , E. Barth , Accurate eye centre localisation by means of gradients., in:

Proceddings of the VISAPP, 2011, pp. 125–130 . [17] C. Zhang , X. Sun , J. Hu , W. Deng , Precise eye localization by fast local linear

SVM, in: Proceedings of the IEEE International Conference on Multimedia andExpo (ICME), IEEE, 2014, pp. 1–6 .

[18] G. Araujo , F. Ribeiro , E. Silva , S. Goldenstein , Fast eye localization without a

face model using inner product detectors, in: Proceedings of the IEEE Interna-tional Conference on Image Processing (ICIP), 2014, pp. 1366–1370 .

[19] I. Choi , D. Kim , A variety of local structure patterns and their hybridization foraccurate eye detection, Pattern Recognit. 61 (2017) 417–432 .

20] M. Zhou , X. Wang , H. Wang , J. Heo , D. Nam , Precise eye localization with im-proved SDM, in: Proceedings of the IEEE International Conference on Image

Processing (ICIP), IEEE, 2015, pp. 4 466–4 470 .

[21] Y. Sun , X. Wang , X. Tang , Deep convolutional network cascade for facial pointdetection, in: Proceedings of the IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR), IEEE, 2013, pp. 3476–3483 . 22] K. Zhang , Z. Zhang , Z. Li , Y. Qiao , Joint face detection and alignment using mul-

titask cascaded convolutional networks, IEEE Signal Process Lett. 23 (10) (2016)1499–1503 .

23] C. Hong , J. Yu , J. Wan , D. Tao , M. Wang , Multimodal deep autoencoder for hu-

man pose recovery, IEEE Trans. Image Process. 24 (12) (2015) 5659–5670 . [24] C. Hong , X. Chen , X. Wang , C. Tang , Hypergraph regularized autoencoder

for image-based 3d human pose recovery, Signal Process. 124 (2016) 132–140 .

25] T. von Marcard , R. Henschel , M.J. Black , B. Rosenhahn , G. Pons-Moll , Recov-ering accurate 3d human pose in the wild using IMUs and a moving cam-

era, in: Proceedings of the European Conference on Computer Vision (ECCV),

2018 . 26] S. Yang , P. Luo , C.-C. Loy , X. Tang , From facial parts responses to face detection:

a deep learning approach, in: Proceedings of the IEEE International Conferenceon Computer Vision, 2015, pp. 3676–3684 .

[27] S. Yang , P. Luo , C.C. Loy , X. Tang , Faceness-net: face detection through deepfacial part responses, IEEE Trans. Pattern Anal. Mach. Intell. 40 (8) (2018)

1845–1859 . 28] J. Yu , B. Zhang , Z. Kuang , D. Lin , J. Fan , Iprivacy: image privacy protection

by identifying sensitive objects via deep multi-task learning, IEEE Trans. Inf.

Forens. Secur. 12 (5) (2017) 1005–1016 . 29] J. Yu , Z. Kuang , B. Zhang , W. Zhang , D. Lin , J. Fan , Leveraging content sensi-

tiveness and user trustworthiness to recommend fine-grained privacy settingsfor social image sharing, IEEE Trans. Inf. Forens. Secur. 13 (5) (2018) 1317–

1332 . 30] C. Hong , J. Yu , D. Tao , M. Wang , Image-based three-dimensional human pose

recovery by multiview locality-sensitive sparse retrieval, IEEE Trans. Ind. Elec-

tron. 62 (6) (2015) 3742–3751 . [31] E. Wood , T. Baltrušaitis , L.-P. Morency , P. Robinson , A. Bulling , Learning an ap-

pearance-based gaze estimator from one million synthesised images, in: Pro-ceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research &

Applications, ACM, 2016, pp. 131–138 . 32] X. Zhang , Y. Sugano , M. Fritz , A. Bulling , Appearance-based gaze estimation

in the wild, in: Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2015, pp. 4511–4520 . [33] A. Dosovitskiy , J.T. Springenberg , M. Tatarchenko , T. Brox , Learning to generate

chairs, tables and cars with convolutional networks, IEEE Trans. Pattern Anal.Mach. Intell. 39 (4) (2017) 692–705 .

34] K. Wang , Y. Lu , Y. Wang , Z. Xiong , F.-Y. Wang , Parallel imaging: a new theoreti-cal framework for image generation, Pattern Recogn. Artif. Intell. 30 (7) (2017)

577–587 .

[35] L. Cao , C. Gou , K. Wang , G. Xiong , F.-Y. Wang , Gaze-aided eye detection via ap-pearance learning, in: Proceedings of the International Conference on Pattern

Recognition, 2018 . 36] Y. Ganin , V. Lempitsky , Unsupervised domain adaptation by backpropagation,

in: Proceedings of the International Conference on Machine Learning, 2015,pp. 1180–1189 .

[37] D. Yoo , N. Kim , S. Park , A.S. Paek , I.S. Kweon , Pixel-level domain transfer, in:

Proceedings of the European Conference on Computer Vision, Springer, 2016,pp. 517–532 .

38] F.-Y. Wang , J.J. Zhang , X. Zheng , X. Wang , Y. Yuan , X. Dai , J. Zhang , L. Yang ,Where does alphago go: from church-turing thesis to alphago thesis and be-

yond, IEEE/CAA J. Autom. Sin. 3 (2) (2016) 113–120 .

a

39] F.-Y. Wang , Parallel control and management for intelligent transportation sys-tems: concepts, architectures, and applications, IEEE Trans. Intell. Transp. Syst.

11 (3) (2010) 630–638 . 40] F.-Y. Wang , P.K. Wong , Intelligent systems and technology for integrative and

predictive medicine: an ACP approach, ACM Trans. Intell. Syst. Technol. (TIST)4 (2) (2013) 32 .

[41] F.-Y. Wang , X. Wang , L. Li , L. Li , Steps toward parallel intelligence, IEEE/CAA J.Autom. Sin. 3 (4) (2016) 345–348 .

42] Y. Wu , Q. Ji , Shape augmented regression method for face alignment, in: Pro-

ceedings of the IEEE International Conference on Computer Vision Workshops,2015, pp. 26–32 .

43] O. Jesorsky , K.J. Kirchberg , R.W. Frischholz , Robust face detection using thehausdorff distance, in: Proceedings of the Audio-and Video-Based Biometric

Person Authentication, Springer, 2001, pp. 90–95 . 44] A. Villanueva , V. Ponz , L. Sesma-Sanchez , M. Ariz , S. Porta , R. Cabeza , Hybrid

method based on topography for robust detection of iris center and eye cor-

ners, ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 9 (4) (2013) 25 .45] G.B. Huang , M. Ramesh , T. Berg , E. Learned-Miller , Labeled faces in the wild: A

database for studying face recognition in unconstrained environments, Techni-cal Report 07–49, 2, University of Massachusetts, Amherst, 2007 .

46] P. Viola , M.J. Jones , Robust real-time face detection, Int. J. Comput. Vis. 57 (2)(2004) 137–154 .

[47] M. Mathias , R. Benenson , M. Pedersoli , L. Van Gool , Face detection with-

out bells and whistles, in: Computer Vision–ECCV 2014, Springer, 2014,pp. 720–735 .

48] A . George , A . Routray , Fast and accurate algorithm for eye localisation for gazetracking in low-resolution images, IET Comput. Vis. 10 (7) (2016) 660–669 .

49] P. Campadelli , R. Lanzarotti , G. Lipori , Precise eye and mouth localization, Int.J. Pattern Recognit Artif Intell. 23 (03) (2009) 359–377 .

50] Y. Ren , S. Wang , B. Hou , J. Ma , A novel eye localization method with rotation

invariance, IEEE Trans. Image Process. 23 (1) (2014) 226–239 . [51] S. Chen , C. Liu , Eye detection using discriminatory haar features and a new

efficient SVM, Image Vis. Comput. 33 (2015) 68–77 . 52] Y. Fu , H. Yan , J. Li , R. Xiang , Robust facial features localization on rotation arbi-

trary multi-view face in complex background, J. Comput. (Taipei) 6 (2) (2011)337–342 .

53] D. Yi , Z. Lei , S.Z. Li , A robust eye localization method for low quality face im-

ages, in: Proceedings of the International Joint Conference on Biometrics (IJCB),IEEE, 2011, pp. 1–6 .

54] Z. Qian , D. Xu , Automatic eye detection using intensity filtering and k-meansclustering, Pattern Recognit. Lett. 31 (12) (2010) 1633–1640 .

55] M. Yang , Y. Liu , Z. You , The euclidean embedding learning based on con-volutional neural network for stereo matching, Neurocomputing 267 (2017)

195–200 .

56] J. Zhang , J. Yu , D. Tao , Local deep-feature alignment for unsupervised dimen-sion reduction, IEEE Trans. Image Process. 27 (5) (2018) 2420–2432 .

[57] S. Qian , H. Liu , C. Liu , S. Wu , H. San Wong , Adaptive activation functions inconvolutional neural networks, Neurocomputing 272 (2018) 204–212 .

hao Gou received his B.S. degree from the University of Electronic Science andechnology of China, Chengdu, China, in 2012 and the Ph.D. degree from the Univer-

ity of Chinese Academy of Sciences (UCAS), Beijing, China, in 2017. From September

015 to January 2017, he was supported by UCAS as a joint-supervision Ph.D stu-ent in Rensselaer Polytechnic Institute, Troy, NY, USA. He is currently an Assistant

rofessor with Institute of Automation, Chinese Academy of Sciences, Beijing, China.is research interests include computer vision and machine learning.

ui Zhang received the B.S. degree from the Beijing Jiaotong University, Beijing,hina, in 2015. She is currently a Ph.D. candidate at the State Key Laboratory of

anagement and Control for Complex Systems, Institute of Automation, Chinese

cademy of Sciences as well as University of Chinese Academy of Sciences. Her re-earch interests include computer vision, pattern recognition, and intelligent trans-

ortation systems.

unfeng Wang received his Ph.D. in Control Theory and Control Engineering from

he Graduate University of Chinese Academy of Sciences, Beijing, China, in 2008.

e joined the Institute of Automation, Chinese Academy of Sciences and becamen Associate Professor at the State Key Laboratory of Management and Control for

omplex Systems. From December 2015 to January 2017, he was a Visiting Scholart the School of Interactive Computing, Georgia Institute of Technology, Atlanta, GA,

SA. His research interests include intelligent transportation systems, intelligent vi-ion computing, and machine learning.

ei-Yue Wang received the Ph.D. degree in computer and systems engineering from

ensselaer Polytechnic Institute, Troy, NY, USA, in 1990. He was an Editor-in-Chieff the International Journal of Intelligent Control and Systems, the World Scientific

eries in Intelligent Control and Intelligent Automation, IEEE Intelligent Systems,nd the IEEE Transactions on Intelligent Transportation Systems. He has served as

hairs for over 20 IEEE, ACM, INFORMS, and ASME conferences. His current researchnterests include social computing, complex systems, and intelligent computing and

ontrol. He is currently the Director of The State Key Laboratory of Management and

ontrol for Complex Systems, Institute of Automation, Chinese Academy of Sciences,he Vice President of the ACM China Council and the Vice President/Secretary-

eneral of Chinese Association of Automation. He is a member of Sigma Xi andn Elected Fellow of IEEE, INCOSE, IFAC, ASME, and AAAS.

Page 11: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism

594 C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594

r

m

s

c

n

I

Qiang Ji received his PhD degree in Electrical Engineering from the University ofWashington. He is currently a professor with the Department of Electrical, Com-

puter, and Systems Engineering at Rensselaer Polytechnic Institute (RPI). He re-cently served as director of the Intelligent Systems Laboratory (ISL) at RPI. Prof. Ji’s

esearch interests are in computer vision, probabilistic graphical models, infor-

ation fusion, and their applications in various fields. Prof. Ji is an editor oneveral related IEEE and international journals and he has served as a general

hair, program chair, technical area chair, and program committee member inumerous international conferences/workshops. Prof. Ji is a fellow of IEEE and

APR.