cascade learning from adversarial synthetic images …cvrl/publication/pdf/gou2019.pdfshrivastava...
TRANSCRIPT
![Page 1: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec87e41022b2a556d25c296/html5/thumbnails/1.jpg)
Pattern Recognition 88 (2019) 584–594
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
Cascade learning from adversarial synthetic images for accurate pupil
detection
Chao Gou
a , c , ∗, Hui Zhang
a , Kunfeng Wang
a , c , Fei-Yue Wang
a , c , Qiang Ji b
a Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China b Rensselaer Polytechnic Institute, Troy, NY 12180, USA c Qingdao Academy of Intelligent Industries, Qingdao 266109, China
a r t i c l e i n f o
Article history:
Received 24 July 2018
Revised 19 November 2018
Accepted 15 December 2018
Available online 18 December 2018
Keywords:
Cascade regression
GANs
Pupil detection
a b s t r a c t
Image-based pupil detection, which aims to find the pupil location in an image, has been an active re-
search topic in computer vision community. Learning-based approaches can achieve preferable results
given large amounts of training data with eye center annotations. However, there are limited publicly
available datasets with accurate eye center annotations and it is unreliable and time-consuming for man-
ually labeling large amounts of training data. In this paper, inspired by learning from synthetic data in
Parallel Vision framework, we introduce a step of parallel imaging built upon Generative Adversarial Net-
works (GANs) to generate adversarial synthetic images. In particular, we refine the synthetic eye images
by the improved SimGAN using adversarial training scheme. For the computational experiments, we fur-
ther propose a coarse-to-fine pupil detection framework based on shape augmented cascade regression
models learning from the adversarial synthetic images. Experiments on benchmark databases of BioID,
GI4E, and LFW show that the proposed work performs significantly better over other state-of-the-art
methods by leveraging the power of cascade regression and adversarial image synthesis.
© 2018 Elsevier Ltd. All rights reserved.
m
a
c
d
g
f
a
p
o
s
i
p
i
s
G
s
c
a
F
1. Introduction
Image-based pupil detection, also named as eye center detec-
tion, aims to predict the location of pupil in an image and it is
becoming increasingly active area of research in the computer vi-
sion community. Accurate pupil detection is useful for gaze pre-
diction, eye state estimation, and human-machine interactions. Re-
cently, cascade regression has emerged as a promising approach
for landmark detection [1–4] . It has been further extended for ac-
curate pupil detection and eye state estimation [5,6] where they
achieved the state-of-the-art pupil detection result. Cascade regres-
sion begins with an initial semantic key point (e.g. eye corner) lo-
cations, followed by iteratively updating the positions through se-
quential regression models until convergence. The cascade regres-
sion models are learned to map current key point related appear-
ance to the target location increment with respect to ground truth.
Moreover, it allows for capturing the structural and contextual in-
formation for pupil localization.
For pupil detection, there are limited databases with eye re-
lated key point including the eye center annotations. In addition,
∗ Corresponding author at: Institute of Automation, Chinese Academy of Sciences,
Beijing 100190, China.
E-mail addresses: [email protected] , [email protected] (C. Gou).
fi
t
S
t
https://doi.org/10.1016/j.patcog.2018.12.014
0031-3203/© 2018 Elsevier Ltd. All rights reserved.
anual annotation of eye centers is time-consuming and not reli-
ble. Learning-by-synthesis has been successfully applied for eye
enter detection where it learns the model from pure synthetic
ata with automatic annotations that are generated by computer
raphics technique [6] . Parallel Vision (PV) [7,8] , which is extended
rom ACP methodology [9] to computer vision, was proposed as
unified framework including Artificial scenes, Computational ex-
eriments, and Parallel execution for perception and understanding
f complex scenes and also emphasize the significance of synthe-
is. As pointed out in [10] , large number of artificial data make
t possible to model the real distributions. While in the case of
upil detection, synthetic eyes are limited to cover the variations
n appearance of real eyes. SimGAN [11] was proposed to refine the
ynthetic images through adversarial training by the framework of
enerative Adversarial Networks (GANs) [12] . Experimental results
how that SimGAN can improve the gaze estimation.
In this paper, inspired by PV framework, we propose to learn
ascade regression models by the adversarial image synthesis for
ccurate pupil detection. The overall framework is illustrated in
ig. 1 . From the perspective of PV, artificial eyes can offer us suf-
cient synthetic eye images with annotations. We first simulate
housands of eye images with accurate eye annotations. Built upon
imGAN, we further perform parallel imaging to add realism to
he artificial eyes, where we can generate adversarial synthetic
![Page 2: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec87e41022b2a556d25c296/html5/thumbnails/2.jpg)
C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594 585
Fig. 1. Overview of the proposed pupil detection framework based on PV.
s
a
m
a
T
l
D
i
i
2
b
t
n
b
i
t
d
p
a
A
c
r
o
l
b
t
w
t
f
l
a
t
o
f
e
t
t
m
p
f
t
t
t
f
h
[
f
H
a
H
i
a
e
e
v
v
c
t
t
c
l
t
e
h
t
b
l
i
t
b
[
t
s
a
a
l
s
e
a
r
n
[
w
e
a
c
p
p
t
a
b
p
i
v
amples that preserve the simulator’s annotations and add real
ppearance information. Finally, we propose to learn shape aug-
ented regression models from the adversarial synthetic images
nd test on real ones to perform the computational experiments.
he major contributions of proposed work are highlighted as be-
ow:
• Adversarial image synthesis: We perform parallel imaging by an
improved SimGAN for training image synthesis. The generated
adversarial images can preserve both the accurate annotations
of synthetic eyes and realistic appearance information of real
eyes. • Shape augmented cascade regression: By leveraging the shape
augmented framework, it capture the eye related shape, appear-
ance, and contextual information for final pupil detection. • Promising experimental results: The proposed method performs
better over other state-of-the-arts on benchmark databases and
allows for real-time applications.
In the following, related work will be reviewed in Section 2 .
etails of the proposed method are described in Section 3 . Exper-
mental results are discussed in Section 4 . We give the conclusion
n Section 5 .
. Related work
For image-based pupil detection, we review the recent learning-
ased work in this section. Earlier prominent review on pupil de-
ection can be found in [13,14] . Valenti and Gevers [15] provided a
ew method to infer eye center location using circular symmetry
ased on isophote properties. A voting based method is introduced
n this isophote based framework. Timm and Barth [16] proposed
o detect the eye center by image gradients. They mathematically
escribed the relationship of orientations in gradients between a
ossible iris center and contextual regions. In [17] , a novel vari-
tion of linear SVM was presented for efficient pupil detection.
raujo et al. [18] introduced the Inner Product Detector based on
orrelation filters to detect the center of the eye. This method is
obust to small variations. But it is limited to model the variations
f appearance of eyes in the wild. A recent work [19] reviewed the
ocal structure patterns (LSPs) and extended them to multi-scale
lock based LSPs for pupil detection. Adaboost feature selection on
he combination of intensity-based LSPs and gradient-based LSPs
as performed to further improve the performance of pupil detec-
ion in [19] .
Cascade regression has emerged as the leading framework for
acial landmark detection [1–3,20–22] . It begins with an initial
andmark locations, e.g. mean locations of the training samples,
nd follows by iteratively updating the locations through sequen-
ially trained regression models. Zhou et al. [20] optimized the
riginal facial landmark detection method SDM [3] by extracting
eatures at different scale and employing the coarse-to-fine strat-
gy to enhance the robustness. By using multiple nonlinear fea-
ures, the approach obtained higher localization accuracy. In [20] ,
he authors trained the model using more than 10 K images with
anual annotations. Gou et al. [5] extended cascade regression for
upil detection where they proposed a joint cascade regression
ramework for simultaneous pupil detection and eye state estima-
ion. By capturing the structural, shape, and contextual informa-
ion using cascade regression framework, it achieved the state-of-
he-art result [5] . The methods in [5,20] employed the handcrafted
eatures that are limited in representation. Recently, deep features
ave been widely used in fields such as 3-D human pose recovery
23–25] , face detection [26,27] , privacy setting recommendation
or social image sharing [28,29] and landmark detection [21,22] .
ong et al. [30] firstly introduced locality-sensitive constriction
nd fed into combination of multi-view hand-crafted feature (e.g.
OG, Shape context, Hierarchical centroid, etc.) in the sparse cod-
ng for 3-D poses recovery from silhouettes. They further proposed
novel deep feature extractor [24] with multi-modal fusion and
xperimental results demonstrated its effectiveness for pose recov-
ry [23] . Yu et al. [28] proposed a novel approach termed as iPri-
acy that jointly learned deep CNNs and the tree classifier for pri-
acy setting recommendation. They further incorporated the image
ontent sensitiveness and the user trustworthiness which were ex-
racted by designing deep models and learning dictionary, respec-
ively [29] . In [21] , the authors presented a three-level cascaded
onvolution networks for facial landmark detection. Reliable initial
ocation estimation was achieved at first level, followed by finely
uning at the following two levels. Zhang et al. [22] proposed to
mploy deep cascaded convolution networks to capture the in-
erent correlation between face detection and landmark location
o boost the performance of these two tasks. For these learning-
ased approaches, it is unreliable and time-consuming to collect
arge number of such training data. With the increasing progress
n computer graphics and virtual reality, it is becoming possible
o learn from the artificial scene and objects that cover the distri-
ution of targets in appearance in real world [6,31–35] . Gou et al.
6] proposed to learn from the synthetic images for pupil detec-
ion using cascade regression methods and achieve preferable re-
ults on benchmark databases. Learning-by-synthesis is becoming
n increasingly active topic for eye related research. In [31] , the
uthors presented UnityEyes , a novel method to rapidly synthesize
arge amounts of eye region images as training data. The synthe-
ized images can be used to estimate gaze in-the-wild scenarios,
ven for large head pose and gaze directions.
However, learning from synthetic images cannot achieve desir-
ble performance due to the bias between synthetic images and
eal images. Synthetic data is often not realistic enough and can-
ot cover the variations in appearance in real world. Wang et al.
7,8] proposed a unified framework termed as Parallel Visions (PV),
hich was extended from ACP (Artificial systems, Computational
xperiments, and Parallel execution) methodology [9] , for the im-
ge perception and understanding. From the perspective of PV,
onventional vision algorithms learning from synthetic images are
art of PV with respect to artificial scene and computational ex-
eriments, and it can be further improved by the parallel execu-
ion of online optimization through interaction between synthetic
nd real images. Ganin and Lempitsky tried to tackle the data
ias problem from the point of domain adaptation [36] . They pro-
osed to learn domain-invariant features by the adversarial train-
ng method. Yoo et al [37] presented a pixel level domain con-
erter learning with Generative Adversarial Networks (GANs) [12] .
![Page 3: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec87e41022b2a556d25c296/html5/thumbnails/3.jpg)
586 C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594
Fig. 2. Parallel vision framework based on ACP [7] .
s
t
c
m
t
o
t
i
v
i
g
S
3
i
c
a
e
u
a
c
a
f
a
e
e
t
s
e
o
s
s
g
t
r
a
t
a
t
w
3
r
m
i
P
o
m
s
t
o
l
s
w
t
e
z
t
a
l
g
s
Shrivastava et al. [11] proposed Simulated+Unsupervised (S+U)
learning to improve the realism of synthetic images and preserve
the annotations information at the same time.
3. Proposed method
3.1. ACP methodology and parallel vision framework
ACP (Artificial systems, Computational experiments, and Paral-
lel execution) methodology was first proposed for effective model-
ing and controlling the complex systems [9] . Artificial systems aim
to capture all the information that real systems have. In such ar-
tificial systems, different models can be built to describe actions
and response of target real systems. After building the ideal arti-
ficial systems, computational experiments will be conducted to ef-
fectively validate the proposed algorithm. The third part of ACP is
parallel execution referring to one or more artificial systems run-
ning in parallel with real complex system, where it is executed in
Parallel Systems. Through parallel execution, we can explore the
potential of artificial systems and transform the roles between ar-
tificial systems and the real system. Based on the interactions, we
can achieve online optimization from static to dynamic. The core
philosophy of ACP is to combine Artificial systems, Computational
experiments, and Parallel execution to turn the virtual artificial
space into another space for solving complex problems [38] . ACP
is becoming an increasingly important research topic and is widely
applied such as social computing, traffic management and control,
and Ethylene production management [39–41] .
Wang et al. [7,8] further extended ACP framework to computer
vision as PV (Parallel Vision) framework as shown in Fig. 2 . PV also
consists of three major parts including artificial scenes, computa-
tional experiments and parallel execution. Modern machine learn-
ing methods require large number of images with labels for train-
ing. While in real applications, we cannot collect sufficient training
samples with labels to cover the real distributions. With increas-
ingly development of computer graphics and virtual reality tech-
niques, it makes the artificial scenes to accurately model the real
scene possible. In our case of pupil detection, we can collect large
scale of synthetic samples by artificial generation for training al-
gorithms. Computational experiments can be conducted to evalu-
ate the proposed pupil detection algorithms including learning and
testing perception in both real and artificial eye images. However,
training algorithms purely from artificial synthetic eye images can-
not model the real eyes distribution due to the dataset bias prob-
lem. Parallel execution like parallel imaging of GNAs-based image
generation [34] offers a methodology to cover the gap between
synthetic and real images.
From the perspective of PV, existing work of learning-by-
synthesis [6,31] are part of PV with respect to artificial scenes and
computational experiments. To tackle the data bias problem of syn-
thetic and real data, inspired by PV framework, we introduce a step
of parallel imaging to refine the synthetic images based on the ad-
versarial training scheme. Specifically, following the recent adver-
arial training based work [11] , we perform adversarial images syn-
hesis to add the realism to synthetic images and preserve the ac-
urate annotations. We further propose to learn cascade regression
odels by adversarial images synthesis to perform the computa-
ional experiments. The overview of our proposed method based
n PV is illustrated in Fig. 1 . In the following, we first describe
he generation of synthetic eyes in the part of artificial scenes
n Section 3.2 . Then we present the method for generating ad-
ersarial photo-realistic eye images in the part of parallel imag-
ng in Section 3.3 . Finally, we present our proposed cascade re-
ression framework for accurate eye center detection in detail in
ection 3.4 .
.2. Artificial eye generation
Learning-based pupil detection needs a large number of train-
ng samples with accurate annotations. However, it is time-
onsuming and unreliable to label the eye center of real images. In
ddition, since there are diverse illuminations, different ethnicity,
ye states, gaze directions, gender and head pose of subjects, man-
al collected data is impossible to cover the variations of eyes in
ppearance in the wild. From the perspective of PV, artificial eyes
an offer us sufficient training samples to cover the general appear-
nce in real scenarios. Recently, learning from synthetic eye images
or gaze estimation has achieved promising results [31] , where the
uthors proposed a method called UnityEyes to generate synthetic
ye images with gaze annotations. Experimental results show that
ven KNN based algorithms training on the simulated eyes by Uni-
yEyes for eye gaze estimation can achieve the state-of-the-art re-
ults. It also outperforms the deep learning-based method for gaze
stimation testing on real images [31] .
In this paper, we apply UnityEyes to generate artificial eyes as
ur first step for pupil detection. UnityEyes can rapidly generate
ynthetic eyes with various appearance with respect to different
kin color, various illuminations and large range of head pose and
aze directions. It can also automatically label the key point loca-
ions as shown in Fig. 1 . To capture the eye ball appearance with
espect to different gaze, a 3D mesh is used to model the shape
nd texture variations of eye ball. This mesh is corresponding to
he external surface and related shape is defined by the cornea
nd two sphere representing. A generative model is applied to syn-
hesize facial eye regions to cover different appearance of subjects
ith respect to different gender, age and ethnicity. In particular, a
D morphable shape model of eye region is built upon the high
esolution 3D head scans captured by a professional photogram-
etric studio. The generic eye region topology with n vertices
s denoted by 3n dimensional vector: s = [ x 1 , y 1 , z 1 , . . . , x n , y n , z n ] .
CA is performed on the m retopologized scans to extract the
rthogonal basis functions U ∈ �
3 n × m . A linear parametric shape
odel M s is formulated by M s ( μ, δ, U ), where μ∈ �
3 n is the mean
hape of m retopologized scans, and δ are the parameters following
he Gaussian that are estimated by fitting to each basis function to
riginal scans. As a result, a morphable model is formulated as be-
ow:
(θ ) = μ + Udiag(δ) θ T , (1)
here θ ∈ �
m is a vector of the coefficients of related basis func-
ion. As a result, new eye region can be formed by randomly gen-
rated parameters of θ which follow the Gausian distribution with
ero mean and unit variance.
Another critical factor for appearance-based pupil detection is
he illuminations since different lighting result in various appear-
nce. To synthesize eye images under diverse illuminations, the
ight sources pointing at a random direction towards the eye re-
ions are simulated. Different panoramic photographs are cho-
en to randomly rotate and expose to render the refections and
![Page 4: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec87e41022b2a556d25c296/html5/thumbnails/4.jpg)
C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594 587
Fig. 3. Examples of artificial eye images.
Fig. 4. Illustration of the adversarial learning for generating the photorealistic eyes.
e
a
t
s
3
b
v
i
a
o
r
t
i
b
t
B
G
t
r
p
G
n
i
a
p
E
w
t
t
a
f
t
0
t
g
r
s
f
o
s
f
t
p
t
f
v
t
a
t
a
l
e
w
w
s
t
E
r
g
r
i
p
3
e
c
o
f
t
p
[
e
o
w
t
3
f
g
w
t
e
p
m
r
l
�
u
nvironmental ambient light. And different head pose can be
chieved by using a spherical coordinates and pointing it towards
he eye ball center of 3D model. Some artificial eye images are
hown in Fig. 3 . Refer to [31] for more details.
.3. Adversarial images synthesis from parallel imaging
Synthetic data has been widely used for training the vision-
ased algorithms because it can be automatically generated with
arious appearance and accurate annotations [7] . However, learn-
ng from synthetic images cannot achieve the desired performance
s the distribution of synthetic images is different from that of real
nes. Although large effort s have been undert aken to improve the
ealism of generated artificial images, it is still limited to cover all
he variations of appearance in real scene. In this work, as shown
n Fig. 1 , we introduce a step of parallel imaging to reduce the gap
etween synthetic and real eye images. Our goal is to refine the ar-
ificial eyes with texture and appearance from unlabeled real eyes.
y building upon the recent GANs based approach termed as Sim-
AN [11] , we learn a generative model to generate adversarial syn-
hetic images that preserve the annotations of synthetic eyes and
ealism of real eyes. The overview of GANs based model in this
aper is illustrated in Fig. 4 , where the deep network of generator
is the expected generative model whose output makes the deep
etwork of discriminator D impossible to determine the source of
nput images. In this paper, we use r and s to represent real eyes
nd synthetic eyes, respectively. G and D is alternatively optimized.
For the discriminator, following the idea of GANs, the learning
rocedure aims to minimize the adversarial loss as formulated in
q. 2 .
L D (θd ) = E s ∼p s (s ) [ −log(D (G (s ; θg ) ; θd )) ]
+ E r ∼p r (r ) [ −log(1 − D (r; θd )) ] , (2)
here E s ∼p s (s ) represents the expectation of the log-likelihood of
he input being sampled from the probability distribution of syn-
hetic eyes p s ( s ), p r ( r ) represents the distribution of the real eyes,
nd D ( ∗ ; θd ) denotes the probability of the input sampled from
ake eyes. By minimizing this loss function, the discriminator aims
o make D ( G ( s ; θ g ); θd ) approach 1 and make D ( r ; θd ) approach
. Hence, for the training of discriminator network D , the ground
ruth labels are 0 for all the real eyes r and 1 for the outputs of
enerator G ( s ; θ g ).
The purpose of generator G is to make its output G ( s ; θ g ) as
eal as possible in appearance and preserve the accurate structural
hape with respect to input synthetic eye s . Following recent idea
rom Shrivastava et al. [11] , we formulate its loss as a combination
f adversarial loss of realism with a self-regularization loss to pre-
erve the detailed information. Since cascade regression is applied
or pupil detection based on the texture and appearance informa-
ion around the key points, we need to preserve the accurate key
oint (e.g. eyelids and eye corners) locations and the related local
exture information from the synthetic eyes. To this end, different
rom Shrivastava et al. [11] where they calculate the difference of
alue in pixel level as self-regularization loss to preserve the anno-
ation like gaze vector for gaze estimation, we propose to design
nother ConvNet to preserve the detailed information (e.g. struc-
ural shape, annotated locations of key points and related global
nd local texture) of synthetic eyes followed by calculating the L1
oss as self-regularization loss. The loss function for learning gen-
rator is formulated in Eq. (3) as below:
L G (θg , θc ) = E s ∼p s (s ) [ −log(1 − D (G (s ; θg ) ; θd )) ]
+ λE s ∼p s (s ) (|| Con v Net(G (s ; θg ) ; θc )
− Con v Net(s ; θc ) || 1 ) , (3)
here λ is the balance factor, ConvNet is a deep convolutional net-
ork that transfers the image space to the representative feature
pace, θ c are the parameters of ConvNet , and ||.|| 1 is the L1 norm.
In this work, the parameters of discriminator and genera-
or are alternatively learned by minimizing the loss function in
qs. (2) and ( 3 ), respectively. Specifically, when learning the pa-
ameters θd of discriminator, θ g and θ c are fixed by the result of
enerator training, and vice versa. It is worth noticing that the pa-
ameters of θ g and θ c are learned simultaneously when perform-
ng the generator learning. For each update of θd , we update the
arameters of generator 10 times.
.4. Shape augmented cascade regression for computational
xperiments
In the PV framework, computational experiments should be
onducted to evaluate the performance of proposed algorithm. For
ur task, we first build a novel model based on cascade regression
or pupil detection. Then we carry out computational experiments
o validate the proposed cascade regression framework. In this pa-
er, motivated by our previous work for facial landmark detection
1,2] which use cascade regression models, we propose to detect
ye center using shape augmented regression method on the basis
f eye-related shape, structural and appearance information. Before
e introduce the shape augmented regression method, we review
he general cascaded framework.
.4.1. General cascaded regression
General cascaded regression method has been widely applied
or facial landmark detection where it learns multi sequential re-
ressors based on the local appearance. The general cascade frame-
ork for landmark detection is summarized in Algorithm 1 . In
his work, the location of N semantic facial landmarks such as
ye, nose tip, and mouth corners in a given image are denoted by
= {{ x 1 , y 1 } , . . . , { x N , y N }} ∈ �
2 ·N . p
t represent the location of land-
arks at iteration t in cascade regression framework. At cascade
egression level t , a function of g t is adopted to map the extracted
ocal appearance features around landmarks p
t−1 to the updates of
p
t . Supervised Decent Method (SDM) [3] is one of the most pop-
lar cascade framework for facial landmark detection where they
![Page 5: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec87e41022b2a556d25c296/html5/thumbnails/5.jpg)
588 C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594
Algorithm 1 General cascade regression for landmark detection.
Input:
N landmarks are initialized as p
0 by mean locations of training
samples given the input image I .
Do cascade regression:
for t=1,…,T do
Estimate the landmark location updates p
t given the current
landmark locations p
t−1 .
g t : I , p
t−1 → �p
t
Update the landmark locations.
p
t = p
t−1 + �p
t
end for
Output:
Landmark locations p
T .
Fig. 5. Proposed coarse-to-fine pupil detection based on shape augmented regres-
sion. (a) Facial landmark detection. (b) Coarsely extracted eye region and initialize
the key point location. (c) Output of first cascade level. (d) Final key point location
including eye center.
c
w
t
E
f
p
�
c
p
a
M
c
g
w
c
d
f
m
i
A
t
I
O
w
a
r
w
e
e
p
k
l
use a linear function for mapping. Hence, we can formulate the
cascade regression model g t in Algorithm 1 by �p
t = g t ( p
t−1 , I ) =R
t �( p
t−1 , I ) + c t , where �( p
t−1 , I ) is the appearance feature, R
t
and c t are the model parameters. For the training of cascade re-
gression in this work, we acquire the ground truth of desired up-
dates by �p
t, ∗ = p
∗ − p
t−1 , where p
∗ is the ground truth location.
Then we learn the linear regression model parameters R
t and c t
by least-squares formulation with closed form solution. By adding
the updates �p
t to the current locations p
t−1 , the prediction of
landmark location is achieved.
3.4.2. Shape augmented cascade regression for pupil detection
Shape augmented cascade regression is first proposed in
[42] for facial landmark detection. We further explore it to 3D
facial landmark alignment to capture the shape information [4] .
Our proposed coarse-to-fine framework for pupil detection is illus-
trated in Fig. 5 . We first perform face detection followed by apply-
ing shape augmented cascade regression for 51 facial landmarks
detection. The eye region can be further coarsely cropped by the
detected facial landmarks. Intuitively, the eye center location is re-
lated to iris region appearance information. In addition, eye related
shape and contextual information like eye corner and eyelids are of
help for accurate eye center location. To capture these information,
as shown in Fig. 5 (b), we choose 11 eye related key points includ-
ing eye corners, eyelids and eye center for cascade regression.
In this paper, we extend shape augmented regression for pupil
detection. The objective function for eye related key points detec-
tion is denoted by Eq. (4) .
f ( p ) =
1
2
|| �( p , I ) − �( p
∗, I ) || 2 +
1
2
|| �( p ) − �( p
∗) || 2 , (4)
where I is the given eye region image, p are the eye related key
point locations, p
∗ are the annotations, �( p , I ) are the local ap-
pearance features around p , and �( p ) are the extracted shape fea-
tures by calculating the difference among pairs of key points. A
second order Taylor expansion of Eq. (4) is applied at an initial lo-
ation p
0 as below:
f ( p ) = f ( p
0 + �p ) ≈ f ( p
0 ) + J f ( p
0 ) T �p +
1
2
�p
T H f ( p
0 )�p ,
(5)
here J f ( p
0 ) is the Jacobian matrix, and H f ( p
0 ) is the Hessian ma-
rix of function f ( ·) at p
0 . We further take the derivation of f ( p ) in
q. (5) and set it to zero to acquire the optima for the objective
unction. Hence, we can estimate the updates of eye related key
oint locations as Eq. (6) .
p = − H f ( p
0 ) −1 J f ( p
0 )
= − H f ( p
0 ) −1 [ J �( p
0 )(�( p
0 , I ) − �( p
∗, I ))
+ J �( p
0 )(�( p
0 ) − �( p
∗))]
= − H f ( p
0 ) −1 J �( p
0 )�( p
0 , I ) − H f ( p
0 ) −1 J �( p
0 )�( p
0 )
+ H f ( p
0 ) −1 (J �( p
0 )�( p
∗, I ) + J �( p
0 )�( p
∗)) (6)
However, the ground truth locations p
∗ are unknown but fixed
onstants during inference and Hessian matrix is not guaranteed
ositive definite everywhere. Hence, we introduce the parameters
s below:
= −H f ( p
0 ) −1 J �( p
0 ) Q = −H f ( p
0 ) −1 J �( p
0 ) b = H f ( p
0 ) −1 (J �( p
0 )�( p
∗, I ) + J �( p
0 )�( p
∗)) (7)
By embedding the introduced parameters into iteration t of cas-
ade regression, we can linearly formulate Eq. (6) as below:
t : �p
t = M
t �( p
t−1 , I ) + Q
t �( p
t−1 ) + b
t , (8)
here we use a linear model as the map function g ( ·) in general
ascade regression and incorporate the shape information for pupil
etection in cascade regression, �( p , I ) are the local appearance
eatures (e.g. SIFT), and �( p ) are the shape features. Our proposed
ethod of shape augmented cascade regression for pupil detection
s summarized in Algorithm 2 .
lgorithm 2 Shape augmented cascade regression for pupil detec-
ion.
nput:
N key point locations are initialized as p
0 by mean locations of
training samples given the input eye image I .
Do cascade regression:
for t=1,…,T do
Estimate the key point location updates given the current key
point locations p
t−1 .
�p
t = M
t �( p
t−1 , I ) + Q
t �( p
t−1 ) + b
t
Update the key point locations.
p
t = p
t−1 + �p
t
end for
utput:
Locations p
T of eye related key points including the eye center.
After building the model in part of computational experiments,
e need to learn the parameters in Eq. (7) . Since we have gener-
ted sufficient adversarial synthetic images with ground truth eye
elated key point annotations by GANs in parallel imaging part,
e can model the real eye center location distribution. Given j th
ye image I j with ground truth locations p
∗j , we can acquire the
ye shape updates by subtracting the current locations p
t−1 j
from
∗j
through �p
t, ∗j
= p
∗j − p
t−1 j
. Given K training images, the initial
ey point locations p
0 j
are mean locations of training samples. The
earning of M
t , Q
t and bias b
t at cascade level t can be formulated
![Page 6: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec87e41022b2a556d25c296/html5/thumbnails/6.jpg)
C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594 589
Fig. 6. Examples of detection result by our proposed work for each benchmark database.
a
M
m
e
A
4
l
d
m
4
4
u
r
T
a
t
f
f
w
B
i
w
l
j
s
1
o
t
L
o
w
4
e
b
d
w
a
E
t
o
t
d
4
s
a
p
d
b
o
T
t
f
f
a
t
s
m
s
m
f
4
4
r
q
s
t
n
p
v
i
1
l
a
s
fi
d
i
p
t
t
o
p
r
l
s a standard least-squares formulation with closed form solution:
t ∗, Q
t ∗, b
t ∗ = arg min
M
t , Q t , b
t
K ∑
j=1
‖ �p
t, ∗j
− M
t �(I j , p
t−1 j
)
−Q
t �( p
t−1 j
) − b
t ‖
2
(9)
For inference, given the detected eye region I , begin with a
ean key point initialization of p
0 , we iteratively estimate the
ye location by learned cased model as shown in Fig. 5 and
lgorithm 2 .
. Experimental results
For computational experiments, we have described the model
earning and inference in detail. In this section, we further con-
uct experiments to validate the proposed framework on bench-
ark datasets for pupil detection.
.1. Experimental setup
.1.1. Datasets
For the adversarial training in the step of parallel imaging, we
se around 28 K synthetic images simulated by UnityEye [31] , and
eal eye images from normalized part of MPIIGaze [32] for training.
hen we use the trained generator network to test on the 10,730
rtificial samples used in [6] , which results in adding realism to
he artificial eye images and keep original annotations in [6] . We
urther conduct computational experiments on benchmark datasets
or pupil detection to demonstrate the performance of proposed
ork.
In this work, three widely used benchmark datasets including
ioID [43] , GI4E [44] , and LFW [45] are used for testing to ver-
fy our proposed method for pupil detection. As one of the most
idely used dataset for pupil detection, BioID contains 1521 gray-
evel images of size 384 × 286. It is challenging with different sub-
ects, various gaze, and illuminations. GI4E contains 1236 images of
ize 800 × 600 taken by a normal camera, from 103 subjects with
2 different gaze directions. We randomly select 10 0 0 color images
f size 250 × 250 from LFW for testing. We subjectively evaluate
he performance of our proposed framework on GI4E, BioID, and
FW which are with respect to high, medium, and poor qualities
f images, respectively. Sample detection results by our proposed
ork for each database are shown in Fig. 6 .
.1.2. Measure criteria
Similar to [6,43] , we use the maximum normalized detection
rror to evaluate the performance measure the eye detection capa-
ility as below:
eye =
max (D r , D l )
D rl
, (10)
here D r and D l are the Euclidean distances between ground truth
nd the predicted right and left eye centers, respectively. D is the
rluclidean distance between the right and left eye center ground
ruth. For this measurement, d eye ≤ 0.1 corresponds to the range
f iris, and d eye ≤ 0.05 corresponds to the range of pupil diame-
er. Commonly, we set different thresholds for d eye to calculate the
etection rate for measurement.
.1.3. Implementation details
In this paper, we crop eye regions from synthetic images with
ize of 240 × 140, and resize it to 72 × 42 for training the gener-
tor network in the parallel execution. For face detection, we ap-
ly VJ face detector [46] . For fair comparison on the test of GI4E
ataset with other methods, we use deformable Part Models (DPM)
ased face detector [47] for face detection. The face detection rate
f BioID, GI4E, and LFW are 97.5%, 99.8%, and 100%, respectively.
he false positives (as shown in Fig. 12 (e)) are not discarded and
he eye location rate can be further improved by optimization of
ace detection. The balance parameter in Eq. (3) is set as 1.2. SIFT
eature is extracted around the key points in this paper. For the
dversarial synthetic images generation, similar network architec-
ures from Shrivastava et al. [11] are applied. For the ConvNet of
elf-regularization in this paper, it contains 3 convolution and 3
ax-pooling layers as follows: (1) Conv3 × 3, feature maps = 64,
tride = 1, (2) MaxPool2 × 2, stride = 2, (3) Conv3 × 3, feature
aps = 128, stride = 1, (4) MaxPool2 × 2, stride = 2, (5) Conv3 × 3,
eature maps = 256, stride = 1, (6)MaxPool2 × 2, stride = 2.
.2. Evaluation
.2.1. Learning from adversarial synthetic images
We use the trained generator in the parallel imaging step to
efine the synthetic images as adversarial synthetic images. Some
ualitative results are shown in Fig. 7 . We can see that the adver-
arial synthetic images capture the texture and appearance from
he real eyes. In addition, they preserve the original key point an-
otations from the synthetic images which is of help for our pro-
osed learning-based pupil detection.
To further demonstrate the effectiveness of learning from ad-
ersarial synthetic images, we perform two baselines for compar-
sons. We separately train the model on 3584 real images [5] and
0,730 synthetic eye images [6] , terming as learning-from-real,
earning-from-synthesis, respectively. We name our proposed work
s learning-from-adversarial (proposed) in which we learn the
hape augmented models from 10,730 adversarial eye images re-
ned from Gou et al. [6] . Then we test on the most widely used
atabase BioID for demonstration. Experimental results are listed
n Table 1 . We can see that, compared with learning from the
ure synthetic data or real images, learning from adversarial syn-
hetic images generated by the step of parallel imaging boosts
he performance on pupil detection. It achieves a detection rate
f 92.3% under the normalized error of 0.05 and significantly im-
roves over the learning-from-synthesis of 89.2%. Manual collected
eal eye images contain limited gaze directions with limited pupil
ocations and synthetic eye images cannot cover the variations
![Page 7: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec87e41022b2a556d25c296/html5/thumbnails/7.jpg)
590 C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594
Fig. 7. Examples of real eyes, synthetic eyes and adversarial synthetic eyes.
Table 1
Eye detection results on BioID with normalized error
( d eye ≤ 0.05) based on different training samples.
Training types Detection rate
Learning-from-synthesis 89.2%
Learning-from-real 90.3%
Learning-from-adversarial(SimGAN) 91.5%
Learning-from-adversarial(proposed) 92.3%
Table 2
Eye detection result and comparison on GI4E database.
Method d eye ≤ 0.05 d eye ≤ 0.1 d eye ≤ 0.15
George et al. [48] 89.3% 92.3% 93.6%
Timm et al. [16] 92.4% 96.0% ∗ 96.9% ∗
Villanueva et al. [44] 93.9% 97.3% ∗ 98.0% ∗
Gou et al. [5] 94.2% 99.1% 99.6%
Learning-from-adversarial 98.3% 99.8% 99.8%
∗ are estimated from the accuracy curves in corresponding paper
[44] .
Table 3
Eye detection result and comparison on BioID database.
Method d eye ≤ 0.05 d eye ≤ 0.10 d eye ≤ 0.15
Campadelli et al. [49] 80.7% 93.2% 95.3%
Timm et al. [16] 82.5% 93.4% 95.2%
Valenti et al. [15] 86.1% 91.7% 93.5%
Araujo et al. [18] 88.3% 92.7% 94.5%
Ren et al. [50] 77.1% 92.3% 97.2% ∗
George et al. [48] 85.1% 94.3% 96.7%
Chen et al. [51] 88.8% 95.2% 96.5% ∗
Choi et al. [19] 91.1% 98.4% 98.7% ∗
Gou et al. [5] 91.2% 99.4% 99.6%
Learning-from-adversarial 92.3% 99.1% 99.7%
∗ are estimated from the accuracy curves in corresponding papers.
i
s
a
r
l
i
t
t
r
i
t
p
t
i
f
r
p
s
t
t
r
a
a
t
u
a
w
a
a
t
i
p
m
s
in appearance in real world. In this work, our generated adver-
sarial synthetic images take advantages from synthetic and real
eyes. To further verify the effectiveness of our modification of the
self-regularization in SimGAN , we conduct comparison experiments
on generated adversarial images by SimGAN [11] where we name
it as learning-from-adversarial (SimGAN). For fair comparison, we
learn the SimGAN from the same training samples with our pro-
posed method and also generate 10,730 refined synthetic images
from Gou et al. [6] . As shown in Table 1 , the introduced Con-
vNet for self-regularization enhances the performance of pupil de-
tection. Compared with pixel level self-regularization to preserve
annotations from synthetic images [11] , our introduced ConvNet
emphasizes more on the detailed information like global shape,
texture, and appearance around eye related key points, which
are beneficial for learning-based shape cascade regression frame-
work.
4.2.2. Comparisons with the state-of-the-art methods
The quantitative results on GI4E are listed in Table 2 and the
best results are highlighted. All the results of the methods in
Table 2 are from the original corresponding papers. Some quali-
tative results are shown in Fig. 8 . GI4E contains images with high
quality taken by normal camera. Fig. 8 shows that we can accu-
rately locate the eye center even the subjects with classes or under
low illuminations. As shown in Table 2 , our proposed method can
achieve a detection rate of 98.3% under d eye = 0 . 05 . Compared with
other related works, our method significantly improve the pupil
detection performance. It further demonstrates that our proposed
pupil detection framework can be applied for accurate pupil detec-
tion task. It is worth noticing that, our method outperforms a sim-
lar work [5] whose model is learned from the combination of real
amples and synthetic samples. On the one hand, we learn from
dversarial synthetic images where artificial eyes are added with
ealism and keep the accurate annotations. On the other hand, we
everage shape augmented regression to capture more local shape
nformation with more eye related key points for eye center detec-
ion.
BioID is one of the most widely used datasets for eye detec-
ion. Experimental results on BioID are listed in Table 3 . All the
esults of the methods in Table 3 are from the original correspond-
ng papers. Fig. 9 shows some of the correct detection examples of
he proposed method in BioID database. We can see that, the pro-
osed pupil detection method can handle low illumination condi-
ions, various gaze directions and subjects with glasses. As listed
n Table 3 , the proposed algorithm gives a detection rate of 92.3%
or d eye ≤ 0.05 and outperforms the state-of-the-art with detection
ate of 91.1% in [19] . It demonstrates that the proposed work ex-
loits the cascade learning from the generated adversarial training
amples better than the local patterns based methods. While for
he coarser evaluation at d eye ≤ 0.1, it performs slightly worse than
he state-of-the-arts [5] with detection rate of 99.4%. It means the
ealism from training images of MPIIGaze is limited to cover the
ppearance distribution of BioID. This problem can be tackled by
dding more unlabeled images with diverse appearance. In addi-
ion, the better performance of our proposed work on finer eval-
ation of d eye ≤ 0.05 also demonstrates that the annotations from
dversarial synthetic images are more reliable.
We further conduct evaluation on LFW which contains images
ith low resolution and poor quality. Some detection examples
re shown in Fig. 10 . We can see that the proposed framework
chieves promising results on extreme low resolution images. Even
hough the image is blurred with ambiguous edge and contextual
nformation, we still can detect the eye center based on the local
upil appearance information. Qualitative comparisons with other
ethods are listed in Table 4 . As shown in Table 4 , our method
ignificantly outperforms a recent work based on multi scale
![Page 8: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec87e41022b2a556d25c296/html5/thumbnails/8.jpg)
C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594 591
Fig. 8. Example results of eye localization in GI4E. The white dot represents the ground truth and red dot represents the predicted eye center. (For interpretation of the
references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 9. Example results of eye localization in BioID. The white dot represents the ground truth and red dot represents the predicted eye center. (For interpretation of the
references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 10. Example results of eye localization in LFW. The white dot represents the ground truth and red dot represents the predicted eye center. (For interpretation of the
references to colour in this figure legend, the reader is referred to the web version of this article.)
Table 4
Eye detection result and comparison on LFW database.
Method d eye ≤ 0.05 d eye ≤ 0.1 d eye ≤ 0.15
Fu et al. [52] 24.2% 75.4% 88.0%
Yi et al. [53] 52.0% ∗ 88.2% ∗ 95.3% ∗
Qian et al. [54] 75.1% 90.6% 94.3%
Choi et al. [19] 54.1% 91.4% 96.3%
Learning-from-adversarial 81.6% 97.2% 98.8%
∗ are estimated from the accuracy curves in the corresponding pa-
per.
l
c
i
m
a
b
o
p
l
l
r
n
d
w
w
l
d
Fig. 11. Eye detection results at each cascaded iteration on BioID and Gi4E database.
Y coordinate denotes the detection rate with the normalized error less than 0.05.
4
w
F
g
d
T
g
o
a
a
ocal structure pattern proposed in [19] . It shows that our proposed
oarse-to-fine framework is much preferable on pupil detection in
mages with low quality.
In summary, thorough experiments show that our proposed
ethod can achieve the state-of-the-art pupil detection results. In
ddition, the proposed pupil detection method is effective and ro-
ust both on high and low quality images.The major reason that
ur method outperforms other state-of-the-arts comes into three
arts. Fist, we detect face in the whole image followed by facial
andmark detection to coarsely extract the eye regions, which al-
ows us to capture the global structural information of eyes with
espect to face. Second, by reducing the gap between the large
umber of synthetic data with annotations and unlabeled real
ata through GANs, we can learn the real eye center locations
ith respect to different gaze directions more accurately. Third,
e further exploit the cascade regression based on the eye re-
ated contextual and local appearance information for eye center
etection.
.2.3. Further discussion
We further conduct comparison experiments with a similar
ork detecting the pupil by cascade regression framework [20,21] .
or fair comparison, similar to [20] , we coarsely extract the eye re-
ion by ground truth instead of with the help of facial landmark
etection method. Experimental results on BioID are shown in
able 5 . As shown in Table 5 , our introduced shape augmented re-
ression outperform conventional linear cascade regression meth-
ds of SDM [3,20] since we can capture the global structural shape
nd local appearance information for the eye center detection. In
ddition, we further test the public available method from [21] for
![Page 9: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec87e41022b2a556d25c296/html5/thumbnails/9.jpg)
592 C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594
Fig. 12. Examples of qualitative detection results with normalized error d eye . The white dot represents the ground truth and red dot represents the predicted eye center. The
green bounding box are the wrongly detected face region. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of
this article.)
Table 5
Eye localization comparison results with cascade regression
methods.
Method d eye ≤ 0.05 d eye ≤ 0.1 d eye ≤ 0.25
SDM [3] 90.3% ∗ 96.4% ∗ 100.0% ∗
Cascaded CNN [21] 92.6% + 99.4% + 99.8% +
CF-MF-SDM [20] 93.8% 99.8% 99.9%
Ours 95.7 % 99.9 % 100.0 %
∗ results are from Zhou et al. [20] . + are produced by the original landmark detection methods.
t
m
w
5
a
i
l
e
t
a
a
t
a
b
t
d
t
m
A
F
a
R
facial landmarks including the eye center detection on BioID. It
can achieve comparable results due to deep model’s nonlinear
generalization ability. Future work will focus on design nonlin-
ear deep models [55–57] as the shape augmented cascade regres-
sion model. By further investigation, compared with the results
with Table 3 , the initialization of the eye related key points based
on the extracted eye region is important because it can not con-
verge to the global optimal when they are far from the ground
truth.
In addition, we further study the influence of number of it-
eration which denoted by T in cascade regression. Experimental
results with respect to different iteration are shown in Fig. 11 .
We can see that it converges fast at the first two iterations and
achieves to optimal at iteration 4 and 6 for BioID and Gi4E, re-
spectively. By further investigation, number of iteration T can be
generally set to 4 and larger to 6 for more challenging task such as
BioID dataset with various illuminations and glasses.
In this work, all the testing experiments are conducted with
non-optimized Matlab codes on a standard PC, which has an Intel
i5 3.47Ghz CPU and 16 GB RAM. Given the eye images, we test our
model on BioID dataset and evaluate the computational cost. A FPS
of 23 can be achieved with 6 cascade regression steps. By further
investigation, Sift feature extraction cost most of the time for test-
ing. And 16 FPS can be achieved when applied in real scenario with
face detection and eye region extraction by our proposed method,
which allows for real time applications like gaze estimation and
human-computer interactions.
Although our proposed method significantly improves the state-
of-the-arts, image-based pupil detection is still a challenging task.
More qualitative detection examples on database with respect to
maximum normalized error are shown in Fig. 12 . When the whole
eye region is under low illumination as shown in Fig. 12 (a) and
(b), the appearance of pupil and eye related edge information are
ambiguous which lead to the normalized error d eye bigger than
0.05. For the case of Fig. 12 (d), our proposed method cannot
handle the reflections of classes which are still very challeng-
ing. As previous investigation, the face detection is important to
coarsely extract the eye region. As shown in Fig. 12 (e), if the
face detection fails with eye region initialization far away from
the ground truth, cascade regression method cannot converge to
he optimum eye centers. The performance can be further opti-
ized in the eye regions extraction step in our proposed frame-
ork.
. Conclusion and future work
In this paper, we propose a unified framework to learn shape
ugmented cascade regression models from adversarial synthetic
mages for accurate pupil detection. We introduce a step of paral-
el imaging exploiting the adversarial training to refine the artificial
yes with texture and appearance from real images and preserving
he detailed information like structural shape from synthetic im-
ges. For the step of computational experiments, we learn shape
ugmented regression models based on the eye related shape,
exture, and appearance for pupil detection. Our proposed work
chieves the state-of-the-art performance on three public available
enchmark datasets.
Future work will focus on designing powerful nonlinear archi-
ectures (e.g. deep models) to map the appearance and target up-
ates in the cascade level. And more effort s will be undertaken
o realize the parallel execution where it can achieve on-line opti-
ization through real-time inputs of unlabeled images.
cknowledgments
This work was supported in part by National Natural Science
oundation of China under Grant 61806198 , 61304200 , 61533019 ,
nd National Science Foundation under the grant 1145152 .
eferences
[1] Y. Wu , C. Gou , Q. Ji , Simultaneous facial landmark detection, pose and defor-mation estimation under facial occlusion, in: Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2017, pp. 3471–3480 . [2] C. Gou , Y. Wu , F.-Y. Wang , Q. Ji , Coupled cascade regression for simulta-
neous facial landmark detection and head pose estimation, in: Proceedings
of the IEEE International Conference on Image Processing (ICIP), IEEE, 2017,pp. 2906–2910 .
[3] X. Xiong , F. De la Torre , Supervised descent method and its applications toface alignment, in: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), IEEE, 2013, pp. 532–539 . [4] C. Gou , Y. Wu , F.-Y. Wang , Q. Ji , Shape augmented regression for 3D face align-
ment, in: Proceedings of the European Conference on Computer Vision Work-
shops, Springer, 2016, pp. 604–615 . [5] C. Gou , Y. Wu , K. Wang , K. Wang , F.-Y. Wang , Q. Ji , A joint cascaded framework
for simultaneous eye detection and eye state estimation, Pattern Recognit. 67(2017) 23–31 .
[6] C. Gou , Y. Wu , K. Wang , F.-Y. Wang , Q. Ji , Learning-by-synthesis for accurate eyedetection, in: Proceedings of the International Conference on Pattern Recogni-
tion, 2017, pp. 3362–3367 . [7] K. Wang , C. Gou , N. Zheng , J.M. Rehg , F.-Y. Wang , Parallel vision for perception
and understanding of complex scenes: methods, framework, and perspectives,
Artif. IntelL. Rev. (2017) 1–31 . [8] K. Wang , C. Gou , F.-Y. Wang , Parallel vision: an ACP-based approach to intelli-
gent vision computing, Acta Autom. Sin. 42 (10) (2016) 1490–1500 . [9] F.-Y. Wang , Parallel system methods for management and control of complex
systems [J], Control and Decis. 5 (2004) 001 .
![Page 10: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec87e41022b2a556d25c296/html5/thumbnails/10.jpg)
C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594 593
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
C
T
s
2
d
P
H
H
C
M
A
s
p
K
t
H
a
C
a
U
s
F
R
o
S
a
c
i
c
C
t
G
[10] X. Liu , X. Wang , W. Zhang , J. Wang , F.-Y. Wang , Parallel data: from big data todata intelligence, Pattern Recogni. Artif. Intell. 30 (8) (2017) 673–682 .
[11] A. Shrivastava , T. Pfister , O. Tuzel , J. Susskind , W. Wang , R. Webb , Learningfrom simulated and unsupervised images through adversarial training., in: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,vol. 2, 2017, p. 5 .
[12] I. Goodfellow , J. Pouget-Abadie , M. Mirza , B. Xu , D. Warde-Farley , S. Ozair ,A. Courville , Y. Bengio , Generative adversarial nets, in: Proceedings of
the Advances in Neural Information Processing Systems, 2014, pp. 2672–
2680 . [13] D.W. Hansen , Q. Ji , In the eye of the beholder: a survey of models for eyes and
gaze, IEEE Trans. Pattern Anal. Mach. Intell. 32 (3) (2010) 478–500 . [14] F. Song , X. Tan , S. Chen , Z.-H. Zhou , A literature survey on robust and effi-
cient eye localization in real-life scenarios, Pattern Recognit. 46 (12) (2013)3157–3173 .
[15] R. Valenti , T. Gevers , Accurate eye center location through invariant isocentric
patterns, IEEE Trans. Pattern Anal. Mach. Intell. 34 (9) (2012) 1785–1798 . [16] F. Timm , E. Barth , Accurate eye centre localisation by means of gradients., in:
Proceddings of the VISAPP, 2011, pp. 125–130 . [17] C. Zhang , X. Sun , J. Hu , W. Deng , Precise eye localization by fast local linear
SVM, in: Proceedings of the IEEE International Conference on Multimedia andExpo (ICME), IEEE, 2014, pp. 1–6 .
[18] G. Araujo , F. Ribeiro , E. Silva , S. Goldenstein , Fast eye localization without a
face model using inner product detectors, in: Proceedings of the IEEE Interna-tional Conference on Image Processing (ICIP), 2014, pp. 1366–1370 .
[19] I. Choi , D. Kim , A variety of local structure patterns and their hybridization foraccurate eye detection, Pattern Recognit. 61 (2017) 417–432 .
20] M. Zhou , X. Wang , H. Wang , J. Heo , D. Nam , Precise eye localization with im-proved SDM, in: Proceedings of the IEEE International Conference on Image
Processing (ICIP), IEEE, 2015, pp. 4 466–4 470 .
[21] Y. Sun , X. Wang , X. Tang , Deep convolutional network cascade for facial pointdetection, in: Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), IEEE, 2013, pp. 3476–3483 . 22] K. Zhang , Z. Zhang , Z. Li , Y. Qiao , Joint face detection and alignment using mul-
titask cascaded convolutional networks, IEEE Signal Process Lett. 23 (10) (2016)1499–1503 .
23] C. Hong , J. Yu , J. Wan , D. Tao , M. Wang , Multimodal deep autoencoder for hu-
man pose recovery, IEEE Trans. Image Process. 24 (12) (2015) 5659–5670 . [24] C. Hong , X. Chen , X. Wang , C. Tang , Hypergraph regularized autoencoder
for image-based 3d human pose recovery, Signal Process. 124 (2016) 132–140 .
25] T. von Marcard , R. Henschel , M.J. Black , B. Rosenhahn , G. Pons-Moll , Recov-ering accurate 3d human pose in the wild using IMUs and a moving cam-
era, in: Proceedings of the European Conference on Computer Vision (ECCV),
2018 . 26] S. Yang , P. Luo , C.-C. Loy , X. Tang , From facial parts responses to face detection:
a deep learning approach, in: Proceedings of the IEEE International Conferenceon Computer Vision, 2015, pp. 3676–3684 .
[27] S. Yang , P. Luo , C.C. Loy , X. Tang , Faceness-net: face detection through deepfacial part responses, IEEE Trans. Pattern Anal. Mach. Intell. 40 (8) (2018)
1845–1859 . 28] J. Yu , B. Zhang , Z. Kuang , D. Lin , J. Fan , Iprivacy: image privacy protection
by identifying sensitive objects via deep multi-task learning, IEEE Trans. Inf.
Forens. Secur. 12 (5) (2017) 1005–1016 . 29] J. Yu , Z. Kuang , B. Zhang , W. Zhang , D. Lin , J. Fan , Leveraging content sensi-
tiveness and user trustworthiness to recommend fine-grained privacy settingsfor social image sharing, IEEE Trans. Inf. Forens. Secur. 13 (5) (2018) 1317–
1332 . 30] C. Hong , J. Yu , D. Tao , M. Wang , Image-based three-dimensional human pose
recovery by multiview locality-sensitive sparse retrieval, IEEE Trans. Ind. Elec-
tron. 62 (6) (2015) 3742–3751 . [31] E. Wood , T. Baltrušaitis , L.-P. Morency , P. Robinson , A. Bulling , Learning an ap-
pearance-based gaze estimator from one million synthesised images, in: Pro-ceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research &
Applications, ACM, 2016, pp. 131–138 . 32] X. Zhang , Y. Sugano , M. Fritz , A. Bulling , Appearance-based gaze estimation
in the wild, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2015, pp. 4511–4520 . [33] A. Dosovitskiy , J.T. Springenberg , M. Tatarchenko , T. Brox , Learning to generate
chairs, tables and cars with convolutional networks, IEEE Trans. Pattern Anal.Mach. Intell. 39 (4) (2017) 692–705 .
34] K. Wang , Y. Lu , Y. Wang , Z. Xiong , F.-Y. Wang , Parallel imaging: a new theoreti-cal framework for image generation, Pattern Recogn. Artif. Intell. 30 (7) (2017)
577–587 .
[35] L. Cao , C. Gou , K. Wang , G. Xiong , F.-Y. Wang , Gaze-aided eye detection via ap-pearance learning, in: Proceedings of the International Conference on Pattern
Recognition, 2018 . 36] Y. Ganin , V. Lempitsky , Unsupervised domain adaptation by backpropagation,
in: Proceedings of the International Conference on Machine Learning, 2015,pp. 1180–1189 .
[37] D. Yoo , N. Kim , S. Park , A.S. Paek , I.S. Kweon , Pixel-level domain transfer, in:
Proceedings of the European Conference on Computer Vision, Springer, 2016,pp. 517–532 .
38] F.-Y. Wang , J.J. Zhang , X. Zheng , X. Wang , Y. Yuan , X. Dai , J. Zhang , L. Yang ,Where does alphago go: from church-turing thesis to alphago thesis and be-
yond, IEEE/CAA J. Autom. Sin. 3 (2) (2016) 113–120 .
a39] F.-Y. Wang , Parallel control and management for intelligent transportation sys-tems: concepts, architectures, and applications, IEEE Trans. Intell. Transp. Syst.
11 (3) (2010) 630–638 . 40] F.-Y. Wang , P.K. Wong , Intelligent systems and technology for integrative and
predictive medicine: an ACP approach, ACM Trans. Intell. Syst. Technol. (TIST)4 (2) (2013) 32 .
[41] F.-Y. Wang , X. Wang , L. Li , L. Li , Steps toward parallel intelligence, IEEE/CAA J.Autom. Sin. 3 (4) (2016) 345–348 .
42] Y. Wu , Q. Ji , Shape augmented regression method for face alignment, in: Pro-
ceedings of the IEEE International Conference on Computer Vision Workshops,2015, pp. 26–32 .
43] O. Jesorsky , K.J. Kirchberg , R.W. Frischholz , Robust face detection using thehausdorff distance, in: Proceedings of the Audio-and Video-Based Biometric
Person Authentication, Springer, 2001, pp. 90–95 . 44] A. Villanueva , V. Ponz , L. Sesma-Sanchez , M. Ariz , S. Porta , R. Cabeza , Hybrid
method based on topography for robust detection of iris center and eye cor-
ners, ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 9 (4) (2013) 25 .45] G.B. Huang , M. Ramesh , T. Berg , E. Learned-Miller , Labeled faces in the wild: A
database for studying face recognition in unconstrained environments, Techni-cal Report 07–49, 2, University of Massachusetts, Amherst, 2007 .
46] P. Viola , M.J. Jones , Robust real-time face detection, Int. J. Comput. Vis. 57 (2)(2004) 137–154 .
[47] M. Mathias , R. Benenson , M. Pedersoli , L. Van Gool , Face detection with-
out bells and whistles, in: Computer Vision–ECCV 2014, Springer, 2014,pp. 720–735 .
48] A . George , A . Routray , Fast and accurate algorithm for eye localisation for gazetracking in low-resolution images, IET Comput. Vis. 10 (7) (2016) 660–669 .
49] P. Campadelli , R. Lanzarotti , G. Lipori , Precise eye and mouth localization, Int.J. Pattern Recognit Artif Intell. 23 (03) (2009) 359–377 .
50] Y. Ren , S. Wang , B. Hou , J. Ma , A novel eye localization method with rotation
invariance, IEEE Trans. Image Process. 23 (1) (2014) 226–239 . [51] S. Chen , C. Liu , Eye detection using discriminatory haar features and a new
efficient SVM, Image Vis. Comput. 33 (2015) 68–77 . 52] Y. Fu , H. Yan , J. Li , R. Xiang , Robust facial features localization on rotation arbi-
trary multi-view face in complex background, J. Comput. (Taipei) 6 (2) (2011)337–342 .
53] D. Yi , Z. Lei , S.Z. Li , A robust eye localization method for low quality face im-
ages, in: Proceedings of the International Joint Conference on Biometrics (IJCB),IEEE, 2011, pp. 1–6 .
54] Z. Qian , D. Xu , Automatic eye detection using intensity filtering and k-meansclustering, Pattern Recognit. Lett. 31 (12) (2010) 1633–1640 .
55] M. Yang , Y. Liu , Z. You , The euclidean embedding learning based on con-volutional neural network for stereo matching, Neurocomputing 267 (2017)
195–200 .
56] J. Zhang , J. Yu , D. Tao , Local deep-feature alignment for unsupervised dimen-sion reduction, IEEE Trans. Image Process. 27 (5) (2018) 2420–2432 .
[57] S. Qian , H. Liu , C. Liu , S. Wu , H. San Wong , Adaptive activation functions inconvolutional neural networks, Neurocomputing 272 (2018) 204–212 .
hao Gou received his B.S. degree from the University of Electronic Science andechnology of China, Chengdu, China, in 2012 and the Ph.D. degree from the Univer-
ity of Chinese Academy of Sciences (UCAS), Beijing, China, in 2017. From September
015 to January 2017, he was supported by UCAS as a joint-supervision Ph.D stu-ent in Rensselaer Polytechnic Institute, Troy, NY, USA. He is currently an Assistant
rofessor with Institute of Automation, Chinese Academy of Sciences, Beijing, China.is research interests include computer vision and machine learning.
ui Zhang received the B.S. degree from the Beijing Jiaotong University, Beijing,hina, in 2015. She is currently a Ph.D. candidate at the State Key Laboratory of
anagement and Control for Complex Systems, Institute of Automation, Chinese
cademy of Sciences as well as University of Chinese Academy of Sciences. Her re-earch interests include computer vision, pattern recognition, and intelligent trans-
ortation systems.
unfeng Wang received his Ph.D. in Control Theory and Control Engineering from
he Graduate University of Chinese Academy of Sciences, Beijing, China, in 2008.
e joined the Institute of Automation, Chinese Academy of Sciences and becamen Associate Professor at the State Key Laboratory of Management and Control for
omplex Systems. From December 2015 to January 2017, he was a Visiting Scholart the School of Interactive Computing, Georgia Institute of Technology, Atlanta, GA,
SA. His research interests include intelligent transportation systems, intelligent vi-ion computing, and machine learning.
ei-Yue Wang received the Ph.D. degree in computer and systems engineering from
ensselaer Polytechnic Institute, Troy, NY, USA, in 1990. He was an Editor-in-Chieff the International Journal of Intelligent Control and Systems, the World Scientific
eries in Intelligent Control and Intelligent Automation, IEEE Intelligent Systems,nd the IEEE Transactions on Intelligent Transportation Systems. He has served as
hairs for over 20 IEEE, ACM, INFORMS, and ASME conferences. His current researchnterests include social computing, complex systems, and intelligent computing and
ontrol. He is currently the Director of The State Key Laboratory of Management and
ontrol for Complex Systems, Institute of Automation, Chinese Academy of Sciences,he Vice President of the ACM China Council and the Vice President/Secretary-
eneral of Chinese Association of Automation. He is a member of Sigma Xi andn Elected Fellow of IEEE, INCOSE, IFAC, ASME, and AAAS.
![Page 11: Cascade learning from adversarial synthetic images …cvrl/Publication/pdf/Gou2019.pdfShrivastava 3.2.et al. [11] proposed Simulated+Unsupervised (S+U) learning to improve the realism](https://reader033.vdocuments.us/reader033/viewer/2022042222/5ec87e41022b2a556d25c296/html5/thumbnails/11.jpg)
594 C. Gou, H. Zhang and K. Wang et al. / Pattern Recognition 88 (2019) 584–594
r
m
s
c
n
I
Qiang Ji received his PhD degree in Electrical Engineering from the University ofWashington. He is currently a professor with the Department of Electrical, Com-
puter, and Systems Engineering at Rensselaer Polytechnic Institute (RPI). He re-cently served as director of the Intelligent Systems Laboratory (ISL) at RPI. Prof. Ji’s
esearch interests are in computer vision, probabilistic graphical models, infor-
ation fusion, and their applications in various fields. Prof. Ji is an editor oneveral related IEEE and international journals and he has served as a general
hair, program chair, technical area chair, and program committee member inumerous international conferences/workshops. Prof. Ji is a fellow of IEEE and
APR.