Face Verification and Face Image Synthesis
under Illumination Changes
using Neural Networks
by
Tamar Elazari Volcani
Under the supervision of
Prof. Daphna Weinshall
School of Computer Science and Engineering
The Hebrew University of Jerusalem
Israel
Submitted in partial fulfillment of the
requirements of the degree of
Master of Science
December, 2017
1
Abstract
Following the success neural networks brought to the field of face recognition, we
examine two further issues regarding changes in illumination.
We first examine the possibility to train a face verification algorithm, based on neural
networks, to overcome illumination changes. The common practice in face verification
used to search for hand-crafted optimal features, such that verification could be performed
by a simple computation, like an inner product. In this work we will focus on training a
neural network, that will overcome the need to predefine a feature space.
As an extension face verification, we explore a more challenging task - the generation of
new face images, rather than only to verifying ones. We propose several novel algorithms,
based on neural networks, trained to generate face images having different illuminations.
2
אימות פנים וסינתזת תמונות פנים שינויי תאורהתחת
ידי רשתות נוירוניםעל
אלעזרי וולקניתמר
:מנחה
דפנה וינשל פרופ'
הספר להנדסה ומדעי המחשבבית
שם רחל וסלים בניןעל
העברית בירושליםהאוניברסיטה
ישראל
הוגש כמילוי חלקי של החובות לתואר מוסמך במדעים
ה'תשע"ח טבת
תקציר
נבחן שני נושאים נוספים ההצלחה שרשתות נוירונים הביאו לתחום זיהוי הפנים, בעקבות
- הנוגעים לשינויי תאורה בתמונות
שיוכל ,מבוסס רשתות נוירונים אימות פנים תםיראשית נבדוק האם ניתן לאמן אלגור
במציאת מרחב דוגלת גישה הרווחת בתחום אימות הפניםה .להתגבר על שינויי תאורה
ה לקבל החלטות באמצעות חישוב פשוט, כמו יניתן יהעל בסיסו כך ש אופטימלי, מאפיין
על נתמקד באימון מקבל החלטות, שיתגבר בעבודה זו .מכפלה פנימית בין ווקטורים
.הצורך למצוא מרחב מאפיין שכזה
לא רק ו ים חדשות,ליצור תמונות פנ האם ניתן -משימה מאתגרת יותר מכן נחקורלאחר
נורשתות נוירונים שיאומ מים מבוססיאלגורית מספר נגדיר .תמונות שכאלהלאמת
לסנתז תמונה של אדם נתון בתנאי תאורה חדשים.
Contents
1 Introduction 6
2 Tools 7
2.1 The YaleB Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Code Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Illumination Invariant Face Verification 10
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Task Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Data Preparations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Training and Testing Samples Definition . . . . . . . . . . . . . . . . . . . 12
3.3.2 Balancing the Training Sample Set . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Face Verification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.1 Feature Vector Construction . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.2 Pair Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.1 Evaluation of Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5.2 Evaluation of Pair Recognition . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Synthesis of Face Images Under New Illuminations 19
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Task Definition: New Illumination Synthesis . . . . . . . . . . . . . . . . . . . . . 21
4.3 Data Preparation: Training and Testing Samples . . . . . . . . . . . . . . . . . . 21
4.4 Algorithms and Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.1 Algo1: An End-to-End Approach . . . . . . . . . . . . . . . . . . . . . . . 23
4.4.2 Algo2: Introducing More Information in Training . . . . . . . . . . . . . . 24
4.4.3 Algo3: Forcing the Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.5 Computational Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
List of Figures 33
List of Tables 34
References 37
5
1 Introduction
Face recognition is useful in many applications, varying from home leisure uses like social
networks image tagging and all the way to public security and border control. Therefore,
face recognition has been studied extensively. The earliest studies on automatic machine
recognition of faces were published in the 1970’s [Kelly, 1970] [Kanade, 1977] (see a review
in [Bhele and Mankar, 2012]).
Artificial Neural Networks (ANNs) enjoyed increased popularity in the last decade, fueled by
progress in Convolutional Neural Networks (CNNs), [LeCun et al., 1995], together with better
and faster computers hardware. This progress enabled far-reaching achievements in computer
vision, including face recognition. For example, the quality of face classification from a given
data-set using CNNs has surpassed human performance [Schroff et al., 2015] [Sun et al., 2015]
[Taigman et al., 2014].
A related face recognition task is Face verification, the recognition of faces of people not
included in the training set.
In this thesis, we address the problem of single sample face verification. Single sample refers
to providing only a single reference image during test time of the algorithm. A scenario such
as this, where the learning algorithm aims to learn information about object categories from
one or only a few images without feedback, is also called One-shot learning. This scenario is
often encountered in real life. For example, a new member joins a social network, a picture of a
criminal, yet unknown to the police, is given, and so on. Moreover, it is not reasonable (today)
to keep and search a database of all the people in the world (7.5× 109 ≈ 233), or even a single
country (China and India population is over a billion each 109 ≈ 230). Not to mention training
on such data, as the most successful face recognition algorithms require training on hundreds
or thousands of images per individual. For example in [Taigman et al., 2014], the classification
algorithm was trained on a data-set of over 4.4 million facial images belonging to more than
4,000 identities. An average of more than a thousand different images per individual.
Our approach to the task of single sample face verification is to train a decision maker: an
ANN that receives two input images performs computation on both of the images, and outputs
a binary decision, “the same person” or “different persons”. To simplify our task, we focus on
face images having a uniform resolution, such that the face is at the center of the images. We
also narrow the image differences to be of illumination changes, as in the Extended Yale Face
Database B (YaleB). We train multiple ANNs, which share the same architecture, to recognize
whether two face images in different preset illuminations are of the same individual.
In the second part of the thesis, we address a more complicated prediction task, by asking
“what a new face image would look like?”. We show how to synthesize a new face image for a
given individual, under different illumination conditions, in the context of YaleB data-set.
There is a possibility ANN algorithms are already predicting images showing faces under
new illuminations. Consider the face verification experiment described above: We would train
a different network for every pair of illuminations. It is possible that given the first image, the
6
Figure 1: The Extended Yale Face Database B.Example of 3 individuals images in 3 illumination conditions.
network predicts what the image of that individual would look like under the second illumina-
tion, and then compares it with the second input image to determine whether they depict the
same individual. Of course, this kind of process may not occur in image space. Finding the
right space would be one of the challenges we face.
Another challenging aspect of the synthesis task is combining regression networks into the
process, and not just classification ones. This raises the issue of nearly-correct results. To
address this problem, we propose an ANN-based validation system for the resulting images.
2 Tools
In this section, we describe the face data-set used in our experiments and define new concepts
regarding it. We also describe the code packages used.
2.1 The YaleB Dataset
We will use the Extended Yale Face Database B (YaleB) [Georghiades et al., 2001], which is
available to download at:
http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html.
The data-set contains 32×32 pixel gray-scale centered near frontal faces images, took under
different illuminations conditions. The images were taken in lab conditions, where the person
sits still and there is only a single light source in the room. To create the different illuminations
the light source location is changed. The set of locations defines the set of illuminations.
Each individual appears in at most one image per illumination condition. Namely, the num-
ber of images of an individual equals the number of illuminations it appears in. We selected 31
individuals, who had (at least) 64 images (different illuminations) each so that each illumination
is shown in the same number of images.
Super Illuminations
In some cases, we thought it would be useful to have more than one sample image of the same
illumination and individual. We also noticed that some of the illuminations are rather similar
looking, which motivated the idea of grouping them together.
This is done in the following way: We represent each illumination by the average of all
images taken in that illumination (31 images, as the number of individuals). Then, we create
7
15 illumination clusters by applying k-means clustering algorithm (Matlab), using Euclidean
distances. Denote the 15 illumination clusters with C1, ..., C15 and their corresponding centroids
with c1, ..., c15. Each cluster of illuminations is considered as one super-illumination.
Still, some illuminations which are located in different clusters may resemble one another.
We wish our set of illuminations to be the ones which are most significantly different from each
other. Therefore we farther go and choose a representative illumination for each cluster to be
the illumination which has the greatest average distance from the centroids of all the other
clusters (Eq. 1). Fig. 2 illustrates this process. Fig. 3 demonstrates the resulting representative
illumination.
Ki = argmaxkij∈Ci
(meanl 6=i
(||kij − cl||2
))(1)
Figure 2: An illustration of the representative illuminationrelationship with illumination clusters.
8
Figure 3: Representative illuminations images. Every row shows all images of allindividuals in one representative illumination. Every column shows images of thesame person in all representative illuminations.
2.2 Code Packages
2.2.1 Artificial Neural Network
The implementation of Artificial Neural Networks (ANN) we used is based on EasyConvNet,
a Convolutional Neural Networks (CNN) Matlab code package by Shai Shalev-Shwartz, which
can be found on https://github.com/shaisha/EasyConvNet, [Shalev-Shwartz, 2014].
2.2.2 SVM
We used Support Vector Machine (SVM) algorithm on different occasions. At times the imple-
mentation of Matlab’s SVM, http://www.mathworks.com/help/stats/svmtrain.html. And
at times the one of LIBSVM [Chang and Lin, 2011].
9
Figure 4: From [Riklin-Raviv and Shashua, 1999]. An outline of the DeepFace faceclassifier architecture. A front-end of a single convolution-pooling-convolution fil-tering on the rectified input, followed by three locally-connected layers and twofully-connected layers. Colors illustrate feature maps produced at each layer. Thenet includes more than 120 million parameters, where more than 95% come fromthe local and fully connected layers.
3 Illumination Invariant Face Verification
In this section, we describe the experiment of training a decision maker ANN algorithm for
single sample face verification under change of illumination. First, we formally define the task
at hand. Then we address in detail the problem of an unbalanced training set and present
two data augmentation solutions. We employ the experiment’s framework to two types of
feature vectors, a high dimensional and a low dimensional, and compare the behaviors of the
corresponding results, to discover a surprising outcome.
3.1 Related Work
The study that best emphasizes the success neural networks brought to the field of face recog-
nition and face verification is [Taigman et al., 2014]. Employing deep CNN in their algorithm,
and having the largest labeled face training data-set available at the time, they receive the
best results to this day and even managed to surpass human performance on face verification
(Fig. 5).
Their method is to first train a very good face classifier, and then use the one-before-last
blob (of length 4, 096) that serves as a feature vector for the classification, also as the feature
vector for face verification. Given such feature vectors of two images, they simply use the inner
product between the two normalized inputs as a similarity measure.
The face classifier consists of two stages: 1) a 3D frontalization and alignment, using ana-
lytical 3D modeling of the face based on fiducial points, that is used to warp a detected facial
crop to a 3D frontal mode (frontalization) [Berg and Belhumeur, 2012], and 2) a deep CNN
(or DNN) whose architecture is shown in Fig. 4. The training data-set for this process is the
the SFC data-set, a collection of photos from Facebook of 4.4 million labeled faces from 4, 030
people each with 800 to 1200 faces. This data-set is unfortunately not accessible to the public,
as is the resulting algorithm (true for the time our experiments were conducted).
As opposed to the approach taken here, of finding the ”perfect” feature space, such that
10
Figure 5: From [Taigman et al., 2014]. ROC curve on LFW data-set. Surpassinghuman performance.
Figure 6: Market-1501 data-set for Re-ID [Zheng et al., 2015]. Images in first tworows are grouped by individuals. Images on the third row are ”noise”.
all there is left to do is to calculate a simple inner product, we seek a decision maker for face
verification, that overcomes the flaws of the representation space of the inputs. And since
our data-set of choice is the YaleB data-set, we do not have to apply neither frontalization or
alignment.
The field of Re-Identification
Another field dealing with a related task is the field of Re-Identification, or in short Re− ID.
Similarly to face verification, the task here is to decide whether two input images are of the
same individual or not. Except in Re-ID input images are of pedestrians, and often of low
resolution, as would be if the images came from security cameras.
The nature of the samples in this field makes it a harder problem. The variety of differences
between two images can be much greater than in face verification. In addition to illumination
and orientation, Re-ID deals with the pose (i.e front vs. back, standing vs. riding a bicycle),
and the resolution makes it nearly impossible to work with face features, that is if they are
11
visible at all. Moreover, Re-ID algorithms suffer from the blinding effect of an outstanding
feature, such as length and color of hair, gender, and as most data-sets contain images from the
same day for a single object, the color of clothes. These features are shared by all appearances
of the same individual, but not uniquely.
The difficulty of this task is reflected in the fact that the success rates here are
much lower then in face verification. For example: rank1 accuracy on Market-1501 data-
set [Zheng et al., 2015] (Fig. 6) in recent papers are 66% by [Varior et al., 2016], 48% by
[Liu et al., 2016], 37% by [Wu et al., 2016]. As in [Taigman et al., 2014] of face verification.
But much like in [Taigman et al., 2014], these studies also concentrate on the representation of
the input samples, which we will avoid.
3.2 Task Definition
So how do you recognize a face you do not know?
We begin with two main definitions: Recognition, in this context, is the ability to determine
whether two different face images are of the same person or not. And as this is a machine
learning algorithm, an individual is considered unseen, or unknown to the algorithm, if no
image of it is in the training set.
We define the task formally:
Let our set of individuals be P , and the set of possible illuminations be K. And let there
be some representation space. It could be images or otherwise chosen (or trained) function of
images, but for simplicity, we call it the image space. We denote the image of individual p ∈ Ptaken under illumination k ∈ K with imk(p).
We set two illuminations: source illumination s ∈ K and target illumination t ∈ K. Then,
for two input images: a base image, ims(p), and a query image, imt(p′), where p, p′ ∈ P , the
task is to determine whether p = p′.
In the language of machine learning, our sample set is the set of all ordered pairs of images
taken with the preset illuminations X = {(ims(p), imt(p
′))
: p, p′ ∈ P}, and our label set is
Y = {±1}, where the true label for an instance(ims(p), imt(p
′))
is 1 if and only if p = p′.
As described before, we have 31 different individuals (|P | = 31), each having 64 images in
(slightly) different illuminations, grouped in 15 clusters of significantly different illuminations.
To ensure that the source and target illuminations, s and t, are different not just by index, but
by appearance, we limit their choices set K to be the 15 illumination representatives (sec.2.1)
of the illuminations clusters.
3.3 Data Preparations
3.3.1 Training and Testing Samples Definition
We wish to test a one-shot learning scenario. Thus a partition of the sample set into training
and testing sample sets, where the individual appearing in the query image might have also
12
been seen through training and therefore known to the algorithm, is not enough. We propose
instead to first partition the individuals, and then infer the training and testing sets samples.
We randomly partition the 31 individuals of the YaleB data-set into two distinct sets, one
for testing and one for training. We assign 20 for training, and the remaining 11 for testing,
and maintain the same partition through any choice of source and target illuminations. Thus
for the test inputs, base or query, not only none of the given images were seen in training time,
but no other image of the individuals, p and p′, was seen either.
3.3.2 Balancing the Training Sample Set
Let there be some set A of N different item. Then, the group of all ordered pairs of items from
that set, A×A, has exactly N pairs that are of the same object, which are NN2 = 1
N of all pairs.
Hence, a function that labels all pairs as of different objects, i.e the constant −1 labeling, will
be correct for 1− 1N of the samples. In our case, different objects are the different individuals,
and N = 20. Therefore the constant negative function will get accuracy of 95%, where in fact
it did not learn to distinguish matching from non-matching pairs at all.
To avoid this deviation, and since we train under the assumption of uniform distribution of
the data, we need to modify our training set in a way that emphasizes the importance of positive
instances. Popular ways to modify the training set in such cases of unbalanced classes can be
divided into two main categories: Over-sampling, adding instances to the under represented
class; and under-sampling, removing instances from the over represented class. Due to the
already small size of the data set, using under sampling will leave us with too small a training
set to train on. We therefore choose two forms of over sampling, train under each of them, and
under no augmentation at all separately, and compare the results:
Duplication: We duplicate positive instances multiple times, so that positive instances
are of the same amount as negative ones.
Noisy Augmentation: We recall that both illuminations s and t are representative
illuminations. Thus we can aggregate positive samples with similarly looking
images. Meaning that for every positive instance (ims(p), imt(p)), we add to the
training set all pairs of the form (ims(p), imt(p)), where s is an illumination from
the s’s illumination cluster and t is from the t’s. Note that in this case the classes
of same and different pairs of images, are not perfectly equally represented, as
illumination clusters are not equal in size. Nevertheless the ratio is much closer
than in the original state, and a counting shows that for the YaleB data-set, for
any s and t, it is not worse than a 2 : 3 ratio.
As the test sample is also unbalanced, we measure success by accuracy and also by additional
measures that are more often used for unbalanced data (see results section).
13
Figure 7: Feature vector construction: CNN architecture.
3.4 Face Verification Algorithms
We construct a two stage algorithm: the first is feature vector construction for a single image,
using classification CNN, and the second is a decision making CNN for two such feature vectors.
3.4.1 Feature Vector Construction
First of all, it is worth noting that our attempts to train a single neural network taking raw
images and resulting a decision, did not converge. Perhaps too difficult a task considering the
small amount of instances in the data set. Moreover, we hold additional information that is not
being used with that approach, and that is the identity of the individuals in the training set.
These reasons motivated us both to look for a more suited representation space, and to use a
classification CNN to do that. The feature constructing network’s architecture is described in
Fig. 7.
We use all images of the training individuals group, 20 individuals over 64 illuminations, as
samples for the training. And for the loss function we use multi-class logistic loss (Eq. 2), as
proposed by [Shalev-Shwartz, 2014],
l(o, Y ) =∑i:Yi=1
−log(eoi∑j e
oj) =
∑i:Yi=1
log(∑j
eoj−oi) (2)
Where o ∈ Rk is the prediction and is Y ∈ {±1}k the label.
We will compare the results for using two types of feature vectors extracted from the trained
network: the low dimension score (labels) vector, and the one-before-last blob (OBL for short),
of high dimension of 1024. The use of the one-before-last blob is the more obvious choice, as it
is in fact the feature vector for labeling in the classification mission itself, and should be more
informative in that way. The use of the score vector is motivated by the intuition that we,
humans, often identify a stranger by its resemblance to other familiar people (”Alice looks a
lot like Bob, but also like Charlie”).
14
3.4.2 Pair Recognition
For the second stage we train multiple networks, one for every choice of feature vector and a
pair of source and target illuminations chosen from the set of representative illuminations set.
All networks share the same architecture, described in Fig. 8.
Figure 8: Pair recognition: ANN architecture.
For every session of training we use the (feature vectors of) image pairs such that the
individuals depicted in them are from the training individuals group, and the illuminations
are in a predefined order. For the loss function we use the binary logistic loss (Eq. 3) by
[Shalev-Shwartz, 2014],
l(o, y) = log(1 + e−yo) (3)
where o ∈ R is the prediction and is y ∈ {±1} the label.
3.5 Results and Discussion
3.5.1 Evaluation of Feature Space
The classification network was trained to classify the 20 individuals on the training set with
a 100% accuracy (all training and no testing). As the length of the score vector equals the
number of individuals (20), and we trained the network to perform under the operation of
max(·), there is no point to measure the classification accuracy over the test sample. To assess
the expressiveness of the resulting feature spaces (score, OBL) on face images of individuals
that the network was not trained on, we check the separability of the test sample images. We
use multi-class linear SVM to do so. For each feature space, we repeat the experiment with
three different types of labeling: face labels, illumination labels, and super illumination labels.
It is evident from the results (Table 1), that the test samples are easily (linearly) separable in
both feature spaces, and also that the super illuminations division still applies.
So, if the SVM depicts the different test sample identities so well, why bother training a
neural network? one might wonder. But this is actually not possible, since this SVM’s outcome
regard training over the test individuals themselves, and we are looking for a method that will
operate well on a new individual (cannot train over it at all).
15
Feature ID 64 original 15 super
vector type illuminations illuminations
Score 99.9% 98% 96.4%
OBL 99.7% 100% 100%
Table 1: Accuracy values for separability of test sample in different feature spacesusing multi-class SVM.
Note that while these characteristics show that the feature spaces are both relevant for our
experiments, they are somewhat surprising. Since we used the network to classify identities
only, and not illuminations, we expect the outcome to be illumination invariant, which it is
clearly not, both for the score vectors and the OBL vectors.
3.5.2 Evaluation of Pair Recognition
As mentioned earlier, for every choice of source and target illumination, we (train and) test the
networks separately. The results in Table 2 are average results of all these repetitions.
feature data training test test test test
space augmentation accuracy TPR TNR F1 score EER
score non 100% 28.4% 98.2% 34.8% 21.2%
score duplication 100% 37.3% 96.4% 40% 21.1%
score noisy 100% 58.8% 87.2% 41% 23.6%
OBL non 99.7% 25.8% 95.6% 28.2% 26.7%
OBL duplication (did not converge) - - - -
OBL noisy 97.4% 32.7% 78.1% 17.1% 37.9%
Table 2: Experiments results by types of feature space and data augmentation.TPR = true positive rate, TNR = true negative rate, EER = equal error rate.The EER measure is calculated with respect to the continuous value, prior to signfunction.
First of all, we note that in all the different experiments, the networks indeed learned to
tell positive and negative samples apart, even if not to a full degree. Recall that a constant
negative function would have a 95% training accuracy, and observe that in all cases our training
accuracy values are higher, and for some experiments even get to a 100%. Moreover, the test
TPR (= TPTP+FN ) is strictly positive, which means TP is strictly positive. Namely positive
samples were labeled correctly too. Still, the difference between the high TNR values and the
low TPR values, teach us that the relatively more common mistake is to label a positive sample
as negative (FN vs TP), than the other way around (FP vs TN).
Second, but perhaps a more obvious observation from the table, is the wide gap, in all
cases, between the training accuracy values, which are a 100% or nearly that, and the test
16
measurement values, that are far lower. This tells us that our models fit much better the training
sample than the testing one. In other words, we have an overfitting situation. A possible and
common explanation for it is the incompatibility between the number of parameters in the
algorithm and the number of training samples. Recall we trained for every pair of illuminations
separately, which effectively means we have multiplied the number of networks, and hence that
of parameters, by the number of pairs of illuminations. Making the number of parameters
orders of magnitude larger than the number of training samples. Continuing to analyze the
results, we need to keep in mind the overfitting effect.
Comparing the different experiments (Table 2), we see that the F1 score (a measure of a
test’s accuracy that can be interpreted as a weighted average of the precision and recall) results
are altogether higher when using the score vectors as feature vectors than when using the OBL
vectors. Possibly the low dimension of the score vector (20) is somewhat of a regularization,
and therefore the OBL vectors are more overfitted to the training sample.
As for data augmentation, looking at the test TPR it is clear that data augmentation
indeed helped emphasize the importance of positive labels. And for every input type, training
the network with augmented positive data, resulted in better TPR, as expected. When the
noisy augmentation, which adds more information on-top of balancing the data, performed
better than the plain duplication. The combination of these two beneficiating conditions, a
feature vector of low dimension and data augmenting with extra information, lead to the best
F1 score and TPR, and a nearly lowest EER.
methoddata training test test test test
augmentation accuracy TPR TNR F1 score EER
our ANNduplication 100% 37.3% 96.4% 40% 21.1%
noisy 100% 58.8% 87.2% 41% 23.6%
RBF SVMduplication 85.3% 78.3% 71.1% 37.1% 24.1%
noisy 82% 80.1% 66.4% 34.2% 24.4%
Table 3: Comparing the results of our ANN and RBF SVM for score vector featurespace and different data augmentation types.
In Table 3, we compare the results under the two best experiments frameworks, score vector
plus noisy data augmentation and score vector plus duplicated positive data, when applying an
off-the-shelf non-ANN ML method: Support Vector Machine (SVM) with Radial Basis Function
(RBF) kernel. The inputs are the same as for the ANN, concatenated pairs of feature vectors,
and the labels are in {±1}. The training accuracy alone tells us that the mission is too complex
for the SVM to express. And while the SVM seems to fit the testing sample similarly as to
the training one (training accuracy vs test TPR), the more inclusive measurement of F1 score,
shows that the ANN algorithm still gets better results for both data augmentation types. Thus
17
reflecting that the decrease in test TNR is more significant than the test TPR increase, for the
SVM experiments.
3.5.3 Conclusions
The most interesting outcome from the results is the feature space behavior. First, it is clear that
both feature spaces are not illumination invariant, in-spite of the fact that they were extracted
from a face only classification network. And second, in other studies where the feature space
is defined by a classification network, the feature vectors are never (to our knowledge) defined
by the score vector itself, but is the outcome of one of the hidden layers, much like in the
previously mentioned [Taigman et al., 2014]. Reasons for that include the fact that the label
vector’s length is determined by the number of layers and is not flexible like the dimension of
the feature vector for that very same labeling. Moreover, one might think it would be invariant
to other characteristics than the ones it is meant to classify, which evidently they are not.
The surprising success of the label vector as a feature vector in our experiments, as the lack
of other studies using it, raises some questions: Is the success coincidental? Can it be repeated
with other data-sets? Is it scalable, or is it working well due to its low dimension only? Can
it be applied to class-based classification (like types of cars, or animals), or only object (ID)
based classification? What other characteristics are preserved?
18
4 Synthesis of Face Images Under New Illuminations
In this section, we present our work on ANN algorithms for synthesis of face image showing
new illumination conditions. The common approach for this task includes a human engineered
property invariant (illumination of others) model. Our goal is to fully exploit ANNs power
on this task, and operate without such model. This opens the door to endless types of ANN
architecture. We characterize the logic structure of such synthesis algorithm and propose a
number of architectures.
Another challenging aspect of the experiment is the one of computational evaluation of the
synthesized images. We propose an ANN classifier, trained to evaluate key properties in the
resulted images.
4.1 Related Work
In [Riklin-Raviv and Shashua, 1999] a very similar synthesis task to ours, is approached. They
too are attempting to synthesize a new face image under given different illumination conditions.
Based on earlier research by the author, about the low dimensionality of the image space under
varying lighting conditions, which were originally reported in [Shashua, 1992], [Shashua, 1997]
in the case of Lambertian objects, which the natural human face is, this paper takes the strategy
is of a well defined illumination invariant signature image, called the ”quotient” image. Given
two objects (a, b), the quotient image Q is defined by the ratio of their albedo (surface texture)
functions.
Moreover, results in the paper refer to ideal class assumption: An ideal class is a collection of
3D objects that have the same shape but differ in the surface albedo function. While the human-
face class is definitely not an ideal class, the researchers treat it as one, claiming that performing
pixel-wise dense correspondence between images (like limiting frontal images) satisfies the ideal
class conditions. In the paper results are demonstrated on the high quality database prepared
by Thomas Vetter and his associates [Vetter et al., 1997] [Vetter and Poggio, 1996]. In our
experiments we specifically want to test for unprocessed images in that way, to better simulate
natural conditions.
One of our goals, is to see the the power of ANNs. [Riklin-Raviv and Shashua, 1999]
shows the success in our task, prior the age of neural networks. A more recent paper is of
[Kulkarni et al., 2015]. This work, much like [Riklin-Raviv and Shashua, 1999], is focused on
defining invariant representation, as a middle step, for the problem of face image synthesis under
new conditions, except they will do it using ANNs. This study aims to learn an interpretable
representation of images, disentangled with respect to three-dimensional scene structure and
viewing. A graphics code for complex transformations such as out-of-plane rotations, lighting
variations, pose, and shape.
Inspired by the various work done in the field of representation learning
[Bengio et al., 2013][Cohen and Welling, 2014] [Goodfellow et al., 2009], this work pro-
19
Original images under 3 distinct lighting conditions
The synthesized images using N = 10 bootstrap set.
Figure 9: Results of the Quotient Image algorithm.
Figure 10: From [Kulkarni et al., 2015]: Structure of the representation vector. φis the azimuth of the face, α is the elevation of the face with respect to the camera,and φL is the azimuth of the light source.
poses a representation that upholds the following principles: invariance, interpretability,
abstraction, and disentanglement. A disentangled representation is one for which changes in
the encoded data are sparse over real-world transformations.
The referred graphics code, is strictly defined for every property (Fig. 10), which simulates
parametric representation for the represented items. A manually designed feature vector. In
our work we strictly avoid that direction in particular, as any middle invariant representation.
They proposed the Deep Convolution Inverse Graphics Network (DCIGN) model, which as
shown in Fig. 11, consists of two parts: an encoder network which captures a distribution over
graphics codes Z given data x and a decoder network which learns a conditional distribution
to produce an approximation x given Z.
One more major difference between this work and ours (as from
[Riklin-Raviv and Shashua, 1999]), is the data-set nature. Here they use synthetic data-
set, and we are set to address real world images. The data-set is of faces generated from a 3D
face model obtained from [Paysan et al., 2009], consisting of faces with random variations on
face identity variables (shape/texture), pose, or lighting.
A few results examples are brought in Fig. 12. Despite the impressive comparison with
20
Figure 11: Algorithm of [Kulkarni et al., 2015]: Deep Convolutional Inverse Graph-ics Network (DC-IGN) has an encoder and a decoder.
“normally-trained network”, the results are not perfect, and although a very convincing trans-
formation of pose, it is not entirely clear the the identity is preserved in the synthesized images.
4.2 Task Definition: New Illumination Synthesis
If so, we wish to synthesize a face image of a given individual in new illumination conditions.
We would ”give” the identity by providing an image of that individual, in different illumination
conditions of course. As in previous chapter, let our set of individuals be P , and the set of
possible illuminations be K. We denote the image of face p ∈ P under illumination k ∈ K
with imk(p). We choose and set a target illumination k∗ ∈ K, and our task is therefore the
following:
Input: A face image of individual p ∈ P in some source illumination k ∈K, imk(p), where k 6= k∗. This is the query image.
Output: A face image of the same individual p in the the target illumina-
tion k∗, imk∗(p). The target image will be (the real) imk∗(p).
4.3 Data Preparation: Training and Testing Samples
To focus on the task of synthesis, we want to eliminate any unnecessary challenges. Namely,
difference between training and testing sets are minimal. Therefore, unlike in previous experi-
ments, we allow the individuals in the test sample to appear in the training sample set, but with
different images so the samples do not overlap. Also, since we preset the target illumination, it
has to appear both in test and training samples. For simplicity, and to ensure non of the train-
ing images resemble too much to the testing ones, we use the 15 representative illuminations
only. Under this framework, we specifically define our samples as follows:
21
Figure 12: From [Kulkarni et al., 2015]: Entangled versus disentangled representa-tions. First column: Original images. Second column: transformed image usingDC-IGN. Third column: transformed image using normally-trained network.
Figure 13: The unseen-sample images.
We set aside a set of images that we will attempt to create in test time. Thus they are all
in target illumination k∗. These images will never be seen during training. Therefore we call it
the unseen-sample. To still allow training with target illumination we limit the unseen-sample
to contain images of only 3 individuals, let us denote them p1, p2, p3. Than, the unseen sample
is as in Eq. 4.
{imk∗(p) : p ∈ {p1, p2, p3}
}(4)
Fig. 13 shows the unseen sample images. Over all, the unseen-sample contains exactly 3
images (3 faces in 1 illumination). All other images can be seen during training.
We define the query-test-sample to be the set of all query image - target image pairs, whose
target images are in the unseen-sample. Thus there are only 3 individuals in this set, as defined
in Eq. 5. Specifically we have 14 query images for every target image in the unseen-sample,
that is one query for every valid source illumination. Note that we allow the query images to
be seen during training.
{(imk(p), imk∗(p)) : k 6= k∗, p ∈ {p1, p2, p3}
}(5)
22
On the opposite side, we define a query-training-sample to contain all the query image -
target image pairs whose target images are not in the unseen-sample, as in EQ. 6. This means
that the query samples are samples of disjoint groups of individuals.
{(imk(p), imk∗(p)) : k 6= k∗, p ∈ P \ {p1, p2, p3}
}(6)
Some of the algorithms listed below, used a separately trained stage for feature vector
construction. For this stage we used all images that are not in the unseen-sample, including
query images of the query-test-sample, as training sample.
4.4 Algorithms and Architectures
Designing our algorithms, we couldn’t ignore the fact that many other studies (such as
[Riklin-Raviv and Shashua, 1999] and [Kulkarni et al., 2015] mentioned above), struggled to
find a feature space better suited to the task than image space. But, since we wish to test our
results in image space, by looking at them, our algorithms needed to convert its results back,
from the space it worked in, to images. In other words, it needs to cover three main logic stages:
1. Constructing the feature space imk(p)→ f(imk(p)).
2. Performing the illumination conversion f(imk(p))→ f(imk∗(p)).
3. Converting the result back to image space f(imk∗(p))→ imk∗(p).
However, it is not clear that we should force these stages in our algorithm design, as these
stages might by implicitly covered within the ANNs. That said, taking the road of an end-to-end
single ANN design hardens the task we were addressing, by demanding synthesis of yet unseen
individuals. Since we are effectively making the training and testing samples the training and
testing query samples accordingly, taking the end-to-end road, means training and testing for
distinct individuals groups.
We therefor define, and compare, three different ANN based algorithms designs (Fig. 14):
4.4.1 Algo1: An End-to-End Approach
This algorithm contains a single neural network (ANN1, Fig. 15), which performs the entire
task. In its structure there are no constraints on feature space or the logic structure above
at all. The network receives an image, and outputs an image, while encapsulating the feature
space in which the transformation takes place. Therefore we say the illumination conversion is
done in image space.
We use the query-training-sample for ANN1’s training. Which, as mentioned earlier, means
it is not trained over any of the unseen-sample’s individuals. ANN1 was trained using Huber
loss function (Eq. 7),
Lδ(y, f(x)) =
12 ·mean(x,y)∈batch(||y − f(x)||22)
δ · (max(x,y)∈batch||y − f(x)||1 − 12δ
2)(7)
23
Figure 14: Image synthesis algorithms logic design.
Figure 15: ANN1 architecture.
where x is query image, y is target image, f(x) is synthesized image, and δ = 0.001. It simulates
Mean Squared Error (MSE), that is commonly used to evaluate image regression. The general
form of Huber loss function is as proposed by [Shalev-Shwartz, 2014] (Eq. 8).
Lδ(y, f(x)) =
12(y − f(x))2
δ|y − f(x)| − 12δ
2(8)
4.4.2 Algo2: Introducing More Information in Training
To allow training for all individuals, and to examine illumination conversion in a different
feature space, we add a classification network to the process. Algo2 is composed of two parts,
two ANNs: a feature construction stage, and a combined illumination conversion and image
synthesis stage.
24
Figure 16: ANN2.1 architecture.
Figure 17: ANN2.2 architecture.
The feature construction stage (ANN2.1, Fig. 16) is a classification network for both ID
and illumination. This choice of task for the stage, allows us not only to train over all training
images (not just the ones on the query-training-sample), but also to use the information of ID
and illumination labels.
ANN2.1 was trained with multi-class logistic loss function (Eq. 2).
Although the second stage (ANN2.2, Fig. 17) is of a similar nature as ANN1 in Algo1,
having to perform illumination conversion and output an image, its task is more complicated
in a way. Being forced to yet another input feature space, compiles the network to learn the
transformation between these spaces in addition to the illumination conversion.
ANN2.2 is trained over the query-training-sample, when the query image of every query-
target pair in that sample, is given as its feature vector (denotedANN2.1(imk(p)) = f(imk(p))),
as in Eq. 9. {(f(imk(p)), imk∗(p)
): k 6= k∗, p ∈ P \ {p1, p2, p3}
}(9)
The loss function for ANN2.2 is the Huber loss (Eq. 8), with parameter δ = 0.001.
4.4.3 Algo3: Forcing the Logic
To farther go and simplify each network’s assignment, we propose Algo3, where we force all
logic stages, by training a different network for each stage. The advantage of this approach is
25
Figure 18: ANN3.2 architecture.
Figure 19: ANN3.3 architecture.
that it simplify the task of the networks. One for feature construction, one for illumination
conversion, and one for the inverse space transformation (from feature vectors back to images).
Also it allows to integrate information of ID and illumination, and images of the individuals of
the unseen-sample, not only in the first stage like in Algo2, but in the last one as well.
The feature constructing network (ANN3.1) is the same one as inAlgo2 (ANN2.1=ANN3.1).
The second stage network (ANN3.2, Fig. 18), having the mission of illumination conversion
(but not the synthesis), is trained with feature vectors of the query-training-sample (using
ANN2.1), both for query images and target (Eq. 10).
{(f(imk(p)), f(imk∗(p))
): k 6= k∗, p ∈ P \ {p1, p2, p3}
}(10)
ANN3.2 was trained with Huber loss function (Eq. 8), with parameter δ = 0.001.
The third stage, being performed by ANN3.3 (Fig. 19), is basically a transformation back
to image space. Its training sample is constructed with the help of ANN2.1, and contains all
images that are not in the unseen sample. When input instances are feature vectors of images
f(imk(p)), and the labels are the same image itself imk(p) (Eq. 11).
{(f(imk(p)), imk(p)
): k ∈ K, p ∈ P, k = k∗ → p /∈ p1, p2, p3
}(11)
ANN3.3 was trained with Huber loss function (Eq. 8), with parameter δ = 0.0001
26
4.5 Computational Validation
Given a synthesized image imk∗(p), how do you decide if it is ”good” enough, or ”close”
enough to your target image? Since we aim for images that demonstrate specific individual
and illumination conditions, using Mean Squared Error (MSE) from the target image, like is
commonly done for image evaluation, is not sufficient here. Since as in real life, a face in a
blurry image can still be recognizable. We propose acomputational validation tool, a face and
illumination image classification ANN (Fig. 20). We train the validation tool over the entire
YaleB data-set, because much like in classic face recognition tasks it is being used to evaluate
new images, in this case the synthesized images, not new individuals. We say an outputted
image, or feature vector, is correct if both its illumination and ID labeling by the validation
network, are correct. A synthesis network accuracy is determined by this notion.
We also construct one such validation tool for feature vectors (Fig. 21), in order to evaluate
Algo3’s performances before returning to image space.
Both networks were trained with multi-class logistic loss function (Eq. 2).
Figure 20: Image validation network architecture.
Figure 21: Feature vector validation network architecture.
27
4.6 Results and Discussion
Overall, and with so little data, all three algorithms were able to produce recognizable results
to some degree. We can see a clear structure of a face, and illumination seems correct. We
detail more with reference to quantitative results and enclosed figures:
Training
accuracy
Illumination
test accuracy
ID test
accuracy
Test
accuracy
Algo1 99.5% 100% 35.7% 35.7%
Algo2 100% 100% 19.1% 19.1%
Algo3 100% 100% 23.8% 23.8%
ANN3.2* 100% 100% 23.8% 23.8%
Table 4: Image synthesis algorithms accuracy values, by our validation tools. Sinceillumination accuracy is 100%, the over all test accuracy and ID accuracy are equal.*Evaluation of Algo3 in feature vectors space
Looking at accuracy results (Table 4), it is clear that we have a case of overfitting to the
training data. This is probably an outcome of the very big gap between the number of algorithm
parameters and the number of training samples, in every one of the algorithms. While there is
no known formula for the number of instances required to train a neural network, it is somewhat
agreed that it should be around one order of magnitude less than the number of parameters.
Our settings doesn’t come close to that relation. We bring the example of Algo1: in Eq. 12
we calculate the number of parameters in ANN1, and in Eq. 13 the number of samples in the
training set.
20×(5×5+1)+20×(5×5×20+1)+2048×(24×24×20+1)+1024×(2048+1) = 25, 703, 724 (12)
(31− 3)× 14 = 392 (13)
Another immediate observation (Table 4), is of illumination accuracy: Recall a synthesized
image is considered correct if both its illumination and ID labels are correct. But, since all al-
gorithms achieved illumination accuracy of 100%, a synthesized image is ”correct” if and only if
its validation network’s ID label is correct. Thus, total accuracy values are equal to ID accuracy
values. Looking at the examples in Table 5, we see that in fact even the most blurry results are
demonstrating realistic shadow, and are not just darker on the right side of the picture. That
said, all human face have similar characteristics when it comes to casting a shadow (eye socket
is shadowed, the eyebrow is illuminated). Namely, the task of demonstrating illumination is an
easier task to learn, than that of maintaining the correct face through illumination changes.
As for ID accuracy, in spite of the seemingly low testing accuracy values (Table 4), the
task is a success. We recall that the validation networks were trained over all individuals (31),
and therefore a random image pool would have achieved an accuracy rate of 131 ≈ 3.2%. In
28
Target image Correct results Incorrect results
ID: 1 by Algo3 by Algo1 by Algo2
ID: 2 by Algo1 by Algo1 by Algo3
ID: 3 by Algo1 by Algo1 by Algo1 by Algo2
Table 5: Examples of correct and incorrect results for the unseen-sample. A syn-thesized image is considered ”correct” if both its ID and illumination labels, by thevalidation network, are correct.All resulting images have correct illumination labeling. The incorrect results haveincorrect ID labeling.
other words, our algorithms achieved 6 to 11 times higher accuracy than chance. Moreover, the
degradation in the resulting images, correct or wrong, is still realistic. Unlike the examples in
[Kulkarni et al., 2015] (Fig. 12).
Looking at the resulting images, it seems the accuracy values correlate the synthesis image
quality. Algo1, who achieved the highest test accuracy of 35.7%, is also responsible for the best
results, in terms of image quality and ability to recognize the individual represented by eye,
of two out of three (IDs 2 and 3) individuals of the unseen sample (Fig. 23). Following in the
second place is Algo2, with the best result for the third (ID 1) individual, for which nor Algo1
and Algo2 had not achieved any correct result. Moreover, although Algo3 has more correct
result for individual #3, Algo1’s results look a lot more similar to the target image (Fig. 24).
Another conclusion from Table 4 regards Algo3. The accuracy results of Algo3 before
and after the transformation of predicted feature vector back to images (forth and third rows
accordingly) are the same. This tells us that the space conversion (ANN3.3) did not defect the
total task.
Recall Algo1 was trained on the least amount of data, and not on any image of the unseen
sample’s individuals, and thus couldn’t create any ”memory” of their faces. Having a more
difficult task than the others that way, the fact that Algo1 achieved the best testing accuracy
and best quality synthesized images, is surprising.
29
Figure 22: All correct test results. All synthesized image shown here were correctlyclassified by the image validation network. Location of every such image demon-strates query image it was synthesized for (top of the same column), the real targetimage for that query (second from the top of the same column), and the algorithmthat synthesized it (row). Query images that appear on the top row, are only theones that caused at least one of the algorithms to synthesize a correctly labeledimage.
But does that mean image space is a better space than the one achieved by the classification
network (ANN2.1) for our task? The answer is complicated. On the one hand, as stated earlier,
the only algorithm that uses unprocessed images (Algo1), performed better in general both on
the front of accuracy and that of quality. But on the other hand, its success is centered mostly
on one individual (#2). On both IDs 1 and 3, Algo3 achieved more correct results (Fig. 22). It is
possible that individual #2 has distinctive facial features in image space, which if preserved will
get a correct labeling. While in feature vector space, being defined with equal representation
for every individual in the data set, they are not more unique than anybody else’s.
What does remain clear is that whatever space, it is better to remain within it. The
transition in Algo2 between feature space and image space in addition to the illumination
conversion, in ANN2.2
30
Figure 23: Best looking synthesized images.
target image
Correct results of Algo1
Correct results of Algo3
Figure 24: Correct results for ID 3, Algo1 vs Algo3.
31
4.6.1 Conclusions
While showing the potential abilities of ANNs to synthesize realistic face images in new illumi-
nation conditions, the experiment’s framework requires some improvements. One of which is
the evaluation method.
The way we chose to evaluate our synthesized images, is far from perfect. Its shortcomings
raise a number of questions and idea for the subject of synthesized image evaluation in particular
and non-trivial evaluation in general:
First of all, it seems that if we wish to evaluate an algorithm in a certain way, we should
train it to perform under that evaluation method. In our experiment, we evaluated the synthesis
performance with Huber loss function (Eq. 7) in training time, and with the validation networks
in test time. But if the Huber loss, or MSE, is not the only way we want to evaluate our results
eventually, why train that way?
If so, how could we use the validation network for training? One possibility is to fix the
validation network’s weights and concatenate it with the synthesis network in training. And to
avoid synthesis of key distinctive feature alone, use a combination of both Huber loss and some
loss function for the classification labels of the validation network (Fig. 25). Its derivatives can
be calculated in a back propagation form, with the exception that it is with respect to input
rather than weights.
Figure 25: Combining the validation network into training process.
Still, this leaves out the question of eye-evaluated image quality. Having enough data for
it, we might have been able to train an ANN to evaluate quality, and operate in the same way.
But obtaining the right data is complicated. As adding synthetic noise or blurring or any other
kind will only teach the learning method to tell that specific noise apart, and not necessarily
the one created by the synthesis ANN.
Phrasing our task differently emphasize that this field is already being explored: we would
like to create a real-looking image, given description of the desired output. Namely, we need
an evaluation tool that punishes for unrealistic appearance. And this is, of course, the mission
of generative adversarial networks (GANs). For example, in [Reed et al., 2016] case, realistic
images are created from caption only.
32
List of Figures
1 Extended Yale Face Database B example . . . . . . . . . . . . . . . . . . . . . . 7
2 Illumination clustering and representatives illustration . . . . . . . . . . . . . . . 8
3 Representative illuminations images . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Outline of the DeepFace [Riklin-Raviv and Shashua, 1999] face classifier archi-
tecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 [Taigman et al., 2014] ROC curve on LFW data-set. . . . . . . . . . . . . . . . . 11
6 Market-1501 data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7 Feature vector construction: CNN architecture . . . . . . . . . . . . . . . . . . . 14
8 Pair recognition: ANN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 15
9 Results of the Quotient Image algorithm [Riklin-Raviv and Shashua, 1999] . . . . 20
10 Structure of the representation vector of [Kulkarni et al., 2015] . . . . . . . . . . 20
11 Algorithm of [Kulkarni et al., 2015] . . . . . . . . . . . . . . . . . . . . . . . . . . 21
12 Entangled versus disentangled representations [Kulkarni et al., 2015] . . . . . . . 22
13 The unseen-sample images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
14 Image synthesis algorithms logic design . . . . . . . . . . . . . . . . . . . . . . . 24
15 ANN1 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
16 ANN2.1 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
17 ANN2.2 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
18 ANN3.2 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
19 ANN3.3 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
20 Image validation network architecture . . . . . . . . . . . . . . . . . . . . . . . . 27
21 Feature vector validation network architecture . . . . . . . . . . . . . . . . . . . . 27
22 All correct test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
23 Best looking synthesized images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
24 Correct results for ID 3, Algo1 vs Algo3 . . . . . . . . . . . . . . . . . . . . . . . 31
25 Combining the validation network into training process . . . . . . . . . . . . . . 32
33
List of Tables
1 Face verification: Separability of feature spaces . . . . . . . . . . . . . . . . . . . 16
2 Face verification: Experiments results by types of feature space and data aug-
mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Face verification: Comparing the results with RBF SVM . . . . . . . . . . . . . . 17
4 Image synthesis: Algorithms accuracy . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Image synthesis: Examples of correct and incorrect results for the unseen-sample 29
34
References
[Bengio et al., 2013] Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learn-
ing: A review and new perspectives. IEEE transactions on pattern analysis and machine
intelligence, 35(8):1798–1828.
[Berg and Belhumeur, 2012] Berg, T. and Belhumeur, P. N. (2012). Tom-vs-pete classifiers and
identity-preserving alignment for face verification. In BMVC, volume 2, page 7.
[Bhele and Mankar, 2012] Bhele, S. G. and Mankar, V. (2012). A review paper on face recog-
nition techniques. International Journal of Advanced Research in Computer Engineering &
Technology (IJARCET), 1(8):pp–339.
[Chang and Lin, 2011] Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support
vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[Cohen and Welling, 2014] Cohen, T. and Welling, M. (2014). Learning the irreducible rep-
resentations of commutative lie groups. In International Conference on Machine Learning,
pages 1755–1763.
[Georghiades et al., 2001] Georghiades, A. S., Belhumeur, P. N., and Kriegman, D. J. (2001).
From few to many: Illumination cone models for face recognition under variable lighting and
pose. IEEE transactions on pattern analysis and machine intelligence, 23(6):643–660.
[Goodfellow et al., 2009] Goodfellow, I., Lee, H., Le, Q. V., Saxe, A., and Ng, A. Y. (2009).
Measuring invariances in deep networks. In Advances in neural information processing sys-
tems, pages 646–654.
[Kanade, 1977] Kanade, T. (1977). Computer recognition of human faces, volume 47.
Birkhauser Basel.
[Kelly, 1970] Kelly, M. D. (1970). Visual identification of people by computer. Technical report,
stanford univ calif dept of computer science.
[Kulkarni et al., 2015] Kulkarni, T. D., Whitney, W. F., Kohli, P., and Tenenbaum, J. (2015).
Deep convolutional inverse graphics network. In Advances in Neural Information Processing
Systems, pages 2539–2547.
[LeCun et al., 1995] LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images,
speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995.
[Liu et al., 2016] Liu, H., Feng, J., Qi, M., Jiang, J., and Yan, S. (2016). End-to-End Compar-
ative Attention Networks for. 14(8):1–10.
35
[Paysan et al., 2009] Paysan, P., Knothe, R., Amberg, B., Romdhani, S., and Vetter, T. (2009).
A 3d face model for pose and illumination invariant face recognition. In Advanced Video and
Signal Based Surveillance, 2009. AVSS’09. Sixth IEEE International Conference on, pages
296–301. Ieee.
[Reed et al., 2016] Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H. (2016).
Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396.
[Riklin-Raviv and Shashua, 1999] Riklin-Raviv, T. and Shashua, A. (1999). The quotient im-
age: Class based recognition and synthesis under varying illumination conditions. In Com-
puter Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., vol-
ume 2, pages 566–571. IEEE.
[Schroff et al., 2015] Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet: A unified
embedding for face recognition and clustering. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 815–823.
[Shalev-Shwartz, 2014] Shalev-Shwartz, S. (2014). A mini tutorial on convolutional-neural-
networks. Technical report.
[Shashua, 1992] Shashua, A. (1992). Illumination and view position in 3d visual recognition.
In Advances in neural information processing systems, pages 404–411.
[Shashua, 1997] Shashua, A. (1997). On photometric issues in 3d visual recognition from a
single 2d image. International Journal of Computer Vision, 21(1):99–122.
[Sun et al., 2015] Sun, Y., Liang, D., Wang, X., and Tang, X. (2015). Deepid3: Face recognition
with very deep neural networks. arXiv preprint arXiv:1502.00873.
[Taigman et al., 2014] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). Deepface:
Closing the gap to human-level performance in face verification. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 1701–1708.
[Varior et al., 2016] Varior, R. R., Haloi, M., and Wang, G. (2016). Gated Siamese Convolu-
tional Neural Network Architecture for Human Re-Identification. pages 1–18.
[Vetter et al., 1997] Vetter, T., Jones, M. J., and Poggio, T. (1997). A bootstrapping algorithm
for learning linear models of object classes. In Computer Vision and Pattern Recognition,
1997. Proceedings., 1997 IEEE Computer Society Conference on, pages 40–46. IEEE.
[Vetter and Poggio, 1996] Vetter, T. and Poggio, T. (1996). Image synthesis from a single
example image. Computer Vision—ECCV’96, pages 652–659.
[Wu et al., 2016] Wu, L., Shen, C., and Hengel, A. V. D. (2016). PersonNet : Person Re-
identification with Deep Convolutional Neural Networks. pages 1–7.
36
[Zheng et al., 2015] Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian, Q. (2015).
Scalable person re-identification: A benchmark. In Computer Vision, IEEE International
Conference on.
37