[ieee 2013 2nd iapr asian conference on pattern recognition (acpr) - naha, japan...
TRANSCRIPT
A Maximum Correlation Feature Descriptor forHeterogeneous Face Recognition
Dihong Gong
Department of Computer Science
Indiana University Purdue University Indianapolis
Indianapolis IN 46202, USA
Jiang Yu Zheng
Department of Computer Science
Indiana University Purdue University Indianapolis
Indianapolis IN 46202, USA
Abstract—Heterogeneous Face Recognition (HFR) refers tomatching probe face images to a gallery of face images taken fromalternate imaging modality, for example matching near infrared(NIR) face images to photographs. Matching heterogeneous faceimages has important practical applications such as surveillanceand forensics, which is yet a challenging problem in facerecognition community due to the large within-class discrepancyincurred from modality differences. In this paper, a novel featuredescriptor is proposed in which the features of both galleryand probe face images are extracted with an adaptive featuredescriptor which can maximize the correlation of the encodedface images between the modalities, so as to reduce the within-class variations at the feature extraction stage. The effectivenessof the proposed approach is demonstrated on the scenario ofmatching NIR face images to photographs based on a very largedataset consists of 2800 different persons.
Keywords-Heterogeneous face recognition; feature descriptor;correlation analysis;
I. INTRODUCTION
New challenges for feature-based face recognition arise
when we match face images taken from different imaging
modalities, i.e. NIR vs. photographs and sketch vs. pho-
tographs, which we refer as Heterogeneous Face Recognition
(HFR). The major difficulty of HFR lies in the fact that
features extracted from images of different modalities are
usually of large within-class discrepancy and thus face pairs
can become mismatched even if they belong to the same class
(from the same subject). Figure 1 demonstrates some example
NIR-photograph face images, from which we can see that
the probe face images (NIR) appear quite differently from
their corresponding photographs. The NIR face images are
usually blurred and of low contrast when compared with the
photographs.
A variety of algorithms have been proposed in the literature
to combat such modality discrepancy, which can be summa-
rized as the following three categories: 1) Convert images
from one modality to the other by synthesizing a pseudo-
image from the query image such that the matching process
can be done within the same modality [1-4]. 2) Design an
appropriate representation that is insensitive to the modalities
of images [5-6]. 3) Compare the heterogeneous images on a
common subspace where the modality difference is believed
to be minimized [7-9,18].
Fig. 1: Sample NIR-photograph face images.
This paper presents a new approach to enhance the ac-
curacy of matching heterogeneous face images by reducing
the within-class discrepancy at the feature extraction stage.
The basic idea is to learn a feature descriptor that can
adaptively turn heterogeneous face images into face features
whose within-class correlation has been maximized. Although
our approach is not limited to specific HFR scenario, we
will demonstrate its effectiveness on the scenario of matching
NIR to photograph. The merits of the proposed approach are
summarized as follows:
1. Improves the recognition accuracy over the state-of-art
approaches on the NIR to photograph scenario.
2. Can be integrated with other fancy subspace space analysis
algorithms to further boost the performance.
3. A natural extension to the existing Local Binary Patterns
(LBP) [13] feature descriptor.
II. PROPOSED APPROACH
Given an image, it can be turned into an encoded one
by converting each pixel into a specific code using vector
quantization technique. Various algorithms such as mean sift
[10], k-means, random projection tree [11] and random forest
[12] have been proposed to quantize a continuous space to
form discrete partition cells for vector quantization.
In this section, we present a learning-based feature descrip-
tor specifically for HFR. The proposed feature descriptor can
turn the heterogeneous face images into common encoded
images in the sense that the correlation between the encoded
images is maximized, and thus reduce the within-class varia-
tions at the feature extraction stage.
2013 Second IAPR Asian Conference on Pattern Recognition
978-1-4799-2190-4/13 $26.00 © 2013 IEEE
DOI 10.1109/ACPR.2013.12
135
2013 Second IAPR Asian Conference on Pattern Recognition
978-1-4799-2190-4/13 $26.00 © 2013 IEEE
DOI 10.1109/ACPR.2013.12
135
2013 Second IAPR Asian Conference on Pattern Recognition
978-1-4799-2190-4/13 $26.00 © 2013 IEEE
DOI 10.1109/ACPR.2013.12
135
2013 Second IAPR Asian Conference on Pattern Recognition
978-1-4799-2190-4/13 $31.00 © 2013 IEEE
DOI 10.1109/ACPR.2013.12
135
Fig. 2: Comparison for the Local Binary Patterns and the proposed model. The proposed model is parameterized with
weights for the elements of pixel vector (which is the L2 normalized vector), and the weights are learned to maximize the
correlation between the model outputs of the heterogeneous face images, while the traditional LBP descriptor can be viewed
as a special case of our linear model with fixed weights �w = [27 26 25 24 23 22 21 20]T .
A. The Local Binary Patterns Revisited
The Local Binary Patterns (LBP) [13] is one of the powerful
feature descriptors for object detection and texture classifica-
tion which was first described in 1994 [15]. The LBP has
several variations, i.e. different sampling patterns, and one of
the most popular configuration is demonstrated in Figure 2.
The LBP encodes an input image by converting each pixel
into a decimal code (ranging from 0 to 255) by:
1) comparing to each of its 8-neighbor pixels (sampling radii
= 1, in counterclockwise direction);
2) writing 1 if the neighbor pixel is greater than the center
pixel (writing 0 otherwise);
3) converting the binary representation into the corresponding
decimal value ranging from 0 to 255.
B. The Proposed Feature Descriptor Model
One of the drawbacks for LBP in the context of HFR is
that the encoding scheme is fixed for any modality, which
may degrade the recognition performance as facial images
from different modalities have different structures. To adapt
the feature descriptor to HFR, we propose a parameterized
model as illustrated in Figure 2. Formally, the code of each
pixel is determined by:
code(�x) =
{1, if y(�x) ≥ b
0, otherwise(1)
Where y(�x) = �wT�x is the model output, b is the threshold, �wis the model parameter and �x is the pixel vector whose i− thelement is given by:
xi = npri − cp, i = 1, ..., 8r (2)
Here cp represents the value of center pixel and npri is the
value of the i − th (counterclockwise) neighbor pixel with
sampling radii r. Figure 3(a) illustrates the sampling pattern
with radii r = 2, 3 (totally 8r neighbor pixels for sampling
radii r). Note that the pixel vector �x is then normalized with
L2−norm in our system. We can view the LBP as a special
case of our model where �w = [27 26 25 24 23 22 21 20]T and
�x is a binary vector, as illustrated in Figure 2.
The preceding encoding mechanism can turn any input
image into a binary encoded image as it encodes each pixel
into either 0 or 1 according to (1). In many real-world applica-
tions however, a binary encoder is usually not discriminative
enough. To extend the model, we encode each pixel with a
series of binary classifiers. Specifically, for each pixel, we first
divide its pixel feature �x computed in (2) into eight groups(as
illustrated in Figure 3 (b)), with each group corresponding one
direction. Then we learn a binary encoder for each direction,
and finally the code of each pixel is determined by the outputs
from all binary encoders:
code =
7∑k=0
2kc(k) (3)
Where c(k) is the binary output in (1) of the k− th direction.
Algorithm 1 Learning the model parameters
Inputs: a set of training image pairs I = {(I1n, I2n)|n =1, ..., N}, and sampling radii r.
1. Extract eight sets of pixel vectors as described in section
II.B.
2. For each direction k = 0 to 7 :
(i) Solve the generalized eigen-decomposition problem (6),
and restore (�wk1 , �w
k2 ) by taking eigenvector corresponding to
the largest eigenvalue.
(ii) Compute bk1 and bk2 with (7).
Outputs: the model parameters (�wk1 , �w
k2 ) and (bk1 , b
k2).
136136136136
Fig. 3: The illustration for (a) the sampling patterns with radii = 2 and radii = 3; and (b) the pixel vector split of the long
pixel vector into eight associated pixel vectors.
C. Learning the Linear Model Parameters
In this section, we will elaborate the adaptation of the model
parameters �w and threshold b in (1). Suppose we are given a
set of training face pairs I = {(I1n, I2n)|n = 1, ..., N}where
I1n represents the n − th training face image from the 1stmodality (i.e. the photograph), and I2n represents image from
the 2nd modality (i.e. the NIR), and these training pairs are
then turned into eight groups of pixel vector pairs, denoted
as Dk = {(�x1,km , �x2,k
m )|m = 1, ...,W ×H ×N} where script
k = 0, ..., 7 represents the direction and W, H is the width and
height of training images respectively, as described in section
II.B. Then our training aim is to learn a set of model parameter
(�wk1 , �w
k2 ) for each of the groups that can empirically maximize
the correlation between the outputs. Formally,
(�w∗1 , �w
∗2) = argmax(�w1, �w2)
1M
∑Mm=1 �wT
1 �x1m�x2T
m �w2√1M
∑Mm=1 �wT
1 �x1m�x1T
m �w1
√1M
∑Mm=1 �wT
2 �x2m�x2T
m �w2
(4)
Where M = W × H × N is the total number of training
pixel pairs, and we have removed the script k for notation
simplicity. To solve (4), it is equivalent to fix the denominator
to 1 and maximize the nominator, which leads to the following
optimization problem:
maximize :∑M
m=1 �wT1 �x
1m�x2T
m �w2
s.t. :∑M
m=1 �wT1 �x
1m�x1T
m �w1 = M ,∑M
m=1 �wT2 �x
2m�x2T
m �w2 = MThis constraint optimization problem can be solved by
introducing the Lagrangian multiplier as follow:
L(�w1, �w2) =
�wT1 C12 �w2− λ1
2(�wT
1 C11 �w1−M)− λ2
2(�wT
2 C22 �w2−M) (5)
Where C12 =∑M
m=1 �x1m�x2T
m , C11 =∑M
m=1 �x1m�x1T
m and
C22 =∑M
m=1 �x2m�x2T
m . Taking derivatives w.r.t �w1, �w2 we can
obtain the K.K.T. conditions as follows:
C12 �w2 = λ1C11 �w1
CT12 �w1 = λ2C22 �w2
Multiplying the first condition with �wT1 and the second con-
dition with �wT2 and then subtract them we can arrive at
λ1 = λ2 = λ, where λ =�wT1 C12 �w2√
�wT1 C11 �w1
√�wT2 C22 �w2
represents the
correlation to be maximized. Substitute the λ1 and λ2 with λ, the K.K.T conditions can be written compactly as:
λCD
[�w1
�w2
]= CO
[�w1
�w2
](6)
Where CD =
[C11 00 C22
]and CO =
[0 C12
CT12 0
]. In solving
the generalized eigen-decomposition problem (6), we pick up
the eigenvector corresponding to the largest eigenvalue as the
eigenvalue λ is corresponding to the correlation that we want
to maximize.
By solving (6), we can obtain the optimal weight vectors in
the sense of maximum correlation. For each binary encoder,
the axillary threshold parameter b defined in (1) is determined
by maximizing the entropy of the binary codes in order to
optimize the discriminative ability [20]. We found that setting
it as the mean of model output gives a good approximation:
b1 =1
M
M∑m=1
�wT1 �x
1m, b2 =
1
M
M∑m=1
�wT2 �x
2m (7)
Algorithm 1 describes the entire procedure for the learning
of model parameters.
D. The Maximum Correlation Feature Descriptor.
In this part we will present how to extract facial features
based on the proposed model. For given training face pairs
and sampling radii r, we first train the model parameters with
Algorithm 1 and then encode face images using algorithms
introduced in section II. B. The Maximum Correlation feature
is then created in the following manner:
1. Divide the whole encoded image into a set of overlapping
patches with size 30x30 pixels (overlapping factor = 0.3 in
our system).
2. Compute the histogram, over each patch, of the frequency
of each ”code” occurring which gives a feature vector for the
patch.
3. Concatenate the outputs of each patch into a long vector to
form the final face feature.
In order to further enhance the discriminative ability for
the feature descriptor, we apply the multi-scale sampling
137137137137
10−6
10−4
10−2
100
0
0.2
0.4
0.6
0.8
1
False Acceptance Rate
Ver
ifica
tion
Rat
e
LBPHOGBIFMLBPProposed
Fig. 4: Comparison of our descriptor with the popular facial
feature descriptors.
technique similarly to the Multi-scale LBP (MLBP) [14].
Specifically, we first extract the face features with sampling
radii = 3, 5, 7, 9 (Figure 2 (a) shows radii = 2, 3), and then
final face feature is formed by concatenating the features of
different scales. Figure 5 shows example encoded face images
of photograph and NIR with sampling radii = 3,5,7,9. We
can see that the encoded face images of the photograph and
NIR appear quite more similar with each other (contribute to
smaller within-class variation) than their original face images.
III. EXPERIMENT
In this section, we explore the performance of the pro-
posed maximum correlation feature descriptor on the NIR-
photograph face recognition task based on a dataset consists
of both NIR face images and the corresponding photographs
from 2800 different persons with each one having one NIR
image and one photograph. The dataset is randomly divided
into two parts without overlapping as follows: 1400 pairs are
used as model training, and the rest 1400 pairs are used as
testing (matching 1400 NIR face images to 1400 photographs).
In the preprocessing steps, we normalize the face images
as follows: 1) rotate the face images to align the vertical
face orientation; 2) scale the face images so that the distance
between the two eyes are the same for all images; 3) crop the
face images into 120x150 pixels to remove the background and
the hair region. Some example images are shown in Figure 1.
Following the same configuration with CITE [6], we use
PCA+LDA for subspace analysis. Specifically, during the
feature classification stage, we compute the matching score as
follows: 1) Project the long feature into a PCA subspace with
dimension 400 to reduce the possible noise. 2) Project features
in the PCA subspace into another subspace, called LDA
subspace with dimension 350 to further minimize the within-
class variations while maximize the between-class distances.
3) Compute the matching score from NIR to photograph in
the LDA subspace with the cosine distance.
Fig. 5: The illustration for the encoded face images. The first
row shows the original photograph as well as the encoded
images with sampling radii = 3,5,7,9. The second row shows
the NIR face images from the same subject and the
correspo-nding encoded face images.
A. Comparison with the popular facial descriptors.
Firstly, we explore the effectiveness of the proposed feature
descriptor by comparing with the popular facial descriptors:
the LBP with sampling radii=1 and size of patches is 16x16
pixels with overlapping factor 0.5; the MLBP with a similar
setting except that the sampling radii = 1,3,5,7; the HOG with
12 discrete orientations; and finally the (Biomedical Inspired
Features) BIF [16]. For fair comparison, we use the same
configuration for all algorithms and the parameters of all the
feature descriptors are tuned to the best according to their
papers.
Recognition performances are reported as rank-1 identifica-
tion rates as well as Receiver Operator Characteristic (ROC)
curves, as shown in Table 1 and Figure 3. From these results,
we can see that by adaptively maximizing the correlation
between photograph and NIR images, our feature descriptor
has achieved a significant improvement over the existing
feature descriptors in both identification rates and verification
rates.
TABLE I: Comparison with the popular facial descriptors.
Listed are the Rank-1 identification accuracies.
Feature Descriptors Accuracies
BIF [16] 58.21%
LBP [13] 63.72%
MLBP [14] 69.21%
HOG [19] 62.14%
Proposed 76.43%
B. Comparison with the state-of-art algorithms for HFR.
Next we compare our approach with the following state-of-
art algorithms for HFR in the literature. The algorithms are
tuned to the best settings according to their paper.
• Coupled-Information Tree Encoding (CITE) [6]. It
encodes each pixel with a learned coupled-information
forest which is formed by greedily maximizing the mutual
information between the heterogeneous face images.
• Partial Least Squares (PLS) [9]. It linearly maps images
in different modalities to a common correlated subspace.
• Randomized LDA (RLDA) [17]. It performs linear
138138138138
Fig. 6: The illustration for face images that cannot be correctly identified with our system. The first row shows the probe
NIR face images with the photographs identified by our system shown in the second row, and the third row shows the
ground-truth photographs corresponding to the probe NIR face images.
discriminant analysis on a collection of random subspaces.
Multiple feature-based random subspaces are learned and
fused by concatenating features from these random subspaces.
• Coupled Discriminant Analysis (CDA) [18]. It makes
use of all samples from different modalities to represent the
coupled projections, and incorporates the locality information
in the kernel space as a smoothness constraint.
Experimental results are reported in rank-1 identification
rates in Table 2, from which we have the following observa-
tions: 1) the feature-adaptive approaches (CITE and the pro-
posed) have noticeable advantages over the subspace-adaptive
approaches (PLS, RLDA, CDA); 2) Our approach can achieve
the best identification performance in the task of matching
NIR to photograph; 3) Figure 6 shows some face images that
cannot be correctly identified by our system, from which we
can see that the retrieved photographs are very similar to the
probe NIR images, and some ground-truth photographs appear
quite differently to the correspoinding NIR face images.
TABLE II: Comparison with the state-of-art algorithms.
Listed are the Rank-1 identification accuracies.
HFR Algorithms Accuracies
CITE [6] 72.53%
PLS [9] 62.72%
Randomized LDA [17] 65.29%
Coupled Discriminant Analysis [18] 71.21%
Proposed 76.43%
IV. CONCLUSION
In this paper we present a new approach for matching
heterogeneous face images by designing a new learning based
feature descriptor. Extensive experiments on a large NIR-
photograph face dataset clearly show that by maximizing
the correlation between the encoded heterogeneous images
at the feature extraction stage we can improve recognition
performance over the state-of-art.
REFERENCES
[1] X. Tang and X. Wang, ”Face Sketch Recognition”, IEEE Transactions onCircuits Systems for Video Technology, 14(1), 50-57, 2004.
[2] B. Xiao, X. Gao, D. Tao, Y. Yuan and J. Li, ”Photo-sketch Synthesis andRecognition based on Subspace Learning”, Neurocomputing 73, 840-852,2010.
[3] X. Wang and X. Tang, ”Face Photo-sketch Synthesis and Recognition”,IEEE Transactions on Pattern Analysis Machine Intelligence, 31(11),1955-1967, 2009.
[4] Q. Liu, X. Tang, H. Jin, H Lu, S. Ma, ”Nonlinear Approach for FaceSketch Synthesis and Recognition”, Proceedings of CVPR, 1005-1010,2005.
[5] B. Klare, Z. Li and A. K. Jain, ”Matching forensic sketches to mugshotphotos”, IEEE Transactions on Pattern Analysis and Machine Intelligence,639-646, 2010.
[6] W. Zhang, X. Wang, and X. Tang, ”Coupled Information-TheoreticEncoding for Face Photo-Sketch Recognition”, Proceedings of CVPR,513-520, 2011.
[7] Annan Li, Shiguang Shan, Xilin Chen1 and Wen Gao, ”Maximizing Intra-individual Correlations for Face Recognition Across Pose Differences”,Proceedings of CVPR, 605-611, 2009.
[8] W. Yang, D. Yi, Z. Lei, J. Sang, and S. Z. Li, ”2D-3D Face Matching Us-ing CCA”. Proceedings of IEEE International Conference on AutomaticFace and Gesture Recognition, 1-6, 2008.
[9] Abhishek Sharma, David W Jacobs, ”Bypassing Synthesis: PLS for FaceRecognition with Pose, Low-Resolution and Sketch”, Proceedings ofCVPR, 593-600, 2011.
[10] F.Jurie and B. Triggs, ”Creating Efficient Codebooks for Visual Recog-nition”, Proceedings of ICCV, 604-610 Vol. 1, 2005.
[11] Z. Cao, Q. Yin, X. Tang, and J. Sun ”Face Recognition with Learning-based Descriptor”, Proceedings of CVPR, 2707-2714, 2010.
[12] J. Shotton, M. Johnson, and R. Cipolla, ”Semantic Texton Forests forImage Categorization and Segmentation”, Proceedings of CVPR, 1-8,2008.
[13] T. Ahonen, A. Hadid and M. Pietikainen, ”Face Recognition with LocalBinary Patterns”, Proceedings of ECCV, 469-481, 2004.
[14] T. Menp, M. Pietikinen, Multi-Scale Binary Patterns for Texture Anal-ysis. Proceedings of the SCIA, Gothenberg, Sweden, 885892, 2003.
[15] T. Ojala, M. Pietikinen, and D. Harwood (1994), ”Performance eval-uation of texture measures with classification based on Kullback dis-crimination of distributions”, Proceedings of the 12th IAPR InternationalConference on Pattern Recognition, 582-585 vol. 1, 1994.
[16] Guowang Mu, ”Human age estimation using bio-inspired features”,Proceedings of CVPR, 112-119, 2009.
[17] B. Klare and A. Jain, ”Heterogeneous Face Recognition: Matching NIRto Visible Light Images” in ICPR 2010.
[18] Zhen Lei; Shengcai Liao; Jain, A.K.; Li, S.Z. ”Coupled DiscriminantAnalysis for Heterogeneous Face Recognition”, IEEE Transactions onInformation Forensics and Security, 1707-1716 Vol.7, 2012.
[19] Dalal, N.; Triggs, B , ”Histograms of Oriented Gradients for HumanDetection”, Proceedings of CVPR, 886-893 vol.1, 2005.
[20] T. Jebara, ”Feature selection and dualities in maximum entropy dis-crimination”, Proceedings of the Sixteenth conference on Uncertainty inartificial intelligence, 291-300, 2000.
139139139139