[ieee 2013 2nd iapr asian conference on pattern recognition (acpr) - naha, japan...

A Maximum Correlation Feature Descriptor forHeterogeneous Face Recognition

Dihong Gong

Department of Computer Science

Indiana University Purdue University Indianapolis

Indianapolis IN 46202, USA

[email protected]

Jiang Yu Zheng

Department of Computer Science

Indiana University Purdue University Indianapolis

Indianapolis IN 46202, USA

[email protected]

Abstract—Heterogeneous Face Recognition (HFR) refers tomatching probe face images to a gallery of face images taken fromalternate imaging modality, for example matching near infrared(NIR) face images to photographs. Matching heterogeneous faceimages has important practical applications such as surveillanceand forensics, which is yet a challenging problem in facerecognition community due to the large within-class discrepancyincurred from modality differences. In this paper, a novel featuredescriptor is proposed in which the features of both galleryand probe face images are extracted with an adaptive featuredescriptor which can maximize the correlation of the encodedface images between the modalities, so as to reduce the within-class variations at the feature extraction stage. The effectivenessof the proposed approach is demonstrated on the scenario ofmatching NIR face images to photographs based on a very largedataset consists of 2800 different persons.

Keywords-Heterogeneous face recognition; feature descriptor;correlation analysis;

I. INTRODUCTION

New challenges for feature-based face recognition arise

when we match face images taken from different imaging

modalities, i.e. NIR vs. photographs and sketch vs. pho-

tographs, which we refer as Heterogeneous Face Recognition

(HFR). The major difficulty of HFR lies in the fact that

features extracted from images of different modalities are

usually of large within-class discrepancy and thus face pairs

can become mismatched even if they belong to the same class

(from the same subject). Figure 1 demonstrates some example

NIR-photograph face images, from which we can see that

the probe face images (NIR) appear quite differently from

their corresponding photographs. The NIR face images are

usually blurred and of low contrast when compared with the

photographs.

A variety of algorithms have been proposed in the literature

to combat such modality discrepancy, which can be summa-

rized as the following three categories: 1) Convert images

from one modality to the other by synthesizing a pseudo-

image from the query image such that the matching process

can be done within the same modality [1-4]. 2) Design an

appropriate representation that is insensitive to the modalities

of images [5-6]. 3) Compare the heterogeneous images on a

common subspace where the modality difference is believed

to be minimized [7-9,18].

Fig. 1: Sample NIR-photograph face images.

This paper presents a new approach to enhance the ac-

curacy of matching heterogeneous face images by reducing

the within-class discrepancy at the feature extraction stage.

The basic idea is to learn a feature descriptor that can

adaptively turn heterogeneous face images into face features

whose within-class correlation has been maximized. Although

our approach is not limited to specific HFR scenario, we

will demonstrate its effectiveness on the scenario of matching

NIR to photograph. The merits of the proposed approach are

summarized as follows:

1. Improves the recognition accuracy over the state-of-art

approaches on the NIR to photograph scenario.

2. Can be integrated with other fancy subspace space analysis

algorithms to further boost the performance.

3. A natural extension to the existing Local Binary Patterns

(LBP) [13] feature descriptor.

II. PROPOSED APPROACH

Given an image, it can be turned into an encoded one

by converting each pixel into a specific code using vector

quantization technique. Various algorithms such as mean sift

[10], k-means, random projection tree [11] and random forest

[12] have been proposed to quantize a continuous space to

form discrete partition cells for vector quantization.

In this section, we present a learning-based feature descrip-

tor specifically for HFR. The proposed feature descriptor can

turn the heterogeneous face images into common encoded

images in the sense that the correlation between the encoded

images is maximized, and thus reduce the within-class varia-

tions at the feature extraction stage.

2013 Second IAPR Asian Conference on Pattern Recognition

978-1-4799-2190-4/13 $26.00 © 2013 IEEE

DOI 10.1109/ACPR.2013.12

135


978-1-4799-2190-4/13 $26.00 © 2013 IEEE

DOI 10.1109/ACPR.2013.12

135


978-1-4799-2190-4/13 $26.00 © 2013 IEEE

DOI 10.1109/ACPR.2013.12

135


978-1-4799-2190-4/13 $31.00 © 2013 IEEE

DOI 10.1109/ACPR.2013.12

135

Fig. 2: Comparison for the Local Binary Patterns and the proposed model. The proposed model is parameterized with

weights for the elements of pixel vector (which is the L2 normalized vector), and the weights are learned to maximize the

correlation between the model outputs of the heterogeneous face images, while the traditional LBP descriptor can be viewed

as a special case of our linear model with fixed weights �w = [27 26 25 24 23 22 21 20]T .

A. The Local Binary Patterns Revisited

The Local Binary Patterns (LBP) [13] is one of the powerful

feature descriptors for object detection and texture classifica-

tion which was first described in 1994 [15]. The LBP has

several variations, i.e. different sampling patterns, and one of

the most popular configuration is demonstrated in Figure 2.

The LBP encodes an input image by converting each pixel

into a decimal code (ranging from 0 to 255) by:

1) comparing to each of its 8-neighbor pixels (sampling radii

= 1, in counterclockwise direction);

2) writing 1 if the neighbor pixel is greater than the center

pixel (writing 0 otherwise);

3) converting the binary representation into the corresponding

decimal value ranging from 0 to 255.

B. The Proposed Feature Descriptor Model

One of the drawbacks for LBP in the context of HFR is

that the encoding scheme is fixed for any modality, which

may degrade the recognition performance as facial images

from different modalities have different structures. To adapt

the feature descriptor to HFR, we propose a parameterized

model as illustrated in Figure 2. Formally, the code of each

pixel is determined by:

code(�x) =

{1, if y(�x) ≥ b

0, otherwise(1)

Where y(�x) = �wT�x is the model output, b is the threshold, �wis the model parameter and �x is the pixel vector whose i− thelement is given by:

xi = npri − cp, i = 1, ..., 8r (2)

Here cp represents the value of center pixel and npri is the

value of the i − th (counterclockwise) neighbor pixel with

sampling radii r. Figure 3(a) illustrates the sampling pattern

with radii r = 2, 3 (totally 8r neighbor pixels for sampling

radii r). Note that the pixel vector �x is then normalized with

L2−norm in our system. We can view the LBP as a special

case of our model where �w = [27 26 25 24 23 22 21 20]T and

�x is a binary vector, as illustrated in Figure 2.

The preceding encoding mechanism can turn any input

image into a binary encoded image as it encodes each pixel

into either 0 or 1 according to (1). In many real-world applica-

tions however, a binary encoder is usually not discriminative

enough. To extend the model, we encode each pixel with a

series of binary classifiers. Specifically, for each pixel, we first

divide its pixel feature �x computed in (2) into eight groups(as

illustrated in Figure 3 (b)), with each group corresponding one

direction. Then we learn a binary encoder for each direction,

and finally the code of each pixel is determined by the outputs

from all binary encoders:

code =

7∑k=0

2kc(k) (3)

Where c(k) is the binary output in (1) of the k− th direction.

Algorithm 1 Learning the model parameters

Inputs: a set of training image pairs I = {(I1n, I2n)|n =1, ..., N}, and sampling radii r.

1. Extract eight sets of pixel vectors as described in section

II.B.

2. For each direction k = 0 to 7 :

(i) Solve the generalized eigen-decomposition problem (6),

and restore (�wk1 , �w

k2 ) by taking eigenvector corresponding to

the largest eigenvalue.

(ii) Compute bk1 and bk2 with (7).

Outputs: the model parameters (�wk1 , �w

k2 ) and (bk1 , b

k2).

136136136136

Fig. 3: The illustration for (a) the sampling patterns with radii = 2 and radii = 3; and (b) the pixel vector split of the long

pixel vector into eight associated pixel vectors.

C. Learning the Linear Model Parameters

In this section, we will elaborate the adaptation of the model

parameters �w and threshold b in (1). Suppose we are given a

set of training face pairs I = {(I1n, I2n)|n = 1, ..., N}where

I1n represents the n − th training face image from the 1stmodality (i.e. the photograph), and I2n represents image from

the 2nd modality (i.e. the NIR), and these training pairs are

then turned into eight groups of pixel vector pairs, denoted

as Dk = {(�x1,km , �x2,k

m )|m = 1, ...,W ×H ×N} where script

k = 0, ..., 7 represents the direction and W, H is the width and

height of training images respectively, as described in section

II.B. Then our training aim is to learn a set of model parameter

(�wk1 , �w

k2 ) for each of the groups that can empirically maximize

the correlation between the outputs. Formally,

(�w∗1 , �w

∗2) = argmax(�w1, �w2)

1M

∑Mm=1 �wT

1 �x1m�x2T

m �w2√1M

∑Mm=1 �wT

1 �x1m�x1T

m �w1

√1M

∑Mm=1 �wT

2 �x2m�x2T

m �w2

(4)

Where M = W × H × N is the total number of training

pixel pairs, and we have removed the script k for notation

simplicity. To solve (4), it is equivalent to fix the denominator

to 1 and maximize the nominator, which leads to the following

optimization problem:

maximize :∑M

m=1 �wT1 �x

1m�x2T

m �w2

s.t. :∑M

m=1 �wT1 �x

1m�x1T

m �w1 = M ,∑M

m=1 �wT2 �x

2m�x2T

m �w2 = MThis constraint optimization problem can be solved by

introducing the Lagrangian multiplier as follow:

L(�w1, �w2) =

�wT1 C12 �w2− λ1

2(�wT

1 C11 �w1−M)− λ2

2(�wT

2 C22 �w2−M) (5)

Where C12 =∑M

m=1 �x1m�x2T

m , C11 =∑M

m=1 �x1m�x1T

m and

C22 =∑M

m=1 �x2m�x2T

m . Taking derivatives w.r.t �w1, �w2 we can

obtain the K.K.T. conditions as follows:

C12 �w2 = λ1C11 �w1

CT12 �w1 = λ2C22 �w2

Multiplying the first condition with �wT1 and the second con-

dition with �wT2 and then subtract them we can arrive at

λ1 = λ2 = λ, where λ =�wT1 C12 �w2√

�wT1 C11 �w1

√�wT2 C22 �w2

represents the

correlation to be maximized. Substitute the λ1 and λ2 with λ, the K.K.T conditions can be written compactly as:

λCD

[�w1

�w2

]= CO

[�w1

�w2

](6)

Where CD =

[C11 00 C22

]and CO =

[0 C12

CT12 0

]. In solving

the generalized eigen-decomposition problem (6), we pick up

the eigenvector corresponding to the largest eigenvalue as the

eigenvalue λ is corresponding to the correlation that we want

to maximize.

By solving (6), we can obtain the optimal weight vectors in

the sense of maximum correlation. For each binary encoder,

the axillary threshold parameter b defined in (1) is determined

by maximizing the entropy of the binary codes in order to

optimize the discriminative ability [20]. We found that setting

it as the mean of model output gives a good approximation:

b1 =1

M

M∑m=1

�wT1 �x

1m, b2 =

1

M

M∑m=1

�wT2 �x

2m (7)

Algorithm 1 describes the entire procedure for the learning

of model parameters.

D. The Maximum Correlation Feature Descriptor.

In this part we will present how to extract facial features

based on the proposed model. For given training face pairs

and sampling radii r, we first train the model parameters with

Algorithm 1 and then encode face images using algorithms

introduced in section II. B. The Maximum Correlation feature

is then created in the following manner:

1. Divide the whole encoded image into a set of overlapping

patches with size 30x30 pixels (overlapping factor = 0.3 in

our system).

2. Compute the histogram, over each patch, of the frequency

of each ”code” occurring which gives a feature vector for the

patch.

3. Concatenate the outputs of each patch into a long vector to

form the final face feature.

In order to further enhance the discriminative ability for

the feature descriptor, we apply the multi-scale sampling

137137137137

10−6

10−4

10−2

100

0

0.2

0.4

0.6

0.8

1

False Acceptance Rate

Ver

ifica

tion

Rat

e

LBPHOGBIFMLBPProposed

Fig. 4: Comparison of our descriptor with the popular facial

feature descriptors.

technique similarly to the Multi-scale LBP (MLBP) [14].

Specifically, we first extract the face features with sampling

radii = 3, 5, 7, 9 (Figure 2 (a) shows radii = 2, 3), and then

final face feature is formed by concatenating the features of

different scales. Figure 5 shows example encoded face images

of photograph and NIR with sampling radii = 3,5,7,9. We

can see that the encoded face images of the photograph and

NIR appear quite more similar with each other (contribute to

smaller within-class variation) than their original face images.

III. EXPERIMENT

In this section, we explore the performance of the pro-

posed maximum correlation feature descriptor on the NIR-

photograph face recognition task based on a dataset consists

of both NIR face images and the corresponding photographs

from 2800 different persons with each one having one NIR

image and one photograph. The dataset is randomly divided

into two parts without overlapping as follows: 1400 pairs are

used as model training, and the rest 1400 pairs are used as

testing (matching 1400 NIR face images to 1400 photographs).

In the preprocessing steps, we normalize the face images

as follows: 1) rotate the face images to align the vertical

face orientation; 2) scale the face images so that the distance

between the two eyes are the same for all images; 3) crop the

face images into 120x150 pixels to remove the background and

the hair region. Some example images are shown in Figure 1.

Following the same configuration with CITE [6], we use

PCA+LDA for subspace analysis. Specifically, during the

feature classification stage, we compute the matching score as

follows: 1) Project the long feature into a PCA subspace with

dimension 400 to reduce the possible noise. 2) Project features

in the PCA subspace into another subspace, called LDA

subspace with dimension 350 to further minimize the within-

class variations while maximize the between-class distances.

3) Compute the matching score from NIR to photograph in

the LDA subspace with the cosine distance.

Fig. 5: The illustration for the encoded face images. The first

row shows the original photograph as well as the encoded

images with sampling radii = 3,5,7,9. The second row shows

the NIR face images from the same subject and the

correspo-nding encoded face images.

A. Comparison with the popular facial descriptors.

Firstly, we explore the effectiveness of the proposed feature

descriptor by comparing with the popular facial descriptors:

the LBP with sampling radii=1 and size of patches is 16x16

pixels with overlapping factor 0.5; the MLBP with a similar

setting except that the sampling radii = 1,3,5,7; the HOG with

12 discrete orientations; and finally the (Biomedical Inspired

Features) BIF [16]. For fair comparison, we use the same

configuration for all algorithms and the parameters of all the

feature descriptors are tuned to the best according to their

papers.

Recognition performances are reported as rank-1 identifica-

tion rates as well as Receiver Operator Characteristic (ROC)

curves, as shown in Table 1 and Figure 3. From these results,

we can see that by adaptively maximizing the correlation

between photograph and NIR images, our feature descriptor

has achieved a significant improvement over the existing

feature descriptors in both identification rates and verification

rates.

TABLE I: Comparison with the popular facial descriptors.

Listed are the Rank-1 identification accuracies.

Feature Descriptors Accuracies

BIF [16] 58.21%

LBP [13] 63.72%

MLBP [14] 69.21%

HOG [19] 62.14%

Proposed 76.43%

B. Comparison with the state-of-art algorithms for HFR.

Next we compare our approach with the following state-of-

art algorithms for HFR in the literature. The algorithms are

tuned to the best settings according to their paper.

• Coupled-Information Tree Encoding (CITE) [6]. It

encodes each pixel with a learned coupled-information

forest which is formed by greedily maximizing the mutual

information between the heterogeneous face images.

• Partial Least Squares (PLS) [9]. It linearly maps images

in different modalities to a common correlated subspace.

• Randomized LDA (RLDA) [17]. It performs linear

138138138138

Fig. 6: The illustration for face images that cannot be correctly identified with our system. The first row shows the probe

NIR face images with the photographs identified by our system shown in the second row, and the third row shows the

ground-truth photographs corresponding to the probe NIR face images.

discriminant analysis on a collection of random subspaces.

Multiple feature-based random subspaces are learned and

fused by concatenating features from these random subspaces.

• Coupled Discriminant Analysis (CDA) [18]. It makes

use of all samples from different modalities to represent the

coupled projections, and incorporates the locality information

in the kernel space as a smoothness constraint.

Experimental results are reported in rank-1 identification

rates in Table 2, from which we have the following observa-

tions: 1) the feature-adaptive approaches (CITE and the pro-

posed) have noticeable advantages over the subspace-adaptive

approaches (PLS, RLDA, CDA); 2) Our approach can achieve

the best identification performance in the task of matching

NIR to photograph; 3) Figure 6 shows some face images that

cannot be correctly identified by our system, from which we

can see that the retrieved photographs are very similar to the

probe NIR images, and some ground-truth photographs appear

quite differently to the correspoinding NIR face images.

TABLE II: Comparison with the state-of-art algorithms.

Listed are the Rank-1 identification accuracies.

HFR Algorithms Accuracies

CITE [6] 72.53%

PLS [9] 62.72%

Randomized LDA [17] 65.29%

Coupled Discriminant Analysis [18] 71.21%

Proposed 76.43%

IV. CONCLUSION

In this paper we present a new approach for matching

heterogeneous face images by designing a new learning based

feature descriptor. Extensive experiments on a large NIR-

photograph face dataset clearly show that by maximizing

the correlation between the encoded heterogeneous images

at the feature extraction stage we can improve recognition

performance over the state-of-art.

REFERENCES

[1] X. Tang and X. Wang, ”Face Sketch Recognition”, IEEE Transactions onCircuits Systems for Video Technology, 14(1), 50-57, 2004.

[2] B. Xiao, X. Gao, D. Tao, Y. Yuan and J. Li, ”Photo-sketch Synthesis andRecognition based on Subspace Learning”, Neurocomputing 73, 840-852,2010.

[3] X. Wang and X. Tang, ”Face Photo-sketch Synthesis and Recognition”,IEEE Transactions on Pattern Analysis Machine Intelligence, 31(11),1955-1967, 2009.

[4] Q. Liu, X. Tang, H. Jin, H Lu, S. Ma, ”Nonlinear Approach for FaceSketch Synthesis and Recognition”, Proceedings of CVPR, 1005-1010,2005.

[5] B. Klare, Z. Li and A. K. Jain, ”Matching forensic sketches to mugshotphotos”, IEEE Transactions on Pattern Analysis and Machine Intelligence,639-646, 2010.

[6] W. Zhang, X. Wang, and X. Tang, ”Coupled Information-TheoreticEncoding for Face Photo-Sketch Recognition”, Proceedings of CVPR,513-520, 2011.

[7] Annan Li, Shiguang Shan, Xilin Chen1 and Wen Gao, ”Maximizing Intra-individual Correlations for Face Recognition Across Pose Differences”,Proceedings of CVPR, 605-611, 2009.

[8] W. Yang, D. Yi, Z. Lei, J. Sang, and S. Z. Li, ”2D-3D Face Matching Us-ing CCA”. Proceedings of IEEE International Conference on AutomaticFace and Gesture Recognition, 1-6, 2008.

[9] Abhishek Sharma, David W Jacobs, ”Bypassing Synthesis: PLS for FaceRecognition with Pose, Low-Resolution and Sketch”, Proceedings ofCVPR, 593-600, 2011.

[10] F.Jurie and B. Triggs, ”Creating Efficient Codebooks for Visual Recog-nition”, Proceedings of ICCV, 604-610 Vol. 1, 2005.

[11] Z. Cao, Q. Yin, X. Tang, and J. Sun ”Face Recognition with Learning-based Descriptor”, Proceedings of CVPR, 2707-2714, 2010.

[12] J. Shotton, M. Johnson, and R. Cipolla, ”Semantic Texton Forests forImage Categorization and Segmentation”, Proceedings of CVPR, 1-8,2008.

[13] T. Ahonen, A. Hadid and M. Pietikainen, ”Face Recognition with LocalBinary Patterns”, Proceedings of ECCV, 469-481, 2004.

[14] T. Menp, M. Pietikinen, Multi-Scale Binary Patterns for Texture Anal-ysis. Proceedings of the SCIA, Gothenberg, Sweden, 885892, 2003.

[15] T. Ojala, M. Pietikinen, and D. Harwood (1994), ”Performance eval-uation of texture measures with classification based on Kullback dis-crimination of distributions”, Proceedings of the 12th IAPR InternationalConference on Pattern Recognition, 582-585 vol. 1, 1994.

[16] Guowang Mu, ”Human age estimation using bio-inspired features”,Proceedings of CVPR, 112-119, 2009.

[17] B. Klare and A. Jain, ”Heterogeneous Face Recognition: Matching NIRto Visible Light Images” in ICPR 2010.

[18] Zhen Lei; Shengcai Liao; Jain, A.K.; Li, S.Z. ”Coupled DiscriminantAnalysis for Heterogeneous Face Recognition”, IEEE Transactions onInformation Forensics and Security, 1707-1716 Vol.7, 2012.

[19] Dalal, N.; Triggs, B , ”Histograms of Oriented Gradients for HumanDetection”, Proceedings of CVPR, 886-893 vol.1, 2005.

[20] T. Jebara, ”Feature selection and dualities in maximum entropy dis-crimination”, Proceedings of the Sixteenth conference on Uncertainty inartificial intelligence, 291-300, 2000.

139139139139

[ieee 2013 2nd iapr asian conference on pattern recognition (acpr) - naha, japan...

Documents