real time biometric face recognition
DESCRIPTION
real time biometric face recognitionTRANSCRIPT
Real time Face
Recognition System Real time face recognition system using Radon and DCT transform
Neha Rathore
Shivani Pandya
[EE586- Project Report]
Contents
1. Abstract ............................................................................................................................... 3
Introduction ................................................................................................................................. 4
System Description ..................................................................................................................... 6
Hardware : ................................................................................................................................... 6
Software: ................................................................................................................................. 7
Selection of Algorithms .............................................................................................................. 8
Preprocessing: ......................................................................................................................... 8
Downsampling ........................................................................................................................ 8
Normalization ......................................................................................................................... 9
Wide-Sense histogram equalization .................................................................................. 10
Full Scale linear Scaling ................................................................................................... 10
Feature Selection ................................................................................................................... 14
PCA : ..................................................................................................................................... 14
Linear Discriminant Analysis (LDA) ................................................................................... 14
Elastic Graph Bunch Matching ............................................................................................. 15
Radon ................................................................................................................................ 15
Discrete Cosine Transform ............................................................................................... 15
Feature extraction using Radon transform and DCT ........................................................ 16
Classification......................................................................................................................... 17
Euclidean Distance: .......................................................................................................... 17
K-Nearest Neighbor Classifier: ......................................................................................... 17
IMPLEMENTATION AND RESULTS ................................................................................... 18
Database Collection: ............................................................................................................. 18
Dimensionality reduction ...................................................................................................... 19
Downsampling .................................................................................................................. 19
Selection of downsample image size ................................................................................ 19
Radon Transform .............................................................................................................. 20
DCT Transform ................................................................................................................. 20
Sel Selection of number of coefficients ............................................................................ 21
Selection of Classifier ( ED results and KNN): ................................................................ 22
Selection of value of K ..................................................................................................... 23
Challenges ................................................................................................................................. 25
Dimensionality ...................................................................................................................... 25
Processing time for Radon for different image sizes ............................................................ 26
Scatter plot of data : .......................................................................................................... 26
Limitations with REAL TIME RESPONSE ............................................................................. 37
Displaying classification result ......................................................................................... 37
Imposter model. ................................................................................................................ 38
Quantitative Results .................................................................................................................. 38
Conclusion ................................................................................................................................ 39
Future Work .............................................................................................................................. 40
References ................................................................................................................................. 40
Abstract
We propose a real time face recognition system suitable for small businesses and home
security systems. The goal of this project is to build a system that works in real world real
world situations where the user is not constrained by lighting conditions or slight variation of
user poses and in-plain rotation. The system is designed for good recognition rates and takes
care of the various problems in face recognition systems like illumination, rotation and etc.
For a system based on purely face recognition, it is specifically difficult to achieve good
recognition rates without considering some kind of clustering or linear separation of data.
System represents prototype of Real time face recognition system using Radon Transform and
2D DCT for feature extraction and KNN for classification giving acceptable performance of
86% on small set database. The performance of the system is presented in terms of
recognition rates for various combinations of feature set extraction techniques. The system
shows a recognition rate as high as 98% for the offline data which consisted of 30 test images,
20 validation images and 10 training images.
Introduction
In today’s age of technology small world security systems are high in demand. Such security
system should be non-obtrusive and require low user interaction. Facial recognition is able to
satisfy such needs as visibility of face does not need any specific action from user. At the
same time if a system allows a user to recognize him/her in uncontrolled environment, it in a
way becomes an non-obtrusive way of recognition.
Some of the desirable features of such systems include ease of use, low error rates, low cost of
implementation, portability and ease of integration. This report describes a prototype
implementation of face recognition and verification algorithms in a stand-alone system using
the TI TMS320C6713 floating point processor. This system is organized to capture an image
sequence, find the features of face in the images, and recognize and identify a person from a
database of 20 people in indoor-building lighting conditions.
For each person an image database is collected in two sessions possibly on different days to
capture the maximum variance in illumination and face poses and 30 images per session were
stored in the database for each person.
One of the main challenges in face recognition system is finding informative and
discriminative information about class image. A 2D-DCT of face images is very sensitive to
pose variations, where as the most commonly used techniques like PCA, LDA are
computationally very expensive for a small hardware system. Also face recognition based on
Gaussian Mixture models pose a difficulty in terms of singularities arising due to large feature
set as compared to small number of training prototype images. Hence, for this system we used
combination of radon transform and 2D DCT transform to achieve feature that can yield to
low frequency information which is crucial to face recognition system. The property of
Radon transform to enhance the low frequency components, which are useful for face
recognition, has been exploited to derive the effective facial features. Data compaction
property of DCT yields lower-dimensional feature vector. The proposed technique computes
Radon projections in different orientations and captures the directional features of the face
images. Further, DCT applied on Radon projections provides frequency features. The
technique is invariant to in-plane rotation (tilt) and robust to zero mean white noise. The
system is also tested for combinations of only 2D-DCT (10 coefficients), 2D-DCT (5
coefficients), Radon + DCT(10 coeff.) by simple Euclidean distance based classifier and a k-
nearest neighbor classifier for different values of K.
System Description
Hardware :
The system is implemented using a floating point DSP processor TI TMS320C6713along with
daughter card DSP STAR TFT LCD Video Daughtercard (VM3224K2) and camera Color
TeleCamera NCK41CV. The DSP board has 16 KB of internal RAM, 16 MB SDRAM and
512 MB external RAM. System is designed as a standalone application and does not need
intervention of the computer once the system is loaded for the first time. The training feature
set is collected offline on the board form the training image set and stored in SDRAM at the
time of loading the program.
Input from the camera: The camera gives images at the rate of 30, 15 or 7 frames per second
in 16 bit YUV format. This camera is a low resolution camera and does not perform well in
low lighting conditions, where it introduces a lot of noise which makes recognition very
difficult. Although the systems is programmed to achieve good performance even in moderate
lighting conditions , the camera stills is operated in good lighting conditions.
Software:
The face recognition algorithm is as shown in the figure:
The system first captures the image of the person through the camera in the 16 bit YUV
format. This image is then used to extract the gray level image of the person which is
basically the Y value of the image obtained from the camera. This format is then converted to
8 bit format and 8 bit gray level input image is obtained.
This image is then preprocesses to make it suitable for recognition step. Firstly, the image is
down sampled from 128x128 to 64x64 followed by normalization of the image to take care of
illumination changes. Since, the most common histogram equalization method introduces
artificial grey level values; we preferred the contrast enhancement by Linear scaling method.
Once the image is normalized by preprocessing, it is send to the feature extraction step where
first the radon of the image is calculated for 32 rotation angles and a 32x44 sized image is
obtained.
System Layout
Then a 2D-DCT of this image is performed to capture the low frequency components
important for face recognition. While training, this process is followed to collect the feature
set of the training set and store it in a file. This feature file is then loaded in to system along
with the program and used as database.
In the test stage, this process produces the feature vector which then compared to the n-feature
vectors of the training set and a distance measure is calculated from each of the training
feature vector. This is followed by the k-nearest neighbor classification which sorts the values
of difference vector and give the closest k-neighbors which leads to final classification
depending on the frequency of class indices in the sorted array.
Selection of Algorithms
There are various algorithms for face recognition that used either the eigenfaces approach,
geometric features approach or the appearance based approach for extracting features from
the given image set.
Preprocessing:
Each method of feature selection poses its own limitation in terms of being illumination
variant, pose variant or dependent on similarity with training set. For this reason, before
applying any feature extraction method we first normalize the image to reduce the effect of
lighting and rotations. Similarly, a high dimensional data poses a problem in terms of
processing time, redundant information and very large feature set leading to “curse of
dimensionality” during the classification stage. Hence, we downsample the image to reduce
the number of dimensions.
Downsampling
As mentioned above, down sampling is a efficient way of reducing redundant information in
an image which might lead to unnecessary feature values that do no contribute much in the
final classification. Down sampling, maintains the total entropy of the image while reducing
the number of dimensions. There are typically two ways of downsampling an image; Bilinear
interpolation and pixel averaging.
Bilinear interpolation leads to a sharper image as the pixel values are reconstructed by using
the pixel values of its neighbors thereby, taking care of the pixel value variations on the
neighboring pixels. Usually in image processing tasks, bilinear imterpolation is a preferred
method for image reconstruction.
In down sampling by pixel averaging, we simply take the value of all the pixels and divide by
the total number of pixels, thereby averaging the pixel value over those n-pixel values. As a
result of averaging the sharp features of the image are lost resulting in blurring. Although, the
only disadvantage of using this method is the rounding of error in case the average of the
pixels is not an integer value. However, this is a very small error and would not lead to
significant difference in the pixel values.
EXAMPLE: Downsampling
Original 128x128 image Pixel Averaging Bilinear Interpolation
We know, the edges of an image are represented as high frequency components in the
frequency domain and smooth regions of the image represent the low frequency regions like
the cheek, nose, forehead and etc. As we need to capture the low frequency components of the
image for good face recognition, pixel averaging method is more suitable in our case.
Normalization
Normalization of the image is important to take care of the illumination changes.
HISTOGRAM EQUALIZATION: The most common method to normalize image is the
histogram equalization method that distributes the grey levels in the image such that we attain
a uniform grey level distribution or pdf of the image.
Wide-Sense histogram equalization
In this method we stretch the original histogram to cover the whole 0-255 range of gray levels.
This technique does not guarantee equal number of pixels in each gray level, but gives a
contrast enhanced version of input image. We use the following formula:
Pixels of No.
LevelIntensity .
0
MaxNO
i
j
ji ×
= ∑
=
The meaning of Max. Intensity Levels maximum intensity level which a pixel can get. For
example, if the image is in the grayscale domain, then the count is 255. And if the image is of
size NN × then, the No. of pixels is N2. And the expression is the bracket means the CDF
value for the value of input gray level. This is how we get new intensity levels calculated for
the old intensity levels.
LINEAR SCALING : The problem with histogram equalization method is that it introduces
some artificial grey levels in different locations of the image as per the gray level distribution.
This is not a desirable thing for our face recognition system. Hence, for our system we choose
to normalize the image by Linear Scaling which stretched the pdf of the images in such a way
that it covers the whole gray scale range whereas keeping the variations in the image intact.
Full Scale linear Scaling
There are three common linear scaling methods, the first one is called Linear Image Scaling,
in which the processed image is linearly mapped over its entire range; the second one is called
Linear Image Scaling with Clipping, where the extreme amplitude values of the processed
image are clipped to maximum and minimum limits. The last one is called Absolute Value
Scaling, which utilizes an absolute value transformation for visualizing an image with
negatively valued pixels. The second technique is often subjectively preferable, especially for
images in which a relatively small number of pixels exceed the limits.
For our purpose, we are going to implement the second method, which is Linear Image
Scaling. The idea of linear scaling is illustrated below.
This process Involves mapping of histogram of the input image in such a way that the
histogram of the output image covers the entire range from [0-255] of gray scale levels. The
main challenge faced here is to realize the mapping range. Low contrast images can be result
of poor illumination and lack of dynamic range in the imaging sensor. These low contrast
images have a very low dynamic range. Thus the primary idea is to increase the dynamic
range of these images, that is to stretch the range from low to high linearly. We have an
equation,
( )min
minmax
minmaxmin)( FF
FF
GGGFHG −
−
−+==
Where,
(Fmin,Fmax)= minimum and maximum grey level of input image that is occupied
(Gmin,Gmax)= minimum and maximum grey level of output image that is desired.
In a way, this equation represents the line form y=mx+c, where m is the slope and c is the
intersection on y axis. In our case the slope is given by the quantity (Gmax-Gmin) / (Fmax -
Fmin). When we make Gmin=0 and Gmax=255, we cover the entire range for 8 bit images,
hence the process is called full range linear scaling.
EXAMPLE: NORMALIZATION
Original bright,dark and midtone images repectively
HISTOGRAM
EQUALIZATION
LINEAR
SCALING
Original
images
Histogram
Equalization
Linear
Scaling
Analysis: the figure above shows the image normalization by histogram equalization and
linear scaling method. As mentioned before, we see that the histogram equalization method
introduces unwanted effects like contouring and also introduces unwanted grey levels. On the
otherhand, Linear scaling method enhances the contrast such that it suppresses any sudden
occurrence of bright light and also takes care of the poor lighting conditions. Hence our
selection of Linear scaling method for image normalization is well-justified.
Feature Selection
PCA :
PCA is one of the most successful techniques used in face recognition algorithms. The
purpose of PCA is to reduce the large dimensionality of the data space to the smaller intrinsic
dimensionality of feature space, which is needed to describe the data. This is the case when
there is a strong correlation between observed variables. The main idea of using PCA for face
recognition is to express the large 1D vector of pixels constructed from 2D facial image into
the compact principal components of the feature space. The equation of PCA is given by the
equation below for the set of D dimensional vector { }n
ix1
the M dominant eigenvectors of the
sample covariance matrix formulate as follows :
∑ −−=i
i
T
i xxC )()( µµ
Where is µ is the sample data mean, each each i v is an eigenvector of the Covariance Matrix
(C) having associated eigenvalue jλ :
jjj vCv λ=
Linear Discriminant Analysis (LDA)
LDA is closely related to PCA in terms of finding linear combinations which best explains the
data. LDA models the difference between the classes to make class cluster more separable.
Suppose that each of C classes has a mean µi and the same covariance Σ. Then the between
class variability may be defined by the sample covariance of the class means:
( )( )∑∑=
−−=C
i
T
ii
b C 1
1µµµµ
The class separation in a direction wr
in this case will be given by:
ww
ww
ST
b
T
rr
rr
∑=
∑
Linear Discriminant Analysis (LDA) finds the vectors in the underlying space that best
discriminate among classes. For all samples of all classes the between-class scatter matrix SB
and the within-class scatter matrix SW are defined. The goal is to maximize SB while
minimizing SW, in other words, maximize the ratio W
B
SS
∆∆
.
Elastic Graph Bunch Matching
EGBM approach has used the structure information of a face which reflects the fact that the
images of the same subject’s trend to translate, scale, rotate, and deform in the image plane. It
makes use of the labeled graph, edges are labeled the distance information and nodes are
labeled with wavelet coefficients in jets. This feature model graph can then be used to
generate image graph. The model graph can be translated, scaled, rotated and deformed during
the matching process. This can make the system robust to large variation in the images.
Radon
Radon transform has been used to derive enhanced low frequency components, which are
useful in face recognition. Radon transform for 2 dimensional function ( )yxf , is defined as :
Radon is an efficient way of extracting frequency components in different directions. The
Line integrals of face, during the computation of Radon transform, amplify low frequency
components, which are useful for face recognition. Radon Transform can give very good
dimensionality reduction by choosing proper number of angles (0-179 orientations.). Radon
Transform can achieve lossless compression. Provides Rotation Invariance to images which
are a very important factor in real time recognition systems.
Discrete Cosine Transform
2D DCt is a efficient way of transforming the image such that there is
good distinction between the high and low frequency components. DCT
enables a proper selection on low or high frequency coefficients due to its
way of spatial distribution of the coefficients. The DCT of image is
] [0
image theofcenter thefrom distance -r
)sincos)(,(),(
πθ
θθθ
∈
−−= ∫ ∫∞
∞−
∞
∞−
dxdyyxryxfrR
calculated as follows:
∑∑−
=
−
=
+
+=
1
01
22
2
1
02
11
1
2,12,1
2 2
2
1cos
2
1cos
N
n
N
n
nnkk knN
knN
xXππ
The first DCT coefficient is the DC coefficient and gives the average value of the image. This
mainly consists of the illumination information and is hence, sometimes discarded to remove
illuminations changes. DCT has excellent energy compaction property and divides the region
of image into regions of low and high frequency. It also facilitates feature selection through
zig-zag or other methods. 8x8 block DCT allows capturing of local frequency distribution and
speed up the overall performance.
OUR APPROACH
As we are implementing face recognition system on the dsp board with limited capabilities of
computation and memory. This constrains our algorithm selection in certain way. PCA and
LDA though very standard methods of face recognition algorithm due require computation of
correlation matrix. Hence, we decided to implement radon transform and DCT transform as
our feature extraction algorithms.
Feature extraction using Radon transform and DCT
(RDCT) Facial features derived in the proposed approach are the frequency components in
different directions. The line integrals of face image, during the computation of the Radon
transform, amplify low frequency components, which are useful in face recognition. Radon
space image for 0–179 orientations is shown in Figure. DCT is used to derive the frequency
features from the Radon space. The figure reveals excellent energy compaction property of
DCT. Significant coefficients (10) of DCT are concatenated to form the facial feature vector.
In our Project, we have used 32 radon orientations ranging from 0-180 degrees at the jumps of
5 degrees. The Image was divided in blocks of 8x8 for the DCT and 10 coefficients from each
block were chosen. The number of DCT coefficients were chosen keeping in mind, the
computational requirements, data seperatability and ease of classification. There are many
ways one can choose the DCT coefficients of a block, like methods based on maximum
magnitude of coefficients, maximum energy, or variance f coefficients and etc.
There are also proven results of different recognition rates as an effect of overlapping blocks
and the percentage of overlap while calculation 2D-DCT .In Our case, DCT coefficients of
non-overlapping blocks of an image are computed and ordered using zigzag scanning. The
main reason behind this is to keep the system computationally less expensive and speed up the
recognition process. The DCT coefficients extracted from each block are concatenated to
obtain the feature vector.
Classification
Euclidean Distance:
Euclidean distance is absolute difference between two points in one or more dimensions.
Euclidean distance for N dimensional space is given by:
22
33
2
22
2
11 )()()()( nn qpqpqpqpd −++−+−+−= K
It classifies the class which is at smallest distance from the test sample. Each class prototype
is defined as mean of the feature values of all the prototypes of that particular class.
K-Nearest Neighbor Classifier:
k-Nearest Neighbor (k-NN) method assumes all instances correspond to points in the n-
dimensional space. The nearest neighbors of an instance are defined in terms of the standard
Euclidean distance. KNN is based on instance based learning. A test sample is classified by
majority of votes by the neighbors.
Implementation and Results
Database Collection:
Database was collected for 20 people for minimum of 30 images. Each subject is asked to
adjust the face it the 128 X 128 box displayed on the LCD. No instructions were given to the
subjects in terms of pose or distance from the camera. The only constraint was to provide a
frontal image such that there are no off-plane rotations. As the algorithm selected does not
support these variations. Also it is quite likely for subject to follow the natural distance and
pose at which he/she is most comfortable at the time of testing/training hence no instructions
were provided to the subjects so that they can follow their natural pose at the time of testing.
The database was collected in two sessions for 11 people to capture variance in pose,
illumination and zoom factor. The subject was made to provide frontal pose while the camera
capturing image at 15 frames per second buffered the images for 2 seconds. Once the program
collected 30 images it saved the images in the database creating unique name for each file.
Training
images
Test Images
Sample Images from Database
Dimensionality reduction
Downsampling
The original LCD size was 240x320, i.e.: 76.8 Kbytes of data per processing. Taking into
consideration the size of internal RAM of 16Kbytes, processing of 1 frame of this size would
take about 4.8 seconds + processing of algorithms. In a real time system, such high processing
times are discouraged as they make the system very slow. As in our case we buffer the images
and then processes them for classification, taking an image size of such large dimensions is
impractical. Hence, we choose a smaller sized image of 128x128, i.e. 16.4Kb which brings
down the processing time to 1 sec per frame. However, considering the fact that a video
based recognition system would take more than 1 frame for classification, a processing time
of 1 sec per frames is still large. An image size of 128x128 when displayed on the LCD
provides enough resolution and image size for the user to see and adjust his face into the
camera. Hence, an Image size of 128x128 for display and crop purposes seems a reasonable
size.
Selection of downsample image size
The cropped image is buffered and frames for 2 seconds are saved. This enables the system to
capture variations in poses of the face. For a frame rate of 15 fps, we get 30 frames leading to
a total buffer size of 491.5 bytes. Processing all the frames is a time consuming process for
such a large dataset. Hence, a need for further reduction in dimension arises. We know that a
larger image contains a lot of redundant information and hence, reducing the size of image
would not harm the classification as far as the image is still recognizable. We initially choose
am image size of 24x24, but it posed lot difficulty as the Image size was too small and the
information contained was much lesser. Since, the system is only based on face recognition
we decided to go with a larger size of 64x64 to capture most of the information of the face and
at the same time, making the dimension of feature vector reasonable enough for the processor
to handle without much delays.
Radon Transform
We learnt that the in-plane rotations can be handled well with radon transformation which
projects the image into various orientations and amplifies the low frequency components by
taking line integrals along the columns for each orientation. Since, low frequency
components are the most important components for face recognition, Radon transform is a
good choice to reduce dimensions and compacting the information contained in the image.
Since we used 32 radon orientations we received a final image size of 32x44. To prevent any
clipping of data while rotating, the image were zero padded by 12 pixels each side to make the
resulting image 88x88. The padding size was decided by taking into consideration the
diagonal length which is 90. Since, the Radon image is followed by DCT where the blocking
of image takes place, we choose 88 as column width so that the image is divisible by 8 from
each side. This makes us loose 2 pixels from each side, but considering the fact that the face
of a person is mostly located in the center of the frame, this loss might not present a serious
error in classification. The Radon transform gives us a final image of 32x44 which is then
send for DCT transform stage.
DCT Transform
A 2D DCT transform of an image accumulated the low frequency components of the image in
the top-left corner of the image. A Local- block based DCT helps separating the frequency
components of the image on a local basis. This helps us select the coefficients from important
blocks containing eyes, lips, nose, etc. without any loss of data. Also it is proven that a Local
based DCT presents better frequency capturing for face recognition purposes. Hence, we
decided to go with a local 2D DCT for a block size of 8x8. We divide the radon image into
blocks of 8x8 and calculate DCT coefficients for each block. We have 64 coefficients per
block leading to 4096 total coefficients.
LOW frequency vs. High frequency components: When DCT is chosen as a feature selection
method, we can choose either high frequency components as features of low frequency
components. If we choose the high frequency components, this implies that the regions
containing edges information’s are selected as features. As the edges present the shape of the
face and features, it is sometimes considered as a good approach as it imitates the face very
closely. On the other hand, the low frequency component represents an approximation of the
image, so it can be considered as a source of classification errors. However, choosing high
frequency components make the dependency of training set too high and can lead to high
classification errors if the subject doesn’t present and image similar to his training set.
Choosing the low frequency components means that we are capturing the total energy of the
image which can in turn lead to better classification results. Hence, we choose to pick low
frequency components as our feature set. Picking up 10 coefficients from each clock gives us
a feature vector of 200 coefficients i.e. 0.8 kb for each image. This is a dimensionality
reduction of 98%.
Selection of number of coefficients
Number of Radon angles:
Selection of number of radon angles was an important decision. We wanted to cover the
whole range of 0-179 but at the same time wanted to keep the computational complexity in
terms of time and processing, very low. We noticed that the 128x128 images were taking 1
sec for each angle to rotate. This was due to the restricted small size of internal RAM. Hence
in total each frame took about 32 seconds for radon transform. This hurdle was taken care of,
when we reduced the size of the image to 64x64. Now the image was taking only 8 secs to
complete the recognition process including 32 rotations of radon transform. Although, 8
seconds is pretty high for a real time system, we decided to go with it and focus on
optimization of loops to reduce the time to rotations. For the 32 angles, we took angles at the
steps of 5 degrees to cover up all the 0-179 range of orientations. However, we learnt later
that a better approach could have been to range the set of angles from 0-30 in steps of two and
then flipping the results to obtain similar orientations in negative direction. For example,
flipping the orientation of 10 degrees would give resulting orientation of -10 degrees which is
170 degrees in effect. Also, considering a practical situation, a person would only rotate his
face up to as much as 30 degrees in each direction. Hence, this approach sounded very
intuitive to apply and would have given better results. This approach gives us 60 radon angles
resulting in a final image of 60x44. This was not a significant gain of dimensionality
reduction in our case and hence, we decided to go with the original 32 radon angles.
Number of DCT coefficients:
Selection of number of DCT coefficients from each block was another important decision.
Too many coefficients would give a large feature vector and vice versa. Since, the images are
gray level images; the symbol set consisted of 0-255 gray level values. After feature
extraction, it is very likely that these feature vectors lie very close to each other. Hence, to
differentiate between the data, a correct selection of coefficients was necessary. A dimension
of 16 coefficients per block gave a final feature vector of 1024 coefficients resulting in the
training set of 20 vectors of 1024 values each. This seems like a very big number, considering
the fact that we need to calculate distance of the incoming vector from each vector in training
feature set. Hence, 16 coefficients per block was an expensive choice in terms of existing
system for us. Since, it is an only face based recognition system, taking a too small value of
coefficients would also be an impractical thing to do. The capture the low frequency
components properly, we take 10 DCT coefficients per block in a zigzag manner such that we
have the highest magnitude coefficients in our feature set. Coefficients with larger magnitude
affect the classification rate more than the coefficients with lower magnitude. 10 DCT
coefficients per clock resulted in a feature vector of 200 values per feature vector. This
sounded a reasonable choice. Another choice was taking 5 DCT coefficients per block which
resulted in a feature vector of 100 values per feature vector. This again is a reasonable number
for our existing system.
The system has also been tested for only DCT based classification. For this purpose, DCT
coefficients are extracted for the original 64x64 normalized images resulting in the feature
vector of size 640 for 10 coefficients per block and 320 in case of 5 coeffecients per block.
Selection of Classifier ( ED results and KNN):
For this system we wanted to choose the classifier that is very less expensive in terms of
computation yet yielding the better classification rate. Euclidian distance classifier is very
light computationally but gives much emphasis on the minimum distance which results into
misclassification with little variation from test images. For Euclidian distance classifier the
recognition rate for validation data was 97% where on test data was 63%.
k-Nearest neighbor gives the most frequent class from the first k smallest class distance from
the test sample. So for KNN even if test sample yields lower distance from prototype of a
class if it only occurs once it has more probability of classifying itself correctly. KNN is
computationally heavy than Euclidian distance but overcomes the problem of singularity in
GMM which occurs because of small database and large number of feature set, and the
number of features which results in non-singular model of GMM fails to capture the facial
details of the face.
Selection of value of K
A proper selection of value of k is very important for a k-nn classifier. To high k value will
give noisy classification, however too low value might land up only considering the very
nearest neighbor, which might or might not be correct classification. We have noticed by
experimentation that the test samples that are very different from the training data will
generally produce larger distances than more similar data. Consequently they are more likely
to cause a misclassification. The following diagrams show the effect of k on classification.
K-NN classification for
Different values of k radius in
increasing order.
Lets say the incoming vector belongs to class red. As wee see if the vector resembles the
training image it might lie very close to the feature vector of the class RED. Taking a smaller
radius of K will help in this case as we will get the maximum frequency for the class RED has
the unknown vector will be classified as RED. Now, we increase the size of the radius under
the assumption that more number of vectors of same class should exist under the circle now.
However, we see that the prototypes for class GREEN and equal in number to that of class
RED. Hence on the basis of value of indices the classifier will classify the unknown vector as
RED or GREEN, whichever has the lower class index. If GREEN is chosen, this clearly is a
wrong classification even when the RED prototypes lie very close to the unknown vectors. As
we increase the size of the radius, the classification areas becomes more and more noisy and
might result in higher misclassifications then classifications.
In our experiments, we noticed that the classification was getting affected by the value of K
in the similar way. Also, we sometimes noticed that the correct class was within the first 10
nearest neighbors of the input vector, but still the classification was incorrect. This was
probably due to the above mentioned reasons.
Also, when the test image is very different from training images, the incoming vector will lie
very far from the training vector (as shown in the figure) and hence, will be always
misclassified. A large variation in training set can help this situation. However, in our case
due to restrictions in data availability we could not include lot of variant images in the training
set. Hence, the system gives good classification if the user presents an image close to the
training set.
Following is an example of test set misclassification due to large values of K.
String:
./Database/ashwin/video_ashwin
_10.raw
Just finished radon
sorted value of 0 at 13
sorted value of 1 at 13
sorted value of 2 at 13
sorted value of 3 at 13
sorted value of 4 at 10
String:
./Database/ashwin/video_ashwi
n_23.raw
Just finished radon
sorted value of 0 at 13
sorted value of 1 at 13
sorted value of 2 at 13
sorted value of 3 at 13
sorted value of 4 at 13
The classification result here shows the
classification of class-13 (“Ashwin”) for
different values of K. We see that
Ashwin is misclassified as class 10
inspite of being in the top 4 neighbors.
For k=15 this class is misclassified as
10, however, for k=5 this gives a correct
sorted value of 5 at 10
sorted value of 6 at 10
sorted value of 7 at 10
sorted value of 8 at 10
sorted value of 9 at 18
sorted value of 10 at 18
sorted value of 11 at 11
sorted value of 12 at 18
sorted value of 13 at 11
sorted value of 14 at 18
Classified as 10
sorted value of 5 at 13
sorted value of 6 at 4
sorted value of 7 at 4
sorted value of 8 at 4
sorted value of 9 at 4
sorted value of 10 at 16
sorted value of 11 at 16
sorted value of 12 at 16
sorted value of 13 at 16
sorted value of 13 at 16
Classified as 13
classification rate.
Challenges
Dimensionality
As mentioned before, a large size of image or feature vector posed the hurdle of very slow
recognition process. Also, since we buffer the images before processing them, a very large
size of image was constantly leading to buffer or stack overflow. As the stack overflows, the
data being displayed on the LCD was displaying garbage values. To solve, this we had to
reduce the size of image from originally 240x320 to 64x64 and flush the buffer after every
classification result.
For a very large feature vector size, we noticed that the data is noisier and closely scattered.
Hence, for large feature vectors the data for each class was overlapping and hence, resulting in
a lot of misclassifications. This was even true for the case of 640 DCT coefficients. Hence, for
DCT only case, we decided to go with selection of 5 DCT coefficients per block, resulting in
the feature vector of 340 coefficients
Small size of internal Ram and memory issues
Processing time for Radon for different image sizes
Scatter plot of data :
RADON 10images per class
FIGURE: Scatter plot of original feature set of radon and DCT,200 coefficients and 10 images per class.
FIGURE: Scatter plot of mean image of radon and DCT of 10 training images and feature set of 200 coefficients per class.
Graphs above show scatter plot for 10 images and the scatter plot for mean image for each
class. We observe that in the scatter plot of 10 images/class the feature vector is highly
overlapping. Since our symbol set consisted of only 0-255 grey level values this translates to
closely lying feature values which in turn contributes more towards misclassification.
However in the mean image graph we see that the values are visible and close to being distinct
which in turn means lower classification rate.
Median image
FIGURE: Scatter plot of median image of radon and DCT of 10 training images and feature set of 200 coefficients per class.
A general analysis of median image has been presented above. While experimentation we
noticed that the class was very oftenly being classified as class id -11 (“Shivani”) the probable
reason for this is that while we take Euclidian distance from all the features from the original
feature set, the set of values lie very close to the other classes as the original feature vector in
itself very less distinct and matches lot of classes in the train set, hence influencing the final
decision. As seen from the median graph the median sufficiently separated the data for class
11 which was highly scattered in the mean image.
Means of radon+DCT
FIGURE: A comparative chart for set of one mean per class for radon and DCT of 10 training images.
Median of radon
FIGURE:A comparative chart for set of one median per class for radon and DCT of 10 training images.
FIGURE: comparative chart for set of one mean per class as compared to median for that class for radon and DCT of 10 training
images.
For analysis purposes, we tried projecting the data on 1 dimension in terms of means and
medians. If we consider Euclidian distance than the class that is most likely to be
misclassified is the one having its mean value very close to any other class. The bar graphs
above the mean and median value, so the bar graphs at same level would be create confusion
at classification because of same distance from each of these feature vectors, hence the
misclassification rate would be high.
In most of the cases the mean and median value is similar for respective classes. However,
when the variation between training images is too high the median presents a better
representation of the class values as it tends to fall between the most frequent values where as
mean is a reconstructed mid value for the variation of the training set.
FIGURE: comparative chart for set of 10means per class for radon and DCT of 10 training images.
FIGURE: comparative chart for set of one mean,median, and median of 10 emans per class for radon and DCT of 10 training
images.
As mentioned above, the bars above shown either the median or the mean value. Classes at
the same level are more likely to be misclassified then classes showing some difference in
mean or median values. We have taken mean of 10 images , median of 10 images and median
of 10 means for 10 training mages for each class. From the graph above, we notice that the
median as 1D representation of data is a better metric for Euclidean distance.
We also notice that there are certain example like class 11(“shivani” and class-12 (“neha”)
and class 18(“pranav”) where there is a significant difference in the mean and median image.
The reason for this is that the training set have a image set of large variance. Since the
variance is too high, the mean of the images is increased whereas the median remains
unchanged. These classes are more likely to be misclassified if we take the distance from
mean image as a metric.
We also, notice that the radon and DCT together also, cannot handle such kind of variations in
the image set. When observed carefully, we see the images are different in terms of zoom
primarily and illumination secondly. The Second condition is handled by the algorithm but the
first condition still remains the problem.
Classification based on only DCT coefficients.
FIGURE: Scatter plot of 10images per class for feature vector of 640 DCT coefficients for each image.
As mentioned earlier, the scatter plot of 640 DCT coefficients is too dense. Also, wee see a
number of coefficients at same level, which implies less distinctness and more redundancy in
the information provided by the DCT coefficients. This highly overlapped data poses a great
problem in classification rates.
FIGURE: comparative chart for set of one mean per class for DCT of 10 training images.
FIGURE: comparative chart for set of one mean per class for DCT of 10 training images.
FIGURE: comparative chart for set of one mean and one median per class for DCT of 10 training images.
We notice from the above graph that the mean values of the DCT –training vectors are quite
distinct and the median and mean values are similar even for cases where the training data is
highly variant like in above mentioned cases. Since no two bars at same level, this scheme
can provide better recognition rates. Both mean and median serves as a sufficient metric for
distance calculations. We noticed that class 11(“shivani” was quite often being classified as
class 6(“shengakai”) in the real time tesing. The probable reason for this can be the very little
difference between the mean values and hence, misclassification because of the slight
variations in the test image. Hence, this provides a very good intuitive reasoning for the
misclassification results.
FIGURE: comparative chart for set of one median per class for DCT of 10 training images.
FIGURE: comparative chart for set of one median per class for DCT of 10 training images.
Again, As metioned above, median of images provides a good distinction in the feature set.
Intuitively, that the incoming feature vector is more likely to go towards the median values
for correct classification, than the slight variant images away from the median.
FIGURE: comparative chart for set of one mean, one median and median of means per class for DCT and radon plus DCT of 10
training images.
RADON_DCT VS DCT
For analysis purposes, we plotted the mean of radon and DCT feature vector, median of radon
and DCT feature vector, mean of DCT feature vector, median of DCT feature vector, and
medians of means for 10 images in each scenario. We notice that , as expected the feature set
values of the DCT are much higher then the Radon feature set. We also observe that the DCT
feature set presents more distinctness in the values of the clusters thereby, facilitating the
correct classification. Radon transform on the otherhand, had values very closely lying
together, hence, making the system very sensitive to slight changes.
The reason for less variation of the Radon transform could be that for the 32 rotations, it sums
up the values of the columns as line integrals. This in a way suppresses the variations in the
image and brings the values close together. This might be a good technique for clustering of
images. However, in our case need largevariations in the feature set , as we are not
implementing any discriminant algorithms. Radon transform and DCt in conjection with any
discriminant algorithms would provide excellent results as the data for each class would be
closely scattered making the in-class variation small and the for each class the values would
be well separated from other classes making the inter-class variation also large. This is a
desirable situation for any face recognition system.
Unfortunately, in our case, due to the large calculation of covariance matrix, implementation
of any Discriminant analysis method was not possible. Hence, in our case, DCT provides
better recognition rates then the combination of Radon and DCT.
Flow diagram of the code:
VM322K2 video input(16 bit YUV)� CAM (8bit-Y-240x320)
CAM (8bit-Y 240x320)�CROP(128x128)
CROP(128x128)� buffer (3 sec)� Downsample(64x64)
Downsample(64x64)�Preprocessing (64x64)
Preprocessing (64x64) � Radon(32x44)
Radon(32x44) � Block(20)
Block(20)� 2D-DCT
2D-DCT� 10 coeff per block� feature_concat[200]
Send to kNN for classification.
Limitations with REAL TIME RESPONSE
One of the major problems we encountered was not being able to capture the image properly
while real time testing. The dataset gave excellent results for the offline testing, while for the
same pose the classification of in real time was very drastic. We learnt that the algorithm was
working fine, however, the Image was not being able to capture properly in the first stage
itself. As initially we were just taking one image for classification, we implemented the
downsampling function in the interrupt (while) loop. As a particular number of frames we
collected, the last frame stored in the crop array was processed and downsampled. However,
later we learnt that the image in the final frame was most of the times corrupted or over
written because the interrupt was still enabled and running while we were performing the
recognition task. Hence, the image we used for classification was sometimes, blurred or
distorted due to interlace effects or sudden halt of interrupt while the camera was still storing
the image. In the very few times when the image was captured properly, out of luck, the
classifier gave satisfactory results. To solve this problem, we went back to the old scheme of
buffering images and then processing them. To avoid confusion or any corrupted frame we
choose a middle frame for classification rather than the first or last frame. Also, we disabled
the imterrupt everytime we went into recognition step.
The second problem, was the camera input. We notices that the classification was also getting
affected by the choice of camera. One of the camera units gave a noisy image on the edges
and a low resolution image while the other camera gave a high resolution and clear image.
Since we used the later camera for collecting the database, we used the same camera for test
stage.
Displaying classification result
The classification result has been displayed on the LCD as the names of the person being
classified. For this, instead of storing the images for each person, we stored the images in
terms of 84x320 sized 2d arrays. This was a reasonable decision because the program
eventually had to read the messages and store it in arrays in case we saved them in forms of
images. This would have been an extra overhead. This also saved us considerable amount of
the loading time. Messages for initial user instructions were also saved in the system .
Imposter model.
Making an imposter model requires a large amount of data with large variations. Due to
constraints and lack of availability of such large databases, we leave the imposter model as a
future work making the current recognition system as “one out of k classes classifier”.
Quantitative Results
The quantitative results in terms of recognition rate sis presented below:
Feature set extraction method and classification for test
database.
Correct Recognition
rate
32 Radon Angles+ 10 DCT coefficients/block – k-NN(10) 86%
32 Radon Angles+ 10 DCT coefficients/block – E.D 63%
5 DCT coefficients (320)+ k-NN (5) 98%
5 DCT coefficients (320) + k-NN (15) 74%
10 DCT coefficients(640) + k-NN (10) 93%
16 DCT coefficients (1024) + E.D 60%
We tested the system for real-time as well as the test dataset. The above results show , the
correct recognition rates for the various combinations of feature selection methods. As
expected the Euclidean distance presents a very low recognition rate for both Radon and DCT
based methods. For the 16 DCT coefficient based method the Euclidean distance gives
recognition rates as low as 60%. This is an expected result due to increased amount of overlap
in data, redundancy in the coefficient values. At the same time, the selection 5 DCT
coefficients per block along with a k-NN classifier with k=5 gives recognition rates as high as
98%. To see the effect of selection of value of K , we calculated the recognition rates for the
same 5 DCT coefficients per block with a k-nn classifier with k=15 nearest neighbors. The
recognition rates in this case went to as low as 74%. We noticed, while classification, that for
most of the misclassified classes, the correct class was present amongst the first 5-7
neighbors. Hence, this explains the good recognition rate of k=5 and low recognition rate of
k=15. Hence, an optimum value of k for the k-nn classifier is also an very important issue.
We also noticed that , the recognition rates for the test database of 50 images was pretty high
as compared to the size of training set of 10 images. Also, as shown above, the training and
test dataset is very distinct and has lot of pose variations. The algorithms gives good
recognition rates for the test database for offline-on-board testing. However, achieving such
high recognition rates for real-time recognition is still a problem due to variance in poses from
the training set.
Conclusion
The system is designed to work well with database containing varying images of each subject.
After looking at the real time classification we believe that face detection is needed prior to
feature extraction to take into account the zoom factor. Including any discriminant analysis
methods would boost the real time recognition rate significantly as explained earlier. We also
noticed that radon + DCT is a good approach for energy compaction and decreasing the in-
class variance but it does not contribute towards increasing inter class variance and hence, is
dependent on linear discriminant method to give good recognition rate. On the contrary only
DCT provides a good interclass variance but poor in-class variance and is very sensitive to
change in pose and training set.
Future Work
We would like to implement one of the discriminant analysis method or weighing function so
we can create a voting scheme for set of images. We also would like to test different feature
selection techniques like Haar wavelet transforms, WHT and EBGM.
References
References
1. Z. Hafed, M. Levine, “Face Recognition Using the Discrete Cosine Transform”,
International Journal of Computer, 43(3),167-188
2. W. Zhao et al., “Face Recognition: A Literature Survey”, ACM Computing Surveys,
Vol. 35, No. 4, pp. 399-458, 2003.
3. H.K. Ekenel, R. Stiefelhagen, "Local Appearance based Face Recognition Using
Discrete Cosine Transform", 13th European Signal Processing Conference (EUSIPCO
2005), Antalya, Turkey, September 2005.
4. J. Stallkamp, H.K. Ekenel, R. Stiefelhagen, “Video-based Face Recognition on Real-
World Data”, Computer Vision, 2007. ICCV 2007. IEEE 11th International
Conference on
5. P. Viola, M. Jones, “Robust Real-Time Face Detection”, Intl. J. of Computer Vision,
Vol. 57, No. 2, pp. 137-154, May 2004.
6. C. Sanderson, K.K. Paliwal, “Fast features for face authentication under illumination
direction changes”, Pattern Recognition Lett. 24 (14) (2003) 2409–2419.
7. A. Batur, B.Flinchbaugh and M. Hayes III, “A DSP-Based Approach for the
Implementation of Face Recognition Algorithms”, IICASS 2003, pp. II 253-256
8. S-W. Lee, Sang Lee and H.C. Jung, “Real-time Implementation of Face Recognition
Algorithms on DSP chips”, Lecture Notes in Computer Science, 2003, pp-1057