real-time face tracking - comp.nus.edu.sgcs4243/projects/hyp-face-tracking.pdf · 3.4(b) database...
TRANSCRIPT
Honours Year Project Report
Real-time Face Tracking
By
Yeow Hong Ann
Department of Computer Science
School of Computing
National University of Singapore
2005/2006
Honours Year Project Report
Real-time Face Tracking
By
Yeow Hong Ann
Department of Computer Science
School of Computing
National University of Singapore
2005/2006
Project No: H055200
Advisor: Dr. Terence Sim
Deliverables:
Report: 1 Volume
i
Program: 1 CD
Abstract
Abstract
Face detection and recognition has been a widely researched topic due to its high
application value in various fields and industries, such as biometrics, image retrieval.
Recently face recognition has also been deployed to aid in counter-terrorism by
recognizing hardcore criminals and terrorist in the important installations. In this paper,
we present an approach to the real-time detection and recognition of human faces, and the
tracking of the face. We have also modified the approach to allow for situations when the
face cannot be detected, and we will then track the motion under changing background in
real-time. Our approach uses the cascade architecture for face detection proposed by
Viola and Jones which uses a feature selection algorithm based on AdaBoost. This
architecture has been incorporated into the existing OpenCV library. As for face
recognition, we will use ARENA, a simple, memory-based algorithm for face recognition.
Due to the rapidly changing background during tracking, we will be using a modified
background subtraction method which can adapt robustly and detect the motion. A
Canon-VCC4 pan/tilt/zoom camera will be employed to track the face and the motion.
Subject Descriptors:
I.4.3 Enhancement
I.4.6 Segmentation
I.4.8 Scene Analysis
I.5.2 Design Methodology
Keywords:
Face detection, computer vision, face recognition
Implementation Software and Hardware:
Pentium 1.6GHz PC, Euresys Picolo Card, Canon VCC4 camera, VC++, OpenCV beta5
ii
Acknowledgement
I wish like to thank all the people for their help in completing this project.
I would like to thank my supervisor, Dr Terence Sim, for his invaluable advice and
guidance throughout this project.
I would like to thank my 2 mentors, Zhang Sheng and Rajkumar for their help and not
being annoyed at me asking them for advice all the time.
I would like to thank my fellow HYP mate Dennis Ng, for his advice and assistance in
my work.
Last, I would like to thank my family for supporting me.
iii
List of Figures
2.1 Detection of a side view of a face 4 2.2 Geometrical features (white) 7 2.3 Facial features mask 7 3.1 Integral image at location (x, y) 11 3.2 Different types of Haar-like features 12 3.3 Cascade of classifiers 13 3.4(a) Upright frontal face recognition 16 3.4(b) Database query returns 2 positive results 16 3.4(c) Orientation of approximately 15 degrees 16 3.4(d) Database query returns 1 positive result 16 3.4(e) Orientation of approximately 30 degrees 16 3.4(f) Database query returns no positive results 16 3.5(a) Horizontal coverage of 5 squares 16 3.5(b) Horizontal coverage of 7 squares 16 3.6 Computer casing falsely identified as a face 17 3.7(a) Face detected in lower left region of image 18 3.7(b) Camera moves 18 3.7(c) Face detected in upper right region of image 18 3.7(d) Camera moves 18 3.8 Face detected within no-tracking region 19 3.9 Blue rectangle not in the centre of the image 20 3.10(a) Moving object detected by motion detector 21 3.10(b) Object moves to the right 21 3.10(c) Object continues to move to the right 21 3.10(d) Object stopped moving 21 3.10(e) Object moves to the left 22 3.11 (a) Zoom parameter 0, area 598 pixels 23 3.11(b) Zoom parameter 8, area 727 pixels 23 3.11(c) Zoom parameter 16, area 860 pixels 23 3.11(d) Zoom parameter 24, area 1054 pixels 23 3.11(e) Zoom parameter 32, area 1332 pixels 23 3.11(f) Zoom parameter 40, area 1685 pixels 23 3.11(g) Zoom parameter 48, area 2206 pixels 23 3.11(h) Zoom parameter 56, area 2861 pixels 23 3.11(i) Zoom parameter 64, area 3805 pixels 23 3.12(a) A face detected in the central region 24 3.12(b) First frame of zooming in 24 3.12(c) Second frame of zooming in 24 3.12(d) Third and final frame of zooming in 24 3.13(a) A face detected in the central region 25 3.13(b) First frame of zooming out 25 3.13(c) Second and final frame of zooming out 25 4.1(a) 75th frame of the video 27
iv
4.1(b) 85th frame of the video 27 4.1(c) Color background model of running average of 75th frame 27 4.1(d) Color background model of running average of 75th frame 27 4.1(e) Grayscale background model of running-average of 85th frame 27 4.1(f) Grayscale background model of running-average of 85th frame 27 4.1(g) Background model of GMM of 75th frame 28 4.1(h) Background model of GMM of 85th frame 28 4.2 Time in milliseconds needed for each technique 28 4.3(a) Difference image before applying median filter 30 4.3(b) Difference image after applying median filter 30 4.3(c) Binary image of 4.3(a) with threshold of 30 30 4.3(d) Binary image of 4.3(b) with threshold of 30 30 4.4(a) Threshold value 20 32 4.4(b) Threshold value 60 32 4.4(c) Threshold value 27, returned by Otsu algorithm 32 4.4(d) Hysteresis threshold 32
v
Table of Contents Title i Abstract ii Acknowledgement iii List of Figures iv
1 Introduction 1 1.1 Background 1 1.2 Project Objective 2 2 Related Work 2 2.1 Face Detection 2 2.1.1 Linear Subspace Methods 2 2.1.2 Learning Networks 3 2.1.3 Statistical Approaches 4 2.1.4 Feature-based Approach 4 2.2 Face Recognition 5 2.2.1 Eigenfaces 5 2.2.2 Neural Network 6 2.2.3 Geometric Matching and Template Matching 6 2.3 Motion Detection 7 2.3.1 Optical Flow Tracking 7 2.3.2 Motion-based Tracking 8 2.3.3 Filtering and Thresholding 9 3 Real-time Face Tracking 10 3.1 Face Detection 10 3.1.1 Integral Image 11 3.1.2 Haar-like Features 11 3.1.3 Learning Algorithm 12 3.1.4 Cascade of Classifiers 13 3.2 Face Recognition 13 3.3 Face Tracking 16 3.4 Motion Tracking 19 3.5 Zooming 22
vi
4 Motion Detection 25
4.1 Background Model 26 4.2 Difference Image 29 4.3 Thresholding the Binary Image 31 5 The Speaking Kiosk 32 6 Conclusion 33 6.1 Summary 33 6.2 Limitations 34 6.3 Further Works 34 References 36
vii
1. Introduction
1.1 Background
Over the last few years, there have been rapid advances in the development of new
algorithms to detect faces, credit to the increasing processing power and decreasing cost
of video acquisition and processing devices. The demand for more accurate face
processing has also been fueled by the recent terrorist attacks in different areas of the
world.
In our case, we have decided to come up with a program that is not only able to detect
and recognize faces, but one that is able to do tracking and motion detection too. There
are four major areas relating to the functions of the kiosk: 1) face detection, 2) face
tracking, 3) motion tracking in the absence of any detected face, and 4) face recognition,
which is subjected to the consent of the user himself. The Intel OpenCV library has a
multitude of functions that is very useful for image processing, and we will be using the
OpenCV library frequently.
This real-time face tracking application can be employed in many situations which denote
the high commercial value of this application. The camera can be mounted on a kiosk
which can then be installed at certain critical installations to aid in enhancing the security
aspect. An alarm system can work hand in hand with the kiosk to alert security personnel
of any potential dangerous intruders. On the fun side, this application can be used as an
interactive system to interact with a person, where the person can allow the machine to
try to match their faces with celebrities or any images from a certain database. Voice
synthesis which can be integrated with the system will allow for a more interesting
human-machine interaction.
Therefore, we approach to the area of face tracking involving face detection in this
project, focusing on tracking faces or the moving object in the absence of a human face.
1.2 Project Objective
The objective of this project is to track a face in real-time, meaning given a live input
from a camera, I will be able to detect a face and track the face. This is achieved by using
simple and robust image processing methods on each individual frame to detect faces and
motion. Fast mathematical methods will be used to calculate the angle and zoom level of
the camera in order to focus on the object of attraction. Using the Software Development
Kit supplied by Canon for the Canon VCC4 camera, we will then communicate with the
camera and control the camera to focus on our object of attraction.
The basic assumptions of this project are shown as following:
• This system is able to detect all the faces captured by the camera, however
the system will only try to track the face with the biggest surface area.
• The motion detection works in the background, and will only be
implemented when a face detected in the previous frames is no longer
detected in the current frame. We assume that motion detected is the
motion of the person whose face was detected in the previous frames. Also,
only the motion with the biggest area detected in the frame will be tracked.
2. Related Work
2.1 Face Detection
For the past two decades, the number of papers on the field of face detection has been
numerous, and many approaches have been devised to solve the problem of face
detection. We will briefly touch upon a few representative works.
2.1.1 Linear Subspace Methods
In the late 1980s, Sirovich and Kirby manage to develop a technique to represent human
faces efficiently using Principal Component Analysis (PCA) [1]. PCA is performed on a
set of images so that the original data can be represented in terms of the eigenvectors
2
found from the covariance matrix. The eigenvectors are actually the principal
components of the distribution of faces, so the more significant ones are used to
approximate the value of each individual. A new image can then be compared to the
existing images, a good match will be one with the smallest difference between the new
image and an existing one. However, PCA is not optimized as the face space can be
separated into subclasses. This is a good example of an image-based approach, which is
dependent on statistical analysis and machine learning to solve face detection as a general
pattern recognition problem. The intensities of the example images are used as
representations instead of using the features of the images. By depending on learned
characteristics that have been gathered, face detection can be treated as a two-class
classification problem and then solved using the characteristics which are usually in the
form of distributions.
2.1.2 Learning Networks
A neural network is an interconnected assembly of simple processing nodes, whose
processing ability is stored in the inter-unit connection weights, obtained by a process of
learning from a set of training patterns. Neural network is used to create the face database
for face detection. This is possible because face detection can be interpreted as a two-
class pattern detection problem. One significant advanced neural network approach is that
proposed by R. Feraud, O.J. Bernier, J.E. Viallet and M. Collobert [2], which is based a
Constrained Generative Model. This neural network is designed to looks at windows of
size 15 by 20 pixels. A major problem arises with window scanning techniques is
overlapping detection. This problem was dealt with by using several pre-filters and a fast
search algorithm. This also reduced the false alarm rate which in turn leads to a lower
average processing time per image. This face detector is also able to detect faces up to
90 ۫. Fig 2.1 shows the detection of a side view of a face using this face detector.
3
Fig 2.1 Detection of a side view of a face
2.1.3 Statistical Approaches
There are several statistical approaches to face detection. An excellent example of a
statistical approach to the face detection problem is the Hidden Markov Model (HMM) [3]
approach. This scheme is based on the idea of identifying the face to non-face and non-
face to face transitions in an image. Based on these transitions, an observation sequence
is generated and the HMM parameters are extracted for further analysis and learning for
the HMM model. The window of the face pattern being looked at is small at only the size
of 13 x 13 pixels which allows for a more compact and efficient training set.
2.1.4 Feature-based Approach
Most feature-based approach usually start processing at the pixel level using low level
features, such as edge detection or skin color, as a guide to detect faces. Due to the low
level nature of these features and that the image might be degraded by factors such as
illumination and noise, more complex features will be used at the later stages to reduce
the feature ambiguity. An outstanding feature-based approach by KC Yow and R. Cipolla
[4] utilizes an adaptive skin color model which is able to adapt online to detect skin
pixels, after which a region of interest will be classified and grouped together. By further
analyzing the region’s geometrical shape and information contents such as the ratio of
4
skin color to non-skin color, an informed decision can be made as to whether a face is
present in the region or not. However the processing time is dependent on the number and
size of the skin regions. Also this model may fail to detect faces when illumination
changes significantly.
2.2 Face Recognition
The idea of face recognition is to examine a set of images and try to find a good match
for a given image. Face recognition is a more difficult and complicated problem than face
detection, because face detection only has to deal with two classes of objects, faces and
non-faces. Moreover same facial features can be found on different faces, and they can be
very similar to each other. There are several approaches to face recognition which uses
Principal Component Analysis (PCA) as a pre-processing and feature extraction of the
input images or uses “eigenfaces”, which are actually eigenvectors that are approximated
from the face image’s auto-correlation matrix. Recently neural network based algorithm
has also been used widely due to its real-time computation and adaptation abilities.
Geometrical approach for face recognition takes advantages of the spatial configuration
of facial features. The face will be processed and the main geometrical features of the
face such as the eyes, nose and mouth are first located. Each face is allocated a
classification based on various factors such as geometrical distances and angles between
features.
2.2.1 Eigenfaces
Using eigenfaces is an example of a statistical approach to face recognition. The concept
of eigenfaces is based on the idea of Turk and Pentland [5] to find out the more
outstanding features in a face and use them to distinguish one face from the other. Every
face can now be represented as a vector of weights which are calculated by using a
simple inner product operation to translate the image into eigenface components. To
recognize a face that has been detected and normalized, its vector of weights is first
calculated and then matched with the faces in a database to find an image that has the
5
closest weights in Euclidean distance. All faces of the same person are likely to be very
near to each other in terms of the Euclidean distance, while different person will have
different clusters of face. This approach is fairly robust and adaptive to illumination
changes, but it degrades quickly as the size of the image changes. There is also no prior
knowledge on the distribution of the values of the face cluster, which makes it harder to
determine whether a given Euclidean distance is small enough for us to say that the two
faces are actually the same.
2.2.2 Neural Network
We have mentioned before that applying a set of neural network based filters to an image
and then combining the outputs using an arbitrator will allow us to detect faces. The
neural network architecture can also be utilized in an unsupervised manner to recognize
faces. Basically this is done by gathering all the common faces from each person and then
projecting them as eigenfaces to let the neural networks learn how to classify them with
the new face descriptor as input. To facilitate the training of the neural network, a neural
net is created for each person. Whenever an input face is available, all the neural nets will
check with the input face and output a value, which a recognition algorithm will analyze
the value and choose the net with the highest value. If this value is above a certain
threshold, the net can is said to be matching with the input face. Training the neural
network is very important and the new faces are always added into training examples to
retrain the neural network. This is because these new faces not only improve the accuracy
of their own network, but also improve the performance of other networks.
2.2.3 Geometric Matching and Template Matching
Roberto Brunelli and Tomaso Poggio [6] develop two algorithms for face recognition:
geometric feature based matching and template matching. The geometric feature based
matching approach analyzes the face and detects 35 facial features automatically such as
eyebrow thickness and vertical position, nose vertical position and width, chin shape and
zygomatic breadth. Using all these 35 facial features, a 35-D vector is formed with the
6
assumption of a Gaussian distribution and recognition is performed with a Bayes
classifier. Fig 2.2 shows the facial features that have been detected, being marked in
white. In the template matching approach, the face of each person is analyzed and
represented by an image and four masks corresponding to the eyes, nose, mouth and face.
Normalized cross correlation is performed on the input image and the database images,
each of which returns a vector of matching scores (one per feature). Each individual score
is summed up and the one with the highest cumulative score is denoted as the person.
They also perform recognition based on single feature and features are sorted by
decreasing performance as eyes, nose, mouth and whole face template. Fig 2.3 shows the
masks detected by template matching.
Fig 2.2 Geometrical features (white) Fig 2.3 Facial features mask
2.3 Motion Detection
Motion detection is an important area of research of computer vision and a very crucial
low-level task for many computer vision applications such as traffic monitoring and
video surveillance. Generally, there are two kinds of methods used for motion-based
tracking system, optical flow tracking methods and motion-energy methods.
2.3.1 Optical Flow Tracking
7
Optical flow tracking is also known as computing the motion in an image. It is basically
tracking a set of points across multiple images to find out how an object has moved. This
technique has been used extensively in Pyramidal Lucas-Kanade Model [15]. First a set
of good features should be determined for tracking purposes since tracking all pixels of
the starting image to the destination image will be very wasteful. This set of features
should be sharp and well-defined. Formally this means the feature should have a large
eigenvalue which indicates that the feature is more well-defined than the image noise and
thus can be tracked more reliably. Tomasi’s [7] method of finding good features to track
is an improvement by providing a method to assess goodness of features for tracking and
accounting for affine transformation of features. Since local accuracy and robustness has
to be taken into account when choosing the integration window size to track the features,
a pyramidal implementation of the Lucas-Kanade algorithm, meaning an iterative
implementation of the algorithm, will be able to optimize the window size and the
tracking results. This algorithm is also able to compute the optical flow at a subpixel
accuracy level, using bilinear interpolation technique. Also this algorithm is able to cater
to “lost” features that are either no longer in the images or have disappeared due to
occlusion.
2.3.2 Motion-based Tracking
This is perhaps the most intuitive technique for tracking motion. By acquiring a
background model, we can use background subtraction by subtracting a current frame
from the background image. Those resultant pixel values are then looked at, those at near
zero are classified as the background, whereas those values that are large are classified as
the foreground. However acquiring the background model can be quite a daunting task by
itself.
The most straightforward approach is to setup up the camera, make sure nobody is in the
image and then take the image to be the constant background image. This is the simplest
and fastest way, but it does not account for situations where the illumination will change
subtly over time and movement in the background (eg trees swaying).
8
There are more complex methods using a sophisticated statistical model for each pixel.
The multi-modal statistical motion detector from MIT “Stauffer and Grimson” [8] has
been extensively used, because of its accuracy in modeling the background model. This
detector models each pixel with a mixture of 3 to 5 Gaussian distributions but it is
computationally expensive. A further improvement on this detector has been proposed by
P. KaewTraKulPong and R. Bowden [9] to improve on the learning speed and accuracy
of the system. A shadow detection technique has also been introduced, simply by noting
the difference of a foreground pixel and the background model. If the difference in both
chromatic and brightness components is small, the pixel is considered to be part of a
shadow, however this will require a color model that can separate chromatic and
brightness components.
2.3.3 Filtering and Thresholding
After acquiring the background model, the difference image between the current frame
and the background image needs to be converted into a grayscale image before
thresholding the image into a binary image. The binary image can then be analyzed to
check for moving objects in the foreground. Before conversion, filters are normally
applied on the difference image to get rid of the noise and enhance the detection process.
Some commonly used filters are the Gaussian smoothing and the simple blurring.
There are several techniques to come up with the optimal threshold. The most
straightforward one will be to set the threshold to an arbitrary integer, however this
approach is not robust enough. A value set too high will eliminate pixels that are part of
the fore ground, whereas a value set too low will introduce a lot of noise and unwanted
pixels that do not belong to the foreground. Calculating the stable Euler number is also a
good way to obtain the threshold since the Euler number can be determined efficiently
and only requires a single raster scan. Otsu’s thresholding method [10] is based on the
idea of creating a histogram of the image and then finding a threshold value that
9
minimizes the within-class variance of resulting foreground and background classes and
maximizing the between-class variance.
3. Real-time Face Tracking
We have implemented this system and mounted the camera on the kiosk. Any user can
just walk by the kiosk and the kiosk will detect the face and invite the user to try to
recognize his face with the existing face database in our server. The kiosk will “show” its
sincerity by focusing on the face and zoom in if applicable. The kiosk will then track the
face if the face moves out of the focus of the camera, and again zooming will be done, if
applicable. In the absence of face detection when the user moves away, the kiosk will
utilize motion detection methods to focus on the user himself while attempting to detect
the face at the same time. If the user or his face moves out of a predefined range, the
camera will go back to its home position. All the tracking and zooming will be performed
on a Canon VCC4 pan/tilt/zoom camera.
If the user is willing to try out the face recognition on the kiosk, he can always instruct
the kiosk to attempt recognition. The kiosk will evaluate the face and search its database
for 6 faces which are the most matching according to the kiosk’s face recognition
algorithm. If the input face is not inside the choice of 6 faces, the user can always add his
own face into the database so that face recognition can be performed on the user the next
time the user utilize the kiosk.
3.1 Face Detection
The face detection system in the kiosk is based on the Cascade AdaBoost architecture
developed by P. Viola and M. Jones [12] and later, R. Lienhart and J. Maydt [13]
extended this face detector by writing another paper “An Extended Set of Haar-like
Features” to improve on the efficiency of the proposed algorithm. These algorithms have
been incorporated into the Intel OpenCV library which we are using extensively to
10
perform face detection. Together there are 4 important parts to the face detection
algorithm.
3.1.1 Integral Image
The concept of integral image was introduced to speed up the computation of rectangle
feature. An integral image at location (x, y) is defined as the sum of the pixels left and
above of the location itself. Mathematically, it can be expressed as follows:
where i(x, y) is the original image and ii(x, y) is the integral image. Fig 3.1 shows the
value of the integral image at location (x, y) as the sum of values of the pixels of the
shaded area.
Fig 3.1 Integral image at location (x, y)
3.1.2 Haar-like Features
The concept of integral image is developed with the purpose of speeding up computation
of rectangular haar-like features due to the simplicity in the computation. The proposed
haar-like features are superior to the rectangular features in the sense that there are added
features in the form of rectangles that are rotated 45 degrees. Like the basic feature set,
these rotated features can also be computed in one pass over the image.
11
Fig 3.2 Different types of Haar-like features
The values of the features are calculated by subtracting the sum of pixels in the white
area by those in the black area, with appropriate scaling done.
3.1.3 Learning Algorithm
In a machine-learning approach for the face detector, an efficient learning algorithm is
essential for the system to learn how to differentiate faces from other images. The
algorithm that is being used in our face detector is a modified variant of the original
AdaBoost algorithm proposed by Schapire and Freund [14]. The Adaboost algorithm is
able to offer a formal guarantees on performance thus it is very suitable as an algorithm
to boost weak classifiers based on all given features. The result returned by each weak
classifier in the modified version of the AdaBoost algorithm now only depends on one
feature. This improvement allows us to select a small number of significant feature which
can then be combined together to form an accurate classifier, instead of using the 180,000
rectangle features associated with each image sub-window, which makes it
computationally expensive. The task of each weak classifier is now made simpler and it
can determine the optimal threshold value with the minimum rate of misclassification
using the input positive and negative examples. Mathematically, a weak classifier hj(x) is
subjected to the thresholding operation as follow:
12
where x is a 24 x 24 pixel image, fj is a feature, θj is the threshold and pj is a polarity
specifying the direction of the inequality sign. The AdaBoost algorithm uses the results
from every single weak classifier, after they have been thresholded using one feature, to
construct a so called strong classifier to be used for the face detection system. This
system can be further improved by using a cascade of classifiers since just relying on one
classifier is not accurate enough. Accuracy of the face detection system can be improved
while the computation time can be decreased.
3.1.4 Cascade of Classifiers
The appearance of a cascade of classifiers is like that of a decision tree. Starting from the
root classifier, the next classifier will be evaluated only when the current classifier returns
a positive result, otherwise negative results at any cascade stages leads to a rejection
immediately. This form of rejection leads to decreased computation time since rejected
sub-windows will not be processed anymore. The design of this cascading style is to
reject as many negative sub-windows as possible since the majority of the sub-windows
are negative. This cascade operation is represented in Fig 3.3, which depicts how the
cascade works.
Fig 3.3 Cascade of classifiers
3.2 Face Recognition
13
For face recognition, our system is using a simple, memory-based technique known as
the ARENA. The resolution of the input image is first reduced to 16x16 pixels and then a
similarity measure, L0* is used to compute the difference between the input image and the
stored images in the database. To reduce the resolution of the input image quickly, the
image is divided into non-overlapping regions before executing a simple averaging to get
the reduced image. The key point of the outstanding performance of this algorithm is the
similarity measure which has a better performance than the Euclidean distance. It takes a
few mathematical steps to arrive at the similarity measure, and the steps are shown below:
Definition of Lp:
Each reduced-resolution image is converted into a vector xr , where each component of the
vector corresponds to each pixel in the image. Due to the presence of noise in the images,
the definition of the similarity measure is relaxed to be
where δ is a threshold
The similarity measure actually counts the number of components whose differences
exceed the threshold δ in vector x and vector y, and this number is used for analysis to
determine whether the input face exist in the database or not. Fig 3.4 shows some results
for the face recognizer with different face orientations, the correct results are circled in
green.
14
(a) (b)
(c) (d)
(e) (f)
15
Fig 3.4 (a) Upright frontal face recognition, (b) database query returns 2 positive results,
(c) this is the same face with an orientation of approximately 15 degrees, (d) database
query returns 1 positive result, (e) this is also the same face with an orientation of
approximately 30 degrees, (f) database query returns no positive results
The results show that the face recognizer can recognize frontal face very quickly and
accurately, however the performance is not as good when recognizing faces with an
orientation of more than 15 degrees. Therefore to utilize this face recognizer effectively,
we have to make sure the user is facing the camera directly before recognizing the face,
so that a frontal face image of the user can be captured and used to query the face
database.
3.3 Face Tracking
In our system, we are using a Canon VCC4 camera which is capable of panning, tilting
and zooming. Before we can actually track a face or a moving object, we must first obtain
the field of view of the camera, both horizontal and vertical. Since the camera takes 2D
images, we can find out the horizontal field of view by using a board to measure the
horizontal distance the camera can cover and the distance of the camera from the board
itself. To obtain the vertical field of view, the same procedure is repeated but the vertical
distance of the coverage of the camera is measured this time. The camera is at a zoom
level of 1x during the measurements of the distance. Fig 3.5 shows some of the images
that have been captured during the measurements of the horizontal field of view.
(a) (b)
16
Fig 3.5(a) and (b) shows the horizontal coverage of 5 squares and 7 squares respectively
From each measurement, the horizontal field of view can be calculated with the formula,
tan θ = (horizontal distance covered by the camera) / (distance of the
board from the camera)
where θ is the horizontal field of view. An extract of the table of values used to calculate
the fields of view is shown below.
Horizontal
distance /cm
Distance from
the camera /cm θ / ۫
14.3 13.2 47.1
23.8 22.2 47
30.6 28.5 47
By taking the average of all the θ’s, the horizontal field of view is calculated to be 47 ۫.
Using the same theory, the vertical field of view is calculated to be 18 ۫. These 2 values
are very important for tracking the face and the motion, as these values are the input that
the camera will accept for panning and tilting, instead of the usual coordinate system.
For tracking of the face, the face detector will only consider a face being present in the
foreground when the number of detection of faces exceeds a certain threshold to increase
the efficiency of the face tracker. This is to prevent the face tracker from wasting time in
tracking false positive detections. Fig 3.6 shows an example of a false positive.
Fig 3.6 A computer casing being falsely
identified as a face
17
In our case, when there are 3 or more occurrences of face detected in 5 consecutive
frames, the face detector will acknowledge a face being present in the foreground and it
will return a set of coordinates that defines the red rectangle bounding the face in the
image. With the set of coordinates, the tracking system can determine the panning and
tilting angle for the camera to focus on the face, so that the face will be in the centre of
the image. It is important to note that although the face detector can detect multiple faces
in an image, only the coordinates of the face with the largest area will be used in the
tracking system. The formula used in computing the panning and tilting angle is a simple
and efficient one and is shown below. All units refer to the (x, y) coordinates system
where the origin is at the top-left of the image. Fig 3.7 shows the accuracy of the tracking
system in tracking faces.
(a) (b)
(c) (d)
18
Fig 3.7(a) Face detected in the lower left region of the image, (b) the camera moves such
that the face is now in the centre of the image, (c) face detected in the upper right region
of the image, (d) the camera moves such that the face is now in the centre of the image
Formula in calculating the pan and tilt angle,
Pan angle = (x-coordinate of centre of face – x-coordinate of centre of image) / x-
coordinate of centre of image * 23.5 ۫
Tilt angle = (y-coordinate of centre of face – y-coordinate of centre of image) / y-
coordinate of centre of image * 5.75 ۫
Although we want to track the face very accurately, we do not want the camera to keep
moving about just to keep the face in the dead centre of the image. Instead we specify a
region in the centre of the image where we will not track the face, and this will prevent
jerky movements of the camera which will affect the performance of face detection. We
specified the width of the region to be ¼ width of the image and the height of the region
to be ¼ height of the image, and the region to be in the centre of the image. Fig 3.8 shows
the scenario of a face that is slightly off the centre of the image but still considered by the
face tracker to be in the region.
Fig 3.8 Face detected within the no-
tracking region
3.4 Motion Tracking
19
For tracking of the moving object, the motion detector will also return a set of
coordinates that defines the blue rectangle bounding the moving object. In the event of
multiple moving objects, the one with the biggest surface area will be tracked. However
the motion detector will only start when a face has been tracked in the recent frames, this
is because the motion tracker works as a supplement to the face tracker thus it must avoid
unwanted motion tracking. For example a cat walking across the kiosk or a vehicle
driving across the kiosk will not be tracked. Thus it is reasonable to assume that the
moving object is the body of the face being tracked recently. The formula used to detect
the panning and tilting angle is slightly different from that of the face tracker, due to
some minor changes being introduced. The way to calculate the panning angle is still the
same, however the formula for calculating the tilting angle no longer uses the y-
coordinate of the centre of the motion, but instead uses the y-coordinate of the topmost
level of the motion. The basic idea behind the change in the formula of calculating the
tilting angle is that the moving object detected is the body of a person so the face will
actually be at the topmost portion. Therefore we want the topmost portion of the object to
be in the centre so as to continue tracking the face. Fig 3.9 illustrates this rationale.
Fig 3.9 The blue rectangle is not in the centre of the image, but instead the topmost
portion of the rectangle is. Notice the face being in the central region of the image as well.
If ( y-coordinate of topmost level of object < ¼ height of image )
Tilt angle = (¼ height – y) / ¼ height * 9 ۫
Else if ( y-coordinate of topmost level of object > ½ height of image )
20
Tilt angle = ( y - ½ height ) / ½ height * 9 ۫
Since we are tracking a moving object, we have to account for the fact that the object is
likely to traverse a longer distance between consecutive frames compared to a face, so we
have included a simple but efficient prediction technique to make the tracking smoother.
Since we assume that human beings walk in a linear fashion, the calculated panning and
tilting angles are multiplied by a factor of 1.5 to account for distance lost during the
movement of the camera. The motion tracking system will be able to track the moving
object in a competent way and not lose the object due to the slow reaction of the camera.
The factor of multiplication cannot be too high, as a high factor will result in the camera
losing the focus on the moving object very quickly when the object moves out of the
range of the camera. Fig 3.10 shows the tracking of a moving object in a room where the
ceiling lights are well lit with a constantly changing background.
(a) (b)
(c) (d)
21
(e)
Fig 3.10 (a) Moving object being detected by the motion detector, (b) object moves to the
right, (c) object continue to move to the right, (d) object stopped moving, (e) object
moves to the left. Note that the face tracking system has been disabled so as to allow the
motion tracking system to work.
For the tracking of either a face or a moving object, we predefined a range of movement
for the camera. If the calculated movement of the camera should exceed the range, the
camera will move back to its home position and reset all the variables used in detecting
faces and motion. The camera will then start to detect and track faces again. We have set
the range of the pan angle of the camera to maximum
3.5 Zooming
After detecting a face, our face tracking system can also zoom in on the face if the face is
too small. Our camera is capable of 16x zoom, however we will not be making use of the
full zooming capabilities of the camera as the zoom function takes a lot of time, and the
face tracker will easily lose the face that it is tracking. The camera takes 1.5 seconds to
zoom from 1x to 16x, which means many frames of valuable information will be lost.
Also we think that there is no need to use up the full zooming power because there is no
point for the face tracking system to zoom in on a face that far away.
22
Although a software development kit has been provided for the camera, the
documentation for the kit is minimal. All it states in the documentation is that the zoom
function takes in an integer value of 0 to 128 as input so we decided to measure the effect
of different parameter values for the zoom function. Fig 3.11 shows the area of a black
object at different zoom parameters.
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Fig 3.11 (a) zoom parameter 0, area 598 pixels, (b) zoom parameter 8, area 727 pixels, (c)
zoom parameter 16, area 860 pixels, (d) zoom parameter 24, area 1054 pixels, (e) zoom
parameter 32, area 1332 pixels, (f) zoom parameter 40, area 1685 pixels, (g) zoom
parameter 48, area 2206 pixels, (h) zoom parameter 56, area 2861 pixels, (i) zoom
parameter 64, area 3805 pixels.
From the results, it is obvious that the zoom power increases exponentially with respect
to the value of the input parameter. Therefore we restrict the value of the input parameter
to a maximum of 64, since the increase in the area is approximately linear for that range
of values.
23
The face tracker will only zoom in on a face when the face is in the central region of the
image and the camera does not need to pan or tilt. This is based on the reason that if the
camera needs to move to track the face, the face is somewhere outside the central region
of the image. Since it takes time to zoom, the face would have moved out of the frame by
the time the camera finishes zooming. Also if the face is in the central region, we can
assume that zooming in on the face will still keep the face in the frame even if the user
moves or that the user has stopped moving and is now looking at the kiosk. Fig 3.12
shows an example of the face tracking system zooming in on a face.
(a) (b)
(b) (d)
Fig 3.12 (a) A face being detected in the central region, (b) first frame of zooming in, (c)
second frame of zooming in, (d) third and final frame of zooming in
24
The zooming is done incrementally so as to make sure the face remains in the central
region and also to react robustly to changes such as the user moving closer to the kiosk or
the user moving away from the scene. If the user decides to move closer to the kiosk, his
face will appear larger on the frame so the camera has to zoom out accordingly. If the
user starts moving away, the camera has to stop zooming and start tracking the face again.
Fig 3.13 shows an example of the face tracking system zooming out from a face.
(a) (b)
(c)
Fig 3.13 (a) A face being detected in the central region, (b) first frame of zooming out, (c)
second and final frame of zooming out
4. Motion Detection
25
Our project is based on real time face tracking, but there are times when the face detector
cannot detect a face even though a face is present in the frame. Thus we need to rely on a
good motion detector to enhance the performance of the face tracker. However face
detection is computationally expensive on its own, so we cannot afford to allocate too
much resource to detecting motion. Also the system should not try to detect motion when
a face is detected in the frame so as to improve the efficiency of the system. Motion
detection in complex background is a study on its own, so we attempt to address this
issue here using very simple and robust techniques.
4.1 Background Model
As mentioned in the Related Works section, there are several background modeling
models now that can model the background very effectively. One good example is the
multi-modal statistical motion detector from MIT “Stauffer and Grimson”. However this
multi-modal statistical motion detector is computationally expensive because this
detector models each pixel with a mixture of 3 to 5 Gaussian distributions. So instead of
using complex algorithms, we used simple weighted running-averages of frames to model
the background.
The new background image is a weighted sum of existing background image and current
frame. This background model is robust to small changes such as illumination change and
movements of background objects like trees. The formula is shown below.
New background image = 0.97 * (existing background image) + 0.03 * (current
frame)
We took a video with a complex background so as to test the accuracy of both models. A
fan is running in the background and the user is wearing a pair of shorts that is very
similar in color to that of a table in the background. Fig 4.1 shows the background image
modeled by both our running-average model and the Gaussian Mixture Model(GMM),
26
together with the image of the current frame at the 75th and 85th frame respectively. We
chose a later frame of the video to allow both background models to stabilize.
(a) (b)
(c) (d)
(e) (f)
27
(g) (h)
Fig 4.1 (a) 75th frame of the video, (b) 85th frame of the video, (c) color background
model of running-average of the 75th frame, (d) color background model of running-
average of the 85th frame, (e) grayscale background model of running-average of the 75th
frame, (f) grayscale background model of running-average of the 85th frame, (g)
background model of GMM of the 75th frame, (h) background model of GMM of the 85th
frame.
The GMM is only slightly superior to that of the running-average model, as you can see
from the background model of the GMM at the 85th frame, the area underneath the table
is more defined and accurate.
Since addition is computed much faster than modeling Gaussian distributions, we can
expect the speed of running average to be very much faster than that of Gaussian Mixture
Model. Indeed, Fig 4.2 confirms our hypothesis.
28
Fig 4.2 The table shows the time in milliseconds needed for each technique to compute
the background model.
The grayscale running-average model, which is an optimized version of the color
running-average model, is clearly the fastest and is almost 60 times faster than that of
GMM. Moreover, the GMM is unsuitable for tracking purposes, because of the rapidly
changing background. The GMM takes time to initialize due to the high overheads.
Hence we will be using the grayscale running-average model for our background
modeling.
4.2 Difference Image
After we obtain a suitable background model, we must subtract the background model
from the current frame to get a difference image. The difference image is supposed to
show all objects in the foreground, and it is grayscale because we are using a grayscale
background model. We are then going to apply a threshold to the difference image to
obtain a binary image so that the blob counter can work on the binary image to detect
movement. However due to the presence of noise and minor changes in the environment
like illumination changes, the difference image is not a perfect reflection of the
foreground. An imperfect background modeling tool also contributes to the error of
accuracy. Therefore we need to process the image to obtain a best fit model of the
foreground.
We perform smoothing on the difference image to remove the random noise present in
the image. We chose the median filter because it is fast and efficient. After applying the
median filter, we will then threshold the image to get a binary image. Fig 4.3 shows a
difference image after a median filter has been applied on it and the binary image
obtained after thresholding it with a value of 30. We are using the 75th frame of the video
as the image, as shown in Fig 4.2.
29
(a) (b)
(c) (d)
Fig 4.3 (a) Difference image before applying median filter, (b) difference image after
applying median filter, (c) binary image of (a) with threshold value of 30, (d) binary
image of (b) with threshold value of 30. The effect of median filter on the binary image is
obvious as the noise present in (c) is no longer found in (d). Also the white portions
(blobs) are more connected in (d), which makes it easier for the blob counter to work on
it.
To increase the connectivity of the surrounding blobs, the binary image is further dilated.
This is to join up adjacent blobs that are part of the foreground so to increase the
accuracy of the motion detector. However, dilation cannot be performed too many times
as we do not want further away blobs that do not belong to the foreground, to be included
as well.
30
4.3 Thresholding the Binary Image
Finding the optimum value for the threshold is very important as we want as much
foreground to be included in the binary image and as less background to be included. If
we set the threshold value to be too low, much unwanted information will be captured by
the image especially noise. If we set the threshold value too high, we risk omitting pixels
that are actually part of the foreground. For example, the user wearing dark blue shorts is
walking across the room with a dark brown table in the background in Fig 4.1(a). Since
dark brown and dark blue looks very similar, we need a relatively low threshold to
capture the motion of the shorts. As we can see from the image in Fig 4.3(d), the window
on the right hand side of the image is considered to be part of the foreground due to the
low value of 30 being used as the threshold.
There are many algorithms to calculate the optimum threshold, notably the more popular
Otsu Algorithm. Please refer to the appendix for the VC++ code of implementation of the
Otsu algorithm. However the Otsu algorithm uses floating-point values to calculate the
interclass variance, thus it is quite computationally expensive. We also found out that
using the Otsu threshold returns a lot of false postive so we decided to use a hysteresis
method similar to that of finding Canny edges [15] to compute the binary image.
The hysteresis method works by thresholding an image using 2 different threshold values,
1 low and 1 high. Those pixels with value above the high threshold value will be retained
in the final binary image while those below the low threshold value will be rejected
outright. Those pixels with value between the low and high threshold will only be
retained if they are directly or indirectly connected to any pixels with value above the
high threshold through other pixels also with value between the low and high threshold.
We set the low threshold to be 20 and the high threshold to be 60.
To denote motion, we will draw a blue rectangle bounding the moving object. To do this,
we will use a simple blob counter to identify all the blobs. The blob counter will also
filter out all blobs that are smaller than 500 pixels in area, as blobs with area smaller than
31
500 pixels are usually interference from the environment. We will then find the largest
blob and draw a bounding blue rectangle over it.
Our simple method works very well as shown in Fig 4.4, where comparisons are made
with images being thresholded with an arbitrary value of 20 and 60 and also the threshold
value computed using Otsu algorithm. The same blob counter with the same parameters
are used during the motion detection. Again we are using the 75th frame of the video.
(a) (b)
(c) (d)
Fig 4.4 (a) threshold value 20, (b) threshold value 60, (c) threshold value 27, as returned
by Otsu algorithm, (d) hysteresis threshold of 20 and 60. As above, our hysteresis
threshold manage to outperform the rest of the thresholding methods.
5. The Speaking Kiosk
32
The kiosk is able to track faces in real-time, but we decided to add in some fun factor to
make the kiosk more interesting. When the kiosk detects a face, the kiosk will invite the
user to try the face recognition system by saying “Hi there, would you like to try the
kiosk?” If the user leaves the view of the kiosk, the kiosk will politely thank the user for
the time by saying “Thank you for your interest in the kiosk. We hope to see you again.
Have a nice day.” When the kiosk is attempting to recognize the face, it will inform the
user by saying “Please hold as we attempt to recognize your face.” If the kiosk is unable
to detect a face when the user attempts to recognize his face, the kiosk will respond
“Sorry, we did not detect a face. Would you like to try again?” By introducing speech
into the kiosk system, we hope to make the kiosk more user-friendly and more accessible.
6. Conclusion
6.1 Summary
In this paper, we gave a brief discussion about some of the useful applications in which
face tracking can play an important role. This project presents a system that is able to
detect faces, and track them in real-time. This system also has a motion detector
embedded in it to help improve the performance of the face tracking system. We not only
concentrate on the technical aspect of the system, but also attempt to introduce some
affability to the system by making the kiosk “speak” to the user.
Since our real-time face tracking system depends heavily on the face detection system, an
efficient and fast face detector is a must. The face detector based on the Cascade
AdaBoost architecture boosted by the Haar-like features performs up to our expectations.
Also, the face recognition system, implemented using the ARENA algorithm, utilizes the
nearest neighbor technique competently, and is able to communicate with the face
database to retrieve the nearest 6 matches.
After a precise calibration and meticulous measurements, the camera is able to track faces
very accurately and for motion tracking, a simple prediction system makes sure the
33
moving object remains in the central region of the view of the camera. We discard the
more computational expensive algorithms in favor of faster and simpler one to make our
face tracking system more robust to changes.
We also placed a high importance on the motion detection system as an improvement to
the face tracking system. We have developed a straightforward and competent motion
detection system which has facilitated the whole project greatly.
6.2 Limitations
The face detection system, although an efficient one, is still unable to avoid the problem
of false positive. Even with the addition of a threshold system, the camera might still end
up tracking the wrong object.
Due to changing background during tracking, a shift from a weak lighting background to
a strong lighting background will corrupt the motion detection system.
The inherent zooming function of the camera takes time, and the system might not be
able to react accordingly if the user walks towards the camera or move away from the
camera very quickly.
The face detection system is unable to detect faces with more than 15 degrees orientation,
and this hampers the performance of the face tracking system.
6.3 Further Works
Since the face detection system forms a vital part of the face tracking system, developing
a faster face detection system will aid in the face tracking system. There are many
researches on this topic underway, and we hope that a faster and more efficient face
detection technique can be proposed.
34
One limitation of the current system is that it only detects upright frontal face, therefore
coming up with a technique to train the system for different head orientations, and then
combining the results together for a more powerful face detection system is a good way
to improve the current system.
We mention in this paper that we tried to make the kiosk a more user-friendly one. The
kiosk can be further improved by adding voice synthesis so as to allow the user to interact
with the kiosk. Also we have contemplated the idea of gesture recognition so that the user
can perform face recognition on the kiosk just by making some pre-defined gestures.
However due to the time limitations, we will have to explore these ideas as a further
improvement to the project.
35
References [1]M. Kirby and L. Sirovich, “Application of the Karhunen-Loeve procedure for the
characterization of human faces”, IEEE Transaction on Pattern Analysis and Machine
Intelligence, 12(1), 1990
[2] R. Feraud, O.J. Bemier, J.E. Viallet and M. Collobert, “A fast and accurate face detector
based on neural networks”, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 23,
no. I , pp. 42-53, 2001.
[3] A. Rajagopalan, K. Kumar, J. Karlekar, R. Manivasakan, M. Patil, U. Desai. P. Poonacha and
S. Chaudhuri, “Finding faces in photographs”, in Proc. 6Ih IEEE Conf on Computer Vision. pp.
640-643, 1998.
[4] KC Yow, R. Cipolla, “Feature-based human face detection”, Image and Vision Computing, no.
15, pp. 713-735, 1997.
[5] M. Turk and A. Pentland, “Face recognition using eigenfaces,” Proc. IEEE Conf. on
Computer Vision and Pattern Recognition, pp. 586-591, 1991
[6] Roberto Brunelli and Tomaso Poggio, ``Face recognition: feature versus templates", IEEE
Transactions on Pattern Analysis and Machine Intelligence, 15(10) pp.1042-1052, 1993
[7] C. Tomasi and T. Kanade, “Detection and tracking of point features,” Tech. Rep. CMUCS-
91-132, Carnegie Mellon University, 1991.
[8] C. Stauffer and W.E.L. Grimson, “Adaptive Background mixture Models for Real-time
Tracking”, CVPR99, June, 1999.
[9] P. KaewTraKulPong and R. Bowden, “An Improved Adaptive Background Mixture Model
for Real-time Tracking with Shadow Detection,” In Proc. 2nd European Workshop on Advanced
Video Based Surveillance Systems, 2001.
36
[10] N. Otsu. A threshold selection method from gray-level histograms. IEEE Trans. on Systems
Man, and Cybernetics, 9:62–66, 1979.
[11] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In
Proc. CVPR, pages 511–518, 2001.
[12] Rainer Lienhart and Jochen Maydt. An Extended Set of Haar-like Features for Rapid Object
Detection. IEEE ICIP 2002, Vol. 1, pp. 900-903, Sep. 2002.
[13] Yoav Freund and Robert E. Schapire. “A decision-theoretic generalization of on-line
learning and an application to boosting”. Journal of Computer and System Sciences, 55(1):119–
139, August 1997.
[14] Canny, J., “A Computational Approach To Edge Detection”, IEEE Trans. Pattern Analysis
and Machine Intelligence, 8:679-714 (1986)
[15] B. Lucas and T. Kanade, “An iterative image restoration technique with an application to
stereo vision”, Proceedings of the DARPA IU Workshop, 121–130, 1981.
37