real-time face tracking - comp.nus.edu.sgcs4243/projects/hyp-face-tracking.pdf · 3.4(b) database...

Honours Year Project Report

Real-time Face Tracking

By

Yeow Hong Ann

Department of Computer Science

School of Computing

National University of Singapore

2005/2006

Honours Year Project Report

Real-time Face Tracking

By

Yeow Hong Ann

Department of Computer Science

School of Computing

National University of Singapore

2005/2006

Project No: H055200

Advisor: Dr. Terence Sim

Deliverables:

Report: 1 Volume

i

Program: 1 CD

Abstract

Abstract

Face detection and recognition has been a widely researched topic due to its high

application value in various fields and industries, such as biometrics, image retrieval.

Recently face recognition has also been deployed to aid in counter-terrorism by

recognizing hardcore criminals and terrorist in the important installations. In this paper,

we present an approach to the real-time detection and recognition of human faces, and the

tracking of the face. We have also modified the approach to allow for situations when the

face cannot be detected, and we will then track the motion under changing background in

real-time. Our approach uses the cascade architecture for face detection proposed by

Viola and Jones which uses a feature selection algorithm based on AdaBoost. This

architecture has been incorporated into the existing OpenCV library. As for face

recognition, we will use ARENA, a simple, memory-based algorithm for face recognition.

Due to the rapidly changing background during tracking, we will be using a modified

background subtraction method which can adapt robustly and detect the motion. A

Canon-VCC4 pan/tilt/zoom camera will be employed to track the face and the motion.

Subject Descriptors:

I.4.3 Enhancement

I.4.6 Segmentation

I.4.8 Scene Analysis

I.5.2 Design Methodology

Keywords:

Face detection, computer vision, face recognition

Implementation Software and Hardware:

Pentium 1.6GHz PC, Euresys Picolo Card, Canon VCC4 camera, VC++, OpenCV beta5

ii

Acknowledgement

I wish like to thank all the people for their help in completing this project.

I would like to thank my supervisor, Dr Terence Sim, for his invaluable advice and

guidance throughout this project.

I would like to thank my 2 mentors, Zhang Sheng and Rajkumar for their help and not

being annoyed at me asking them for advice all the time.

I would like to thank my fellow HYP mate Dennis Ng, for his advice and assistance in

my work.

Last, I would like to thank my family for supporting me.

iii

List of Figures

2.1 Detection of a side view of a face 4 2.2 Geometrical features (white) 7 2.3 Facial features mask 7 3.1 Integral image at location (x, y) 11 3.2 Different types of Haar-like features 12 3.3 Cascade of classifiers 13 3.4(a) Upright frontal face recognition 16 3.4(b) Database query returns 2 positive results 16 3.4(c) Orientation of approximately 15 degrees 16 3.4(d) Database query returns 1 positive result 16 3.4(e) Orientation of approximately 30 degrees 16 3.4(f) Database query returns no positive results 16 3.5(a) Horizontal coverage of 5 squares 16 3.5(b) Horizontal coverage of 7 squares 16 3.6 Computer casing falsely identified as a face 17 3.7(a) Face detected in lower left region of image 18 3.7(b) Camera moves 18 3.7(c) Face detected in upper right region of image 18 3.7(d) Camera moves 18 3.8 Face detected within no-tracking region 19 3.9 Blue rectangle not in the centre of the image 20 3.10(a) Moving object detected by motion detector 21 3.10(b) Object moves to the right 21 3.10(c) Object continues to move to the right 21 3.10(d) Object stopped moving 21 3.10(e) Object moves to the left 22 3.11 (a) Zoom parameter 0, area 598 pixels 23 3.11(b) Zoom parameter 8, area 727 pixels 23 3.11(c) Zoom parameter 16, area 860 pixels 23 3.11(d) Zoom parameter 24, area 1054 pixels 23 3.11(e) Zoom parameter 32, area 1332 pixels 23 3.11(f) Zoom parameter 40, area 1685 pixels 23 3.11(g) Zoom parameter 48, area 2206 pixels 23 3.11(h) Zoom parameter 56, area 2861 pixels 23 3.11(i) Zoom parameter 64, area 3805 pixels 23 3.12(a) A face detected in the central region 24 3.12(b) First frame of zooming in 24 3.12(c) Second frame of zooming in 24 3.12(d) Third and final frame of zooming in 24 3.13(a) A face detected in the central region 25 3.13(b) First frame of zooming out 25 3.13(c) Second and final frame of zooming out 25 4.1(a) 75th frame of the video 27

iv

4.1(b) 85th frame of the video 27 4.1(c) Color background model of running average of 75th frame 27 4.1(d) Color background model of running average of 75th frame 27 4.1(e) Grayscale background model of running-average of 85th frame 27 4.1(f) Grayscale background model of running-average of 85th frame 27 4.1(g) Background model of GMM of 75th frame 28 4.1(h) Background model of GMM of 85th frame 28 4.2 Time in milliseconds needed for each technique 28 4.3(a) Difference image before applying median filter 30 4.3(b) Difference image after applying median filter 30 4.3(c) Binary image of 4.3(a) with threshold of 30 30 4.3(d) Binary image of 4.3(b) with threshold of 30 30 4.4(a) Threshold value 20 32 4.4(b) Threshold value 60 32 4.4(c) Threshold value 27, returned by Otsu algorithm 32 4.4(d) Hysteresis threshold 32

v

Table of Contents Title i Abstract ii Acknowledgement iii List of Figures iv

1 Introduction 1 1.1 Background 1 1.2 Project Objective 2 2 Related Work 2 2.1 Face Detection 2 2.1.1 Linear Subspace Methods 2 2.1.2 Learning Networks 3 2.1.3 Statistical Approaches 4 2.1.4 Feature-based Approach 4 2.2 Face Recognition 5 2.2.1 Eigenfaces 5 2.2.2 Neural Network 6 2.2.3 Geometric Matching and Template Matching 6 2.3 Motion Detection 7 2.3.1 Optical Flow Tracking 7 2.3.2 Motion-based Tracking 8 2.3.3 Filtering and Thresholding 9 3 Real-time Face Tracking 10 3.1 Face Detection 10 3.1.1 Integral Image 11 3.1.2 Haar-like Features 11 3.1.3 Learning Algorithm 12 3.1.4 Cascade of Classifiers 13 3.2 Face Recognition 13 3.3 Face Tracking 16 3.4 Motion Tracking 19 3.5 Zooming 22

vi

4 Motion Detection 25

4.1 Background Model 26 4.2 Difference Image 29 4.3 Thresholding the Binary Image 31 5 The Speaking Kiosk 32 6 Conclusion 33 6.1 Summary 33 6.2 Limitations 34 6.3 Further Works 34 References 36

vii

1. Introduction

1.1 Background

Over the last few years, there have been rapid advances in the development of new

algorithms to detect faces, credit to the increasing processing power and decreasing cost

of video acquisition and processing devices. The demand for more accurate face

processing has also been fueled by the recent terrorist attacks in different areas of the

world.

In our case, we have decided to come up with a program that is not only able to detect

and recognize faces, but one that is able to do tracking and motion detection too. There

are four major areas relating to the functions of the kiosk: 1) face detection, 2) face

tracking, 3) motion tracking in the absence of any detected face, and 4) face recognition,

which is subjected to the consent of the user himself. The Intel OpenCV library has a

multitude of functions that is very useful for image processing, and we will be using the

OpenCV library frequently.

This real-time face tracking application can be employed in many situations which denote

the high commercial value of this application. The camera can be mounted on a kiosk

which can then be installed at certain critical installations to aid in enhancing the security

aspect. An alarm system can work hand in hand with the kiosk to alert security personnel

of any potential dangerous intruders. On the fun side, this application can be used as an

interactive system to interact with a person, where the person can allow the machine to

try to match their faces with celebrities or any images from a certain database. Voice

synthesis which can be integrated with the system will allow for a more interesting

human-machine interaction.

Therefore, we approach to the area of face tracking involving face detection in this

project, focusing on tracking faces or the moving object in the absence of a human face.

1.2 Project Objective

The objective of this project is to track a face in real-time, meaning given a live input

from a camera, I will be able to detect a face and track the face. This is achieved by using

simple and robust image processing methods on each individual frame to detect faces and

motion. Fast mathematical methods will be used to calculate the angle and zoom level of

the camera in order to focus on the object of attraction. Using the Software Development

Kit supplied by Canon for the Canon VCC4 camera, we will then communicate with the

camera and control the camera to focus on our object of attraction.

The basic assumptions of this project are shown as following:

• This system is able to detect all the faces captured by the camera, however

the system will only try to track the face with the biggest surface area.

• The motion detection works in the background, and will only be

implemented when a face detected in the previous frames is no longer

detected in the current frame. We assume that motion detected is the

motion of the person whose face was detected in the previous frames. Also,

only the motion with the biggest area detected in the frame will be tracked.

2. Related Work

2.1 Face Detection

For the past two decades, the number of papers on the field of face detection has been

numerous, and many approaches have been devised to solve the problem of face

detection. We will briefly touch upon a few representative works.

2.1.1 Linear Subspace Methods

In the late 1980s, Sirovich and Kirby manage to develop a technique to represent human

faces efficiently using Principal Component Analysis (PCA) [1]. PCA is performed on a

set of images so that the original data can be represented in terms of the eigenvectors

2

found from the covariance matrix. The eigenvectors are actually the principal

components of the distribution of faces, so the more significant ones are used to

approximate the value of each individual. A new image can then be compared to the

existing images, a good match will be one with the smallest difference between the new

image and an existing one. However, PCA is not optimized as the face space can be

separated into subclasses. This is a good example of an image-based approach, which is

dependent on statistical analysis and machine learning to solve face detection as a general

pattern recognition problem. The intensities of the example images are used as

representations instead of using the features of the images. By depending on learned

characteristics that have been gathered, face detection can be treated as a two-class

classification problem and then solved using the characteristics which are usually in the

form of distributions.

2.1.2 Learning Networks

A neural network is an interconnected assembly of simple processing nodes, whose

processing ability is stored in the inter-unit connection weights, obtained by a process of

learning from a set of training patterns. Neural network is used to create the face database

for face detection. This is possible because face detection can be interpreted as a two-

class pattern detection problem. One significant advanced neural network approach is that

proposed by R. Feraud, O.J. Bernier, J.E. Viallet and M. Collobert [2], which is based a

Constrained Generative Model. This neural network is designed to looks at windows of

size 15 by 20 pixels. A major problem arises with window scanning techniques is

overlapping detection. This problem was dealt with by using several pre-filters and a fast

search algorithm. This also reduced the false alarm rate which in turn leads to a lower

average processing time per image. This face detector is also able to detect faces up to

90 ۫. Fig 2.1 shows the detection of a side view of a face using this face detector.

3

Fig 2.1 Detection of a side view of a face

2.1.3 Statistical Approaches

There are several statistical approaches to face detection. An excellent example of a

statistical approach to the face detection problem is the Hidden Markov Model (HMM) [3]

approach. This scheme is based on the idea of identifying the face to non-face and non-

face to face transitions in an image. Based on these transitions, an observation sequence

is generated and the HMM parameters are extracted for further analysis and learning for

the HMM model. The window of the face pattern being looked at is small at only the size

of 13 x 13 pixels which allows for a more compact and efficient training set.

2.1.4 Feature-based Approach

Most feature-based approach usually start processing at the pixel level using low level

features, such as edge detection or skin color, as a guide to detect faces. Due to the low

level nature of these features and that the image might be degraded by factors such as

illumination and noise, more complex features will be used at the later stages to reduce

the feature ambiguity. An outstanding feature-based approach by KC Yow and R. Cipolla

[4] utilizes an adaptive skin color model which is able to adapt online to detect skin

pixels, after which a region of interest will be classified and grouped together. By further

analyzing the region’s geometrical shape and information contents such as the ratio of

4

skin color to non-skin color, an informed decision can be made as to whether a face is

present in the region or not. However the processing time is dependent on the number and

size of the skin regions. Also this model may fail to detect faces when illumination

changes significantly.

2.2 Face Recognition

The idea of face recognition is to examine a set of images and try to find a good match

for a given image. Face recognition is a more difficult and complicated problem than face

detection, because face detection only has to deal with two classes of objects, faces and

non-faces. Moreover same facial features can be found on different faces, and they can be

very similar to each other. There are several approaches to face recognition which uses

Principal Component Analysis (PCA) as a pre-processing and feature extraction of the

input images or uses “eigenfaces”, which are actually eigenvectors that are approximated

from the face image’s auto-correlation matrix. Recently neural network based algorithm

has also been used widely due to its real-time computation and adaptation abilities.

Geometrical approach for face recognition takes advantages of the spatial configuration

of facial features. The face will be processed and the main geometrical features of the

face such as the eyes, nose and mouth are first located. Each face is allocated a

classification based on various factors such as geometrical distances and angles between

features.

2.2.1 Eigenfaces

Using eigenfaces is an example of a statistical approach to face recognition. The concept

of eigenfaces is based on the idea of Turk and Pentland [5] to find out the more

outstanding features in a face and use them to distinguish one face from the other. Every

face can now be represented as a vector of weights which are calculated by using a

simple inner product operation to translate the image into eigenface components. To

recognize a face that has been detected and normalized, its vector of weights is first

calculated and then matched with the faces in a database to find an image that has the

5

closest weights in Euclidean distance. All faces of the same person are likely to be very

near to each other in terms of the Euclidean distance, while different person will have

different clusters of face. This approach is fairly robust and adaptive to illumination

changes, but it degrades quickly as the size of the image changes. There is also no prior

knowledge on the distribution of the values of the face cluster, which makes it harder to

determine whether a given Euclidean distance is small enough for us to say that the two

faces are actually the same.

2.2.2 Neural Network

We have mentioned before that applying a set of neural network based filters to an image

and then combining the outputs using an arbitrator will allow us to detect faces. The

neural network architecture can also be utilized in an unsupervised manner to recognize

faces. Basically this is done by gathering all the common faces from each person and then

projecting them as eigenfaces to let the neural networks learn how to classify them with

the new face descriptor as input. To facilitate the training of the neural network, a neural

net is created for each person. Whenever an input face is available, all the neural nets will

check with the input face and output a value, which a recognition algorithm will analyze

the value and choose the net with the highest value. If this value is above a certain

threshold, the net can is said to be matching with the input face. Training the neural

network is very important and the new faces are always added into training examples to

retrain the neural network. This is because these new faces not only improve the accuracy

of their own network, but also improve the performance of other networks.

2.2.3 Geometric Matching and Template Matching

Roberto Brunelli and Tomaso Poggio [6] develop two algorithms for face recognition:

geometric feature based matching and template matching. The geometric feature based

matching approach analyzes the face and detects 35 facial features automatically such as

eyebrow thickness and vertical position, nose vertical position and width, chin shape and

zygomatic breadth. Using all these 35 facial features, a 35-D vector is formed with the

6

assumption of a Gaussian distribution and recognition is performed with a Bayes

classifier. Fig 2.2 shows the facial features that have been detected, being marked in

white. In the template matching approach, the face of each person is analyzed and

represented by an image and four masks corresponding to the eyes, nose, mouth and face.

Normalized cross correlation is performed on the input image and the database images,

each of which returns a vector of matching scores (one per feature). Each individual score

is summed up and the one with the highest cumulative score is denoted as the person.

They also perform recognition based on single feature and features are sorted by

decreasing performance as eyes, nose, mouth and whole face template. Fig 2.3 shows the

masks detected by template matching.

Fig 2.2 Geometrical features (white) Fig 2.3 Facial features mask

2.3 Motion Detection

Motion detection is an important area of research of computer vision and a very crucial

low-level task for many computer vision applications such as traffic monitoring and

video surveillance. Generally, there are two kinds of methods used for motion-based

tracking system, optical flow tracking methods and motion-energy methods.

2.3.1 Optical Flow Tracking

7

Optical flow tracking is also known as computing the motion in an image. It is basically

tracking a set of points across multiple images to find out how an object has moved. This

technique has been used extensively in Pyramidal Lucas-Kanade Model [15]. First a set

of good features should be determined for tracking purposes since tracking all pixels of

the starting image to the destination image will be very wasteful. This set of features

should be sharp and well-defined. Formally this means the feature should have a large

eigenvalue which indicates that the feature is more well-defined than the image noise and

thus can be tracked more reliably. Tomasi’s [7] method of finding good features to track

is an improvement by providing a method to assess goodness of features for tracking and

accounting for affine transformation of features. Since local accuracy and robustness has

to be taken into account when choosing the integration window size to track the features,

a pyramidal implementation of the Lucas-Kanade algorithm, meaning an iterative

implementation of the algorithm, will be able to optimize the window size and the

tracking results. This algorithm is also able to compute the optical flow at a subpixel

accuracy level, using bilinear interpolation technique. Also this algorithm is able to cater

to “lost” features that are either no longer in the images or have disappeared due to

occlusion.

2.3.2 Motion-based Tracking

This is perhaps the most intuitive technique for tracking motion. By acquiring a

background model, we can use background subtraction by subtracting a current frame

from the background image. Those resultant pixel values are then looked at, those at near

zero are classified as the background, whereas those values that are large are classified as

the foreground. However acquiring the background model can be quite a daunting task by

itself.

The most straightforward approach is to setup up the camera, make sure nobody is in the

image and then take the image to be the constant background image. This is the simplest

and fastest way, but it does not account for situations where the illumination will change

subtly over time and movement in the background (eg trees swaying).

8

There are more complex methods using a sophisticated statistical model for each pixel.

The multi-modal statistical motion detector from MIT “Stauffer and Grimson” [8] has

been extensively used, because of its accuracy in modeling the background model. This

detector models each pixel with a mixture of 3 to 5 Gaussian distributions but it is

computationally expensive. A further improvement on this detector has been proposed by

P. KaewTraKulPong and R. Bowden [9] to improve on the learning speed and accuracy

of the system. A shadow detection technique has also been introduced, simply by noting

the difference of a foreground pixel and the background model. If the difference in both

chromatic and brightness components is small, the pixel is considered to be part of a

shadow, however this will require a color model that can separate chromatic and

brightness components.

2.3.3 Filtering and Thresholding

After acquiring the background model, the difference image between the current frame

and the background image needs to be converted into a grayscale image before

thresholding the image into a binary image. The binary image can then be analyzed to

check for moving objects in the foreground. Before conversion, filters are normally

applied on the difference image to get rid of the noise and enhance the detection process.

Some commonly used filters are the Gaussian smoothing and the simple blurring.

There are several techniques to come up with the optimal threshold. The most

straightforward one will be to set the threshold to an arbitrary integer, however this

approach is not robust enough. A value set too high will eliminate pixels that are part of

the fore ground, whereas a value set too low will introduce a lot of noise and unwanted

pixels that do not belong to the foreground. Calculating the stable Euler number is also a

good way to obtain the threshold since the Euler number can be determined efficiently

and only requires a single raster scan. Otsu’s thresholding method [10] is based on the

idea of creating a histogram of the image and then finding a threshold value that

9

minimizes the within-class variance of resulting foreground and background classes and

maximizing the between-class variance.

3. Real-time Face Tracking

We have implemented this system and mounted the camera on the kiosk. Any user can

just walk by the kiosk and the kiosk will detect the face and invite the user to try to

recognize his face with the existing face database in our server. The kiosk will “show” its

sincerity by focusing on the face and zoom in if applicable. The kiosk will then track the

face if the face moves out of the focus of the camera, and again zooming will be done, if

applicable. In the absence of face detection when the user moves away, the kiosk will

utilize motion detection methods to focus on the user himself while attempting to detect

the face at the same time. If the user or his face moves out of a predefined range, the

camera will go back to its home position. All the tracking and zooming will be performed

on a Canon VCC4 pan/tilt/zoom camera.

If the user is willing to try out the face recognition on the kiosk, he can always instruct

the kiosk to attempt recognition. The kiosk will evaluate the face and search its database

for 6 faces which are the most matching according to the kiosk’s face recognition

algorithm. If the input face is not inside the choice of 6 faces, the user can always add his

own face into the database so that face recognition can be performed on the user the next

time the user utilize the kiosk.

3.1 Face Detection

The face detection system in the kiosk is based on the Cascade AdaBoost architecture

developed by P. Viola and M. Jones [12] and later, R. Lienhart and J. Maydt [13]

extended this face detector by writing another paper “An Extended Set of Haar-like

Features” to improve on the efficiency of the proposed algorithm. These algorithms have

been incorporated into the Intel OpenCV library which we are using extensively to

10

perform face detection. Together there are 4 important parts to the face detection

algorithm.

3.1.1 Integral Image

The concept of integral image was introduced to speed up the computation of rectangle

feature. An integral image at location (x, y) is defined as the sum of the pixels left and

above of the location itself. Mathematically, it can be expressed as follows:

where i(x, y) is the original image and ii(x, y) is the integral image. Fig 3.1 shows the

value of the integral image at location (x, y) as the sum of values of the pixels of the

shaded area.

Fig 3.1 Integral image at location (x, y)

3.1.2 Haar-like Features

The concept of integral image is developed with the purpose of speeding up computation

of rectangular haar-like features due to the simplicity in the computation. The proposed

haar-like features are superior to the rectangular features in the sense that there are added

features in the form of rectangles that are rotated 45 degrees. Like the basic feature set,

these rotated features can also be computed in one pass over the image.

11

Fig 3.2 Different types of Haar-like features

The values of the features are calculated by subtracting the sum of pixels in the white

area by those in the black area, with appropriate scaling done.

3.1.3 Learning Algorithm

In a machine-learning approach for the face detector, an efficient learning algorithm is

essential for the system to learn how to differentiate faces from other images. The

algorithm that is being used in our face detector is a modified variant of the original

AdaBoost algorithm proposed by Schapire and Freund [14]. The Adaboost algorithm is

able to offer a formal guarantees on performance thus it is very suitable as an algorithm

to boost weak classifiers based on all given features. The result returned by each weak

classifier in the modified version of the AdaBoost algorithm now only depends on one

feature. This improvement allows us to select a small number of significant feature which

can then be combined together to form an accurate classifier, instead of using the 180,000

rectangle features associated with each image sub-window, which makes it

computationally expensive. The task of each weak classifier is now made simpler and it

can determine the optimal threshold value with the minimum rate of misclassification

using the input positive and negative examples. Mathematically, a weak classifier hj(x) is

subjected to the thresholding operation as follow:

12

where x is a 24 x 24 pixel image, fj is a feature, θj is the threshold and pj is a polarity

specifying the direction of the inequality sign. The AdaBoost algorithm uses the results

from every single weak classifier, after they have been thresholded using one feature, to

construct a so called strong classifier to be used for the face detection system. This

system can be further improved by using a cascade of classifiers since just relying on one

classifier is not accurate enough. Accuracy of the face detection system can be improved

while the computation time can be decreased.

3.1.4 Cascade of Classifiers

The appearance of a cascade of classifiers is like that of a decision tree. Starting from the

root classifier, the next classifier will be evaluated only when the current classifier returns

a positive result, otherwise negative results at any cascade stages leads to a rejection

immediately. This form of rejection leads to decreased computation time since rejected

sub-windows will not be processed anymore. The design of this cascading style is to

reject as many negative sub-windows as possible since the majority of the sub-windows

are negative. This cascade operation is represented in Fig 3.3, which depicts how the

cascade works.

Fig 3.3 Cascade of classifiers

3.2 Face Recognition

13

For face recognition, our system is using a simple, memory-based technique known as

the ARENA. The resolution of the input image is first reduced to 16x16 pixels and then a

similarity measure, L0* is used to compute the difference between the input image and the

stored images in the database. To reduce the resolution of the input image quickly, the

image is divided into non-overlapping regions before executing a simple averaging to get

the reduced image. The key point of the outstanding performance of this algorithm is the

similarity measure which has a better performance than the Euclidean distance. It takes a

few mathematical steps to arrive at the similarity measure, and the steps are shown below:

Definition of Lp:

Each reduced-resolution image is converted into a vector xr , where each component of the

vector corresponds to each pixel in the image. Due to the presence of noise in the images,

the definition of the similarity measure is relaxed to be

where δ is a threshold

The similarity measure actually counts the number of components whose differences

exceed the threshold δ in vector x and vector y, and this number is used for analysis to

determine whether the input face exist in the database or not. Fig 3.4 shows some results

for the face recognizer with different face orientations, the correct results are circled in

green.

14

(a) (b)

(c) (d)

(e) (f)

15

Fig 3.4 (a) Upright frontal face recognition, (b) database query returns 2 positive results,

(c) this is the same face with an orientation of approximately 15 degrees, (d) database

query returns 1 positive result, (e) this is also the same face with an orientation of

approximately 30 degrees, (f) database query returns no positive results

The results show that the face recognizer can recognize frontal face very quickly and

accurately, however the performance is not as good when recognizing faces with an

orientation of more than 15 degrees. Therefore to utilize this face recognizer effectively,

we have to make sure the user is facing the camera directly before recognizing the face,

so that a frontal face image of the user can be captured and used to query the face

database.

3.3 Face Tracking

In our system, we are using a Canon VCC4 camera which is capable of panning, tilting

and zooming. Before we can actually track a face or a moving object, we must first obtain

the field of view of the camera, both horizontal and vertical. Since the camera takes 2D

images, we can find out the horizontal field of view by using a board to measure the

horizontal distance the camera can cover and the distance of the camera from the board

itself. To obtain the vertical field of view, the same procedure is repeated but the vertical

distance of the coverage of the camera is measured this time. The camera is at a zoom

level of 1x during the measurements of the distance. Fig 3.5 shows some of the images

that have been captured during the measurements of the horizontal field of view.

(a) (b)

16

Fig 3.5(a) and (b) shows the horizontal coverage of 5 squares and 7 squares respectively

From each measurement, the horizontal field of view can be calculated with the formula,

tan θ = (horizontal distance covered by the camera) / (distance of the

board from the camera)

where θ is the horizontal field of view. An extract of the table of values used to calculate

the fields of view is shown below.

Horizontal

distance /cm

Distance from

the camera /cm θ / ۫

14.3 13.2 47.1

23.8 22.2 47

30.6 28.5 47

By taking the average of all the θ’s, the horizontal field of view is calculated to be 47 ۫.

Using the same theory, the vertical field of view is calculated to be 18 ۫. These 2 values

are very important for tracking the face and the motion, as these values are the input that

the camera will accept for panning and tilting, instead of the usual coordinate system.

For tracking of the face, the face detector will only consider a face being present in the

foreground when the number of detection of faces exceeds a certain threshold to increase

the efficiency of the face tracker. This is to prevent the face tracker from wasting time in

tracking false positive detections. Fig 3.6 shows an example of a false positive.

Fig 3.6 A computer casing being falsely

identified as a face

17

In our case, when there are 3 or more occurrences of face detected in 5 consecutive

frames, the face detector will acknowledge a face being present in the foreground and it

will return a set of coordinates that defines the red rectangle bounding the face in the

image. With the set of coordinates, the tracking system can determine the panning and

tilting angle for the camera to focus on the face, so that the face will be in the centre of

the image. It is important to note that although the face detector can detect multiple faces

in an image, only the coordinates of the face with the largest area will be used in the

tracking system. The formula used in computing the panning and tilting angle is a simple

and efficient one and is shown below. All units refer to the (x, y) coordinates system

where the origin is at the top-left of the image. Fig 3.7 shows the accuracy of the tracking

system in tracking faces.

(a) (b)

(c) (d)

18

Fig 3.7(a) Face detected in the lower left region of the image, (b) the camera moves such

that the face is now in the centre of the image, (c) face detected in the upper right region

of the image, (d) the camera moves such that the face is now in the centre of the image

Formula in calculating the pan and tilt angle,

Pan angle = (x-coordinate of centre of face – x-coordinate of centre of image) / x-

coordinate of centre of image * 23.5 ۫

Tilt angle = (y-coordinate of centre of face – y-coordinate of centre of image) / y-

coordinate of centre of image * 5.75 ۫

Although we want to track the face very accurately, we do not want the camera to keep

moving about just to keep the face in the dead centre of the image. Instead we specify a

region in the centre of the image where we will not track the face, and this will prevent

jerky movements of the camera which will affect the performance of face detection. We

specified the width of the region to be ¼ width of the image and the height of the region

to be ¼ height of the image, and the region to be in the centre of the image. Fig 3.8 shows

the scenario of a face that is slightly off the centre of the image but still considered by the

face tracker to be in the region.

Fig 3.8 Face detected within the no-

tracking region

3.4 Motion Tracking

19

For tracking of the moving object, the motion detector will also return a set of

coordinates that defines the blue rectangle bounding the moving object. In the event of

multiple moving objects, the one with the biggest surface area will be tracked. However

the motion detector will only start when a face has been tracked in the recent frames, this

is because the motion tracker works as a supplement to the face tracker thus it must avoid

unwanted motion tracking. For example a cat walking across the kiosk or a vehicle

driving across the kiosk will not be tracked. Thus it is reasonable to assume that the

moving object is the body of the face being tracked recently. The formula used to detect

the panning and tilting angle is slightly different from that of the face tracker, due to

some minor changes being introduced. The way to calculate the panning angle is still the

same, however the formula for calculating the tilting angle no longer uses the y-

coordinate of the centre of the motion, but instead uses the y-coordinate of the topmost

level of the motion. The basic idea behind the change in the formula of calculating the

tilting angle is that the moving object detected is the body of a person so the face will

actually be at the topmost portion. Therefore we want the topmost portion of the object to

be in the centre so as to continue tracking the face. Fig 3.9 illustrates this rationale.

Fig 3.9 The blue rectangle is not in the centre of the image, but instead the topmost

portion of the rectangle is. Notice the face being in the central region of the image as well.

If ( y-coordinate of topmost level of object < ¼ height of image )

Tilt angle = (¼ height – y) / ¼ height * 9 ۫

Else if ( y-coordinate of topmost level of object > ½ height of image )

20

Tilt angle = ( y - ½ height ) / ½ height * 9 ۫

Since we are tracking a moving object, we have to account for the fact that the object is

likely to traverse a longer distance between consecutive frames compared to a face, so we

have included a simple but efficient prediction technique to make the tracking smoother.

Since we assume that human beings walk in a linear fashion, the calculated panning and

tilting angles are multiplied by a factor of 1.5 to account for distance lost during the

movement of the camera. The motion tracking system will be able to track the moving

object in a competent way and not lose the object due to the slow reaction of the camera.

The factor of multiplication cannot be too high, as a high factor will result in the camera

losing the focus on the moving object very quickly when the object moves out of the

range of the camera. Fig 3.10 shows the tracking of a moving object in a room where the

ceiling lights are well lit with a constantly changing background.

(a) (b)

(c) (d)

21

(e)

Fig 3.10 (a) Moving object being detected by the motion detector, (b) object moves to the

right, (c) object continue to move to the right, (d) object stopped moving, (e) object

moves to the left. Note that the face tracking system has been disabled so as to allow the

motion tracking system to work.

For the tracking of either a face or a moving object, we predefined a range of movement

for the camera. If the calculated movement of the camera should exceed the range, the

camera will move back to its home position and reset all the variables used in detecting

faces and motion. The camera will then start to detect and track faces again. We have set

the range of the pan angle of the camera to maximum

3.5 Zooming

After detecting a face, our face tracking system can also zoom in on the face if the face is

too small. Our camera is capable of 16x zoom, however we will not be making use of the

full zooming capabilities of the camera as the zoom function takes a lot of time, and the

face tracker will easily lose the face that it is tracking. The camera takes 1.5 seconds to

zoom from 1x to 16x, which means many frames of valuable information will be lost.

Also we think that there is no need to use up the full zooming power because there is no

point for the face tracking system to zoom in on a face that far away.

22

Although a software development kit has been provided for the camera, the

documentation for the kit is minimal. All it states in the documentation is that the zoom

function takes in an integer value of 0 to 128 as input so we decided to measure the effect

of different parameter values for the zoom function. Fig 3.11 shows the area of a black

object at different zoom parameters.

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Fig 3.11 (a) zoom parameter 0, area 598 pixels, (b) zoom parameter 8, area 727 pixels, (c)

zoom parameter 16, area 860 pixels, (d) zoom parameter 24, area 1054 pixels, (e) zoom

parameter 32, area 1332 pixels, (f) zoom parameter 40, area 1685 pixels, (g) zoom

parameter 48, area 2206 pixels, (h) zoom parameter 56, area 2861 pixels, (i) zoom

parameter 64, area 3805 pixels.

From the results, it is obvious that the zoom power increases exponentially with respect

to the value of the input parameter. Therefore we restrict the value of the input parameter

to a maximum of 64, since the increase in the area is approximately linear for that range

of values.

23

The face tracker will only zoom in on a face when the face is in the central region of the

image and the camera does not need to pan or tilt. This is based on the reason that if the

camera needs to move to track the face, the face is somewhere outside the central region

of the image. Since it takes time to zoom, the face would have moved out of the frame by

the time the camera finishes zooming. Also if the face is in the central region, we can

assume that zooming in on the face will still keep the face in the frame even if the user

moves or that the user has stopped moving and is now looking at the kiosk. Fig 3.12

shows an example of the face tracking system zooming in on a face.

(a) (b)

(b) (d)

Fig 3.12 (a) A face being detected in the central region, (b) first frame of zooming in, (c)

second frame of zooming in, (d) third and final frame of zooming in

24

The zooming is done incrementally so as to make sure the face remains in the central

region and also to react robustly to changes such as the user moving closer to the kiosk or

the user moving away from the scene. If the user decides to move closer to the kiosk, his

face will appear larger on the frame so the camera has to zoom out accordingly. If the

user starts moving away, the camera has to stop zooming and start tracking the face again.

Fig 3.13 shows an example of the face tracking system zooming out from a face.

(a) (b)

(c)

Fig 3.13 (a) A face being detected in the central region, (b) first frame of zooming out, (c)

second and final frame of zooming out

4. Motion Detection

25

Our project is based on real time face tracking, but there are times when the face detector

cannot detect a face even though a face is present in the frame. Thus we need to rely on a

good motion detector to enhance the performance of the face tracker. However face

detection is computationally expensive on its own, so we cannot afford to allocate too

much resource to detecting motion. Also the system should not try to detect motion when

a face is detected in the frame so as to improve the efficiency of the system. Motion

detection in complex background is a study on its own, so we attempt to address this

issue here using very simple and robust techniques.

4.1 Background Model

As mentioned in the Related Works section, there are several background modeling

models now that can model the background very effectively. One good example is the

multi-modal statistical motion detector from MIT “Stauffer and Grimson”. However this

multi-modal statistical motion detector is computationally expensive because this

detector models each pixel with a mixture of 3 to 5 Gaussian distributions. So instead of

using complex algorithms, we used simple weighted running-averages of frames to model

the background.

The new background image is a weighted sum of existing background image and current

frame. This background model is robust to small changes such as illumination change and

movements of background objects like trees. The formula is shown below.

New background image = 0.97 * (existing background image) + 0.03 * (current

frame)

We took a video with a complex background so as to test the accuracy of both models. A

fan is running in the background and the user is wearing a pair of shorts that is very

similar in color to that of a table in the background. Fig 4.1 shows the background image

modeled by both our running-average model and the Gaussian Mixture Model(GMM),

26

together with the image of the current frame at the 75th and 85th frame respectively. We

chose a later frame of the video to allow both background models to stabilize.

(a) (b)

(c) (d)

(e) (f)

27

(g) (h)

Fig 4.1 (a) 75th frame of the video, (b) 85th frame of the video, (c) color background

model of running-average of the 75th frame, (d) color background model of running-

average of the 85th frame, (e) grayscale background model of running-average of the 75th

frame, (f) grayscale background model of running-average of the 85th frame, (g)

background model of GMM of the 75th frame, (h) background model of GMM of the 85th

frame.

The GMM is only slightly superior to that of the running-average model, as you can see

from the background model of the GMM at the 85th frame, the area underneath the table

is more defined and accurate.

Since addition is computed much faster than modeling Gaussian distributions, we can

expect the speed of running average to be very much faster than that of Gaussian Mixture

Model. Indeed, Fig 4.2 confirms our hypothesis.

28

Fig 4.2 The table shows the time in milliseconds needed for each technique to compute

the background model.

The grayscale running-average model, which is an optimized version of the color

running-average model, is clearly the fastest and is almost 60 times faster than that of

GMM. Moreover, the GMM is unsuitable for tracking purposes, because of the rapidly

changing background. The GMM takes time to initialize due to the high overheads.

Hence we will be using the grayscale running-average model for our background

modeling.

4.2 Difference Image

After we obtain a suitable background model, we must subtract the background model

from the current frame to get a difference image. The difference image is supposed to

show all objects in the foreground, and it is grayscale because we are using a grayscale

background model. We are then going to apply a threshold to the difference image to

obtain a binary image so that the blob counter can work on the binary image to detect

movement. However due to the presence of noise and minor changes in the environment

like illumination changes, the difference image is not a perfect reflection of the

foreground. An imperfect background modeling tool also contributes to the error of

accuracy. Therefore we need to process the image to obtain a best fit model of the

foreground.

We perform smoothing on the difference image to remove the random noise present in

the image. We chose the median filter because it is fast and efficient. After applying the

median filter, we will then threshold the image to get a binary image. Fig 4.3 shows a

difference image after a median filter has been applied on it and the binary image

obtained after thresholding it with a value of 30. We are using the 75th frame of the video

as the image, as shown in Fig 4.2.

29

(a) (b)

(c) (d)

Fig 4.3 (a) Difference image before applying median filter, (b) difference image after

applying median filter, (c) binary image of (a) with threshold value of 30, (d) binary

image of (b) with threshold value of 30. The effect of median filter on the binary image is

obvious as the noise present in (c) is no longer found in (d). Also the white portions

(blobs) are more connected in (d), which makes it easier for the blob counter to work on

it.

To increase the connectivity of the surrounding blobs, the binary image is further dilated.

This is to join up adjacent blobs that are part of the foreground so to increase the

accuracy of the motion detector. However, dilation cannot be performed too many times

as we do not want further away blobs that do not belong to the foreground, to be included

as well.

30

4.3 Thresholding the Binary Image

Finding the optimum value for the threshold is very important as we want as much

foreground to be included in the binary image and as less background to be included. If

we set the threshold value to be too low, much unwanted information will be captured by

the image especially noise. If we set the threshold value too high, we risk omitting pixels

that are actually part of the foreground. For example, the user wearing dark blue shorts is

walking across the room with a dark brown table in the background in Fig 4.1(a). Since

dark brown and dark blue looks very similar, we need a relatively low threshold to

capture the motion of the shorts. As we can see from the image in Fig 4.3(d), the window

on the right hand side of the image is considered to be part of the foreground due to the

low value of 30 being used as the threshold.

There are many algorithms to calculate the optimum threshold, notably the more popular

Otsu Algorithm. Please refer to the appendix for the VC++ code of implementation of the

Otsu algorithm. However the Otsu algorithm uses floating-point values to calculate the

interclass variance, thus it is quite computationally expensive. We also found out that

using the Otsu threshold returns a lot of false postive so we decided to use a hysteresis

method similar to that of finding Canny edges [15] to compute the binary image.

The hysteresis method works by thresholding an image using 2 different threshold values,

1 low and 1 high. Those pixels with value above the high threshold value will be retained

in the final binary image while those below the low threshold value will be rejected

outright. Those pixels with value between the low and high threshold will only be

retained if they are directly or indirectly connected to any pixels with value above the

high threshold through other pixels also with value between the low and high threshold.

We set the low threshold to be 20 and the high threshold to be 60.

To denote motion, we will draw a blue rectangle bounding the moving object. To do this,

we will use a simple blob counter to identify all the blobs. The blob counter will also

filter out all blobs that are smaller than 500 pixels in area, as blobs with area smaller than

31

500 pixels are usually interference from the environment. We will then find the largest

blob and draw a bounding blue rectangle over it.

Our simple method works very well as shown in Fig 4.4, where comparisons are made

with images being thresholded with an arbitrary value of 20 and 60 and also the threshold

value computed using Otsu algorithm. The same blob counter with the same parameters

are used during the motion detection. Again we are using the 75th frame of the video.

(a) (b)

(c) (d)

Fig 4.4 (a) threshold value 20, (b) threshold value 60, (c) threshold value 27, as returned

by Otsu algorithm, (d) hysteresis threshold of 20 and 60. As above, our hysteresis

threshold manage to outperform the rest of the thresholding methods.

5. The Speaking Kiosk

32

The kiosk is able to track faces in real-time, but we decided to add in some fun factor to

make the kiosk more interesting. When the kiosk detects a face, the kiosk will invite the

user to try the face recognition system by saying “Hi there, would you like to try the

kiosk?” If the user leaves the view of the kiosk, the kiosk will politely thank the user for

the time by saying “Thank you for your interest in the kiosk. We hope to see you again.

Have a nice day.” When the kiosk is attempting to recognize the face, it will inform the

user by saying “Please hold as we attempt to recognize your face.” If the kiosk is unable

to detect a face when the user attempts to recognize his face, the kiosk will respond

“Sorry, we did not detect a face. Would you like to try again?” By introducing speech

into the kiosk system, we hope to make the kiosk more user-friendly and more accessible.

6. Conclusion

6.1 Summary

In this paper, we gave a brief discussion about some of the useful applications in which

face tracking can play an important role. This project presents a system that is able to

detect faces, and track them in real-time. This system also has a motion detector

embedded in it to help improve the performance of the face tracking system. We not only

concentrate on the technical aspect of the system, but also attempt to introduce some

affability to the system by making the kiosk “speak” to the user.

Since our real-time face tracking system depends heavily on the face detection system, an

efficient and fast face detector is a must. The face detector based on the Cascade

AdaBoost architecture boosted by the Haar-like features performs up to our expectations.

Also, the face recognition system, implemented using the ARENA algorithm, utilizes the

nearest neighbor technique competently, and is able to communicate with the face

database to retrieve the nearest 6 matches.

After a precise calibration and meticulous measurements, the camera is able to track faces

very accurately and for motion tracking, a simple prediction system makes sure the

33

moving object remains in the central region of the view of the camera. We discard the

more computational expensive algorithms in favor of faster and simpler one to make our

face tracking system more robust to changes.

We also placed a high importance on the motion detection system as an improvement to

the face tracking system. We have developed a straightforward and competent motion

detection system which has facilitated the whole project greatly.

6.2 Limitations

The face detection system, although an efficient one, is still unable to avoid the problem

of false positive. Even with the addition of a threshold system, the camera might still end

up tracking the wrong object.

Due to changing background during tracking, a shift from a weak lighting background to

a strong lighting background will corrupt the motion detection system.

The inherent zooming function of the camera takes time, and the system might not be

able to react accordingly if the user walks towards the camera or move away from the

camera very quickly.

The face detection system is unable to detect faces with more than 15 degrees orientation,

and this hampers the performance of the face tracking system.

6.3 Further Works

Since the face detection system forms a vital part of the face tracking system, developing

a faster face detection system will aid in the face tracking system. There are many

researches on this topic underway, and we hope that a faster and more efficient face

detection technique can be proposed.

34

One limitation of the current system is that it only detects upright frontal face, therefore

coming up with a technique to train the system for different head orientations, and then

combining the results together for a more powerful face detection system is a good way

to improve the current system.

We mention in this paper that we tried to make the kiosk a more user-friendly one. The

kiosk can be further improved by adding voice synthesis so as to allow the user to interact

with the kiosk. Also we have contemplated the idea of gesture recognition so that the user

can perform face recognition on the kiosk just by making some pre-defined gestures.

However due to the time limitations, we will have to explore these ideas as a further

improvement to the project.

35

References [1]M. Kirby and L. Sirovich, “Application of the Karhunen-Loeve procedure for the

characterization of human faces”, IEEE Transaction on Pattern Analysis and Machine

Intelligence, 12(1), 1990

[2] R. Feraud, O.J. Bemier, J.E. Viallet and M. Collobert, “A fast and accurate face detector

based on neural networks”, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 23,

no. I , pp. 42-53, 2001.

[3] A. Rajagopalan, K. Kumar, J. Karlekar, R. Manivasakan, M. Patil, U. Desai. P. Poonacha and

S. Chaudhuri, “Finding faces in photographs”, in Proc. 6Ih IEEE Conf on Computer Vision. pp.

640-643, 1998.

[4] KC Yow, R. Cipolla, “Feature-based human face detection”, Image and Vision Computing, no.

15, pp. 713-735, 1997.

[5] M. Turk and A. Pentland, “Face recognition using eigenfaces,” Proc. IEEE Conf. on

Computer Vision and Pattern Recognition, pp. 586-591, 1991

[6] Roberto Brunelli and Tomaso Poggio, ``Face recognition: feature versus templates", IEEE

Transactions on Pattern Analysis and Machine Intelligence, 15(10) pp.1042-1052, 1993

[7] C. Tomasi and T. Kanade, “Detection and tracking of point features,” Tech. Rep. CMUCS-

91-132, Carnegie Mellon University, 1991.

[8] C. Stauffer and W.E.L. Grimson, “Adaptive Background mixture Models for Real-time

Tracking”, CVPR99, June, 1999.

[9] P. KaewTraKulPong and R. Bowden, “An Improved Adaptive Background Mixture Model

for Real-time Tracking with Shadow Detection,” In Proc. 2nd European Workshop on Advanced

Video Based Surveillance Systems, 2001.

36

[10] N. Otsu. A threshold selection method from gray-level histograms. IEEE Trans. on Systems

Man, and Cybernetics, 9:62–66, 1979.

[11] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In

Proc. CVPR, pages 511–518, 2001.

[12] Rainer Lienhart and Jochen Maydt. An Extended Set of Haar-like Features for Rapid Object

Detection. IEEE ICIP 2002, Vol. 1, pp. 900-903, Sep. 2002.

[13] Yoav Freund and Robert E. Schapire. “A decision-theoretic generalization of on-line

learning and an application to boosting”. Journal of Computer and System Sciences, 55(1):119–

139, August 1997.

[14] Canny, J., “A Computational Approach To Edge Detection”, IEEE Trans. Pattern Analysis

and Machine Intelligence, 8:679-714 (1986)

[15] B. Lucas and T. Kanade, “An iterative image restoration technique with an application to

stereo vision”, Proceedings of the DARPA IU Workshop, 121–130, 1981.

37

real-time face tracking - comp.nus.edu.sgcs4243/projects/hyp-face-tracking.pdf · 3.4(b) database...

Documents