j. vis. commun. image r. · pdf filesaliency model-based face segmentation and tracking in...

14
Saliency model-based face segmentation and tracking in head-and-shoulder video sequences Hongliang Li * , King N. Ngan Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong article info Article history: Received 3 October 2006 Accepted 3 April 2008 Available online 11 April 2008 Keywords: Video segmentation Facial saliency map Boundary saliency map Video-conferencing Face tracking Skin tone Facial feature extraction Visual attention Face segmentation Head-and-shoulder video abstract In this paper, a novel face segmentation algorithm is proposed based on facial saliency map (FSM) for head-and-shoulder type video application. This method consists of three stages. The first stage is to gen- erate the saliency map of input video image by our proposed facial attention model. In the second stage, a geometric model and an eye-map built from chrominance components are employed to localize the face region according to the saliency map. The third stage involves the adaptive boundary correction and the final face contour extraction. Based on the segmented result, an effective boundary saliency map (BSM) is then constructed, and applied for the tracking based segmentation of the successive frames. Experimental evaluation on test sequences shows that the proposed method is capable of segmenting the face area quite effectively. Ó 2008 Elsevier Inc. All rights reserved. 1. Introduction Object segmentation plays an important role in content-based multimedia applications. From the content-related image and video segmentation, a higher semantic object can be detected and exploited to provide the user with flexibility in content-based access and manipulation [1,2,4]. As an important key to the future advances in human-to-machine communication, the segmentation of a facial region also can be applied in many fields, such as encod- ing, indexing, and pattern–recognition purposes [3]. In the litera- ture, large number of face segmentation algorithms based on different assumptions and applications have been reported. According to the primary criterion for segmentation, two catego- ries can be classified: color-based methods [3–22] and facial fea- tures-based methods [23–29]. The color-based segmentation methods aim to exploit skin-col- or information to locate and extract the face region. A universal skin-color map is introduced and used on the chrominance compo- nent to detect pixels with skin-color appearance in [3]. In order to overcome the limitations of color segmentation, five operating stages are employed to refine the output result, such as density and luminance regularization, geometric correction. Based on color clustering and filtering using approximations of the YCbCr and HSV skin-color subspaces, a scheme for human faces detection by per- forming a wavelet packet decomposition was proposed in [4]. The wavelet coefficients of the band filtered images are used to characterize the face texture and form compact and meaningful feature vectors. It is known that both of the previous two methods are based on the linear classifier for skin-color pixels. Using a light- ing compensation technique and a nonlinear color transformation, a face detection algorithm [5] for color images was presented to detect skin regions over the entire image. This algorithm extracts facial features by constructing feature maps for the eyes, mouth, and face boundary. Based on the joint processing of color and motion, a nonlinear color transform relevant for hue segmentation is derived from a logarithmic model [6]. Markov random field (MRF) model that combines hue and motion detection within a spatiotemporal neighborhood is used to realize the hierarchical segmentation. In addition, a hand and face segmentation method using color and motion cues for the content-based representation of sign language videos was also proposed in [7]. This method con- sists of three stages: skin-color segmentation, change detection and face and hand segmentation mask generation. The facial features-based method, on the other hand, utilizes the facial statistical or other structural features rather than skin-color information to achieve face detection/segmentation 1047-3203/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.jvcir.2008.04.001 * Corresponding author. E-mail addresses: [email protected] (H. Li), [email protected] (K.N. Ngan). J. Vis. Commun. Image R. 19 (2008) 320–333 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

Upload: hoangtuyen

Post on 19-Mar-2018

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

J. Vis. Commun. Image R. 19 (2008) 320–333

Contents lists available at ScienceDirect

J. Vis. Commun. Image R.

journal homepage: www.elsevier .com/locate / jvc i

Saliency model-based face segmentation and tracking in head-and-shouldervideo sequences

Hongliang Li *, King N. NganDepartment of Electronic Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong

a r t i c l e i n f o a b s t r a c t

Article history:Received 3 October 2006Accepted 3 April 2008Available online 11 April 2008

Keywords:Video segmentationFacial saliency mapBoundary saliency mapVideo-conferencingFace trackingSkin toneFacial feature extractionVisual attentionFace segmentationHead-and-shoulder video

1047-3203/$ - see front matter � 2008 Elsevier Inc. Adoi:10.1016/j.jvcir.2008.04.001

* Corresponding author.E-mail addresses: [email protected] (H. Li), k

Ngan).

In this paper, a novel face segmentation algorithm is proposed based on facial saliency map (FSM) forhead-and-shoulder type video application. This method consists of three stages. The first stage is to gen-erate the saliency map of input video image by our proposed facial attention model. In the second stage, ageometric model and an eye-map built from chrominance components are employed to localize the faceregion according to the saliency map. The third stage involves the adaptive boundary correction and thefinal face contour extraction. Based on the segmented result, an effective boundary saliency map (BSM) isthen constructed, and applied for the tracking based segmentation of the successive frames. Experimentalevaluation on test sequences shows that the proposed method is capable of segmenting the face areaquite effectively.

� 2008 Elsevier Inc. All rights reserved.

1. Introduction

Object segmentation plays an important role in content-basedmultimedia applications. From the content-related image andvideo segmentation, a higher semantic object can be detectedand exploited to provide the user with flexibility in content-basedaccess and manipulation [1,2,4]. As an important key to the futureadvances in human-to-machine communication, the segmentationof a facial region also can be applied in many fields, such as encod-ing, indexing, and pattern–recognition purposes [3]. In the litera-ture, large number of face segmentation algorithms based ondifferent assumptions and applications have been reported.According to the primary criterion for segmentation, two catego-ries can be classified: color-based methods [3–22] and facial fea-tures-based methods [23–29].

The color-based segmentation methods aim to exploit skin-col-or information to locate and extract the face region. A universalskin-color map is introduced and used on the chrominance compo-nent to detect pixels with skin-color appearance in [3]. In order toovercome the limitations of color segmentation, five operatingstages are employed to refine the output result, such as density

ll rights reserved.

[email protected] (K.N.

and luminance regularization, geometric correction. Based on colorclustering and filtering using approximations of the YCbCr and HSVskin-color subspaces, a scheme for human faces detection by per-forming a wavelet packet decomposition was proposed in [4].The wavelet coefficients of the band filtered images are used tocharacterize the face texture and form compact and meaningfulfeature vectors. It is known that both of the previous two methodsare based on the linear classifier for skin-color pixels. Using a light-ing compensation technique and a nonlinear color transformation,a face detection algorithm [5] for color images was presented todetect skin regions over the entire image. This algorithm extractsfacial features by constructing feature maps for the eyes, mouth,and face boundary. Based on the joint processing of color andmotion, a nonlinear color transform relevant for hue segmentationis derived from a logarithmic model [6]. Markov random field(MRF) model that combines hue and motion detection within aspatiotemporal neighborhood is used to realize the hierarchicalsegmentation. In addition, a hand and face segmentation methodusing color and motion cues for the content-based representationof sign language videos was also proposed in [7]. This method con-sists of three stages: skin-color segmentation, change detectionand face and hand segmentation mask generation.

The facial features-based method, on the other hand, utilizesthe facial statistical or other structural features rather thanskin-color information to achieve face detection/segmentation

Page 2: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333 321

[23–30]. In [23], a probabilistic method for detecting and trackingmultiple faces in a video sequence was presented. The proposedmethod integrates the information of face probabilities providedby the detector and the temporal information provided by thetracker to produce the available detection and tracking methods.In [30], a statistical model-based video segmentation for head-and-shoulder type video was addressed. The head is modeled witha ‘‘blob”, which is segmented based on the assumption that a back-ground scene contains no foreground in order to satisfy the crea-tion of a background model. Recently, many segmentation workshave been developed to extract the object in the image/video[31–34]. A bilayer video segmentation method was proposed basedon the tree classifiers [31,35]. In that work, visual cues such as mo-tion, motion context, colour, contrast and spatial priors are fusedtogether by means of a Conditional Random Field model, and thensegmented by binary min-cut [36]. In addition, In [34], a randomwalker segmentation method is proposed for performing multila-bel, interactive image segmentation.

In this paper, we will concentrate on the specific applicationdomain, namely head-and-shoulder type videos existing in manymultimedia service such as videophone, video-conferencing, andweb chatting, etc. An effective face segmentation algorithm is pre-sented based on skin-color, face position, and eye-map information.Our segmentation method consists of three stages. The first stage isto generate the saliency map of input video image by our new facialattention model. In the second stage, a geometric model and an eye-map built from chrominance components are employed to localizeface regions. The third stage involves the boundary correction andthe final face contour extraction. Finally, we employ the proposedBSM to segment the facial region in the successive frames.

The main contributions and advantages of this work are the novelface segmentation algorithm developed for head-and-shoulder typevideo application based on facial saliency map. In this work, the fa-cial and boundary saliency maps are constructed successfully basedon the attention model to achieve the face segmentation and track-ing, which have been evaluated by a large number of images/videoswith good performance. In addition, this work provides a generalmethod to build the saliency model by combining different cues,such as edge, color, and shape. This work can be easily extended toother objects by appropriate design of the object saliency model.

This paper is organized as follows. The face segmentation algo-rithm will be presented in Section 2. Section 3 presents the track-ing based segmentation. Experimental results are provided inSection 4 to support the efficiency of our face segmentation algo-rithm. Finally, in Section 5, conclusions are drawn and furtherresearch proposed.

2. Face segmentation algorithm

Generally, video segmentation can be decomposed into two sub-problems, namely video object segmentation and video object track-ing [1,2]. The first stage is used to detect the object and extract it fromthe input frame. Then the object in the successive frames will be seg-mented in the following tracking procedure. In this section, we willpresent our method to locate candidate face areas in the color image.It is known that there are many color space models relevant to differ-ent applications, such as RGB for display purpose, HSV for computergraphics and image analysis, and YCbCr. Since the YCbCr color spaceis usually employed for the video storage and coding, and can pro-vide effective analysis for human skin-color [3,7], we will use thiscolor space for our input video sources. Namely, the format the videoimage is the YCbCr color space with the spatial sampling ratio of4:2:0 in our work. In addition, we assume that the person in ahead-and-shoulder pattern appears in the typical video sequenceswith front or near front views.

2.1. Facial saliency map

Motivated by the visual attention model [37], which has beensuccessfully extended to some applications [38–40], we will em-ploy the similar idea to construct a facial saliency map to indicatethe face location in the typical head-and-shoulder video sequences.Assume ðx; yÞ represent the spatial position of a pixel in the currentimage. The corresponding luminance and chrominance compo-nents of the pixel are denoted by Yðx; yÞ, Cbðx; yÞ, and Crðx; yÞ,respectively. The FSM in our work can be defined as:

Sðx; yÞ ¼ P1ðx; yÞ � P2ðx; yÞ � P3ðx; yÞ; ð1Þ

where P1, P2, and P3 denote the ‘‘conspicuity maps” correspondingto the chrominance, position, and luminance components, respec-tively. The detailed discussion of the maps can be found in thefollowing.

2.1.1. Chrominance conspicuity map (CCM) P1

Skin-color can be detected by the presence of a certain range ofchrominance values with narrow and consistent distribution in theYCbCr color space. The empirical ranges for the chrominance val-ues employed are typically Crskin ¼ ½133;173� and Cbskin ¼½77;127� [3]. It is known that face region usually exhibits the sim-ilar skin-color feature regardless of different skin types. Therefore,using the skin-color information, we can easily construct the facialsaliency map to locate the potential face areas.

To investigate the skin-color distribution, we manually seg-mented the training images into face patches. The data were takenfrom the California Institute of Technology face databases and CVLFace Database that are provided by the Computer Vision Labora-tory [41], which contains 1248 color human faces. Each personhas different poses and emotions. Different lighting conditionsand face types can be found for these image sources. It should benoticed that there are no test images in the experiments includedin the training data.

The histogram results for Cb and Cr components are presentedin Fig. 1a and b. We can see that the values of chrominance for dif-ferent facial skin-colors are indeed narrowly distributed, which isidentical with that statistical results in [3,4,7]. The correspondinghistograms exhibit distinctly the Gaussian-similar distributionrather than the uniform distribution. The larger the offset of a pixelaway from the mean value, the smaller the probability of the pixelbelongs to the face area. In addition, from the distribution of facialpixels in the CbCr plane shown in Fig. 1c, we find that an obviouslydeclining angle can be observed between two chrominance com-ponents. Let l and D denote the mean and variance, and h be theangle. Based on the above analysis, we can define the CCM P1 as

P1ðx; yÞ ¼ exp � xcrðx;yÞ �Cr0ðx;yÞ2

2D2

crþ xcbðx;yÞ �

Cb0ðx;yÞ2

2D2

cb

� �� �;

ð2Þ

where lcr ¼ 153, lcb ¼ 102, Dcr ¼ 20, Dcb ¼ 25, h ¼ p4, which are

determined from the training data and experimental test. We be-lieve that more accurate parameters estimation and update methodcan be employed to improve the parameters estimation. Cr0 and Cb0

are computed from the rotation of coordinate, i.e., Cr0ðx; yÞ ¼ðCrðx; yÞ � lcrÞ � cosðhÞ þ ðCbðx; yÞ � lcbÞ � sinðhÞ, and Cb0ðx; yÞ ¼�ðCrðx; yÞ � lcrÞ � sinðhÞ þ ðCbðx; yÞ � lcbÞ � cosðhÞ. Variable xvðx; yÞ(v ¼ cb or cr) is a weight coefficient, which is employed to adjustthe decreasing level of the chrominance distribution curve. Namely,if the chrominance values of a pixel exceed the facial distributionregion, its corresponding CCM obtained by the weight computationwill tend to have smaller probability to be classified as facial pixel.It is given by

xvðx; yÞ ¼ akvðx;yÞ; ð3Þ

Page 3: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

0 50 100 150 200 2500

1

2

3

4

5

6

7

8

9

10x 10

5

Cb

Num

ber

0 50 100 150 200 2500

1

2

3

4

5

6

7

8

9

10x 10

5

CrN

umbe

r

0 50 100 150 200 2500

50

100

150

200

250

Cb

Cr

a b c

Fig. 1. Sample face skin color distribution for chrominance Cb (a) and Cr (b). (c) The facial region in CbCr plane.

2

2.5

3

3.5x 105

ber

322 H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333

kvðx;yÞ¼symfðlcrþDcr�Crðx;yÞÞðCrðx;yÞ�lcrþDcrÞg; if v¼ crsymfðlcbþDcb�Cbðx;yÞÞðCbðx;yÞ�lcbþDcbÞg; if v¼ cb:

�ð4Þ

Here, a is a constant with the value of 2 is employed in our work,and symfkg denotes a sign function:

symfkg ¼1; if k P 00; otherwise:

�ð5Þ

0 50 100 150 200 2500

0.5

1

1.5

Y

Num

Fig. 3. The histogram of luminance component Y for the test face data.

2.1.2. Position conspicuity map (PCM) P2

We have found that in a typical head-and-shoulder videosequences, most of the face locations appear at or near the centerof the image in order to attract user attention distinctly. Fewhuman faces are captured and shown at the boundary of the image,especially the bottom of the image. Fig. 2 illustrates a statistical re-sult of the face pixel positions (denoted by the white area) fromstandard video sequences, including Akiyo, Carphone, Claire, andSalesman with total 1200 frames. From the obtained result, wecan see that the larger the distance between the current and thecenter positions, the smaller the possibility of face appearance.Hence, it is reasonable to assume that the probability of the facialpixels appearing at the center of the image will be larger than otherlocations. Let H and W denote the height and width of the image,respectively. Based on this characteristic, we define the positionconspicuity map P2 as

P2ðx; yÞ ¼ exp � ðx� H=2Þ2

0:8 � ðH=2Þ2� ðy�W=2Þ2

2 � ðW=3Þ2

( ): ð6Þ

From (6), we found that the probability of vertical orientationdecays faster than that of the horizontal. The rectangle area ofH2 � W

3 at the center holds the lager conspicuity values.

Fig. 2. Statistical result of face position for about 1200 frames.

2.1.3. Luminance and structure conspicuity map (LSCM) P3

Although there is no narrow distribution in the face area for theluminance component, different density values can be found dis-tinctly on the interval (0,255). As shown in Fig. 3, the region of[128 � 50,128 + 50] tends to contain most of conspicuity valuesfor the facial skin area. The darker the intensity value of a pixel,the less possible it will be a skin-tone color. Similar result can alsobe found for the very bright pixels. The reason is that a ‘‘head-and-shoulder” region, as the foreground, usually exhibits more unevendistribution of brightness, and provides a clearer visual result foruser rather than the background. In addition, it is known that thebrightness in the facial area is usually nonuniform, and exhibitslarger deviation than chrominance components. Based on the anal-ysis, we next define the LSCM P3 as

P3ðx;yÞ¼ 1� 1

ð1þrðNw1ðx;yÞÞÞ1=ne

!�exp �ðcðx;yÞ �Y

0ðx;yÞ�lLÞ2

2 �D2L

( );

ð7Þ

ne¼max log2maxðNw1ðx;yÞÞ

meanðNw1ðx;yÞÞ

� �;1

� �; ð8Þ

where lL ¼ 128, DL ¼ 50, cðx; yÞ denotes the luminance compensa-tion coefficient, which is defied as cðx; yÞ ¼ lL

1K

PK

k¼1Y0ðxk ;ykÞ

for ðxk; ykÞ 2

Nw2ðx; yÞ and 0:3jCbðxk; ykÞ � Cbðx; yÞj þ 0:7jCrðxk; ykÞ �Crðx; yÞj < 2.Nwðx; yÞ represents the w�w neighborhood of pixel ðx; yÞ.rðNw1ðx; yÞÞ, maxðNw1ðx; yÞÞ and meanðNw1ðx; yÞÞ denote the stan-dard deviation, maximal and mean values of the w1�w1 neighbor-

Page 4: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333 323

hood, respectively. Due to the sample format of (4:2:0), we use Y 0ðx; yÞto denote the mean value of four luminance values corresponding tothe chrominance location ðx; yÞ. Here, the first term in (7) representsthe structural coefficient, which is employed to characterize theluminance variation in face region. However, it is observed that a re-gion with larger local smooth areas and many sharp edges is likely toexhibit distinct deviation in brightness. In order to avoid large predic-tion errors in the presence of edges, we employ a coefficient ne toapproximately estimate the edge character. We can see that onlythose larger derivation with small edge strength will be taken into ac-count, which means that the influence yielded by the high-frequencycontents such as the sharp edges will be reduced significantly by (7).

Additionally, it is widely reported that the appearance of theskin-tone color is characterized by the brightness of color, whichis governed by the luminance component of light [3,5,7,9,14]. Wenow introduce a local lighting balancing technique, which has asimilar objective as the method in [42], to normalize the colorappearance. cðx; yÞ given in (7) is proposed to reduce the influenceof the light variation on the face detection. For the current pixelðx; yÞ, we first consider all the pixels in the window of w2�w2,which are centered at ðx; yÞ. The corresponding distances of chro-minance components between the pixel ðx; yÞ and others are thencomputed, respectively. Only those pixels with smaller distancevalues (i.e., <2) are selected as the light balancing pixels. From(7), we can see that if the current pixel has the lower density value(i.e., in the darker area), its luminance level would be increasedsince the balancing coefficient cðx; yÞ corresponds to a larger value.It should be noticed that this balancing process is only employed tothe pixel with its chrominance values satisfying Cbðx; yÞ 2½lcb � Dcb; lcb þ Dcb� and Crðx; yÞ 2 ½lcr � Dcr; lcr þ Dcr� in order toreduce the computation complexity.

According to (1), the final FSM can be easily obtained byemploying the three conspicuity maps (2), (6), and (7). For exam-ple, the FSMs of four video fames, namely Claire, Carphone, Fore-man, and News, are calculated, and presented in Fig. 4,respectively. It is observed that most face regions in the test framesachieve the large saliency values with respect to other objects.Using the facial saliency map, the first face candidate map f ðx; yÞcan be obtained by setting

f ðx; yÞ ¼1; if Sðx; yÞP s1

0; otherwise

�; ð9Þ

where s1 is a threshold.Once the candidate face areas are obtained, we can then begin

the binary map regularization. The morphological operators [43],i.e., dilation and erosion, are employed. In our work, we use thesquare structuring elements which have the width of two andthree pixels for erosion and dilation.

2.2. Geometric verification model

After the previous stage, one or more potential candidate faceareas may be obtained. As illustrated in the News sequence in

Fig. 4. The facial saliency map: (a) Claire, (b) Car

Fig. 4, some background regions (i.e., two dancers) are probablydetected as candidate regions due to the similar saliency features.In this section, we will construct a simple geometric model relativeto shape, size, and variance to validate these candidates, and thenremove most of false regions.

Let NoPk denote the number of pixels of the current candidateface area k. Assume that the ideal face height for the given pixelsis denoted by a, we then set the ideal face width to 2a

3 . Thus, we

have a ¼ffiffiffiffiffiffiffiffiffiffi3NoPk

2

q. Let h and w represent the real height and width

of this area, respectively. VL and VR are used to denote the standarddeviations of the left and right parts. The geometric model isdefined as

GðkÞ ¼ symðNoPk � FminÞ �jðh� aÞðw� 2a

3 ÞjNoPk

� 12� VL

VL þ VR

���� ����; ð10Þ

where Fmin denotes the minimal size of displaying a face structure.The allowable sizes of 80� 48 and 50� 50 for Fmin had been em-ployed in the literature [4,7]. We found that the size of 12� 10 iscapable of describing the outline of a face structure. From (10),we can see that three features are taken into account to performthe geometric verification. The first is the minimal pixels, whichcan be employed to remove most of small candidate areas. The sec-ond is relative to the shape, which is used to eliminate the regionswithout the face contour characteristic. The third corresponds tothe variance, which is based on the assumption that the human facein the head-and-shoulder type video appears in or near front viewrather than the side-view of face initially. If the constraint of (10)is satisfied, we will label the current region as the candidate facearea, namely

GIDðkÞ ¼1; if GðkÞ 6 s2

0; otherwise

�; ð11Þ

where s2 is a threshold. According to the performance of a lot ofexperiments, the value in (0.003 � 0.009) is recommended, whichcan provide better constraint result for the candidate facesselection.

2.3. Eye-map verification

In this stage, we will verify the face detection result generatedby the previous stage, i.e., for the case of GIDðkÞ ¼ 1, and to removethose false alarms caused by objects with similar skin-color andfacial geometric structure. Among various facial features, eye andmouth are the most prominent features for recognition and esti-mation of 3D head pose [5]. Next, we will concentrate on the detec-tion of eye location from the marked face areas.

In [5], an eye-map based on the observation of chrominancevalues is constructed, and is used to locate the eye area in colorimages. Unlike the high resolution color image, the video-confer-encing sequences usually do not have enough chrominance infor-mation to generate good eye-map by the method in [5]. Forexample, Fig. 5a shows the constructed EyeMap (from (1) and (2)

phone, (c) Foreman, and (d) News sequences.

Page 5: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

Cb Cr

N1

N2

N3 N4Eye

we

he

e e

e

e

Fig. 7. The neighborhood structure of the candidate eye area.

324 H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333

in [5]) for the first Carphone video frame. It is observed that the eyelocations cannot be easily derived from the obtained eye-map. Inorder to construct an effective eye-map, we first manually seg-mented the previous face data into eye patches. Different lightingconditions are considered in 314 experimental samples. Since itis difficult to get the accurate region around eyes, the rectangularwindow is used to extract the eyes. The color distributions are pre-sented in Fig. 6. From the statistical results, it can be observed thatvalues of Cb > 100 and Cr < 155 with respect to the skin-tone col-or distribution are found around the eyes. Based on the above anal-ysis, we then propose a simple method to construct thecorresponding eye-map.

Let mY 0 ðkÞ, mCbðkÞ, and mCrðkÞ denote the mean values of the col-or components Y 0, Cb, and Cr for the kth candidate face area,respectively. The eye-map can be written as

Eyemap ¼ symðY 0 �mY 0 ðkÞÞ AND symðCb

�maxf100;mCbðkÞgÞ AND symðCr

�minf155;mCrðkÞgÞ; ð12Þ

where, the symbols ‘max’ and ‘min’ denote the maximum and min-imum operation. symðÞ represents the inverse function of sym(). Inaddition, it should be noted that the luminance Y 0 is the normalizedvalue that has the same dimension as the chrominance compo-nents. Fig. 5b shows the eye-map obtained by our proposed meth-od. As compared with the method [5], we can see that the locationsof eyes can be clearly observed. For each candidate eye area, we firstselect the corresponding rectangle region with the size of he �we inthe luminance space. Then the neighborhood pixels with a constantoffset e are taken into account. As shown in Fig. 7, we compare themean value of the eye area with those of its neighborhood, i.e., N1,N2, N3, and N4. If a low value and the corresponding appropriate

Fig. 5. The construction result of EyeMap of Car

0 50 100 150 200 2500

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 104

Y

Num

ber

0 50 1000

1

2

3

4

5

6

7

8

9

10x 104

C

Num

ber

a b

Fig. 6. The histograms of eye skin-colors

location (i.e., the upper part of the current candidate area but notthe boundary) are detected, we will declare that there exists aneye structure in the candidate face region.

2.4. Adaptive boundary correction

After the analysis of the previous stages, we obtain the finalcandidate face areas. In this section, we will segment the face re-gion using an adaptive boundary correction method. The flow chartof this stage is shown in Fig. 8.

For each face area, we first calculate the mean value mi andstandard derivation ri of the chrominance component i (i.e.,i ¼ Cb or Cr). We then label all the boundary points. For each pointðj; kÞ, we compute the global distance Dðj; kÞ and the local distancedR

wðj; kÞ, which are defined as

Dðj; kÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðCbðj; kÞ �m Þ2 þ ðCrðj; kÞ �m Þ2

qð13Þ

phone video: (a) by [5], (b) by our method.

150 200 250b

0 50 100 150 200 2500

1

2

3

4

5

6

7

8

9

10 x 104 Crval

Cr

Num

ber

c

. (a) Luminance Y, (b) Cb, and (c) Cr.

Page 6: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

Input candidate

face area

Calculate the mean

and variance of Cband Cr

Label the

boundary point

For each

boundary point

T1=1 ?Delete this point

and generate new

boundary point

T2 of its neighborpixels =1 ?

Add thisneighbors as new

boundary point

Contour Extraction

Done for allboundary pixels ?

Y

N

Y

N

Y

N

Fig. 8. Flow chart of the presented contour correction method.

H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333 325

dRwðj; kÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðCbðj; kÞ �mCbR

wÞ2 þ ðCrðj; kÞ �mCrR

wÞ2 þ ðY 0ðj; kÞ �mY 0Rw

Þ2q

;

ð14Þ

where w denotes a window of w�w that is centered at the currentpoint ðj; kÞ. R is used to indicate the area properties in the window,i.e., R ¼ F (candidate face region) or R ¼ B (background region).

In addition, in order to improve the accuracy of the boundarycorrection and avoid the occurrence of the false detection espe-cially in the case of blurry region, we next introduce the constraintof boundary curvature. It is known that the contour of human faceusually exhibits elliptical shape and smaller curving level. To re-duce computation complexity, we employ a simple scheme toapproximately depict the curvature feature at the boundary pointðj; kÞ in our work. Assume that ðjþn ; k

þn Þ and ðj�n ; k

�n Þ denote the

boundary points which have the offset n relative to the initial pointðj; kÞ in counterclockwise and clockwise directions, respectively.The corresponding curvature value cvnðj; kÞ and direction cdnðj; kÞare defined as

cvnðj; kÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðjþn � j�n Þ

2 þ ðkþn � k�n Þ2

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðjþn � jÞ2 þ ðkþn � kÞ2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðj�n � jÞ2 þ ðk�n � kÞ2

q : ð15Þ

cdnðj; kÞ ¼1; if f jþn þj�n

2

j k; kþn þk�n

2

j k� ¼ 1

�1; otherwise

(; ð16Þ

where the symbol b�c denotes the downward truncation operation.If the condition (17) is satisfied (i.e., T1ðj; kÞ ¼ 1), we will removethis point, and then select its neighborhood pixels in the candidateregion as new boundary points. On the other hand, if its neighbor-hood pixel ðjs; ksÞ that does not belong to the marked area also tendsto exhibit similar feature as the face area, i.e., the condition (18)ðT2ðjs; ksÞ ¼ 1Þ is satisfied, we then extend this pixel as a newboundary point. Finally, we employ dilation and erosion with asquare structuring element to perform the binary imageregularization.

T1ðj; kÞ ¼1; if fDðk; jÞ > 2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffir2

Cb þ r2Cr

qor dB

wðj;kÞdF

wðj;kÞ< 1g

or fcdnðj; kÞ < 0:8; cvnðj; kÞ ¼ �1 for n ¼ 2;4;10g;0; otherwise

8>><>>:ð17Þ

T2ðjs;ksÞ¼1; if Dðjs;ksÞ<3

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffir2

Cbþr2Cr

qand dB

wðjs ;ksÞdF

wðjs ;ksÞP 1 and T1ðj;kÞ¼0

0; otherwise

8<: :

ð18Þ

3. Tracking based face segmentation

The face segmentation in the successive fames is achieved bythe tracking based method. Many algorithms have been proposedto track human faces in a video [46–49]. Generally, much compu-tational complexity will be involved in order to achieve good track-ing performance. In our work, the face tracking is used as the initialstep for the faces segmentation in the subsequent frames. A simpleand efficient tracking method is developed. The flowchart of ouralgorithm is shown in Fig. 9. The major steps in this stage are theboundary matching and connection, which is enclosed in the rect-angular block in Fig. 9. In this section, a novel boundary saliencymap is first constructed to determine the face boundary. Then aconnective technique between two key points is employed to ex-tract this region.

3.1. Boundary saliency map

The first step for tracking the face region is the projection of theinformation at the previous frame ði� 1Þ into the current frame i[2]. By applying the motion estimation technology, we can easilyobtain the current position for each candidate face area. Here, asimple coarse-to-fine motion estimation technique is employedto obtain the projected position. We first perform the full-searchmotion estimation in the sampling space of half the spatial resolu-tion within a given window. Then, the search in finer scale (i.e.,eight neighboring points) will be performed for the best matchedposition. The sum of absolute difference (SAD) is used as the sim-ilarity measure between two regions. Generally, we can use largesearch window to deal with different degrees of face moving whichcan track the face over successive frames. In our work the 16� 16search window is considered in the motion estimation stage, whichcan cover those range of face moving for the head-and-shouldervideos.

Without loss of generality, we only take one face region into ac-count in the following analysis. Assume fi�1 denotes the candidateface region of the ði� 1Þth frame in the video sequence.El; l ¼ 1;2; . . . ; L corresponds to its boundary points. We then usef 0i with the boundary points E0l to denote the projected face areaby motion compensation in the ith frame. Let ðx; yÞ be the positionof a pixel in the current frame i. We define the boundary saliencymap (BSM) as

Bsmðx; yÞ ¼ Q1ðx; yÞ � Q2ðx; yÞ � Q 3ðx; yÞ; ð19Þ

Page 7: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

Video Input

Is 1st frame?Y

N

Motion Estimation

FSM-based FaceSegmentation

Face Region

Output

Face Projection from

(n-1)th to nth frame

BSM-based

Boundary Matching

Connecting and

Filling-In Procedure

Fig. 9. Flowchart of the proposed tracking based segmentation algorithm.

326 H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333

Q1ðx; yÞ ¼ exp �Crðx; yÞ �mCrfi�1

� 2

8r2ðCrfi�1Þ þ

Cbðx; yÞ �mCbfi�1

� 2

8r2ðCbfi�1Þ

0B@1CA

8><>:9>=>;;ð20Þ

Q2ðx; yÞ ¼NfEsðY 0ðx; yÞÞg �NfEsðCrðx; yÞÞg; ð21Þ

Q3ðx; yÞ ¼c

c þminffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðx� xE0l

Þ2 þ ðy� yE0lÞ2

q; for l ¼ 1;2; . . . ; L

n o ;ð22Þ

where Q1, Q2, and Q3 denote the ‘‘conspicuity maps” correspondingto the color, edge, and position components, respectively. Here,mCrfi�1

and mCbfi�1correspond to the mean values of color compo-

nents of the segmented face area in frame i� 1, while rð�Þ denotesthe standard derivation. The symbol N is a normalized operator.Esð�Þ at position ðx; yÞ denotes the edge strength obtained by Cannyalgorithm [44]. In (22), ðxE0l

; yE0lÞ represents the coordinates value of

boundary point E0l. Variable c is used to adjust the influence levelcaused by the distance of the current detected pixel. Generally,we can set a small value to c if the current frame is in a static orsmall motion level, while the larger value can be employed forthe fast motion face region. From (19), we can see that if the edgepoints that are close to the projected boundaries have the same col-or feature as the segmented face region in the previous frame, thelarge BSM values will be achieved for these points. On the contrary,those background points tend to have smaller BSM values due tothe unconspicuous color or position features. For example,Fig. 10b and d show the obtained BSMs of the 2th and 65th framesin the Claire and Carphone video sequences (namely Fig. 10a and c),

Fig. 10. (a and c) Original input image of the 2th and 65th frames in Claire and Carphone

respectively. It is observed that most of the facial boundary pointsexhibit the larger values distinctly than other background points.In addition, we find that if we take all the pixels of the current frameinto account, the computation complexity will increase signifi-cantly. In fact, most background pixels that are away from the facecontour will tend to hold very small BSM values. Therefore, in orderto save computational load, we only consider the neighboring pixelsof the facial boundary points in the following matching procedure,i.e., within a window of w�w. Assume that the pixels in the givenwindow have the same values of position conspicuity. The corre-sponding BSM can be rewritten as

Bsmðx; yÞ ¼ Qw1 ðx; yÞ � Q

w2 ðx; yÞ; ð23Þ

where Qw1 and Qw

2 denote the ‘‘conspicuity maps” calculated fromthe given window of w�w.

3.2. Boundary extraction

After the calculation of BSM, the next step is to find the bound-ary points in the current frame i and segment the correspondingface region. Here, we use two stages to realize the extractionprocess.

The first is the boundary matching, which aims to find the bestpoints with maximal BSM values for the projected facial boundary.Assume bE0l; l ¼ 1;2; . . . L denote the estimation of the positions forthe previous boundary points El in the current frame. We have

bE0l ¼ arg maxðx;yÞ2wðE0lÞ

XþM

i¼�M

ðBsmðxi; yiÞ � gðxi; yiÞÞ" #pðx;yÞ

8<:9=;; ð24Þ

video sequences, respectively. (b and d) The corresponding BSMs obtained by (19).

Page 8: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333 327

gðxi; yiÞ ¼1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

pð2M þ 1Þp exp � i2

2M þ 1

!; ð25Þ

pðx; yÞ ¼ 1� 12M þ 1þ J

XþM

i¼�M

Xj2NF

xi ;yi

Q j1ðxi; yiÞ

1� 12M þ 1þ�J

XþM

i¼�M

Xj2NB

xi ;yi

Q j1ðxi; yiÞ

0B@1CA; ð26Þ

where the ðxþi; yþiÞ and ðx�i; y�iÞ have the same description asðxþi ; yþi Þ and ðx�i ; y�i Þ, respectively, which have been defined in Sec-tion 2.4. wðE0lÞ denotes the window that is centered at E0l. Eq. (25)is used to eliminate the impact of the singular point by means ofcomputing the weighting mean value of the BSM at ðx; yÞ. The expo-nential variable p depicts the boundary feature in the given win-dow. In (26), NF

xi ;yiand NB

xi ;yidenote the neighborhood of the point

ðxi; yiÞ, which belong to the projected facial and background areas,respectively. Similarly, J and J are used to indicate the number ofpixels in the corresponding areas. From (19), it is observed thatlarge BSM value usually corresponds to the facial boundary. But insome cases of the lighting variation, the similar BSM values maybe observed for the actual and other edges. We can see that p canprovide an approximate evaluation of boundary position at the pix-el ðx; yÞ within the search region. If pðx; yÞ holds a small value, itmeans that the current detected points tend to be the actual facialboundary. Otherwise, it should be classified as the point in the faceregion or in the background. In addition, it should be noted that var-iable sizes are employed for the parameters M and Nxi ;yi

in our work,i.e., 10 and 3 for the static level while 5 and 5 for motion case,respectively. The reason is that the values of small M and largeNxi ;yi

are more suitable to satisfy the requirement of the facialboundary deformation during the motion instance. On the contrary,in order to avoid false detection in the case of the small boundarymovement, more boundary points and the small window size areemployed to impose the strictly curvature and moving step con-straints during the search procedure.

The second is the postprocessing stage, which is used to connecttwo break boundary points. This procedure is only applied to thenonclosed contour after the boundary matching. It is known thatif the deformation of the face contour appears during the facemovement, such as the face approaching to the camera, somenew contour points will then be generated, which results in theincontinuity of the matching boundary. Here, a simple linear con-

Fig. 11. First row: Four test MPEG video sequences (Akyio, Claire, Carphone, Foreman). Se

nection method between two break boundary points is employedto form a closed loop for the face contour. Finally, for the closedcontour, we use the filling-in technique to extract the correspond-ing area.

4. Segmentation experiments

In this section, we evaluate the effectiveness of the proposedface segmentation algorithm. For the evaluation, we use somestandard MPEG test sequences in QCIF ð176� 144Þ format (namelyAkyio, Claire, Carphone, and Foreman) as the input videos.

4.1. FSM-based segmentation

The segmentation results on the first frames for four videosequences (i.e., Akyio, Claire, Carphone, and Foreman) are shownin Fig. 11, respectively. The first row is the original input frames,while the second corresponds to the segmented results. We cansee that the face areas in the images are segmented successfullyfor all test sequences. Most of the background regions with similarskin-tone color are removed from the detected images, whilereserving the prominent face saliency area. Only a minimal amountof false alarms, such as the inner boundary of the safety helmet forthe Foreman frame image, are detected as face region due to thesimilar skin-color feature. Additionally, it is observed that someof the hair on the head has been detected as skin, which shouldnot be considered as a significant problem in object-based codingscheme [7].

The comparison results between the existing skin-color seg-mentation methods are illustrated in Fig. 12. The result depictedin Fig. 12a is obtained by using the method [3]. It should be notedthat the horizontal and vertical scanning processes in the geomet-ric correction stage are skipped because of the small sample sizefor QCIF image. Namely, the operation that any group of less thanfour connected pixels with the value of one will be eliminated andassigned to zero for CIF image [3] is not employed in our experi-ment. Fig. 12b shows the skin-color segmentation result by usingthe method in [5]. This method employs color correction in RGBcolor space before skin detection, which is based on the assump-tion that the dominant bias color always appears as ‘‘real white”.If the number of reference-white pixels [5] exceeds 100 and theaverage color is not similar to skin tone, the color adjustment willbe performed. From the result, we find that a lot of backgroundpixels are also detected as skin region. Similar evaluation result

cond row: The corresponding segmentation results based on our proposed method.

Page 9: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

Fig. 12. Segmentation results for the first frame in Carphone sequence: (a) Chai [3], (b) Hsu [5], (c) Jones [9], and (d) Habili [7].

328 H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333

of this method can be found in [45]. Fig. 12c shows the detectionresult by using the mixture of Gaussian model of skin-colordescribed in [9]. The skin-color segmentation result for the Car-phone sequence by the method [7] can be found in Fig. 12d. It is ob-served that high miss detection are generated for the last twoalgorithms. In addition, we applied the postprocessing method in[3] to the output skin-color map obtained from the linear classifier[4], the similar segmentation result is observed as compared toFig. 12a. From a lot of experiments, we have found that the skin-color detection results will change significantly with various imagesources or different lighting conditions for the existing skin-colordetection schemes.

To further validate our proposed algorithm, other three humanobjects with different lighting and noisy conditions are tested. Oneimage with indoor lighting, which is selected from MPEG-7 News1video sequence, is shown in Fig. 13a, while the face area appearsnear the top left of the image instead of the center position. Theimage with outdoor lighting and large illumination variation,which is selected from [52], is depicted in Fig. 13b. A noisy image(i.e., movie poster from WWW) is shown in Fig. 13c. The corre-sponding segmentation results are illustrated in Fig. 13d–f, respec-tively. It is found that the face region has been segmented verywell, and most of the facial pixels are correctly identified with only

Fig. 13. Face segmentation results for human objects with different lighting and noisy cwith illumination variation. (c and f) Result for noisy image.

a minimal amount of false alarms for the boundary points. Moreexamples are shown in Fig. 14, where the first row of images areselected from the widely used The Berkeley Segmentation Datasetand Benchmark [50]. The segmentation results of our method aregiven in the second row. From the experimental results, we cansee that better performance can be achieved by using our proposedface segmentation method.

We also perform the comparisons of our method with the wellknown graph cut method [36], which is considered as a successfulbilayer segmentation method. The results are shown in Fig. 15,where the first row denotes the user’s marks of the face (dark lines)and the background (light lines). The second row is the corre-sponding results. We can see our attention based automatic facesegmentation performs equally well with the distinct advantageof being unsupervised.

4.2. BSM-based segmentation for tracking procedure

After the segmentation of the first frame, the face regions insuccessive frames are obtained by the tracking based segmenta-tion procedure. We apply our proposed algorithm to the se-quence of Carphone, which is characterized by the mild-fastmotion of the face and the background in the video scene. Some

onditions. (a and d) Result for indoor lighting. (b and e) Result for outdoor lighting

Page 10: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

Fig. 15. Face segmentation results for interactive graph cut.

Fig. 14. Face segmentation results of our method for Berkeley database.

H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333 329

original frames are shown in the first and the third rows ofFig. 16. The final segmentation results are shown in the secondand final rows, respectively. As opposed to the static video se-quence, such as Akyio or Claire, Carphone sequence represents ascene with the deformable face contour due to the fast headmovement. It is observed that the face contour appears signifi-cant difference with the varying postures of the human object.The segmentation result indicates that our proposed method iscapable of capturing the deformable face boundary effectivelyin the moving condition. Moreover, as shown in Fig. 16, the

appearance of the hand in the 178th frame, which also has theskin-color feature, does not alter the segmentation of the track-ing face region.

Another experiment of our segmentation algorithm is per-formed on two handset video sequences in QCIF format. Some rep-resentative frames are shown in the first and second rows ofFig. 17. Unlike the previous sequences, such as Carphone, we usethe low quality compressed handset videos as the input sequences,which have very low signal-to-noise ratio. The first sequence givenin the first row of Fig. 17 includes the fast movement of the head

Page 11: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

Fig. 16. Segmentation results for Carphone video sequence after face tracking. The first and the third rows: original input frames. The second and the final rows: segmentationresults.

330 H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333

and shoulder with the static background. In the second sequence,the moving background is considered as well as varying light con-ditions. Their corresponding segmentation results are given in thesecond and the fourth row of Fig. 17. We can see that good perfor-mance can be achieved using our proposed segmentation algo-rithm for those videos with noise and varying illuminationconditions.

Next, we would like to analyze the behavior of the proposedtracking based segmentation algorithm in case of errors in theobject segmentation result. Firstly, the appearance of the falsealarms in the previous segmentation is taken into account. Asshown on the top left of Fig. 18, we manually add some of back-ground areas to the original segmentation result. The following fig-ures on the top row show the segmentation results obtained byusing our proposed BSM based segmentation algorithm in thetracking procedure. It is observed that the false parts disappeargradually after several frames segmentation. The boundary pointsin the person’s clothing are rapidly converged to the correctboundary. From the result shown in the 9th frame, we can see thatmost of the false alarms are corrected by our proposed method.Secondly, the case of miss alarms is also tested in our experiments.In the same way, we manually take off some facial pixels from the

original segmentation, which is shown in the left figure at the bot-tom of Fig. 18. We can see that a large part of face region is re-moved as the miss alarms. However, the face contour is correctlyidentified after the following frames segmentation. The corre-sponding face boundary is again resumed to the correct status,which can be found in the result of Frame 14. From the evaluationof this sequence, it is shown that even if the segmentation does notpresent a satisfied result for the following tracking stage, the pro-posed BSM based segmentation algorithm is capable of correctingthese errors effectively, and keeps the better segmentation results.

Now, we will concentrate on the computational complexity ofthe proposed FSM-based face segmentation algorithm. In ourmethod, the facial saliency map, geometric model verification,and eye-map detection should be taken into account in order toobtain the correct facial region. However, it can be seen that thecomputational load of the last two stages is mainly imposed onthe output candidate face areas with only smaller size with respectto the original image. Generally, in order to decrease the computa-tional operation, we can directly skip the last two stages if only onecandidate face area is detected. In addition, for the tracking basedsegmentation algorithm, i.e., BSM based segmentation, most of thecomputation will be consumed in the stage of motion estimation

Page 12: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

Fig. 17. Face segmentation results for two handset videos.

Fig. 18. Example of robustness of the proposed BSM algorithm in case of errors in the segmentation. Top: the appearance of false alarms. Bottom: the appearance of missalarms.

H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333 331

process. Only a small number of operations are required for the fol-lowing BSM computation.

In addition, it should be noted that the proposed method em-ploys the adjustable threshold to obtain the candidate face areasfrom the FSM. Generally, the smaller s1 is, the more false alarmsare allowed, and vice versa. By varying s1, we can control theamount of false alarms and false dismissals allowed in the stageof the FSM. Based on a large amount of experiments, a dynamicthreshold s1 (namely ð0:3 � 0:5ÞSmax), which can provide better

candidate outputs, is reasonable for pixels classification. Someexamples are illustrated in Fig. 19. Here, Smax denotes the maximalvalue of the facial saliency map. Of course, it is difficult to obtainthe accurate face boundary only by the determination of thethreshold. Nevertheless, it can be improved by using thesubsequent stages of the geometric verification model and adap-tive boundary correction. That is, the false face candidates can besuccessfully removed by employing the following verificationmodels. Similarly, those false and miss alarms in the selected face

Page 13: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

Fig. 19. Experimental results of akyio, carphone, and foreman videos for different s1. The values of s1 from left to right: 0.3, 0.45, 0.5, 0.6.

332 H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333

regions can be eliminated by using the final adaptive boundarycorrection.

Moreover, we also test the selection of the local window size onthe segmentation performance. Fig. 20 shows the segmentation re-sults of a frame for carphone video using different local windows,i.e., 4-neighborhood (diamond window), 3� 3, 5� 5, and 7� 7.Similar results can be obtained for those small windows, such as4-neighborhood and 3� 3.

In addition, compared with Viola and Jones method [51], whichis used to detect the human face based on the AdaBoost learningmethod, it is observed that there are distinct differences betweenour segmentation method and Viola and Jones method. Firstly,Our method concentrates on the face segmentation based on theattention model rather than the face location in the input image.The obtained facial saliency map not only provides the candidateface position, but also achieves the coarse segmentation. However,the Viola and Jones method focuses on finding the face locationusing the detection window at many scales. In order to extractthe human face, another segmentation method has to be em-ployed. Secondly, Viola and Jones method is designed for thegrey-level image without the consideration of the skin-color infor-mation. In our work, the chrominance components are the impor-

Fig. 20. Experimental results of carphone video for different local w

tant cues in our facial saliency model. Finally, the skin-color andthe gradient values are usually employed in pre-filtering the inputimage to speed up the face detection for the Viola and Jonesmethod.

5. Conclusion

In this paper, an effective face segmentation algorithm is pro-posed from the facial saliency point of view. The basic idea is moti-vated by the observation that face area usually exhibits its specialcolor distribution and structural feature, especially for typical”head-and-shoulder” video. Therefore, a novel facial saliency mapis first constructed based on the color, structure, and position infor-mation. Then a geometric model and facial features are used to ver-ify the candidate face areas. Based on the segmented result, aneffective boundary saliency map is then constructed, and appliedto the subsequent tracking process. Unlike other approaches, theproposed method puts heavy emphasis on the facial saliency fea-tures for the head-and-shoulder videos. Experimental results wereobtained by applying the proposed method to a large number ofvideo sequences. It is shown that our method is capable of seg-menting the face quite effectively. In our the future work, we hope

indows. Left to right: 4-neighborhood, 3� 3, 5� 5, and 7� 7.

Page 14: J. Vis. Commun. Image R. · PDF fileSaliency model-based face segmentation and tracking in head-and-shoulder video sequences ... a face detection algorithm [5] for color images was

H. Li, K.N. Ngan / J. Vis. Commun. Image R. 19 (2008) 320–333 333

to design more robust and efficient face segmentation algorithm inthe case of lighting variations and complex background, and makeit suitable for videophone and video-conferencing applications.

Acknowledgments

This work was partially supported by the Shun Hing Institute ofAdvanced Engineering and the Research Grants Council of theHong Kong SAR, China (Project CUHK415505).

References

[1] T. Meier, K.N. Ngan, Video segmentation for content-based coding, IEEE Trans.Circ. Syst. Video Technol. 9 (8) (1999) 1190–1203.

[2] A. Cavallaro, O. Steiger, T. Ebrahimi, Tracking video objects in clutteredbackground, IEEE Trans. Circ. Syst. Video Technol. 15 (4) (2005) 575–584.

[3] D. Chai, K.N. Ngan, Face segmentation using skin-color map in videophoneapplication, IEEE Trans. Circ. Syst. Video Technol. 9 (4) (1999) 551–564.

[4] C. Garcia, G. Tziritas, Face detection using quantized skin color regions mergingand wavelet packet analysis, IEEE Trans. Multimedia 1 (3) (1999) 264–277.

[5] R.L. Hsu, M.A. Mottaleb, A.K. Jain, Face detection in color images, IEEE Trans.Pattern Anal. Mach. Intell. 24 (5) (2002) 696–706.

[6] M. Lievin, F. Luthon, Nonlinear color space and spatiotemporal MRF forhierarchical segmentation of face features in video, IEEE Trans. Image Process.13 (1) (2004) 63–71.

[7] N. Habili, C.C. Lim, A. Moini, Segmentation of the face and hands in signlanguage video sequences using color and motion cues, IEEE Trans. Circ. Syst.Video Technol. 14 (8) (2004) 1086–1097.

[8] S.L. Phung, A. Bouzerdoum, D. Chai, Skin segmentation using color pixelclassification: analysis and comparison, IEEE Trans. Pattern Anal. Mach. Intell.27 (1) (2005) 148–154.

[9] M.J. Jones, J.M. Rehg, Statistical color model with application to skin detection,Int. J. Comput. Vis. 46 (1) (2002) 81–96.

[10] H. Greespan, J. Goldberger, I. Eshet, Mixture model for face color modeling andsegmentation, Pattern Recogn. Lett. 22 (2001) 1525–1536.

[11] B. Menser, M. Wien, Segmentation and Tracking of Facial Regions in ColorImage Sequences, SPIE Vis. Commun. Image Process. 4067 (2000) 731–740.

[12] M.-H. Yang, N. Ahuja, Detecting human faces in color images, IEEE ICIP 98 1(1998) 127–130.

[13] J.-C. Terrillon, M. David, S. Akamatsu, Detection of human faces in complexscene images by use of a skin color model and of invariant Fourier–Mellinmoments, in: Proc. Fourteenth International Conference on PatternRecognition, vol. 2, 1998, pp. 1350–1355.

[14] H. Wang, S.-F. Chang, A highly efficient system for automatic face regiondetection in MPEG video, IEEE Trans. Circ. Syst. Video Technol. 7 (4) (1997)615–628.

[15] C. Chen, S.-P. Chiang, Detection of human faces in colour images, IEE Proc.: Vis.Image Signal Process. 144 (6) (1997) 384–388.

[16] S. Kawato, J. Ohya, Automatic skin-color distribution extraction for facedetection and tracking, in: Proc. WCCC-ICSP 2000. 5th InternationalConference on Signal Processing Proceedings, vol. 2, 2000, pp. 1415–1418.

[17] J. Park, J. Seo, D. An, S. Chung, Detection of human faces using skin color andeyes, in: Proc. IEEE International Conference on Multimedia and Expo, vol. 1,2000, pp. 133–136.

[18] S.L. Phung, D. Chai, A. Bouzerdoum, Skin colour based face detection, in: Proc.Seventh Australian and New Zealand Intelligent Information SystemsConference, 2001, pp. 171–176.

[19] T.S. Caetano, D.A.C. Barone, A probabilistic model for the human skin color, in:Proc. 11th International Conference on Image Analysis and Processing, 2001,pp. 279–283.

[20] J. Fritsch, S. Lang, A. Kleinehagenbrock, G.A. Fink, G. Sagerer, Improvingadaptive skin color segmentation by incorporating results from face detection,in: Proc. 11th IEEE International Workshop on Robot and Human InteractiveCommunication, 2002, pp. 337–343.

[21] T.S. Caetano, S.D. Olabarriaga, D.A.C. Barone, Performance evaluation ofsingle and multiple-Gaussian models for skin color modeling, in: Proc. XVBrazilian Symposium on Computer Graphics and Image Processing, 2002,pp. 275–282.

[22] J. Yang, Z. Fu, T. Tan, W. Hu, Skin color detection using multiple cues, in: Proc.17th International Conference on Pattern Recognition, vol. 1, 2004, pp. 632–635.

[23] R.C. Verma, C. Schmid, K. Mikolajczyk, Face detection and tracking in a videoby propagating detection probabilities, IEEE Trans. Pattern Anal. Mach. Intell.25 (10) (2003) 1215–1228.

[24] C. Liu, A Bayesian discriminating features method for face detection, IEEETrans. Pattern Anal. Mach. Intell. 25 (6) (2003) 725–740.

[25] S.Z. Li, X. Lu, X. Hou, X. Peng, Q. Cheng, Learning multiview face subspaces andfacial pose estimation using independent component analysis, IEEE Trans.Image Process. 14 (6) (2005) 705–712.

[26] A.N. Rajagopalan, R. Chellappa, N.T. Koterba, Background learning for robustface recognition with PCA in the presence of clutter, IEEE Trans. Image Process.14 (6) (2005) 832–843.

[27] K. Sobottka, I. Pitas, A fully automatic approach to facial feature detectionand tracking, in: Proc. of 1st Int. Conf. on Audio- and Video-based BiometricPerson Authentication (AVBPA’97), Crans-Montana, Switzerland, 1997, pp.77–84.

[28] K.-M. Lam, H. Yan, An analytic-to-holistic approach for face recognition basedon a single frontal view, IEEE Trans. Pattern Anal. Mach. Intell. 20 (7) (1998)673–686.

[29] S. Yan, X. He, Y. Hu, H. Zhang, M. Li, Q. Cheng, Bayesian shape localization forface recognition using global and local textures, IEEE Trans. Circ. Syst. VideoTechnol. 14 (1) (2004) 102–113.

[30] H. Luo, A. Eleftheriadis, Model-based segmentation and tracking of head-and-shoulder video objects for real time multimedia services, IEEE Trans.Multimedia 5 (3) (2003) 379–389.

[31] A. Criminisi, G. Cross, A. Blake, V. Kolmogorov, Bilayer segmentation of livevideo, Proc. CVPR (2006) 53C60.

[32] I. Levner, H. Zhang, Classification-driven watershed segmentation, IEEE Trans.Image Process. 16 (5) (2007) 1437–1445.

[33] M. Cooper, T. Liu, E. Rieffel, Video segmentation via temporal patternclassification, IEEE Trans. Multimedia 9 (3) (2007) 610–618.

[34] L. Grady, Random walks for image segmentation, IEEE Trans. Pattern Anal.Mach. Intell. 28 (11) (2007) 1768–1783.

[35] P. Yin, A. Criminisi, J. Winn, I. Essa, Tree-based classifiers for bilayer videosegmentation, Proc. CVPR (2007).

[36] Y. Boykov, M.-P. Jolly, Interactive graph cuts for optimal boundary & regionsegmentation of objects in n-d images, in: Proceedings of ICCV, vol. 1, 2001, pp.105–112.

[37] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapidscene analysis, IEEE Trans. Pattern Anal. Mach. Intell. 20 (11) (1998) 1254–1259.

[38] Y.-F. Ma, H.-J. Zhang, A model of motion attention for video skimming, IEEEICIP ’02 1 (22–25) (2002) I-129–I-132.

[39] J. Han, K.N. Ngan, M. Li, H.-J. Zhang, Towards unsupervised attention objectextraction by integrating visual attention and object growing, IEEE ICIP ’04 2(24–27) (2004) 941–944.

[40] Z. Lu, W. Lin, X. Yang, E. Ong, S. Yao, Modeling visual attentions modulatoryaftereffects on visual sensitivity and quality evaluation, IEEE Trans. ImageProcess. 14 (11) (2005) 1928–1942.

[41] CVL FACE DATABASE. Available from: <http://www.lrv.fri.uni-lj.si/facedb.html/>.

[42] S.C. Hsia, P.S. Tsai, Efficient light balancing techniques for text images in videopresentation systems, IEEE Trans. Circ. Syst. Video Technol. 15 (8) (2005)1026–1031.

[43] P.T. Jackway, M. Deriche, Scale-space properties of the multiscalemorphological dilation–erosion, IEEE Trans. Pattern Anal. Mach. Intell. 18 (1)(1996) 38–51.

[44] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal.Mach. Intell. 9 (6) (1986) 679–698.

[45] B. Martinkauppi, M. Soriano, M. Pietikainen, Detection of skin color underchanging illumination: a comparative study, in: Proceedings of the 12th IEEEInternational Conference on Image Analysis and Processing, 2003, pp. 652–657.

[46] R.C. Verma, C. Schmid, K. Mikolajczyk, Face detection and tracking in a videoby propagating detection probabilities, IEEE Trans. Pattern Anal. Mach. Intell.25 (10) (2003) 1215–1228.

[47] B. Froba, C. Kublbeck, Face tracking by means of continuous detection, Proc.CVPR (2004) 65–71.

[48] V.D. Scott, L. Yin, M.J. Ko, T. Hung, Multiple-view face tracking for modelingand analysis based on non-cooperative video imagery, Proc. CVPR (2007).

[49] Y. Li, H. Ai, Y. Takayoshi, S. Lao, K. Masato, Tracking in low frame rate video: acascade particle filter with discriminative observers of different lifespans, Proc.CVPR (2007) 1–8.

[50] D. Martin, C. Fowlkes, D. Tal, J. Malik, A database of human segmented naturalimages and its application to evaluating segmentation algorithms andmeasuring ecological statistics, in: Proc. 8th Int’l Conf. Computer Vision, vol.2, 2001, pp. 416–423.

[51] P. Viola, M.J. Jones, Robust real-time face detection, Int. J. Comput. Vis. 57 (2)(2004) 137–154.

[52] C. Rother, V. Kolmogorov, A. Blake, GrabCut: interactive foreground extractionusing iterated graph cuts, in: Proceedings of SIGGRAPH, 2004.