01468149

10
628 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 4, AUGUST2005 Automatic Creation of a Talking Head From a Video Sequence Kyoung-Ho Choi, Member, IEEE, and Jenq-Neng Hwang, Fellow, IEEE Abstract—In this paper, a real-time system to create a talking head from a video sequence without any user intervention is pre- sented. In the proposed system, a probabilistic approach, to de- cide whether or not extracted facial features are appropriate for creating a three-dimensional (3-D) face model, is presented. Auto- matically extracted two-dimensional facial features from a video sequence are fed into the proposed probabilistic framework before a corresponding 3-D face model is built to avoid generating an un- natural or nonrealistic 3-D face model. To extract face shape, we also present a face shape extractor based on an ellipse model con- trolled by three anchor points, which is accurate and computation- ally cheap. To create a 3-D face model, a least-square approach is presented to find a coefficient vector that is necessary to adapt a generic 3-D model into the extracted facial features. Experimental results show that the proposed system can efficiently build a 3-D face model from a video sequence without any user intervention for various Internet applications including virtual conference and a virtual story teller that do not require much head movements or high-quality facial animation. Index Terms—MPEG-4 facial object, probabilistic approach, speech-driven talking heads, talking heads, virtual face. I. INTRODUCTION A DVANCES in computing power and numerical algo- rithms in graphics and image-processing make it possible to build a realistic three-dimensional (3-D) face from a video sequence by using a regular PC camera. However, in most reported systems, user intervention is generally required to provide feature points at the initialization stage [1]–[4]. In the initialization stage, feature points in two orthogonal frames or in multiple frames have to be provided carefully to generate a photo-realistic 3-D face model. These techniques can build high quality face models but they are computationally expensive and time consuming. For various multimedia applications such as video confer- encing, e-commerce, and virtual anchors, integrating talking heads are highly required to enrich their human-computer in- terface. To provide talking head solutions for these multimedia applications, which do not require high quality animation, fast and easy ways to build a 3-D face model have been investigated to generate many different face models in a short time period. However, user intervention is still required to provide several Manuscript received May 29, 2002; revised March 15, 2004. The associate editor coordinating the review of this paper and approving it for publication was Dr. Jean-Luc Dugelay. K.-H. Choi is with the Electronics Engineering Department, Mokpo National University, Chonnam 534-729, Korea (e-mail: [email protected]). J.-N. Hwang is with the Information Processing Laboratory, Department of Electrical Engineering, University of Washington, Seattle, WA 98195-2500 (e-mail: [email protected]). Digital Object Identifier 10.1109/TMM.2005.850964 corresponding points in two frames from a video sequence [4] or feature points in a single frontal image [5], [6]. In this paper, we present a real-time system that extracts facial features automatically and builds a 3-D face model without any user intervention from a video sequence. Approaches for creating a 3-D face model can be classified into two groups. Methods in the first group [1]–[3] use a generic 3-D model, usually generated by a 3-D scanner, and deform the 3-D model by calculating coordinates of all vertices in the 3-D model. Lee et al. [1] considered deformation of vertices in a 3-D model as an interpolation of the displacements of the given con- trol points. They used Dirichlet Free-From Deformation tech- nique [7] to calculate new 3-D coordinates of a deformed 3-D model. Pighin et al. [3] also considered model deformation as an interpolation problem and used radial basis functions to find new 3-D coordinates for vertices in a generic 3-D model. How- ever, methods in the second group use multiple 3-D face models to find 3-D coordinates of all vertices in a new 3-D model based on given feature points. They combine multiple 3-D models to generate a new 3-D model by calculating parameters to combine them. Blanz et al. [8] used a laser scanner (Cyberware) to gen- erate a 3-D model database. They considered a new face model as a linear combination of the shapes of 3-D faces in the data- base. Liu et al. [4] simplified the idea of linear combination of 3-D models by designing key 3-D faces that can be used to build a new 3-D model by combining the key 3-D faces linearly, elim- inating the need for a large 3-D face database. The merit of these approaches used in the second group is that linearly created face objects can eliminate a wrong face that is not natural, which is a very important aspect to create a 3-D face model without user intervention. Emerging Internet applications equipped with a talking head system such as merchandise narrator [10], virtual anchors [11], and e-commerce do not require high quality facial animation, e.g., the one used in Shrek or Toy Story, etc. Furthermore, move- ment of a 3-D face model in those applications, i.e., rotation along the x and y directions, can be restricted within 5–10 de- grees. In other words, although the movement of a talking head is limited, users still do not feel uncomfortable in these appli- cations. Recent approaches of creating a 3-D face model from a single image are applicable to those Internet applications [5], [6], [10]. Valle et al. [5] used manually extracted 18 feature points and an interpolation technique based on a radial basis function to obtain coordinates of polygon mesh of a 3-D model. Kuo et al. [6] used the anthropometric and a priori information to estimate the depth of a 3-D face model. Lin et al. [10] used a two-dimensional (2-D) mesh model to animate a talking head by mesh warping. They manually adjust control points of mesh 1520-9210/$20.00 © 2005 IEEE

Upload: hany-elgezawy

Post on 04-Dec-2015

221 views

Category:

Documents


2 download

DESCRIPTION

01468149

TRANSCRIPT

Page 1: 01468149

628 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 4, AUGUST 2005

Automatic Creation of a Talking HeadFrom a Video Sequence

Kyoung-Ho Choi, Member, IEEE, and Jenq-Neng Hwang, Fellow, IEEE

Abstract—In this paper, a real-time system to create a talkinghead from a video sequence without any user intervention is pre-sented. In the proposed system, a probabilistic approach, to de-cide whether or not extracted facial features are appropriate forcreating a three-dimensional (3-D) face model, is presented. Auto-matically extracted two-dimensional facial features from a videosequence are fed into the proposed probabilistic framework beforea corresponding 3-D face model is built to avoid generating an un-natural or nonrealistic 3-D face model. To extract face shape, wealso present a face shape extractor based on an ellipse model con-trolled by three anchor points, which is accurate and computation-ally cheap. To create a 3-D face model, a least-square approach ispresented to find a coefficient vector that is necessary to adapt ageneric 3-D model into the extracted facial features. Experimentalresults show that the proposed system can efficiently build a 3-Dface model from a video sequence without any user interventionfor various Internet applications including virtual conference anda virtual story teller that do not require much head movements orhigh-quality facial animation.

Index Terms—MPEG-4 facial object, probabilistic approach,speech-driven talking heads, talking heads, virtual face.

I. INTRODUCTION

ADVANCES in computing power and numerical algo-rithms in graphics and image-processing make it possible

to build a realistic three-dimensional (3-D) face from a videosequence by using a regular PC camera. However, in mostreported systems, user intervention is generally required toprovide feature points at the initialization stage [1]–[4]. In theinitialization stage, feature points in two orthogonal frames orin multiple frames have to be provided carefully to generate aphoto-realistic 3-D face model. These techniques can build highquality face models but they are computationally expensive andtime consuming.

For various multimedia applications such as video confer-encing, e-commerce, and virtual anchors, integrating talkingheads are highly required to enrich their human-computer in-terface. To provide talking head solutions for these multimediaapplications, which do not require high quality animation, fastand easy ways to build a 3-D face model have been investigatedto generate many different face models in a short time period.However, user intervention is still required to provide several

Manuscript received May 29, 2002; revised March 15, 2004. The associateeditor coordinating the review of this paper and approving it for publication wasDr. Jean-Luc Dugelay.

K.-H. Choi is with the Electronics Engineering Department, Mokpo NationalUniversity, Chonnam 534-729, Korea (e-mail: [email protected]).

J.-N. Hwang is with the Information Processing Laboratory, Department ofElectrical Engineering, University of Washington, Seattle, WA 98195-2500(e-mail: [email protected]).

Digital Object Identifier 10.1109/TMM.2005.850964

corresponding points in two frames from a video sequence[4] or feature points in a single frontal image [5], [6]. In thispaper, we present a real-time system that extracts facial featuresautomatically and builds a 3-D face model without any userintervention from a video sequence.

Approaches for creating a 3-D face model can be classifiedinto two groups. Methods in the first group [1]–[3] use a generic3-D model, usually generated by a 3-D scanner, and deform the3-D model by calculating coordinates of all vertices in the 3-Dmodel. Lee et al. [1] considered deformation of vertices in a 3-Dmodel as an interpolation of the displacements of the given con-trol points. They used Dirichlet Free-From Deformation tech-nique [7] to calculate new 3-D coordinates of a deformed 3-Dmodel. Pighin et al. [3] also considered model deformation asan interpolation problem and used radial basis functions to findnew 3-D coordinates for vertices in a generic 3-D model. How-ever, methods in the second group use multiple 3-D face modelsto find 3-D coordinates of all vertices in a new 3-D model basedon given feature points. They combine multiple 3-D models togenerate a new 3-D model by calculating parameters to combinethem. Blanz et al. [8] used a laser scanner (Cyberware) to gen-erate a 3-D model database. They considered a new face modelas a linear combination of the shapes of 3-D faces in the data-base. Liu et al. [4] simplified the idea of linear combination of3-D models by designing key 3-D faces that can be used to builda new 3-D model by combining the key 3-D faces linearly, elim-inating the need for a large 3-D face database. The merit of theseapproaches used in the second group is that linearly created faceobjects can eliminate a wrong face that is not natural, which isa very important aspect to create a 3-D face model without userintervention.

Emerging Internet applications equipped with a talking headsystem such as merchandise narrator [10], virtual anchors [11],and e-commerce do not require high quality facial animation,e.g., the one used in Shrek or Toy Story, etc. Furthermore, move-ment of a 3-D face model in those applications, i.e., rotationalong the x and y directions, can be restricted within 5–10 de-grees. In other words, although the movement of a talking headis limited, users still do not feel uncomfortable in these appli-cations. Recent approaches of creating a 3-D face model from asingle image are applicable to those Internet applications [5],[6], [10]. Valle et al. [5] used manually extracted 18 featurepoints and an interpolation technique based on a radial basisfunction to obtain coordinates of polygon mesh of a 3-D model.Kuo et al. [6] used the anthropometric and a priori informationto estimate the depth of a 3-D face model. Lin et al. [10] useda two-dimensional (2-D) mesh model to animate a talking headby mesh warping. They manually adjust control points of mesh

1520-9210/$20.00 © 2005 IEEE

Page 2: 01468149

CHOI AND HWANG: AUTOMATIC CREATION OF A TALKING HEAD FROM A VIDEO SEQUENCE 629

Fig. 1. Modeling of different faces using an ellipse: (a) square; (b) triangular; (c) trapezoid; (d) long-narrow; and (e) same size in width and height.

to fit eyes, nose, and mouth into an input image. All these ap-proaches, based on a single image to obtain a 3-D face model,are computationally cheap and fast, which are suitable to gen-erate multiple face models in a short time. Although depth in-formation of a created 3-D model from these approaches is notas accurate as other labor-intensive approaches, such as [1],[3], textured 3-D face models should be good enough for var-ious Internet applications that do not require high quality facialanimation.

In this paper, we present a real-time system that extracts fa-cial features automatically and builds a 3-D face model withoutany user intervention. The main contribution of this paper canbe summarized as follows. Firstly, we propose a face shape ex-tractor, which is easy and accurate for various face shapes. Webelieve face shape is one of the most important facial featuresin creating a 3-D face model. Our face-shape extractor uses amodel of an ellipse controlled by three anchor points, extractingvarious face shapes successfully. Secondly, we present a prob-abilistic network to maximally use facial feature evidence indeciding if extracted facial features are suitable for creating a3-D face model. To create a 3-D model from a video sequencewithout any user intervention, we need to keep on extractingfacial features and checking if the extracted features are goodenough to build a 3-D model in a systematical way. We proposefacial feature net, a face shape net, and a topology net to verifycorrectness of extracted facial features, which also enable thealgorithm to extract facial features more accurately. Thirdly, aleast-square approach to create a 3-D face model based on ex-tracted facial features is presented. Our approach for 3-D modeladaptation is similar to Liu’s approach in a sense that a 3-Dmodel is described as a linear combination of a neutral faceand some deformation vectors. The differences are that we usea least-square approach to find coefficients for the deformationvectors and we build a 3-D face model from a video sequencewith no user input. Lastly, a talking head system is presentedby combining an audio-to-visual conversion technique basedon constrained optimization [25] and the proposed automaticscheme of 3-D model creation.

The organization of this paper is as follows. In Section II, theproposed face shape extractor based on an ellipse model con-trolled by three anchor points is presented. The detailed expla-nation of the probabilistic network is described in Section III. InSection IV, the proposed least-square approach to create a 3-Dface model is described. In Section V, experimental results aswell as implementation of the proposed real-time talking head

system are described. Finally, conclusions and future work aregiven in Section VI.

II. FACE SHAPE EXTRACTOR

Face shape is one of the most important features in creatinga 3-D face model. In this section, we propose a novel idea toextract face shape, which is easy and accurate for various faceshapes. Based on the study of anthropometry of the head andface [18], a face can be classified into one of the several types ofshapes, i.e., long-narrow, square, triangular, and trapezoid, etc.,as shown in Fig. 1. Our face-shape extractor uses a model of anellipse controlled by three anchor points as shown in Fig. 1(d)and (e). The idea behind the proposed model is that we cancontrol shape of a face by fixing the bottom anchor pointand moving the left and right anchor points, and , upand down as shown in Fig. 1(d) and (e). For different faces asshown in Fig. 1, an ellipse that contains three anchor points,left, right, and down, can describe various face shapes correctlyand smoothly, although each ellipse requires different parame-ters. For instance, to describe a long-narrow face the left andright anchor points need to be moved up as shown in Fig. 1(d).Based on this observation, shape extraction is considered as aproblem to find parameters for an ellipse that produces max-imum boundary energy under a constraint that the ellipse mustcontain the three anchor points , and . The detailedsteps of the face shape extraction can be described as follows.

1) Find three anchor points , and at 180, 270, and0 degree directions.

2) Draw an ellipse using the detected three anchor points. Ifwe use as an ellipse, is the distancebetween x position of and and is the distancebetween y position of and (If face shape is assumedsymmetric.).

3) Add intensity of pixels that are lower than the left andright anchor points on the ellipse and record the sum.

4) Move the left and right anchor points up and down tofind parameters of an ellipse that produces maximumboundary energy for the face shape from an edge image[see Fig. 2(e)] using (1).

After positions of facial components such as mouth and eyesare known as shown in Fig. 2(a) using various methods [15],[16], the proposed face shape extractor is ready to start. We as-sume that a human face has a homogeneous color distribution,

Page 3: 01468149

630 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 4, AUGUST 2005

Fig. 2. Detecting three anchor points. (a) Extract facial features first. (b) Calculate intensity average for inside of a face: 1) draw lines from the corner of left eyeand from nose center and find a intersection point C and 2) find average intensity of pixels within a rectangular window (size = 20� 20) centered at the point C.(c) Three anchor points P , P and P . (d) An ellipse shaped search window and search direction. (e) An edge image.

which means statistics, e.g., means and variances, can be usedas criteria to decide if a region is inside or outside of the face(if statistics for the inside of a face is known). As it starts, thesearch procedure for three anchor points, it calculates statisticsfirst. It calculates an intensity average for the inside of a face byusing a window as shown in Fig. 2(b). By locating a point thathas a quite different intensity average from the previously cal-culated average of the inside of a face, three anchor points canbe found. In our implementation, threshold (averageintensity of the inside of a face) is selected experimentally tolocate the anchor points. Because the search procedure for threeanchor points highly depends on color distributions, it is sensi-tive to color distributions of background objects. To overcomethis weak point, the threshold is adjusted adaptively in ourprocedure (please refer to Section V for details). To find an op-timal face shape, (1) is used to find parameters and of anellipse

(1)

where is the intensity of an edge image [Fig. 2(e)] anddenotes a subset of pixels on an ellipse, whose pixels are locatedlower than the left and right anchor points.

III. PROBABILITY NETWORKS

Probabilistic approaches have been successfully used to lo-cate human faces from a scene and to track deformations of localfeatures [19], [20]. Cipolla et al. [19] proposed a probabilisticframework to combine different facial features and face groups,achieving a high confidence rate for face detection from a com-plicated scene. Huang et al. [20] used a probabilistic network forlocal feature tracking by modeling locations and velocities of se-lected features points. In our automated system, a probabilisticframework is adopted to maximally use facial feature evidencefor deciding correctness of extracted facial features before a 3-Dface model is built. Fig. 3 shows the selected FDPs for the pro-posed probabilistic framework. The network hierarchy used inour approach is shown in Fig. 4, which consists of a facial fea-ture net, a face shape net, and a topology net. The facial featurenet has a mouth net and eye net as its subnets. The detail of eachsubnet is shown in Fig. 5. In the networks, each node representsa random variable and each arrow denotes conditional depen-dency between two nodes.

In a study of face anthropometry [18], data are collected bymeasuring distances and angles among selected key points froma human face, e.g., corners of eyes, mouth and ears, to describethe variability of a human face. Based on the study, we are char-acterizing a frontal face by measuring distances and covariancebetween key points chosen from the study. All nodes in theproposed probability networks are classified into four groups:

,

Page 4: 01468149

CHOI AND HWANG: AUTOMATIC CREATION OF A TALKING HEAD FROM A VIDEO SEQUENCE 631

Fig. 3. Extracted feature points from MPEG-4 facial definition points (FDPs).

Fig. 4. Network hierarchy used in our approach.

,, and

, where D(P1,P2)is a distance between FDPs P1 and P2 defined in MPEG-4standard [21], [22]. In our network, the distance between twofeature points is defined as a random variable for each node. Forinstance, we model D(3.5, 3.6), the distance between centers ofthe left and right eyes, and D(2.1, 9.15), the distance of selectedtwo points FDP 2.1 and FDP 9.15, shown in Fig. 5(b), as a 2-DGaussian distribution, estimating means, standard deviations,and correlation coefficients. Fig. 5(c) shows graphical illustra-tions of the relationship between two nodes in the proposedprobability networks. For example, the distance between FDP3.5 and FDP 3.6, and the length between FDP 8.4 and FDP 8.3(width of mouth), are modeled as a 2-D Gaussian distribution

Fig. 5. Detail of each probability network [the distance 3.5–3.6 meansthe distance between FDP 3.5 and FDP 3.6 shown in (c).] (a) Mouth net.(b) Topology net. (c) Graphical illustration of nodes in (a) and (b). (d) Eye net.(e) Face shape net.

(2)

where , , and denote the distance between two se-lected FDPs, the means, and standard deviation of , respec-tively. denotes the correlation coefficients between two nodes

and . To model 2-D Gaussian distributions of D(3.5, 3.6)and distances of selected paired points, a database from [23]is used in our simulations. The reason we model probabilitydistributions based on FDP3.5 and FDP3.6 is that the left andright eye centers are the features that can be detected most re-liably and accurately from a video sequence according to ourimplementation.

The chain rule and conditional independence relationship areapplied to calculate the joint probability of each network. Forinstance, the probability of the face shape net is defined as a

Page 5: 01468149

632 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 4, AUGUST 2005

joint probability of all three nodes, D(3.5, 3.6), D(2.2, 2.1), andD(10.7, 10.8), as follows:

(3)

In the same manner, probabilities of other networks can bedefined as follows:

(4)

(5)

(6)

(7)

(8)

In our implementation, P(Face Shape Net) is used to verifyface shape extracted from our face shape extractor, and P(MouthNet) is used to check extracted mouth features. P(Topology Net)is used for deciding if facial components, i.e., eyes, nose, andmouth, are located correctly along the vertical axis. P(FacialFeatures, Face Shape, Topology) of (8) is used as a decision cri-

terion for the correctness of extracted facial features for buildinga 3-D face model.

IV. A LEAST-SQUARE APPROACH TO ADAPT A

3-D FACE MODEL

Our system is devoted to creating a 3-D face model withoutany user intervention from a video sequence, which means weneed an algorithm that is robust and stable to build a photo-real-istic and natural 3-D face model. Recent approach proposed byLiu et al. [4] shows that combining multiple 3-D models linearlyis a promising way to generate a photo-realistic 3-D model. Inthis approach, a new face model is described as a linear combi-nation of key 3-D face models, e.g., big mouth, small eyes, etc.The strong point of this approach is that the multiple face modelsconstrain the shape of a new 3-D face, preventing algorithmsfrom producing an unrealistic 3-D face model. Our approachis similar with Liu’s approach in the sense that a 3-D model isdescribed as a linear combination of a neutral face and somedeformation vectors. The main differences are that: 1) we use aleast-square approach to find the coefficient vector for creating anew 3-D face model rather than an iterative approach and 2) webuild a 3-D face model from a video sequence with no user input.

A. The 3-D Model

Our 3-D model is a modified version of the 3-D face modeldeveloped by Parke and Waters [28]. We have developed a3-D model editor to build a complete head and shoulder modelincluding ears and teeth. Fig. 6(a) shows the modified 3-Dmodel used in our system. It has 1294 polygons and it is goodenough for realistic facial animation. Based on this 3-D modeland the 3-D model editor, 16 face models have been designedfor the proposed system (more face models can be added tomake a better 3-D model), because eight position vectors andeight shape vectors (please see Section IV-B) are a minimalrequirement to describe a 3-D face in a sense that shapes and lo-cations of mouth, nose, eyes are the most important features todescribe a human frontal face. These face models are combinedlinearly based on automatically extracted facial features suchas shape of face, location of eyes, nose and mouth, and size ofthese features, etc. If we denote the face geometry by a vector

, where are the vertices,and a deformation vector that containsthe amount of variation for size and location of vertices on a3-D model, the face geometry can be described as

(9)

where is a neutral face vector and is a coefficient vector,i.e., that decides the amount of variationneeded to be applied to vertices on the neutral face model [4].

B. The 3-D Model Adaptation

Finding an optimal 3-D model that is best matched with theinput video sequences can be considered as a problem to find acoefficient vector that minimizes mean-square errors betweenprojected 3-D feature points onto 2-D and feature points frominput face. We assume that all feature points are equally impor-tant, because locations as well as shapes of facial components

Page 6: 01468149

CHOI AND HWANG: AUTOMATIC CREATION OF A TALKING HEAD FROM A VIDEO SEQUENCE 633

Fig. 6. Examples of 3-D face models to find deformation vectors: (a) neutral; (b) wide face; (c) big mouth; (d) eyes apart; (e) eyes close; and (f) small mouth.

such as mouth, eyes, and nose are all critical to model a 3-D facefrom a frontal face. In our system, all coefficients are decided atonce by solving the following least-square formulation:

(10)

where denotes the number of extracted features and is thenumber of the deformation vector . is an extracted featurefrom an input image, which has (x,y) location, is the corre-sponding vertex on a neutral 3-D model projected onto 2-D, and

means the corresponding vertex on a deformation vectorprojected onto 2-D using current camera parameters. Fig. 6(a)shows the neutral 3-D face model and Fig. 6(b)–(f) show ex-amples of 3-D face models used to calculate deformation vec-tors in our implementation. For instance, by subtracting a wide3-D face model, as shown in Fig. 6(b), from a neutral 3-D facemodel, shown in Fig. 6(a), a deformation vector for wide faceis obtained. For the deformation vectors , eight shape vectors(wide face, thin face, big (and small) mouth, nose and eyes) andeight position vectors (minimum (and maximum) horizontal andvertical translation for eyes and minimum (and maximum) ver-tical translation for mouth and nose) are designed in our imple-mentation. To solve the least-square problem the singular valuedecomposition (SVD) is used in our implementation.

V. IMPLEMENTATION AND EXPERIMENTAL RESULTS

A. Automatic Creation of a 3-D Face Model

In this section, the detailed implementation of the proposedreal-time talking head system is presented. To create a photo-re-alistic 3-D model from a video sequence without any user inter-vention, the proposed algorithms have to be integrated carefully.We assume that user should be in a neutral face as defined in[22], looking at the camera, and rotating in the x and y direc-tions. The proposed algorithms catch the best facial orientation,i.e., simply a frontal face, by extracting and verifying facial fea-tures. By analyzing video sequences, two requirements for thereal-time system have been established, because input is not asingle image but a video sequence. First, locating face shouldnot be called every frame. Once face is located, face location inthe following frames is likely to be the same or very close to it.

Second, facial features obtained in previous frames should beexploited to provide a better result in current frame.

Fig. 7 shows the detailed block diagram of the proposed real-time system. The proposed system starts with finding a facelocation from a video sequence by using a method based on anormalized RG color space [24] and frame difference. After de-tecting face location, a valley detection filter, which was pro-posed in [12], is used to find rough positions of facial compo-nents. After applying a valley detection filter, rough locationof facial components, i.e., eyes, nose, and mouth, is locatedby examining its intensity distribution projected in vertical andhorizontal directions. Then, exact location for nose is obtainedby recursive thresholding because the nose holes always havethe lowest intensity around the nose. A threshold value is in-creased recursively until we reach the number of pixels thatcorresponds to nose holes. To find the exact location of mouthand eyes, several approaches [13]–[17] can be used. We use apseudo moving difference method to find exact location of fa-cial components, which is simple and computationally cheap.Based on the extracted feature location, a search area for ex-tracting face shape can be found (readers are referred to [30] fordetails.). Within this search area, we use the face shape extractorto extract face shape. After feature extraction is done, the ex-tracted features are fed into the proposed probabilistic networksto verify the correctness and suitability before a corresponding3-D face model is built. The proposed probabilistic network actsas a quality control agent in creating a 3-D face model in the pro-posed system. Based on the output of the probability networks,

is adjusted adaptively to extract face shape more accurately.If only face shape is bad, which means extracted features are cor-rect except face shape, the algorithm adjusts thresholds, andextracts face shape again without moving into the next frame[see Fig. 11(c) and (d)]. If extracted face shape is bad again,the algorithm moves to the next frame and starts from detectingrough location, without detecting face location. If all featuresare bad, the algorithm moves to the next frame, locates face,and extracts all features again.

B. Speech-Driven Talking Head System

After a virtual face is built an audio-to-visual conversion tech-nique based on constrained optimization is combined with the

Page 7: 01468149

634 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 4, AUGUST 2005

Fig. 7. Block diagram of the proposed real-time system to create a talking head from a video sequence.

Fig. 8. Block diagram of the encoder for the proposed Talking Head System, which is based on an audio-to-visual conversion technique for generating FAPs andthe proposed automatic scheme of feature extraction and 3-D model adaptation for generating FDPs.

virtual face to make a complete talking head system. There areseveral research results available for audio-to-visual conversion[25]–[27]. In our system, we have selected the constrained op-timization technique that is robust in noisy environments [25].Our talking head system aims at generating FDPs and FAPs forMPEG-4 talking head applications with no user input. FDPs areobtained automatically from a video sequence, captured by acamera connected to a PC, based on the proposed automaticscheme of facial feature extraction and a 3-D model adaptation.FAPs are generated from an audio-to-visual conversion based onthe constrained optimization technique. Fig. 8 shows the blockdiagram of the encoder for the proposed talking head system.The FDPs and FAPs, created without any user intervention, are

coded as an MPEG-4 bit stream and sent to a decoder via In-ternet. Because the coded bit stream contains FDPs and FAPs,no animation artifacts are expected in the decoder. For transmit-ting speech via Internet, G.723.1, a dual rate speech coder formultimedia communications, is used. G.723.1, the most widelyused standard codec for Internet telephony, is selected becauseof its capability of low bit rate coding working at 5.3 and 6.3 kb/s(please see [29] for a detailed explanation about G.723.1). Ininitialization stage 3-D coordinates and texture information foran adapted 3-D model is sent to the decoder via TCP protocol.Then, coded speech and animation parameters are sent to the de-coder via UDP protocol in our implementation. Fig. 9(a) and (b)show screen shots of encoder and decoder implemented in our

Page 8: 01468149

CHOI AND HWANG: AUTOMATIC CREATION OF A TALKING HEAD FROM A VIDEO SEQUENCE 635

Fig. 9. Encoder/decoder windows for the Talking Head System. (a) Self-viewwindow for a created virtual face and an input video window in the encoder.(b) A talking head window driven by decoded animation parameters and speechin the decoder.

Fig. 10. Extracted face shape for different facial orientation.

talking head system. The performance of the proposed talkinghead system has been evaluated subjectively and the results areshown in Section V-C.

C. Experimental Results

The proposed automatic system, creating a 3-D face modelfrom a video sequence without any user intervention, producesfacial features including face shape about 9 fps (frames persecond) on Pentium III 600-MHz PC. Twenty feature points asshown in Fig. 3, and 16 deformation vectors were used in our im-plementation [ and for (10)]. Users are requiredto provide a frontal view with a rotation angle less than 5 de-grees. Twenty video sequences were recorded, making approx-imately 2000 frames in total. The proposed face shape extractorwas tested for the captured video sequences that have differenttypes of faces. Fig. 10 shows some examples of extracted faceshapes for different face shapes and orientation. The proposedface shape extractor achieved a detection rate of 64% for 1180selected frames from the testing video sequences. Most errorscome from the similar color distribution between face and back-ground and failure to detect facial components such as eyes andmouth.

Fifty frontal face images of the PICS database from the Uni-versity of Stirling (http:// pics.psych.stir.ac.uk/) were used to

Fig. 11. Probabilistic network as a quality control agent. (a), (b) Examplesof rejected facial features. (c), (d) Examples of increased extraction accuracy(contour A is an old face shape and contour B is a newly extracted one byadjusting threshold values).

build the proposed probabilistic network and the ExpectationMaximization (EM) algorithm was used to model 2-D Gaussiandistributions. The proposed probabilistic network was tested asa quality control agent in our real-time talking head system.Fig. 11(a) and (b) show examples of rejected facial featuresfrom the probabilistic network, preventing the creation of un-realistic faces. , the threshold value for face shape extrac-tion, was adjusted automatically from 0.5 (average intensityof the inside of a face) to 1.0 (average intensity of the insideof a face) to improve accuracy based on the results of the prob-abilistic network. If only P(Face Shape Net) is low, was in-creased to find a more clear boundary of the face [please seein Fig. 2(c).]. Fig. 11(c) and (d) shows examples of feature ex-traction improved via adjusting threshold values. According tothe simulation results the proposed probabilistic networks weresuccessfully combined with our automatic system to create a3-D face model. Fig. 12 shows examples of successfully created3-D face models. By using the probabilistic network approachthe chance of creating unrealistic faces due to wrong facial fea-tures was reduced significantly. The performance of the pro-posed talking head system was evaluated subjectively. Twelvepeople participated in the subjective assessments. The 5-pointscale was used for the subjective evaluations. Table I shows re-sults from the subjective test and gives an idea of how goodthe proposed talking head system is, even though it is createdwithout any user intervention. People were asked “how realistican adapted 3-D model is” and “how natural its talking head is”to see the performance of the proposed system. They were alsoasked to measure audio quality, audio-visual synchronization,and overall performances. Overall results from the subjectiveevaluations show that the proposed automatic scheme produces

Page 9: 01468149

636 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 4, AUGUST 2005

Fig. 12. Examples of created 3-D models. (a) 3-D models wearing cloth1.(b) 3-D models wearing cloth2. (c) Side view of 3-D models.

TABLE ISUBJECTIVE EVALUATIONS OF THE PROPOSED TALKING HEAD SYSTEM

a 3-D model that is quite realistic and good enough for var-ious Internet applications that do not require high-quality facialanimation.

VI. CONCLUSIONS AND FUTURE WORK

We have presented an implementation of an automatic systemto create a talking head from a video sequence without any userintervention. In the proposed system, we have presented: 1) anovel scheme to extract face shape based on an ellipse modelcontrolled by three anchor points; 2) a probabilistic network toverify if extracted features are good enough to build a 3-D facemodel; 3) a least-square approach to adapt a generic 3-D modelinto extracted features from input video; and 4) a talking headsystem that generates FAPs and FDPs without any user interven-tion for MPEG-4 facial animation systems. Based on an ellipsemodel controlled by three anchor points, an accurate and com-putationally cheap method for face shape extraction was devel-oped. A least-square approach was used to calculate a requiredcoefficient vector to adapt a generic model to fit an input face.

Probability networks were successfully combined with our au-tomatic system to maximally use facial feature evidence in de-ciding if extracted facial features are suitable for creating a 3-Dface model.

Creating a 3-D face model with no user intervention is a verydifficult task. In this paper, an automatic scheme to build a 3-Dface model from a video sequence is presented. Although weassume that user should be in a neutral face and looking at theinput camera, we believe this is a basic requirement to build a3-D face model in an automatic fashion. The created 3-D modelis allowed to rotate less than 10 degrees along x and y directionsbecause z coordinates of vertices on the 3-D model are not cal-culated from input features. The proposed speech-driven talkinghead system, generating FDPs and FAPs for MPEG-4 talkinghead applications, is suitable for various Internet applicationsincluding virtual conference and a virtual story teller that donot require much head movements or high quality facial anima-tion. For future research, more accurate mouth and eye extrac-tion scheme can be considered to improve quality of a created3-D model and to handle nonneutral faces and faces with mus-tache. The current approach based on a simple parametric curvehas limitations on the shapes of mouth and eyes. In addition, tobuild a complete 3-D face model, extracting hair from the headand modeling its style should be considered in future research.

ACKNOWLEDGMENT

The authors wish to thank the anonymous reviewers for theirvaluable comments.

REFERENCES

[1] W.-S. Lee, M. Escher, G. Sannier, and N. Magnenat-Thalmann,“MPEG-4 compatible faces from orthogonal photos,” in Proc. Int. Conf.Computer Animation, 1999, pp. 186–194.

[2] P. Fua and C. Miccio, “Animated heads from ordinary images: a least-squares approach,” Comput. Vis. Image Understand., vol. 75, no. 3, pp.247–259, 1999.

[3] F. Pighin, R. Szeliski, and D. H. Salesin, “Resynthesizing facial anima-tion through 3-D model-based tracking,” in Proc. 7th IEEE Int. Conf.Computer Vision, vol. 1, 1999, pp. 143–150.

[4] Z. Liu, Z. Zhang, C. Jacobs, and M. Cohen, “Rapid Modeling of Ani-mated Faces From Video,”, Tech.l Rep. MSR-TR-2000-11.

[5] A. C. A. del Valle and J. Ostermann, “3-D talking head customizationby adapting a generic model to one uncalibrated picture,” in Proc. IEEEInt. Symp. Circuits and Systems, 2001, pp. 325–328.

[6] C. J. Kuo, R.-S. Huang, and T.-G. Lin, “3-D facial model estimationfrom single front-view facial image,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 12, no. 3, pp. 183–192, Mar. 2002.

[7] L. Moccozet and N. Magnenat Thalmann, “Dirichlet free-from defor-mations and their application to hand simulation,” in Proc. ComputerAnimation 97, 1997, pp. 93–102.

[8] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3-Dfaces,” in Computer Graphics, Annu. Conf. Series, SIGGRAPH 1999,pp. 187–194.

[9] E. Cosatto and H. P. Graf, “Photo-realistic talking-heads from imagesamples,” IEEE Trans. Multimedia, vol. 2, no. 3, pp. 152–163, Jun. 2000.

[10] I.-C. Lin, C.-S. Hung, T.-J. Yang, and M. Ouhyoung, “A speech driventalking head system based on a single face image,” in Proc. 7th PacificConf. Computer Graphics and Applications, 1999, pp. 43–49.

[11] http://www.ananova.com/ [Online][12] R.-S. Wang and Y. Wang, “Facial feature extraction and tracking in video

sequences,” in Proc. IEEE Int. Workshop on Multimedia Signal Pro-cessing, 1997, pp. 233–238.

[13] D. Reisfeld and Y. Yeshurun, “Robust detection of facial features by gen-eralized symmetry,” in Proc. 11th IAPR Int. Conf. Pattern Recognition,1992, pp. 117–120.

Page 10: 01468149

CHOI AND HWANG: AUTOMATIC CREATION OF A TALKING HEAD FROM A VIDEO SEQUENCE 637

[14] M. Zobel, A. Gebhard, D. Paulus, J. Denzler, and H. Niemann, “Robustfacial feature localization by coupled features,” in Proc. 4th IEEE Int.Conf. Automatic Face and Gesture Recognition, 2000, pp. 2–7.

[15] Y. Tian, T. Kanade, and J. Cohn, “Robust lip tracking by combiningshape, color and motion,” in Proc. 4th Asian Conf. Computer Vision,2000.

[16] J. Luettin, N. A. Tracker, and S. W. Beet, “Active shape models for vi-sual speech feature extraction,” University of Sheffield, Sheffield, U.K.,Electronic Systems Group Rep. 95/44, 1995.

[17] C. Kim and J.-N. Hwang, “An integrated scheme for object-based videoabstraction,” in Proc. ACM Int. Multimedia Conf., 2000.

[18] L. G. Farkas, Anthropometry of the Head and Face. New York: Raven,1994.

[19] K. C. Yow and R. Cipolla, “A probabilistic framework for perceptualgrouping of features for human face detection,” in Proc. IEEE Int. Conf.Automatic Face and Gesture Recognition ’96, 1996, pp. 16–21.

[20] H. Tao, R. Lopez, and T. Huang, “Tracking facial features using proba-bilistic network,” Auto. Face Gesture Recognit., pp. 166–170, 1998.

[21] ISO/IEC FDIS 14 496-1 Systems, ISO/IEC JTC1/SC29/WG11 N2501,Nov. 1998.

[22] ISO/IEC FDIS 14 496-2 Visual, ISO/IEC JTC1/SC29/WG11 N2502,Nov. 1998.

[23] Psychological Image Collection at Stirling (PICS). [Online] Available:http://pics.psych.stir.ac.uk/

[24] J. Luettin, N. A. Tracker, and S. W. Beet, “Active shape models for vi-sual speech feature extraction,” University of Sheffield, Sheffield, U.K.,Electronic Systems Group Rep. 95/44, 1995.

[25] K. H. Choi and J.-N. Hwang, “Creating 3-D speech-driven talking heads:a probabilistic approach,” in Proc. IEEE Int. Conf. Image Processing,2002, pp. 984–987.

[26] F. Lavagetto, “Converting speech into lip movement: A multimedia tele-phone for hard of hearing people,” IEEE Trans. Rehabil. Eng., vol. 3, no.1, pp. 90–102, Jan. 1995.

[27] R. R. Rao, T. Chen, and R. M. Mersereau, “Audio-to-visual conversionfor multimedia communication,” IEEE Trans. Ind. Electron., vol. 45, no.1, pp. 15–22, Feb. 1998.

[28] F. I. Parke and K. Waters, Computer Facial Animation. Wellesley, MA:A. K. Peters, 1996.

[29] Dual Rate Speech Coder for Multimedia Communications Transmittingat 5.3 and 6.3 kbits/s, ITU-T Recommendation G.723.1, Mar. 1996.

[30] K. H. Choi and J.-N. Hwang, “A real-time system for automatic creationof 3-D face models from a video sequence,” in Proc. IEEE Int. Conf.Acoustics, Speech, and Signal Processing, 2002, pp. 2121–2124.

Kyoung-Ho Choi (M’03) received the B.S. and M.S.degrees in electrical and electronics engineering fromInha University, Korea, in 1989 and 1991, respec-tively, and the Ph.D. degree in electrical engineeringfrom the University of Washington, Seattle, in 2002.

In January 1991, he joined the Electronics andTelecommunications Research Institute (ETRI),where he was a Leader of the Telematics ContentResearch Team. He was also a Visiting Scholar atCornell University, Ithaca, NY, in 1995. In March2005, he joined the Department of Information

and Electronic Engineering, Mokpo National University, Chonnam, Korea.His research interests include telematics, multimedia signal processing andsystems, mobile computing, MPE4/7/21, multimedia-GIS, and audio-to-visualconversion and audiovisual interaction..

Dr. Choi was selected as an Outstanding Researcher at ETRI in 1992.

Jenq-Neng Hwang (F’03) received the B.S. andM.S. degrees, both in electrical engineering, from theNational Taiwan University, Taipei, Taiwan, R.O.C.,in 1981 and 1983, respectively, and the Ph.D.degree from the University of Southern California inDecember 1988.

He spent 1983–1985 in obligatory military ser-vices. He was then a Research Assistant in the Signaland Image Processing Institute, Department ofElectrical Engineering, University of Southern Cal-ifornia. He was also a visiting student at Princeton

University, Princeton, NJ, from 1987 to 1989. In the summer of 1989, hejoined the Department of Electrical Engineering, University of Washington,Seattle, where he is currently a Professor. He has published more than 180journal, conference paper, and book chapters in the areas of image/video signalprocessing, computational neural networks, multimedia system integration,and networking. He is the co-author of the Handbook of Neural Networks forSignal Processing (Boca Raton, FL: CRC Press, 2001).

Dr. Hwang served as the Secretary of the Neural Systems and ApplicationsCommittee of the IEEE Circuits and Systems Society from 1989 to 1991, andwas a member of Design and Implementation of the SP Systems Technical Com-mittee of the IEEE SP Society. He is also a Founding Member of the MultimediaSP Technical Committee of IEEE SP Society. He served as the Chairman of theNeural Networks SP Technical Committee of the IEEE SP Society from 1996to 1998, and the Society’s representative to the IEEE Neural Network Councilfrom 1997 to 2000. He served as Associate Editor for the IEEE TRANSACTIONS

ON SIGNAL PROCESSING and IEEE TRANSACTIONS ON NEURAL NETWORKS, andis currently an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND

SYSTEMS FOR VIDEO TECHNOLOGY. He is also on the editorial board of theJournal of VLSI Signal Processing Systems for Signal, Image, and Video Tech-nology. He was a Guest Editor for the IEEE TRANSACTIONS ON MULTIMEDIA,Special Issue on Multimedia over IP in March/June 2001, the Conference Pro-gram Chair for the 1994 IEEE Workshop on Neural Networks for Signal Pro-cessing held in Ermioni, Greece, in September 1994, the General Co-Chair ofthe International Symposium on Artificial Neural Networks held in Hsinchu,Taiwan, R.O.C., in December 1995, the Chair of the Tutorial Committee for theIEEE International Conference on Neural Networks (ICNN’96) held in Wash-ington, DC, in June 1996, and the Program Co-Chair of the International Con-ference on Acoustics, Speech, and Signal Processing (ICASSP) held in Seattle,WA, in 1998. He received the 1995 IEEE Signal Processing (SP) Society’s An-nual Best Paper Award (with S.-R. Lay and A. Lippman) in the area of NeuralNetworks for Signal Processing.