a dynamic cascades with bidirectional bootstrapping for ...jeffcohn/biblio/dcbbjan19.pdfjournal of...

JOURNAL OF LATEX CLASS FILES, OCTOBER 2010 1

Dynamic Cascades with BidirectionalBootstrapping for Action Unit Detection in

Spontaneous Facial BehaviorYunfeng Zhu, Fernando De la Torre, Jeffrey F. Cohn, Associate Member, IEEE,

and Yu-Jin Zhang, Senior Member, IEEE

Abstract—Automatic facial action unit detection from video is a long standing problem in facial expression analysis. Researchhas focused on registration, choice of features, and classifiers. A relatively neglected problem is the choice of training images.Nearly all previous work uses one or the other of two standard approaches. One approach assigns peak frames to the positiveclass and frames associated with other actions to the negative class. This approach maximizes differences between positiveand negative classes, but results in a large imbalance between them, especially for infrequent AUs. The other approach reducesimbalance in class membership by including all target frames from onsets to offsets in the positive class. However, becauseframes near onsets and offsets often differ little from those that precede them, this approach can dramatically increase falsepositives. We propose a novel alternative, dynamic cascades with bidirectional bootstrapping (DCBB) to select training samples.Using an iterative approach, DCBB optimally selects positive and negative samples in the training data. Using Cascade Adaboostas basic classifier, DCBB exploits the advantages of feature selection, efficiency, and robustness of Cascade Adaboost. Toprovide a real-world test, we used the ********************change ************** M3 (also known as RU-FACS) database of non-posed behavior recorded during interviews. For all tested action units, DCBB improved AU detection relative to alternativeapproaches.

Index Terms—Facial expression analysis, action unit detection, FACS, dynamic cascade boosting, bidirectional bootstrapping.

F

1 INTRODUCTIONThe face is one of the most powerful channels of nonver-bal communication. Facial expression provides cues aboutemotional response, regulates interpersonal behavior, andcommunicates aspects of psychopathology. To make useof the information afforded by facial expression, Ekmanand Friesen [1] proposed the Facial Action Coding System(FACS). FACS segments the visible effects of facial muscleactivation into ”action units (AUs)”. Each action unit isrelated to one or more facial muscles. These anatomicunits may be combined to represent more molar facialexpressions. Emotion-specified joy, for instance, is repre-sented by the combination of AU6 (cheek raiser, whichresults from contraction of the orbicularis occuli muscle)and AU12 (lip-corner puller, which results from contractionof the zygomatic major muscle). The FACS taxonomy wasdeveloped by manually observing live and recorded facialbehavior and to a lesser extent by recording the electricalactivity of underlying facial muscles [2]. Because of its

This work was supported in part by U.S. Naval Research Laboratoryunder Contract No. N00173-07-C-2040 and National Institute of HealthGrant R01 MH 051435. The first author was also partially supported bya scholarship from China Scholarship Council.Yunfeng Zhu is in the Department of Electronic Engineering at TsinghuaUniversity, Beijing 100084 China. mail: [email protected] De la Torre, is in the Robotics Institute at Carnegie MellonUniversity, Pittsburgh, Pennsylvania 15213 USA. mail: [email protected] F. Cohn is in the Department of Psychology at University ofPittsburgh, Pittsburgh, Pennsylvania 15260 USA. mail: [email protected] Zhang is in the Department of Electronic Engineering at TsinghuaUniversity, Beijing 100084 China. mail: [email protected]

time

AU12 Positive Bootstrap

Negative Bootstrap Negative Bootstrap

Strong AU frames

Ambiguous AU frames Subtle AU frames

AU peak AU onset

AU offset

Fig. 1. Example of strong, subtle, and ambiguoussamples of FACS action unit 12. Strong samples typ-ically correspond to the peak or maximum intensity,and, ambiguous frames to AU onset and offset andframes proximal to them. Subtle samples are locatedbetween strong and ambiguous ones. Using an iter-ative algorithm, DCBB selects positive samples suchthat detection accuracy is optimized during training.

descriptive power, FACS has become the state of the art inmanual measurement of facial expression [3] and is widelyused in studies of spontaneous facial behavior [4]. For theseand related reasons, much effort in automatic facial imageanalysis seeks to automatically recognize FACS action units[5], [6], [7], [8], [9].

Automatic detection of AUs from video is a challengingproblem for several reasons. Non-frontal pose and moderateto large head motion make facial image registration diffi-


cult, large variability occurs in the temporal scale of facialgestures, individual differences occur in shape and appear-ance of facial features, many facial actions are inherentlysubtle, and the number of possible combinations of 40+individual action units is in the thousands. More than 7000action unit combinations have been observed [4]. Previousefforts have emphasized types of features and classifiers.Features have included shape and various appearance fea-tures, such as grayscale pixels, edges, and appearance(e.g., canonical appearance, Gabor, and SIFT descriptors).Classifiers have included support vector machines(SVM)[10], boosting [11], hidden markov models(HMM) [12]and dynamic Bayesian networks(DBN) [13]. Little attentionhas been paid to the assigment of video frames to positiveand negative classes. Typically, assignment has been donein one of two ways. One is to assign to the positiveclass those frames that occur at the peak of each AU orproximal to it. Peaks refer to the maximum intensity of anaction unit between the frame at which begins (”onset”) andends (”offset”). Negative class then is chosen by randomlysampling other AUs, including AU 0 or neutral. Thisapproach suffers at least two drawbacks: (1) the numberof training examples will often be small, which results in alarge imbalance between positive and negative frames; and(2) peak frames may provide too little variability to achievegood generalization. These problems may be circumventedby following an alternative approach; that is to includeall frames from onset to offset in the positive class. Thisapproach improves the ratio of positive to negative framesand increases representativeness of positive examples. Thedownside is confusability of positive and negative classes.Onset and offset frames and many of those proximal or evenfurther from them may be indistinguisable from the negativeclass. As a consequence,the number of false positives maydramatically increase.

To address these issues, we propose an extension ofAdaboost [14], [15], [16] called Dynamic Cascades withBidirectional Bootstrapping (DCBB). Fig. 1 illustrates themain idea. Having manually annotated FACS data withonset, peak and offset, the question we address is how bestto select the AU frames for the positive and negative class.

In contrast to previous approaches to class assignment,DCBB automatically distinguishes between strong, subtleand ambiguous AU frames for AU events of different in-tensity. Strong frames correspond to the peaks and the onesproximal to them; ambiguous AU frames are proximal toonsets and offsets; subtle AU frames occur between strongand ambiguous. Strong and subtle frames are assigned tothe positive class. By distinguishing between these threetypes, DCBB maximizes the number of positive frameswhile reducing confusability between positive and negativeclasses.

For high intensity AUs in comparison with low intensityAUs, the algorithm will select more frames for the positiveclass. Some of these frames may be similar in intensityto low intensity AUs. Similarly, if multiple peaks occur be-tween an onset and offset, DCBB assigns multiple segmentsto the positive class. See Fig. 1 for an example. Strong

and subtle but not ambiguous AU frames are assigned tothe positive class. For the negative class, DCBB proposesa mechanism, which is similar as Cascade AdaBoost tooptimize that as well, the principles are that the weightof misclassified negative class will be increased duringtraining step of each weak classifier, and don’t learning tomuch at each cascade stage. Moreover, the positive class ischanged at each iteration, while the corresponding negativeclass is reselected again.

In experiments, we evaluated the validity of our approachto class assignment and selection of features. In the firstexperiment, we illustrate the importance of selecting theright positive samples for action unit detection. In thesecond we compare DCBB with standard approaches basedon SVM and AdaBoost.

The rest of the paper is organized as follows. SectionII reviews previous work on automatic methods for actionunit detection. Section III describes pre-processing stepsfor alignment and feature extraction. Section IV givesdetails of our proposed DCBB method. Section V providesexperimental results in non-posed, naturalistic video. Forexperimental evaluation, we used FACS-coded interviewsfrom the ********************change **************M3 (also known as RU-FACS) database [17], [18]. Forall action units tested, DCBB outperformed alternativeapproaches.

2 PREVIOUS WORKThis section describes previous work on FACS and onautomatic detection of AUs from video.

2.1 FACSThe Facial Action Coding System (FACS) [1] is a compre-hensive, anatomically-based system for measuring nearlyall visually discernible facial movement. FACS describesfacial activity on the basis of 44 unique action units (AUs),as well as several categories of head and eye positions andmovements. Facial movement is thus described in terms ofconstituent components, or AUs. Any facial expression maybe represented as a single AU or a combination of AUs.For example, the felt, or Duchenne, smile is indicated bymovement of the zygomatic major (AU12) and orbicularisoculi, pars lateralis (AU6). FACS is recognized as themost comprehensive and objective means for measuringfacial movement currently available, and it has becomethe standard for facial measurement in behavioral researchin psychology and related fields. FACS coding proceduresallow for coding of the intensity of each facial action ona 5-point intensity scale (which provides a metric for thedegree of muscular contraction) and for measurement ofthe timing of facial actions. FACS scoring produces a listof AU-based descriptions of each facial event in a videorecord. Fig. 2 shows an example for AU12.

2.2 Automatic FACS detection from videoTwo main streams in the current research on automaticanalysis of facial expressions consider emotion-specified


Fig. 2. FACS coding typically involves frame-by-frameinspection of the video, paying close attention to tran-sient cues such as wrinkles, bulges, and furrows todetermine which facial action units have occurred andtheir intensity. Full labeling requires marking onset,peak and offset and may include annotating changesin intensity as well. Left to right, evolution of an AU 12(involved in smiling), from onset, peak, to offset.

expressions (e.g., happy or sad) and anatomically basedfacial actions (e.g., FACS). The pioneering work of Blackand Yacoob [19] recognizes facial expressions by fittinglocal parametric motion models to regions of the face andthen feeding the resulting parameters to a nearest neighborclassifier for expression recognition. De la Torre et al. [20]used condensation and appearance models to simultane-ously track and recognize facial expression. Chang et al.[21] learnt a low dimensional Lipschitz embedding to builda manifold of shape variation across several people and thenuse I-condensation to simultaneously track and recognizeexpressions. Lee and Elgammal [22] employed multi-linearmodels to construct a non-linear manifold that factorizesidentity from expression.

Several promising prototype systems were reported thatcan recognize deliberately produced AUs in either nearfrontal view face images (Bartlett et al., [23]; Tian et al.,[8]; Pantic & Rothkrantz, [24]) or profile view face images(Pantic & Patras, [25]). Although high scores have beenachieved on posed facial action behavior [13], [26], [27],accuracy tends to be lower in the few studies that havetested classifiers on non-posed facial behavior [28], [11],[29]. In non-posed facial behavior, non-frontal views andrigid head motion are common, and action units are oftenless intense, have different timing, and occur in complexcombinations [30]. These factors have been found to reduceAU detection accuracy [31]. Non-posed facial behavior ismore representative of facial actions that occur in real life,which is our focus in the current paper.

Most work in automatic analysis of facial expressionsdiffers in the choice of facial features, representations,and classifiers. Barlett et al. [11], [23], [18] used SVMand AdaBoost in texture-based image representations torecognize 20 action units in near-frontal posed and non-posed facial behavior. Valstar and Pantic [32], [25], [33]proposed a system that enables fully automated robustfacial expression recognition and temporal segmentation ofonset, peak and offset from video of mostly frontal faces.They used particle filtering to track facial features, Gabor-based representations, and a combination of SVM and

725 1137 1366

2297 2968 3258

Fig. 3. AAM tracking across several frames

HMM to temporally segment and recognize action units.Lucey et al. [34], [29] compared the use of different shapeand appearance representations and different registrationmechanisms for AU detection.

Tong et al. [13] used Dynamic Bayesian Networks withappearance features to detect facial action units in posedfacial behavior. The correlation among action units servedas priors in action unit detection.Comprehensive reviews ofautomatic facial coding may be found in [5], [6], [7], [8],[35].

To the best of our knowledge, no previous work hasconsidered strategies for selecting training samples or eval-uated their importance in AU detection. This is the firstpaper to propose an approach to optimize the selectionof positive and negative training samples. We show thatour approach consistently improves action unit detectionrelative to previous appraoches.

3 FACIAL FEATURE TRACKING AND IMAGEFEATURES

This section describes the system for facial feature trackingusing active appearance models (AAMs), and extractionand representation of shape and appearance features forinput to the classifiers.

3.1 Facial tracking and alignmentOver the last decade, appearance models have becomeincreasingly important in computer vision and graphics.Parameterized Appearance Models (PAMs) (e.g. ActiveAppearance Models [36], [37], [38] and, Morphable Models[39]) have been proven useful for detection, facial featurealignment, and face synthesis. In particular, Active Ap-pearance Models (AAMs) have proven an excellent toolfor aligning facial features with respect to a shape andappearance model. In our case, the AAM is composed of 66landmarks that deform to fit perturbations in facial features.Person-specific models were trained on approximately 5%of the video [38]. Fig. 3 shows an example of AAMtracking of facial features in a single subject from the M3(also known as RU-FACS) [17], [18] video data-set.

After tracking facial features using AAM, a similaritytransform registers facial features with respect to an averageface (see middle column in Fig. 4). The face is normalizedto have 212× 212 pixels in all our experiments. To extract


Similarity

Transformation

Backward

Mapping

Piece-wise

Affine

Warpping

Fig. 4. Two-step alignment

appearance representations in areas that have not beenexplicitly tracked (e.g. nasolabial furrow), we use a back-ward piece-wise affine warp with Delaunay triangulationto set up the correspondence. Fig. 4 shows the two stepprocess for registering the face to a canonical pose for AUdetection. Purple squares represent tracked points and bluedots represent non-tracked meaningful points. The dashedblue line shows the mapping between the point in the meanshape and the corresponding points on the original image.This two-step registration proved particularly important todetect low intensity action units.

3.2 Appearance featuresAppearance features for AU detection [11], [40] outper-formed shape only features for some action units, see Luceyet al. [34], [41], [42] for a comparison. In this section, weexplore the use of the SIFT [43] descriptors to be used asappearance features.

Given feature points tracked with AAMs, SIFT descrip-tors are first computed around the points of interest. SIFTdescriptors are computed from the gradient vector for eachpixel in the neighborhood to build a normalized histogramof gradient directions. For each pixel within a subregion,SIFT descriptors add the pixel’s gradient vector to a his-togram of gradient directions by quantizing each orientationto one of 8 directions and weighting the contribution of eachvector by its magnitude.

4 DYNAMIC CASCADES WITH BIDIREC-TIONAL BOOTSTRAPPING (DCBB)This section explores the use of a dynamic boosting tech-niques to select the positive and negative samples thatimprove detection performance in AU detection.

Bootstrapping [44] is a resampling method that is com-patible with many learning algorithms. During the boot-strapping process, the active sets of negative examplesare extended by examples that were misclassified by the

Active

positive set

Active

negative set

Training

Dynamic Cascade Detector

Negative Pool Potential Positive

BootstrapSpreading

with Constrains

Final AU

Detector

AU peaksUpdate Weight

Q P

P0Pw

Fig. 5. Bidirectional Bootstrapping.

current classifier. In this section, we propose BidirectionalBootstrapping, a method to bootstrap both positive andnegative samples.

Bidirectional Bootstrapping beings by selecting as posi-tive samples only the peak frames and uses Classificationand Regression Tree (CART) [45] as a weak classifier (Ini-tial learning in Section 4.1). After this initial training step,Bidirectional Bootstrapping extends the positive samplesfrom the peak frames to proximal frames and redefines newprovisional positive and negative training sets (Dynamiclearning in Section 4.2). The positive set is extended by in-cluding samples that are classified correctly by the previousstrong classifier(Cascade AdaBoost in our algorithm), thenegative set is extended by examples misclassified by thesame strong classifier, thus emphasizing negative samplesclose to the decision boundary. With the bootstrapping ofpositive samples, the generalization ability of the classifieris gradually enhanced. The active positive and negativesets then are used as an input to the CART that returnsa hypothesis, which updates the weights in the manner ofGentle AdaBoost [46], and the training continues until thevariation between previous and current Cascade AdaBoostbecome smaller than a defined threshold. Fig. 5 illustratesthe process. In Fig. 5 P is potential positive data set, Q isnegative data set(Negative Pool), P0 is the positive set inInitial learning step, Pw is the active positive set in eachiteration, the size of solid circle illustrate the intensity ofAU samples, the right ellipses illustrate the spreading ofdynamic positive set. See details in Algorithm 1 and 2.

4.1 Initial training stepThis section explains the initial training step for DCBB. Inthe initial training step we select the peaks and the twoneighboring samples as positive samples, and a randomsample of other AUs and non-AUs as negative samples.As in standard AdaBoost [14], we define the false positivetarget ratio (Fr), the maximum acceptable false positiveratio per cascade stage (fr), and the minimum acceptabletrue positive ratio per cascade stage (dr). The cascadeclassifier we used CART. The initial training step appliesstandard AdaBoost using CART as a weak classifier assummarized in Algorithm 1.


Input:• Positive data set P0 (contains AU peak frames p and

p± 1);• Negative data set Q (contains other AUs and non-

AUs);• Target false positive ratio Fr;• Maximum acceptable false positive ratio per cascade

stage fr;• Minimum acceptable true positive ratio per cascade

stage dr;Initialize:

• Current cascade stage number t = 0;• Current overall cascade classifier’s true positive ratio

Dt = 1.0;• Current overall cascade classifier’s false positive ratio

Ft = 1.0;• S0 = {P0, Q0} is the initial working set, Q0 ⊂ Q.

The number of positive samples is Np. The numberof negative samples is Nq = Np ×R0, R0 = 8;

While Ft > Fr

1) t = t + 1;ft = 1.0; Normalize the weights ωt,i foreach sample xi to guarantee that ωt = {ωt,i} is adistribution.

2) While ft > fra) For each feature ϕm, train a weak classifier on

S0 and find the best feature ϕi (the one with theminimum classification error).

b) Add the feature ϕi into the strong classifier Ht,update the weight in Gentle AdaBoost manner.

c) Evaluate on S0 with the current strong classifierHt, adjust the rejection threshold under theconstraint that the true positive ratio does notdrop below dr.

d) Decrease threshold until dr is reached.e) Compute ft under this threshold.

END While3) Ft+1 = Ft × ft; Dt+1 = Dt × dr; keep in Q0 the

negative samples that the current strong classifier Ht

misclassified(current false positive samples), recordits size as Kfq.

4) Repeat using detector Ht to bootstrap false positivesamples from negative Q randomly until the negativeworking set has Nq samples.

END WhileOutput:

A t-levels cascade where each level has a strong boostedclassifier with a set of rejection thresholds for each weakclassifier. The final training accuracy figures are Ft and Dt.

Algorithm 1: Initial learning

4.2 Dynamic learningOnce a cascade of peak frame detectors is learned in theinitial learning stage (Section 4.1), we are able to enlargethe positive set to increase the discriminative performance

of the whole classifier. The AU frames detector will becomestronger as new AU positive samples are added duringthe training steps, and the distribution of positive andnegative samples will become more representative of thewhole training data. A constraint scheme is applied indynamic learning to avoid adding ambiguous AU framesto the dynamic positive set. The algorithm is summarizedin Algorithm 2.

Input• cascade detector H0, from the Initial Learning step;• Dynamic working set SD = {PD, QD};• All the frames in this action unit as potential positive

samples P = {Ps, Pv}. Ps contains the strong positivesamples, P0 contains peak related samples describedabove, P0 ∈ Ps. Pv contains obscure positive samples;

• A large negative data set Q, which contains all theother AUs and non-AUs. Its size is Na.

Update positive working set by spreading in P andupdate negative working set by bootstrap in Q dynamiccascade learning:

Initialize: We set the value of Np as the size of P0. The sizeof the old positive data set is Np old = 0. Current diffusingstage is t = 1.

While (Np −Np old)/Np > 0.1

1) AU Positive Spreading: Np old = Np. Using currentdetector on the potential positive data set P to pickup more positive samples, Psp are all the positivesamples that determined by the cascade classifierHt−1.

2) Constrain the spreading: k is the index of currentAU event, i is the index of current frame in thisevent. Calculate the similarity values (Eq. 1) betweenthe peak frame in event k and all peak frames withthe lowest intensity value ’A’, the average similarityvalue is Sk. Calculate the similarity value betweenframe i and peak frame in event k, its value is Ski,if Ski < 0.5× Sk, frame i will exclude from Psp.

3) After the above step, the remaining positive work setis Pw = Psp, Np = size of Psp. Using Ht−1 detectorto bootstrap false positive samples from the negativeset Q until the negative working set Qw has Nq =β ×R0 ×Na samples, Na is different in AUs.

4) Train the cascade classifier Ht with the dynamicworking set {Pw, Qw}. As Rt varies, the maximumacceptable false positive ratio per cascade stage fmr

also becomes smaller (Eq. 2).5) t = t+ 1; empty Pw and Qw.

END While

Algorithm 2: Dynamic learning

In eq.1, n is the total number of AU sections withintensity ’A’, and m is the length of the AU features.The similarity description used in eq.1 is the Radial BasisFunction between the appearance representation of two


frames.

Sk =1

n

n∑j=1

Sim(fk,fj), j ∈ [1 : n]

Sik = Sim(fi,fk) = e−(Dist(i,k)/max(Dist(:,k)))2

Dist(i, k) =

[ m∑j=1

(fkj − fij)2

]1/2, j ∈ [1 : m] (1)

The dynamic positive work set becomes larger but thenegative samples pool is finite, so fmr need to be changeddynamically, and Nq is decided by Na as different AUhas different size of the negative samples pool. Some AUs(e.g. AU12) are likely to are occur more often than others.Instead of tuning these thresholds one by one, we assumethat the false positive ratio fmr changes exponentially ineach stage t, which means

fmr = fr × (1− e−αRt) (2)

In our experiment, we set α as 0.2 and β as 0.04 respec-tively because those values are suitable for all the AUs toavoid lacking of useful negative samples in M3 database.After the spreading stage, the ratio between positive andnegative samples becomes balanced, except for some rareAUs (e.g., AU4, AU10) which keep unbalanced because ofthe scarceness of positive frames in the database.

5 EXPERIMENTS

DCBB iteratively samples training images and then usesCascade AdaBoost for classification. We evaluated both,as well as the use of additional appearance features. Toevaluate the efficacy of iteratively sampling training images,Experiment 1 compared DCBB with the two standardapproaches. They are selecting only peaks and alternativelyselecting all frames between onsets and offsets. Experi-ment 1 thus evaluated whether iteratively sampling trainingimages increased AU detection. Experiment 2 evaluatedthe efficiacy of Cascade AdaBoost relative to SVM wheniteratively sampling training images. In our implementation,DCBB uses Cascade AdaBoost, but other classifiers mightbe used instead. Experiment 2 informs whether better re-sults might be achieved using SVM. Experiment 3 exploredthe use of three appearance descriptions (Gabor, SIFT andDAISY) in conjunction with DCBB (Cascade Adaboost asclassifier).

5.1 DatabaseThe three experiments all used the M3 (also known as RU-FACS) database [17], [18]. M3 consists of video-recordedinterviews of 34 men and women of varying ethnicity.Interviews were approximately two minutes in duration.Video from five subjects could not be processed for tech-nical reasons (e.g., noisy video), which resulted in usabledata from 29 participants. Meta-data included manual FACScodes for AU onsets, peaks and offsets. Because someAUs occurred too infrequently, we selected the 13 AUsthat occur more than 20 times in the database (i.e. 20

or more peaks). These AUs are: AU1, AU2, AU4, AU5,AU6, AU7, AU10, AU12, AU14, AU15, AU17, AU18,and AU23. Fig. 6 shows the number of frames that eachAU occurred and their average duration. Fig. 5 illustratesa representative time series for several AUs from subjectS010. Blue asterisks represent onsets, red circles peakframes, and green plus signs the offset frames. In eachexperiment, we randomly selected 19 subjects for trainingand the other 10 subjects for testing.

Fig. 6. Top) Frame number from onset to offset for the13 most frequent AUs in the M3 dataset. Bottom) Theaverage duration of AU in frames.

5.2 Experiment 1: Iteratively sampling training im-ages with DCBBThis experiment illustrates the effect of iteratively selectingpositive and negative samples (i.e. frames) while trainingthe Cascade AdaBoost. As a sample selection mechanismon top of the Cascade AdaBoost, DCBB could be appliedto other classifiers as well. The experiment investigates thatwhether the iteratively selecting training samples strategy isbetter than the strategy using only peak AU frames and thestrategy using all the AU frames as positive samples. Fordifferent positive samples assignments, the negative sam-ples is defined by the method used in Cascade AdaBoost.

We apply the DCBB method, described in Section IV anduse appearance features based on SIFT descriptors (SectionIII). For all AUs the SIFT descriptors are built using asquare of 48 × 48 pixels for twenty feature points for thelower face AUs or sixteen feature points for upper face (seeFig. 4). We trained 13 dynamic cascade classifiers, one foreach AU, as described in Section IV, using a one versus allscheme for each AU.

Top of Fig. 8 shows the manual labeling for AU12 of thesubject S015. We can see eight instances of AU12 withvarying intensities ranging from A (weak) to E (strong).The black curve in bottom figures represent the similarity(eq. 1) between the peak and the neighboring frames. Thepeak is the maximum of the curve. The positive samples


AU1

AU2

AU10

AU12

AU14

AU17

Fig. 7. AUs characteristics in subject S010, duration, intensity, onset, offset, peak.(frames as unit in X axis, Yaxis is the intensity of AUs)

87

1 2 3 4 5 6 7 8

2 3 41

5 6

Fig. 8. The spreading of positive samples during each dynamic training step for AU12. See text for theexplanation of the graphics.

in the first step are represented by green asterisks, in thesecond iteration by red crosses, in the third iteration by bluecrosses, and in the final iteration by black circles. Observethat in the case of high peak intensity, subfigures 3 and 8(top right number in the similarity plots), the final selectedpositive samples contain areas of low similarity values.When AU intensity is low, subfigure 7, positive samples areselected if they have a high similarity with the peak, whichreduces to the number of false positives. The ellipses andrectangles in the figures contain frames that are selected aspositive samples, and correspond to strong and subtle AUsin Fig. 1. The triangles correspond to frames between the

onset and offset that are not selected as positive samples,and represent ambiguous AUs in Fig. 1.

********************change********************************************Table 1 show the number of frames at each level ofintensity and the percentage of each intensity that wereselected as positive by DCBB in training dataset. The’A-E’ in left column refers to the level of AU intensities,The ’Num’ and ’Pct’ in the top rows refers to the numberof AU frames at each level of intensity and the percentageof each intensity that were selected as positive examplesat the last iteration of dynamic learning step respectively.


TABLE 1Number of frames at each level of intensity and the percentage of each intensity that were selected as positive

by DCBB

AU1 AU2 AU4 AU5 AU6 AU7Num Pct Num Pct Num Pct Num Pct Num Pct Num Pct

A 986 46.3% 845 43.6% 325 37.1% 293 86.4% 119 40.2% 98 47.9%B 3513 83.3% 3130 79.4% 912 70.4% 371 97.7% 1511 71.3% 705 93.4%C 3645 90.9% 3241 91.3% 439 83.8% 110 99.7% 1851 87.8% 1388 97.9%D 2312 90.3% 1753 95.6% 73 79.6% 0 NaN 1128 84.2% 537 98.7%E 151 99.3% 601 96.5% 0 NaN 23 96.6% 118 91.4% 153 98.5%

AU10 AU12 AU14 AU15 AU17 AU18 AU23Num Pct Num Pct Num Pct Num Pct Num Pct Num Pct Num Pct

A 367 61.3% 1561 38.1% 309 49.7% 402 72.6% 445 34.1% 134 87.4% 25 77.0%B 2131 90.8% 4293 73.3% 2240 93.9% 1053 89.7% 2553 70.1% 665 99.2% 190 87.8%C 1066 92.8% 5432 82.0% 1040 95.6% 611 97.4% 1644 76.4% 240 98.7% 198 97.3%D 572 88.6% 2234 83.5% 268 96.8% 68 98.6% 966 85.3% 83 98.6% 166 98.9%E 0 NaN 1323 92.4% 232 94.3% 0 NaN 226 80.2% 0 NaN 361 99.1%

If there is no AU frames in one level of intensity, the’Pct’ value will be ’NaN’. From this table, we can foundthat the mainly AU frames that were selected by DCBBconcentrate in the level ’C’, ’D’ and ’E’, moreover, mostof the AU frames in level ’A’ were ignored by DCBBduring the sampling of positive examples.

Fig. 9 shows the Receiver-Operator Characteristic (ROC)curve for testing data (subjects not in the training) usingDCBB in the 13 AUs. The ROC is obtained by plotting truepositives ratios against false positives ratios for differentdecision threshold values of the classifier. In each figure(representing one AU) there are five or six ROCs: initiallearning corresponds to training on only the peaks (whichis same as Cascade Adaboost without the DCBB strategy);spread x corresponds to running DCBB x times; All denotesusing all frames between onset and offset. The first numberbetween lines | in Fig. 9 denotes the area under the ROC;the second number is the size of positive samples inthe testing dataset; and separated by / is the number ofnegative samples in the testing dataset. The third numberdenotes the size of positive samples in training workingsets and separated by / the total frames of target AU intraining data sets. AU5 has the minimum number of trainingexamples and AU12 has the largest number of examples.We can observed that the area under the ROC for frame-by-frame detection is improved gradually during each learningstage and the performance improves faster for AU4, AU5,AU10, AU14, AU15, AU17, AU18, AU23 than for AU1,AU2, AU6 and AU12 during Dynamic learning. It is alsoimportant to notice that using all the frames between theonset and offset (’All’) typically degrades the detectionperformance. The table of Fig. 9 shows the area under theROC using only the peak, all frames or the DCBB strategy.Clearly the DCBB outperforms standard approaches thatuse the peak or all frames [40], [18], [34], [10], [29].

There are two parameters in the DCBB algorithms thatare manually tuned and remain the same in all experiments.One parameters specifies the minimum similarity valuebelow which no positive sample will be selected. Thisthreshold is based on the similarity equation (eq. 1). It

was set at 0.5 after preliminary testing found results stablewithin a range of 0.2 to 0.6. The second parameter isthe stopping criterion. We consider that the algorithm hasconverged if the number of new positive samples betweeniterations is smaller than 10%. We ran experiments fordifferent values ranging from 5% to 15%, values lower than15% did not change the performance in the ROC curves.In the extreme case of being 0%, the algorithm will notstop but the ROC will not improve. Typically the algorithmconverges within three or four iterations.

5.3 Experiment 2: Comparing Cascade Adaboostand Support Vector Machine (SVM) when used withiteratively sampled training frames**************************changed********************SVM and AdaBoost are commonly used classifiers for AUdetection. In this experiment, we compared AU detectionby DCBB (Cascade AdaBoost as classifier) and SVMusing shape and appearance features. We found that forboth types of features, DCBB achieved more accurate AUdetection.

For compatibility with the previous experiment, datafrom the same 19 subjects as above were used for trainingand the other 10 for testing. Results are reported for the13 most frequently observed AU. Other AU occurred tooinfrequently (i.e. fewer than 20 occurrences) to obtainreliable results and thus were omitted. Classifiers for eachAU were trained using a one versus-all strategy. The ROCcurves for the 13 AUs are shown in Fig. 10. For eachAU, six curves are shown, one for each combination oftraining features and classifiers. ’App+DCBB’ refers toDCBB using appearance features; ’Peak+Shp+SVM’ refersto SVM using shape features trained on the peak frames(and two adjacent frames) [10]; ’Peak+App+SVM’ [10]refers to train a SVM using appearance features trained onthe peak frame (and two adjacent frames); ’All+Shp+SVM’refers to SVM using shape features trained on all framesbetween onset and offset; ’All+PCA+App+SVM’ refers toSVM using appearance features (after PCA processing)trained on all frames between onset and offset, here, in


AU12

AU4

AU6

AU15

AU1 AU2

AU7 AU10

AU14 AU17

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive ratio

tru

e p

os

itiv

e r

ati

o

Initial Learning | 0.53 | 409/ 56912 | 216/ 1749

Spread 1 | 0.65 | 409/ 56912 | 701/ 1749

Spread 2 | 0.76 | 409/ 56912 | 1078/ 1749

Spread 3 | 0.76 | 409/ 56912 | 1190/ 1749

All+Boost | 0.58 | 409/ 56912 | 1749/ 1749

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Initial Learing | 0.71 | 3801/ 53520 | 813/ 9570

Spread 1 | 0.71 | 3801/ 53520 | 5614/ 9570

Spread 2 | 0.72 | 3801/ 53520 | 7432/ 9570

Spread 3 | 0.75 | 3801/ 53520 | 8069/ 9570

All+Boost | 0.71 | 3801/ 53520 | 9570/ 9570

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Initial Learning | 0.56 | 1721 / 55600 | 174/ 2881

Spread 1 | 0.56 | 1721 / 55600 | 850/ 2881

Spread 2 | 0.64 | 1721 / 55600 | 2210/ 2881

Spread 3 | 0.72 | 1721 / 55600 | 2575/ 2881

Spread 4 | 0.69 | 1721 / 55600 | 2745/ 2881

All+Boost | 0.66 | 1721/ 55600 | 2881/ 2881

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o


Spread 1 | 0.62 | 924/ 56397 | 1443/ 4136

Spread 2 | 0.66 | 924/ 56397 | 2779/ 4136

Spread 3 | 0.72 | 924/ 56397 | 3467/ 4136

Spread 4 | 0.72 | 924/ 56397 | 3656/ 4136

All+Boost | 0.64 | 924/ 56397 | 4136/ 4136

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Initial Learning | 0.83 | 12696 / 44625 | 446/ 14843

Spread 1 | 0.89 | 12696 / 44625 | 7059/ 14843

Spread 2 | 0.90 | 12696 / 44625 | 10239/ 14843

Spread 3 | 0.92 | 12696 / 44625 | 11283/ 14843

All+Boost | 0.90 | 12696/ 44625 | 14843/ 14843

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o


Spread 1 | 0.70 | 3494/ 53827 | 2861/ 4089

Spread 2 | 0.72 | 3494/ 53827 | 3476/ 4089

Spread 3 | 0.72 | 3494/ 53827 | 3729/ 4089

All+Boost | 0.69 | 3494/ 53827 | 4089/ 4089

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o


Spread 1 | 0.81 | 1331/ 55990 | 1219/ 2134

Spread 2 | 0.86 | 1331/ 55990 | 1746/ 2134

Spread 3 | 0.86 | 1331/ 55990 | 1899/ 2134

All+Boost | 0.83 | 1331/ 55990 | 2134/ 2134

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o


Spread 1 | 0.76| 3774/ 53547 | 5536/ 10607

Spread 2 | 0.77| 3774/ 53547 | 8103/ 10607

Spread 3 | 0.76 | 3774/ 53547 | 8934/ 10607

All+Boost | 0.69 | 3774/ 53547 |10607/ 10607

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Init+Boost | 0.59 | 2620/ 54701 | 530/ 5843

Spread 1 | 0.53 | 2620/ 54701 | 1971/ 5843

Spread 2 | 0.80 | 2620/ 54701 | 4093/ 5843

Spread 3 | 0.81 | 2620/ 54701 | 4202/ 5843

All+Boost | 0.80 | 2620/ 54701 | 5843/ 5843

AU1 AU10AU7AU6AU4AU2 AU12 AU14 AU15 AU17

Peak

DCBB

All

0.75

0.69

0.76

0.71

0.71

0.75

0.53 0.93 0.56 0.52 0.83 0.50 0.52 0.59

0.58 0.95 0.66 0.64 0.90 0.69 0.83 0.80

0.76 0.97 0.69 0.72 0.92 0.72 0.86 0.81

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o


Spread 1 | 0.94 | 2578/ 54743 | 1867/ 4727

Spread 2 | 0.96 | 2578/ 54743 | 3112/ 4727

Spread 3 | 0.96 | 2578/ 54743 | 3626/ 4727

Spread 4 | 0.97 | 2578/ 54743 | 3808/ 4727

All+Boost | 0.95 | 2578/ 54743 | 4727/ 4727

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Initial Learning | 0.49| 667/ 56654 | 87/ 797

Spread 1 | 0.71| 667/ 56654 | 592/ 797

Spread 2 | 0.77| 667/ 56654 | 748/ 797

All+Boost | 0.75| 667/ 56654 | 797/ 797

AU5

AU18

AU23

AU5

0.49

0.75

0.77

AU18

0.57

0.77

0.86

AU23

0.34

0.40

0.75

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Init Learning | 0.57 | 680/ 56641 | 449/ 1122

Spread 1 | 0.60 | 680/ 56641 | 1045/ 1122

Spread 2 | 0.86 | 680/ 56641 | 1096/ 1122

All+Boost | 0.77 | 680/ 56641 | 1122/1122

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Init Learning | 0.34 | 885/ 56436 | 108/ 940

Spread 1 | 0.75 | 885/ 56436 | 901/940

All+Boost | 0.40 | 885/ 56436 | 940/ 940

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Peak

All

Fig. 9. The ROCs improve with the spreading of positive samples: See text for the explanation of Peak, spreadx and All.


AU1 AU2 AU4

AU6 AU7 AU10 AU12

AU14 AU15 AU17

AU1 AU10AU7AU6AU4AU2 AU12 AU14 AU15 AU17

Peak+App+

Cascade Boost

DCBB

Peak+Shp+SVM 0.71

0.43

0.76

0.62

0.45

0.75

0.76 0.93 0.64 0.61 0.89 0.57 0.73 0.66

0.61 0.77 0.67 0.63 0.90 0.50 0.69 0.53

0.76 0.97 0.69 0.72 0.92 0.72 0.86 0.81

0.75 0.71 0.53 0.93 0.56 0.52 0.83 0.50 0.52 0.59

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.71 | 3774/ 53547

All+Shp+SVM | 0.68 | 3774/ 53547

Peak+App+SVM | 0.43 | 3774/ 53547

All+PCA+App+SVM | 0.74 | 3774/ 53547

Peak+App+Cascade Boost | 0.75 | 3774/ 53547

App+DCBB | 0.76 | 3774/ 53547

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.62 | 3801/ 53520

All+Shp+SVM | 0.65 | 3801/ 53520

Peak+App+SVM | 0.45 | 3801/ 53520

All+PCA+App+SVM | 0.74 | 3801/ 53520


App+DCBB | 0.75 | 3801/ 53520

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.76 | 409/ 56912

All+Shp+SVM | 0.78 | 409/ 56912

Peak+App+SVM | 0.61 | 409/ 56912

All+PCA+App+SVM | 0.85 | 409/ 56912


App+DCBB | 0.76 | 409/ 56912

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.93 | 2578/ 54743

All+Shp+SVM | 0.88| 2578/ 54743

Peak+App+SVM | 0.77 | 2578/ 54743

All+PCA+App+SVM | 0.96| 2578/ 54743


App+DCBB | 0.97 | 2578/ 54743

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.73 | 1331/ 55990

All+Shp+SVM | 0.61 | 1331/ 55990

Peak+App+SVM | 0.69 | 1331/ 55990

All+PCA+App+SVM | 0.86 | 1331/ 55990


App+DCBB | 0.86 | 1331/ 55990

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.64 | 1721/ 55600

All+Shp+SVM | 0.64 | 1721/ 55600

Peak+App+SVM | 0.67 | 1721/ 55600

All+PCA+App+SVM | 0.63 | 1721/ 55600


App+DCBB | 0.69 | 1721/ 55600

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.61 | 924/ 56397

All+Shp+SVM | 0.67 | 924/ 56397

Peak+App+SVM | 0.63 | 924/ 56397

All+PCA+App+SVM | 0.54 | 924/ 56397


App+DCBB | 0.72 | 924/ 56397

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

fasle positive ratio

tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.89 | 12696/ 44625

All+Shp+SVM | 0.77 | 12696/ 44625

Peak+App+SVM | 0.90 | 12696/ 44625

All+PCA+App+SVM | 0.91 | 12696/ 44625


App+DCBB | 0.92 | 12696/ 44625

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.57 | 3494/ 53827

All+Shp+SVM | 0.72 | 3494/ 53827

App+SVM | 0.50 | 3494/ 53827

Peak+App+SVM | 0.82 | 3494/ 53827


App+DCBB | 0.72 | 3494/ 53827

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

os

itiv

e r

ati

o

Peak+Shp+SVM | 0.66 | 2620/ 54701

All+Shp+SVM | 0.72 | 2620/ 54701

Peak+App+SVM | 0.53 | 2620/ 54701

All+PCA+App+SVM | 0.78 | 2620/ 54701


App+Boost | 0.81 | 2620/ 54701

All+Shp+SVM

Peak+App+SVM

0.68 0.65 0.78 0.88 0.64 0.67 0.77 0.72 0.61 0.72

0.74 0.74 0.85 0.96 0.63 0.54 0.91 0.82 0.86 0.78

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

osit

ive r

ati

o

Peak+Shp+SVM | 0.58 | 667/ 56654

All+Shp+SVM | 0.55 | 667/ 56654

Peak+App+SVM | 0.86 | 667/ 56654

All+PCA+App+SVM | 0.89 | 667/ 56654


App+DCBB | 0.77 | 667/ 56654

AU5

AU18

AU23AU23AU18AU5

0.58

0.86

0.77

0.49

0.55

0.89

0.86

0.95

0.86

0.57

0.88

0.94

0.74

0.62

0.75

0.34

0.67

0.54

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

osit

ive r

ati

o

Peak+Shp+SVM | 0.74 | 885/ 56436

All+Shp+SVM | 0.67 | 885/ 56436

Peak+App+SVM | 0.62 | 885/ 56436

Peak+PCA+App+SVM | 0.54 | 885/ 56436


App+DCBB | 0.75 | 885/ 56436

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


tru

e p

osit

ive r

ati

o

Peak+Shp+SVM | 0.86 | 680/ 56641

All+Shp+SVM | 0.88 | 680/ 56641

Peak+App+SVM | 0.95 | 680/ 56641

Peak+PCA+App+SVM | 0.94 | 680/ 56641


App+DCBB | 0.86 | 680/ 56641

All+PCA+

App+SVM

Fig. 10. ROC curve for 13 AUs using six different methods: AU peak frames with shape features and SVM(Peak+Shp+SVM), All frames between onset and offset with shape features and SVM (All+Shp+SVM), AU peakframes with appearance features and SVM (Peak+App+SVM), Sampling 1 frame in every 4 frames betweenonset and offset with PCA reduced appearance features and SVM (All+PCA+App+SVM) ,AU peak frameswith appearance features and Cascade AdaBoost(Peak+App+Cascade Boost), DCBB with appearance features(DCBB).


order to computationally scale in memory space, we re-duced the dimensionality of the appearance features usingprincipal component analysis (PCA) that preserves 98%of the energy. ’Peak+App+Cascade Boost’ refers to usethe peak frame with appearance features and CascadeAdaBoost [14] classifier (will be equivalent to the firststep in DCBB). As can be observed in the figure, DCBBoutperformed SVM in all AUs (except AU18) using eithershape or appearance when training in the peak (and twoadjacent frames). When training the SVM with shape andusing all samples the DCBB outperformed in eleven out ofthirteen AUs. In the SVM training, the negative sampleswere selected randomly (but the same negative sampleswhen using either shape or appearance features). The ratiobetween positive and negative samples was fixed to 30.Compared with the Cascade AdaBoost (first step in DCBBthat only uses the peak and two neighbor samples), DCBBimproved the performance in all AUs.

Interestingly, the performance for AU4, AU5, AU14,AU18 using the method ’All+PCA+App+SVM’ was betterthan ’DCBB’. ’All+PCA+App+SVM’ uses appearance fea-tures and all samples between onset and offset. Similarly,we sampled one frame every four frames between onsetand offset. All parameters in SVM were selected usingcross-validation. It is interesting to observe that AU4, AU5,AU14, AU15 and AU18 are the AUs that have very fewtotal training samples (only 1749 total frames for AU4,797 frames for AU5, 4089 frames for AU14, 2134 framesfor AU15, 1122 frames for AU18), and when having verylittle training data the classifier can benefit from usingall samples. An other result should be mentioned is thatall the winners are using appearance feature instead ofshape feature, that is because the head pose variation inM3 dataset may impair shape features more than SIFTdescriptors. Using unoptimized MATLAB code, trainingDCBB typically requires one to three hours depending onthe AU and number of training samples.

5.4 Experiment 3: Appearance descriptors forDCBB

Appearance features have proven more robust represen-tation than shape features for most AUs. In addition topresenting and evaluating DCBB, we explore the represen-tational power of two edge-based appearance descriptors(SIFT and DAISY descriptors) for AU detection, and com-pare it to the traditional Gabor representation [40], [23].

Similar in spirit to SIFT descriptors, DAISY descriptorsare an efficient feature descriptor based on histogramsthat have been used to match stereo images [47]. DAISYdescriptors use circular grids instead of SIFT descriptors’regular grids; the former have been found to have betterlocalization properties [48] and to outperform many state-of-the-art feature descriptors for sparse point matching [49].At each pixel, DAISY builds a vector made of valuesfrom the convolved orientation maps located on concentriccircles centered on the location. The amount of Gaussiansmoothing is proportional to the radius of the circles.

In the following experiment, we compare the perfor-mance on AU detection for three appearance descriptorsGabor, SIFT and DAISY while using DCBB with CascadeAdaboost. We used the same 19 subjects for training and10 for testing as the previous experiments. Fig. 11 showsthe ROC detection curves for the lower AUs using Gaborfilters at eight different orientations and five different scales,DAISY [49] and SIFT [43]. In most of AUs, SIFT andDAISY descriptors have shown similar performance, beingDAISY faster to compute than the SIFT descriptors. For afair comparison Gabor, DAISY and SIFT descriptors werecomputed in the same locations in the image (twenty featurepoints in lower face, see Fig. 12), and using the sametraining and testing data.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

true p

ositiv

e r

ate

20 Points | 0.92 | 12696/ 44625

18 Points | 0.91 | 12696/ 44625

12 Points | 0.86 | 12696/ 44625

2 Points | 0.83 | 12696/ 44625

Fig. 12. Using different number of feature points

An important parameter largely conditioning the perfor-mance of appearance-based descriptors is the number andlocation of features selected to build the representation. Forcomputational considerations it is not practical to build theappearance representation in all possible regions of interest.In the following experiment, we test the robustness inperformance varying the location where the appearance rep-resentation is built (i.e. (2, 12, 18, 20) points) in the moutharea. In Fig. 12 the big red circles represent the selectionof two feature descriptors, adding blue squares and smallred circles the descriptor has 12 features. Including thepurple triangle and small black square the descriptor wouldcontain 18 and 20 points respectively. The performance ofthe ROCs is consistent with our intuition, the more featureswe add the better is the performance. This behavior does notnecessarily occur with all classifiers, but boosting providesa feature selection mechanism that avoids overfitting evenfor large feature spaces.

6 CONCLUSIONS

An unexplored problem and critical to the success of auto-matic action unit detection is the selection of the positiveand negative samples. This paper proposes dynamic cascadebidirectional bootstrapping (DCBB) to best select positiveand negative training samples. With few exceptions, byselecting the optimal positive samples, DCBB achievedbetter detection performance than the standard approachof selecting either peak frames or all frames between theonsets and offsets. This results was found using SVMs


AU10

AU12 AU14 AU15 AU17

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

os

itiv

e r

ate

Sift descriptor | 0.92 | 12696/ 44625

Daisy descriptor | 0.92 | 12696/ 44625

Gabor | 0.90 | 12696/ 44625

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive

rate

Sift Descriptor | 0.72| 3494/ 53827

Daisy Descriptor | 0.74| 3494/ 53827

Gabor | 0.59| 3494/ 53827

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive

ra

te

Sift descriptor | 0.86 | 1331/ 55990


Gabor | 0.77 | 1331/ 55990

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

os

itiv

e r

ate

Sift Descriptor | 0.81 | 2620/ 54701

Daisy Descriptor | 0.77 | 2620/ 54701

Gabor | 0.73 | 2620/ 54701

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

os

itiv

e r

ate

Sift descriptor | 0.72 | 924/ 56397


Gabor | 0.57 | 924/ 56397

++

++

+ +++

++ +

++++

++

Sift

Descriptor

Daisy

Descriptor

Gabor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive

rate

Sift Descriptor | 0.75 | 3801/ 53520


Gabor | 0.67 | 3801/ 53520

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive r

ate

Sift Descriptor | 0.76 | 3774/ 53547


Gabor | 0.72 | 3774/ 53547

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive r

ate

Sift Descriptor | 0.76 | 409/ 56912


Gabor | 0.64 | 409/ 56912

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

os

itiv

e r

ate

Sift Descriptor | 0.97 | 2578/ 54743


Gabor | 0.73 | 2578/ 54743

AU1 AU2 AU4

AU6 AU7

0.76 0.75 0.97

0.75 0.68 0.86

0.72 0.67 0.73

0.76

0.68

0.64

AU6 AU7

0.72 0.92 0.86

0.78 0.92 0.84

0.57 0.90 0.77

0.72

0.74

0.59

AU10 AU12 AU14 AU15 AU17

0.64

0.69

0.68

0.81

0.77

0.73

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

os

itiv

e r

ate

Sift Descriptor | 0.64 | 1721/ 55600


Gabor | 0.68 | 1721/ 55600

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive

rate


Daisy Descripter | 0.76| 667/ 56654

Gabor | 0.76| 667/ 56654

AU5

AU18

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

osit

ive r

ate



Gabor | 0.56 | 885/ 56436

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

tru

e p

os

itiv

e r

ate



Gabor | 0.84 | 680/ 56641

SIFT

DAISY

Gabor

AU1 AU2 AU4 AU5 AU18 AU23

0.86 0.750.77

0.87 0.60

0.560.84

0.76

0.76

AU23

Fig. 11. Comparison of ROC curves for 13 AUs using different appearance representations based on SIFT,DAISY descriptors and Gabor feature

and AdaBoost. We also compared three common usedappearance features and compared different number ofsampling points, the results show than the SIFT and DAISYfeatures are more efficient than Gabor under same numberof sampling points, while more sampling points can getbetter performance. Although promising results have beenshown, there are several issues that remain unsolved. Byusing DCBB, we got a strong Cascade Adaboost Classifierfinally, which has optimized features and training samplesby bidirectional bootstrapping, however, it does not mean

there is no use of the Cascade Adaboost Classifiers whichtrained during each iterations of dynamic learning. Actually,for each AU, we can got 3−5 Cascade Adaboost Classifiersduring dynamic learning, when they were used for AUdetection one by one, they will have overlapping area onthe label’s sequences of the AU detecting results, these willreflect the dynamic patterns of AU events. In future work,We plan to model the dynamic patterns around the onsetand offset of AU events and exclude false positive samplesfor all AUs. we also plan to explore how to extend this


approach to other classifiers such as Gaussian Processes orSupport Vector Machines. Moreover, we plan to explore theuse of these techniques in other computer vision problemssuch activity recognition, where the selection of the positiveand negative samples might play an important role in theresults.

ACKNOWLEDGMENTS

The work was performed when the first author was atCarnegie Mellon University. Any opinions, findings andconclusions or recommendations expressed in this materialare those of the authors and do not necessarily reflectthe views of the U.S. Naval Research Laboratory or theNational Institutes of Health. Thanks to Tomas Simon, FengZhou, Zengyin Zhang for helpful comments.

REFERENCES

[1] P. Ekman and W. Friesen, “Facial action coding system: A techniquefor the measurement of facial movement.” Consulting PsychologistsPress., 1978.

[2] J. F. Cohn, Z. Ambadar, and P. Ekman, Observer-based measurementof facial expression with the Facial Action Coding System. NewYork: Oxford.: The handbook of emotion elicitation and assessment.Oxford University Press Series in Affective Science., 2007.

[3] J. F. Cohn and P. Ekman, “Measuring facial action by manualcoding, facial emg, and automatic facial image analysis.” Handbookof nonverbal behavior research methods in the affective sciences.,2005.

[4] P. Ekman and E. Rosenberg, What the face reveals: Basic andapplied studies of spontaneous expression using the Facial ActionCoding System (FACS). Oxford University Press, USA, 2005.

[5] M. Pantic, N. Sebe, J. F. Cohn, and T. Huang, “Affective multimodalhuman-computer interaction.” in ACM International Conference onMultimedia, 2005, pp. 669–676.

[6] W. Zhao and R. Chellappa, (Editors). Face Processing: AdvancedModeling and Methods. Elsevier, 2006.

[7] S. Li and A. Jain, Handbook of face recognition. New York:Springer., 2005.

[8] Y. Tian, J. F. Cohn, and T. Kanade, Facial expression analysis. InS. Z. Li and A. K. Jain (Eds.). Handbook of face recognition. NewYork, New York: Springer., 2005.

[9] Y. Tian, T. Kanade, and J. F. Cohn, Recognizing action units forfacial expression analysis, vol. 23, no. 1, pp. 97–115, 2001.

[10] S. Lucey, A. B. Ashraf, and J. Cohn, “Investigating spontaneousfacial action recognition through aam representations of the face,” inFace Recognition, K. D. . M. Grgic, Ed. Vienna: I-TECH Educationand Publishing, 2007, pp. 275–286.

[11] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, andJ. Movellan, “Recognizing facial expression: Machine learning andapplication to spontaneous behavior,” in CVPR05, 2005, pp. 568–573.

[12] M. Valstar and M. Pantic, “Combined support vector machinesand hidden markov models for modeling facial action temporaldynamics,” in ICCV07., 2007, pp. 118–127.

[13] Y. Tong, W. Liao, and Q. Ji, “Facial action unit recognition byexploiting their dynamic and semantic relationships,” IEEE TPAMI,pp. 1683–1699, 2007.

[14] P. Viola and M. Jones, “Rapid object detection using a boostedcascade of simple features,” in CVPR01, 2001, pp. 511–518.

[15] R. Xiao, H. Zhu, H. Sun, and X. Tang, “Dynamic cascades for facedetection,” in ICCV07, 2007, pp. 1–8.

[16] S. Yan, S. Shan, X. Chen, W. Gao, and J. Chen, “Matrix-structurallearning (msl) of cascaded classifier from enormous training set,” inCVPR07, 2007, pp. 1–7.

[17] M. Frank, M. Bartlett, and J. Movellan, “The m3 database ofspontaneous emotion expression.” 2011.

[18] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, andJ. Movellan, “Automatic recognition of facial actions in spontaneousexpressions.” Journal of Multimedia, 2006.

[19] M. J. Black and Y. Yacoob, “Recognizing facial expressions in imagesequences using local parameterized models of image motion,”IJCV97, vol. 25, no. 1, pp. 23–48, 1997.

[20] F. de la Torre, Y. Yacoob, and L. Davis, “A probabilisitc frameworkfor rigid and non-rigid appearance based tracking and recognition,”in Proceedings of the Seventh IEEE International Conference onAutomatic Face and Gesture Recognition (FG’00), 2000, pp. 491–498.

[21] Y. Chang, C. Hu, R. Feris, and M. Turk, “Manifold based analysisof facial expression,” in CVPRW04, 2004, pp. 81–89.

[22] C. Lee and A. Elgammal, “Facial expression analysis using nonlineardecomposable generative models.” in IEEE International Workshopon Analysis and Modeling of Faces and Gestures, 2005, pp. 17–31.

[23] M. Bartlett, G. Littlewort, I. Fasel, J. Chenu, and J. Movellan.,“Fully automatic facial action recognition in spontaneous behavior.”in Proceedings of the Seventh IEEE International Conference onAutomatic Face and Gesture Recognition (FG’06), 2006, pp. 223–228.

[24] M. Pantic and L. Rothkrantz, “Facial action recognition for facialexpression analysis from static face images.” IEEE Transactions onSystems, Man, and Cybernetics, pp. 1449–1461., 2004.

[25] M. Pantic and I. Patras, “Dynamics of Facial Expression: Recog-nition of Facial Actions and their Temporal Segments from FaceProfile Image Sequences,” IEEE Transactions on Systems, Man, andCybernetics - Part B: Cybernetics, vol. 36, pp. 433–449, 2006.

[26] P. Yang, Q. Liu, X. Cui, and D. N. Metaxas, “Facial expressionrecognition using encoded dynamic features.” in CVPR08, 2008.

[27] Y. Sun and L. Yin, “Facial expression recognition based on 3ddynamic range model sequences,” in ECCV08, 2008, pp. 58–71.

[28] B. Braathen, M. S. Bartlett, G. Littlewort, and J. R. Movellan, “Firststeps towards automatic recognition of spontaneous facial actionunits,” in Proceedings of the ACM Conference on Perceptual UserInterfaces, 2001.

[29] P. Lucey, J. Cohn, S. Lucey, S. Sridharan, and K. Prkachin, “Auto-matically detecting action units from faces of pain: Comparing shapeand appearance features,” CVPRW09, pp. 12–18, 2009.

[30] J. Cohn and T. Kanade, “Automated facial image analysis formeasurement of emotion expression,” in The handbook of emotionelicitation and assessment. Oxford University Press Series in Affec-tive Science, J. A. C. . J. B. Allen, Ed., pp. 222–238.

[31] J. F. Cohn and M. A. Sayette, “Spontaneous facial expression in asmall group can be automatically measured: An initial demonstra-tion.” 2010.

[32] M. F. Valstar, I. Patras, and M. Pantic, “Facial action unit detectionusing probabilistic actively learned support vector machines ontracked facial point data,” in CVPRW05, 2005, p. 76.

[33] M. Valstar and M. Pantic, “Fully automatic facial action unitdetection and temporal analysis,” in CVPR06, 2006, pp. 149–157.

[34] S. Lucey, I. Matthews, C. Hu, Z. Ambadar, F. D. la Torre, andJ. Cohn, “Aam derived face representations for robust facial actionrecognition,” in Proceedings of the Seventh IEEE InternationalConference on Automatic Face and Gesture Recognition (FG’06),2006, pp. 155–160.

[35] Z. Zeng, M. Pantic, G. Roisman, and T. Huang, “A survey of affectrecognition methods: Audio, visual, and spontaneous expressions,”IEEE TPAMI, pp. 39–58, 2008.

[36] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearancemodels,” in ECCV98, 1998, pp. 484–498.

[37] F. de la Torre and M. Nguyen, “Parameterized kernel principalcomponent analysis: Theory and applications to supervised andunsupervised image alignment,” in CVPR08, 2008.

[38] I. Matthews and S. Baker, “Active appearance models revisited,”IJCV04, pp. 135–164, 2004.

[39] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3dfaces.” in SIGGRAPH99, 1999.

[40] Y. Tian, T. Kanade, and J. F. Cohn, “Evaluation of gabor-wavelet-based facial action unit recognition in image sequences of increasingcomplexity,” in Proceedings of the Seventh IEEE InternationalConference on Automatic Face and Gesture Recognition (FG’02).Springer, 2002, pp. 229–234.

[41] A. Ashraf, S. Lucey, T. Chen, K. Prkachin, P. Solomon, Z. Ambadar,and J. Cohn, “The painful face: Pain expression recognition usingactive appearance models,” in Proceedings of the ACM InternationalConference on Multimodal Interfaces (ICMI’07), 2007, pp. 9–14.

[42] P. Lucey, J. F. Cohn, S. Lucey, S. Sridharan, and K. M. Prkachin,“Automatically detecting pain using facial actions.” in Interna-


tional Conference on Affective Computing and Intelligent Interaction(ACII2009), 2009.

[43] D. Lowe, “Object recognition from local scale-invariant features,” inICCV99, 1999, pp. 1150–1157.

[44] K. kay Sung and T. Poggio, “Example-based learning for view-basedhuman face detection,” IEEE TPAMI, pp. 39–51, 1998.

[45] L. Breiman, Classification and regression trees. Chapman &Hall/CRC, 1998.

[46] R. Schapire and Y. Freund, “Experiments with a new boostingalgorithm,” in Machine learning: proceedings of the ThirteenthInternational Conference (ICML’96). Morgan Kaufmann Pub, 1996,p. 148.

[47] E. Tola, V. Lepetit, and P. Fua, “A fast local descriptor for densematching,” in CVPR08, 2008.

[48] K. Mikolajczyk and C. Schmid, “A performance evaluation of localdescriptors,” IEEE TPAMI, vol. 27, no. 10, pp. 1615–1630, 2005.

[49] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient dense descriptorapplied to wide baseline stereo,” IEEE TPAMI, vol. 99, no. 1, pp.3451–3467, 2009.

Yunfeng Zhu received the B.Sc. degree inInformation Engineering, the M.Sc. degree inPattern Recognition and Intelligent Systemfrom Xi’an Jiaotong University in 2003 and2006 respectively. He is currently a Ph.D.candidate at Image and Graphics Institute, inthe Department of Electronic Engineering atTsinghua University, Beijing, China. In 2008and 2009, he became visiting student in theRobotics Institute at Carnegie Mellon Univer-sity (CMU), Pittsburgh, PA. His research in-

terests include computer vision, machine learning, facial expressionrecognition and 3D facial model reconstruction.

Fernando De la Torre received the B.Sc.degree in telecommunications and the M.Sc.and Ph.D. degrees in electronic engineeringfrom Enginyeria La Salle in Universitat Ra-mon Llull, Barcelona, Spain, in 1994, 1996,and 2002, respectively. In 1997 and 2000,he became an Assistant and an AssociateProfessor, respectively, in the Department ofCommunications and Signal Theory at theEnginyeria La Salle. Since 2005, he hasbeen a Research Faculty at the Robotics

Institute, Carnegie Mellon University (CMU), Pittsburgh, PA. Hisresearch interests include machine learning, signal processing, andcomputer vision, with a focus on understanding human behaviorfrom multimodal sensors. He is directing the Human Sensing Lab(http://humansensing.cs.cmu.edu) and the Component Analysis Labat CMU (http://ca.cs.cmu.edu). He has coorganized the first work-shop on component analysis methods for modeling, classification,and clustering problems in computer vision in conjunction withCVPR-07, and the workshop on human sensing from video inconjunction with CVPR-06. He has also given several tutorials atinternational conferences on the use and extensions of componentanalysis methods.

Jeffrey F. Cohn earned the PhD degreein clinical psychology from the University ofMassachusetts in Amherst. He is a professorof psychology and psychiatry at the Univer-sity of Pittsburgh and an Adjunct Facultymember at the Robotics Institute, CarnegieMellon University. For the past 20 years, hehas conducted investigations in the theoryand science of emotion, depression, andnonverbal communication. He has co-led in-terdisciplinary and interinstitutional efforts to

develop advanced methods of automated analysis of facial expres-sion and prosody and applied these tools to research in human emo-tion and emotion disorders, communication, biomedicine, biometrics,and human-computer interaction. He has published more than 120papers on these topics. His research has been supported by grantsfrom the US National Institutes of Mental Health, the US NationalInstitute of Child Health and Human Development, the US NationalScience Foundation, the US Naval Research Laboratory, and the USDefense Advanced Research Projects Agency. He is a member ofthe IEEE and the IEEE Computer Society.

Yu-Jin Zhang received the Ph.D. degree inApplied Science from the State Universityof Lige, Lige, Belgium, in 1989. From 1989to 1993, he was post-doc fellow and re-search fellow with the Department of AppliedPhysics and Department of Electrical Engi-neering at the Delft University of Technology,Delft, the Netherlands. In 1993, he joinedthe Department of Electronic Engineering atTsinghua University, Beijing, China, wherehe is a professor of Image Engineering since

1997. In 2003, he spent his sabbatical year as visiting professorin the School of Electrical and Electronic Engineering at NanyangTechnological University (NTU), Singapore. His research interestsare mainly in the area of Image Engineering that includes imageprocessing, image analysis and image understanding, as well astheir applications. His current interests are on object segmentationfrom images and video, segmentation evaluation and comparison,moving object detection and tracking, face recognition, facial ex-pression detection/classification, content-based image and videoretrieval, information fusion for high-level image understanding, etc.He is vice president of China Society of Image and Graphics anddirector of academic committee of the Society. He is also a Fellow ofSPIE.

a dynamic cascades with bidirectional bootstrapping for ...jeffcohn/biblio/dcbbjan19.pdfjournal of...

Documents