spontaneous facial micro-expression analysis using ......efﬁciency for facial expression...

Spontaneous Facial Micro-expression Analysis usingSpatiotemporal Completed Local Quantized Patterns

Xiaohua Huanga,b, Guoying Zhaoa,∗, Xiaopeng Honga, Wenming Zhengb,c, MattiPietikainena

aCenter for Machine Vision Research, Department of Computer Science and Engineering, P.O. Box 4500,FI-90014, University of Oulu, Finland

bResearch Center for Learning Science, Southeast University, Nanjing, JiangSu 210096, ChinacKey Laboratory of Child Development and Learning Science (Ministry of Education), Southeast

University, Nanjing, JiangSu 210096, China

Abstract

Spontaneous facial micro-expression analysis has become an active task for rec-ognizing suppressed and involuntary facial expressions shown on the face of humans.Recently, Local Binary Pattern from Three Orthogonal Planes (LBP-TOP) has beenemployed for micro-expression analysis. However, LBP-TOP suffers from two criti-cal problems, causing a decrease in the performance of micro-expression analysis. Itgenerally extracts appearance and motion features from the sign-based difference be-tween two pixels but not yet considers other useful information. As well, LBP-TOPcommonly uses classical pattern types which may be not optimal for local structure insome applications. This paper proposes SpatioTemporal Completed Local Quantiza-tion Patterns (STCLQP) for facial micro-expression analysis. Firstly, STCLQP extractsthree interesting information containing sign, magnitude and orientation components.Secondly, an efficient vector quantization and codebook selection are developed foreach component in appearance and temporal domains to learn compact and discrimi-native codebooks for generalizing classical pattern types. Finally, based on discrimina-tive codebooks, spatiotemporal features of sign, magnitude and orientation componentsare extracted and fused. Experiments are conducted on three publicly available facialmicro-expression databases. Some interesting findings about the neighboring patternsand the component analysis are concluded. Comparing with the state of art, experimen-tal results demonstrate that STCLQP achieves a substantial improvement for analyzingfacial micro-expressions.

Keywords: Micro-expression, LBP-TOP, Vector Quantization, Discriminative

∗Corresponding authorEmail addresses: [email protected] (Xiaohua Huang), [email protected]

(Guoying Zhao), [email protected] (Xiaopeng Hong), [email protected](Wenming Zheng), [email protected] (Matti Pietikainen)

Preprint submitted to NeuroComputing September 21, 2015

1. Introduction

Micro-expression is a subtle and involuntary facial expression. It usually occurswhen a person is consciously trying to conceal all signs of how he is feeling. Unlikeregular facial expressions, it is difficult to hide micro-expression reaction. As a result,the importance of micro-expression study is apparent in many potential applications forthe security field. Psychological research shows that facial micro-expressions generallyremain less than 0.2 seconds, as well they are very subtle [9]. Short duration and subtlechange causes human difficulties in recognizing facial micro-expressions. In order toimprove this ability, the micro-expression training tool developed by Ekman and histeam was used to train people to better recognize micro-expressions. Even so, humancan achieve just around 40% recognition accuracy [11]. Therefore, there is a great needfor a high-quality automatic system to recognize facial micro-expressions.

Some earlier studies on automatic facial micro-expression analysis have primarilyfocused on posed or synthetic facial micro-expressions [32, 33]. Recently, researchershave conducted spontaneous facial micro-expression analysis [25, 30, 44, 45]. In con-trast to posed facial micro-expressions, spontaneous facial micro-expressions can re-veal genuine emotions that people try to conceal. It is very challenging to extract usefulinformation from subtle changes of micro-expressions. It is noted that geometry-basedor appearance-based feature extraction method has been commonly employed to ana-lyze facial expressions. Specifically, geometric-based features represent the face ge-ometry, such as the shapes and locations of facial landmarks, but they are sensitive toglobal changes, such as pose change and illumination variation. Instead, appearance-based features describe the skin texture of faces. Among these methods, Local BinaryPattern from Three Orthogonal Planes (LBP-TOP) has demonstrated its simplicity andefficiency for facial expression recognition [18, 54]. As a result, LBP-TOP has beenwidely used in micro-expression analysis [30, 45, 7]. Pfister et al. [30] proposed aspontaneous micro-expression recognition method using LBP-TOP. Yan et al. [45] ap-plied LBP-TOP on their CASME 2 database achieving micro-expression recognitionrate of 63.41%. As well, other works used LBP-TOP to investigate whether micro-facial movement sequences can be different from neutral face sequences [7].

However, it should be noted that there is still a gap to achieve a high-quality micro-expression analysis. Consequently, several works have attempted to improve the LBP-TOP. Ruiz-Hernandez and Pietikainen [35] used the re-parameterization of second or-der Gaussian jet on the LBP-TOP achieving promising micro-expression recognitionresult on the SMIC database [30]. Wang et al. [40] extracted Tensor features from Ten-sor Independent Color Space (TICS) for micro-expression recognition, but their resultson the CASME 2 database showed no improvement if we compare their highest achiev-able accuracies with the previous results. Furthermore, Wang et al. [41] used LocalSpatiotemporal Directional Features (LSDF) with robust principal component analysisfor micro-expressions, but yet did not obtain improvement for micro-expression analy-sis. In addition, recent work [42] reduced redundant information of LBP-TOP by usingSix Intersection Points (LBP-SIP) to obtain better performance than the LBP-TOP.Even so, there is still much room for improvement in the recognition performance.

In our preliminary work [19], Completed Local Quantized Pattern (CLQP) wasproposed by using the completed information and vector quantization to improve the

2

performance of the original LQP proposed by Hussain and Triggs [20]. It achievedconsiderable results on texture classification and neonatal facial expression classifica-tion tasks. This paper proposes SpatioTemporal Completed Local Quantized Pattern(STCLQP) by extending our spatial domain approach [19] for micro-expression anal-ysis. In this work, STCLQP exploits three useful information, including sign-based,magnitude-based and orientation-based difference of pixels. Furthermore, STCLQPdesigns compact and discriminative codebooks for spatiotemporal domain. Differentfrom our preliminary work, this work considers a more discriminative codebook andan application in spatiotemporal domain.

To explain the concepts of our approach, the paper is organized as follows. Sec-tion 2 discusses recent related work. Section 3 describes the completed local quan-tized patterns (CLQP) in spatial domain. Section 4 provides the extension of CLQP tospatiotemporal domain and explains its implementation to micro-expression analysis.Section 5 discusses parameter settings, datasets and provides experimental results withrelevant discussion. Finally, Section 6 concludes the paper.

2. Related work

The video feature extraction problem has been addressed from different perspec-tives. Some works describe a video clip from the shape features. In [27], Liu et al. pro-posed a sketch-based method to organize video clip, where they used sketch annota-tions to enhance the narrations. The sketch-based approach achieves context-awaresketch recommendation for video shape extraction. In [2], Belongie et al. proposedthe shape context using a circular local pattern and histogram to measure similaritybetween shapes. In [37], Shin and Chun used eighteen major feature points defined inMPEG 4, and then applied the dense optical flow method to track the feature points forsequential frames. In [23], Jain et al. used the shape feature around the eyebrows, eyes,nose, chin, inner lips and outer lips to describe the feature of each frame in video clip.

On the other hand, there are a few works to use a texture descriptor to describe theappearance and motion features of the video clip. As we know, Gabor and Local Bi-nary Pattern (LBP) are two most representative ones for facial expression recognition.Since Gabor feature is related to the perception in human visual system, some facialexpression recognition systems to date have utilized the Gabor energy filters [26, 51].Instead, LBP is simple to implement, fast to compute and has led to high accuracy intexture-based recognition tasks [29]. In recent years significant progress has been madein using LBP for facial expression recognition [10, 36]. Due to its simplicity, the LBPoperator was further extended to video sequences [54], named the LBP from ThreeOrthogonal Planes (LBP-TOP) of a space time volume. Basically, the LBP-TOP de-scription is formed by calculating the LBP features from the planes and concatenatingthe histograms. In addition, LBP-TOP has efficient computation. Thus, it has becomeattractive for researchers in many fields.

In recent years, many extensions of LBP-TOP are developed in an application suchas human action recognition [28] and lip reading [3]. Local Ternary Pattern from ThreeOrthogonal Planes (LTP-TOP) proposed by Nanni et al. [28] quantized intensity differ-ences of neighboring pixels and center pixel into three levels to increase the robustnessagainst noise. However, the LTP-TOP is sensitive to the quantization levels. In [3],

3

for increasing the robustness against intensity noise, Local Ordinal Contrast Pattern(LOCP) used a pairwise ordinal contrast measurement of pixels from a circular neigh-borhood starting at the center pixel.

Recently, Completed Local Binary Pattern from Three Orthogonal Planes (CLBP-TOP) was proposed to exploit the useful information from intensity level [31]. In theirwork, completed local binary pattern [15] was extended into temporal domain. Magni-tude and center pixel are severed as the complementary to the LBP-TOP, increasing therobustness of the LBP-TOP against noise. However, central pixel intensity informationis very sensitive to noises caused by such things as illumination changes [55]. Alter-natively, several researches [52, 43, 39, 38, 18] have demonstrated that orientation isuseful because of its robustness to illumination changes. In [52], Zhang et al.proposedthe Histogram of Gabor Phase Pattern, in which they combine the spatial histogram andthe Gabor phase information coding schemes. In [43], Xie et al.proposed local GaborXOR patterns, which encodes the Gabor phase by using local XOR pattern. In [39],Vu and Caplier presented multiple features combining patterns of oriented edge magni-tude and patterns of dominant orientations. Therefore, it begs a question if orientationwould be more effective for LBP-TOP.

Furthermore, LBP-TOP and CLBP-TOP inherit from the sparse sampling problem,yielding inadequate spatiotemporal descriptors. In fact, LBP histograms with smalla number of bins tend to fail to provide enough discriminative information about theimage appearance [50]. Instead, as the number of increases, more discriminative in-formation will be provided, although this will cause the number of local patterns toincrease exponentially. For example, with 8 sampling points around each pixel, thereare 256 possible local patterns, while with 16 sampling points the dimensionality ofhistogram is 65,536. More importantly, the histogram becomes extremely sparse givena limited number of pixels. For example, it is observed that only half of the 65,536local patterns in the case of 16-point sampling occur in spontaneous micro-expressiondatabase [25].

To address the sparseness problem, some specific codebooks were designed to re-duce the number of possible codes and make the resulting histogram compact andevenly distributed. For example, in [29], Ojala et al. presented a type of codebook,namely ‘uniform pattern’, which consists of several binary patterns containing at mosttwo bitwise transitions from 0 to 1 or vice versa when the bit pattern is traversed circu-larly. Although similar methods to uniform coding are well-established to reduce thedimensionality, they are at best limited palliatives for a serious problem, in which thecodebook size is exponentially grown with the number of local sampling points andquantization depth. Furthermore, it is uncertain that uniform coding is generalized fornon-circular and multi-circular structures. Thus, it wonders if another compact methodcan be developed to provide appropriate codebook for various sizes and expressivenessof local pattern representations.

3. Completed local quantized pattern in spatial domain

To address the sparseness problem, many typical coding methods, such as uniformpatterns [29], have been designed by considering rotation and scale invariant prop-erties. However, with number of sampling points increasing, these coding methods

4

Figure 1. Overview of completed local binary pattern: (1) Three kinds of informa-tion (local sign, magnitude and orientation patterns) are extracted from the image. (2)Three separate codebooks are learned by using vector quantization, where S, M, andO are referred to sign, magnitude and orientation, respectively. (3) The sign, magni-tude and orientation patterns are mapped into their corresponding codebook by using acodebook, and three histograms are concatenated into one vector.

are not optimal and compact. Our preliminary work [19] proposed Completed Lo-cal Quantization Pattern (CLQP) to create a more robust LBP. This approach consistsof three stages: (1) component extraction, (2) a codebook obtained by using vectorquantization, and (3) local pattern encoding. Figure 1 shows the procedure for CLQP.For briefly description, Table 1 describes the representation of the used mathematicalsymbols in CLQP.

3.1. Component extraction

We suppose that an image can be represented as ξx,y, where ξx,y is the gray-level intensity or orientation angle of pixel (x, y). For a spatial coordinate (x, y), asillustrated in Figure 2, the local pattern can be formulated as follows:

−→x = [f(ξx,y, ξx1,y1), f(ξx,y, ξx2,y2), . . . , f(ξx,y, ξxP ,yP )], (1)

where f(ξx,y, ξxi,yi) is the formula comparing values such as gray-level intensities oftwo pixels, (xi, yi) is the neighbor sampling points of (x, y), and P is the numberof neighbor sampling points. For (xi, yi), it can be sampled as shown in Figure 2.This section will in detail describe the extraction of f(ξx,y, ξxi,yi), which involves theorientation-based, sign-based and magnitude-based difference operators.

3.1.1. Orientation-based difference extractionEssentially, orientation-based difference extraction aims to encode two dominant

orientations between two pixels in an image. This scheme consists of three stages.Firstly, it calculates dominant orientation of the pixel. Secondly, it quantifies dominantorientation into T levels. Finally, it compares dominant orientation of the central pixelwith its neighbor pixels.

5

Table 1. Representation of mathematical symbols in CLQP.

Symbol Representation(x, y) Coordinate of pixelξ The gray-level intensity or orientation angle of a pixel

f(·, ·) The formula comparing values of two pixelsP The number of neighbor sampling pointsG Gaussian kernelθ The dominant orientation of a pixelN The number of Gaussian kernelsT The number of quantification levelK The number of codebook sizeΩ The codebook−→x The local patternH A histogram of one facial blockB The number of blocks in facial imageH The feature of facial imageS The sign componentM The magnitude componentO The orientation component

(1) Dominant orientation of the pixel: To date, several methods, such as an imagegradient [5] and Sobel filters [13], can be used to calculate dominant orientation of eachpixel. Gabor filters and Gaussian recursive transformation have been recently proposedto obtain dominant orientation of each pixel [12, 43, 52]. In their methods, orientation-estimation filters are designed by using a mask with various size.

Following [12], a set of N Gaussian kernels is designed to estimate dominant ori-entation of a pixel. Given a mask patch with D ×D size, the n-th Gaussian kernel isdefined as the difference between two oriented with shifted kernels as follows:

Gθn =G−θn −G

+θn∑

x,y[(G−θn −G+θn

) · h(G−θn −G+θn

)], (2)

where

G−θn =1

2πσ2exp(− (x− σ cos θn)2 + (y − σ sin θn)2

2σ2),

G+θn

=1

2πσ2exp(− (x+ σ cos θn)2 + (y + σ sin θn)2

2σ2),

h(G−θn −G+θn

) =

1, G−θn −G

+θn> 0

0, G−θn −G+θn

6 0,

and σ is a root mean square deviation of Gaussian distribution, (x, y) is the coordinateof cell patch, and θn = n× 2π

N , n = 0, . . . , N − 1.The response of Gaussian kernel defines the contrast magnitude of a local edge at its

pixel location. The dominant orientation of pixel (x, y) is obtained with the orientation

6

(a) (b)

Figure 2. Examples of commonly used local pattern neighborhoods: (a) The circularneighborhoods (R,P ) with P sampling points (xi, yi) and radius R around the centralpoint (x, y), and (b) multi-circle neighborhoods Disc5 with 24 sampling points, wherethe black point represents a central point (x, y) and the red ones are local samplingpoints (xi, yi). The pixel values are bi-linearly interpolated whenever the samplingpoint is not in the center of a pixel.

of a kernel that gave the maximum response,

θx,y = arg maxθn

∑p,q

gx−p,y−qGθn , (3)

where gx−p,y−q is gray-level intensity of a pixel (x − p, y − q) in an image, p, q =−bD2 c,−b

D2 c+ 1, . . . , bD2 c − 1, bD2 c, and bD2 c means the floor of D2 .

(2) Quantification of dominant orientation: With Gaussian kernels, we obtain anew image containing dominant orientation. We suppose that dominant orientationsof the center pixel (x, y) and its neighbors (xi, yi) are θx,y and θxi,yi , respectively.Generally, we could compute the difference of orientation angle [38] and subsequentlycode it as 0/1 by setting a threshold value. However, it is difficult to find the optimalthreshold value. Instead, some works [52, 43] have quantified phase or orientation intoseveral levels, making it easy to compare the relationship between dominant orienta-tions. Thus, the quantification function is used to quantify dominant orientation:

ξ = mod(bθx,y2πT

+ 0.5c, T ), (4)

where T is number of quantification level.(3) Relationship of dominant orientations: Following LBP, we aim to exploit the

relationship of a pixel and its surrounding neighbors on orientation-quantified image.The dominant orientation bin of pixel (x, y) and its surrounding pixels (xi, yi) aredenoted as ξ and ξi, respectively. Thus, their relationship is calculated as follows,

f(ξx,y, ξxi,yi) = ξ⊕

ξi =

0, ξ = ξi1, ξ 6= ξi

, (5)

7

where P is the number of the sampling points.Discussion: In previously mentioned procedure, N and T are important to ori-

entation based difference extraction. The dominant orientation estimation is sensitiveto N . Small N will make orientation estimation inaccurate. Instead, large N willcause expensive computation. Consequently, appropriate N would provide beneficialperformance to orientation estimation. In [43], Xie et al. showed that the appropriatequantification level could achieve a balance between robustness to orientation variationand representation power of local patterns. Additionally, in [39], Vu et al. discussedhow the quantification level could affect the local pattern for the orientation in neigh-bor structure. Thus, T performs a trade-off role between the robustness to orientationvariation and representation power of local patterns. The effect of N and T will beexamined in experiments.

3.1.2. Sign-based and magnitude-based difference extractionThe sign-based and magnitude-based information are important to a face descriptor.

Let ξxi,yi (i = 1, . . . , P ) denote the gray-level intensity of P sampling points around(x, y). Following [15], the difference between the center pixel and its surrounding pixelis calculated as di = ξxi,yi − ξx,y . It is further decomposed into sign and magnitude asfollows,

di = sign(di) ∗ |di|, (6)

where sign(di) =

1, di ≥ 00, di < 0

is the sign of di and |di| is the magnitude of di.

The sign pattern of (x, y) has the same formulation (binary) as the LBP operator.It can be represented as [S1, . . . , SP ]. For a magnitude pattern [M1, . . . ,MP ], it issimply converted into a consistent format with that of a sign pattern by a threshold δ.Here we set it as the mean value of |di| from the whole image. The magnitude pattern

can be denoted as Mi =

1, |di| ≥ δ0, |di| < δ

.

3.1.3. Formulation of componentsThrough the extraction, for a pixel (x, y), three components are obtained as the

formulations of [O1, . . . , OP ], [S1, . . . , SP ] and [M1, . . . ,MP ] for orientation, signand magnitude, respectively.

3.2. Efficient vector quantization

Hussain and Triggs [20] applied vector quantization to learn codebook for variouslocal pattern neighborhoods, such as Disc5 (in Figure 2(b)), since vector quantizationproduces compact codebook to different applications. Motivated by their work, weadopt vector quantization to obtain codebooks for three components. For convenientpurpose, we take the orientation component as an example for interpreting our method.

Given training images, all local orientation patterns are denoted as−→xq (q = 1, . . . , Q),where Q is the total number of pixels from these images. The vector quantization aimsto quantize them intoK quantized depths. K-means clustering is a common way in vec-tor quantization for generating representative “visual words” [48, 20]. Theoretically,

8

the k-means clustering partitions all observations into a set Ψ∗ = Ψ1,Ψ2, . . . ,ΨKas to minimize the within-cluster sum of squares as follows:

Ψ∗ = arg minΨ

K∑k=1

∑−→x q∈Ψk

‖−→x q − µk‖2, (7)

where µk is the mean of local patterns in Ψk, andK is the number of clustering centers.It is found that k-means clustering usually requires much time and huge memory

size in training procedure. In our case, local patterns that occur several times are calcu-lated repeatedly in clustering. For example, ‘00000000’ occurs 20 times in an image.It therefore leads to great redundancy in calculation. To address this problem, we intro-duce a weight W of local patterns for k-means clustering. In specific, we can obtain Junique local patterns Y = [−→y 1,

−→y 2, . . . ,−→y J ] and the number of occurrences of local

pattern W = [w1, w2, . . . , wJ ], where J (J Q) is the number of local patterns, forexample, J = 256 for 8-point sampling. Eq. 7 consequently becomes

Ψ∗ = arg minΨ

K∑k=1

∑−→y j∈Ψk

‖wj−→y j − µk‖2. (8)

Another important issue related to the efficiency of k-means is to initialize cluster-ing centers. Different choices of an initialization substantially affect the speed of di-vergence of training procedure and also the performance. Instead of random sampling,we exploit the dominant local patterns, i.e., the most frequently occurred patterns, asinitialization. In the implementation, we firstly sort local patterns by the descendingorder of occurrence, and secondly select the first K local patterns as the initializationof the clustering centers. Our preliminary work[19] had demonstrated that objectivefunction in Eq. 8 converged very fast.

In order to guarantee fast implementation, a codebook is off-line built by mappingunique local patterns Y = [−→y 1,

−→y 2, . . . ,−→y J ] to the nearest clustering centers. Moti-

vated by [29], we propose to use a ‘miscellaneous’ label to make codebook robust andcompact. In the implementation of this label, we take the following steps: (1) we setthe threshold λ to categorize local patterns into genuine and fake ones; (2) for genuineone, we assign the label of the closest cluster to it; and (3) for fake one, we assign anextra label (K + 1) to it.

3.3. Local pattern encodingTypically, local pattern −→x can be encoded by a predefined codebook Ω, such as

‘uniform pattern’ [29] or our learned codebook by vector quantization. Figure 3 de-scribes the procedure encoding a local pattern by using a codebook Ω on a facial image.

Generally, a facial image is divided intoB blocks. For the b-th block, we can obtainlocal orientation pattern [O1, . . . , OP ] for a pixel (x, y). Based on the codebook Ω, itscorresponding index can be quickly searched. For example, the local pattern ‘0000000’is programmed by its corresponding index ‘5’. The same procedure is applied to allpixels, thus formulating all local orientation patterns into a histogramHb. For the facialimage, histograms Hb(b = 1, . . . , B) from B blocks are concatenated into one featurevectorH.

9

Figure 3. An example of mapping the local pattern into a codebook Ω. Given animage, we use orientation-based difference extraction to obtain local pattern of eachpixel. Based on a codebook Ω, local pattern is assigned into its corresponding index.Finally, All local patterns are formulated as one histogram.

Figure 4. Procedure of LBP-TOP on micro-expression analysis [54]. (a) Block volume;(b) LBP features from three orthogonal planes; (c) Appearance and motion feature forblock volume.

4. Extension to the spatiotemporal domain

In the previous section, the CLQP features are constructed for static-image analysisand obtained acceptable results on texture classification. Recently, LBP-TOP has beenproposed for micro-expression analysis [31, 25]. It combines motion and appearanceinformation together by using three orthogonal planes, as shown in Figure 4. Motivatedby LBP-TOP, we intend to extend CLQP to the spatiotemporal domain for dynamicmicro-expression analysis, named Spatiotemporal Completed Local Quantized Pattern(STCLQP). In this work, STCLQP resolves the problem that the typical coding methodmay be not optimal to the LBP-TOP. On the other hand, STCLQP makes local patternsmore compact and discriminative by developing codebooks based on Fisher criterion.

The method is described in Figure 5. It consists of four stages: (1) componentsextraction, (2) using vector quantization to obtain codebook (including local patternpool and vector quantization), (3) local pattern encoding and (4) discriminative andcompact codebook selection. The component extraction method (in Section 3.1) and anefficient vector quantization (in Section 3.2) are employed to the first and second stages,respectively. In this section, we explain in detail the last two stages. For convenientpurpose, we take the orientation component as an example for interpreting our method.The same procedure is applied to sign and magnitude components.

10

Figure 5. Spatiotemporal completed local binary pattern: (1) We extract the sign, mag-nitude and orientation components for each plane (one of XY , XT and Y T ), and for-mulate the local pattern pool. (2) Fixing the number of clustering K, k-means cluster-ing method is employed to quantize the local pattern pool for obtaining the codebook.(3) Based on specific codebook, the pool can be encoded. (4) The encoded features arefurther used to calculate the Fisher score wKη for the number of clusterings, where ηis XY , XT and Y T . Fisher scores are used to choose the discriminative and compactcodebook for each component in each plane. For convenience, here we take the signcomponent as an example.

4.1. Local pattern encoding in the spatiotemporal domain

In spatiotemporal domain, LBP-TOP considers to use three orthogonal planes forrepresenting an image sequences. In original LBP-TOP, ‘uniform pattern’ [29] is com-monly utilized to encode all three planes. Due to this, it is noted that in the implemen-tation of the STCLQP, three codebooks are required for the orientation component. Wedenote codebooks as ΩXY , ΩXT , and ΩY T for XY , XT and Y T planes.

Assume that a video clip is divided into B blocks, for all pixels in volume, we canobtain local patterns−→x XY forXY plane. With the codebook ΩXY , it is easy to searchthe corresponding index for −→x XY . The same method is utilized to all pixels in XYplane. Finally, the indexes are statistically formulated as one histogram. Histogramsof all blocks are concatenated into one feature HXY , which represents the appearancefeature of facial images.

For two other planes (XT and Y T ), histograms are extracted to the same wayof XY plane. For STCLQP, histograms from three orthogonal planes are concatenatedinto one feature vector. In summary, local pattern encoding procedure in the spatiotem-poral domain is presented in Algorithm 1.

4.2. Discriminative and compact codebook selection

Previously mentioned in Section 3.2, the weighted k-mean method produces an ef-ficient and fast way to vector quantization. However, the weighted k-mean method didnot consider the discriminative information. In order to design discriminative code-book, Fisher criterion and local pattern encoding are employed to each plane. In this

11

Algorithm 1: Local pattern encoding using codebook in the spatiotemporal do-main

input : A video sequence V , codebooks ΩXY ,ΩXT ,ΩY T , the number ofblocks B

output: Histogram of a video sequence H

Partition V into B blocks;for η ← XY , XY , Y T do

for b← 1 to B doOn the plane η, compute local patterns −→x of the b-th block using localsampling points;Map local pattern −→x to the codebook Ωη;Compute the histogram Hj ;ConcatenateHη = [H1, . . . ,HB ];

ConcatenateH = [HXY ,HXT ,HY T ];L1-normalizeH = H∑

H ;

section, we intend to produce ΩK∗,η for the plane η, where η ∈ XY,XT, Y T, andK∗ means the optimal clustering size for the plane η. Here, the procedure onXY planeis taken as an example. The same way are also utilized to two other planes. When thereis no ambiguity, η is omitted for clarity.

We suppose that we have a set of codebook size [K1, . . . ,KV ]. For the codebooksizeKv ,Kv ∈ K1, . . . ,KV , its corresponding codebook ΩKv

is calculated by usingvector quantization. With local pattern encoding method (in Section 4.1), we can obtainits corresponding features:

H = [H1, . . . ,HB ], (9)

whereB is number of blocks, andHb ∈ <Kv . Based on Fisher criterion, the differenceof the same class should be small while the difference of the various classes shouldbe large. Here we take Hb of the b-th block as an example. For a C class problem,let the similarities of different samples of the same expression compose the intra-classsimilarity, and those of samples from different expressions compose the extra-classsimilarity. The mean mI,b and the variance s2

I,b of intra-class similarities for eachblock can be computed by as follows:

mI,b =1

C

C∑c=1

2

Lc(Lc − 1)

Lc∑i=2

i−1∑j=1

d(Hc,ib , Hc,j

b ), (10)

s2I,b =

C∑c=1

Lc∑i=2

i−1∑j=1

(d(Hc,ib , Hc,j

b )−mI,b)2, (11)

where Hc,jb denotes the histogram extracted from the j-th sample and Hc,i

b denotes thehistogram extracted from the i-th sample of the c-th class, Lc is the sample number ofthe c-th class in the training set, and the subsidiary index means the b-th block. In the

12

same way, the mean mE,b and the variance s2E,b of the extra-class similarities for each

block can be computed by:

mE,b =2

C(C − 1)

C−1∑u=1

C∑v=u+1

1

LuLv

Lu∑i=1

Lv∑j=1

d(Hu,ib , Hv,j

b ), (12)

s2E,b =

C−1∑u=1

C∑v=u+1

Lu∑i=1

Lv∑j=1

(d(Hu,ib , Hv,j

b )−mE,b)2. (13)

Finally, the Fisher score for the codebook ΩKvcan be computed by:

w =

B∑b=1

(mI,b −mE,b)2

s2I,b + s2

I,b

. (14)

With Fisher criterion, the local histogram features are discriminative, if the meansof intra and extra classes are far apart and the variances are small. For each codebooksize, we obtain Fisher weight by using the above mentioned method. The optimalcodebook size K∗ and its corresponding codebook ΩK∗ are chosen with respect to themaximal Fisher weights. Our proposed approach is described in Algorithm 2.

4.3. Implementation of STCLQPGiven a video sequence, through component extraction (in Section 3.1), we can

obtain sign, magnitude and orientation components. For the sign component, compactand discriminative codebook Ωη of each plane is obtained by using an efficient vectorquantization and codebook selection. Subsequently, local patterns are mapped intoΩη , and finally generate features Hη . For sign component, features of three planesHXY , HXT and HY T are concatenated into one feature vector H(S). Features frommagnitude and orientation components,H(M) andH(O), are extracted to the same wayof the sign component.

Concatenation of histograms is commonly used to combine histograms of threecomponents. However, it would bring curse of dimensionality to classification. Alter-natively, we intend to use feature-level fusion on feature subspace. In our scheme, weapply supervised locality preserving projection [17] to sign, magnitude and orientationcomponents, thus obtaining feature space US , UM and UO for sign, magnitude and ori-entation components, respectively. Therefore, the final feature vectors from sign (S),magnitude (M) and orientation (O) components can be formulated as,

HSTCLQP = [U′

SH(S), U′

MH(M), U′

OH(O)]. (15)

5. Experiments

In the proposed method, we exploit the sign-based, magnitude-based and orientation-based information (in Section 3.1). For the orientation, it is found that the number ofGaussian kernels and quantification level are two critical parameters. Furthermore, thecodebook size (i.e. the number of clustering K in Section 4.2) and number of local

13

Algorithm 2: Discriminative and compact codebook selection for motion andspatial information

input : L training video sequence, number of local sampling points P ,threshold λ as well as a set of codebook size [K1, . . . ,KV ]

output: The codebooks ΩXY , ΩXT , and ΩY T for XY , XT and Y T planes

Generate unique local patterns Y = [−→y 1,−→y 2, . . . ,

−→y J ], where J = 2P ;for η ← XY , XY , Y T do

Initialize the zero weight Wη for local patterns Y;for l← 1 to L do

Compute local pattern −→x from each pixel on i plane using localsampling points;for j ← 1 to J do

if −→x = −→y j thenWη

j = Wηj + 1;

Compute the occurrence Wη of unique local patterns Y for the plane η;Sort Y corresponding to Wη in descend order;for k ← [K1, . . . ,KV ] do

Initialize the clustering as the first k local patterns from Y;Feed Y and Wη into k-means clustering;Learn the optimal centers Ψk using the Eq. 8;Compute the distance between Ψk and Y;for j ← 1 to J do

if d(Ψk,−→y j) < λ then

Assign the center index to −→y j using the nearest distance;

elseAssign the extra label k + 1 to −→y j ;

Generate the codebook Ωkη;Map the local patterns into Ωkη;Calculate the Fisher score wk using the Eq. 14;

Choose the optimal k∗ and codebook Ωk∗

η corresponding to the maximalFisher score from [wK1

, . . . , wKV];

pattern sampling (i.e. P and R as shown in Figure 2) are very important to STCLQP.In the experiments, Spontaneous Micro-expression Corpus (SMIC) [25], CASME [44]and CASME 2 [45] databases are used to evaluate the performance of STCLQP. Pre-liminary results of CLQP on texture classification and neonatal facial expression recog-nition were presented in [19].

5.1. Micro-expression databaseThe SMIC database consists of 16 subjects (6 females and 10 males) with 164

spontaneous micro-expressions (positive, negative and surprise) which were recorded

14

in a controlled scenario using 100 fps camera with resolution of 640 × 480. Here,two tasks in SMIC database are investigated: micro-expression detection (micro/non-micro) and recognition (positive/negative/surprise).

The CASME dataset contains spontaneous 1,500 facial movements filmed under60 fps camera. Among them, 195 micro-expressions were coded so that the first,peak and last frames were tagged. Referring to the work of [44], we select 171 fa-cial micro-expression videos that contain disgust, surprise, repression and tense micro-expressions.

The CASME 2 database includes 247 spontaneous facial micro-expressions recordedby a 200 fps camera. These samples are coded with the onset and offset frames, as wellas tagged with AUs and emotion. There are five classes of the micro-expressions inthis database: happiness, surprise, disgust, repression and others. One task in CASME2 database is investigated: micro-expression recognition for five classes.

In order to distinguish each task on three databases, we denote micro-expressiondetection and micro-expression recognition on the SMIC database as ‘detection’ and‘pos/neg/sur’, respectively, while for micro-expression recognition on the CASMEand CASME 2 as ‘4-class’ and ‘5-class’, respectively.

5.2. Experimental setup and protocol

We evaluate the proposed approach by using a leave-one-subject-out validation inall databases, in which samples from one subject are used as test and the rest are usedas training. In all experiments, facial landmarks of facial images are detected by us-ing Active Shape Models [6], and then all facial images are normalized and croppedinto the same size. The temporal interpolation method [53] is used to interpolate theframes of the high speed videos into 10 frames. All facial images are divided into8×8 blocks for ‘pos/neg/sur’ (SMIC), ‘4-class’ (CASME) and ‘5-class’ (CASME 2),while 5×5 blocks for ‘detection’ (SMIC). For a fair comparison, we use Support Vec-tor Machine with linear kernel as classifier in all experiments [4]. The optimal valueof the cost parameter was determined using the grid search strategy, where optimalvalues of cost parameter were searched exponentially in the range of [10−6, 106]. Fol-lowing [44], mean recognition accuracy is used to measure the performance. Sincemean recognition accuracy would allow for false bias, Area Under the Curve (AUC)for multi-class [16] and F-measure [24] are also considered as measurements to eval-uate the parameters of STCLQP and compare algorithms. For F-measurement, it isdefined as F = 1

C

∑Ci=1

2pi×ripi+ri

, where pi and ri are the precision and recall of the i-thclass, respectively, and C are the number of classes.

5.3. Analysis of orientation parameters

So far many works can be exploited to approximate dominant orientation θ in imageprocessing. Specifically, θ can be calculated by arctan hx∗I

hy∗I , where hx and hy arefilters used to approximate the differentiation operator along the image horizontal andvertical direction respectively. Possible choices for hx and hy include central differenceestimators of various orders and discrete approximations to the first derivative of theGaussian. In the experiments, we compare Gaussian filter with two commonly used

15

Table 2. Recognition accuracy of local orientation pattern using various filters, wherewe compare Gaussian filter with Gradient and Sobel filters. N and T represent thelevel of orientation estimation and quantization level, respective. Gaussian-Mask3 andGaussian-Mask5 represent Gaussian kernels are designed based on a mask patch with3× 3 and 5× 5 pixel sizes, respectively.

Filter [N , T ]SMIC CASME CASME2

detection pos/neg/sur 4-class 5-class

Gradient

[-, 2] 55.18 41.46 27.49 37.25[-, 4] 60.67 50.61 32.75 43.72[-, 8] 67.38 42.68 33.33 42.92[-, 16] 64.63 41.46 30.99 41.30

Sobel

[-, 2] 56.40 43.29 29.24 36.44[-, 4] 58.84 50.00 27.49 30.37[-, 8] 64.33 43.90 30.99 30.77[-, 16] 64.94 38.41 32.16 38.06

Gaussian-Mask3

[4, 2] 64.02 39.63 29.24 43.72[8, 2] 58.85 40.24 26.32 37.65[8, 4] 66.46 48.78 30.41 45.75

[16, 4] 66.77 49.39 31.58 49.80[16, 8] 70.73 46.34 31.58 40.08

Gaussian-Mask5

[4, 2] 63.72 42.68 25.73 38.46[8, 2] 58.84 42.68 26.32 36.03[8, 4] 68.90 48.78 29.82 47.37

[16, 4] 70.73 51.21 32.16 50.20[16, 8] 67.99 46.34 29.24 43.32

filters for hx and hy . One is an image gradient, where hx = [−101] and hy = [−101]T ,the other is Sobel operator with two 3× 3 kernels [13].

For Gaussian filter, T and N are two important parameters in orientation pattern.T plays a critical role in computing the neighborhood relationship for all filters. It alsodecides the quantification level for dominant orientation. In experiments, we test threequantization levels of 2, 4 and 8. On the other hand,N is the key parameter in Gaussianfilter, since dominant orientation estimation is dependent on N . We evaluate the effectof N on the performance when N = 4, 8 and 16. In order to suppress the influenceof codebook, we use ‘uniform pattern’ [29] to encode local patterns for orientationcomponent.

Table 2 shows recognition accuracy by using different dominant orientation esti-mation methods. These results give some interesting findings as follows:

(1) As seen from this table, Gradient filter’s method obtains the accuracies of67.38%, 50.61%, 33.33% and 43.72% for ‘detection’, ‘pos/neg/sur’, ‘4-class’ and‘5-class’, respectively, while for Sobel filter’s approach, 64.94%, 50%, 32.16% and38.06% for ‘detection’, ‘pos/neg/sur’, ‘4-class’ and ‘5-class’, respectively. Instead, ourmethod using Gaussian filters with Mask5 (Gaussian-Mask5) yields the highest accu-racies of 70.73%, 51.21%, 32.16% and 50.20% for ‘detection’, ‘pos/neg/sur’, ‘4-class’

16

and ‘5-class’, respectively, when the number of Gaussian filters N and quantizationlevel T are set to 16 and 4, respectively.

(2) It is found that using more Gaussian filters can increase recognition rates of alltasks. However, the deeper quantization level T cannot improve the performance, sincefeatures will be more sparse. In the implementation of orientation, it is found that theperformance of Gaussian filters achieves promising results when T is 4.

(3) Using bigger mask patch size (Gaussian-Mask5) for Gaussian filters can con-tribute better performance to ‘detection’, ‘pos/neg/sur’, ‘4-class’ and ‘5-class’ thanGaussian-Mask3, because using bigger mask patch size can capture sufficient pixelinformation for more accurate dominant orientation estimation.

These results demonstrate that Gaussian filter’s strategy achieves better perfor-mance than Gradient filter and Sobel filter. They further show that Gaussian filter’smethod can be appropriately used in estimating domain orientation. In the followingexperiments, the number of Gaussian filters N and the quantization level T are set to16 and 4, respectively.

5.4. Analysis of an efficient vector quantization

The purpose of this experiment aims to investigate how vector quantization (in Sec-tion 3.2) affects the performance of STCLQP in micro-expression analysis. Since thecodebook size is an important factor in vector quantization, we mainly investigate theinfluence of codebook size and compare our proposed efficient vector quantization withthe typical coding method, ‘uniform pattern’. For convenience, we use 8 neighborhoodsampling pixels around the center pixel with radius as 3 (i.e. (3, 8) as shown in Fig-ure 2(a)). Here, we set the same codebook size for all codebooks on three orthogonalplanes.

5.4.1. Influence of codebook sizeFigure 6 clearly shows the performance of three components with different code-

book sizes for ‘detection’, ‘pos/neg/sur’, ‘4-class’ and ‘5-class’, respectively. Fromthis figure, we have some findings as follows:

(1) For the sign component (in Figure 6(a)), the highest accuracy is achieved withthe increasing codebook size. And they are 67.38%, 57.93%, 40.94% and 50.20%when the codebook size is set to 30, 40, 50 and 70 for ‘detection’, ‘pos/neg/sur’, ‘4-class’ and ‘5-class’, respectively.

(2) For the magnitude component (in Figure 6(b)), it is noted that the codebookswith the large codebook size have better performance for ‘detection’ and ‘5-class’,while the codebook with the small codebook size offers promising result for ‘pos/neg/sur’.The accuracies are 65.24%, 41.46%, 44.44% and 44.53% when the codebook size is setto 50, 10, 10 and 80 for ‘detection’, ‘pos/neg/sur’, ‘4-class’ and ‘5-class’, respectively.

(3) For the orientation component (in Figure 6(c)), we find that the influence ofcodebook size is similar to sign component for ‘detection’ and ‘5-class’. For ‘pos/neg/sur’,the performance is achieved when the codebook size is 60; with the increasing code-book size, the performance is remained. From Figure 6(c), it can be seen that the resultsare 72.87%, 56.10%, 36.84% and 48.99% when the codebook size is set to 20, 70, 60and 50 for ‘detection’, ‘pos/neg/sur’, ‘4-class’ and ‘5-class’, respectively.

17

Figure 6. Recognition accuracies of sign, magnitude and orientation components withdifferent codebook size K (in Section 4.2), where ‘detection’ and ‘pos/neg/sur’ areconducted on SMIC, while ‘4-class’ on CASME and ‘5-class’ on CASME 2.

The results using vector quantization are shown in Tables 3 and 4. As seen fromTables 3 and 4, it is found that the optimal codebook size stays in range of [10 80],when sign, magnitude and orientation components achieve the best performance. Itfurther shows that the smallest codebook size cannot represent the statistical propertyof all bins, while large size could provide abundant information.

5.4.2. Comparison to ‘uniform pattern’For demonstrating vector quantization, we compare our method with ‘uniform pat-

tern’ in this section. We use ‘uniform pattern’ [29] to encode the sign, magnitudeand orientation components, respectively. The recognition accuracy and F1 score areshown in Tables 5 and 6, respectively.

Comparing Tables 4 and 5, it is seen that for sign component, the increasing rates ofrecognition accuracies using vector quantization are 0.61%, 4.27%, 3.51% and 3.74%for ‘detection’, ‘pos/neg/sur’, ‘4-class’ and ‘5-class’, respectively. As seen from Ta-ble 4, F1 scores have been increased at the increasing rate of 0.0424, 0.0051, 0.0627

18

Table 3. Recognition accuracy (%) of sign, magnitude and orientation componentsencoded by using vector quantization. The number in the bracket are codebook size.

ComponentSMIC CASME CASME2

detection pos/neg/sur 4-class 5-classSign 67.38 (30) 57.93 (40) 40.94 (50) 50.20 (70)

Magnitude 65.24 (50) 41.46 (10) 44.44 (10) 44.53 (80)Orientation 72.87 (20) 56.10 (70) 36.84 (60) 48.99 (50)

Table 4. F1 score of sign, magnitude and orientation components encoded by usingvector quantization. The number in the bracket are codebook size.


detection pos/neg/sur 4-class 5-classSign 0.6675 (30) 0.5808 (40) 0.386 (50) 0.4696 (70)

Magnitude 0.6506 (50) 0.4183 (10) 0.3997 (10) 0.3982 (80)Orientation 0.7262 (20) 0.5518 (70) 0.2945 (60) 0.4261 (50)

Table 5. Recognition accuracy (%) of sign, magnitude and orientation componentsencoded by using ‘uniform pattern’ [29].


detection pos/neg/sur 4-class 5-classSign 66.77 53.66 37.43 46.46

Magnitude 62.80 39.63 39.18 39.68Orientation 70.73 51.21 32.16 50.20

Table 6. F1 score of sign, magnitude and orientation components encoded by using‘uniform pattern’ [29].




and 0.0455 for ‘detection’, ‘pos/neg/sur’, ‘4-class’ and ‘5-class’, respectively.For magnitude component, the recognition accuracies using vector quantization is

increased at 2.44%, 1.83%, 5.26% and 4.85% for ‘detection’, ‘pos/neg/sur’, ‘4-class’and ‘5-class’, respectively. F1 scores are increased at 0.021, 0.0237, 0.1172 and 0.0476

19

Table 7. Recognition accuracy (%) of sign, magnitude and orientation componentsencoded by using vector quantization, where codebook size is 60.




for ‘detection’, ‘pos/neg/sur’, ‘4-class’ and ‘5-class’, respectively. Specifically for ‘4-class’ F1 score is substantially increased.

For orientation component, using vector quantization obtains promising perfor-mance at the increasing recognition rate of 2.14%, 4.89% and 4.68% for ‘detection’,‘pos/neg/sur’ and ‘4-class’, respectively. F1 scores are increased at 0.0337, 0.0207 and0.0587 for ‘detection’, ‘pos/neg/sur’ and ‘4-class’, respectively.

Furthermore, we show the recognition rate of vector quantization for sign, magni-tude and orientation, when codebook size is 60 in Tables 7. Comparing with Table 5,even using the similar number for both encoding methods, at most of cases, vectorquantization still performs better than ‘uniform pattern’.

These comparisons show that vector quantization can make original LBP-TOP bet-ter. Furthermore, they show STCLQP with small codebook size achieves better per-formance than ‘uniform pattern’. The small codebook size can save storage cost. Forexample, in the implementation, the dimension of sign component using ‘uniform pat-tern’ is 11,328 (8×8×3×59), while using codebook is 7,680 (8×8×3×40). Therefore,an advantage of vector quantization is to reduce the dimension and get more compactfeatures.

5.5. Evaluation of discriminative codebook selectionAs previously mentioned in Section 4.2, the proposed STCLQP can optimally

choose the discriminative codebook for each plane and each component. Next, wefurther examine the performance using codebook selection. The detailed results inthree tasks based on recognition accuracy are described as follows,

(1) In ‘detection’, the results are reported as 68.29%, 65.85%, and 73.78% for sign,magnitude and orientation components, respectively. Using fixed codebook size canobtain 67.38%, 65.24% and 72.87% for sign, magnitude and orientation components,respectively.

(2) In the task ‘pos/neg/sur’, the results are reported as 59.10%, 43.29%, and57.93% for sign, magnitude and orientation components, respectively, while usingfixed codebook size can only achieve the recognition rate of 57.93%, 41.46% and56.10% for sign, magnitude and orientation components, respectively.

(3) In ‘4-class’, the results are reported as 42.11%, 46.78%, and 40.94% for sign,magnitude and orientation components, respectively. One can obtain 40.94%, 44.44%and 36.84% using fixed codebook size for sign, magnitude and orientation components,respectively.

20

Table 8. Recognition accuracy (%) of each component and combination of componentsfor STCLQP.


detection pos/neg/sur 4-class 5-classSign (S) 70.43 59.10 43.86 52.55

Magnitude (M) 67.07 44.51 49.12 44.94Orientation (O) 74.08 59.10 43.27 52.18

S+M 70.04 56.10 55.56 54.01S+O 73.78 61.59 53.8 54.74M+O 68.90 59.15 54.39 53.28

S+M+O 75.31 64.02 57.31 58.39

(4) In ‘5-class’, the results are reported as 52.55%, 44.94%, and 50.20% for sign,magnitude and orientation components, respectively. One can obtain 50.20%, 44.53%and 48.99% using fixed codebook size for sign, magnitude and orientation components,respectively.

From comparative results in four tasks, we can see that discriminative codebookselection can considerably raise the performance of three components. This can partlybe explained by the present codebooks, which can provide discriminative informationto the features.

5.6. Effect of various neighboring sampling

In this experiment, we aim to investigate the effect of local pattern neighborhood,which is shown in Figure 2. For parameter setup, we employ discriminative vectorquantization and take orientation component into consideration.

We use 8 sampling pixels around a center pixel with radius as 3 (i.e. (3, 8) as shownin Figure 2(a)), the recognition accuracies of ‘detection’, ‘pos/neg/sur’, ‘4-class and ‘5-class’ are 73.78%, 57.93%, 40.94% and 50.20%, respectively. While we increase thenumber of sampling points yet keep the same radius (i.e. (3, 16)), the recognition ac-curacies are raised by 0.3%, 0.61%, 0.31% and 1.1% for ‘detection’, ‘pos/neg/sur’,‘4-class and ‘5-class’ respectively. Furthermore, Disc5 (24 sampling points, in Fig-ure 2(b)) is employed. The performance is further improved by 0.56%, 2.02% and0.88% for ‘pos/neg/sur’, ‘4-class’ and ‘5-class’ respectively. Therefore, increasingsampling points boosts the performance of micro-expression analysis. One reason isthat more sampling points provide more compact information and extract sufficientstructure around the center.

5.7. Analysis of components

Tables 8 and 9 show results for different components of STCLQP using Disc5. Itis found that sign and orientation components performs better than magnitude compo-nent in the most of cases except on CASME database, because CASME has distinctcharacteristic of low video rate and little motion intensity change. The results of us-ing only M were constantly lower than S+M in the most of cases. The best results

21

Table 9. F1 score of components and their combination in STCLQP.


detection pos/neg/sur 4-class 5-classSign (S) 0.7005 0.5908 0.4224 0.5143

Magnitude (M) 0.6699 0.4485 0.4534 0.3781Orientation (O) 0.7452 0.5899 0.4122 0.4701

S+M 0.7070 0.5608 0.5324 0.5366S+O 0.7340 0.6186 0.5052 0.5584M+O 0.6867 0.5853 0.53 0.5399

S+M+O 0.7402 0.6381 0.56 0.5835

Table 10. Comparison under recognition accuracy (%) with spatiotemporal featuredescriptors on SMIC, CASME and CASME 2, where we reproduce results of LBP-SIPusing leave-one-subject-out protocol.

MethodsSMIC CASME CASME2

detection pos/neg/sur 4-class 5-classLBP-TOP [54] 66.77 53.66 37.43 46.46STLMBP [18] 67.38 54.88 46.20 47.77LOCP-TOP [3] 61.59 51.22 31.58 48.91

CLBP-TOP (S+M) [31] 69.21 56.10 45.31 53.28Cuboids [8] 60.37 32.32 33.33 36.03

LBP-SIP [42] 55.49 44.51 36.84 46.56STCLQP 75.31 64.02 57.31 58.39

for STCLQP are achieved by using all three components. This finding agrees withPfister [31] and Guo [15] who found magnitude and sign useful for face analysis andtexture classification. It is interesting to note that even though the tasks are quite dif-ferent (texture recognition on visual data and spontaneous micro-expression analysis)the results of the component division experiments follow the same pattern. In general,S+M+O yields the best results, followed closely by S+O, S+M and M+O, and moredistantly by S.

5.8. Algorithm comparison

Previously mentioned, STCLQP achieves the best accuracies of 75.31%, 64.02%,57.31% and 58.39% for ‘detection’, ‘pos/neg/sur’, ‘4-class’ and ‘5-class’, respectively.In this section, we compare STCLQP with LBP-TOP, CLBP-TOP, local ordinary con-trast pattern (LOCP-TOP) [3], spatiotemporal local monogenic binary pattern (STLMBP) [18],spatiotemporal cuboids descriptor (Cuboids) [8] and LBP-SIP [42].

Results on recognition accuracy, average F1-score and mean AUC on three databasesare reported in Table 10, Table 11 and Table 12, respectively. As can be seen from three

22

Table 11. F1 score using spatiotemporal feature descriptors on SMIC, CASME andCASME 2, where we reproduce results of LBP-SIP using leave-one-subject-out proto-col.




LBP-SIP [42] 0.5434 0.4492 0.3327 0.4480STCLQP 0.7402 0.6381 0.5 0.5836

Table 12. Area Under the Curve (%) using spatiotemporal feature descriptors on SMIC,CASME and CASME 2, where we reproduce results of LBP-SIP using leave-one-subject-out protocol.




LBP-SIP [42] 55.49 55.62 60.80 58.29STCLQP 75.31 67.25 68.93 66.57

tables, cuboid feature method performs poorly in four tasks. Instead, spatiotemporalfeature descriptors work better than cuboid features.

(1) We can see that STCLQP outperforms CLBP-TOP, although both methods em-ploy sign and magnitude components. STCLQP alternatively uses codebook to replace‘uniform pattern’, so that it can obtain more compact and discriminative informationthan CLBP-TOP. In addition, orientation component is included in STCLQP.

(2) As well, we can see that STCLQP beats STLMBP. As we know, STLMBP ex-ploits the similar way to obtain orientation and sign features. But one main difference isSTLMBP first used monogenic filter to extract new images. Comparing with STLMBPand CLBP-TOP, we see that using codebook can strengthen micro-expression features.

(3) We compare STCLQP with LBP-SIP [42] on the three databases. It is foundthat our method appears to outperform LBP-SIP on all databases. In STCLQP, it ex-ploits more information while LBP-SIP aims to improve efficiency of LBP-TOP usingintersect neighbourhoods.

We provide confusion matrices for four tasks on three databases using STCLQP, as

23

shown in Figure 7. On micro-expression detection task, STCLQP easily detects mostof micro-expression videos out of non-micro expression videos. For micro-expressionrecognition on SMIC database, STCLQP achieves the lowest accuracy for recognizing‘positive’ class, and STCLQP has the same performance for recognizing ‘negative’ and‘surprise’ classes.

On CASME and CASME 2 databases, micro-expression classes become more com-plicated and diversity, for example, CASME 2 database contains five classes. From theconfusion matrix on CASME dataset, it is found that STCLQP works the best on rec-ognizing ‘disgust’, followed closely by ‘tense’, ‘repression’ and ‘surprise’. Amongfour classes, ‘repression’ and ‘tense’ are two easily confused micro-expressions forSTCLQP. Furthermore, it is found that samples from ‘disgust’ class is falsely classifiedto ‘tense’ class. On CASME 2 dataset, ‘repression’ and ‘disgust’ are the most difficultmicro-expressions to recognize, while ‘surprise’ and ‘others’ classes are easy to rec-ognize. It is found that samples of ‘happiness’, ‘disgust’ and ‘repression’ are falselyclassified into ‘others’ micro-expression category. In [45], Yan et al. annotated micro-expression video clips that cannot be labeled according to FACS into one class. Thisannotation may cause STCLQP falsely classifying ‘happiness’, ‘disgust’ and ‘repres-sion’ into ‘others’ class.

5.9. Discussion

Overall, for spontaneous micro-expressions, the effects of codebook size, sam-pling structure and components are examined on three spontaneous micro-expressiondatabases. Tables 10, 11 and 12 compare the proposed method with other methodsunder different performance evaluation metric. We see that STCLQP provides consid-erably improvement for recognizing spontaneous micro-expressions.

Moreover, the codebook size substantially affects the performance of STCLQP. Itis also dependent on the neighboring structures. Fortunately, less bins in a codebookcan achieve better accuracy than by using uniform pattern (59 bins). The advantage isthat it automatically learns the statistical patterns from the dataset. In other words, thecodebook is specific to the dataset. We further prove that the discriminative codebookselection strategy can substantially raise the performance in recognizing spontaneousmicro-expressions. It also avoids from the manual parameter selection from manyparameters. Additionally, we also see that STCLQP appears to be explicitly improvedby using more local pattern neighborhoods.

Additionally, we report the effect of all components and different combination tomicro-expression analysis. It is interesting to find that all three components can im-prove the performance of spontaneous micro-expression analysis. We also find thatthe orientation component can perform better than sign component in some cases, forexample, in micro-expression detection. This is first time to combine those three com-ponents for LBP-TOP.

6. Conclusion

In recent years, facial micro-expression analysis has been an active and challengingresearch in psychology and computer vision, because the durations of micro-expressions

24

Figure 7. Confusion matrix of each class using STCLQP on (a) ‘detection’ on SMIC,(b) ‘pos/neg/sur’ on SMIC, (c) ‘4-class’ on CASME and (d) ‘5-class’ on CASME 2.

are very short and only subtle changes are involved. We proposed a spatiotemporalcompleted local quantized patterns, which exploits more useful information and learnscompact and discriminative codebook for micro-expression analysis. In our approach,orientation component is considered as complementary to sign and magnitude compo-nents for LBP features. Furthermore, an efficient vector quantization and Fisher crite-rion are utilized to obtain the compact and discriminative codebook. Finally, featuresof three components are concatenated into one feature based on feature subspace. Wedemonstrated the proposed feature descriptor is efficient and can provide advantageousperformance by comparing with other methods on three spontaneous micro-expressiondatabases.

7. Acknowledgment

This work was supported by the Academy of Finland and Infotech Oulu.

8. Reference

[1] Ahonen, T., Hadid, A. and Pietikainen, M.: Face description with local binarypatterns: application to face recognition. IEEE Transactions on Pattern Analysisand Machine Intelligence, 28(12): 2037-2041, 2006.

25

[2] Belongie, S., Malik, J. and Puzicha, J.: Shape matching and object recognitionusing shape contexts. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 24(4): 509-522, 2002.

[3] Chan, C., Goswami, B., Kittler, J. and Christmas, W.: Local ordinal contrast pat-tern histograms for spatiotemporal, lip-based speaker authentication. IEEE Trans-actions on Information Forensics and Security, 7(2): 602-612, 2012.

[4] Chang, C. and Lin, C.: A library for support vector machines. ACM Transactionson Intelligent Systems and Technology, 2(27): 1-27, 2011.

[5] Chen, J., Shan, C., He, C., Zhao, G., Pietikainen, M., Chen, X. and Gao. W.:WLD: A robust local image descriptor. IEEE Transactions on Pattern Analysisand Machine Intelligence, 32(9): 1705-1720, 2010.

[6] Cootes, T., Taylor, C., Cooper, D. and Graham, J.: Active shape models-theirtraining and application. Computer Vision and Image Understanding. 61(1): 38-59, 1995.

[7] Davison A., Yap, M., Costen, N., Tan, K., Lansley, C. and Leightley, D.: Micro-facial movements: an investigation on spatio-temporal descriptors. In Proceedingof ECCV workshop on Spontaneous Behavior Analysis, pp. 111-123, 2014.

[8] Dollar, P., Rabaud, V., Cottrell, G. and Belongie, S.: Behavior recognition viasparse spatio-temporal features. In Proceedings of IEEE International Workshopon Visual Surveillance and Performance Evaluation of Tracking and Surveillance,pp. 65-72, 2005.

[9] Ekman, P.: Lie catching and micro expressions. In Ed. Clancy Martin, editor. ThePhilosophy of Deception, pp. 118-133, 2009.

[10] Feng, X., Lai, Y., Mao, X., Peng, J., Jiang, X. and Hadid, A.: Extracting localbinary patterns from image key points: Application to automatic facial expressionrecognition. In Proceedings of Scandinavian Conference on Image Analysis, pp.339-348, 2013.

[11] Frank, M., Herbasz, M., Sinuk, K., Keller, A. and Nolan, C.: I see how you feel:Training laypeople and professionals to recognize fleeting emotions. In the annualmeeting of International Communication Association, 2009.

[12] Gizatdinova, Y. and Surakka, V.: Feature-based detection of facial landmarksfrom neutral and expressive facial images. IEEE Transactions on Pattern Analysisand Machine Intelligence, 28(1): 135-139, 2006.

[13] Gonzalez, R. and Woods, R.: Digital Image Processing. Addison Wesley, pp.414-428, 1992.

[14] Gu, Q., Li, Z. and Han, J.: Generalized fisher score for feature selection. In Pro-ceedings of the Conference on Uncertainty in Artificial Intelligence, pp. 266-273,2011.

26

[15] Guo, Z., Zhang, L. and Zhang, D.: A completed modeling of local binary patternoperator for texture classification. IEEE Transactions on Image Processing, 19(6):1657-1663, 2010.

[16] Hand, D. and Till, R.: A Simple Generalisation of the Area Under the ROC Curvefor Multiple Class Classification Problems. Machine Learning, 45(2): 171186,2001.

[17] He, X. and Niyogi, P.: Locality preserving projections. In Proceedings of NeuralInformation Processing Systems, pp. 153-160, (2003).

[18] Huang, X., Zhao, G., Zheng, M. and Pietikainen, M.: Spatiotemporal local mono-genic binary patterns for facial expression recognition. IEEE Signal ProcessingLetter, 19(5): 243-246, 2012.

[19] Huang, X., Zhao, G., Hong, X. and Pietikainen, M.: Texture description withcompleted local quantized patterns. In Proceedings of Scandinavian Conferenceon Image Analysis, pp. 1-10, 2013.

[20] Hussain, S. and Triggs, B.: Visual recognition using local quantized patterns. InProceedings of European Conference on Computer Vision, pp. 716-729, 2012.

[21] Hussain, S., Napoleon, T. and Jurie, F.: Face recognition using local quantizedpatterns. In Proceedings of British Conference on Computer Vision, pp. 1-11,2012.

[22] Iosifidis, A., Tefas, A. and Pitas, I.: Discriminant bag of words based representa-tion for human action recognition. Pattern Recognition Letters, 49(1): 185-192,2014.

[23] Jain, S., Hu, C., and Aggarwal, J.: Facial expression recognition with temporalmodeling of shapes. In Proceedings of IEEE International Conference on Com-puter Vision, pp. 1642-1649, 2011.

[24] Kang, F., Jin, R. and Sukthankar, R.: Correlated label propagation with applica-tion to multi-label learning. In Proceedings of IEEE International Conference onComputer Vision and Pattern Recognition, pp. 1719-1726, 2006.

[25] Li, X., Pfister, T., Huang, X., Zhao, G. and Pietikainen, M.: A spontaneousmicro-expression database: Inducement, collection and baseline. In Proceedingsof IEEE International Conference on Automatic Face and Gesture Recognition,pp. 1-6, 2013.

[26] Littlewort, G., Bartlett, M., Fasel, I., Susskind, J. and Movellan, J.: Dynamics offacial expression extracted automatically from video. Image and Vision Comput-ing, 24(6): 615-625, 2006.

[27] Liu, Y., Ma, C., Fu, Q., Fu, X., Qin, S. and Xie, L.: A sketch-based approach forinteractive organization of video clips. ACM Transactions on Multimedia Com-puting, Communication, and Applications, 11(1): 2:1-2:21, 2014.

27

[28] Nanni, L, Brahnam, S. and Lumini, A.: Local ternary patterns from three orthog-onal planes for human action clssification. Expert Systems with Applications,38(5): 5125-5128, 2011.

[29] Ojala, T., Pietikainen, M. and Maenpaa, T.: Multiresolution gray-scale and rota-tion invariant texture classification with local binary patterns. IEEE Transactionson Pattern Analysis and Machine Intelligence, 24(7): 971-987, 2002.

[30] Pfister, T., Li, X., Zhao, G. and Pietikainen, M.: Recognising spontaneous facialmicro-expressions. In Proceedings of IEEE International Conference on Com-puter Vision, pp. 1449-1456, 2011.

[31] Pfister, T., Li, X., Zhao, G. and Pietikainen, M.: Differentiating spontaneous fromposed facial expressions within a generic facial expression recognition frame-work. In Proceedings of IEEE International Conference on Computer Vision, pp.868-875, 2011.

[32] Polikovsky, S., Kameda, Y. and Ohta, Y.: Facial micro-expressions recognitionusing high speed camera and 3-D gradient descriptor. In Proceedings of Interna-tional Conference on Crime Detection and Prevention, pp. 1-6, 2009.

[33] Polikovsky, S., Kameda, Y. and Ohta, Y.: Detection and measurement of facialmicro-expression characteristics for psychological analysis. IEICE Technical Re-port, 110(98): 57-64, 2009.

[34] Pietikainen, M., Hadid, A., Zhao, G., and Ahonen, T.: Computer Vision usingLocal Binary Patterns. Springer, 2011.

[35] Ruiz-Hernandez, J. and Pietikainen, M.: Encoding local binary patterns using re-parameterization of the second order Gaussian jet. In Proceedings of InternationalConference on Automatic Face and Gesture Recognition, pp. 1-6, 2013.

[36] Shan, C., Gong, S., and McOwan, P.: Facial expression recognition based on lo-cal binary pattern: A comprehensive study. Image and Vision Computing, 27(6):803-816, 2009.

[37] Shin, G., and Chun, J.: Spatio-temporal facial expression recognition using op-tical flow and HMM. In: Software Engineering, Artificial Intelligence, Network,and Parallel/Distributed Computing, pp. 27-38, 2008.

[38] Tzimiropoulos, G., Zafeiriou, S. and Pantic, M.: Subspace learning from imagegradient orientations. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 34(12): 2454-2466, 2012.

[39] Vu, N.G. and Caplier, A.: Mining patterns of orientations and magnitudes for facerecognition. In Proceedings of International Joint Conference on Biometrics, pp.1-8, 2011.

[40] Wang, S., Yan, W., Li, X., Zhao, G. and Fu, X.: Micro-expression recognitionusing dynamic textures on tensor independent color space. In Proceedings of In-ternational Conference on Pattern Recognition, pp. 4678-4683, 2014.

28

[41] Wang, S., Yan, W., Zhao, G. and Fu, X.: Micro-expression recognition usingrobust principal component analysis and local spatiotemporal directional features.In Proceedings of ECCV workshop on Spontaneous Behavior Analysis, pp. 325-338, 2014.

[42] Wang, Y., See, J., Phan R. and Oh, Y.: LBP with six intersection points: Re-ducing redundant information in LBP-TOP for micro-expression recognition. InProceedings of Asian Conference on Computer Vision, pp. 525-537, 2014.

[43] Xie, S., Shan, S., Chen, X. and Chen, J.: Fusing local patterns of Gabor magnitudeand phase for face recognition. IEEE Transaction on image processing, 19(5):1349-1361, 2010.

[44] Yan, W., Wu, Q., Liu, Y., Wang, S. and Fu, X.: CASME database: A dataset ofspontaneous micro-expressions collected from neutralized faces. In Proceedingsof IEEE International Conference on Automatic Face and Gesture Recognition,pp. 1-7, 2013.

[45] Yan, W., Li, X., Wang, S., Zhao, G., Liu, Y., Chen, Y. and Fu, X.: CASME II:An improved spontaneous micro-expression database and the baseline evaluation.PLOS ONE, 9(1): 1-8, 2014.

[46] Yan, W., Wang, S., Chen, Y., Zhao, G. and Fu, X.: Quantifying micro-expressionswith constraint local model and local binary pattern. In Proceedings of ECCVworkshop on Spontaneous Behavior Analysis, pp. 296-305, 2014.

[47] Yang, Y.: An evaluation of statistical approaches to text categorization. Informa-tion Retrieval, 1(1-2): 69-90, 1999.

[48] Yang, J., Jiang, Y., Hauptmann, A. and Ngo, C.: Evaluating bag-of-the-visual-words representations in scene classification. In Proceedings of the InternationalWorkshop on Multimedia Information Retrieval, pp. 192-206, 2007.

[49] Yang, H. and Wang, Y.: A LBP-based face recognition method with Hammingdistance constraint. In Proceeding of International Conference on Image andGraphics, pp. 645-649, 2007.

[50] Ylioinas, J., Hadid, A., Guo, Y. and Pietikainen, M.: Efficient image appearancedescription using dense sampling based local binary patterns. In Proceedings ofAsian Conference on Computer Vision, pp. 375-388, 2012.

[51] Zhang, Z., Lyons, M., Schuster, M. and Akamatsu S.: Comparison betweengeometry-based and Gabor-wavelets-based facial expression recognition usingmulti-layer perceptron. In Proceedings of International Conference on AutomaticFace and Gesture Recognition, pp. 454-459, 1998.

[52] Zhang, B., Shan, C., Chen, X. and Gao, W.: Histogram of Gabor phase pat-terns (HGPP): A novel object representation appraoch for face recognition. IEEETransactions on Image Processing, 16(1): 57-68, 2007.

29

[53] Zhou, Z., Zhao, G., Guo, Y. and Pietikainen, M.: An image-based visual speechanimation system. IEEE Transactions on Circuits and Systems for Video Tech-nology, 22(10): 1420-1432, 2012.

[54] Zhao, G. and Pietikainen, M.: Dynamic texture recognition using local binarypattern with an application to facial expressions. IEEE Transactions on PatternAnalysis and Machine Intelligence, 29(6): 915-928, 2007.

[55] Zhao, G., Ahonen, T., Matas, J. and Pietikainen, M.: Rotation invariant imageand video description with local binary pattern features. IEEE Transactions onImage Processing, 21(4): 1465-1467, 2012.

30

spontaneous facial micro-expression analysis using ......efﬁciency for facial expression...

Documents