lucianabarbieri facialmicroexpressionrecognitionbasedon...

Universidade Estadual de CampinasInstituto de Computação

INSTITUTO DECOMPUTAÇÃO

Luciana Barbieri

Facial Microexpression Recognition Based onDescriptor and Classifier Combinations

Reconhecimento de Microexpressões Faciais Baseadoem Combinações de Descritores e Classificadores

CAMPINAS2018

Luciana Barbieri



Thesis presented to the Institute of Computingof the University of Campinas in partialfulfillment of the requirements for the degree ofMaster in Computer Science.

Dissertação apresentada ao Instituto deComputação da Universidade Estadual deCampinas como parte dos requisitos para aobtenção do título de Mestra em Ciência daComputação.

Supervisor/Orientador: Prof. Dr. Hélio Pedrini

Este exemplar corresponde à versão final daDissertação defendida por Luciana Barbierie orientada pelo Prof. Dr. Hélio Pedrini.

CAMPINAS2018

Agência(s) de fomento e nº(s) de processo(s): Não se aplica.

Ficha catalográficaUniversidade Estadual de Campinas

Biblioteca do Instituto de Matemática, Estatística e Computação CientíficaAna Regina Machado - CRB 8/5467

Barbieri, Luciana, 1971- B234f BarFacial microexpression recognition based on descriptor and classifier

combinations / Luciana Barbieri. – Campinas, SP : [s.n.], 2018.

BarOrientador: Hélio Pedrini. BarDissertação (mestrado) – Universidade Estadual de Campinas, Instituto de

Computação.

Bar1. Aprendizado de máquina. 2. Reconhecimento de expressões faciais. 3.

Vídeo digital. 4. Fusão de classificadores. 5. Processamento de imagem. I.Pedrini, Hélio, 1963-. II. Universidade Estadual de Campinas. Instituto deComputação. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: Reconhecimento de microexpressões faciais baseado emcombinações de descritores e classificadoresPalavras-chave em inglês:Machine learningFacial expressions recognitionDigital videoClassifiers fusionImage processingÁrea de concentração: Ciência da ComputaçãoTitulação: Mestra em Ciência da ComputaçãoBanca examinadora:Hélio Pedrini [Orientador]Fátima de Lourdes dos Santos Nunes MarquesSandra Eliza Fontes de AvilaData de defesa: 19-01-2018Programa de Pós-Graduação: Ciência da Computação

Powered by TCPDF (www.tcpdf.org)

Universidade Estadual de CampinasInstituto de Computação

INSTITUTO DECOMPUTAÇÃO

Luciana Barbieri



Banca Examinadora:

• Prof. Dr. Hélio PedriniIC/UNICAMP

• Profa. Dra. Fátima de Lourdes dos Santos Nunes MarquesEACH/USP

• Prof. Dra. Sandra Eliza Fontes de AvilaIC/UNICAMP

A ata da defesa com as respectivas assinaturas dos membros da banca encontra-se noprocesso de vida acadêmica do aluno.

Campinas, 19 de janeiro de 2018

O poeta é um fingidor.Finge tão completamenteQue chega a fingir que é dorA dor que deveras sente.

(Fernando Pessoa)

Acknowledgements

• To my beloved husband, Kleber, and my treasure daughters, Letícia and Gabriela,for all their love, patience, encouragement and unfailing support. You are my inspi-ration.

• To my parents, Natal (in memoriam) and Aldaiza, for their endless love and care,and for being such perfect role models for me and my daughters.

• To my advisor, Prof. Hélio Pedrini, for all the shared knowledge, patience, supportand dedication. My sincere gratitude.

Resumo

O reconhecimento de microexpressões em vídeos tem importantes aplicações práticas nasáreas de psicoterapia, investigações forenses, segurança e negociação, entre outros, porfornecer indícios significativos para a identificação de emoções escondidas. Devido a suacurtíssima duração, estas expressões são bastante difíceis de se perceber a olho nu, deforma que seu reconhecimento automático é uma evolução natural em sua área de conhe-cimento. Pesquisas em aprendizado de máquina aplicadas ao reconhecimento de micro-expressões são relativamente recentes, entretanto, os resultados iniciais são promissores,apesar dos desafios impostos. Trabalhos de pesquisa anteriores utilizaram principalmentedescritores e classificadores individuais para o reconhecimento de microexpressões. Estetrabalho apresenta e avalia uma metodologia que emprega diferentes descritores como en-trada para classificadores independentes. Propõe-se também uma extensão a um descritorde geometria da face pré-existente, avaliando-o por meio de múltiplas técnicas de apren-dizado de máquina, entre elas os classificadores do tipo Máquinas de Vetores de Suporte(SVM), Florestas Aleatórias (RF) e K-Vizinhos Mais Próximos (KNN). A saída dos clas-sificadores independentes é combinada por meio de técnicas de votação e empilhamento declassificadores. Os resultados experimentais realizados em duas bases de dados públicasmostram uma melhoria significativa nas taxas de acerto dos algoritmos de combinação declassificadores em relação aos classificadores individuais, superando o estado-da-arte noreconhecimento de microexpressões com valores de F1-score de 68,66% e 64,18% para asbases de dados CASME II e SMIC HS, respectivamente.

Abstract

Microexpression recognition in videos has important applications in psychotherapy, foren-sics, homeland security and negotiation, among others, for providing significant clues tohidden emotion detection. Since these expressions are very brief and difficult to be de-tected by the naked eye, automatic recognition is a natural step forward in the field.Researches on machine learning applied to their recognition are relatively new, however,initial results are promising, despite the challenges involved. Previous research workshave mostly applied single descriptors and classifiers to recognize microexpressions. Thiswork presents and evaluates a methodology that applies different descriptors as inputto standalone classifiers. An extension to an existing facial geometric descriptor is alsoproposed and evaluated using different machine learning techniques, such as Support Vec-tor Machines (SVM), Random Forests (RF) and K-Nearest Neighbors (KNN) classifiers.The output of standalone classifiers is combined through voting and stacking techniques.Results obtained on two public datasets indicate that significant improvement is achievedwith the combined classification algorithms over the standalone classifiers, with final scoresoutperforming the state-of-the-art microexpression recognition methods with F1-score val-ues of 68.66% and 64.18% for the CASME II and SMIC HS datasets, respectively.

List of Figures

1.1 Frames of a microexpression video clip from the Spontaneous Micro-Expression (SMIC) Dataset [44]. . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1 Facial Action Unit examples. . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Geometric features introduced by Saeed et al. . . . . . . . . . . . . . . . . 26

3.1 High level flow diagram for the microexpression recognition method. . . . . 323.2 Flow diagram for microexpression preprocessing. . . . . . . . . . . . . . . . 333.3 Riesz pyramid phase-based motion magnification results. . . . . . . . . . . 343.4 Temporal Interpolation Model applied to a microexpression video segment,

doubling its length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.5 The 68 facial landmarks detected by DLib and OpenFace (figure adapted

from [70]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.6 Samples of facial landmarks detected by DLib for the SMIC dataset. . . . . 363.7 Samples of facial landmarks detected by OpenFace from individual frames

of the SMIC dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.8 Samples of facial landmarks tracked by OpenFace from frame sequences of

the SMIC dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.9 14 landmarks + 21 distances geometric features . . . . . . . . . . . . . . . 383.10 18 landmarks + 27 distances geometric features . . . . . . . . . . . . . . . 383.11 51 landmarks + 35 distances geometric features . . . . . . . . . . . . . . . 393.12 Lucas-Kanade optical flow and corresponding histograms of optical flow

calculated on a microexpression video segment. . . . . . . . . . . . . . . . 423.13 Classifier combination diagram (figure adapted from [65]). . . . . . . . . . 433.14 SMIC HS dataset [44] frame sequence containing a surprise microexpression. 463.15 CASME II dataset [88] frame sequence containing a disgust microexpression. 47

4.1 ME recognition results on the SMIC HS and CASME II datasets using thegeometric descriptor built with different sets of facial landmark locationsand distances, SVM classifier and k-fold cross-validation protocol. . . . . . 52

4.2 ME recognition results on the SMIC HS and CASME II datasets using thegeometric descriptor built with different feature sets and facial landmarkdetectors, SVM classifier and k-fold cross-validation protocol. . . . . . . . . 53

4.3 ME recognition results on the SMIC HS and CASME II datasets using thegeometric descriptor computed from magnified video clips, SVM classifierand k-fold cross-validation protocol. . . . . . . . . . . . . . . . . . . . . . . 56

4.4 ME recognition results on the SMIC HS and CASME II datasets using theAU descriptor built with different AU feature sets, different classifiers andk-fold cross-validation protocol. . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5 ME recognition results on the SMIC HS and CASME II datasets using theAU descriptor built with AU presence and intensity feature sets, differentclassifiers and k-fold cross-validation protocol. . . . . . . . . . . . . . . . . 58

4.6 ME recognition results on the SMIC HS and CASME II datasets using theAU descriptor with AU intensity feature set, with different classifiers andk-fold cross-validation protocol. . . . . . . . . . . . . . . . . . . . . . . . . 59

4.7 ME recognition results on the SMIC HS and CASME II datasets using theAction Unit descriptor computed from magnified video clips with k-foldcross-validation protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.8 ME recognition results on the SMIC HS and CASME II datasets using theHOG3D descriptor calculated with different sets of interest points and withdense sampling, SVM classifier and k-fold cross-validation protocol. . . . . 61

4.9 ME recognition results on the SMIC HS and CASME II datasets usingthe HOG3D descriptor calculated with different quantization types, SVMclassifier and k-fold cross-validation protocol. . . . . . . . . . . . . . . . . . 62

4.10 ME recognition results on the SMIC HS and CASME II datasets using theHOG3D descriptor computed from magnified video clips, SVM classifierand k-fold cross-validation protocol. . . . . . . . . . . . . . . . . . . . . . . 64

4.11 ME recognition results on the SMIC HS and CASME II datasets using theoriginal and spatial WLD descriptors, with k-fold cross-validation protocol. 65

4.12 ME recognition results on the SMIC HS and CASME II datasets usingthe spatial WLD descriptor built with different block sizes, KNN classifierfor SMIC HS and SVM classifier for CASME II and k-fold cross-validationprotocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.13 ME recognition results on the SMIC HS and CASME II datasets usingthe original and spatial WLD descriptors built with different (T , M , S)parameter combinations and k-fold cross-validation protocol. . . . . . . . . 67

4.14 ME recognition results on the SMIC HS and CASME II datasets using theWLD descriptor computed from magnified video clips with k-fold cross-validation protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.15 ME recognition results with and without temporal interpolation on theSMIC HS and CASME II datasets using the LBP-TOP descriptor (singleblock) with SVM classifier and k-fold cross-validation protocol. . . . . . . . 69

4.16 ME recognition results on the SMIC HS and CASME II datasets using theLBP-TOP descriptor built with different block sizes, SVM classifier andk-fold cross-validation protocol. . . . . . . . . . . . . . . . . . . . . . . . . 69

4.17 ME recognition results on the SMIC HS and CASME II datasets usingthe LBP-TOP descriptor computed from magnified video clips with SVMclassifier and k-fold cross-validation protocol. . . . . . . . . . . . . . . . . . 70

4.18 ME recognition results on the SMIC HS and CASME II datasets usingthe HOF descriptor calculated by different methods, with k-fold cross-validation protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.19 ME recognition results on the SMIC HS and CASME II datasets using thesparse and dense HOF descriptor built with different numbers of bins, withk-fold cross-validation protocol. . . . . . . . . . . . . . . . . . . . . . . . . 72

4.20 ME recognition results on the SMIC HS and CASME II datasets usingthe HOF descriptor computed from magnified video clips with k-fold cross-validation protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

List of Tables

4.1 Best ME recognition results on the SMIC HS dataset using the geometricdescriptor built with 14 landmark locations and subsets of the 21 distanceset, SVM classifier and k-fold cross-validation protocol. . . . . . . . . . . . 55

4.2 Best ME recognition results on the CASME II dataset using the geometricdescriptor built with subsets of the 21 distance set, SVM classifier andk-fold cross-validation protocol. . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 ME recognition results on the SMIC HS and CASME II datasets usingthe HOG3D descriptor built from 8 facial landmarks detected by differenttools, SVM classifier and k-fold cross-validation protocol. . . . . . . . . . . 63

4.4 ME recognition results on the SMIC HS and CASME II datasets using theHOF descriptor built from 8 facial landmarks detected by different tools,SVM classifier and k-fold cross-validation protocol. . . . . . . . . . . . . . 73

4.5 Best results achieved with each single descriptor for the SMIC HS dataset. 744.6 Best results achieved with each single descriptor for the CASME II dataset. 744.7 Best ME recognition results on the SMIC HS dataset using concatenated

descriptors, SVM classifier and k-fold cross-validation protocol. . . . . . . . 754.8 Best ME recognition results on the CASME II dataset using concatenated

descriptors, SVM classifier and k-fold cross-validation protocol. . . . . . . . 754.9 Best results achieved with individual descriptor/classifier pairs for the

SMIC HS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.10 Best results achieved with individual descriptor/classifier pairs for the

CASME II dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.11 Voting results using 12 descriptor/classifier pairs. . . . . . . . . . . . . . . 774.12 Best ME recognition results on the SMIC HS dataset using the hard ma-

jority voting classifier combination method and k-fold cross-validation. . . 784.13 Best ME recognition results on the SMIC HS dataset using the hard

weighted voting classifier combination method and k-fold cross-validation. . 784.14 Best ME recognition results on the CASME II dataset using the hard ma-

jority voting classifier combination method and k-fold cross-validation. . . 794.15 Best ME recognition results on the CASME II dataset using the hard

weighted voting classifier combination method and k-fold cross-validation. . 794.16 Best ME recognition results on the SMIC HS dataset using the voting

classifier combination method and LOSO cross-validation. . . . . . . . . . 804.17 Best ME recognition results on the CASME II dataset using the voting

classifier combination method and LOSO cross-validation. . . . . . . . . . 804.18 Stacking results using 12 descriptor/classifier pairs. . . . . . . . . . . . . . 824.19 Best ME recognition results on the SMIC HS dataset using the stacking

classifier combination method and k-fold cross-validation. . . . . . . . . . . 83

4.20 Best ME recognition results on the CASME II dataset using the stackingclassifier combination method and k-fold cross-validation. . . . . . . . . . . 83

4.21 Best ME recognition results on the SMIC HS dataset using the stackingclassifier combination method and two-level LOSO cross-validation. . . . . 84

4.22 Best ME recognition results on the CASME II dataset using the stackingclassifier combination method and two-level LOSO cross-validation. . . . . 84

4.23 Proposed methods compared to the literature for microexpression recognition. 87

List of Abbreviations

AU Action UnitsBi-WOOF Bi-Weighted Oriented Optical FlowCASME II Chinese Academy of Sciences Micro-Expression II DatasetCNN Convolutional Neural NetworkEVM Eulerian Video MagnificationFACS Facial Action Coding SystemFPS Frames per SecondGLCM Gray-Level Co-occurrence MatricesHIGO Histogram of Image Gradient OrientationHOF Histograms of Optical FlowHOG Histograms of Oriented GradientsHOG3D Histograms of Oriented 3D Spatio-Temporal GradientsHS High SpeedISTLMBP Improved Spatio-Temporal Local Monogenic Binary PatternKNN K-Nearest NeighborsLBP Local Binary PatternsLBP-SIP Local Binary Patterns with Six Intersection PointsLBP-TOP Local Binary Patterns on Three Orthogonal PlanesLOO Leave-One-OutLOSO Leave-One-Subject-OutLSDF Local Spatio-Temporal Directional FeaturesMCFI Motion-Compensated Frame InterpolationME MicroExpressionMETT Micro Expression Training ToolMKL Multiple Kernel LearningNIR Near InfraredNN Nearest NeighborsPCA Principal Component AnalysisRF Random ForestsRGB Red Green BlueRPCA Robust Principal Component AnalysisSIFT Scale-Invariant Feature TransformSMIC Spontaneous Micro-Expression DatasetSTCLQP Spatio-Temporal Completed Local Quantized PatternsSVM Support Vector MachinesTICS Tensor Independent Color SpaceTIM Temporal Interpolation ModelUFO-MKL Ultra-Fast Multiple Kernel LearningVIS Visual CameraVLBP Volume Local Binary PatternsWLD Weber Local Descriptor

Contents

1 Introduction 161.1 Problem and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2 Objectives and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.5 Text Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Background 202.1 Theoretical Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Microexpressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.1.3 Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.1.4 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Methodology 313.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.1 Grayscale Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . 323.1.2 Frame Size Normalization . . . . . . . . . . . . . . . . . . . . . . . 323.1.3 Motion Magnification . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1.4 Temporal Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2.1 Facial Landmark Detection . . . . . . . . . . . . . . . . . . . . . . 353.2.2 Geometric Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.3 Action Unit Features . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.4 Histograms of Oriented 3D Spatio-temporal Gradients . . . . . . . 403.2.5 Weber Local Descriptor . . . . . . . . . . . . . . . . . . . . . . . . 403.2.6 Local Binary Pattern on Three Orthogonal Planes . . . . . . . . . . 413.2.7 Histograms of Optical Flow . . . . . . . . . . . . . . . . . . . . . . 413.2.8 Descriptor Combinations . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.1 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.2 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.1 SMIC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.2 CASME II Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Experiments 484.1 Evaluation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Descriptor Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.1 Geometric Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.2 Action Unit Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.3 HOG3D Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2.4 WLD Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2.5 LBP-TOP Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . 684.2.6 HOF Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.2.7 Descriptor Combinations . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3 Classifier Combination Results . . . . . . . . . . . . . . . . . . . . . . . . . 744.3.1 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.3.2 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.4.1 Proposed Geometric Descriptor . . . . . . . . . . . . . . . . . . . . 844.4.2 Other Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.4.3 Motion Magnification . . . . . . . . . . . . . . . . . . . . . . . . . . 854.4.4 Descriptor Combinations . . . . . . . . . . . . . . . . . . . . . . . . 854.4.5 Classifier Combinations . . . . . . . . . . . . . . . . . . . . . . . . . 864.4.6 Comparison to the Literature . . . . . . . . . . . . . . . . . . . . . 86

5 Conclusions and Future Work 88

Bibliography 90

16

Chapter 1

Introduction

This chapter introduces the problem to be investigated in this research work. The mainobjectives and contributions are presented, as well as the text organization.

1.1 Problem and Motivation

Microexpressions are very brief involuntary facial expressions that show emotions thatpeople are trying to conceal [21]. A microexpression lasts from 1/25 to 1/2 of a second [22,89], which makes it very difficult to be noticed by the naked eye. It has the same formof regular facial expressions, including the basic emotion categories (happiness, sadness,fear, anger, disgust and surprise), however, their duration is very much shorter and thecomprised facial movements are usually much less intense. Figure 1.1 shows some framesof a video clip containing a happiness microexpression. The moving left mouth corner iszoomed below each frame for better observation.

Figure 1.1: Frames of a microexpression video clip from the Spontaneous Micro-Expression(SMIC) Dataset [44].

Haggard and Isaacs [30] first observed these brief expressions in 1966 and connectthem to repressed emotions. In 1969, Ekman and Friesen [21] found microexpressionswhile analyzing a video from a psychiatric patient who planned to commit suicide, buthid it from her doctor. They established a relationship between microexpressions and

CHAPTER 1. INTRODUCTION 17

lies. Later on, Ekman developed the Micro Expression Training Tool (METT) to assistpeople on the development of microexpression recognition abilities [17].

Microexpressions are, however, difficult to be detected and recognized even by trainedhumans. Then, automatic microexpression detection and recognition have recently gainedpopularity since they are important clues to lie detection. Possible applications includepsycotherapic diagnosis, forensic interrogation, border control, and security and businessnegotiations. Nevertheless, unlike regular face expression recognition, few research workshave been published so far on microexpression recognition, since it is a relatively new andvery challenging topic.

One of the main difficulties for microexpression recognition is the small number ofpublic datasets. It is very difficult to elicit and capture a large number of spontaneousmicroexpressions. In addition to that, microexpression detection and recognition arecomplex tasks due to their short duration and the low intensity of the movements.

In the literature, the term microexpression spotting is used for the task of identify-ing when a microexpression occurs (starts and ends) in a longer video sequence, whilemicroexpression detection differentiates spotted video clips that contain microexpressionsfrom those that do not. Microexpression recognition, on the other hand, deals with dis-tinguishing what type of microexpression is represented in spotted video clips. Thisdissertation focuses on the recognition problem. It explores different description tech-niques, including some well known descriptors such as Histograms of Oriented Gradients(HOG) [15] and Histograms of Optical Flow (HOF) [73], and proposes an extension toan existing facial geometric features set. Different machine learning techniques are alsoexplored, such as Support Vector Machines (SVM) [71] and Random Forests (RF) [9].However, this research work goes beyond the traditional classification method where adescriptor is built from features extracted from objects (in this case, video sequences) anda single machine learning algorithm is applied for classification. Instead, a combination ofclassifiers is used, each specialized on classifying microexpressions based on a single differ-ent descriptor. Results obtained with each standalone classifier are combined using sevendifferent schemes. Experimental results show important improvements in final accuracieswhen compared to single classifier results.

1.2 Objectives and Limitations

The main objective of this research work is to propose and analyze an automatic microex-pression recognition approach through the use of motion, texture and shape descriptors,as well as the application of multiple machine learning techniques, so that accuracy isimproved. Some more specific objectives of this work are:

• Evaluation of existing spatial and spatio-temporal descriptors, such as the WeberLocal Descriptor (WLD) [14] and Histograms of Oriented 3D Spatio-temporal Gra-dients (HOG3D) [41], which have not yet been used or have only been limitedlyapplied to the microexpression recognition problem.

• Proposition and analysis of extensions to a geometric descriptor originally proposed


for facial (macro) expression recognition [68], with the objective of making it moresuitable to the microexpression recognition task.

• Proposition and evaluation of descriptor combinations that lead to improved accu-racies when used as input features to a standalone classifier.

• Test and analysis of results obtained from multiple machine learning techniques,including the use of classifier combination techniques, and comparison of them tostate-of-the-art approaches.

1.3 Research Questions

This research work aims to answer the following questions:

• Are facial geometric features discriminative for microexpression recognition?

• Is it possible to improve microexpression recognition accuracy by combination dif-ferent motion, texture and shape descriptors?

• Can classifier combination techniques be applied to microexpression recognition sothat final accuracy outperforms that of standalone classifiers?

1.4 Contributions

Descriptors based on the geometry of the face have been very limitedly used on microex-pression recognition. In this work, we propose three extensions to a geometric descriptorand apply them to the problem with competitive results. Five other descriptors are alsoevaluated and compared, providing valuable insights for future researches. In addition tothat, to the best of our knowledge, this is the first attempt to combine motion, textureand shape descriptors in this recognition task.

Similarly, and most importantly, microexpression recognition works have used singleclassifiers such as Support Vector Machines (SVM) or Nearest Neighbors (NN) methodfor classification, with this being the first attempt to apply different classifier combinationtechniques to the problem. Results show that significant improvement is achieved with thecombined classification algorithms, with final scores outperforming most of the methodsreported in the literature.

1.5 Text Organization

Chapter 2 presents relevant concepts associated with microexpression recognition. Mi-croexpressions are described in more details, as well as the underlying principles of thedescriptors and classification techniques employed in this work. Previous works relatedto microexpression recognition are also reviewed and discussed. Chapter 3 describes themethodology proposed in this work for microexpression recognition and the benchmark-ing datasets. Experimental configuration, protocols and results are presented, discussed


and compared in Chapter 4. Finally, Chapter 5 presents some concluding remarks anddirections for future work.

20

Chapter 2

Background

This chapter is divided into two sections. The first reviews some relevant concepts relatedto microexpression recognition, whereas the second describes and discusses the methodsproposed in the literature.

2.1 Theoretical Concepts

This section briefly reviews some concepts related to microexpressions, as well as to videoprocessing, feature extraction and machine learning techniques that were explored duringthe development of this work.

2.1.1 Microexpressions

Microexpressions are very brief involuntary facial expressions that show emotions thatpeople are generally trying to neutralize (simply hide) or mask (replace by a false differentemotion) [21,22]. These expressions were first observed by Haggard and Isaacs [30] in 1966.They called them micro-momentary expressions and stated that these expressions weresigns of repressed emotion that could not be seen in real time, but only in slow motionvideos. Three years later, Ekman and Friesen [18,21] found micro facial expressions whenanalyzing a video from a psychiatric patient who confessed lying to her doctor to hideher plans to commit suicide. The patient seemed happy and optimistic. However, whenexamining the video frame by frame, Ekman and Friesen noticed an expression of intenseanguish that lasted for only two frames (1/12 seconds) and was quickly replaced by a smile.Other very brief similar expressions were found throughout the video. They concludedthat these expressions can occur in two cases: in case of repressed emotions (the personconceals information from him/herself), and when emotions are deliberately suppressed(the person consciously conceals information from another). They also found that oncethey knew what to look for, it was possible to see microexpressions when watching thevideo in real time.

Microexpressions appear in high-stake situations, when true emotions that the personis trying to hide (for instance, fear or guilt) may arise and betray the lie [18]. These ex-pressions take the same form of regular facial expressions (happiness, sadness, fear, anger,disgust and surprise), but with a much shorter duration and often with less intensity.

CHAPTER 2. BACKGROUND 21

Their actual length, although one of their most important features, is not a consensus,varying from 1/25 to 1/2 of a second [22,53,89].

Due to their short duration and low intensity, microexpressions are very difficult tobe detected and recognized. Untrained humans, in general, perform slightly better thanchance on their detection [18,85] and, although training can be applied to improve accu-racy [53], it is time consuming and expensive.

Facial Action Units

The Facial Action Coding System (FACS) is a widely used system of facial expressiondescription and measurement proposed by Ekman and Friesen [19,20]. It measures facialexpressions based on facial muscles activity: facial expressions are decomposed into ActionUnits (AUs), each corresponding to an activity in a muscle or group of muscles thatgenerates a characteristic facial movement. As a result, possibly all facial expressions canbe described using FACS. Figure 2.1 presents some Action Unit examples extracted fromthe Chinese Academy of Sciences Micro-Expression (CASME) II dataset [88].

(a) (b) (c) (d)

Figure 2.1: Facial Action Unit examples from the CASME II Dataset [88]: (a) AU 2 -Outer Brow Raiser; (b) AU 4 - Brow Lowerer; (c) AU 12 - Lip Corner Puller; (d) AU 15- Lip Corner Depressor.

2.1.2 Preprocessing

A number of video preprocessing techniques have been applied on facial expression andmicroexpression recognition methods, from which the most relevant for the purpose ofthis research work being frame interpolation and motion magnification, whose conceptsare reviewed next.

Frame Interpolation

In the past decades, limited bandwidth and storage have motivated the creation of anumber of video encoding techniques that sacrifice temporal quality by dropping some ofthe frames from a video sequence when encoding it, resulting in perceptible visual qualityloss after decoding. Various frame interpolation methods have been proposed to recreatedropped frames during decoding, from simple frame repetition and linear interpolation to


motion-compensated frame interpolation (MCFI) techniques [35, 50, 54]. MCFI methodsestimate motion vectors between consecutive frames and use them to interpolate theintermediate frame.

Frame interpolation is used in a number of other applications where frame rate increaseis required, such as temporal upsampling for slow-motion effect creation (from low framerate videos) and virtual view synthesis for three-dimensional and free viewpoint television.It has also been used in applications that deal with very short video sequences and/orwith sequences of variable length, such as speech recognition through lip reading, faceexpression recognition and microexpression recognition, for normalizing video sequencelength. The Temporal Interpolation Model (TIM) introduced by Zhou et al. [94], forexample, has been used in such applications for this purpose. It projects visual featuresextracted from the video frames onto a low-dimensional continuous curve calculated basedon a Laplacian matrix. Any arbitrary point on this curve can be mapped back into theimage space, allowing unavailable video frames to be interpolated.

Motion Magnification

Video sequences may contain information that cannot be detected by the limited humanvisual system. Low intensity motion, for example, or subtle color variations, may behardly perceptible or even invisible to the naked eye. Magnification techniques can beapplied to videos to reveal these otherwise undetectable variations.

Motion magnification was first introduced by Liu et al. [49]. Their method finds andamplifies small motions in video by analyzing feature point trajectories and segmentingpixels based on similarity. More recently, Wu et al. [87] proposed the Eulerian Video Mag-nification method that, in addition to subtle motions, can amplify small color changes.Although unseen with the naked eye, human skin color, for example, slightly changes withblood circulation. The variation can be used to extract pulse rate and other medical ap-plications. The method eliminates the need for flow computation by firstly decomposingvideo sequences into different spatial frequency bands, which might be magnified differ-ently (for example, frequencies corresponding to the human pulse would be of interestfor medical applications). Temporal filtering is then applied to each spatial band, whichreveals changes in temporal intervals. Bands of interest are multiplied by a magnificationfactor α and added to the original signal.

When using the Eulerian Video Magnification method, however, noise can also besignificantly amplified when α is increased. This problem was solved by Wadhwa et al. [79]with their phase-based video motion processing method, which is based on complex-valuedsteerable pyramids. The phase variations of these pyramids correspond to local motionsin spatial sub-bands of an image. Temporal processing is applied to amplify motionin selected bands and the modified video is reconstructed. The method is significantlyless sensitive to noise. Nonetheless, an improved method was proposed by the sameauthors [80], where Riesz pyramids are applied on image pyramid representation insteadof the complex steerable pyramids. Resulting motion magnified videos have comparablequality, but can be processed in one quarter of the time.

The range of possible applications of motion magnification techniques is vast, from


measuring the vital signs of neonatal babies to monitoring buildings swaying in the wind.Subtle motions, such as the ones found in microexpressions, can also be exaggerated usingthese techniques, possibly facilitating their recognition.

2.1.3 Descriptors

This section presents some concepts related to the descriptors used in this work for mi-croexpression recognition.

Gradient Orientation

Edges can be detected by identifying local abrupt intensity (gray level) changes in animage [26,60]. These changes can be described by the gradient (x and y partial derivatives)vector of the image function I(x, y) (which returns the intensity of the pixel in coordinates(x, y)):

∇I(x, y) =

Gx

Gy

=

∂I(x, y)

∂x

∂I(x, y)

∂y

(2.1)

The partial derivatives can be approximated by using 1-dimensional filters, such as

Gx ≈[−1 0 1

]~ I(x, y) = I(x+ 1, y)− I(x− 1, y) (2.2)

Gy ≈[−1 0 1

]T~ I(x, y) = I(x, y + 1)− I(x, y − 1) (2.3)

where ~ denotes the convolution operator.The magnitude of the gradient vector, generally called simply gradient in image pro-

cessing, indicates the intensity variations that identify edges, which for computationalcost reduction, can be approximated as

G =√Gx

2 +Gy2 ≈ |Gx|+ |Gy| (2.4)

The orientation of the gradient vector is also important as it indicates edge direction.It can be calculated as

θ = arctan(Gy

Gx

) (2.5)

Edge directions can be used to describe object appearance and shape. The concept wasused in the Histograms of Oriented Gradients (HOG) introduced by Dalal and Triggs [15],which are widely used as an image descriptor for object detection. The method uniformlydivides the image into cells for which a histogram of gradient orientations is computed.The concatenation of the histograms makes the descriptor. For better invariation toillumination and shadowing, cell histograms are contrast-normalized over larger spatialregions called blocks.

The concept was generalized to the spatio-temporal domain by Kläser et al. [41],who proposed a 3D video descriptor based on HOG. The method views videos as spatio-


temporal volumes, which are divided into a grid of cells for which gradient orientationhistograms are computed. 3D gradient orientations are quantized using regular polyhe-drons (dodecahedron and icosahedron), as opposed to polar coordinates. Using their facesas histogram bins avoids the problem of singularities at the poles caused by polar binsgetting progressively smaller. The resulting HOG3D descriptor was applied on humanaction recognition.

Texture

The interactions and amplitude variations in the intensity (gray level) of pixels in an imagecharacterize the texture of the objects it depicts [26, 60]. Generally, random interactionsand large amplitude variations identify fine textures, while well-defined interactions anduniform regions (with small amplitude variations) indicate coarse textures.

Texture features extracted from images through the analysis of the relations betweenits pixels have shown high discriminative power in various image classification applications.For example, texture attributes extracted from the Gray-level Co-occurrence Matrices(GLCM) proposed by Haralick et al. [31] in 1973 are broadly used in many differentpattern recognition applications, such as remote sensing, medical image analysis, seismicfacies analysis, etc. The Local Binary Patterns (LBP) introduced by Ojala et al. [57],which characterize the spatial structure of texture by comparing the intensity of eachpixel to the pixels in its local neighborhood, is also a powerful local texture descriptorwidely used in texture analysis applications, such as face recognition, face expressionrecognition, eye localization and many others.

More recently, Chen et al. [14] proposed an approach to describing texture inspired byWeber’s Law, which states that the change in a stimulus that will be just noticeable is aconstant proportion of the original stimulus [34]. The proposed Weber Local Descriptor(WLD) represents an image as a histogram of differential excitations (ratio between therelative intensity differences of a pixel and its neighbors and the intensity of the pixelitself) and gradient orientations. The descriptor outperformed LBP and other widelyused descriptors such as Gabor and SIFT in texture classification. Later on, Ullah etal. [75] proposed an extension to WLD by introducing local spatial information. In thismethod, called spatial WLD, the image is divided into blocks, WLD is calculated for eachof these blocks and concatenated.

The concept of texture can be extended to the temporal domain, where it is calleddynamic texture. The most common dynamic texture recognition approaches are basedon optical flow, such the ones proposed by Péteri and Chetverikov [61] and by Polanaand Nelson [63]. Zhao and Pietkäinen [92] have also introduced the Volume Local BinaryPatterns (VLBP) and the Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) temporal extensions to the original LBP descriptor to represent dynamic texture.The first method characterizes motion and appearance by analyzing dynamic texture asvolumes with pixel neighborhood being circularly defined in three dimensions. The second,computationally simpler, extension considers the co-occurrences of local binary patternsin three orthogonal planes only (XY , XT and Y T ). The concatenation of the histogramsobtained for the three planes make the LBP-TOP descriptor, which was used in various


facial expression and microexpression recognition methods reported in the literature, asdescribed in Section 2.2.

Optical Flow

Optical flow [73] is the pattern of apparent motion between two consecutive video framescaused either by the motion of objects or the camera. It describes motion as two-dimensional displacement vectors that represent the movement of points between oneframe and the next.

As the time interval between two consecutive video frames is generally very small, it isassumed that the intensities of the pixels representing an object remain the same betweenthese frames. Therefore, considering a pixel I(x, y, t) that moves by distance (dx, dy) inthe next frame, with dt being the time interval between frames, it is possible to write

I(x, y, t) = I(x+ dx, y + dy, t+ dt) (2.6)

The right side of the equation can be approximated by Taylor series so that

I(x, y, t) = I(x, y, t) +∂I

∂xdx+

∂I

∂ydy +

∂I

∂tdt+ ... (2.7)

Dividing by dt, it follows that

∂I

∂x

dx

dt+∂I

∂y

dy

dt+∂I

∂t= 0 (2.8)

where ∂I∂x, ∂I∂y

and ∂I∂t

are the image function gradients in the x and y directions and in timet. The optical flow vector (dx

dt, dydt), however, is unknown. Various methods are available

to estimate this vector, from which the one proposed by Lucas and Kanade [51] must behighlighted. It assumes the flow is constant in a local neighborhood of a given pixel andsolve the equation for the pixels in this neighborhood with the least square fit method.Another important method was proposed by Farnebäck [24], which estimates the denseoptical flow through polynomial expansion.

Histograms computed from orientation and magnitude of optical flow vectors wereused in various research works as a motion descriptor. Laptev et al. [42], for example,combined histograms of optical flow (HOF) and oriented gradient (HOG) computed forspace-time volumes in the neighborhood of interest points to recognize human actions.Chaudhry et al. [13] proposed to represent each frame of a video using a histogram oforiented optical flow to perform this same task.

Facial Geometric Features

According to Martinez et al. [52], robust computer vision algorithms for face analysisand recognition should be based on configural and shape features. Founded on recentresults in cognitive science and neuroscience, they argued that these features are mostimportant for the recognition of facial expressions by humans and propose the use of the


coordinates of fiducial points delineating the eyebrows, eyes, nose, mouth and jaw line asfeature vectors.

Inspired by their work, Saeed et al. [67, 68] introduced a facial expression recognitionmethod that used geometric features of the face. It utilizes just 8 facial landmarks rep-resenting the shape and location of three facial components (eye, eyebrow and mouth, asdepicted in Figure 2.2(a)) to extract two feature sets:

• Facial point location features: the location (x and y coordinates) of the 8 landmarksrelative to the face size and position is used to generate a 16-dimensional featurevector.

• Geometric features: 6 distances are calculated from the 8 landmarks, as representedin Figure 2.2(b), which describe relative position of facial landmarks to each other,and concatenated to the location feature set. Distances d1 and d2 are computed asthe average of the corresponding distance values calculated for the left and rightsides of the face.

(a) (b)

Figure 2.2: Geometric features introduced by Saeed et al. [67,68]: (a) location of 8 faciallandmarks (plotted on an image from the SMIC dataset [44]); (b) 6 distances between 8facial landmarks (figure adapted from [68]).

2.1.4 Classifiers

Machine learning [7] is the subfield of artificial intelligence that studies algorithms whichcan learn from datasets to subsequently predict properties of new samples. In this context,classification is a particular machine learning task where, given a set of known samplesbelonging to two or more distinct categories (classes), new samples are assigned (classified)to one of these categories. Algorithms that classify data are known as classifiers.

From the many existing classification methods, Support Vector Machines (SVM) [71]must be highlighted for their successful application in a number of diverse fields. Given a


labeled and linearly separable training dataset, SVM builds an optimal hyperplane thatseparates samples belonging to each class. The optimal hyperplane is calculated to max-imize the margin (minimum distance to the learning samples). In many cases, however,a hyperplane that adequately separates data may not exist (dataset is not linearly sepa-rable). To handle these cases, linear SVM is extended by introducing an error parameterin the hyperplane calculation. The optimal hyperplane, in this case, is the one that max-imizes the margin and minimizes the error. If even with the use of an error parameter,the data still remains inseparable in its original dimensional space, non-linear SVMs canbe applied. In the latter case, a kernel function is used to calculate distances and anglesin a new projected space (of typically higher dimension) where data is now separable bya hyperplane.

The K-Nearest Neighbors (KNN) [16] method has also been successfully applied tovarious classification problems. The algorithm calculates the distances between a newsample and all training samples. It is then assigned to a class through a majority votingof its k nearest neighbors. The k parameter is a positive, typically small, integer number.

Decision trees [10] are yet another classification method that builds an acyclic con-nected graph (tree) from the training dataset, such that its leaves represent the classes thedata belong to and the internal nodes represent the combinations of data property valuesthat lead to these classes. The tree is traversed to determine the class of a new sample.Random Forests (RF) [9], in turn, are ensemble classifiers that use multiple decision trees,with each tree being built by resampling the original training dataset through randomselection with replacement. A subset of randomly selected properties is used to split eachtree node. New samples are classified independently by each of these decision tress andfinal classification is assigned through majority voting.

The Adaptive Boosting (AdaBoost) [25] machine learning approach combines multi-ple weak (inaccurate) classifiers to create an accurate predictor. The algorithm assignsweights to training samples, which are adjusted after each weak classifier is trained byincreasing the weight of misclassified samples. After all weak classifiers are trained, theyare also assigned a weight based on their accuracies. Final classification is the weightedsum of the output of the weak classifiers.

Random Forests and AdaBoost are both classifier combination (ensemble) techniquesthat use multiple classifier instances of the same weak classification method (e.g. deci-sion trees) on the same input feature vectors (descriptors). An alternate approach is tocombine classifiers that implement different classification methods, each specialized ona different descriptor [23, 39, 66, 95]. As each descriptor potentially represents differentcomplementary characteristics of the objects to be classified, and as different methods areused for classification, the set of correctly classified and misclassified samples obtainedfrom each classifier will not be the same. Combining these potentially complementaryoutput to make a final decision can then improve the accuracy of the final predictions.Various different rules can be applied to combine the individual classifier predictions. Ifonly the predicted labels are available, a majority voting algorithm can be used. If pre-dicted probabilities for each class are also given, a soft voting rule can be applied bycalculating the sum of the probabilities. Weighted voting can also be used by assigningweights to each classifier.


A number of other classifier combination techniques are described in the literature [66,95], among which, stacking [1,66,86,95] must be highlighted. In this case, a meta-classifieris trained to combine the output of the standalone classifiers, i.e., classifications predictedby the individual classifiers are used as the input features for the meta-classifier, whichpredicts the final classification.

2.2 Literature Review

The first research works on automatic microexpression recognition are relatively recent.In 2009, Polikovsky et al. [64] captured microexpression videos from 10 volunteers using ahigh-speed camera (200 FPS) and applied histograms of oriented gradient as descriptorson their automatic recognition. The analyzed microexpressions, however, were obtainedartificially (posed). Shreve et al. [72] proposed an algorithm for locating microexpressionsin video sequences using optical strain patterns. The method was successful, but thedataset used on the research also consisted of posed microexpressions.

In 2011, Pfister et al. [62] conducted a research work using spontaneous microexpres-sion captured from 6 volunteers. They proposed an algorithm for recognizing them usingLBP-TOP for description and Multiple Kernel Learning (MKL) for classification. In 2013,as a continuation of this research, Li et al. [44] collected the first dataset of spontaneousmicroexpressions: SMIC (Spontaneous Micro-Expression Database), which comprises 164microexpression video clips collected from 16 participants. A baseline was established forthe detection and recognition of 3 classes of microexpressions using the Temporal Inter-polation Model (TIM) for video length equalization, LBP-TOP for description and SVMfor classification.

The CASME (Chinese Academy of Sciences Micro-Expression) dataset was created byYan et al. [90] that same year. It contains 195 samples of spontaneous microexpressionscaptured from 35 participants. Its evolution, CASME II [88] includes more samplescaptured with a higher frame rate, superseding the original CASME. A baseline wasestablished for CASME II using the LBP-TOP descriptor and SVM classifier on therecognition of 5 classes of microexpressions.

The availability of these spontaneous microexpression datasets allowed a number ofother research works to be developed. Guo et al. [29] used the LBP-TOP descriptor com-bined with the nearest neighbor method for microexpression recognition. The method wastested on the SMIC dataset, outperforming the original baseline. Wang et al. [82] alsoused the LBP-TOP descriptor, but on a new color space model called Tensor IndependentColor Space (TICS), and SVM classifier. The method was tested on a subset of CASME IIand results indicated that higher accuracies are achieved with TICS than with RGB colorspace. On their subsequent research work [81], they used Local Spatio-Temporal Direc-tional Features (LSDF) combined with Robust Principal Component Analysis (RPCA) torecognize microexpression on both SMIC and CASME II. Liong et al. [46,47] also testedtheir microexpression recognition methods on both datasets. Both methods are based onoptical strain magnitude features, which characterize the relative muscular movements onfaces, and SVM classifier. Results outperformed the original baselines. Later on, they


presented a novel approach to the problem [48], in which Bi-Weighted Oriented OpticalFlow (Bi-WOOF) features are extracted from only two frames per video (the apex andonset frames) achieving good performance on SMIC and CASME II.

Huang et al. [32] proposed a new spatio-temporal facial representation for microexpres-sion recognition in which an integral projection method is used to obtain horizontal andvertical projections of facial images. LBP features are extracted from these projectionsand SVM is used for classification. Good performance was observed in experiments con-ducted with SMIC and CASME II. In a subsequent work [33], they used Spatio-TemporalCompleted Local Quantized Patterns (STCLQP) descriptor containing sign, magnitudeand orientation components with the same classifier. The method was also tested onboth SMIC and CASME II datasets. Oh et al. [56] proposed a spatio-temporal featurerepresentation based on monogenic signals for microexpression recognition in which mono-genic features (magnitude, phase and orientation) are captured from multiple scales usingRiesz wavelet transform. Two techniques were separately applied to CASME II for clas-sification: ultra-fast Multiple Kernel Learning (UFO-MKL) and linear SVM. LBP-TOPand ISTLMBP (Improved Spatio-Temporal Local Monogenic Binary Pattern) descriptorswere also experimented for comparison, being surpassed by the proposed Monogenic RieszWavelet representation when used with the linear SVM classifier. Subsequently, Oh etal. [55] introduced a new method in which intrinsic two-dimensional local structures areused to represent corners of facial contours for microexpression recognition. Linear SVMis used for classification on both SMIC and CASME II.

Wang et al. [84] used a different variant of the LBP descriptor, the Local BinaryPatterns with Six Intersection Points (LBP-SIP), to reduce the redundancy in LBP-TOPpatterns and provide a more compact representation for microexpression video clips. Theirmethod extracts and concatenates the patterns across all levels of a Gaussian multi-resolution pyramid. SVM is used for classification on CASME II. On their followingwork [83], they proposed an approach based on the Eulerian Video Magnification (EVM)technique. LBP-TOP features are extracted from magnified microexpressions and SVMis used for classification. The method was tested on CASME II and results indicatedthat higher accuracies are achieved when EVM is applied. Li et al. [45] used the samemethod to magnify motion in microexpression videos. TIM was also applied to magnifiedvideos before LBP, HOG and HIGO (Histogram of Image Gradient Orientation) featureson three orthogonal planes extraction. Classification was done using SVM and results onboth SMIC and CASME II indicated that motion magnification increased microexpressionrecognition accuracy, with the HIGO feature outperforming LBP and HOG. Le Ngo etal. [43] also used motion magnification for microexpression recognition. Their methodapplied the Riesz Pyramid Motion Magnification technique on microexpression video clipsbefore LBP-TOP feature extraction. SVM was used for classification on CASME IIand results indicated improvements in recognition rates of magnified over non-magnifiedmicroexpressions.

Patel et al. [58] explored the use of deep learning for microexpression recognition. Dueto the small number of samples contained in public microexpression datasets, instead oftraining a Convolutional Neural Network (CNN) model from microexpression data, theytransfered features from models trained on facial expression datasets and used evolution-


ary feature selection techniques for microexpression recognition. The method was testedon SMIC and CASME II. Breuer and Kimmel [11] also used the transfer learning trainingmethodology for microexpression recognition. They trained a CNN to detect regular facialexpressions and combined it with a Long-Short-Term-Memory Recurrent Neural Networkto perform the recognition task on CASME II. Recognition results obtained with theseand other microexpression recognition works reviewed in this section are summarized inTable 4.23.

It is possible to observe that most of the proposed microexpression recognition ap-proaches use LBP variants as feature descriptors and SVM for classification. Althoughsome HOF and HOG descriptor variants were also utilized in some methods, motion andshape descriptors can be much further explored, as well as a number of other machinelearning techniques.

31

Chapter 3

Methodology

Microexpression recognition aims to distinguish what type (class) of microexpression isrepresented in spotted video clips. For this purpose, the methodology proposed and eval-uated in this work extracts different features from video sequences to generate descriptors,which are used to train different individual classifiers and predict microexpression classesthrough cross-validation. Classification combination techniques are then applied to theseindividual classification results to yield a final classification prediction.

In most cases, preprocessing techniques are applied to video sequences before the actualfeature extraction process is done, for example, to normalize their lengths (number offrames). Some well known descriptors were explored, such as Histograms of Optical Flow(HOF), Histograms of Oriented Gradients (HOG) and Weber Local Descriptor (WLD).Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) was also explored forcomparison, as it is used in most microexpression recognition studies. Additionally, adescriptor based on an extended set of the geometric features presented in [68] is proposedand experimented as part of this work. A descriptor built from Action Unit occurrencefeatures is also explored. The Principal Component Analysis (PCA) [36] dimensionalityreduction technique is applied to the descriptors with high dimensionality.

Different classifiers, such as Support Vector Machines (SVM), K-Nearest Neighbors(KNN), Random Forests (RF) and AdaBoost are used with each descriptor. Resultsobtained through these individual descriptor-classifier pairs are then combined to yielda final classification. Figure 3.1 illustrates the high level flow of the method, which isdescribed in details in the following sections.

3.1 Preprocessing

In general, the first preprocessing steps in facial expression or microexpression recogni-tion systems are face detection, alignment and cropping. However, both datasets usedin this research already provide a preprocessed version of their microexpression video se-quences with faces aligned and cropped (please see Section 3.4 for a detailed descriptionof the datasets). These preprocessed versions were used throughout this research and noadditional face detection, alignment and/or cropping techniques were experimented.

Figure 3.2 shows the sequence of preprocessing steps utilized as part of this research

CHAPTER 3. METHODOLOGY 32

Figure 3.1: High level flow diagram for the microexpression recognition method.

method. Each step is described in the subsections that follow. It is important to noticethat preprocessing steps are not mandatory and may be skipped depending on the featuresto be extracted or on the objective of each particular experiment.

3.1.1 Grayscale Conversion

Both datasets used in this research provide color (RGB) video sequences. However, somedescriptors, such as WLD and LBP-TOP, must be extracted from grayscale videos. Inthese cases, the first preprocessing step taken by the method is RGB to grayscale con-version. For some other descriptors, although grayscale conversion may not be a require-ment, experimentation indicated that better results are achieved when it is done, so itwas applied in these cases as well. Conversion is done using color conversion routinesimplemented through OpenCV [8].

3.1.2 Frame Size Normalization

Another simple preprocessing step that is required for some descriptors, or that can beused to enhance final results for others, is the normalization of frame sizes among all videosequences in the dataset. When HOG3D is extracted with dense sampling, for example,the resulting descriptor dimensionality depends, among other parameters, on the framesize. Frame size normalization is then a required preprocessing step when extracting


Figure 3.2: Flow diagram for microexpression preprocessing.

HOG3D features to ensure descriptors of all video sequences have the same dimensionality.It is done by downsizing frames of all video sequences to the size of the smallest videoframe in the dataset using resize routines implemented through scikit-image [77].

3.1.3 Motion Magnification

One of the challenges for microexpression recognition is the low magnitude of the facialmovements they comprise, which may be difficult to capture or describe. Through motionmagnification, these movements can be exaggerated, so that they would possibly be moreeffectively described and, as a result, be more recognizable.

For this purpose, the Riesz Pyramid for Fast Phase-based Video Magnification methodis experimented in this research work. The algorithm is applied to microexpression videoclips as a preprocessing step, as indicated in Figure 3.2. Multiple magnification factors(α) are applied before extracting different features and results are compared to the onesobtained when magnification is not used.

Figure 3.3 shows a frame obtained when applying phase-based motion magnificationwith α = 4, 8 and 16 to a microexpression video clip (containing an eyebrow rise) fromthe SMIC HS dataset.

3.1.4 Temporal Interpolation

Another challenge for microexpression recognition is the different and short durations ofvideo sequences. Similarly to its relation to frame size, dimensionality of some descriptors


(a) (b) (c) (d)

Figure 3.3: Riesz pyramid phase-based motion magnification: (a) original frame; (b)α = 4; (c) α = 8; (d) α = 16.

also depends on video sequence length, so that different durations would produce de-scriptors with different dimensionalities. In addition to that, the short duration of videosequences may limit the application or reduce the effectiveness of some descriptors.

Temporal interpolation can be used to deal with these issues by expanding or shrinkingvideo sequences into normalized lengths. The Temporal Interpolation Model (TIM) [94]is used along this work to normalize video sequence lengths, as well as to study the effectof using longer/shorter interpolated video sequences on the results of microexpressionrecognition, as discussed in Chapter 4.

Figure 3.4 shows the results obtained when applying TIM to a video segment excerptedfrom an SMIC HS microexpression video clip. In this case, interpolated sequence lengthis twice the length of the original video clip.

Figure 3.4: Temporal Interpolation Model applied to a microexpression video segment,doubling its length.


3.2 Feature Extraction

This section describes the feature extraction methods used along this research.

3.2.1 Facial Landmark Detection

In order to extract and build some of the descriptors explored in this work, a set of facialpoints (landmarks) must be known. Facial landmarks represent the location and shapeof facial components, such as eyes, eyebrows, nose, mouth and jawline, and, therefore,can be used to calculate geometric features of the face or as interest points to computemotion descriptors such as HOF and HOG3D.

Two existing facial landmark detection software packages were experimented as partof this work, so that their effect on the results of microexpression recognition could beevaluated, as detailed next.

DLib’s 68 face landmark shape predictor

DLib [38] implements a shape predictor based on an ensemble of regression trees [37] andprovides a ready-to-use model for face shape prediction trained on the iBUG 300-W facelandmark dataset [69]. It is able to detect 68 facial landmarks from an image, followingthe Multi-PIE [28] 68 points mark-up, as illustrated in Figure 3.5.

Figure 3.5: The 68 facial landmarks detected by DLib and OpenFace (figure adaptedfrom [70]).

DLib’s 68 face landmark shape predictor was used along this research on each frameof microexpression video clips. Figure 3.6 presents some sample frames with the detected


landmarks. It is possible to observe that results are mostly accurate, although a quickhuman inspection can easily find video frames for which the algorithm performs poorly,specially for the landmarks representing the edges of the mouth and jawline.

Figure 3.6: Samples of facial landmarks detected by DLib for the SMIC dataset.

OpenFace facial landmark detection and tracking

OpenFace [5] implements facial landmark detection and tracking functionality using con-strained local neural fields [4] and includes a model trained on the Multi-PIE dataset [28].It is able to detect the same 68 facial landmarks (also following the Multi-PIE mark-up)illustrated in Figure 3.5 from individual images or to track them from a video or from asequence of images treated as a single video.

For the purpose of this research, both the facial landmark detection and trackingmethods were experimented, so that their effect on microexpression recognition could beevaluated. Some sample frames with the detected and tracked facial landmarks are shownrespectively in Figures 3.7 and 3.8. A quick human observation can conclude that resultsseem generally as precise as or more accurate than the ones obtained with DLib, althoughsome poor results can be found, specially for the landmarks representing the edges ofclosed eyes.

Figure 3.7: Samples of facial landmarks detected by OpenFace from individual frames ofthe SMIC dataset.


Figure 3.8: Samples of facial landmarks tracked by OpenFace from frame sequences ofthe SMIC dataset.

3.2.2 Geometric Features

This work explored the usage of the geometric features introduced by Saeed et al. [67,68]for microexpression recognition. As the original approach specifies frame-level featuresets, it was extended to handle videos by concatenating all frame-level features into asingle video-level descriptor. For each frame, coordinates for the 8 facial landmarks arenormalized (re-scaled) to [0, 1] and the 6 Euclidean distances between them are calculated.The variations in landmark locations and distances along the video frames are expectedto characterize the facial movements.

In addition to that, this work also proposes three extensions to the landmark locationand distance feature sets calculated for each frame as follows. The additional landmarksand distances used in these extended feature sets were empirically selected consideringtheir relationship to facial movements observed in microexpression video sequences.

• Add 6 landmarks to the original 8 landmark set with the intention to representsubtle eye and eyebrow movement, as depicted in Figure 3.9(a). From the resultingset of 14 landmarks, calculate 21 distances between them (see Figure 3.9(b)). Leftand right side distances are used instead of computing their averages to avoid lossof information for possible subtle asymmetrical movements.

• Use 4 additional landmarks (totaling 18 points) to further represent the shape ofthe eyes and eyebrows (refer to Figure 3.10(a)) and calculate 27 distances betweenthem, as shown in Figure 3.10(b).

• Use all 51 detected landmarks representing the shape and location of eyes, eye-brows, mouth and nose (Figure 3.11(a)) and calculate 35 distances between them(Figure 3.11(b)).

This work also experimented the effect of using only the landmark locations or onlytheir distances to build the geometric descriptor. The objective was to compare theirseparate contributions to microexpression recognition. Additionally, it investigated ifsubsets of the complete 21 distance set could be enough (or better) to geometricallydescribe microexpressions. As a brute-force experimentation with all possible 221 subsetswas not feasible, some conditions were established for what subsets would be tested. For


(a) (b)

Figure 3.9: Geometric features proposed in this research: (a) location of 14 facial land-marks (plotted on an image from the SMIC dataset); (b) 21 distances between 14 faciallandmarks.

(a) (b)



(a) (b)


example, it was determined that the distance between the right and left corners of themouth (d4) should be always used and that every time a right side (of the face) distanceis included in the subset, the corresponding left side distance should be included as well.The resulting set E of the distance subsets selected to be experimented is defined as

E = {S ∈ P (D) | d4 ∈ S ∧ ¬(d9 ∈ S Y d10 ∈ S) ∧ ¬(d1 ∈ S Y d7 ∈ S) ∧¬(d12 ∈ S Y d13 ∈ S) ∧ ¬(d14 ∈ S Y d15 ∈ S) ∧ ¬(d16 ∈ S Y d17 ∈ S) ∧¬(d18 ∈ S Y d19 ∈ S) ∧ ¬(d20 ∈ S Y d21 ∈ S) ∧ ¬(d2 ∈ S Y d8 ∈ S) ∧¬(d5 ∈ S Y d6 ∈ S) ∧ (d1 ∈ S ∨ d12 ∈ S ∨ d14 ∈ S ∨ d16 ∈ S) ∧

(d2 ∈ S ∨ d20 ∈ S) ∧ (d3 ∈ S ∨ d5 ∈ S)}

(3.1)

where D is the complete set of 21 distances, P (D) is the power set of D and distancesare identified according to Figure 3.9(b). 1080 subsets of D satisfy these conditions and,therefore, were experimented in this work.

3.2.3 Action Unit Features

Chapter 2 outlines how Action Units objectively describe facial expressions and microex-pressions. Based on that, a very straightforward proposition done in this research is touse Action Unit features extracted from images or videos to recognize microexpressions.

The OpenFace [5] software package implements facial Action Unit detection throughcross-dataset learning and person-specific normalization [3]. The tool is able to recognize


a subset of Action Units, specifically AU1, AU2, AU4, AU5, AU6, AU7, AU9, AU10,AU12, AU14, AU15, AU17, AU20, AU23, AU25, AU26, AU28 and AU45, and describesthem in two ways: (1) by their presence, i.e., if they are visible in the face, which isencoded as a boolean flag, and (2) by their intensity, i.e. how intense is the Action Uniton a 0 to 5 continuous scale, with 0 meaning it is not present, 1 indicating that it ispresent at minimum intensity and 5 meaning it is present at maximum intensity.

OpenFace can extract Action Units from individual images using static models, orfrom videos or image sequences treated as videos, in which case the tool uses dynamicmodels that calibrate by performing person normalization in the video.

This research proposes to use presence and intensity features detected by OpenFacefor each supported Action Unit as a descriptor for microexpression classification. Presenceflag is used as provided, while intensity values are re-scaled to [0, 1]. Using presence fea-tures only, intensity features only, as well as concatenating both features is experimented.Features detected (and selected) for each frame are concatenated and used as a descriptorfor each microexpression video sequence. Both static and dynamic Action Unit predictionmodels are tested.

3.2.4 Histograms of Oriented 3D Spatio-temporal Gradients

As a spatio-temporal descriptor employed on human action and gesture recognition, theHistograms of Oriented 3D Spatio-temporal Gradients (HOG3D) can also be used formicroexpression recognition. The descriptor is computed in this research work for mi-croexpression video clips using dense sampling, as well as for sparse sets of interest pointsbuilt from facial landmarks detected by DLib and OpenFace. Additionally, two differentpolyhedrons are used for gradient orientation quantization: dodecahedron and icosahe-dron and compared to original polar coordinate quantization.

3.2.5 Weber Local Descriptor

The Weber Local Descriptor (WLD) was also experimented in this work on microexpres-sion recognition. The WLD histogram of differential excitations and gradient orientationsis computed for each frame of microexpression video clips as originally proposed by Chenet al. [14] using three parameters: T , the number of dominant gradient orientations, M ,the number of segments in which differential excitations are grouped for each dominantgradient orientation, and S, the number of bins in which these segments are further split.

Furthermore, the spatial WLD extension proposed by Ulla et al. [75] is also appliedand compared to the original WLD. In this case, the image is divided into blocks and theWLD histogram is calculated for each block. Histograms are concatenated to form thespatial WLD describing a frame.

For both variants, the descriptor is computed for all frames and concatenated toproduce a descriptor for each microexpression video clip.


3.2.6 Local Binary Pattern on Three Orthogonal Planes

The Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) descriptor is usedin various microexpression recognition works. For this reason, it is also employed in ourtests, such that results can be compared and combinations with other descriptors can beexperimented.

The video sequence volume is divided into blocks and the LBP-TOP histogram iscomputed for each block. Resulting histograms are concatenated to build the final videodescriptor.

3.2.7 Histograms of Optical Flow

This research work also makes usage of Histograms of Optical Flow (HOF) as a motiondescriptor for microexpression recognition. Optical flow vectors are estimated for mi-croexpression video sequences using the Farnebäck [24] and Lucas-Kanade [51] methods.The first computes the dense optical flow (i.e., for all points in the frame), while thesecond calculates optical flow vectors for a sparse set of interest points. Facial landmarksdetected by DLib and OpenFace are used as interest points for the microexpression videosequences. Other optical flow estimation methods, such as the one proposed by Brox etal. [12], might be a subject for future work.

Once optical flow vectors are estimated, histograms are computed based on the methodproposed by Chaudhry et al. [13]: optical flow vectors are binned according to theirprimary angle from the horizontal axis and weighted by their magnitude. Using theprimary angle, i.e., the smallest signed angle between the vector and the horizontal axis,makes the histogram invariant to the direction of the motion (left or right).

For each microexpression video sequence, optical flow and histograms are calculated forall frames, normalized and concatenated to produce a descriptor for each microexpressionvideo clip.

Figure 3.12 shows the optical flow calculated using the Lucas-Kanade method andthe computed histogram of optical flow for a video segment excerpted from a SMIC HSmicroexpression video clip. 18 facial landmarks detected by DLib were used as interestpoints. The arrowed lines represent the computed optical flow vectors.

3.2.8 Descriptor Combinations

After individual evaluation, descriptors can be combined (concatenated) to be used asinput for standalone classifiers. In this case, Principal Component Analysis (PCA) [36]is applied to the single feature vectors before concatenation, so that dimensionalities areequalized, i.e., the number of features introduced by each individual descriptor to the finalfeature vector is the same. Although dimensionality reduction may cause some varianceloss for some descriptors, it is used to avoid concatenating descriptors of different (in somecases by orders of magnitude) dimensionality.

All 57 combinations of the six descriptors (Geometric, Action Unit, HOG3D, WLD,LBP-TOP and HOF) evaluated in this work are tested. Parameters settings that led tothe best individual results for each descriptor are applied.


Figure 3.12: Lucas-Kanade optical flow and corresponding histograms of optical flowcalculated on a microexpression video segment.

3.3 Classification

The SVM, KNN, RF and AdaBoost classifiers are experimented in this work with eachsingle descriptor, as well as with descriptor concatenations. The different classifiers aretrained and used to predict microexpression classes through cross-validation.

The best classification results obtained for each single descriptor are then combinedto yield a final classification. As different motion, texture and shape descriptors are usedwith different classification methods, the set of samples correctly or incorrectly classifiedby each descriptor-classifier pair is not necessarily the same. It is expected that combiningthese different results will improve the final accuracy.

The diagram presented in Figure 3.13 illustrates the classifier combination concept.Seven different combination schemes are evaluated: hard and soft majority voting, hardand soft weighted voting and three variants of the stacking technique, which are detailedin the following sub-sections.

3.3.1 Voting

Voting is the most popular classifier combination method, from which four variants areexplored in this work. Hard majority voting takes the class labels predicted by thestandalone classifiers and simply counts the votes received by each class. The class thatreceived the largest number of votes is selected as the final prediction. Mathematically,the scheme can be written as [66,95]:

c(x) = argmaxj

(∑i

cji (x)

)(3.2)

where cji (x) is one if classifier i predicts class j for sample x or zero otherwise.


Figure 3.13: Classifier combination diagram (figure adapted from [65]).

Instead of predicted classes, soft majority voting uses the probabilities predicted bythe standalone classifiers for each class to compute the final prediction. This takes intoaccount not only the predictions, but the confidence level of the individual classifiers [95].In this case, cji (x) is the probability predicted by classifier i of sample x being of class j.

Weighted voting, on its turn, assumes that standalone classifiers have different perfor-mances and, as such, should be given different power in voting. This is done by assigningthem different weights. The accuracy computed for a descriptor-classifier pair from itsindividual cross-validation results is used in this work as its weight. In this case, finalprediction is calculated as [66, 95]:

c(x) = argmaxj

(∑i

wicji (x)

)(3.3)

where wi is the weight assigned to classifier i and cji (x) can be calculated either frompredicted class labels (hard weighted voting) or probabilities (soft weighted voting).

3.3.2 Stacking

Stacking is a meta-learning technique for classifier combination. Classifications predictedby the standalone (first level) classifiers are used as the input features for a (secondlevel) meta-classifier, which predicts the final classification. The meta-classifier typicallyoutperforms the individual classifiers by learning when they classify samples correctly orincorrectly, so the technique is best suited for cases in which the standalone classifiers (ordescriptor-classifier pairs) are different.


A variant to the basic method is to use class probabilities instead of class labels asinput features for the meta-classifier [1,66,74,95]. In this case, the size of the meta-featurevector is multiplied by the number of classes.

Stacking is usually applied using a logistic regression model to predict the final clas-sifications [65]. This algorithm is experimented in this work for both stacking variants.Additionally, KNN is also explored as another possible meta-classifier algorithm and re-sults are compared.

Stacking Cross-validation

While the voting classifier combination methods used in this work do not required training,in order to obtain final combined classifications through stacking, the meta-classifier needsto be trained and tested through two levels of cross-validation. This is because if the exactsame data used to train the first-level classifiers are used to generate the predictions usedfor training the meta-classifier, there is a risk of overfitting. To avoid that, the followingtwo algorithms were applied:

• In the first algorithm, cross-validation is equally applied in two separate levels.First, the dataset is divided into folds and used for cross-validating the first-levelclassifiers (the same folds are used with each of them) to generate the meta-features.The exact same folds are then used once more to divide the meta-features for themeta-classifier cross-validation. The procedure is depicted in Algorithm 1. Althoughmostly used in practical implementations, there is a subtle data leakage, however,in this approach that could cause the stacking model to overfit [27]. If, for example,the k-fold strategy is used to randomly divide the data into 5 folds (D1, D2, D3.D4, D5), while D2, D3, D4 and D5 are used to train the first-level classifiers togenerate the meta-features for D1, these D1 meta-features will be used to train themeta-classifier to predict the final classifications for D2, D3, D4 and D5.

• The second algorithm eliminates this small data leakage by using a nested cross-validation approach. First, the dataset is divided into folds (for example, the same5 folds D1, D2, D3. D4, D5) and one of the folds (e.g., D1) is separated for testing.The remaining data (D1 = D2 +D3 +D4 +D5) is re-divided into smaller folds tocross-validate the first-level classifiers to generate the meta-features. Once this isdone, the first-level classifiers are re-trained using the complete D1 set, while themeta-features obtained from D1 are used to train the meta-classifier. Predictions forthe untouched D1 folder are finally obtained using this trained stacking model. Theprocedure is then repeated for the other folders (D2, D3, D4 and D5), as depictedin Algorithm 2.

In both Algorithms 1 and 2, the k-fold strategy is used to divide the dataset into folds.It is important to notice though that any other division strategy could be equally appliedin both cases.


Algorithm 1 Stacking with two-level K-fold cross-validation (adapted from [1])Input: Dataset D = {xi, yi}, descriptor/classifier pairs P = {pj}, where pj = (dj, cj)Output: Final classification results F1: Step 1: Apply cross-validation to get first-level predictions2: Randomly split D into K equal-sized subsets D1, D2, ..., DK

3: for k ← 1 to K do4: for each pj in P do5: Train descriptor/classifier pair pkj from D\Dk

6: for each xi, yi in Dk do7: Create a record {x′i, yi}, where x′i = {pk1(xi), pk2(xi), ..., pkJ(xi)}8: Accumulate new record in D′9: Step 2: Apply cross-validation to get the second-level predictions

10: Use the same random split from Step 1 in the collection D′11: for k ← 1 to K do12: Train second-level classifier sk from D′\D′k13: for each x′i in D′k do14: fi ← sk(x

′i)

15: Accumulate fi in F16: return F

Algorithm 2 Stacking with nested K-fold cross-validation (adapted from [1])Input: Dataset D = {xi, yi}, descriptor/classifier pairs P = {pj}, where pj = (dj, cj)Output: Final classification results F1: Step 1: Apply cross-validation to to get second-level predictions2: Randomly split D into K equal-sized subsets D1, D2, ..., DK

3: for k ← 1 to K do4: Step 1.1: Apply cross-validation to get first-level predictions5: Randomly split Dk = D\Dk into L equal-sized subsets Dk1, Dk2, ..., DkL

6: for l← 1 to L do7: for each pj in P do8: Train descriptor/classifier pair pklj from Dk\Dkl

9: for each xi, yi in Dkl do10: Create a record {x′i, yi}, where x′i = {pkl1(xi), pkl2(xi), ..., pklJ(xi)}11: Accumulate new record in D′k12: Step 1.2: Re-train first-level classifiers13: for each pj in P do14: Train descriptor/classifier pair pkj from Dk

15: Step 1.3: Get second-level predictions for the reserved test set16: Train second-level classifier sk from the collection D′k17: for each xi in Dk do18: x′i ← {pk1(xi), pk2(xi), ..., pkJ(xi)}19: fi ← sk(x

′i)

20: Accumulate fi in F21: return F


3.4 Datasets

This section describes the datasets used in the experiments carried out as part of thiswork.

3.4.1 SMIC Dataset

The Spontaneous Micro-Expression Database (SMIC) [44] comprises 164 video clips con-taining spontaneous microexpressions and 164 clips without microexpressions. Each mi-croexpression is classified as positive (corresponding to happiness expressions), negative(including sadness, fear and disgust expressions) or surprise. The videos were recordedfrom 16 participants (European and Asian men and women). They were captured in theCenter for Machine Vision and Signal Analysis of the University of Oulu, Finland by us-ing a high speed (HS) camera (100 FPS). Additionally, for 8 out of the 16 participants, anormal speed visual (VIS, with 25 FPS) and near infrared (NIR) cameras were also used,so that three datasets are included in SMIC: SMIC HS, SMIC VIS and SMIC NIR.

Microexpressions were induced in participants in an indoor bunker environment, re-sembling an interrogation room, where they were shown selected movie clips that canelicit strong emotions. Participants were instructed to not reveal their true feelings underthe threat of being punished (by having to fill in a long and boring questionnaire). Thissetting created a high-stake lie situation very close to real life. After watching every movieclip, participants had to fill in a report according to their true feelings about the film.

Captured videos were segmented and labeled by two annotators, according to theemotions self-reported by the participants. Only labels that were consistent with theseself-reports were included. Two versions of the resulting dataset are available: a rawversion and a preprocessed version, where faces were aligned and cropped. Both versionscontain microexpression clips including only frames from onset to offset. Figure 3.14depicts the frame sequence of a video clip containing a microexpression captured by thehigh speed camera.

Figure 3.14: SMIC HS dataset [44] frame sequence containing a surprise microexpression.

3.4.2 CASME II Dataset

The Chinese Academy of Sciences Micro-Expression II (CASME II) [88] was created in theChinese Academy of Sciences and contains 247 samples of spontaneous microexpressions


recorded from 26 participants (Asian men and women). Samples are classified in fivecategories: happiness, disgust, surprise, repression and others. The videos were capturedusing a high speed camera (200 FPS).

The spontaneous microexpressions have been elicited by submitting the participantsto a process similar to the one used to build the SMIC dataset, with the difference thathalf of the participants were instructed to keep neutral faces (i.e. not to reveal theirtrue feelings), while the other half should try to suppress facial movements only whenthey realized there was a facial expression. The goal was to elicit two different types ofmicroexpressions, as described by Ekman [21]: those which the ego is not aware of andthose which the ego senses and interrupts in mid performance.

Two coders were involved in the analysis and labeling of the microexpressions. Theresulting dataset is available in three versions: the first contains the raw microexpressionvideo frames, the second contains the microexpression clips including only frames fromonset to offset, while the third is a preprocessed version of the second, where faces werealigned and cropped. Figure 3.15 shows a sample from this dataset.

Figure 3.15: CASME II dataset [88] frame sequence containing a disgust microexpression.

48

Chapter 4

Experiments

This chapter presents the experimental results of the proposed method for microexpressionrecognition. The evaluation strategy used in the experiments is described and results arediscussed and compared to the ones reported in the literature.

All experiments were executed on a 2.6 GHz Intel(R) Core(TM) i5 CPU with 8GBRAM using OS X version 10.10.2 and Python programming language version 2.7.11with the following libraries: SciPy and NumPy [76], scikit-learn [59], scikit-image [77],OpenCV [8] and Dlib [38].

The following tools were compiled under OS X from the original source code:

• OpenFace, a facial behavior analysis toolkit [5]: C++ source code from Bal-trušaitis [2], University of Cambridge, Carnegie Mellon University.

• HOG3D, tool for computing 3D gradient descriptor [41]: C++ source code fromKläser [40], National Institute for Research in Computer Science and Control (Inria).

Additionally, the following codes were converted to Python:

• Riesz Pyramids for Fast Phase-Based Video Magnification [80]: Pseudocode fromWadhwa et al. [78], Massachusetts Institute of Technology.

• Temporal Interpolation Model [94]: Matlab implementation from the Center forMachine Vision and Signal Analysis [93], University of Oulu.

• Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) [92]: Matlab imple-mentation from the Center for Machine Vision and Signal Analysis [91], Universityof Oulu.

4.1 Evaluation Strategy

No standard protocols and metrics are specified to evaluate the SMIC [44] or the CASMEII [90] datasets, so the following strategy was adopted in this work.

CHAPTER 4. EXPERIMENTS 49

Cross-validation: Videos were partitioned into training and testing sets using twodifferent cross-validation protocols: k-fold, with k = 5, and leave-one-subject-out (LOSO).In the LOSO protocol, video samples are split according to the subject from which theywere captured. In each cross-validation iteration, samples from one subject are left outfor testing, while samples from all other subjects make the training set. Experiments withstandalone classifiers were done using k-fold cross-validation for faster execution, whileclassifier combination testing was done using both methods. Although not standardized,the LOSO protocol is most commonly employed in facial expression and microexpressionrecognition works. In all cases, the classifiers were trained and tested using the completeset of microexpression videos available for each dataset.

Metrics: A confusion matrix was computed for each experiment, from which the accu-racy, precision, recall and F1-score metrics were calculated.

Accuracy is the fraction of correct predictions calculated as

accuracy =

∑j TPj

N(4.1)

where TPj is the number of true positive predictions for class j, i.e., the number ofcorrectly predicted samples belonging to class j, and N is the total number of samples inthe dataset.

Precision measures the ability of the classifier not to label as belonging to class j asample that actually belongs to another class. It is first separately calculated for eachclass as

precisionj =TPj

TPj + FPj

(4.2)

where FPj is the number of false positive predictions for class j, i.e. the number ofsamples for which predicted class is j, but actual class is not j.

The overall precision is then computed as the average of class precisions, weighted bythe number of samples Nj that actually belong to each class, i.e.

precision =

∑j Nj · precisionj

N(4.3)

Recall is the ability of the classifier to correctly predict all samples belonging to classj. It is calculated for each class as

recallj =TPj

TPj + FNj

(4.4)

where FNj is the number of false negative predictions for class j, i.e., the number ofsamples for which the actual class is j but predicted class in not j. The overall recall isthen

recall =∑

j Nj · recalljN

(4.5)


Finally the F1-score is the harmonic mean of precision and recall, i.e.

F1-score = 2 · precision · recallprecision+ recall

(4.6)

Comparison between different results obtained as part of this research work is doneusing F1-score, unless otherwise stated, while both F1-score and accuracy are used on thecomparison to results reported by others.

4.2 Descriptor Results

In this section, the results of microexpression recognition experiments using different de-scriptors are reported for the SMIC and CASME II datasets. For both datasets, themicroexpression clips including only frames from onset to offset with cropped and alignedface images were used. Experiments were organized so that different descriptors are eval-uated separately and later combined (concatenated) with each other. For each descriptor,sub-experiments were executed to evaluate the effect of using different parameters, pre-processing techniques or classifiers.

Also for both datasets, the frames from the different video clips with cropped/alignedfaces have slightly different sizes. Preliminary experiments indicated that, for most of theexperimented descriptors, better results are achieved if frames of all video clips have thesame size. For this reason, in the experiments described in this chapter, unless otherwisestated, frame size normalization preprocessing is done, meaning that all frames of everyvideo clip are re-scaled to the minimum frame width and height found in the dataset.

It is important to note that, as mentioned in Chapter 3, in addition to the framesize, using video clips with the same length (number of frames) simplifies the descriptorbuilding process and tend to yield better recognition results. This is done in most of thereported experiments by applying the Temporal Interpolation Model [94] to the video clipsduring preprocessing. Interpolation is done to ten different lengths (10, 20, 30, 40, 50, 60,70, 80, 90, 100) with the purpose of evaluating its effect on microexpression recognitionresults using each single descriptor.

In the motion magnification experiments, microexpression video clips are magnifiedwith cut-off frequencies of 0.4 and 3.0 Hz at ten different magnification factors (α = 2,4, 6, 8, 10, 12, 14, 16, 18, 20) before temporal interpolation is applied and features areextracted.

SVM, RF, KNN and Adaboost classifiers were experimented with all single descriptors.Results obtained by using the classifier that achieved the best scores are presented foreach experiment. The optimal values of classifier parameters were determined by using theexhaustive grid search strategy implemented through scikit-learn [59], as follows. Otherparameter tuning optimizers might be experimented in a future work.

• For the SVM classifier, kernel type was searched in {linear, RBF}, C in{1, 101, 102, 103, 5 × 103, 104, 5 × 104, 105} and γ in {10−1, 10−2, 5 × 10−3, 10−3, 5 ×10−4, 10−4}. Additionally, the usage of class weights (inversely proportional to classfrequencies in the dataset) was also tested.


• For the RF classifier, number of trees in the forest was searched in {5, 10, 20, 200,700}, maximum depth of the tree in {5, 10, 20, None}, where None indicates thatnodes are expanded until all leaves are pure (contain only one sample), and numberof features to consider when looking for the best split was searched in {

√n, log2 n}

where n is the number of features.

• For KNN, the number of neighbors k was searched in [1, 9], the weight function usedin prediction was searched in {uniform, distance}, where uniform indicates that allneighbors are weighted equally, while distance weights points by the inverse of theirdistance, and the algorithm used to compute nearest neighbors was searched in {balltree, K-d tree, brute force}.

• For Adaboost, the maximum number of estimators at which boosting is terminatedwas searched in {10, 50, 100} and the boosting algorithm in {SAMME, SAMME.R}.

4.2.1 Geometric Descriptor

The experiments conducted to evaluate the performance of the geometric descriptor arereported next. The following test configuration/parameters apply:

• Original color frames were used for all experiments, as preliminary tests indicatedthat conversion to grayscale does not lead to better results.

• As described in Chapter 3, the geometric descriptor is built by concatenating faciallandmark locations and distances for each frame in the video clip. Therefore, tobuild descriptors with the same size for all videos in the dataset, temporal interpo-lation (or some equivalent method that can equalize video clip lengths) is a requiredpreprocessing step. As a result, temporal interpolation was used in all experimentswith the geometric descriptor.

• SVM classifier results are reported for all experiments as it consistently performedbetter than other classifiers when using the geometric descriptor.

Experiment 1: Comparing different sets of facial landmarks and distances andevaluating interpolated video clip length

This experiment explored the use of four different sets of facial landmarks and distances, asdescribed in Chapter 3, to build the geometric descriptor for microexpression recognition.It also evaluated the effect of the interpolated video clip length (number of interpolatedframes) on these results. Facial landmark detection was done using Dlib. SVM was usedfor classification with k-fold cross-validation protocol.

Results are depicted in Figure 4.1. For both the SMIC HS and CASME II datasets,the 14 landmarks + 21 distances, 18 landmarks + 27 distances and 51 landmarks + 35distances feature sets yielded similar results, while the original 8 landmarks + 6 distancesset [68] scored below. Some conclusions that can be taken from these results are: (1)the additional landmark locations and distances proposed in this research were relevant


for microexpression (ME) recognition results, and (2) the 14 landmarks + 21 distancesfeature set, which added only the points and distances considered in this work to be ofhigher relevance to microexpression recognition, appears to be enough to geometricallydescribe the microexpressions contained in the video clips and that using additional pointsand distances do not lead to a performance improvement. For these reasons, the featureset composed of 14 landmarks + 21 distances, which results in a descriptor with smallerdimensionality than the 18 landmarks + 27 distances and 51 landmarks + 35 distancessets, was selected to be used in subsequent experiments.

0 10 20 30 40 50 60 70 80 90 100 110Number of interpolated frames

0.54

0.56

0.58

0.60

0.62

0.64

0.66

0.68

F1-score

SMIC HS

8 landmarks, 6 distances





0.44

0.46

0.48

0.50

0.52

0.54

0.56

0.58

F1-score

CASME II





Figure 4.1: ME recognition results on the SMIC HS and CASME II datasets using thegeometric descriptor built with different sets of facial landmark locations and distances,SVM classifier and k-fold cross-validation protocol.

Regarding the different lengths of the interpolated video clips, results do not show aclear tendency that could lead to a conclusion on what is the ideal number of microex-pression video clip frames to be used to extract and build the geometric descriptor. Forthe purpose of this work, empirical values that allow a trade off between recognitionresults and interpolated video clip length (smaller lengths generate smaller descriptors)were selected for the following experiments. Deeper studies on this matter are a subjectfor future work.

It is also important to observe that the selection of best performing parameter valuesfor use on subsequent experiments is done to narrow down the number of parameter valuecombinations to be tested. This is done since brute-force experimentation with all possibleparameter combinations would not be feasible. Other parameter value search strategies,such as randomized parameter optimization, might be used in a future work.


Experiment 2: Comparing facial landmark detectors and using locations ordistances only

Facial landmark detection precision is key to the performance of the geometric descriptor,so in addition to Dlib, the OpenFace facial behavior analysis toolkit [5] was also evaluatedto detect facial landmarks. As described in Chapter 3, this toolkit is able to detect faciallandmarks from individual images (in this case, the individual video clip frames) or totrack them from a video or sequence of images treated as a single video. Both methodswere tested as part of this experiment.

In addition to comparing facial landmark detectors, this experiment also exploreddifferent ways of building the geometric descriptor. Using both the 14 landmark pointlocations and 21 distances (as done in the previous experiment) is compared to utilizing14 locations or 21 distances only. Video clips were interpolated into 90 frames for bothdatasets, since this sequence length yielded the best results (when using DLib for SMICHS and OpenFace for CASME II, as observed next).

Results are presented in Figure 4.2 and show that better scores are achieved whenDlib is used to detect the facial landmarks used to build the geometric descriptor for theSMIC HS dataset, while OpenFace image facial landmark detection leads to better scoresfor the CASME II dataset. The reasons for having different results may be speculatedas being related to different video frame size/quality and illumination conditions betweenthe SMIC HS and CASME II datasets, as well as to the experimented facial landmarkdetectors themselves being trained on different datasets (iBUG 300-W for DLib faciallandmark detector and Multi-PIE for OpenFace), but cannot be precisely determinedfrom the experiments performed as part of this work and, as such, may be a subject forfuture work.

DLib OF detection OF tracking

0.50

0.55

0.60

0.65

0.70

F1-sco

re

SMIC HS

locations + distances

locations only

distances only

DLib OF detection OF tracking

0.50

0.55

0.60

F1-sco

re

CASME II

locations + distances

locations only

distances only

Figure 4.2: ME recognition results on the SMIC HS and CASME II datasets using thegeometric descriptor built with different feature sets and facial landmark detectors, SVMclassifier and k-fold cross-validation protocol.

From Figure 4.2, it is also possible to conjecture that better results are achieved


when using both location and distance features calculated from landmarks detected byDLib, whereas distance features alone performed better when landmarks were detectedby OpenFace.

Experiment 3: Exploring distance subsets

This experiment explored the use of subsets of distances (from the complete 21 distance setused in the previously reported experiments) to build the geometric descriptor. A brute-force test of all possible 1080 distance subsets described in Chapter 3 was performed.For SMIC HS, DLib was used for facial landmark detection and the distance set wasconcatenated to landmark locations to build the geometric descriptor. For CASME II,facial landmarks were detected by OpenFace and the descriptor contained distances only.Video clips were interpolated into 90 frames for both datasets.

Distance subsets that led to the best results are presented in Tables 4.1 and 4.2. Theresults obtained for the complete 21 distances set are also shown at the bottom of eachtable for comparison. The following observations are outlined about these results:

• The d20 and d21 distance pair is present in all top 10 subsets for both datasets, whichis strong evidence that it is more discriminative than its alternative pair, d2 and d8,which is absent is most cases.

• Other distances used in most of the top subsets for both datasets are d3, the d5 andd6 pair, and the d1 and d7 pair, which indicates they are also highly relevant.

• The d18 and d19 pair appears in most of CASME II top subsets, which also indicatesit is a highly relevant feature for this dataset. It is not possible to draw this sameconclusion for SMIC HS however.

• d11 and the d9 and d10 pair are absent in most top 10 subsets for both datasets,which is an indication that these distances do not have much discriminative powerwhen compared to the others.

• The difference between the first and tenth scores is higher for CASME II (0.0152against 0.0058 for SMIC HS). A similar observation can be made when comparingthe first score to the complete 21 distance set score (0.0610 for CASME II and 0.0301for SMIC HS). This is due to the 14 landmark location features being used togetherwith the distance features for SMIC HS, while distance features alone are used forCASME II, making results more sensitive to the variations in the distance subset.

• d4 is present in all subsets for both datasets as specified in Equation 3.1.

It is important to notice that selecting a single general subset that would get optimalresults for any dataset is a difficult task that would require experimentation with variousdifferent datasets. For this reason, the different subsets that yielded the best result foreach dataset were used to build the geometric descriptor utilized in the concatenateddescriptors and classifier combination experiments presented later in this chapter.


Rank d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 F1-score

1 X X X X X X 0.70652 X X X X X X X X X X X X X 0.70653 X X X X X X X 0.70134 X X X X X X X X X X 0.70135 X X X X X X X X X 0.70126 X X X X X X X X X 0.70127 X X X X X X X X X X X 0.70128 X X X X X X X X X X X X X X 0.70109 X X X X X X X X 0.700910 X X X X X X X X X X X X 0.7007

507 X X X X X X X X X X X X X X X X X X X X X 0.6764

Table 4.1: Best ME recognition results on the SMIC HS dataset using the geometricdescriptor built with 14 landmark locations and subsets of the 21 distance set, SVMclassifier and k-fold cross-validation protocol.

Rank d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 F1-score

1 X X X X X X X X X X X X 0.65082 X X X X X X X X X X X 0.64823 X X X X X X X X X X X X X X 0.64004 X X X X X X X X X X X X 0.63965 X X X X X X X X X X X 0.63876 X X X X X X X X X X X X 0.63827 X X X X X X X X X X X 0.63758 X X X X X X X X X X X X 0.63689 X X X X X X X X X X 0.636310 X X X X X X X X X X X X X X 0.6356

474 X X X X X X X X X X X X X X X X X X X X X 0.5898

Table 4.2: Best ME recognition results on the CASME II dataset using the geometricdescriptor built with subsets of the 21 distance set, SVM classifier and k-fold cross-validation protocol.

Experiment 4: Applying motion magnification

This experiment evaluated the effect of applying phase-based motion magnification withdifferent magnification factors to microexpression video sequences before geometric featureextraction. The feature sets that generated the two best scores on the previous experimentwere tested, as well as the complete 21 distances set (with the landmark locations anddistances features being used for SMIC HS and distance features only being used forCASME II). Facial landmark detection was done using DLib for SMIC HS and OpenFacefor CASME II. Video clips were interpolated into 90 frames for both datasets.

Figure 4.3 shows the results achieved with different magnification factors α, with α = 1

indicating the scenario where no motion magnification is applied. It is possible to observethat microexpression recognition performance is not improved when motion magnificationis applied to both datasets for the geometric descriptor.


0 2 4 6 8 10 12 14 16 18 20 22Magnification factor

0.58

0.60

0.62

0.64

0.66

0.68

0.70

0.72F1

-sco

reSMIC HS

complete feature set

feature set #1

feature set #2


0.40

0.45

0.50

0.55

0.60

0.65

0.70

F1-sco

re

CASME II

complete feature set

feature set #1

feature set #2

Figure 4.3: ME recognition results on the SMIC HS and CASME II datasets using thegeometric descriptor computed from magnified video clips, SVM classifier and k-fold cross-validation protocol.

4.2.2 Action Unit Descriptor

The experiments conducted to assess the Action Unit descriptor proposed in this researchare reported next. Test configuration/parameters are:

• Preliminary experiments indicated that using grayscale frames when extractingthese features yielded better recognition results, so conversion from color to grayscalewas done as a preprocessing step for all experiments.

• As described in Chapter 3, the Action Unit descriptor is built by concatenating Ac-tion Unit occurrence and intensity features for each frame in the video clip. There-fore, to build descriptors with the same size for all videos in the dataset, temporalinterpolation (or some equivalent method that can equalize video clip lengths) isa required preprocessing step. As a result, temporal interpolation was used in allexperiments with the Action Unit descriptor.

• SVM and KNN classifiers yielded better results for the SMIC HS dataset, whileSVM and RF performed better on CASME II, depending on test configuration andparameter values. For this reason, the results achieved using these classifiers arepresented.

Experiment 1: Comparing different feature sets and evaluating interpolatedvideo clip length

This experiment explored different ways of building the Action Unit descriptor. Usingboth Action Unit presence and intensity features is compared to utilizing only one type offeature or the other. It also attempted to evaluate the effect of the interpolated video clip


length on the microexpression recognition results. OpenFace was used to extract ActionUnit features from individual images (static method). SVM and KNN classifiers wereused for the SMIC HS dataset, while SVM and RF did the classification for CASME II,all with k-fold cross-validation protocol.

Results are shown in Figure 4.4 and indicate that better scores were achieved whenboth presence and intensity features are used with the KNN classifier for SMIC HS, whileusing intensity features only with SVM generated better results for CASME II. As resultsare directly linked to the performance of OpenFace’s Action Unit detection functionality,the reasons for different results being obtained for the two datasets cannot be determinedfrom the experiments done as part of this work and might be a subject for future work.


0.54

0.55

0.56

0.57

0.58

0.59

0.60

0.61

0.62

F1-s

core

SMIC HS

presence + intensity, SVM

presence only, SVM

intensity only, SVM


0.44

0.46

0.48

0.50

0.52

0.54

0.56

0.58

F1-s

core

CASME II

presence + intensity, SVM

presence only, SVM

intensity only, SVM


0.54

0.56

0.58

0.60

0.62

0.64

0.66

F1-s

core

SMIC HS

presence + intensity, KNN

presence only, KNN

intensity only, KNN


0.40

0.45

0.50

0.55

F1-s

core

CASME II

presence + intensity, RF

presence only, RF

intensity only, RF

Figure 4.4: ME recognition results on the SMIC HS and CASME II datasets using theAU descriptor built with different AU feature sets, different classifiers and k-fold cross-validation protocol.

Similarly to what was observed for the geometric descriptor, results do not show aclear indication of what is the best number of interpolated frames to use for the Action


Unit descriptor. Different values of this parameter were also used in the next experiment,with the same behavior being observed. As a result, empirical values that compromiserecognition results and video clip length were selected to be used in the experiments thatfollow, while a deeper analysis of this subject is a subject for future work.

Experiment 2: Comparing Action Unit detection methods

The goal of this experiment was to compare the different Action Unit detection methods(static and dynamic) implemented by OpenFace. SVM and KNN classifiers were used forthe SMIC HS dataset, while SVM and RF were used for CASME II.

Figures 4.5 and 4.6 respectively depict the results achieved with both methods whenusing the presence and intensity feature sets and the intensity feature set alone. It ispossible to observe that better results are achieved using the static method in most cases,although the dynamic method scored higher when the SVM classifier is utilized with theintensity feature set.


0.48

0.50

0.52

0.54

0.56

0.58

0.60

0.62

F1-score

SMIC HS

static, SVM

dynamic, SVM


0.45

0.46

0.47

0.48

0.49

0.50

0.51

0.52

F1-score

CASME II

static, SVM

dynamic, SVM


0.48

0.50

0.52

0.54

0.56

0.58

0.60

0.62

0.64

0.66

F1-s

core

SMIC HS

static, KNN

dynamic, KNN


0.49

0.50

0.51

0.52

0.53

0.54

0.55

0.56

F1-s

core

CASME II

static, RF

dynamic, RF

Figure 4.5: ME recognition results on the SMIC HS and CASME II datasets using theAU descriptor built with AU presence and intensity feature sets, different classifiers andk-fold cross-validation protocol.



0.50

0.52

0.54

0.56

0.58

0.60

0.62

F1-score

SMIC HS

static, SVM

dynamic, SVM


0.50

0.51

0.52

0.53

0.54

0.55

0.56

0.57

0.58

0.59

F1-score

CASME II

static, SVM

dynamic, SVM


0.44

0.46

0.48

0.50

0.52

0.54

0.56

0.58

F1-s

core

SMIC HS

static, KNN

dynamic, KNN


0.48

0.49

0.50

0.51

0.52

0.53

0.54

0.55

F1-s

core

CASME II

static, RF

dynamic, RF

Figure 4.6: ME recognition results on the SMIC HS and CASME II datasets using theAU descriptor with AU intensity feature set, with different classifiers and k-fold cross-validation protocol.


This experiment evaluated the effect of applying phase-based motion magnification withdifferent magnification factors to microexpression video sequences before Action Unitfeature extraction. The feature sets that generated the best scores on the previous ex-periment were tested, i.e., Action Unit presence and intensity features detected usingthe static method for SMIC HS and intensity features alone detected using the dynamicmethod for CASME II. Video clips were interpolated into 30 and 90 frames for the SMICHS and CASME II datasets, respectively.

Figure 4.7 shows the results achieved with different magnification factors α, withα = 1 indicating the scenario where no motion magnification is applied. It is possible toobserve that microexpression recognition performance is again not improved when motionmagnification is applied to both datasets for the Action Unit descriptor.



0.54

0.56

0.58

0.60

0.62

0.64

0.66F1

-sco

reSMIC HS

presence + intensity, KNN


0.51

0.52

0.53

0.54

0.55

0.56

0.57

0.58

0.59

F1-s

core

CASME II

intensity only, SVM

Figure 4.7: ME recognition results on the SMIC HS and CASME II datasets using theAction Unit descriptor computed from magnified video clips with k-fold cross-validationprotocol.

4.2.3 HOG3D Descriptor

The experiments executed to evaluate the performance of the HOG3D descriptor on mi-croexpression recognition are described next. The following test configuration/parametersapply:

• Preliminary experiments indicated that using grayscale frames when extractingthese features yielded better recognition results, so conversion from color to grayscalewas done as a preprocessing step for all experiments.

• The dimensionality of the HOG3D descriptor depends on the length of the videoclip, so to obtain descriptors of the same size for all entries in the dataset, temporalinterpolation (or some equivalent method that can equalize video clip lengths) isa required preprocessing step. As a result, temporal interpolation was used in allexperiments with the HOG3D descriptor.

• The following parameter values were used to compute the HOG3D descriptor [40]in all experiments (all parameter values are given for the x × y × t dimensions,respectively): number of cells for histogram computation: 2×2×2; support regionaround sampling points: 12×12×6; polar bins (for polar coordinate quantization):5×5×3; stride for dense sampling: 12×12×6, except for the case of video sequenceinterpolation to only 10 frames, when 6×6×3 was used; maximum scale factor fordense sampling: 12×12×2.

• Due to the high dimensionality of the resulting descriptors (which ranged fromapproximately 3,500 to 2,500,000 for the datasets used in this research, dependingon parameter values and interpolated video clip length), PCA is applied on allHOG3D descriptors before classification.


• SVM classifier consistently yielded better results for both datasets on all experimentsconducted using the HOG3D descriptor. For this reason, the results achieved usingthis classifier are presented.

Experiment 1: Comparing different sampling methods and evaluating inter-polated video clip length

As described in Chapter 3, the HOG3D descriptor can be computed from a set of giveninterest points or through dense sampling, so the goal of this first experiment is to comparethe two methods. For the set of interest points, subsets of 8, 14, 18 and 51 of the faciallandmarks detected by DLib were used. The HOG3D descriptor was computed usingpolar-coordinates quantization type.

In addition to the different sampling methods, this experiment also evaluated the effectof the interpolated video clip length (number of interpolated frames) in microexpressionrecognition performance.

Results are presented in Figure 4.8 and show that better scores are achieved wheninterpolating video clips to shorter lengths for both datasets. This may be because thedifferences between consecutive frames (resulting from the subtle microexpression move-ments) become more evident when the video sequences are down-sampled, which possiblymakes the extracted HOG3D descriptor more discriminative. As a result, video clips areinterpolated to 10 frames in all further experiments with the HOG3D descriptor.


0.56

0.58

0.60

0.62

0.64

0.66

0.68

0.70

F1-score

SMIC HS

8 landmarks

14 landmarks

18 landmarks

51 landmarks

dense sampling


0.52

0.53

0.54

0.55

0.56

0.57

0.58

0.59

0.60

0.61

F1-score

CASME II

8 landmarks

14 landmarks

18 landmarks

51 landmarks

dense sampling

Figure 4.8: ME recognition results on the SMIC HS and CASME II datasets using theHOG3D descriptor calculated with different sets of interest points and with dense sam-pling, SVM classifier and k-fold cross-validation protocol.

From Figure 4.8, it is also possible to observe that this experiment was not conclu-sive about what sampling method yields the best results on microexpression recognition.Computing the descriptor from the smaller sets of interest points (8, 14 and 18 facial


landmarks sets) led to (similar) higher scores for the SMIC HS dataset, while dense sam-pling and the 8 and 51 facial landmarks sets performed better for CASME II, also withsimilar results. For this reason, all sampling methods were explored once again in thenext experiment.

Experiment 2: Evaluating quantization types

The goal of this experiment is to evaluate how the usage of different quantization types(polar-coordinates, dodecahedron and icosahedron) during HOG3D computation affectsthe performance of microexpression recognition. Evaluation is done for each of the sam-pling methods tested in the previous experiment with video clips interpolated to 10 frames.

Figure 4.9 depicts the results obtained for each quantization type. The best scoreswere mostly obtained when using dodecahedron quantization, while polar-coordinatesconsistently performed below the others. It is also possible to observe that, unlike theresults obtained when using polar-coordinates, both dodecahedron and icosahedron quan-tizations generated better scores for both datasets when the descriptor is computed fromthe 8 facial points of interest.

8 lm 14 lm 18 lm 51 lm dense

0.60

0.65

0.70

0.75

F1-sco

re

SMIC HS

polar

dodecahedron

icosahedron

8 lm 14 lm 18 lm 51 lm dense

0.55

0.60

0.65

0.70

F1-sco

re

CASME II

polar

dodecahedron

icosahedron

Figure 4.9: ME recognition results on the SMIC HS and CASME II datasets using theHOG3D descriptor calculated with different quantization types, SVM classifier and k-foldcross-validation protocol.

Experiment 3: Comparing facial landmark detectors

This experiment explored the usage of OpenFace to detect the facial landmarks used asinterest points to compute the HOG3D descriptor. Detection from individual images ortracking from video sequences was experimented and compared to the results achieved us-ing DLib. Results are depicted in Table 4.3, with the best result obtained for each datasetbeing highlighted in bold. Sparse sampling with 8 facial landmarks was applied to video


sequences interpolated to 10 frames for both datasets. Icosahedron and dodecahedronquantization types were used for SMIC HS and CASME II, respectively.

Tool F1-scoreSMIC HS CASME II

OpenFace detection (image) 0.6813 0.6574OpenFace tracking (video) 0.6395 0.6517DLib 0.7296 0.6841

Table 4.3: ME recognition results on the SMIC HS and CASME II datasets using theHOG3D descriptor built from 8 facial landmarks detected by different tools, SVM classifierand k-fold cross-validation protocol.

Unlike what was observed when comparing facial landmark detectors for the geometricdescriptor, for HOG3D the best results were achieved when using DLib for both datasets.


This experiment evaluated the effect of applying phase-based motion magnification withdifferent magnification factors to microexpression video sequences before HOG3D featureextraction. Both sparse and dense HOG3D feature extraction were experimented. Icosa-hedron quantization was used for SMIC HS, while dodecahedron was utilized for CASMEII for both sampling methods. Sparse sampling was computed from 8 facial landmarksfor both datasets and video sequences were interpolated to 10 frames.

Figure 4.10 shows the results achieved with different magnification factors α, withα = 1 indicating the scenario where no motion magnification is applied. It is possibleto observe that microexpression recognition performance is not improved when motionmagnification is applied to SMIC HS for both the sparse and dense HOG3D descriptor.Improvement was observed, however, for the CASME II dataset with sparse HOG3Ddescriptor when the magnification factor is small (between 4 and 8).

4.2.4 WLD Descriptor

The experiments executed to evaluate the performance of the WLD descriptor on mi-croexpression recognition are reported next. Test configuration/parameters used in theseexperiments are:

• As WLD is computed from pixel intensity, conversion from color to grayscale is thefirst preprocessing step applied to all experiments.

• Preliminary tests indicated that, for the original WLD descriptor, frame size normal-ization does not lead to better results, so unlike what is done in other experiments,this preprocessing step is not applied in this case. For spatial WLD, on the otherhand, frame size normalization is a required step, so that the number of blocks intowhich frames are divided is the same for all video sequences.



0.58

0.60

0.62

0.64

0.66

0.68

0.70

0.72

0.74F1

-sco

reSMIC HS

sparse

dense


0.58

0.60

0.62

0.64

0.66

0.68

0.70

0.72

F1-sco

re

CASME II

sparse

dense

Figure 4.10: ME recognition results on the SMIC HS and CASME II datasets usingthe HOG3D descriptor computed from magnified video clips, SVM classifier and k-foldcross-validation protocol.

• As described in Chapter 3, the WLD descriptor used in this research is built byconcatenating the histograms calculated for each frame in the video clip. There-fore, to build descriptors with the same size for all videos in the dataset, temporalinterpolation (or some equivalent method that can equalize video clip lengths) is arequired preprocessing step, so it was used in all experiments.

• Due to the high dimensionality of the resulting descriptors (which ranged fromapproximately 1,200 to 5,000,000 for the datasets used in this research, dependingon parameter values and interpolated video clip length), PCA is applied on all WLDdescriptors before classification.

• KNN classifier consistently yielded better results for the SMIC dataset on all ex-periments conducted using the WLD descriptor. For CASME II, however, whileKNN still performed better with the original WLD, SVM achieved higher scoreswith spatial WLD. The results achieved using these best performing classifiers arepresented.

• The range of the T , M , S and spatial WLD block size parameter values used inthese experiments are guided by the values used by Ullah et al. [75].

Experiment 1: Evaluating interpolated video clip length

This experiment evaluates the effect of the interpolated video clip length in microexpres-sion recognition performance when using the original and spatial WLD descriptors. TheT , M , S WLD histogram parameters were set to 8, 4, 4 and spatial WLD block size to12×12.


Results are presented in Figure 4.11 and indicate that better scores are possiblyachieved when interpolating video clips to shorter lengths for both datasets when us-ing the original WLD descriptor. For spatial WLD, however, although the same trendwas observed for SMIC HS, on CASME II the interpolated video sequence length doesnot seem to significantly affect recognition results. Lengths that led to highest scores ineach case are used in the following experiments.


0.56

0.57

0.58

0.59

0.60

F1-score

SMIC HS

original WLD, KNN

spatial WLD, KNN


0.46

0.48

0.50

0.52

0.54

0.56

0.58

0.60

F1-score

CASME II

original WLD, KNN

spatial WLD, SVM

Figure 4.11: ME recognition results on the SMIC HS and CASME II datasets using theoriginal and spatial WLD descriptors, with k-fold cross-validation protocol.

It is remarkable that, although similar results were achieved when using both theoriginal and spatial WLD descriptors for SMIC HS, spatial WLD clearly outperforms theoriginal WLD on CASME II.

Experiment 2: Evaluating spatial WLD block size

The size of the blocks into which video frames are divided for WLD histogram computa-tion is an important parameter to build the spatial WLD descriptor, so this experimentevaluates the effect of using different block sizes on microexpression recognition results.Nine block sizes were used for both datasets: 12×12, 12×24, 24×24, 24×36, 36×36,36×48, 48×48, 48×60 and 60×60 pixels. The T , M and S parameters were set to 8, 4and 4, respectively, and video clips were interpolated to 20 frames for SMIC HS and 60frames for CASME II.

Results are depicted in Figure 4.12 and indicate that higher scores are achieved whenusing blocks of medium to large size (36×36, 36×48, 48×48 pixels) for both datasets,with 36×48 pixels yielding the highest score for SMIC HS and 48×48 pixels performingbetter on CASME II.


12x12

12x24

24x24

24x36

36x36

36x48

48x48

48x60

60x60

Block size

0.54

0.56

0.58

0.60

0.62

0.64

0.66F1

-sco

reSMIC HS

12x12

12x24

24x24

24x36

36x36

36x48

48x48

48x60

60x60

Block size

0.54

0.56

0.58

0.60

0.62

F1-sco

re

CASME II

Figure 4.12: ME recognition results on the SMIC HS and CASME II datasets using thespatial WLD descriptor built with different block sizes, KNN classifier for SMIC HS andSVM classifier for CASME II and k-fold cross-validation protocol.

Experiment 3: Evaluating WLD histogram parameter combinations

This experiment evaluates various combinations of the T , M , S parameters used to buildthe WLD histogram for both the original and spatial WLD. Some combinations wereselected for testing where T = 4, 6, 8, M = 4, 6 and S = 4, 8, 10. Video clips wereinterpolated to 30 and 20 frames for SMIC HS and to 10 and 60 frames for CASME II,respectively, for the original and spatial WLD descriptor. For spatial WLD, 36×48 blocksize was used on SMIC HS, while 48×48 was utilized on CASME II.

Figure 4.13 presents the results achieved for the experimented combinations. Thehighest score was reached when using (T , M , S) = (6, 4, 4) for both datasets with thespatial WLD descriptor. For the original WLD, combination (8, 6, 8) yielded the bestresult for SMIC HS, while (8, 4, 4) performed better for CASME II. Figure 4.13 alsoshows that once all parameters were tuned, spatial WLD performed significantly betterthan the original WLD for both datasets.


This experiment evaluated the effect of applying phase-based motion magnification withdifferent magnification factors to microexpression video sequences before WLD featureextraction. Both the original and spatial WLD feature extraction were experimented.Parameter settings that yielded the best results on the previous experiments were used:(T , M , S) = (8, 6, 8) with video sequences interpolated to 30 for the original WLD forSMIC HS; (T , M , S) = (6, 4, 4) with 20 frames interpolation and 36×48 block size forspatial WLD on SMIC HS; (T , M , S) = (8, 4, 4) with 10 for the original for CASME II;(T , M , S) = (6, 4, 4) with 60 frames and 48×48 block size for spatial WLD on CASMEII.

Figure 4.14 shows the results achieved with different magnification factors α, with


(4,4

,4)

(4,6

,8)

(6,4

,4)

(6,6

,10

)

(8,4

,4)

(8,4

,8)

(8,6

,8)

(8,6

,10

)

(T,M,S)

0.52

0.54

0.56

0.58

0.60

0.62

0.64

0.66

F1-s

core

SMIC HS

original WLD, KNN

spatial WLD, KNN

(4,4

,4)

(4,6

,8)

(6,4

,4)

(6,6

,10

)

(8,4

,4)

(8,4

,8)

(8,6

,8)

(8,6

,10

)

(T,M,S)

0.45

0.50

0.55

0.60

F1-s

core

CASME II

original WLD, KNN

spatial WLD, SVM

Figure 4.13: ME recognition results on the SMIC HS and CASME II datasets using theoriginal and spatial WLD descriptors built with different (T , M , S) parameter combina-tions and k-fold cross-validation protocol.

α = 1 indicating the scenario where no motion magnification is applied. It is possibleto observe that microexpression recognition performance is not improved when motionmagnification is applied to SMIC HS for both the original and spatial WLD descriptor.Improvement was observed, however, for the CASME II dataset with the spatial WLDdescriptor for most experimented magnification factors, with the best score being achievedwhen α = 6.


0.56

0.58

0.60

0.62

0.64

F1-score

SMIC HS

original WLD, KNN

spatial WLD, KNN


0.40

0.45

0.50

0.55

0.60

0.65

F1-score

CASME II

original WLD, KNN

spatial WLD, SVM

Figure 4.14: ME recognition results on the SMIC HS and CASME II datasets usingthe WLD descriptor computed from magnified video clips with k-fold cross-validationprotocol.


4.2.5 LBP-TOP Descriptor

The experiments conducted to evaluate the performance of the LBP-TOP descriptor arereported next. The following test configuration/parameters apply:

• As LBP-TOP is computed from pixel intensity, conversion from color to grayscaleis the first preprocessing step applied to all experiments.

• As the LBP-TOP histogram is calculated for the video volume (as opposed to frameby frame), video length equalization is not required. Temporal interpolation isthen applied for performance evaluation purposes and results are compared to thescenario where no length equalization method is used.

• LBP-TOP radius parameter was set to 1 and number of neighbor points to 8 in allexperiments.

• Due to the high dimensionality of the resulting descriptors for some smaller blocksizes (up to approximately 600,000 for the datasets used in this research, dependingon parameter values), PCA is applied on all LBP-TOP descriptors before classifica-tion.

• SVM classifier yielded better results for both datasets on most experiments con-ducted using the LBP-TOP descriptor. The results achieved using this classifier arepresented next.

Experiment 1: Evaluating interpolated video clip length

This experiment evaluates the effect of the interpolated video clip length in microexpres-sion recognition performance when using the LBP-TOP descriptor for the entire videovolume (no block division). Using the original video sequences without temporal interpo-lation is also experimented.

Results are presented in Figure 4.15 and indicate that better scores are achieved wheninterpolating video clips to shorter lengths for both datasets when using the LBP-TOPdescriptor. Scores obtained when interpolating to 10 frames not only outperform theones from longer interpolated sequences, but also the results achieved when no temporalinterpolation is applied. Similarly to what was observed for the HOG3D descriptor, thismay be due to the differences between consecutive frames becoming more evident whenthe video sequences are down-sampled.

Experiment 2: Evaluating LBP-TOP block size

The size of the blocks into which video volumes are divided for LBP-TOP histogram com-putation is an important parameter to build this descriptor, so this experiment evaluatesthe effect of using different block sizes on microexpression recognition results. Fourteendifferent block sizes were tested. Video clips were interpolated to 10 frames for bothdatasets.

Results are depicted in Figure 4.16 and indicate that higher scores are achieved whenusing blocks of 24×36×5 pixels for SMIC HS and 36×48×5 pixels for CASME II.


−10 0 10 20 30 40 50 60 70 80 90 100 110Number of interpolated frames

0.38

0.40

0.42

0.44

0.46

0.48F1

-sco

reSMIC HS

−10 0 10 20 30 40 50 60 70 80 90 100 110Number of interpolated frames

0.38

0.40

0.42

0.44

0.46

0.48

0.50

0.52

F1-s

core

CASME II

Figure 4.15: ME recognition results with and without temporal interpolation on theSMIC HS and CASME II datasets using the LBP-TOP descriptor (single block) withSVM classifier and k-fold cross-validation protocol.

Single

12x21x5

12x12x10

12x24x5

12x24x10

24x24x5

24x24x10

24x36x5

24x36x10

36x36x5

36x36x10

36x48x5

36x48x10

48x48x5

48x48x10

Block size

0.45

0.50

0.55

0.60

0.65

F1-sco

re

SMIC HSSingle

12x21x5

12x12x10

12x24x5

12x24x10

24x24x5

24x24x10

24x36x5

24x36x10

36x36x5

36x36x10

36x48x5

36x48x10

48x48x5

48x48x10

Block size

0.50

0.55

0.60

0.65

0.70

F1-sco

re

CASME II

Figure 4.16: ME recognition results on the SMIC HS and CASME II datasets using theLBP-TOP descriptor built with different block sizes, SVM classifier and k-fold cross-validation protocol.


This experiment evaluated the effect of applying phase-based motion magnification withdifferent magnification factors to microexpression video sequences before LBP-TOP fea-ture extraction. Parameter settings that yielded the best results on the previous ex-periments were used: 24×36×5 and 36×48×5 blocks for the SMIC HS and CASME IIdatasets, respectively, with video sequences interpolated to 10 frames.

Figure 4.17 shows the results achieved with different magnification factors α, with


α = 1 indicating the scenario where no motion magnification is applied. It is possibleto observe that microexpression recognition performance is not improved when motionmagnification is applied to both datasets.


0.56

0.58

0.60

0.62

0.64

F1-sco

re

SMIC HS


0.60

0.62

0.64

0.66

0.68

0.70

F1-sco

re

CASME II

Figure 4.17: ME recognition results on the SMIC HS and CASME II datasets using theLBP-TOP descriptor computed from magnified video clips with SVM classifier and k-foldcross-validation protocol.

4.2.6 HOF Descriptor

The experiments executed to assess the performance of the HOF descriptor on microex-pression recognition are described next. The following test configuration/parameters ap-ply:

• As optical flow is calculated from pixel intensity, conversion from color to grayscaleis the first preprocessing step applied to all experiments.

• As described in Chapter 3, the HOF descriptor used in this research is built byconcatenating the histograms calculated for each frame in the video clip. There-fore, to build descriptors with the same size for all videos in the dataset, temporalinterpolation (or some equivalent method that can equalize video clip lengths) is arequired preprocessing step, so it was used in all experiments.

• The following parameter values were used for Lucas-Kanade sparse optical flow com-putation using OpenCV in all experiments: size of search window: 15×15; maximumpyramid level number: 2. Additionally, the following parameters values were usedfor Farnebäck dense optical flow calculation using OpenCV: average window size:40; number of pyramid layers: 3; image scale to build pyramids: 0.5; number of it-erations the algorithm does at each pyramid level: 3; size of the pixel neighborhoodused to find polynomial expansion: 5; standard deviation of the Gaussian used tosmooth derivatives utilized as a basis for the polynomial expansion: 1.2.


• SVM classifier yielded better results for the SMIC HS dataset on most experimentsexecuted using the HOF descriptor. For the CASME II dataset, SVM classifierperformed better for sparse HOF, whereas RF yielded the best results for denseHOF. Results achieved using these classifiers are presented next.

Experiment 1: Comparing different sampling methods and evaluating inter-polated video clip length

As described in Chapter 3, the Farnebäck [24] and Lucas-Kanade [51] methods were usedin this work to calculate optical flow. While the first method computes dense opticalflow, the second calculates it for a given set of interest points. Facial landmarks, or, morespecifically, the subsets of 8, 14, 18 and 51 of the facial landmarks detected by DLib wereused as interest points for Lucas-Kanade optical flow computation in this experiment.The histograms computed with 18 bins from the optical flow calculated through bothsampling methods and the different facial landmark subsets were used as descriptors formicroexpression recognition.

This experiment also evaluates the effect of the interpolated video clip length in mi-croexpression recognition performance when using the HOF descriptor.

Results are depicted in Figure 4.18 and show that similar scores are achieved whenusing the dense and sparse optical flow calculation methods, as well as with the differentsets of interest points. Sparse HOF calculated from 8 facial landmarks achieved the bestscore for SMIC HS with video clips interpolated to 50 frames, while dense HOF with 10frames interpolation achieved the highest score for CASME II. Results obtained from bothdatasets appear to indicate that scores tend to get lower as the number of interpolatedframes increases, although some peaks can be observed around 50 and 60 frames.


0.25

0.30

0.35

0.40

0.45

0.50

0.55

F1-s

core

SMIC HS

8 landmarks, SVM

14 landmarks, SVM

18 landmarks, SVM

51 landmarks, SVM

dense sampling, SVM


0.20

0.25

0.30

0.35

0.40

0.45

0.50

F1-s

core

CASME II

8 landmarks, SVM

14 landmarks, SVM

18 landmarks, SVM

51 landmarks, SVM

dense sampling, RF

Figure 4.18: ME recognition results on the SMIC HS and CASME II datasets using theHOF descriptor calculated by different methods, with k-fold cross-validation protocol.


Experiment 2: Evaluating the number of histogram bins

The goal of this experiment is to evaluate the effect of the number of bins used to buildthe histogram of optical flow in microexpression recognition. The test configuration andparameter values that yielded the best results for both the sparse and dense HOF inthe previous experiment are used for each dataset: sparse HOF calculated from 8 faciallandmarks with video sequences interpolated to 50 frames and dense HOF with 90 framesinterpolation is used for SMIC HS, while sparse HOF computed from 8 facial landmarkswith 20 frames interpolation and dense HOF with video sequences interpolated to 10frames is used for CASME II.

Results are presented in Figure 4.19 and indicate that higher scores are achieved withsmaller numbers of bins for both the sparse and dense HOF descriptors for both datasets,meaning that HOF is possibly more discriminative for microexpression recognition whenoptical flow orientation is grouped into fewer, more general (less precise), directions.

0 18 36 54 72 90 108 126 144 162 180 198HOF Bins

0.25

0.30

0.35

0.40

0.45

0.50

0.55

F1-score

SMIC HS

sparse, SVM

dense, SVM

0 18 36 54 72 90 108 126 144 162 180 198HOF Bins

0.20

0.25

0.30

0.35

0.40

0.45

0.50F1-score

CASME II

sparse, SVM

dense,. RF

Figure 4.19: ME recognition results on the SMIC HS and CASME II datasets usingthe sparse and dense HOF descriptor built with different numbers of bins, with k-foldcross-validation protocol.

Experiment 3: Comparing facial landmark detectors

This experiment explored the usage of OpenFace to detect the facial landmarks used asinterest points to calculate sparse HOF. Detection from individual images or tracking fromvideo sequences were experimented and compared to the results achieved using DLib, asdepicted in Table 4.4. The configuration and parameter values that generated the bestresult using sparse HOF with DLib for each dataset in the previous experiment wereutilized: 8 facial landmarks were used as the interest points for both datasets and videosequences were interpolated to 50 frames for SMIC HS and 20 frames for CASME II.

Similarly to what was observed when comparing facial landmark detectors for thegeometric descriptor, for HOF the best score was achieved when using DLib for SMIC HS


Tool F1-scoreSMIC HS CASME II

OpenFace detection (image) 0.3637 0.4178OpenFace tracking (video) 0.4374 0.4649DLib 0.5044 0.4563

Table 4.4: ME recognition results on the SMIC HS and CASME II datasets using theHOF descriptor built from 8 facial landmarks detected by different tools, SVM classifierand k-fold cross-validation protocol.

and OpenFace for CASME II.


This experiment evaluated the effect of applying phase-based motion magnification withdifferent magnification factors to microexpression video sequences before HOF featureextraction. Both sparse and dense HOF feature extraction were experimented. Sparsesampling with 8 facial landmarks was used for both datasets, with landmarks being de-tected by DLib for SMIC HS and tracked by OpenFace for CASME II. Video sequenceswere interpolated to 50 and 90 frames for sparse and dense sampling, respectively, forSMIC HS, and to 20 and 10 frames for CASME II.

Figure 4.20 shows the results achieved with different magnification factors α, withα = 1 indicating the scenario where no motion magnification is applied. It is possibleto observe that microexpression recognition performance was improved for the SMIC HSdataset when using dense HOF with magnification factor α = 14. No improvement wasachieved, however, when motion magnification is applied to CASME II with either sparseor dense HOF descriptor.


0.34

0.36

0.38

0.40

0.42

0.44

0.46

0.48

0.50

0.52

F1-s

core

SMIC HS

sparse, SVM

dense, SVM


0.34

0.36

0.38

0.40

0.42

0.44

0.46

0.48

F1-s

core

CASME II

sparse, SVM

dense, RF

Figure 4.20: ME recognition results on the SMIC HS and CASME II datasets using theHOF descriptor computed from magnified video clips with k-fold cross-validation protocol.



After separate evaluation, descriptors were combined (concatenated) and used as input forstandalone classifiers. Parameter settings that led to the best results for each descriptorwere applied, which are summarized in Tables 4.5 and 4.6.

Descriptor Preprocessing Parameters Classifier F1-scoreTIM Mag

Geometric 90 None 14 locations+21 distance subset #1; DLib SVM 0.7065HOG3D 10 None Sparse; 8 landmarks; DLib; Icosahedron SVM 0.7296WLD 20 None 36×48 block; (T,M,S)=(6,4,4) KNN 0.6408Action Unit 30 None Presence+intensity KNN 0.6556LBP-TOP 10 None 24×36×5 block SVM 0.6397HOF 50 None Sparse; 8 landmarks; DLib; 18 bins SVM 0.5044

Table 4.5: Best results achieved with each single descriptor for the SMIC HS dataset.

Descriptor Preprocessing Parameters Classifier F1-scoreTIM Mag

Geometric 90 None 21 distance subset #1; OpenFace SVM 0.6508HOG3D 10 α=6 Sparse; 8 landmarks; DLib; Dodecahedron SVM 0.7019WLD 60 α=6 48×48 block; (T,M,S)=(6,4,4) SVM 0.6473Action Unit 90 None Intensity only SVM 0.5808LBP-TOP 10 None 36×48×5 block SVM 0.6992HOF 20 None Sparse; 8 landmarks; OpenFace; 18 bins SVM 0.4621

Table 4.6: Best results achieved with each single descriptor for the CASME II dataset.

PCA was applied to all single feature vectors before concatenation, so that dimension-alities are equalized. All 57 combinations of two or more of the six descriptors (Geometric,Action Units, HOG3D, WLD, LBP-TOP and HOF) were experimented with each clas-sifier (SVM, RF, KNN and AdaBoost), with SVM performing better for both datasets.Combinations that led to the best results are presented in Tables 4.7 and 4.8 and showthat using concatenated descriptors did not improve the best score achieved with individ-ual descriptors for SMIC HS. It is also possible to observe that the best performing singledescriptor (HOG3D) is present in all ten best performing concatenations. For CASMEII, on the other hand, performance improvement was indeed achieved, as expected, withthe HOG3D and WLD descriptors being present in all ten best concatenations.

4.3 Classifier Combination Results

The best classification results obtained for each single descriptor in the previous exper-iments are combined to yield a final classification. Four different voting combinationschemes and three variants of the stacking combination method are experimented. For


Geometric HOG3D WLD AU LBP-TOP HOF F1-score

X X 0.7125X X X 0.7123

X X 0.7074X X X 0.7066

X X 0.7004X X X 0.6949

X X X X 0.6945X X X 0.6731X X 0.6731X X X 0.6731

Table 4.7: Best ME recognition results on the SMIC HS dataset using concatenateddescriptors, SVM classifier and k-fold cross-validation protocol.

Geometric HOG3D WLD AU LBP-TOP HOF F1-score

X X X X 0.7184X X X 0.7179X X X X 0.7179

X X X X X 0.7179X X X 0.7165

X X X X X 0.7138X X X X X 0.7136X X X X X X 0.7135X X X X X 0.7114X X X X 0.7091

Table 4.8: Best ME recognition results on the CASME II dataset using concatenateddescriptors, SVM classifier and k-fold cross-validation protocol.

each descriptor explored in this work, parameter settings that yielded the top two classi-fication results are paired with the best performing classifier to be used as input to thecombination algorithm. These pairs are summarized in Tables 4.9 and 4.10.

For the classifier combination experiments, in addition to k-fold, the leave-one-subject-out (LOSO) cross-validation protocol is also used to generate the individual descrip-tor/classifier results. Instead of partitioning samples into random subsets, LOSO dividesthe dataset by grouping samples obtained from the same (human) subject. For each cross-validation iteration, the samples captured from one subject are used as the validationsubset. As a result, LOSO is generally a more difficult protocol and, as such, is expectedto produce lower scores. LOSO scores for the best standalone descriptor/classifier pairsare also presented in Tables 4.9 and 4.10.

Predictions obtained from these 12 selected descriptor/classifier pairs were tested withdifferent classifier combination algorithms, as detailed in the following subsections.


Descriptor Preproc Parameters Classifier F1-scoreTIM Mag k-fold LOSO

Geometric 90 None 14 locations+21 distances subset#1; DLib SVM 0.7065 0.5132Geometric 90 None 14 locations+21 distances subset#2; DLib SVM 0.7065 0.5016HOG3D 10 None Sparse; 8 landmarks; DLib; Icosahedron SVM 0.7296 0.5300HOG3D 10 None Sparse; 18 landmarks; DLib; Dodecahedron SVM 0.7059 0.5414WLD 20 None 36×48 block; (T,M,S)=(6,4,4) KNN 0.6408 0.3391WLD 20 None 36×48 block; (T,M,S)=(8,4,4) KNN 0.6402 0.3798Action Unit 30 None Presence+intensity KNN 0.6556 0.3702Action Unit 30 None Presence only KNN 0.6328 0.3671LBP-TOP 10 None 24×36×5 block SVM 0.6397 0.4575LBP-TOP 10 None 24×24×10 block SVM 0.6396 0.5581HOF 50 None Sparse; 8 landmarks; DLib; 18 bins SVM 0.5044 0.4561HOF 90 α=14 Dense; 18 bins SVM 0.4703 0.3974

Table 4.9: Best results achieved with individual descriptor/classifier pairs for the SMICHS dataset.

Descriptor Preproc Parameters Classifier F1-scoreTIM Mag k-fold LOSO

Geometric 90 None 21 distance subset #1; OpenFace SVM 0.6508 0.4807Geometric 90 None 21 distance subset #2; OpenFace SVM 0.6482 0.4957HOG3D 10 α=6 Sparse; 8 landmarks; DLib; Dodecahedron SVM 0.7019 0.5689HOG3D 10 α=8 Sparse; 8 landmarks; DLib; Dodecahedron SVM 0.6908 0.5949WLD 60 α=6 48×48 block; (T,M,S)=(6,4,4) SVM 0.6473 0.4830WLD 60 α=16 48×48 block; (T,M,S)=(6,4,4) SVM 0.6386 0.4543Action Unit 90 None Intensity only SVM 0.5808 0.4858Action Unit 90 α=2 Intensity only SVM 0.5794 0.3902LBP-TOP 10 None 36×48×5 block SVM 0.6992 0.5004LBP-TOP 10 None 36×36×5 block SVM 0.6945 0.4557HOF 20 None Sparse; 8 landmarks; OpenFace; 18 bins SVM 0.4649 0.4167HOF 10 None Dense; 18 bins RF 0.4621 0.3940

Table 4.10: Best results achieved with individual descriptor/classifier pairs for the CASMEII dataset.

4.3.1 Voting

Four variants of the voting method were explored in this work for classifier combination:hard and soft majority voting and hard and soft weighted voting. Hard voting takes theclass labels predicted by the standalone descriptor/classifier pairs and counts the votesreceived by each class, while soft voting uses the class probabilities to compute the finalprediction. For weighted voting, the accuracies computed for the descriptor-classifier pairsare used as weights.

The following experiments were executed to evaluate the performance of these votingclassifier combination methods on microexpression recognition.


Experiment 1: Evaluating different voting methods

This experiment applied the four voting method variants on the class labels and probabil-ity predictions obtained from the selected 12 best descriptor/classifier pairs. Results arepresented in Table 4.11 and indicate that best results were achieved with the hard votingmethods for both datasets (hard weighted voting scored slightly better for SMIC HS,while hard majority voting achieved a minimally higher score for CASME II). It is alsopossible to observe that significant improvement was obtained in the final score (comparedto the individual descriptor/classifier pair scores) for CASME II, while no improvementwas achieved for SMIC HS.

Voting Method F1-scorek-fold LOSO

SMIC HS CASME II SMIC HS CASME II

Hard Majority Voting 0.7231 0.7482 0.5921 0.6353Soft Majority Voting 0.6979 0.6808 0.4708 0.4365Hard Weighted Voting 0.7239 0.7424 0.5923 0.6186Soft Weighted Voting 0.7042 0.6887 0.4708 0.4431

Table 4.11: Voting results using 12 descriptor/classifier pairs.

Experiment 2: Exploring prediction subsets

This experiment explored the usage of subsets of predictions (from the 12 best predictionsused in the previous experiment) as the input to the voting algorithm. All 4017 subsetsof the predictions obtained from 3 or more of the 12 descriptor/classifier pairs were testedwith the voting algorithms that achieved the best score in the previous experiment (hardmajority and hard weighted voting).

Subsets that yielded the top 10 results using k-fold cross-validation are presented inTables 4.12, 4.13, 4.14 and 4.15. Results obtained for the complete 12 prediction set isalso shown at the bottom of each table for comparison.

The following observations are outlined about these results:

• Significant improvement was obtained in the final score when using the top subsets(compared to the complete 12 prediction set, as well as to the standalone descrip-tor/classifier pair scores) for both datasets.

• Hard weighted voting yielded the highest scores for both datasets, significantly out-performing hard majority voting for SMIC HS.

• Both HOG3D descriptor/classifier pairs are present in most of the top 10 subsets,while at least one of them appears in all of these subsets. This was expected, asHOG3D is the best performing standalone descriptor/classifier pair for both datasetsand, as such, provides the most solid individual contribution to the final combinationresults.


Rank Descriptor/Classifier Pair F1-score

Geo

#1/

SVM

Geo

#2/

SVM

HOG3D

#1/SV

M

HOG3D

#2/

SVM

WLD

#1/

KNN

WLD

#2/

KNN

AU

#1/

KNN

AU

#2/

KNN

LBP-T

OP

#1/

SVM

LBP-T

OP

#2/SV

M

HOF#1/SV

M

HOF#2/SV

M

1 X X X X X 0.76392 X X X X X 0.76393 X X X X X 0.76004 X X X X X 0.76005 X X X X X X X 0.75976 X X X X X X X 0.75977 X X X X X 0.75928 X X X X X 0.75929 X X X X X X X 0.758910 X X X X X X X 0.7589

1038 X X X X X X X X X X X X 0.7231

Table 4.12: Best ME recognition results on the SMIC HS dataset using the hard majorityvoting classifier combination method and k-fold cross-validation.


Geo

#1/

SVM

Geo

#2/

SVM

HOG3D

#1/SV

M

HOG3D

#2/

SVM

WLD

#1/

KNN

WLD

#2/

KNN

AU

#1/

KNN

AU

#2/

KNN

LBP-T

OP

#1/

SVM

LBP-T

OP

#2/SV

M

HOF#1/SV

M

HOF#2/SV

M

1 X X X X X 0.78382 X X X X X 0.78383 X X X X X 0.77164 X X X X X 0.77165 X X X X X X X 0.77036 X X X X X X X 0.76657 X X X X X X 0.76628 X X X X X X X 0.76619 X X X X X X X 0.766110 X X X X X X X 0.7661

1476 X X X X X X X X X X X X 0.7239

Table 4.13: Best ME recognition results on the SMIC HS dataset using the hard weightedvoting classifier combination method and k-fold cross-validation.

• At least one LBP-TOP and one HOF pair are present in most of the top 10 subsets.While LBP-TOP is an average and a top performing descriptor/classifier pair forSMIC HS and CASME II, respectively, HOF pairs yielded the lowest scores among



Geo

#1/

SVM

Geo

#2/

SVM

HOG3D

#1/SV

M

HOG3D

#2/

SVM

WLD

#1/

SVM

WLD

#2/

SVM

AU

#1/

SVM

AU

#2/

SVM

LBP-T

OP

#1/

SVM

LBP-T

OP

#2/SV

M

HOF#1/SV

M

HOF#2/RF

1 X X X X X X X X 0.77382 X X X X X X X X X 0.77243 X X X X X X 0.77224 X X X X X X X X X X X 0.77205 X X X X X X X X 0.76996 X X X X X X 0.76987 X X X X X X X 0.76988 X X X X X X X X 0.76949 X X X X X X X X X 0.769110 X X X X X X X X X X 0.7689245 X X X X X X X X X X X X 0.7482

Table 4.14: Best ME recognition results on the CASME II dataset using the hard majorityvoting classifier combination method and k-fold cross-validation.


Geo

#1/

SVM

Geo

#2/

SVM

HOG3D

#1/SV

M

HOG3D

#2/

SVM

WLD

#1/

SVM

WLD

#2/

SVM

AU

#1/

SVM

AU

#2/

SVM

LBP-T

OP

#1/

SVM

LBP-T

OP

#2/SV

M

HOF#1/SV

M

HOF#2/RF

1 X X X X X X X 0.77692 X X X X X X 0.77683 X X X X X X X X X X 0.77544 X X X X X X X X X 0.77445 X X X X X X X 0.77306 X X X X X X X X X 0.77247 X X X X X X X X X 0.77228 X X X X X X X 0.77169 X X X X X X X X X 0.770710 X X X X X X 0.7687522 X X X X X X X X X X X X 0.7424

Table 4.15: Best ME recognition results on the CASME II dataset using the hard weightedvoting classifier combination method and k-fold cross-validation.

the descriptors experimented in this work. This is evidence that poor performingdescriptors and individual classifiers may also provide valuable information thatimproves the final results of combination algorithms.


• Geometric and WLD descriptor/classifier pairs are present in most of the top 10subsets for CASME II, which indicates they are valuable contributors to the finalcombination results for this dataset. However, they do not appear in various of thetop SMIC HS subsets. The opposite observation can be made for the Action Unitdescriptor/classifier pairs: while they are present in most of the top 10 subsets forSMIC HS, they are absent from various of the top CASME II subsets.

The same 4017 subsets were also evaluated using the LOSO cross-validation protocol,for which the best results are summarized in Tables 4.16 and 4.17. As for k-fold cross-validation, significant improvement can also be observed in the final score in comparisonto the top subsets and to the individual descriptor/classifier pairs. However, instead ofhard weighted voting, hard majority voting yielded the highest scores for both datasets.

Voting Method Descriptor/Classifier Pair F1-score

Geo

#1/

SVM

Geo

#2/

SVM

HOG3D

#1/SV

M

HOG3D

#2/

SVM

WLD

#1/

KNN

WLD

#2/

KNN

AU

#1/

KNN

AU

#2/

KNN

LBP-T

OP

#1/

SVM

LBP-T

OP

#2/SV

M

HOF#1/SV

M

HOF#2/SV

M

Hard Majority X X X X X X X X X 0.6418Hard Weighted X X X X X X X X X 0.6409

Table 4.16: Best ME recognition results on the SMIC HS dataset using the voting classifiercombination method and LOSO cross-validation.

Voting Method Descriptor/Classifier Pair F1-score

Geo

#1/

SVM

Geo

#2/

SVM

HOG3D

#1/SV

M

HOG3D

#2/

SVM

WLD

#1/

SVM

WLD

#2/

SVM

AU

#1/

SVM

AU

#2/

SVM

LBP-T

OP

#1/

SVM

LBP-T

OP

#2/SV

M

HOF#1/SV

M

HOF#2/RF

Hard Majority X X X X X X X X 0.6866Hard Weighted X X X X X X 0.6784

Table 4.17: Best ME recognition results on the CASME II dataset using the votingclassifier combination method and LOSO cross-validation.

4.3.2 Stacking

Three variants of the stacking classifier combination method were experimented in thiswork. The first variant uses the class labels predicted by the standalone descrip-tor/classifier pairs as input (meta) features for a second level meta-classifier, while the


second and third utilize, respectively, class probabilities and both class labels and prob-abilities. Logistic Regression and KNN were tested for the role of meta-classifier, withtheir optimal parameter values being determined through grid search, as follows:

• For Logistic Regression, C was searched in {1, 101, 102, 103, 5×103, 104, 5×104, 105},the algorithm to use in the optimization was searched in {liblinear, newton-cg, l-bfgs}, penalty was searched in {l1, l2} (newton-cg and l-bfgs algorithms supportonly l2 penalty), and the multi-class scheme was searched in {one-vs-rest, cross-entropy loss} (liblinear supports only one-vs-rest). Additionally, the usage of classweights (inversely proportional to class frequencies in the dataset) was also tested.

• For KNN, the number of neighbors k was searched in [1, 9], the weight function usedin prediction was searched in {uniform, distance}, where uniform indicates that allneighbors are weighted equally, while distance weights points by the inverse of theirdistance, and the algorithm used to compute nearest neighbors was searched in {balltree, K-d tree, brute force}.

The experiments executed to assess the performance of these methods on microex-pression recognition are described next.

Experiment 1: Evaluating different stacking variants

This experiment applied the three stacking variants on the class labels and probabilitypredicted by the 12 best descriptor/classifier pairs. Two-level and nested cross-validationschemes were experimented.

Results are presented in Table 4.18. It is possible to observe that higher scores areachieved with two-level than with nested cross-validation regardless of what meta-classifierand meta-features (class labels and/or probabilities) are used, with significant improve-ment being achieved (in comparison to individual descriptor/classifier pair scores) forthe first, but not for the latter. This may be an indication that the small data leakagethat occur when using this cross-validation scheme is indeed causing the model to overfit.However, there is another, more evident and directly impacting, cause for these differentfinal scores: first level cross-validation results may be significantly less accurate in thenested scheme because it is done on a smaller dataset. Training is not only done usinga smaller number of samples in the first (inner) level of nested cross-validation, but alsousing folder divisions on which certain classes may be misrepresented.

Another important point to notice on the results of this experiment is that using classprobabilities instead of (or in addition to) class labels as meta-features for the second levelclassifier yielded better results than using only the class labels in most cases (the exceptionbeing the results obtained using the KNN meta-classifier with two-level cross-validationfor the SMIC HS dataset). This was expected, as the class probabilities contain moreinformation than class labels and, as such, result in more discriminative meta-features.

One last observation is related to the different results obtained when using LogisticRegression or KNN as the meta-classifier. Although Logistic Regression yielded superiorresults in most of the cases, the best stacking score for each dataset was obtained when


CVAlgorithm

Meta-classifier Meta-features F1-score

k-fold LOSOSMIC HS CASME II SMIC HS CASME II

Two-level

LRLabels 0.7199 0.4880 0.4999 0.3568Probabilities 0.7342 0.7329 0.5416 0.5910Labels+Probabilities 0.7335 0.6989 0.5097 0.5963

KNNLabels 0.7426 0.5744 0.5142 0.4659Probabilities 0.7041 0.7464 0.4889 0.5413Labels+Probabilities 0.7089 0.5988 0.4881 0.4826

Nested

LRLabels 0.6839 0.3897 0.4519 0.4075Probabilities 0.6957 0.7293 0.3577 0.5544Labels+Probabilities 0.6944 0.6759 0.3379 0.4292

KNNLabels 0.6154 0.5130 0.4526 0.3629Probabilities 0.6664 0.6912 0.3789 0.5265Labels+Probabilities 0.6769 0.5647 0.4302 0.5007

Table 4.18: Stacking results using 12 descriptor/classifier pairs.

using KNN (with class labels for SMIC HS and class probabilities for CASME II, bothwith two-level cross-validation).

Experiment 2: Exploring prediction subsets

This experiment explored the usage of subsets of the predictions obtained from the 12 bestdescriptor/classifier pairs as the meta-features for the stacking model. All 4017 subsetsof the predictions obtained from 3 or more of the 12 first level classifiers were tested withall stacking variants explored in this work.

Subsets that yielded the best results using two-level and nested k-fold cross-validationare presented in Tables 4.19 and 4.20.

Significant improvement was achieved in the final scores (compared to the complete12 prediction sets used in the previous experiment, as well as to the individual descrip-tor/classifier pairs) when using the two-level and nested cross-validation approaches forboth datasets. As observed for the voting combination methods, it is noticeable that atleast one of the HOG3D descriptor/classifier pairs is present in all and in most of the topsubsets for two-level and nested cross-validation, respectively.

The same 4017 subsets were also evaluated using two-level LOSO cross-validationwith the stacking variants that yielded the best scores (class labels and class probabilitiesmeta-features, respectively, for SMIC HS and CASME II, both with KNN meta-classifier).Subsets that yielded the best results are depicted in Tables 4.21 and 4.22.

4.4 Discussion

This work evaluated a number of different descriptors and machine learning techniquesfor microexpression recognition [6]. More than 200,000 tests were executed from whichvarious important lessons were learned, making significant contributions to this researchfield, as described in the following subsections.


CVAlgorithm

Meta-classifier Meta-features Descriptor/Classifier Pair F1-score

Geo

#1/

SVM

Geo

#2/

SVM

HOG3D

#1/SV

M

HOG3D

#2/SV

M

WLD

#1/

KNN

WLD

#2/

KNN

AU

#1/

KNN

AU

#2/

KNN

LBP-T

OP

#1/

SVM

LBP-T

OP

#2/SV

M

HOF#1/SV

M

HOF#2/SV

M

Two-level

LRLabels X X X X X X 0.7614Probs X X X X X X 0.7582Labels+Probs X X X X X 0.7791

KNNLabels X X X X 0.8027Probs X X X X X X 0.7651Labels+Probs X X X X X X 0.7710

Nested

LRLabels X X X X X X X X X 0.7314Probs X X X X X X 0.7394Labels+Probs X X X X X X 0.7572

KNNLabels X X X X X 0.7294Probs X X X X X X X X 0.7253Labels+Probs X X X X 0.7238

Table 4.19: Best ME recognition results on the SMIC HS dataset using the stackingclassifier combination method and k-fold cross-validation.

CVAlgorithm


Geo

#1/

SVM

Geo

#2/

SVM

HOG3D

#1/SV

M

HOG3D

#2/SV

M

WLD

#1/

SVM

WLD

#2/

SVM

AU

#1/

SVM

AU

#2/

SVM

LBP-T

OP

#1/

SVM

LBP-T

OP

#2/SV

M

HOF#1/SV

M

HOF#2/RF

Two-level

LRLabels X X X X X X X X X 0.5201Probs X X X X X X 0.7534Labels+Probs X X X X 0.7527

KNNLabels X X X 0.7089Probs X X X X X X X 0.7714Labels+Probs X X X 0.7093

Nested

LRLabels X X X 0.5078Probs X X X X X X X X 0.7653Labels+Probs X X X X X X 0.7360

KNNLabels X X X 0.6857Probs X X X X X X X 0.7293Labels+Probs X X X 0.6769

Table 4.20: Best ME recognition results on the CASME II dataset using the stackingclassifier combination method and k-fold cross-validation.



Geo

#1/

SVM

Geo

#2/

SVM

HOG3D

#1/SV

M

HOG3D

#2/SV

M

WLD

#1/

KNN

WLD

#2/

KNN

AU

#1/

KNN

AU

#2/

KNN

LBP-T

OP

#1/SV

M

LBP-T

OP

#2/SV

M

HOF#1/SV

M

HOF#2/SV

M

KNN Labels X X X X X 0.6295

Table 4.21: Best ME recognition results on the SMIC HS dataset using the stackingclassifier combination method and two-level LOSO cross-validation.


Geo

#1/

SVM

Geo

#2/

SVM

HOG3D

#1/SV

M

HOG3D

#2/SV

M

WLD

#1/SV

M

WLD

#2/SV

M

AU

#1/

SVM

AU

#2/

SVM

LBP-T

OP

#1/SV

M

LBP-T

OP

#2/SV

M

HOF#1/SV

M

HOF#2/RF

KNN Probs X X X X X X X X 0.6104

Table 4.22: Best ME recognition results on the CASME II dataset using the stackingclassifier combination method and two-level LOSO cross-validation.

4.4.1 Proposed Geometric Descriptor

The extensions proposed in this work to the geometric features introduced by Saeed etal. [67, 68] outperformed the original descriptor on microexpression recognition. Thisindicates that the added features (landmark locations and distances) contain importantinformation about the geometry of microexpressions that allow to distinguish betweentheir emotion classes. As all three extended feature sets yielded similar scores, it is possibleto conclude that the most discriminative among the added features are the ones selectedto build the (smaller) 14 landmarks + 21 distances set. The other additional landmarksand distances used to build the 18 landmarks + 27 distances and 51 landmarks + 35distances feature sets seem less relevant.

Due to facial landmark detection playing a key role in geometric feature extraction, itshould be considered as an important topic for future work. Results obtained in the ge-ometric descriptor experiments indicate that small differences in detected point locationscan have a high impact on classification results. This can be particularly observed whencontrasting the different results obtained when using DLib and OpenFace on the samevideo sequences, so much so that the best results observed for SMIC HS and CASME IIwere achieved when using different detectors (DLib and OpenFace, respectively).

Compared to the other single descriptors evaluated in this work, the extended geo-metric descriptor performed well, achieving the second and third best scores for SMIC


HS and CASME II, respectively, as presented back in Tables 4.5 and 4.6. Additionally,it showed up as an important individual contributor to the final scores achieved with theclassifier combination methods.

4.4.2 Other Descriptors

From the other five descriptors evaluated in this work (Action Units, HOG3D, WLD,LBP-TOP and HOF), HOG3D stands out as the best single descriptor for a standalonemicroexpression classifier. In addition to yielding the best single descriptor scores forboth datasets, it demonstrated its high discriminative power by being present in all butone of the best results obtained by each of the voting and stacking variants explored inthis work.

The proposed use of Action Unit presence and intensity features extracted by Open-Face for microexpression recognition has shown intermediate, but yet competitive, resultsas a single descriptor. It also turned out to be a valuable single descriptor in many ofthe best performing classifier combinations. Given these results, it is possible to concludethat Action Unit features are of great importance to microexpression recognition, andthat further research could be done in its development and application. Among possibleresearch lines, one can highlight the study of Action Unit detection algorithms and theirpotential adaptations to microexpression recognition.

4.4.3 Motion Magnification

Unlike the results reported by Le Ngo et al. [43], Wang et al. [83] and Li et al. [45], thiswork did not observe overall improvement in microexpression recognition scores whenmotion magnification was applied as a preprocessing step. Nonetheless, improvement wasperceived in some specific cases such as with the HOG3D and WLD descriptors for theCASME II dataset and occasionally with the HOF descriptor for SMIC HS.

One can only speculate about the differences in the results of motion magnificationapplication to microexpression recognition between this present work and others. Onepossibility resides in the differences in the implementation of the magnification algorithmitself (in this case, the Riesz Pyramid for Fast Phase-based Video Magnification method).The application of different preprocessing steps and techniques before and after motionmagnification is another possible cause.


This is one of the most comprehensive studies of the application of different descriptorsand classifiers to the microexpression recognition problem. Not only has it tested sixdifferent descriptors, namely Geometric, Action Units, HOG3D, WLD, LBP-TOP andHOF, but it also ran experiments on all using 57 different combinations (concatenations)of two or more of them as input to four different standalone classifiers (SVM, RF, KNNand AdaBoost).

In general, the conclusion is that combining descriptors does not lead to much im-provement on its own when strictly associated to a single classifier. However, as discussed


later, different descriptors do help when individually associated to a standalone classifierand then combined using techniques such as stacking or voting.

4.4.5 Classifier Combinations

The best results presented in this work were achieved by combining different pairs of singledescriptors/classifiers. As shown in Section 4.3, the best performing combinations dependon the tested dataset, the cross-validation protocol and the combination technique itself.

For SMIC HS, the best performance in the k-fold cross-validation tests was achievedwith the class label stacking scheme when using the KNN meta-classifier with the Geomet-ric/SVM, HOG3D/SVM, Action Units/KNN and LBP-TOP/SVM descriptor/classifierpairs. For LOSO cross-validation, on the other hand, hard majority voting with the Geo-metric/SVM, HOG3D/SVM, WLD/KNN, Action Units/KNN and LBP-TOP/SVM pairswas the top performer.

For the CASME II dataset, the best results in the k-fold cross-validation testswere achieved with the hard weighted voting technique with the Geometric/SVM,HOG3D/SVM, WLD/SVM, Action Units/SVM and LBP-TOP/SVM pairs, while, forLOSO cross-validation, hard majority voting with the Geometric/SVM, HOG3D/SVM,WLD/SVM, Action Units/SVM and HOF/SVM outperformed all other schemes.

As previously mentioned, HOG3D and Action Units have proven to be most presentin the winning combination sets for all the experimented methods on both datasets,while, among the different combination methods, hard voting was the most common topperformer. Given its simplicity, execution performance and overall results achieved inthese experiments, it is fair to conclude that hard voting methods are the best optionsfor combining classifiers for microexpression recognition on the experimented datasets.

4.4.6 Comparison to the Literature

Comparing results against other works is not straightforward. In microexpression recogni-tion, in particular, there are substantial differences in the use of cross-validation protocolssuch as k-fold, leave-one-subject-out (LOSO), leave-one-out (LOO), among others, as wellas in the testing scope (some researchers used only subsets of the benchmarking datasets)and evaluation metrics. For that reason, the results achieved in this work are first com-pared to its own implementation of the descriptor/classifier pair most commonly usedin the literature: LBP-TOP/SVM. This comparison is a good indicator of the potentialperformance of other descriptors and classifiers against existing methods.

As one can see back in Table 4.5, three different descriptors performed better thanLBP-TOP for SMIC HS using k-fold cross-validation, including the Geometric and ActionUnit descriptors proposed in this work. For CASME II, on the other hand, LBP-TOP wasoutperformed by HOG3D when using both k-fold and LOSO cross-validation, as depictedin Tables 4.6 and 4.10. These results indicate that other descriptors might actually addvalue when building more complex descriptor/classifier combinations.

This is corroborated when comparing the results achieved in this research againstothers in the literature. Particularly, for the CASME II dataset under the LOSO cross-


validation protocol, the results achieved with the hard majority voting method outper-formed the other works analyzed in this study, as shown in Table 4.23.

For the SMIC HS dataset, this work ranked among the top three LOSO results con-sidered from the literature, as reported in Table 4.23. According to the data therein, thispresent work would rank in third place in the comparison. However, it is worth notingthat the results presented by Wang et al. [81] are compared by Li et al. [45] as beingobtained using leave-one-out instead of leave-one-subject-out cross-validation.

In general, the results achieved herein outperform or level up to those presentedin the literature. Similarly, combined descriptor/classifiers outperform single descrip-tor/classifiers in this implementation. Therefore, it is fair to state that combining differentdescriptors with different classifiers leads to building better microexpression recognitionmethods.

CV Protocol Method SMIC HS CASME IIF1-score Accuracy F1-score Accuracy

LOSO

Li et al. [44] N/A 0.4878 N/A N/AGuo et al. [29] N/A 0.5372 N/A N/AWang et al. [82] N/A N/A N/A 0.6176Wang et al. [81] N/A 0.7134 N/A 0.6545Liong et al. [46] N/A 0.5356 N/A N/ALiong et al. [48] 0.6200 N/A 0.6100 N/AHuang et al. [32] N/A 0.5793 N/A 0.5951Huang et al. [33] 0.6381 0.6402 0.5835 0.5839Oh et al. [56] N/A N/A 0.4307 0.4615Oh et al. [55] 0.4400 N/A 0.4100 N/ALi et al. [45] N/A 0.6829 N/A 0.6721Le Ngo et al. [43] N/A N/A 0.4700 0.5100Patel et al. [58] N/A 0.5360 N/A 0.4730Breuer and Kimmel [11] N/A N/A N/A 0.5947Proposed geometric descriptor 0.5132 0.5122 0.4807 0.4816Proposed voting method 0.6418 0.6402 0.6866 0.6857Proposed stacking method 0.6295 0.6341 0.6104 0.6163

k-fold

Guo et al. [29] N/A 0.6300 N/A N/AProposed geometric descriptor 0.7065 0.7073 0.6508 0.6490Proposed voting method 0.7838 0.7866 0.7769 0.7755Proposed stacking method 0.8027 0.8049 0.7714 0.7714

LOO

Yan et al. [88] N/A N/A N/A 0.6341Liong et al. [47] N/A 0.5771 N/A 0.6640Wang et al. [84] N/A 0.6402 N/A 0.6721Wang et al. [83] N/A N/A N/A 0.7530

Table 4.23: Proposed methods compared to the literature for microexpression recognition.

88

Chapter 5

Conclusions and Future Work

Ever since its initial research works, detection and recognition of microexpressions haveproven to be an important tool for psychotherapy, forensics, homeland security and nego-tiation. Given the fact that a microexpression does not last longer than 1/2 of a second,using computers to perform its automatic recognition is a natural step forward in thefield. Researches in machine learning applied to this field are relatively new, howeverinitial results are promising, despite the challenges involved.

Previous research works have applied different descriptors and machine learning algo-rithms to recognize microexpressions. This work contributes to this research field by:

• Exploring the application of different descriptors as input to classifiers, includingHOG3D, WLD, LBP-TOP and HOF.

• Extending an existing geometric descriptor to improve classification results.

• Exploring the use of Action Unit features as input to classifiers.

• Combining descriptors to test their associated discriminative power.

• Exploring the use of different machine learning classification algorithms, such asSVM, RF, KNN and AdaBoost.

• Combining the output of classifiers using the voting and stacking techniques to buildcombined classification algorithms that outperform the individual ones.

After exploring different combinations of descriptors and classifiers, the achieved re-sults are confronted with others presented in the literature. Despite the inherent difficultyin making such comparisons, given the different validation protocols and scopes in variousresearch works, the hard voting classifier combination method presented a solid perfor-mance when contrasted with results from other works. The method achieved competitiveresults for the SMIC HS dataset, while outperforming all compared methods for CASMEII.

The application of these different descriptors and classifiers, as well as the combinationof them through various schemes, provided some insights that could help future researchesin this field:

CHAPTER 5. CONCLUSIONS AND FUTURE WORK 89

• HOG3D turned out to be the best single descriptor for a standalone microexpressionclassifier. Its ability to describe the key aspects of microexpression video clips andthe predictions resulting from its use made it the most valuable single descriptor inthe best performing classifier combinations.

• Combinations of classifiers outperform standalone classifiers on microexpressionrecognition. Combination algorithms have proven to aggregate the information con-tained on the individual classification results obtained from the standalone classifiersspecialized on different descriptors. Not only the best, but also the poor perform-ing descriptors (such as HOF) and standalone classifiers have shown to be valuablein increasing the score of stacked classifiers and voting schemes. Diversity in de-scriptors and classifiers seem to significantly matter in building a robust classifiercombination algorithm.

This is a young and evolving field of study. Among the promising techniques thatcould be addressed in future work, the following could be highlighted:

• Further studying facial landmark detectors to better understand their weaknessesand improve their effectiveness, given their importance to the facial geometric de-scriptors and as interest point detectors for sparse descriptors.

• Further studying Action Unit detection algorithms and their potential use in mi-croexpression recognition.

• Applying statistical significance tests to help interpret the experimental results ob-tained with different configurations of descriptors and classifiers.

• Applying deep learning algorithms for classification, either directly on top of themicroexpression video clips or on combinations of descriptors.

Microexpression recognition is a hot research field and possibilities seem to be endlessat the moment. We may not be too far away from these techniques going mainstreaminto commercial software and effectively delivering their promises.

90

Bibliography

[1] C. C. Aggarwal. Data Classification: Algorithms and Applications. CRC Press, 2014.

[2] T. Baltrušaitis. Openface: A Facial Behavior Analysis Toolkit, C++ Source Code.https://github.com/TadasBaltrusaitis/OpenFace/wiki. [Online; accessed onAug 24, 2017].

[3] T. Baltrušaitis, M. Mahmoud, and P. Robinson. Cross-dataset Learning and Person-specific Normalisation for Automatic Action Unit Detection. In 11th IEEE Inter-national Conference and Workshops on Automatic Face and Gesture Recognition,volume 6, pages 1–6. IEEE, 2015.

[4] T. Baltrušaitis, P. Robinson, and L.-P. Morency. Constrained Local Neural Fields forRobust Facial Landmark Detection in the Wild. In IEEE International Conferenceon Computer Vision Workshops, pages 354–361, 2013.

[5] T. Baltrušaitis, P. Robinson, and L.-P. Morency. Openface: An Open Source FacialBehavior Analysis Toolkit. In IEEE Winter Conference on Applications of ComputerVision, pages 1–10. IEEE, 2016.

[6] L. Barbieri and H. Pedrini. Facial Microexpression Recognition Based on Descrip-tor and Classifier Combinations. IEEE Transactions on Affective Computing, 2017(submitted).

[7] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[8] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal, 25(11):120–126, 2000.

[9] L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.

[10] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and RegressionTrees. CRC press, 1984.

[11] R. Breuer and R. Kimmel. A Deep Learning Perspective on the Origin of FacialExpressions. arXiv preprint arXiv:1705.01842, 2017.

[12] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High Accuracy Optical FlowEstimation Based on a Theory for Warping. In European conference on computervision, pages 25–36. Springer, 2004.

https://github.com/TadasBaltrusaitis/OpenFace/wiki

BIBLIOGRAPHY 91

[13] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal. Histograms of OrientedOptical Flow and Binet-Cauchy Kernels on Nonlinear Dynamical Systems for theRecognition of Human Actions. In IEEE Conference on Computer Vision and PatternRecognition, pages 1932–1939. IEEE, 2009.

[14] J. Chen, S. Shan, C. He, G. Zhao, M. Pietikainen, X. Chen, and W. Gao. WLD:A Robust Local Image Descriptor. IEEE Transactions on Pattern Analysis andMachine Intelligence, 32(9):1705–1720, 2010.

[15] N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. InIEEE Computer Society Conference on Computer Vision and Pattern Recognition,volume 1, pages 886–893. IEEE, 2005.

[16] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition,volume 31. Springer, 1996.

[17] P. Ekman. METT. Micro Expression Training Tool. CD-ROM. Oakland, 2003.

[18] P. Ekman. Lie Catching and Microexpressions. The Philosophy of Deception, pages118–133, 2009.

[19] P. Ekman and W. Friesen. Facial Action Coding System: a Technique for the Mea-surement of Facial Movement. Consulting Psychologists Press, Palo Alto, 1978.

[20] P. Ekman, W. Friesen, and J. C. Hager. Facial Action Coding System. CD-ROM.Salt Lake City, 2002.

[21] P. Ekman and W. V. Friesen. Nonverbal Leakage and Clues to Deception. Psychiatry,32(1):88–106, 1969.

[22] P. Ekman and W. V. Friesen. Unmasking the Face: A Guide to Recognizing Emotionsfrom Facial Clues. Prentice Hall, 1975.

[23] T. Fagni, F. Falchi, and F. Sebastiani. Image Classification via Adaptive Ensembles ofDescriptor-specific Classifiers. Pattern Recognition and Image Analysis, 20(1):21–28,2010.

[24] G. Farnebäck. Two-frame Motion Estimation based on Polynomial Expansion. InImage Analysis, pages 363–370. Springer, 2003.

[25] Y. Freund and R. E. Schapire. A Decision-theoretic Generalization of On-line Learn-ing and an Application to Boosting. In European Conference on ComputationalLearning Theory, pages 23–37. Springer, 1995.

[26] R. Gonzalez and R. Woods. Digital Image Processing. Prentice Hall, 2007.

[27] B. Gorman. A Kaggler’s Guide to Model Stacking in Practice. http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/, 2016.[Online; accessed on Nov 15, 2017].

http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/

http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/

BIBLIOGRAPHY 92

[28] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image andVision Computing, 28(5):807–813, 2010.

[29] Y. Guo, Y. Tian, X. Gao, and X. Zhang. Micro-expression Recognition based on LocalBinary Patterns from Three Orthogonal Planes and Nearest Neighbor Method. InInternational Joint Conference on Neural Networks, pages 3473–3479. IEEE, 2014.

[30] E. A. Haggard and K. S. Isaacs. Micromomentary Facial Expressions as Indicators ofEgo Mechanisms in Psychotherapy. In Methods of Research in Psychotherapy, pages154–165. Springer, 1966.

[31] R. M. Haralick, K. Shanmugam, and I. H. Dinstein. Textural Features for ImageClassification. IEEE Transactions on Systems, Man and Cybernetics, 3(6):610–621,1973.

[32] X. Huang, S.-J. Wang, G. Zhao, and M. Piteikainen. Facial Micro-expression Recog-nition using Spatiotemporal Local Binary Pattern with Integral Projection. In IEEEInternational Conference on Computer Vision Workshops, pages 1–9, 2015.

[33] X. Huang, G. Zhao, X. Hong, W. Zheng, and M. Pietikäinen. Spontaneous FacialMicro-expression Analysis using Spatiotemporal Completed Local Quantized Pat-terns. Neurocomputing, 175:564–578, 2016.

[34] A. Jain. Fundamentals of Digital Image Processing. Prentice-Hall, 1989.

[35] S.-G. Jeong, C. Lee, and C.-S. Kim. Motion-compensated Frame Interpolation basedon Multihypothesis Motion Estimation and Texture Optimization. IEEE Transac-tions on Image Processing, 22(11):4497–4509, 2013.

[36] I. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 2002.

[37] V. Kazemi and J. Sullivan. One Millisecond Face Alignment with an Ensemble ofRegression Trees. In IEEE Conference on Computer Vision and Pattern Recognition,pages 1867–1874, 2014.

[38] D. E. King. Dlib-ml: A Machine Learning Toolkit. Journal of Machine LearningResearch, 10:1755–1758, 2009.

[39] J. Kittler, M. Hatef, R. P. Duin, and J. Matas. On Combining Classifiers. IEEETransactions on Pattern Analysis and Machine Intelligence, 20(3):226–239, 1998.

[40] A. Kläser. Tool for Computing 3D Gradient Descriptor in Videos, C++Source Code. http://lear.inrialpes.fr/people/klaeser/software_3d_video_descriptor. [Online; accessed on Aug 24, 2017].

[41] A. Kläser, M. Marszałek, and C. Schmid. A Spatio-temporal Descriptor based on 3D-Gradients. In 19th British Machine Vision Conference, pages 275–1. British MachineVision Association, 2008.

http://lear.inrialpes.fr/people/klaeser/software_3d_video_descriptor

http://lear.inrialpes.fr/people/klaeser/software_3d_video_descriptor

BIBLIOGRAPHY 93

[42] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning Realistic HumanActions from Movies. In IEEE Conference on Computer Vision and Pattern Recog-nition, pages 1–8. IEEE, 2008.

[43] A. C. Le Ngo, Y.-H. Oh, R. C.-W. Phan, and J. See. Eulerian Emotion Magnificationfor Subtle Expression Recognition. In IEEE International Conference on Acoustics,Speech and Signal Processing, pages 1243–1247. IEEE, 2016.

[44] X. Li, T. Pfister, X. Huang, G. Zhao, and M. Pietikainen. A Spontaneous Micro-Expression Database: Inducement, Collection and Baseline. In 10th IEEE Interna-tional Conference and Workshops on Automatic Face and Gesture Recognition, pages1–6. IEEE, 2013.

[45] X. Li, H. Xiaopeng, A. Moilanen, X. Huang, T. Pfister, G. Zhao, and M. Pietikainen.Towards Reading Hidden Emotions: A Comparative Study of Spontaneous Micro-expression Spotting and Recognition Methods. IEEE Transactions on Affective Com-puting, 2017.

[46] S.-T. Liong, R. C.-W. Phan, J. See, Y.-H. Oh, and K. Wong. Optical Strain basedRecognition of Subtle Emotions. In International Symposium on Intelligent SignalProcessing and Communication Systems, pages 180–184. IEEE, 2014.

[47] S.-T. Liong, J. See, R. C.-W. Phan, A. C. Le Ngo, Y.-H. Oh, and K. Wong. Sub-tle Expression Recognition using Optical Strain Weighted Features. In 12th AsianConference on Computer Vision - Workshops, pages 644–657. Springer, 2014.

[48] S.-T. Liong, J. See, R. C.-W. Phan, and K. Wong. Less is More: Micro-expressionRecognition from Video using Apex Frame. arXiv preprint arXiv:1606.01721, 2016.

[49] C. Liu, A. Torralba, W. T. Freeman, F. Durand, and E. H. Adelson. Motion Magni-fication. ACM Transactions on Graphics, 24(3):519–526, 2005.

[50] S. Liu, Z. Yan, J. W. Kim, and C.-C. J. Kuo. Global/local Motion-compensatedFrame Interpolation for Low Bitrate Video. In The International Society for OpticalEngineering (Proc. SPIE), volume 3974, pages 223–234, 2000.

[51] B. D. Lucas and T. Kanade. An Iterative Image Registration Technique with anApplication to Stereo Vision. In International Joint Conference on Artificial Intelli-gence, volume 81, pages 674–679, 1981.

[52] A. M. Martinez. Deciphering the Face. In IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshops, pages 7–12. IEEE, 2011.

[53] D. Matsumoto and H. S. Hwang. Evidence for Training the Ability to Read Microex-pressions of Emotion. Motivation and Emotion, 35(2):181–191, 2011.

[54] H. G. Musmann, P. Pirsch, and H.-J. Grallert. Advances in Picture Coding. Pro-ceedings of the IEEE, 73(4):523–548, 1985.

BIBLIOGRAPHY 94

[55] Y.-H. Oh, A. C. Le Ngo, R. C.-W. Phan, J. See, and H.-C. Ling. Intrinsic Two-dimensional Local Structures for Micro-expression Recognition. In 41st IEEE Inter-national Conference on Acoustics, Speech and Signal Processing, Shanghai, China,Mar. 2016.

[56] Y.-H. Oh, A. C. Le Ngo, J. See, S.-T. Liong, R. C.-W. Phan, and H.-C. Ling. Mono-genic Riesz Wavelet Representation for Micro-expression Recognition. In IEEE In-ternational Conference on Digital Signal Processing, pages 1237–1241. IEEE, 2015.

[57] T. Ojala, M. Pietikäinen, and D. Harwood. A Comparative Study of Texture Mea-sures with Classification based on Featured Distributions. Pattern Recognition,29(1):51–59, 1996.

[58] D. Patel, X. Hong, and G. Zhao. Selective Deep Features for Micro-expression Recog-nition. In 23rd International Conference on Pattern Recognition, pages 2258–2263.IEEE, 2016.

[59] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011.

[60] H. Pedrini and W. Schwartz. Análise de Imagens Digitais: Princípios, Algoritmos eAplicações. Editora Thomson Learning, 2007.

[61] R. Péteri and D. Chetverikov. Dynamic Texture Recognition using Normal Flowand Texture Regularity. In Pattern Recognition and Image Analysis, pages 223–230.Springer, 2005.

[62] T. Pfister, X. Li, G. Zhao, and M. Pietikäinen. Recognising Spontaneous FacialMicro-expressions. In IEEE International Conference on Computer Vision, pages1449–1456. IEEE, 2011.

[63] R. Polana and R. Nelson. Temporal Texture and Activity Recognition. Springer, 1997.

[64] S. Polikovsky, Y. Kameda, and Y. Ohta. Facial Micro-expressions Recognition usingHigh Speed Camera and 3D-gradient Descriptor. In 3rd International Conference onCrime Detection and Prevention, pages 1–6. IET, 2009.

[65] S. Raschka. Python Machine Learning. Packt Publishing, 2015.

[66] L. Rokach. Pattern Classification using Ensemble Methods, volume 75. World Scien-tific, 2010.

[67] A. Saeed, A. Al-Hamadi, R. Niese, and M. Elzobi. Effective Geometric Featuresfor Human Emotion Recognition. In IEEE 11th International Conference on SignalProcessing, volume 1, pages 623–627. IEEE, 2012.

BIBLIOGRAPHY 95

[68] A. Saeed, A. Al-Hamadi, R. Niese, and M. Elzobi. Frame-based Facial ExpressionRecognition using Geometrical Features. Advances in Human-Computer Interaction,2014:4, 2014.

[69] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 Facesin-the-wild Challenge: Database and Results. Image and Vision Computing, 47:3–18,2016.

[70] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 Faces in-the-wildChallenge: The First Facial Landmark Localization Challenge. In IEEE InternationalConference on Computer Vision Workshops, pages 397–403, 2013.

[71] B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,Regularization, Optimization, and Beyond. MIT press, 2002.

[72] M. Shreve, S. Godavarthy, V. Manohar, D. Goldgof, and S. Sarkar. Towards Macro-and Micro-expression Spotting in Video using Strain Patterns. In Workshop onApplications of Computer Vision, pages 1–6. IEEE, 2009.

[73] R. Szeliski. Computer Vision: Algorithms and Applications. Springer, 2010.

[74] K. M. Ting and I. H. Witten. Issues in Stacked Generalization. Journal of ArtificialIntelligence Research, 10:271–289, 1999.

[75] I. Ullah, M. Hussain, G. Muhammad, H. Aboalsamh, G. Bebis, and A. M. Mirza.Gender Recognition from Face Images with Local WLD Descriptor. In 19th Interna-tional Conference on Systems, Signals and Image Processing, pages 417–420. IEEE,2012.

[76] S. van der Walt, S. C. Colbert, and G. Varoquaux. The NumPy Array: A Struc-ture for Efficient Numerical Computation. Computing in Science and Engineering,13(2):22–30, 2011.

[77] S. van der Walt, J. L. Schönberger, J. Nunez-Iglesias, F. Boulogne, J. D. Warner,N. Yager, E. Gouillart, and T. Yu. Scikit-image: Image Processing in Python. PeerJ,2:e453, jun 2014.

[78] N. Wadhwa, M. Rubinstein, F. Durand, and W. Freeman. Pseudocode for RieszPyramids for Fast Phase-Based Video Magnification. http://people.csail.mit.edu/nwadhwa/riesz-pyramid/. [Online; accessed on Aug 24, 2017].

[79] N. Wadhwa, M. Rubinstein, F. Durand, and W. T. Freeman. Phase-Based VideoMotion Processing. ACM Transactions on Graphics (Proc. SIGGRAPH), 32(4), 2013.

[80] N. Wadhwa, M. Rubinstein, F. Durand, and W. T. Freeman. Riesz Pyramids forFast Phase-Based Video Magnification. In IEEE International Conference on Com-putational Photography. IEEE, 2014.

http://people.csail.mit.edu/nwadhwa/riesz-pyramid/

http://people.csail.mit.edu/nwadhwa/riesz-pyramid/

BIBLIOGRAPHY 96

[81] S. Wang, W.-J. Yan, G. Zhao, X. Fu, and C. Zhou. Micro-Expression RecognitionUsing Robust Principal Component Analysis and Local Spatiotemporal DirectionalFeatures. In European Conference on Computer Vision Workshops, pages 325–338,2014.

[82] S.-J. Wang, W.-J. Yan, X. Li, G. Zhao, and X. Fu. Micro-expression Recognitionusing Dynamic Textures on Tensor Independent Color Space. In 22nd InternationalConference on Pattern Recognition, pages 4678–4683. IEEE, 2014.

[83] Y. Wang, J. See, Y.-H. Oh, R. C.-W. Phan, Y. Rahulamathavan, H.-C. Ling, S.-W.Tan, and X. Li. Effective Recognition of Facial Micro-expressions with Video MotionMagnification. Multimedia Tools and Applications, 76(20):21665–21690, 2017.

[84] Y. Wang, J. See, R. C.-W. Phan, and Y.-H. Oh. LBP with Six Intersection Points:Reducing Redundant Information in LBP-TOP for Micro-expression Recognition. InAsian Conference on Computer Vision, pages 525–537. Springer, 2014.

[85] G. Warren, E. Schertler, and P. Bull. Detecting Deception from Emotional andUnemotional Cues. Journal of Nonverbal Behavior, 33(1):59–69, 2009.

[86] D. H. Wolpert. Stacked Generalization. Neural Networks, 5(2):241–259, 1992.

[87] H.-Y. Wu, M. Rubinstein, E. Shih, J. Guttag, F. Durand, and W. T. Freeman.Eulerian Video Magnification for Revealing Subtle Changes in the World. ACMTransactions on Graphics (Proc. SIGGRAPH), 31(4), 2012.

[88] W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y.-J. Liu, Y.-H. Chen, and X. Fu. CASME II:An Improved Spontaneous Micro-expression Database and the Baseline Evaluation.PloS One, 9(1):e86041, 2014.

[89] W.-J. Yan, Q. Wu, J. Liang, Y.-H. Chen, and X. Fu. How Fast are the Leaked FacialExpressions: The Duration of Micro-expressions. Journal of Nonverbal Behavior,37(4):217–230, 2013.

[90] W.-J. Yan, Q. Wu, Y.-J. Liu, S.-J. Wang, and X. Fu. CASME Database: A Dataset ofSpontaneous Micro-Expressions Collected from Neutralized Faces. In 10th IEEE In-ternational Conference and Workshops on Automatic Face and Gesture Recognition,pages 1–7. IEEE, 2013.

[91] G. Zhao and M. Pietikäinen. Spatio-temporal LBP: VLBP and LBP-TOP, MatlabImplementation. http://www.cse.oulu.fi/CMV/Downloads/LBPMatlab. [Online;accessed on Aug 24, 2017].

[92] G. Zhao and M. Pietikainen. Dynamic Texture Recognition using Local BinaryPatterns with an Application to Facial Expressions. IEEE Transactions on PatternAnalysis and Machine Intelligence, 29(6):915–928, 2007.

http://www.cse.oulu.fi/CMV/Downloads/LBPMatlab

BIBLIOGRAPHY 97

[93] Z. Zhou, G. Zhao, and M. Pietikäinen. Implementation of Temporal Interpo-lation Model for Video Normalization, Matlab. http://www.cse.oulu.fi/CMV/Downloads. [Online; accessed on Aug 24, 2017].

[94] Z. Zhou, G. Zhao, and M. Pietikäinen. Towards a Practical Lipreading System.In IEEE Conference on Computer Vision and Pattern Recognition, pages 137–144.IEEE, 2011.

[95] Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman &Hall/CRC, 2012.

http://www.cse.oulu.fi/CMV/Downloads

http://www.cse.oulu.fi/CMV/Downloads

lucianabarbieri facialmicroexpressionrecognitionbasedon...

Documents