AUTOMATED FACIAL ACTION CODING 2
The Promises and Perils of Automated Facial Action Coding in Studying Children’s Emotions
Aleix M Martinez
The Ohio State University
Author Note
Aleix M Martinez is with the Dept. Electrical and Computer Engineering, and the Center
for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210.
The author and the research described in this paper were supported by the National
Institutes of Health, grants R01-DC-014498 and R01-EY-020834, the Human Frontier Science
Program, grant RGP0036/2016, and by the Center for Cognitive and Brain Sciences at The Ohio
State University. The author thanks Qianli Feng, Fabian Benitez-Quiroz, Ramprakash
Srinivasan, and Shichuan Du for discussion. The Ohio State University is licensing some of the
computational tools developed in the author's lab.
© 2019, American Psychological Association. This paper is not the copy of record and may not
exactly replicate the final, authoritative version of the article. Please do not copy or cite without
authors' permission. The final article will be available, upon publication, via its DOI:
10.1037/dev0000728
AUTOMATED FACIAL ACTION CODING 3
Abstract
Computer vision algorithms have made tremendous advances in recent years. We now have
algorithms that can detect and recognize objects, faces and even facial actions in still images and
video sequences. This is wonderful news for researchers that need to code facial articulations in
large datasets of images and videos, since this task is time consuming and can only be completed
by expert coders, making it very expensive. The availability of computer algorithms that can
automatically code facial actions in extremely large datasets also opens the door to studies in
psychology and neuroscience that were not previously possible, e.g., to study the development of
the production of facial expressions from infancy to adulthood within and across cultures.
Unfortunately, there is a lack of methodological understanding on how these algorithms should
and should not be used, and on how to select the most appropriate algorithm for each study. This
paper aims to address this gap in the literature. Specifically, we present several methodologies
for use in hypothesis-based and exploratory studies, explain how to select the computer
algorithms that best fit to the requirements of our experimental design, and detail how to evaluate
whether the automatic annotations provided by existing algorithms are trustworthy.
Keywords: facial action coding, FACS, facial expression, emotion, computer vision, machine
learning
AUTOMATED FACIAL ACTION CODING 4
The Promises and Perils of Automated Facial Action Coding in Studying Children’s
Emotions
Facial expressions of emotion (i.e., facial configurations assumed to have emotion
meaning) are believed to be important in the understanding of emotion and, thus, have played an
important role in many developmental studies (e.g., Holodynski & Seeger, 2019; Leitzke &
Pollak, 2016; Castro et al., 2017; Reeb-Sutherland et al., 2015; Gaspar & Esteves, 2012; Bennett
et al., 2005). This hypothesis is based largely on adult recognition studies involving facial
configurations described by contemporary researchers based largely on early writings of Darwin
and Duchenne (Barret et al., submitted; Martinez, 2017a; Ekman, 2016). Yet relatively few
studies have tested these hypotheses by examining production differences of these configurations
in a wide-range of real-life social interactions in children and adults or characterized their
evolution over development.
These studies have been hampered by the labor-intensiveness of manual facial behavior
coding (Oster, 2003, 2006). Thankfully, in recent years, a number of commercial products have
become available that purport to allow for the automated coding of both emotional and non-
emotional facial configurations. Automated coding offers the promise of radically reducing the
labor-costs of studying facial expression production and allowing for important scientific
breakthroughs in our understanding of the relationship between facial expression and emotion
and its development from infancy to adulthood. At the same time, indiscriminate use of such
systems may lead to inadequate studies and inaccurate conclusions, a problem that might have
already started.
It is thus urgent for developmental psychologists to understand how to use and not to use
these systems and how to select the best algorithm for each study. The purpose of this paper is to
inform potential users of these promises and perils. In doing so, it will provide guidelines that
researchers may use to help them determine whether and how they can effectively employ such
systems and which system works best on each research study. In addition, it will provide a
beneath-the-hood look at how automated coding systems operate and a behind-the-scenes look at
how they are developed.
Before we begin, it is imperative to understand the difference between algorithms that
recognize facial articulations and those that identify prototypical facial expressions of emotion,
AUTOMATED FACIAL ACTION CODING 5
e.g., the prototypical facial expressions of joy, surprise, sadness, anger, disgust, fear, typically
called “basic” emotions (Ekman, 2016). There is mounting evidence suggesting that these
prototypical expressions are no different than other expressions of emotion (Martinez, 2017b;
Martinez & Du, 2012). For example, Du et al. (2014), Du & Martinez (2015) and Srinivasan &
Martinez (2019) demonstrate there are multiple facial expressions of joy, and that AU variability
across subjects is relatively common. Furthermore, not every expression of happiness indicates a
person is cheerful, nor does its absence indicate they are not joyful (Barrett et al., in press). For
example, most people can easily fake a spontaneous (Duchenne) smile, and not everyone smiles
at everything they find funny. Hence, computer systems that aim to solely recognize these
prototypical expressions should be avoided. Instead, developmental psychologists should use
algorithms that produce an automatic coding of facial muscle articulations. Equally important is
to note that only the externalized facial articulations are observable, not the actual (internal)
affective state experienced by the subject. In fact, the internal affective state may be unknowable
and subject dependent (Barret et al., in press), i.e., not everyone that smiles or claims to be happy
may be experiencing the same affective state.
What we can ask is whether in specific contexts, facial articulations carry affective
meaning on average across the population. For example, multiple studies have evaluated the
neonatal imitation of adult facial articulations and the narrowing of infant facial expressions
toward those produced by caregivers (Camras, 2019; Oostenbroek et al., 2013). Such hypotheses
can be studied with the help of automatic facial action coding as we discuss below. Here, it is
important to keep the context fixed, since context may affect our facial articulations and
interpretation (Barrett et al., in press), e.g., a person making an angry expression while telling a
joke is not interpreted to mean the same as one making the same expression in a bar fight, and
the interpretation of an infant’s frowning when tired or when pressed to eat its lunch is probably
different too.
A related example is in the study of the communication of emotion categories versus
affect in infants and toddlers. For example, Castro et al. (2017) found a clear distinction of
valence in the expression of 7- to 9-year old children, but not of emotion categories. Automatic
coding of facial action coding provides a mechanism to study this and related hypotheses in
hundreds or even thousands of subjects in multiple cultures and contexts, which would provide a
new window to the development and learning of facial communication.
AUTOMATED FACIAL ACTION CODING 6
Facial Articulations
There are 28 distinct facial muscle articulations that clearly fit in the above-defined
framework. These are called Action Units (or AUs for short), each with a unique number
between 1 and 44 (Ekman, Friesen, & Hager, 2002; Figure 1a). Four additional action units, AUs
55-58, specify head pose, and another four, AUs 61-64, denote the direction of eye gaze (Ekman
& Rosenberg, 2005).
Of the 28 facial AUs, about 10 cannot be performed independently of others; for
example, AU 18 (lips pucker) cannot be performed in unison with AU 28 (lip suck). That still
leaves us with 18 AUs that may be co-articulated. Assuming one can move these AUs without
affecting others, people can produce 262,144 facial configurations, an astronomical number. We
define facial configurations as a face with any combination of AUs, while a facial expression is a
facial configuration that carries some biological or sociological meaning, e.g., an emotion or a
grammatical marker (Martinez, 2017a).
Infants are especially adept to move their facial muscles independently of one another; in
adulthood only trained actors may be able to perform this miraculous number of facial
configurations. The hypothesis is that culture modulates the expressions we perform, establishing
a set of dependencies between AUs (Martinez, 2017b). This is similar to language. Infants are
capable of understanding the sounds of all human languages (Werker & Tees,1984; Gervain &
Mehler, 2010), but as we grow and specialize in one or two languages, we lose the ability to
produce and hear sounds used by languages other than ours. Native Spanish speakers, for
example, generally add the ‘e’ sound to English words starting with an ‘s,’ thus student is
pronounced /estü-d(ə)nt/ rather than /stü-d(ə)nt/. The reason for this is simple, Spanish words
may start with ‘es’ but almost never with ‘s’ alone. Thus, native Spanish speakers have engraved
this dependency in their speech production and breaking the pattern in adulthood requires
endless practice. It appears the same is true for AUs (Martinez, 2017b; Chu et al., 2019; Camras,
2019; Oostenbroek et al., 2013). As we become more adept at communicating non-verbally with
our peers, we learn AU dependencies that are difficult to break later in life.
This means that the production and perception of facial expressions evolves with age and,
hence, we wish to understand this developmental process in humans. A few important, yet
unanswered questions are as follows. How many of the 262,144 facial configurations do infants
really produce? Of the 𝑛 facial configurations typically used by infants, which ones survive to
AUTOMATED FACIAL ACTION CODING 7
adulthood? Of the 𝑚 that survive to adulthood, how many are cross-cultural and how many
cultural-specific? Are the cross-cultural ones the most typically produced by infants, indicating a
biological origin? Or are they learned? Under what circumstances are these expressions produced
by infants, children, and/or adults? Is there evidence suggesting that some of these are related to
some form of affect or emotion? And so on and on.
Why have these questions not been addressed by the research community yet? The reason
is simple. To answer them, we would need to:
1. collect hundreds of thousands, or perhaps millions of hours of video of facial
configurations in humans of all ages, from infancy to adulthood,
2. code the action units in each frame of these videos,
3. and perform a statistical analysis of the data.
Cameras are now so ubiquitous and people so eager to help science that our first point
above may be finally solvable. There are indeed privacy issues that need to be addressed, but
with proper care, availability of data should no longer be a major hurdle. We urgently need a
consortium of research teams that will collect such a dataset, and I hope funding entities will
support this effort.
Advances in statistical pattern analysis and machine learning also make the third of our
points listed above a solvable one. Machine learning is a set of computer algorithms derived to
extract useful information from data, while statistical pattern analysis identifies patterns in that
data. Older machine learning algorithms were unable to analyze large amounts of data, a problem
that is known as saturation, meaning that after a certain number of samples, the algorithms are
unable to improve their analysis. But recent algorithms, especially deep learning (Goodfellow et
al., 2016), are capable of working with very large datasets, what is now commonly term big data
(Martinez, 2017a).
But how about point 2 above? How can we code action units in thousands or even
millions of hours of video? As a simple example, consider the analysis of a full year of video for
each of 100 subjects. That is 876,000 hours of video or 98 billion frames (assuming 30 frames-
per-second).
If we are ever to complete a study of this magnitude, our only hope is to have computer
vision algorithms that do the coding of AUs in each video sequence, video frame or still image
automatically, with minimal or no human intervention. Computer vision is an area of artificial
AUTOMATED FACIAL ACTION CODING 8
intelligence devoted to the design of algorithms capable of automatically analyzing images and
videos; for example, identifying AUs in images of facial configurations. Computer vision
algorithms capable of coding the action units of faces in images and videos are already available,
Figure 1b. How do these algorithms achieve this feat? How and under which conditions can
developmental psychologist use them? Are the results of these algorithms trustworthy?
This paper provides concrete answers to these and related questions. I summarize the
different types of computer vision algorithms that are available for coding AUs. I also put
forward a proposal of best practices for those wanting to use these algorithms. Therefore, this
paper will be primarily useful to researchers who want to use these automated systems to answer
some of the fundamental scientific questions listed above. I will list the dos and don’ts and
explain how to design experiments to maximize the likelihood of success and reproducibility.
Note that the scientific questions I enumerated earlier may be answered using exploratory
experiments and, hence, I will describe how these should be properly conducted. But, I will also
describe how these computer vision algorithms can be added to a hypothesis-based experiment,
since these tend to be preferred by researchers.
I cannot emphasize enough how important it is to use well-designed methodologies and
follow the best practices outlined below when using computer vision algorithms. Computer
vision algorithms are no substitute for well-designed experiments, nor are they good enough to
fully substitute human experts, at least not yet. There is much these algorithms can help us
achieve, but only if we take care of the important methodological details and limitations
described below.
Finally, some of the computer algorithms described below are available from companies.
These require little effort from the researcher, but, as we explain below, these may not be the
most appropriate algorithms to study the scientific question of interest. Thus, it may be necessary
to use the algorithms provided by some computer vision research groups. Using these may
require to hire a computer scientist (ideally a computer vision specialist) to run the experiments
for us. The details provided below will help you decide whether this is necessary.
The research conducted in my research lab was approved by the Office of Responsible
Research Practices at The Ohio State University, and subjects provided written consent; study
title “Face Recognition: Data collection, recognition of identity and expression,” study number
2002B0258.
AUTOMATED FACIAL ACTION CODING 9
Proper Experimental Design
To properly incorporate automatic AU coding in our experiments, we must first carefully
define what we wish to accomplish. This is important, because some computer vision algorithms
are best suited for some specific tasks more than others. Also, existing algorithms may or may
not adapt to our experimental requirements, or may lead to incorrect predictions or
irreproducibility (Poursabzi-Sangdeh et al., 2018).
Look at the taxonomy of experimental settings given in Figure 2. We start with a well-
defined experiment. A properly defined experiment will specify which type of data collection we
intend to assemble, or which one was made available to us. Generally speaking, there are three
conditions, which I have termed: i. idealized conditions, ii. in-lab-like conditions, and iii. in the
wild. The first of these groups includes images and videos filmed indoors under good, non-
changing illumination, with a frontal view of the subject’s face, and no occlusions. The second
group includes images and videos for which illumination may vary but are almost always well-
lite; may be filmed indoors or outdoors; if not frontal, faces appear at an angle that allows
humans to distinguished the desirable AUs – this is usually between -20o and +20o in each of the
three axes of rotation; have minor occlusions that do not block the AUs of interest. The third
group are images collected under completely unconstrained conditions, which is typically termed
“in the wild,” analogous to studying biological systems in the wild, rather than in the lab. While
we generally prefer to study expressions in the wild, this is obviously the most challenging
setting for computer vision algorithms.
Idealized conditions
As the reader may suspect, computer vision algorithms perform best with still images and
videos in the first group – images and videos collected under ideal conditions. However,
collecting and curating these images and videos requires a big effort by the experimenter. For
example, it is extremely unlikely one will always obtain a frontal view of every filmed facial
configuration. If that is the requirement of our computer vision algorithm, then human annotators
will have to sweep through every still image or video frame to determine which can be used in
our study and which cannot. This may be problematic for a variety of reasons. First, it may be a
monumental task that takes months or even years to complete. It is of course much simpler and
preferred to having to manually annotate AUs, but it is still a time-consuming task. Second,
studying only frontal faces may eliminate important fundamental variables of interest in our
AUTOMATED FACIAL ACTION CODING 10
study. Maybe head pose is a much better determinant of what we wish to study than AUs, or
maybe some AUs of interest are only produced when the head is tilted in specific directions
away from the camera; additionally, head pose is known to influence our reading of a face
(Witkower & Tracy, in press; Lyons et al., 2000). Third, the need for non-occluded, frontal faces
may require placing the camera in a location that prevents natural, spontaneous expressions from
occurring. And so on. The important conclusion is that computer vision algorithms can
accurately annotate our images and videos if we have selected the images and videos carefully to
almost exclusively include frontal, non-occluded faces that are well illuminated with a non-
changing light source, but these images and videos may be unfit to answer our scientific
questions.
If your research does fit within this first group, the next question you need to ask is
whether you are interested in analyzing still images or video sequences, Figure 2. You will use
images when you wish to study facial configurations at specific time points, and you will use
video when changes over time is a variable of interest. If your experiment requires that you study
facial configurations at specific time points, you need to ask one final question: do the images in
these time periods correspond to the apex of the facial configuration? If the answer is yes, then,
there exist several computer vision algorithms that you will be able to adapt to your experiment.
Of course, if the answer is yes, this means that you or your colleagues have been very careful
when collecting your images to make sure that they correspond to the apex of each facial
configuration, or you have curated the images to select those that show only the apex. In these
cases, there are several computer vision algorithms that may be adapted to your experiment, as
discussed later in this paper.
If, on the other hand, you do not know whether the images correspond to the apex or not,
then no computer vision exists that can find the apex for you. Most non-computer vision experts
will be surprised by this. Detecting the apex of a facial configuration seems such a simple
problem. If a video sequence is available, all that needs to be done is to identify the frame where
the AUs are at maximum activation. Note that AUs can be activated at multiple intensities. For
example, AU 12, which indicates the outer pulling of the corners of the lips (i.e., the corners of
the lips move away from the center of the face), can be activated at different intensities (i.e.,
increasing the distance from the center of the face as a function of intensity). Thus, one may
think that detecting the apex simplifies to identifying the point of maximum extension of the
AUTOMATED FACIAL ACTION CODING 11
AUs. There are a few problems with this hypothesis though. First, current computer vision
algorithms are very good at detecting AUs in the idealized imaging conditions described above,
but not as good at detecting the intensity of activation. Some algorithms do detect the intensity
but not with enough precision to allow us to accurately specify the apex of the facial
configuration. Second, the maximum extension of each AU is subject dependent. What may be
maximum activation for me, may be only half active for you. Facial muscle differences between
people constitute a barrier for a mathematical definition of apex. Third, and arguably most
important, AUs come and go rapidly, generating a large number of activation peaks, but not all
of these peaks define the apex of a meaningful expression. Actually, most of these activation
peaks correspond to transitional facial movements unrelated to any variable of interest to the
researcher. Consider, as a simple example, a subject yawning or sneezing. How will the
computer vision system know this is not a facial expression as defined above?
To further clarify the last point in the preceding paragraph, consider the following. First
assume the data has been carefully curated to include only video sequences with a single facial
expression. Here, detecting the apex of an expression is relatively easy. Now consider a video of
a long conversation between two subjects. This includes a barrage of facial configurations. Some
include coding of expressions in isolation (e.g., a surprised reaction to a comment) which are
easy to detect. But one expression may overlap another, e.g., a subject starts to express surprise
to the comment of another person when she realizes that it was a joke and starts to laugh before
the surprise expression reaches the apex. How would you code for this? Is the maximum
extension of the surprised expression to be consider the new apex? Even if that does not reflect
the maximum intensity of the AUs in those frames of the video sequence? If so, do we impose a
minimum intensity of the AUs to consider it an apex? This opens a fundamental, recursive
problem in our experimental design: In most cases, we are interested in identifying facial
expressions unknown to us and, hence, we wish to not constraint what defines the apex of that
expression. But without such a definition, the number of meaningless facial configurations will
be unmanageable, e.g., are the AUs of a sneeze relevant? How about those due to breathing,
babbling, speech, etc.? At present the only solution is to have a human expert in-the-loop who
defines what is meant by apex, evaluates the outcomes of the computer vision algorithm
carefully, or both.
AUTOMATED FACIAL ACTION CODING 12
A human-in-the-loop means that the computer vision system and the human work in
unison at each step in the experiment. For example, after curating our images as much as
possible, we still do not know which images define the apex of a potentially meaningful facial
configuration and which do not. To solve this, we first use an appropriate computer vision
system to automatically annotate the presence of AUs. AU patterns (sequential or otherwise) that
repeat multiple times over time and across subjects are selected. This selection needs to be done
carefully and must be based on some grounded assumptions; for instance, what is the minimum
number of AUs we are interested in and why? And, how many times does a pattern of AUs need
to occur to be significant? These and other questions must be carefully answered by the research
team, not by the algorithm. This exercise will give the research team a set of potentially
interesting facial configurations. Careful analysis of these facial configurations must follow. Do
any of them correspond to autonomous or semi-autonomous body movements (e.g., a sneeze) or
babbling or anything else we are not interested in? If so, we need to identify the properties of
these facial configurations and ask the software to redo the analysis with these additional
constraints added. This process must be repeated until we identify the facial expressions we were
looking for in our study.
The problems enumerated above (e.g., specifying the apex or the minimum number of
AUs) can be alleviated when we perform a hypothesis-based experiment. If we wish to test an
accepted or well-reasoned hypothesis, all we need to do is to determine if the process described
in the preceding paragraph yields the expected facial configurations. For example, a famous
hypothesis is that humans of all cultures share six facial expressions – joy, surprise, sadness,
anger, disgust and fear – sometimes called “basic” emotions (Ekman, 2016; Jack et al. 2012). If
we studied millions of still images of facial configurations across cultures would we identify
these expressions and these expressions only as universal? Or would we identify a much larger
number of expressions, as hypothesize by other authors (Du et al., 2014)? We recently used the
above experimental design to test this hypothesis (Srinivasan & Martinez, 2019), and identify a
much larger number of cross-cultural expressions, 35 to be precise, demonstrating that people
regularly use many more facial expressions of emotion than previously believed. This study
included a carefully collected dataset of about 7.2 million images and 10,000 hours of video
which were collected online using web search tools in 30 different countries.
AUTOMATED FACIAL ACTION CODING 13
Let us now go back to Figure 2 and consider the case where we wish to automatically
annotate AUs in video sequences collected in idealized conditions. We see that this becomes a
solved problem again. However, this statement needs to be qualified. What is meant here is that
computer vision algorithms will be able to automatically annotate AUs accurately (Benitez-
Quiroz et al., 2019), not that this will give us any meaningful scientific analysis. If we wish to
identify expressions in this newly annotated video set, we will need to use the approach defined
above for still images all over again. A human will need to carefully evaluate the results of the
algorithm to identify any expression of interest.
In-lab-like conditions
Most likely, your still images and video sequences will be collected in somewhat
controlled conditions but not idealized ones; maybe indoors, with some 3D head rotation and
minor occlusions. Current computer vision algorithm can mostly deal with these imaging
conditions and, hence, the procedures to follow do not deviate much from the ones already
described in the preceding section.
If you are interested to work with images that have been curated to show the apex of a
facial articulation, and/or wish to work with video, then, there are several computer vision
algorithms you may use to automatically annotate AUs. As shown in Figure 2, still images are
easier because you have already indicated they display the apex of the facial configuration, you
will only need to select the right computer vision algorithm, a topic we will discuss in detail in
later sections of this paper. If you can show that the selected algorithm works as expected on
your database of images, you will be able to trust your annotations.
As for video analysis, you will need to check that the analyses given by the computer
vision algorithm provide reliable results. The amount of human involvement will depend on the
task you are interested to solve. As stated above, if your goal is to detect the apex of all facial
configurations, then you will need to provide additional information to the algorithm to define
what this really means. Make sure your model or assumptions are based on well accepted
theories and/or experimental results, or you run the risk of falling into a circular trap. As an
example, consider the following problem: We are interested in identifying the number of facial
expressions infants produce. To address this, we use a computer vision algorithm to analyze
thousands of hours of videos of infants interacting with their parents. We define a facial
expression as the point in which the number of AUs is maximal for every small interval of 𝑡
AUTOMATED FACIAL ACTION CODING 14
seconds. After completion of our study, we conclude infants produce 𝑞 facial expressions. But
closer inspection shows that some expressions have been missed, because some important
expressions overlap with others yielding a monotonically increasing number of AUs (e.g., a
frowned started before the conclusion of a smile), but our definition only allowed us to detect the
one with the larger number of AUs in each interval of 𝑡 seconds (e.g., a smiley frowned that had
no intentional meaning). Computer vision and machine learning algorithms will compute what
we define, not necessarily what is needed. But if we do not have a mathematical definition of
what we wish to uncover, we cannot ask the computer vision algorithm to solve it. In fact, many
scientific studies are performed to identify that definition, but, in these cases, the intrinsic
definition coded in the algorithm will be the one we identify, whether we are aware of it or not.
Similarly, in hypothesis-based experiments we will most likely find results that support our
hypothesis if that definition is what was given to the computer vision algorithm. One solution to
this problem is to run a permutation test, as described later in this paper.
As shown above, if we wish to analyze still images of faces that may or may not display
the apex of a facial configuration, then the problem can only be solved with a human in the loop.
One way is by specifying what AUs and intensities we are interested in and whether there is a
requirement on the number of AUs per expression. Another solution mentioned above is to ask a
human expert coder to verify that the automatic annotations provided by the computer vision
system are accurate (Srinivasan & Martinez, 2019).
Images and videos in the wild
In some instances, we may wish to analyze images and videos collected in the real world,
under completely unconstrained conditions. Here, images and videos may be low quality, have
large variations in illumination, pose, ethnicity, skin color, have major occlusions, etc. Even in
the lab, developmental scientists working with children (who often have difficulty remaining
relatively still) may find large variations in these factors that will pose challenges. Can we use
computer vision algorithms to automatically code AUs in these still images and video
sequences? While extra care will need to be taken when doing this, the answer is yes, at least in
some cases.
As we will detail in the section to follow, assuming our still images have decent quality,
the major AUs of interest are not occluded, and images represent the apex of facial
configurations, computer vision algorithms already exist that can provide a reasonable (useful)
AUTOMATED FACIAL ACTION CODING 15
annotation of AUs (Srinivasan & Martinez, 2019; Benitez-Quiroz et al., 2016). In these cases, we
will still need to verify the results manually, but that is a much-preferred task over that of
manually providing the annotations ourselves.
However, when the still images may or may not represent the apex, or when we use video
sequences, the problem can only be solved with a human-in-the-loop, Figure 2. As in the above,
the amount of human work will depend on the goal of the project, but, at a minimum, a human
expert will have to carefully monitor and evaluate the performance of the algorithm at every step
of our study.
Automatic Coding of Action Units
Spatial versus dynamic representation
As mentioned above, one of the most important decisions we need to make when
selecting a computer vision system to automatically code facial action units is whether we are
interested in their functional change of activation over time or on the discrete activation at
several time points. The former requires the analysis of video, while the latter can be performed
in video and images. Let us first explain what the basic differences between these two analyses
are.
Recognition of action units in still images and video frames
This is the most typical analysis seen in scientific studies to date. The goal is to identify
which AUs are present in each of a set of available still images or video frames (Martinez,
2017a; Du & Martinez, 2012). If we are given a set of images, our goal is to select an algorithm
that can tell us which (if any) AUs are active in each of the faces that appear in the images. If we
are given a video sequence instead, then the algorithm must list the active AUs in each of the
frames of the video sequence. In some instances, we may also be interested in an algorithm that
can specify the intensity of activation of each AU. The standard way to specify intensity is to
categorize each AU into one of five values (Ekman & Rosenberg, 2005), namely: a (meaning
there is only a trace of the presence of this AU), b (indicating a slight activation of the AU), c
(meaning the AU is clearly marked), d (specifying the activation of the AU is extreme), or e
(indicating the activation of the AU is maximal). Figure 3a shows an example.
Functional representation
Another way to study facial actions is to uncover the underlying function of muscle
articulations, Figure 3b. While the methods described in the previous paragraph provide a
AUTOMATED FACIAL ACTION CODING 16
qualitative analysis of the presence of AUs in each image or frame of a video sequence, the
methods used here yield a quantitative analysis of the activation change over time (Simon et al.,
2010). This means that each AU is defined as a function (i.e., a curve) over time, rather than a set
of discrete letters as above, Figure 3. As shown in this figure, one may be interested in the
intensity of activation 𝑓& or some other variable, e.g., co-articulation of AUs, frequency or
probability of activation, etc.
Once we have determined which variables are of interest to us and whether a qualitative
or a quantitative measurement is needed, we ought to identify a computer vision system that can
provide the values of these variables by analyzing a large dataset of images or videos. This
means we need to understand which systems are available and what they can and cannot do, and
what is the degree of accuracy of their analyses.
Computer Vision Methods
The computer vision algorithms that have been designed to label AUs in still images and
video sequences use either computer vision approaches or machine learning techniques or a
combination of the two. Figure 4 summarizes the main techniques used by the algorithms. The
professional computer systems made available by companies as well as those made available by
researchers fit within one of these groups. It is important to understand the approach used by the
selected algorithm for two reasons. First., we want to select a method that has been demonstrated
to perform well under similar imaging conditions to those of our data. To do this, we need to
know the approach used by the selected algorithm by reviewing the papers or reports were the
system is defined. Second, this same report should provide a description of the images used to
evaluate the algorithm. Use this to determine whether this is a good fit. After this, we will need
to test whether the selected algorithm works on our dataset. If not, we should generally avoid
algorithms that use the same approach and move on to algorithms that employ distinct strategies.
Let us briefly summarize these distinct approaches.
a) Template matching. Template matching is a classical approach in computer vision. As
its name indicates, given a template, the goal is to find whether it is present in an image
and, if so, where (Martinez & Kak, 2001). When applied to automatic detection of AUs,
we first need to generate a template of each AU. For example, one can define a window
𝑤( of 𝑝 × 𝑞 pixels centered at the place of articulation of AU 𝑖 on a number of sample
images of faces with that AU active. Statistics are then extracted from these sample
AUTOMATED FACIAL ACTION CODING 17
windows, e.g., the mean, standard deviations, covariances, etc. If we use the mean and
covariance matrix, we define the image variability of that AU template using a Gaussian
distribution. Given a test image, we extract the window 𝑤, of 𝑝 × 𝑞 pixels centered at the
location of AU 𝑖 and calculate the distance to the computed Normal distribution, e.g.,
using the Mahalanobis distance or Bayes error (Zhu &Martinez, 2006; Hamsici &
Martinez, 2007). If that distance is below a threshold, we say that AU 𝑖 was detected in
the image. We can make this system better by using a mixture of Gaussians instead
(Martinez & Vitria, 2001). Alternatively, we can compute the distance to the subspace of
principal or independent components, given by Principal Components Analysis (PCA)
and Independent Components Analysis (ICA), respectively. The PCA and ICA
representations are linear, meaning the statistical model that represent AUs is given by a
linear equation (Draper et al., 2003). Nonlinear manifolds allow us to define more
flexible models. This is achieved by either changing the metric of the feature space
representing 𝑤(, a technique called kernel mapping (You et al., 2011), or with nonlinear
regression (Rivera & Martinez, 2012).
b) Optical flow. Template matching can also be used to determine the movement of fiducial
points. This is called optical flow and is defined as the apparent motion of the brightness
pattern of a set of images (Baker et al., 2011). However, while the template matching
method described above is used to detect the presence of an AU, here its purpose is to
uncover the movement of a fiducial point across a number of video frames or images,
e.g., the outer pulling of the corners of the mouth of AU 12. This can be readily
computed from video sequences that start at a neutral face followed by the activation of a
number of AUs (Lien et al., 1998; Donato et al., 1999; Martinez, 2003a; Liu et al., 2016).
When only an image is available, we ought to compare it to a neutral face, ideally of the
same individual, but, if unavailable, of a norm facial identity (Martinez, 2003b; Du &
Martinez, 2012). Figure 5 shows some examples of the optical flow estimated on a single
image and the mean neutral face of a large number of individuals. As can be seen in this
figure, optical flow provides a direct measure of the perception of apparent movement of
the facial muscles, which can be used to identify AUs.
AUTOMATED FACIAL ACTION CODING 18
c) Image filters (Gabors, Wavelets). A window of 𝑝 × 𝑞 pixels, called a kernel, is
centered at the location of each AU and convolved1 with that local region of the image.
This process typically yields distinct results when the AU is active than when it is not,
allowing computer vision algorithms to identify its presence or absence. Kernels that
have been shown to yield this distinction are variants of the Gabor kernel and wavelets
(Lyons et al., 1999; Tian et al., 2002; Yang et al., 2007; Savran et al., 2012). This
approach is usually applied on several pixels around the center point of each AU to add
robustness to its detection. One of the arguments for using Gabor filters is their
resemblance to the computations executed by our own early visual cortex (Martinez &
Du, 2012), which may aid in the classification of AUs thought to occur in a nearby brain
region called the posterior Superior Temporal Sulcus (pSTS) (Srinivasan et al., 2016).
d) 2D and 3D shape analysis. Above we defined a way to detect facial movements with
optical flow. Another popular approach is to use Procrustes analysis (Hamsici &
Martinez, 2008, 2009a, Sun et al., 2008; Garg et al., 2013; Jin & Tan, 2017). Procrustes
analysis is an algorithm to align a set of fiducial points (e.g., corners of the mouth, center
of the eyes) across multiple images. This registration allows the algorithm to compute the
deformation of the shape of the face using a statistical model, as, for example, Principal
Components Analysis (PCA) (Martinez & Kak, 2001; Martinez & Zhu, 2005). Using
PCA implies we compute the mean and covariance matrix of the deformation of the face,
meaning the facial expression is modeled using a Normal distribution (Todorov et al.,
2016). This can be extended to more complex distributions by chancing the norm of the
space (Hamsici & Martinez, 2009b), which also allows us to recover the 3D shape of the
facial expression, as shown in Figure 6a, as well as other transformation functions
(Agudo & Moreno-Noguer, 2018a). An alternative approach is structure from motion,
which estimates the movement of face (scales, translates, and rotates with respect to the
camera) as well as its 3D shape (Jia & Martinez, 2009; Gotardo & Martinez, 2011a,b;
Agudo et al., 2014; Agudo & Moreno-Noguer, 2018a). As with Procrustes analysis, a
change in the metric (called a kernel mapping) is typically used to improve the results
and, in addition, it allows us to recover the 3D shape of the face (Hamsici et al., 2012;
Gotardo & Martinez, 2011c). Figure 6b shows an example sequence. An alternative to
1 A convolution is given by adding the elements of the image within a determined window, weighted by the kernel.
AUTOMATED FACIAL ACTION CODING 19
kernel maps is the use of sparse representations, which simplifies the number of
unknowns to be solved to yield robust results (Li et al., 2015). And, finally, some models
combine the formulation of Procrustes analysis and structure-from-motion to compute the
shape of the face (Lee et al., 2013), while others use deep learning (Zhao et al., 2018;
Albiero et al., 2018; Chang et al., 2018).
e) Isoluminant color and shading. The shading of a face is given by the luminance, or
quantity of light per unit area at each point on the surface of the skin; this is 1-
dimensional and can be readily computed by mapping a color image into grayscale (i.e.,
from three color channels to one). Isoluminant color is what is left in the image once the
luminance has been factored out. Thus, isoluminant color is 2-dimensional. We believe
the human visual system uses two opponent color channels – yellow-blue and red-green –
to represent images and objects (Gegenfurtner, 2003). It has been recently shown
(Benitez-Quiroz et al., 2018) that facial color in this isoluminant color space changes as a
function of the emotion experienced by the expresser. The assumption is that hormonal
changes have an effect on facial blood flow and/or composition that is visible through
color variations on the surfaces of the skin of the face. This information can be combined
with shading cues, which defines the 3D shape of the face, to detect AUs with greatest
accuracy than ever before (Benitez-Quiroz et al., 2019).
Machine Learning Methods
The computer vision systems defined above are formulated based on our understanding
of the physics of the world (e.g., light, geometry) and existing computational models of the
human visual system (Martinez, 2017b; Martinez & Du, 2012). Another solution is to learn the
representation that is best suited for a specific dataset. This is the goal of machine learning.
f) Classifiers. Deep feature representations have become commonplace (Benitez-Quiroz et
al., 2017b; Bai et al., 2018; Pons & Masip, 2018; Corneanu et al., 2018). Given a large
dataset of images of facial configurations, we use a deep neural network and train it to
identify action units. A deep neural network is composed of a number of layers generally
represented as a directed acyclic graph (Goodfellow et al., 2016). The outputs in the last
layer correspond to the classification of AUs, and the previous layers correspond to the
so-called “deep features.” We can use these deep features in lieu of the computer vision
features defined above. While this approach has yielded top results in other computer
AUTOMATED FACIAL ACTION CODING 20
vision problems, computer vision features yield equally good and, in many cases, better
results than these deep representations (Benitez-Quiroz et al., 2017a). Discriminant
analysis, a statistical pattern recognition approach (Hamsici & Martinez, 2008; Zhu &
Martinez, 2006; Deng et al., 2018; Wan et al., 2018), is also used to uncover the best
predictors of AUs (Benitez-Quiroz et al., 2016), while other algorithms use Support
Vector Machines (SVMs) over the computer vision or deep representations described
above (Bartlett et al., 2005; Kotsia & Pitas, 2007; Zhang et al. 2014; Du & Martinez,
2014; Girard et al. 2015).
g) Unsupervised methods. Machine learning algorithms are tasked to find the functional
mapping 𝒚( = 𝑓(𝒙(), where 𝒙 is the input feature vector (which may define one or more
of the computer vision features given in a-e, or be a set of deep features as explained in f)
and 𝒚 is the desirable output, e.g., 𝒚( = 2𝑦(4, … , 𝑦(789, with 𝑦(: = {−1, +1} indicating
the AU 𝑗 is present (+1) or not present (-1); alternatively 𝑦(: may define the intensity of
activation of AU 𝑗. The machine learning algorithms described above use a labeled
training set, 𝒴 = {𝒙(, 𝒚(}(B4C , to find a possible mapping 𝑓(. ), and 𝑑 is the number of
training sample pairs. The algorithms using this labelled dataset are called supervised
methods, because the task of finding 𝑓(. ) is determined (supervised) by 𝒴. The problem
with this approach is that a human expert must provide a large training set 𝒴, and, as we
know, manually annotating AUs in a large number of images or video frames is costly;
that is why we wish to use automated computer vision systems instead. Therefore, the
main goal in modern computer vision and machine learning is to define algorithms that
can learn from a large set of unlabeled data, 𝒳 = {𝒙(}(B4C . This is called unsupervised
learning, because the labels (𝒚() are not given. As of this writing, unsupervised learning
of action units is still an open area of research. Recently, Zhao et al. (2018) have derived
an algorithm that can learn from a large set of unlabeled internet face images. The key
idea is to group image as a function of image feature similarity and image description
similarity using techniques from graph theory. Some other recent methods (Wiles et al.,
2018) are not specifically defined to detect AUs in faces, but may be adapted to achieve
this goal in the near future.
h) Generative models. If the functional mapping 𝑓(. ) is given by a probabilistic model
defined by an underlying but unknown density function, we can use probabilistic
AUTOMATED FACIAL ACTION CODING 21
algorithms to estimate it. The most classical approach to density estimation is mixture
models (Reynolds, 2015), with a long tradition in modeling a variety of visual stimuli
(Martinez & Vitria, 2001). Most algorithms model the activation of AUs using a mixture
of Gaussian (Song et al., 2015), with variants using a mixture of PCs and ICs (Draper et
al., 2003). When adding time to these models (i.e., in video analysis), we have a Hidden
Markov Model, which has also been successfully used to model facial expressions
(Corneanu et al., 2016; Cohen et al., 2003; Martinez, 1999). Deep learning methods can
also be used to estimate the underlying distribution, with the most tested approach being
Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). Pumerola et al.
(2018) have used this approach to learn the underlying distribution of the image changes
of every AU. This means that given any arbitrary image, this algorithm can edit it to add
or subtract any AU from the image. As seen in Figure 7, the results are so convincing that
it easily tricks human subjects in believing that the generated images are in fact real.
Thus, this approach can be used to detect AUs in images as well as to generate new
stimuli for our experiments. A variety of applications of this approach and extensions are
already underway (Romero et al., 2018; Vielzeuf et al., 2018).
Evaluations
To date, the computer vision and machine learning algorithms described above have been
mostly evaluated on data collected in the laboratory, under constrained conditions, and only a
handful of algorithms have been tested with still images filmed in unconstrained conditions
outside the lab. Here, constrained conditions may refer to illumination, pose and other image
collection mechanisms or to the restricted way in which subjects are asked to behave; in general,
people do not act naturally in the lab, while illumination and pose are at least somewhat
constrained.
It is important we understand under which conditions each algorithm has been tested.
When selecting one of the computer vision or machine learning algorithms described in the
preceding section, we need to check how these were tested and evaluated to make sure these can
be used and were evaluated with the same type of images and videos we wish to automatically
analyze.
There are three types of data on which algorithms are typically evaluated.
AUTOMATED FACIAL ACTION CODING 22
i. Posed expressions: These are still images or video frames of typically hypothesized facial
expressions. Subjects are asked to pose the expressions by either imitating the expression
in an image, following a cue (e.g., smile, frown), or giving subjects a situation and asking
them to produce the expression that would be expected in it. Examples are the CK+ and
the Compound Emotions datasets (Lucey et al., 2010; Du et al., 2014).
ii. Spontaneous facial configurations: Here videos and images are collected while subjects
watch a video or interact with another person, yielding several spontaneous facial
configurations. Examples are DISFA (Denver Intensity of Spontaneous Facial Action)
and Shoulder Pain datasets (Lucey et al., 2011; Mavadati et al., 2013).
iii. Images and videos in the wild: These are images and videos collected outside the lab, in
completely unconstrained environments/conditions (Martinez, 2017a). The term “in the
wild” refers to the fact that they are collected outside controlled, in-lab conditions. The
largest dataset is called EmotioNet (Benitez-Quiroz et al., 2016) which includes 1 million
images; the recent extension of Srinivasan & Martinez (2019) contains over 7 million
images and 10,000 of video with more than 1 billion frames.
Obviously, when selecting a computer vision or machine learning algorithm to
automatically annotate AUs, one must make sure that it has been tested on similar conditions to
those of our data. For example, if your experiment only includes posed expressions, has the
algorithm you wish to use been extensively tested using well-documented datasets of posed
expressions? If not, then you either need to test it yourself, or you ought to find a different
algorithm.
Let us assume you have now selected an algorithm that has been extensively tested on
images and/or videos collected under the same imaging conditions as those of your data. And, let
us further assume that these studies prove the selected algorithm performs wonderfully on those
images and/or video sequences. We can now use the selected algorithm with confidence, right?
Unfortunately, the answer is no. Why not? Because, most likely, these tests have been run by the
same team that designed the experiment and the algorithm might be overfitted to the testing data.
Let me explain what that means. In computer vision and machine learning we typically divide
our dataset into two subsets. The first is used to train our algorithm, e.g., to identify the
parameters of the algorithm that make it work well. Then, the tuned algorithm is run on the
testing data, yielding the results that are to be expected when using a similar, yet independent
AUTOMATED FACIAL ACTION CODING 23
dataset. The problem is that the people who design the algorithm we wish to use had access to
both, the training and testing data, during development, and, most likely, they modified their
algorithm until it worked on both the training and testing datasets. That is, they overfitted their
algorithm to the training and testing data. This means their testing results may not be a good
representation of what you might expect to see on truly independent, previously unseen data.
How can we solve this problem? We have three options. One is to manually annotate
AUs in a number of images or videos of our dataset, use the selected algorithm, and compute
how well it does on it. The more annotations we use, the better. A second option is to use the
selected algorithm to annotate our data, randomly select a number of images or video frames
(say, 5% of them), and manually check the accuracy of these annotations. The third option is to
find a number of annotated datasets that were not used by the developers of the selected
algorithm and use these to check how well the algorithm performs on these novel datasets.
When evaluating an algorithm or looking at evaluations performed by others, do not pay
much attention to the accuracy of the algorithm. Accuracy is defined as the number of correctly
labeled images, divided by the total number of images used in the test. The problem is that most
images do not have AU 𝑖 present, and a simple algorithm which always says AU 𝑖 is not present
would have a very high accuracy but would be useless. As an example, consider AU 4. Imagine
AU 4 appears in .2% of the images in a database of 1 million samples. If our algorithm says that
AU 4 is not present in any of these 1 million images, its accuracy would be GGH,III4,III,III
= .998, i.e.,
the accuracy of this algorithm is 99.8%, even though the algorithm is unable to code AU 4 at all.
To address this issue, we need to compute the precision and recall of the algorithm.
Precision measures the fraction of selected images with AU 𝑖 that are correctly classified, while
recall (also called sensitivity) is the fraction of detected images with AU 𝑖 over all the images
with AU 𝑖. These two measures are typically combined in a single value called 𝐹4-score, 𝐹4 =M∙7OPQ(R(ST∙OPQU&&7OPQ(R(STVOPQU&&
. 𝐹4 takes values between 0 and 1, with 0 indicating the algorithm is useless at the
task and 1 designating perfect performance. In our example above, 𝐹4 = 0, even though
accuracy= .998.
All the above will be necessary unless the authors of the algorithm have validated their
algorithm on a large and truly independent dataset they had no access to. This typically means
the authors have participated in a challenge or competition, where the testing data was
AUTOMATED FACIAL ACTION CODING 24
sequestered and, thus, not available to the team designing the algorithm. The most extensive one
is the EmotioNet challenge (Benitez-Quiroz et al., 2017a), but this challenge exclusively uses
images in the wild. Thus, it is generally highly recommended to test the selected algorithm on
your data before you proceed; some algorithms perform really well on some datasets and really
poorly on others. Having a computer vision expert in your team who can perform these
evaluations is also advisable. Additionally, whenever you need to evaluate a computer vision
algorithm on your data, you will also need to add a certified FACS coder on your team (e.g., to
manually annotate a subset of the data for reliability purposes).
It is important to note that the more your data deviates from that of previous tests, the
more likely it is for the selected algorithm to fail. For example, to the author knowledge, the
highest 𝐹4-scores on posed and spontaneous expressions are those achieved by the algorithm of
Benitez-Quiroz et al. (2016), with 𝐹4 > .94 when testing on CK+, DISFA and Shoulder Pain,
which is as good as human annotations (Girard et al., 2015). However, these results were
obtained by training the algorithm on a subset of each of these databases and testing it on an
independent subset of the same dataset, a method called cross-validation. 𝐹4-scores drop to about
.6 when training on some of these datasets and testing on very different databases. This is
because the imaging conditions in these datasets are extremely different, making the
classification of AUs database specific. One solution to this problem is to retrain these
algorithms with a portion of your database. For this though, you will need to manually annotate a
portion of your data, which is time consuming.
EmotioNet Challenge
The most challenging problem is, of course, the detection of AUs in the wild, where the
algorithm needs to adapt to any possible imaging condition. The only largescale test that assesses
computer vision algorithms in these challenging conditions is the EmotioNet Challenge. Thus
far, there have been two challenges, a first one in 2017 and a second in 2018. Of the dozens of
participants that registered, only 10 have completed the challenge. The top 𝐹4-scores are about
.64 on the moderately difficult set and about .56 on the most difficult one (Benitez-Quiroz et al.,
2017a, 2017b).2 These results improve when the facial color features associated to emotion are
also considered (Benitez-Quiroz et al., 2019).
2 See also the results in the EmotioNet website: http://cbcsl.ece.ohio-state.edu/EmotionNetChallenge/index.html
AUTOMATED FACIAL ACTION CODING 25
A clear limitation of AU detection in the wild is in pose invariance (Benitez-Quiroz et al.,
2017a); that is, how can we recognize AUs when the face is not observed frontally? One solution
is to recover the 3D shape, shading and discriminant colors of the face from a single 2D image
(Zhao et al., 2016). This is an ill-posed problem, meaning that for any 2D image of a face there is
an infinite number of possible 3D faces that could have generated this 2D observation. A small
number of algorithms have recently solved this problem by learning the mapping function
between 3D and 2D images of faces (Zhao et al., 2018; Zhao et al., 2016), with extensions and
variants of these algorithms improving over previous results (Tome et al., 2017; Jourabloo et al.,
2017; Rad et al., 2018).
Another problem with the above algorithms is the intrinsic biases of the databases used to
learn to discriminate between AU present versus not present. As we saw above, computer vision
algorithms do not perform as well when our training dataset is not a good representation of what
we will be using in testing. A major issue is that most databases used to train these systems do
not have a large number of images of certain ethnicities and races. Hence, algorithms trained
with these databases provide subpar performance on the poorly represented groups (e.g., black
subjects) (Buolamwini & Gebru, 2018). It is imperative to test your system on the demographics
you will be using it with, before you decide whether that algorithm will perform as expected.
Exploratory and Hypothesis-based Designs
Let us see how we can use the information detailed above to design our experiments.
First of all, we must decide whether we will perform an exploratory of a hypothesis-based study.
Since researchers (and funders) typically prefer hypothesis-based studies, let us define those first.
Hypothesis-based
Knowledge is the pillar on which research rests. Typically, we will want to design an
experiment to test whether an established model or hypothesis is true or not; in other cases, we
may wish to test a novel hypothesis. To this end, we need to design an experiment that
challenges our hypothesis. After careful thought, we determined that an analysis of facial action
units in a large number of images of videos is necessary, and hope we can complete this analysis
automatically, using a computer vision system. How should we proceed?
First, we ought to know whether such a system is available. Using Figure 2, we can easily
determine how we can proceed and how much care one needs to take when collecting and
AUTOMATED FACIAL ACTION CODING 26
curating the database of face images/videos to be used in our study. Above we provided a
detailed explanation of Figure 2, which we can now use to design our experiment.
Second, we need to select a computer vision algorithm from those listed in Figure 4. This
selection needs to be directed by the needs of our study and the design we have already defined.
If we are to analyze still images showing the apex of an expression, then we will select one of the
algorithms specifically designed to work with these. If the images sometimes do and sometimes
do not show the apex of an expression, then, we will need to use an algorithm that can detect
which images correspond to an apex and which ones do not by providing a specific definition of
what we call the apex; e.g., we may define apex as the frame of a video sequence with the
maximal number of AUs, or as the point that each AU is at maximum activation, or at the point
when the AU activation first increases and then decreases by a specified amount (i.e., a
threshold), etc. This definition will be part of our hypothesis. On the other hand, if we wish to
analyze the temporal information of AUs, then we need to select one of the algorithms that can
provide a quantitative analysis over a video sequence, rather than a qualitative analysis on
images. Similarly, if we are interested in the intensity of activation of an AU, we need to
determine if we wish to recover a category (i.e., a set of levels of intensity), or a continuous
value, and then select the appropriate algorithm.
Once we have selected our algorithm, we will need to test it on our data, as described
above, to make sure it will yield an accurate analysis of our data. We are now ready to use the
selected algorithm to test our hypothesis.
When testing established hypotheses, we may not want to stop here. Once we have used
the selected algorithm to evaluate our hypothesis, we can modify it to accommodate the new
results. This will give us a new hypothesis, which can be retested on a different dataset of images
or videos and using the same approach described above. Alternatively, and preferably, we can
test the new hypothesis using behavioral or imaging studies. We can repeat this process until our
hypothesis has been modified to justify the new data. In fact, this is one of the main advantages
of using computational analysis in big data – it allows us to modify, extend and tune currently
accepted hypotheses, which will then serve as the seed of novel scientific studies (Martinez,
2017a). For example, Izard, Dougherty, & Hembree (1983) identify multiple expressions in
infants, and Du et al. (2014) hypothesized the existence of compound facial expressions of
emotion, like happily surprised and happily disgusted in adults. Du et al. then went a step further
AUTOMATED FACIAL ACTION CODING 27
by presenting a computational analysis that supported their hypothesis. Then, in Du & Martinez
(2015), we tested this novel hypothesis by identifying spontaneous expressions of compound
emotion in the wild. Later, these studies were used to define a new model of the production and
perception of facial expressions (Martinez, 2017b).
Exploratory design
Although hypothesis-based has been the method of choice for decades, computational
analyses now provide a mechanism to answer questions that were previously impossible to
tackle. Some of the basic scientific questions I listed in the introduction of this paper, for
instance, may not be properly addressed using a hypothesis-based approach. For example, a
fundamental question in the study of the production and perception of facial expressions is to
determine the number of expressions used across cultures (Martinez, 2017b; Barrett et al., in
press). A hypothesis-based experimental design is unsuited to answer this question. We could
define a study to test whether the six so-called “basic” emotions are used across cultures, but this
would still not give us the actual number of cross-cultural expressions we want to know. An
exploratory approach, however, does offer a method to properly address this. A very large
database of images and videos of facial expressions collected in many cultures around the world,
for instance, can be automatically analyzed to identify the facial configurations that are common
across cultures. These results can then be used in behavioral experiments to test whether these
facial configurations do indeed have a common interpretation across languages (Srinivasan &
Martinez, 2019).
Exploratory experiments may be especially valuable for developmental studies, since
these allow us to explore the evolution of our variables of interest over time. For example, we
can use computer vision algorithms to delineate the narrowing of facial configurations as we age,
or the acquisition of expertise on the production of expressions used in non-verbal
communication.
Computing statistical significance is particularly important in exploratory experiments.
Using the evaluation methods defined above does not mean we should not compute statistical
significance. Given a large dataset of images or videos, a t-test can be readily computed, in
which case, we should aim for p<.001, or use confident intervals (Cumming, 2013).
Alternatively, we can compute the likelihood that the results we observed cannot be obtained
from permutated data. To perform this test, we permute the labels of our AU detector and run the
AUTOMATED FACIAL ACTION CODING 28
same statistical analysis we used to complete our study. If we can still obtain results, regardless
of whether these are the same or different than those obtained before permuting the AU labels,
then our results should not be trusted, since any meaningless assignment of AUs yields a possible
result.
Which computer vision or machine learning approach?
There is a good reason why, earlier, we defined the different approaches to automatic
coding of AUs, because this will now facilitate the selection of our algorithm. In general,
algorithms that use discriminant analysis (You et al., 2011) and deep learning (Benitez-Quiroz et
al., 2017a, 2017b) work best for images. If there is large 3D head movement, algorithms that
utilize structure from motion and 3D shape analysis may be preferred; or deep neural networks
that can recover the 3D shape of the face from a single 2D image (Zhao et al., 2018). And, if we
wish to identify facial movements of non-salient facial components, then 3D dense shape
recovery methods would be preferred. In some cases, it may also be appropriate to select other
algorithms. For example, if we wish to study the perception of implied motion in still images
with AUs, then an algorithm that computes the optical flow or uses a template matching
procedure to determine the movement of a set of fiducial points defining that AU may be the
most appropriate.
If we are interested in video, the latest algorithm is that of Benitez-Quiroz et al. (2019),
which has been shown to outperform other methods on standard datasets. But, as always, you
will need to evaluate whether this (or some other algorithm) works best on your data.
We may also want to compare the results of an AU analysis with those of the changes in
facial color due to emotional experiences. Facial articulations and color are believed to be
controlled by (at least partially) dissociated neural mechanisms (Benitez-Quiroz et al., 2018),
suggesting parallel ways of studying emotion. Having multiple means of investigating the same
scientific question can add robustness to our studies.
Finally, if we select an algorithm, test it, and determine it does not work well on our data,
we should then select an algorithm that uses a distinct approach. The reason for this is simple.
Although multiple algorithms have been derived for each approach, if one of them does not work
on the type of data we have, it is unlikely another algorithm in the same group will work much
better. We have a better chance with an algorithm that uses a distinct methodology.
AUTOMATED FACIAL ACTION CODING 29
As detailed earlier in this paper, some of the computer vision and all the machine learning
algorithms can be retrained with a portion of your data. That means you will need to manually
annotate a portion of your still images and/or video sequences and then use them to retrain the
available algorithm. This should only be done if none of the existing (pre-trained) algorithms
worked and we have a computer vision expert that can help us perform this technical step, but it
is an option to consider and one that typically yields excellent results.
Also note that some of these algorithms are available from companies, and may be easy
to use, as out of the box tools. But others may be available from researchers and require basic
computer science knowledge on how to operate them. As in the above, the best course of action
is to add a computer vision researcher in your group that knows how to run and test these
algorithms. But do not leave all the work to them. You should discuss which of the algorithms
and approaches described in this paper are available to you and why one is believed to be a better
choice than the others. Make sure you discuss how to evaluate the selected algorithms too.
Conclusions
Automatic facial action coding has the potential to be of major help to researchers
studying the role faces play in a number of verbal and non-verbal social interactions (Martinez,
2017a; Benitez-Quiroz et al., 2014, 2016). This is especially useful for developmental
psychologists interested in probing the role of facial configurations in a number of infant and
developmental studies. Herein, I have summarized the main computer vision algorithms
available to researchers and, most importantly, how to properly use them in scientific studies.
Several researchers are already using these algorithms in their research studies (Zanette et
al., 2018; Martinez, 2017a; Sikka et al., 2015; De la Torre & Cohn, 2011), a number that is
expected to grow rapidly in the next few years. While such systems are welcomed and a good
opportunity to advance research in facial expressions, emotion, affect, sign language, and
developmental psychology, they also need to be used and tested properly before being embraced
as a universal solution to each and every facial analysis we might need. This paper provides a
guide on how to achieve that. Specifically, I have presented a methodology to help researchers
select the most appropriate computer vision algorithm for a given task and provided details of the
distinct algorithms that are available to researchers. Taxonomies of the analyses and computer
vision algorithms were presented in Figures 2 and 4.
AUTOMATED FACIAL ACTION CODING 30
It is also important to note that this paper provides a guide on how to use available facial
action coding algorithms, not systems that purport to automatically detect of emotion categories
and valence. There is a good reason for this: the latter systems do not recognize all emotion
categories or valence in images, but, instead, analyze images based on preconceived ideas of
emotion that are most likely inaccurate (Barrett et al., in press). For example, Srinivasan &
Martinez (2019) recently showed there are at least 17 facial expressions of happiness with a
varying AUs, and these are not accounted for in computer algorithms designed to categorize
emotion in images. The same is true for valence. Additionally, facial color is a marker of affect
that has until very recently been omitted (Benitez-Quiroz et al., 2018) and omitting it can readily
result in misinterpretations of the observed expressions.
In summary, the time is right to move to an automatic analysis of facial expressions. This
is likely to revolutionize the study of nonverbal communication and emotion and will surely be a
fundamental tool for developmental psychologist for years to come. However, when using these
computational tools, care needs to be taken in both the selection of the computer vision
algorithms and the experimental design, otherwise we run the risk of uncovering nonexistent
features of our social and cognitive development.
AUTOMATED FACIAL ACTION CODING 31
References
Agudo, A., Agapito, L., Calvo, B., & Montiel, J. M. (2014). Good vibrations: A modal analysis
approach for sequential non-rigid structure from motion. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 1558-1565). DOI:
10.1109/CVPR.2014.202
Agudo, A., & Moreno-Noguer, F. (2018a). A scalable, efficient, and accurate solution to non-
rigid structure from motion. Computer Vision and Image Understanding, 167, 121-133.
DOI: 10.1016/j.cviu.2018.01.002
Agudo, A., & Moreno-Noguer, F. (2018b). Deformable Motion 3D Reconstruction by Union of
Regularized Subspaces. In 2018 25th IEEE International Conference on Image
Processing (ICIP) (pp. 2930-2934). IEEE. DOI: 10.1109/ICIP.2018.8451235
Albiero, V., Bellon, O. R., & Silva, L. (2018). Multi-label action unit detection on multiple head
poses with dynamic region learning. In Proc. IEEE International Conference on Image
Processing (pp. 2037-2041). DOI: 10.1109/ICIP.2018.8451267
Bai, Y., Fu, J., Zhao, T., & Mei, T. (2018, September). Deep attention neural tensor network for
visual question answering. In Proc. European Conference on Computer Vision, Munich,
Germany, Part XII (p. 20). Springer. DOI: 10.1007/978-3-030-01258-8_2
Baker, S., Scharstein, D., Lewis, J. P., Roth, S., Black, M. J., & Szeliski, R. (2011). A database
and evaluation methodology for optical flow. International Journal of Computer Vision,
92(1), 1-31. DOI: 10.1007/s11263-010-0390-2
Bartlett, M. S., Littlewort, G., Frank, M., Lainscsek, C., Fasel, I., & Movellan, J. (2005).
Recognizing facial expression: machine learning and application to spontaneous
AUTOMATED FACIAL ACTION CODING 32
behavior. In Proc. IEEE Computer Vision and Pattern Recognition, (Vol. 2, pp. 568-573).
DOI: 10.1109/CVPR.2005.297
Barrett, L. F., Adolphs, R., Marsella, S., Martinez, A. M., & Pollak, S. (in press). Emotional
Expressions Reconsidered: Challenges to Inferring Emotion in Human Facial
Movements. Psychological Science in the Public Interest.
Benitez-Quiroz, C. F., Gökgöz, K., Wilbur, R. B., & Martinez, A. M. (2014). Discriminant
features and temporal structure of nonmanuals in American Sign Language. PloS one,
9(2), e86268. DOI: 10.1371/journal.pone.0086268
Benitez-Quiroz, C. F., Srinivasan, R., & Martinez, A. M. (2016). Emotionet: An accurate, real-
time algorithm for the automatic annotation of a million facial expressions in the wild. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp.
5562-5570). DOI: 10.1109/CVPR.2016.600
Benitez-Quiroz, C. F., Srinivasan, R., & Martinez, A. M. (2018). Facial color is an efficient
mechanism to visually transmit emotion. Proceedings of the National Academy of
Sciences, 201716084. DOI: 10.1073/pnas.1716084115
Benitez-Quiroz, F., Srinivasan, R., & Martinez, A. M. (2019). Discriminant Functional Learning
of Color Features for the Recognition of Facial Action Units and their Intensities. IEEE
AUTOMATED FACIAL ACTION CODING 33
Transactions on pattern analysis and machine intelligence. DOI:
10.1109/TPAMI.2018.2868952
Benitez-Quiroz, C. F., Srinivasan, R., Feng, Q., Wang, Y., & Martinez, A. M. (2017a).
EmotioNet Challenge: Recognition of facial expressions of emotion in the wild. arXiv
preprint arXiv:1703.01210.
Benitez-Quiroz, C. F., Wang, Y., & Martinez, A. M. (2017b). Recognition of action units in the
wild with deep nets and a new global-Local loss. In Proceedings of the International
Conference on Computer Vision. DOI: 10.1109/ICCV.2017.428
Benitez-Quiroz, C. F., Wilbur, R. B., & Martinez, A. M. (2016). The not face: A
grammaticalization of facial expressions of emotion. Cognition, 150, 77-84. DOI:
10.1016/j.cognition.2016.02.004
Bennett, D. S., Bendersky, M., & Lewis, M. (2005). Does the organization of emotional
expression change over time? Facial expressivity from 4 to 12 months. Infancy, 8(2),
167-187. DOI: 10.1207/s15327078in0802_4
Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in
commercial gender classification. In Conference on Fairness, Accountability and
Transparency (pp. 77-91).
Castro, V. L., Camras, L. A., Halberstadt, A. G., & Shuster, M. (2017). Children’s Prototypic
Facial Expressions During Emotion-Eliciting Conversations with Their Mothers.
Emotion, 18(2), 260-276. DOI: 10.1037/emo0000354
Chang, F. J., Tran, A. T., Hassner, T., Masi, I., Nevatia, R., & Medioni, G. (2018). ExpNet:
Landmark-free, deep, 3D facial expressions. In Proc. IEEE International Conference on
Automatic Face & Gesture Recognition (pp. 122-129). DOI: 10.1109/FG.2018.00027
Chu, W. S., De la Torre, F., & Cohn, J. F. (2019). Learning facial action units with
spatiotemporal cues and multi-label sampling. Image and Vision Computing, 81, 1-14.
DOI: 10.1016/j.imavis.2018.10.002
Cohen, I., Sebe, N., Garg, A., Chen, L. S., & Huang, T. S. (2003). Facial expression recognition
from video sequences: temporal and static modeling. Computer Vision and image
understanding, 91(1-2), 160-187.
AUTOMATED FACIAL ACTION CODING 34
Corneanu, C. A., Madadi, M., & Escalera, S. (2018). Deep Structure Inference Network for
Facial Action Unit Recognition. In Proc. European Conference on Computer Vision.
DOI: 10.1016/S1077-3142(03)00081-X
Corneanu, C. A., Simón, M. O., Cohn, J. F., & Guerrero, S. E. (2016). Survey on rgb, 3d,
thermal, and multimodal approaches for facial expression recognition: History, trends,
and affect-related applications. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 38(8), 1548-1568. DOI: 10.1109/TPAMI.2016.2515606
Cumming, G. (2013). Understanding the new statistics: Effect sizes, confidence intervals, and
meta-analysis. Routledge. DOI: 10.4324/9780203807002
De la Torre, F., & Cohn, J. F. (2011). Facial expression analysis. In Visual analysis of humans
(pp. 377-409). Springer, London. DOI: 10.1007/978-0-85729-997-0_19
Deng, W., Hu, J., & Guo, J. (2018). Face recognition via collaborative representation: its
discriminant nature and superposed representation. IEEE Transactions on Pattern
AUTOMATED FACIAL ACTION CODING 35
Analysis and Machine Intelligence, 40(10), 2513-2521. DOI:
10.1109/TPAMI.2017.2757923
Donato, G., Bartlett, M. S., Hager, J. C., Ekman, P., & Sejnowski, T. J. (1999). Classifying facial
actions. IEEE Transactions on pattern analysis and machine intelligence, 21(10), 974.
DOI: 10.1109/34.799905
Draper, B. A., Baek, K., Bartlett, M. S., & Beveridge, J. R. (2003). Recognizing faces with PCA
and ICA. Computer vision and image understanding, 91(1-2), 115-137. DOI:
10.1016/S1077-3142(03)00077-8
Du, S., & Martinez, A. M. (2015). Compound facial expressions of emotion: from basic research
to clinical applications. Dialogues in Clinical Neuroscience, 17(4), 443–455.
Du, S., Tao, Y., & Martinez, A. M. (2014). Compound facial expressions of emotion.
Proceedings of the National Academy of Sciences, 111 (15) E1454-E1462. DOI:
10.1073/pnas.1322355111
Ekman, P. (2016). What scientists who study emotion agree about. Perspectives on
Psychological Science, 11(1), 31-34. DOI: 10.1177/1745691615596992
Ekman, P., & Rosenberg, E. L. (Eds.). (2005). What the face reveals: Basic and applied studies
of spontaneous expression using the Facial Action Coding System (FACS). 2nd edition.
Oxford University Press, USA.
Garg, R., Roussos, A., & Agapito, L. (2013). Dense variational reconstruction of non-rigid
surfaces from monocular video. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (pp. 1272-1279). DOI: 10.1109/CVPR.2013.168
Gaspar, A., & Esteves, F. G. (2012). Preschooler’s faces in spontaneous emotional contexts—
How well do they match adult facial expression prototypes? International Journal of
Behavioral Development, 36(5), 348-357. DOI: 10.1177/0165025412441762
Gegenfurtner, K. R. (2003). Cortical mechanisms of colour vision. Nature Reviews
Neuroscience, 4(7), 563. DOI: 10.1038/nrn1138
Gervain, J., & Mehler, J. (2010). Speech perception and language acquisition in the first year of
life. Annual review of psychology, 61, 191-218. DOI:
10.1146/annurev.psych.093008.100408
AUTOMATED FACIAL ACTION CODING 36
Girard, J. M., Cohn, J. F., Jeni, L. A., Lucey, S., & De la Torre, F. (2015). How much training
data for facial action unit detection? In Proc. IEEE International Conference Automatic
Face and Gesture Recognition. DOI: 10.1109/FG.2015.7163106
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning. Cambridge: MIT
press. DOI: 10.4258/hir.2016.22.4.351
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.
& Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information
Processing Systems (pp. 2672-2680).
Gotardo, P. F., & Martinez, A. M. (2011a). Computing smooth time trajectories for camera and
deformable shape in structure from motion with occlusion. IEEE Transactions on Pattern
AUTOMATED FACIAL ACTION CODING 37
Analysis and Machine Intelligence, 33(10), 2051-2065. DOI: DOI:
10.1109/TPAMI.2011.50
Gotardo, P. F., & Martinez, A. M. (2011b). Non-rigid structure from motion with
complementary rank-3 spaces. In Proc. IEEE Conf. Computer Vision and Pattern
Recognition. DOI: 10.1109/CVPR.2011.5995560
Gotardo, P. F., & Martinez, A. M. (2011c). Kernel non-rigid structure from motion. IEEE
International Conference on Computer Vision (pp. 802-809). DOI:
10.1109/ICCV.2011.6126319
Hamsici, O. C., Gotardo, P. F., & Martinez, A. M. (2012). Learning spatially-smooth mappings
in non-rigid structure from motion. In European Conference on Computer Vision (pp.
260-273). Springer, Berlin, Heidelberg. DOI: 10.1007/978-3-642-33765-9_19
Hamsici, O. C., & Martinez, A. M. (2009a). Rotation invariant kernels and their application to
shape analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11),
1985-1999. DOI: 10.1109/TPAMI.2008.234
Hamsici, O. C., & Martinez, A. M. (2009b). Active appearance models with rotation invariant
kernels. In Proc. IEEE 12th International Conference on In Computer Vision (pp. 1003-
1009). DOI: 10.1109/ICCV.2009.5459365
Hamsici, O. C., & Martinez, A. M. (2007). Spherical-homoscedastic distributions: The
equivalency of spherical and normal distributions in classification. Journal of Machine
Learning Research, 8(Jul), 1583-1623.
Holodynski, M. & Seeger, D. (2019). Expressions as Signs and Their Significance for Emotional
Development. Developmental Psychology (this issue).
Jack, R. E., Garrod, O. G., Yu, H., Caldara, R., & Schyns, P. G. (2012). Facial expressions of
emotion are not culturally universal. Proceedings of the National Academy of Sciences,
109(19), 7241-7244. DOI: 10.1073/pnas.1200155109
Jia, H., & Martinez, A. M. (2009). Low-rank matrix fitting based on subspace perturbation
analysis with applications to structure from motion. IEEE transactions on pattern analysis
and machine intelligence, 31(5), 841-854. DOI: 10.1109/TPAMI.2008.122
Jin, X., & Tan, X. (2017). Face alignment in-the-wild: A survey. Computer Vision and Image
Understanding, 162, 1-22. DOI: 10.1016/j.cviu.2017.08.008
AUTOMATED FACIAL ACTION CODING 38
Jourabloo, A., Ye, M., Liu, X., & Ren, L. (2017). Pose-invariant face alignment with a single
CNN. In Proc. IEEE International Conference on Computer Vision (pp. 3219-3228).
DOI: 10.1109/ICCV.2017.347
Kotsia, I., & Pitas, I. (2007). Facial expression recognition in image sequences using geometric
deformation features and support vector machines. IEEE transactions on image
processing, 16(1), 172-187. DOI: 10.1109/TIP.2006.884954
Lee, M., Cho, J., Choi, C. H., & Oh, S. (2013). Procrustean normal distribution for non-rigid
structure from motion. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (pp. 1280-1287). DOI: 10.1109/TPAMI.2016.2596720
Leitzke, B. T., & Pollak, S. D. (2016). Developmental changes in the primacy of facial cues for
emotion recognition. Developmental Psychology, 52(4), 572. DOI: 10.1037/a0040067
Li, K., Yang, J., & Jiang, J. (2015). Nonrigid structure from motion via sparse representation.
IEEE Transactions on Cybernetics, 45(8), 1401-1413. DOI:
10.1109/TCYB.2014.2351831
Lien, J. J., Kanade, T., Cohn, J. F., & Li, C. C. (1998). Automated facial expression recognition
based on FACS action units. In IEEE Face & Recognition Workshop (p. 390). DOI:
10.1109/AFGR.1998.670980
Liu, Y. J., Zhang, J. K., Yan, W. J., Wang, S. J., Zhao, G., & Fu, X. (2016). A main directional
mean optical flow feature for spontaneous micro-expression recognition. IEEE
Transactions on Affective Computing, 7(4), 299-310. DOI:
10.1109/TAFFC.2015.2485205
Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. (2010). The
extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-
specified expression. In Proc. IEEE Computer Vision and Pattern Recognition,
Workshops (pp. 94-101). IEEE. DOI: 10.1109/CVPRW.2010.5543262
Lucey, P., Cohn, J. F., Prkachin, K. M., Solomon, P. E., & Matthews, I. (2011). Painful data: The
UNBC-McMaster shoulder pain expression archive database. In Proc. IEEE International
AUTOMATED FACIAL ACTION CODING 39
Conference on Automatic Face & Gesture Recognition (pp. 57-64). IEEE. DOI:
10.1109/FG.2011.5771462
Lyons, M. J., Budynek, J., & Akamatsu, S. (1999). Automatic classification of single facial
images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(12), 1357-
1362. DOI: 10.1109/34.817413
Lyons, M. J., Campbell, R., Plante, A., Coleman, M., Kamachi, M., & Akamatsu, S. (2000). The
Noh mask effect: vertical viewpoint dependence of facial expression perception.
AUTOMATED FACIAL ACTION CODING 40
Proceedings of the Royal Society of London B: Biological Sciences, 267(1459), 2239-
2245. DOI: 10.1098/rspb.2000.1274
Martinez, A. (1999). Face image retrieval using HMMs. In Content-Based Access of Image and
Video Libraries. In Proc. IEEE Workshop on Content-Based Access of Image and Video
Libraries (pp. 35-39). DOI: 10.1109/IVL.1999.781120
Martinez, A. M. (2003a). Recognizing expression variant faces from a single sample image per
class. In Proc. IEEE Computer Vision and Pattern Recognition, Madison, WI. DOI:
10.1109/CVPR.2003.1211375
Martinez, A. M. (2003b). Matching expression variant faces. Vision Research, 43(9), 1047-1060.
DOI: 10.1016/S0042-6989(03)00079-8
Martinez, A. M. (2017a). Computational models of face perception. Current Directions in
Psychological Science, 26(3), 263-269. DOI: 10.1177/0963721417698535
Martinez, A. M. (2017b). Visual perception of facial expressions of emotion. Current Opinion in
Psychology, 17:27-33. DOI: 10.1016/j.copsyc.2017.06.009
Martinez, A., & Du, S. (2012). A model of the perception of facial expressions of emotion by
humans: Research overview and perspectives. Journal of Machine Learning Research,
13(May), 1589-1608.
Martinez, A. M., & Kak, A. C. (2001). Pca versus lda. IEEE Transactions on Pattern Analysis &
Machine Intelligence, (2), 228-233. DOI: 10.1109/34.908974
Martinez, A. M., & Vitria, J. (2001). Clustering in image space for place recognition and visual
annotations for human-robot interaction. IEEE Transactions on Systems, Man, and
Cybernetics, Part B (Cybernetics), 31(5), 669-682. DOI: 10.1109/3477.956029
Martinez, A. M., & Zhu, M. (2005). Where are linear feature extraction methods applicable?.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12), 1934-1944.
DOI: 10.1109/TPAMI.2005.250
Matias, R., & Cohn, J. F. (1993). Are max-specified infant facial expressions during face-to-face
interaction consistent with differential emotions theory?. Developmental Psychology,
29(3), 524.
Mavadati, S. M., Mahoor, M. H., Bartlett, K., Trinh, P., & Cohn, J. F. (2013). Disfa: A
spontaneous facial action intensity database. IEEE Transactions on Affective Computing,
4(2), 151-160. DOI: 10.1109/T-AFFC.2013.4
AUTOMATED FACIAL ACTION CODING 41
Neth, D., & Martinez, A. M. (2009). Emotion perception in emotionless face images suggests a
norm-based representation. Journal of vision, 9(1), 5. DOI: 10.1167/9.1.5
Neth, D., & Martinez, A. M. (2010). A computational shape-based model of anger and sadness
justifies a configural representation of faces. Vision research, 50(17), 1693-1711. DOI:
10.1016/j.visres.2010.05.024
Oster, H. (2003). Emotion in the Infant's Face. Annals of the New York Academy of Sciences,
1000(1), 197-204. DOI: 10.1196/annals.1280.024
Oster, H. (2006). Baby FACS: Facial Action Coding System for infants and young children.
Monograph and coding manual. New York University.
Pons, G., & Masip, D. (2018). Multi-task, multi-label and multi-domain learning with residual
convolutional networks for emotion recognition. arXiv preprint arXiv:1802.06664.
Poursabzi-Sangdeh, F., Goldstein, D. G., Hofman, J. M., Vaughan, J. W., & Wallach, H. (2018).
Manipulating and measuring model interpretability. arXiv preprint arXiv:1802.07810.
Pumarola, A., Agudo, A., Martinez, A. M., Sanfeliu, A., & Moreno-Noguer, F. (2018, July).
Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of
the European Conference on Computer Vision (pp. 818-833). DOI: 10.1007/978-3-030-
01249-6_50
Rad, M., Oberweger, M., & Lepetit, V. (2018). Domain Transfer for 3D Pose Estimation from
Color Images without Manual Annotations. arXiv preprint arXiv:1810.03707.
Reeb-Sutherland, B. C., Rankin Williams, L., Degnan, K. A., Pérez-Edgar, K., Chronis-Tuscano,
A., Leibenluft, E., ... & Fox, N. A. (2015). Identification of emotional facial expressions
AUTOMATED FACIAL ACTION CODING 42
among behaviorally inhibited adolescents with lifetime anxiety disorders. Cognition and
Emotion, 29(2), 372-382. DOI: 10.1080/02699931.2014.913552
Reynolds, D. (2015). Gaussian mixture models. Encyclopedia of Biometrics, pp. 827-832.
Rivera, S., & Martinez, A. M. (2012). Learning deformable shape manifolds. Pattern
Recognition, 45(4), 1792-1801. DOI: 10.1016/j.patcog.2011.09.023
Romero, A., Arbeláez, P., Van Gool, L., & Timofte, R. (2018). SMIT: Stochastic Multi-Label
Image-to-Image Translation. arXiv preprint arXiv:1812.03704.
Savran, A., Sankur, B., & Bilge, M. T. (2012). Regression-based intensity estimation of facial
action units. Image and Vision Computing, 30(10), 774-784. DOI:
10.1016/j.imavis.2011.11.008
Sikka, K., Ahmed, A. A., Diaz, D., Goodwin, M. S., Craig, K. D., Bartlett, M. S., & Huang, J. S.
(2015). Automated assessment of children’s postoperative pain using computer vision.
Pediatrics, 136(1), e124-e131. DOI: 10.1542/peds.2015-0029
Simon, T., Nguyen, M. H., De La Torre, F., & Cohn, J. F. (2010). Action unit detection with
segment-based svms. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE
Conference on (pp. 2737-2744). IEEE. DOI: 10.1109/CVPR.2010.5539998
Song, Y., McDuff, D., Vasisht, D., & Kapoor, A. (2015). Exploiting sparsity and co-occurrence
structure for action unit recognition. In Proc. IEEE International Conference on
Automatic Face and Gesture Recognition (Vol. 1, pp. 1-8). DOI:
10.1109/FG.2015.7163081
Srinivasan, R., and Martinez, A.M. (2019). Cross-Cultural and Cultural-Specific Production and
Perception of facial Expressions of Emotion in the Wild. IEEE Transactions on Affective
Computing. DOI: 10.1109/TAFFC.2018.2887267
Srinivasan, R., Golomb, J.D., and Martinez, A.M. (2016). A neural basis of facial action
recognition in humans. The Journal of Neuroscience 36, 4434-4442. DOI:
10.1523/JNEUROSCI.1704-15.2016
Sun, Y., Reale, M., & Yin, L. (2008). Recognizing partial facial action units based on 3D
dynamic range data for facial expression recognition. In Proc. IEEE International
AUTOMATED FACIAL ACTION CODING 43
Conference on Automatic Face & Gesture Recognition (pp. 1-8). DOI:
10.1109/AFGR.2008.4813336
Tian, Y. L., Kanade, T., & Cohn, J. F. (2002). Evaluation of Gabor-wavelet-based facial action
unit recognition in image sequences of increasing complexity. In Proc. IEEE
AUTOMATED FACIAL ACTION CODING 44
International Conference on Automatic Face and Gesture Recognition (pp. 229-234).
DOI: 10.1109/AFGR.2002.1004159
Todorov, A., Dotsch, R., Porter, J. M., Oosterhof, N. N., & Falvello, V. B. (2013). Validation of
data-driven computational models of social perception of faces. Emotion, 13(4), 724.
DOI: 10.1037/a0032335
Tome, D., Russell, C., & Agapito, L. (2017). Lifting from the deep: Convolutional 3d pose
estimation from a single image. I n Proc. IEEE Conf. Computer Vision and Pattern
Recognition, pp. 2500-2509. DOI: 10.1109/CVPR.2017.603
Vielzeuf, V., Kervadec, C., Pateux, S., & Jurie, F. (2018). The Many Moods of Emotion. arXiv
preprint arXiv:1810.13197.
Wan, H., Wang, H., Guo, G., & Wei, X. (2018). Separability-oriented subclass discriminant
analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2), 409-
422. DOI: 10.1109/TPAMI.2017.2672557
Werker, J. F., & Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual
reorganization during the first year of life. Infant behavior and development, 7(1), 49-63.
DOI: 10.1016/S0163-6383(84)80022-3
Wiles, O., Koepke, A., & Zisserman, A. (2018). Self-supervised learning of a facial attribute
embedding from video. arXiv preprint arXiv:1808.06882.
Witkower, Z., & Tracy, J. L. (in press). A facial action imposter: How head tilt influences
perceptions of dominance from a neutral face. Psychological Science.
Yang, P., Liu, Q., & Metaxas, D. N. (2007). Boosting coded dynamic features for facial action
units and facial expression recognition. In Proc. IEEE Conf. Computer Vision and Pattern
Recognition. DOI: 10.1109/CVPR.2007.383059
You, D., Hamsici, O. C., & Martinez, A. M. (2011). Kernel optimization in discriminant
analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3), 631-
638. DOI: 10.1109/TPAMI.2010.173
Zanette, S., Gao, X., Brunet, M., Bartlett, M. S., & Lee, K. (2016). Automated decoding of facial
expressions reveals marked differences in children when telling antisocial versus
AUTOMATED FACIAL ACTION CODING 45
prosocial lies. Journal of Experimental Child Psychology, 150, 165-179. DOI:
10.1016/j.jecp.2016.05.007
Zhang, X., Mahoor, M. H., Mavadati, S. M., & Cohn, J. F. (2014). A l p-norm mtmkl framework
for simultaneous detection of multiple facial action units. In Poc. IEEE Winter
Conference on Applications of Computer Vision (pp. 1104-1111). DOI:
10.1109/WACV.2014.6835735
Zhao, K., Chu, W. S., & Martinez, A. M. (2018). Learning Facial Action Units From Web
Images With Scalable Weakly Supervised Clustering. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (pp. 2090-2099).
Zhao, R., Wang, Y., Benitez-Quiroz, C. F., Liu, Y., & Martinez, A. M. (2016). Fast and precise
face alignment and 3d shape reconstruction from a single 2d image. In Proc. European
Conference on Computer Vision (pp. 590-603). Springer. DOI: 10.1007/978-3-319-
48881-3_41
Zhao, R., Wang, Y., & Martinez, A. M. (2018). A simple, fast and highly-accurate algorithm to
recover 3d shape from 2d landmarks on a single image. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 40(12), 3059-3066. DOI:
10.1109/TPAMI.2017.2772922
Zhu, M., & Martinez, A. M. (2006). Selecting principal components in a two-stage LDA
algorithm. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (pp. 132-137).
DOI: 10.1109/CVPR.2006.271
AUTOMATED FACIAL ACTION CODING 46
a.
b.
Figure 1. a. Action Units (AUs). ©Dirk W. Eilert, Eilert-Academy, Germany. Reprinted with
permission. b. Automatic annotation of AUs with the algorithm of Benitez-Quiroz et al. (2016).
The individual whose face appears here gave signed consent for his likeness to be published in
this article.
AUTOMATED FACIAL ACTION CODING 47
Figure 2. Can I use a computer vision system to automatically annotate action units in my
images and videos? Shown here is a taxonomy of what computer vision algorithm can and
cannot do at the moment. Follow the arrows to determine to which degree algorithms can help in
your studies. If you reach the blue box, you will most likely be able to identify an algorithm that
can do most (if not all) of the job (see text for how to identify the algorithm). If you reach the
green box, there is likely an algorithm you can use, but you will need an expert AU coder to
determine how well the algorithm works on your data and verify the results of your experiment.
But, if you reach the red box, computer vision algorithms will provide only minimal help and
you will require a human expert to aid, adjust and verify the algorithm and data at each stage of
the experiment.
AUTOMATED FACIAL ACTION CODING 48
a.
b.
Figure 3. What coding is needed for our study? a. Qualitative coding of AUs with or without
intensities. b. Quantitative analysis of AU activation over time. The individuals whose face
appears here gave signed consent for their likeness to be published in this article.
AUTOMATED FACIAL ACTION CODING 49
Figure 4. A taxonomy of the most popular techniques used to automatically annotate action units
in face images. Several algorithms use more than one of these techniques to detect AUs in face
images.
Figure 5. The lines in the left image indicate the apparent movement of fiducial points (i.e.,
optical flow) needed to move from a norm (average) neutral face to the facial expression shown
on right. Note how this optical flow specifies the outer pulling of the corners of the lips (AU 12)
and the parting of the lips (AU 25). The individual whose face appears here gave signed consent
for his likeness to be published in this article.
AUTOMATED FACIAL ACTION CODING 50
a.
b.
Figure 6. 2D and 3D shape automatically extracted from a video sequence using: a. Procrustes
analysis with rotation invariant kernels, and b. non-rigid structure from motion. The individual
whose face appears here gave signed consent for their likeness to be published in this article.
Figure 7. Given the image shown on left (indicated with a green frame), the algorithm of
Pumerola et al (2018) is able to edit the image to illustrate what the face would look like with
distinct AUs active at different intensities. Here, AUs 12 and 25 are added, with their intensities
increasing from left to right. Note that the only real image is the one on left; all others are
computer generated, i.e., fake. Adapted with permission from Pumarola, A., Agudo, A.,
Martinez, A. M., Sanfeliu, A., & Moreno-Noguer, F. (2018). Ganimation: Anatomically-aware
facial animation from a single image. In Proceedings of the European Conference on Computer
Vision (pp. 818-833).