International Conference on Communication and Signal Processing, April 3-5, 2014, India
A Study on Emotion Recognition from Body Gestures Using Kinect Sensor
Sriparna Saha, Shreyasi Datta, Amit Konar and Ramadoss Janarthanan
Abstract- This novel work is aimed at the study of emotion
recognition from gestures using Kinect sensor. The Kinect sensor
along with Software Development Kit (SDK) generates the
human skeleton represented by 3-dimensional coordinates
corresponding to twenty body joints. Using the co-ordinates of
eleven such joints from the upper body and the hands, a set of
nine features based on the distances, accelerations and angles
between the different joints have been extracted. These features
are able to uniquely identify gestures corresponding to five basic
human emotional states, namely, 'Anger', 'Fear', 'Happiness',
'Sadness' and 'Relaxation'. The goal of the proposed system is to
classify an emotion based on body gesture. A comparison of
classification using binary decision tree, ensemble decision tree,
k-nearest neighbour, support vector machine with radial basis
function kernel and neural network classifier based on back
propagation learning is made, in terms of average classification
accuracy and computation time. A high overall recognition rate
of90.83% is obtained from the ensemble decision tree.
D Index Terms- Emotion Recognition, Gesture Classification,
Kinect Sensor
I. INTRODUCTION
GESTURES constitute an important medium for human
beings to interact and communicate with the
surroundings. Gestures are expressive body movements,
involving the face and hands in majority of the cases. Gestures
are an important indication of a person's mental state and
his/her emotions and hence human emotions can be classified
based on their gestures.
Emotion recognition is a very important aspect of human
computer interaction [1]. Recognizing emotions from gestures
will enable humans to efficiently communicate with machines
using only sign language according to one's mental state. It is
interesting to study how human emotions can be recognized
from the corresponding gestures so that such gestures can be
utilized to control a machine according to the human
emotional state. For example, if a sad person enters a room
where loud music is being played, the music player will be
automatically switched off or turned mute.
Emotions are mainly classified from facial expressions, i.e.,
Sri pam a Saha, Shreyasi Datta and Amit Konar are with the Department of Electronics and Telecommunication Engineering, Jadavpur University, Kolkata, India (emails: [email protected]@gmail.com. [email protected]).
R. Janarthanan is with the department of Computer Science and Engineering, T J S Engineering College, Chennai, India (e-mail: srmjana [email protected]).
978-1-4799-3358-7114/$31.00 ©2014 IEEE
the movements of cheek, chin, and wrinkles [2] but very few
of them correlates the facial expression with the body
movements [3]. Another important principal of emotion
recognition is based on EEG signals from brain activity [4].
With reference to human-computer interaction, gesture
recognition plays an important role. In [5], a condensation
algorithm implements a visual tracking system. In [6], real
time MPEG video inputs are provided and AC-DCT
coefficients are used to obtain Eigen space representation of
human silhouettes. Two cameras are required for this work
each along with a standard camera and video processing
board. In [7], time-of-flight depth camera along with a pulse
coupled neural network (PCNN) is used for recognition of
human body in distorted background. Here hierarchical
decision tree is used for the classification purpose.
For the present work, human emotions are recognized from
the corresponding gestures captured using a Kinect Sensor [8],
D . A Kinect sensor has been utilized to detect the skeleton of
a person while making gestures according to his/her emotional
state, using 20 body joint co-ordinates. The work focuses on
the recognition of five basic types of emotions from body
gestures. They are 'Anger', 'Fear', 'Happiness', 'Sadness',
and 'Relaxation'. A total of nine unique features based on the
angles and displacements of the body joints have been
extracted for a sequence of gestures for depicting a particular
emotion. The information regarding movements of the feet
and the lower parts of the body has not been taken into
account in the present work as these five particular emotions
are decoded efficiently from the movements and gestures
made using the upper body alone.
Classification of emotions from the extracted features is
carried out using a binary decision tree based classifier and a
tree classifier based on ensemble learning. A comparison of
these methods is made with three standard pattern classifiers,
namely, k-nearest neighbour (k-NN), support vector machine
(SVM) with radial basis function kernel and neural network
classifier based on back-propagation learning, in terms of
average classification accuracy and computation time. The
best result is obtained for ensemble decision tree with 90.83%.
The required time is least for binary decision tree with 0.9 sec
and largest for ensemble tree with 15.592 sec for recognition
of each gesture sequence of 60 seconds duration. The codes
are executed in Intel Core2Duo Processor with Matlab
R2012b.
Section II provides a discussion on the Kinect Sensor. The
methodology followed has been described in Section III with
+-IEEE Advancing Technology
for Humanity
056
explanations of features extracted from the subjects' skeleton
as well as the scheme of classification of emotions from
gestures using the extracted features with different classifiers.
Section IV presents the experiments and results. Finally,
Section V provides the conclusions with directions for future
work.
II. KINECT SENSOR
The Kinect [8], [9] is a sensor device with set of IR and
RGB camera. It appears as a long horizontal bar with a
motorized base as shown in the flowchart in Fig. 1. It detects
the 3D image of an object and tracks the skeleton of the
person standing in front of it within a finite amount of
distance. The Kinect sensor with the help of the corresponding
Software Development Kit (SDK) senses the skeleton and the
body postures irrespective of the color of the skin or the
individual's dress. In Fig. 1 an RGB image of a person and the
corresponding skeleton generated by the Kinect has been
shown. The Kinect Sensor produces the human skeleton
represented by twenty body joints in the 3-D space. Out of
these twenty joints, eleven from the upper parts of the body
and both the hands are useful for the present work, denoted by
squares. These joints are the head, shoulder center, spine, hand
left, wrist left, elbow left, shoulder left, hand right, wrist right,
elbow right and shoulder right. The rest of the joints are
denoted by stars, which are not needed here, as gestures are
mainly depicted by the hand movements.
III. METHODOLOGY
As in case of any pattern recognition problem, in order to
classify the emotions from gestures, these gestures need to be
effectively represented by suitable features. The feature spaces
thus obtained are classified using a number of different
classifiers. The entire course of work including data
acquisition from Kinect Sensor, feature extraction and
subsequent classification and hence emotion recognition is
illustrated in Fig. l.
A. Data Acquisition Experiments are conducted with 10 subjects in the age
group of 25±5 years with their consent. Data is acquired by
stimulating the environment in such a way so as to excite the
subjects to five particular emotions: anger, fear, happiness,
sadness, and relaxation. The subjects are instructed to make
gestures in accordance with their emotional state. The Kinect
acquires data at a sampling rate of 30 frames per second. Data
for a total duration of 60 seconds of each emotion is acquired
from each subject. The body joint co-ordinates thus obtained
as depicted by squares, represented in Cartesian form, are
processed in the next stage. The RGB and skeleton images for
the five body gestures are shown in Table 1.
057
Kineet Sen&or Skeleton
R�eognized
Emotion + I Classification I � from Gesture L-___ ....I
Fig I. A flowchart illustrating the course of work.
TABLEl RGB IMAGES AND SKELETONS OF BODY GESTURES
Emotions
Anger
Fear
Happiness
Sadness
Relaxation
B. Feature Extraction Gestures corresponding to the different emotions are
effectively coded from hand, head, shoulder and spine
positions and angles in terms of the following nine features.
The features requiring the hand joints are calculated for both
the hands.
1) Distance of hand with respect to spine (DHsleji, DHtghj The Euclidean distance between the hand and the elbow is
considered as two features. Gestures corresponding to 'Anger'
are characterized by rapid 'to and fro' hand movements.
Therefore, the distance of hands and elbow with respect to the
spine repeatedly increases and decreases. In case of 'sadness'
as well as 'fear' gestures, hands come closer to the body.
Hence the distances of the hands with respect to the spine
decrease. The Euclidean distance Dist between two points
with co-ordinates (xJ,y"z,) and (X2,Y2,Z2) is calculated using
(1).
(1)
2) Maximum Acceleration of hand and elbow with respect to spine (AccH1eji, AccH"gh', AccE1eji, AccErighJ
The hand movements in 'anger' are very fast. They are
accompanied with vibrations of the palm or the hands.
Therefore, the accelerations of the hands for 'anger' are much
greater than that in any other gesture. Velocity of a joint is
calculated from the difference in the displacements of the
joints in two consecutive frames. Acceleration is computed
from the change in velocity in for two consecutive frames.
Maximum acceleration is computed among all the frames of a
particular dataset for each second.
3) Angle between head, shoulder center and spine (AHead, SC.Spine)
The hand movements to express 'sadness' and 'fear'
emotions are quite similar as in both the cases hands and
elbows come closer to the body. A possible feature to
distinguish these two emotions is the calculation of the angle
between head, shoulder center and spine. Let, the co-ordinates
of head, shoulder center and spine be (xJ,y"z,), (X2,Y2,Z2) and
(X3,Y3,Z3) respectively. The vectors vec1 and vec2 formed by
head, shoulder center and spine, shoulder center are given by
(2) and (3) and the angle between these two vectors is
calculated by (4).
vec1 = (XI -x2 )f +(YI - Y2 )] +(ZI -z2 )k (2)
(3)
I atan2(cross(vecl,vec2),dot(vec1,vec2)) 0 ang e= . xl80
pi (4)
where atan2 denotes the arctangent.
4) Angle between shoulder, elbow and wrist (ASE W1eji, ASEWrighJ
This feature is essential to distinguish the gestures
corresponding to 'Relaxation' from those of any other
emotion.
C. Classification Classification is carried out using the following classifiers.
1) Binary Decision Tree Classifier A decision tree based classifier [10] uses a recursive tree
like structure with a number of nodes and edges. The starting
node is the root of the classifier. Each of the interior nodes
denotes a test on an attribute, each edge denotes an outcome
and each of the exterior or leaf node denotes a class. Each
interior node splits the instance space into two subspaces
depending on some criterion. The process continues till
classification is done at the leaves upon testing with all the
attributes. In the present work, the starting node or root for
the decision tree classifier is the set of gestures corresponding
to the five emotions. The complete scheme of decision tree
based classification is depicted in Fig. 2. From experimental
observations, the threshold values of maximum acceleration
are set to 20m/s2 and 70m/s2 for the elbows and the hands
respectively. It has been experimentally found that the
structure of human body shows considerable variations from
person to person. Therefore, a 20 degree margin for error is
provided for angle between shoulder, elbow and wrist.
2) Ensemble Tree Classifier An ensemble classifier [11] or a multiple classifier system,
is made up of a number of 'base' classifiers or 'weak
learners' , each of which classifies the training data set
separately. The outcome of the ensemble classifier is a
combination of the decisions of these base classifiers and
hence it's performance is usually better than the base
classifiers, provided the individual classifier errors are
uncorrelated. A 'tree' classifier has been used as the weak
learner in this work. Each node of the tree classifier decides on
each of the features in the dataset to predict the class of a
sample. Two popular methods to implement the ensemble
classifiers are bagging and boosting [12]. The present work
has utilized AdaBoost [13] or adaptive boosting algorithm to
implement a boosting ensemble classifier based on tree
learning. Initially the weights of all the samples in the dataset
are equal. In each iteration, a new weak classifier is created
using the whole dataset and the weights for all samples are
updated by increasing the weights of the samples misclassified
and decreasing the weights of the samples correctly classified.
Hence it is of 'adaptive' nature. A weighted voting mechanism
determines the class of a new sample. For our work, the
process is iterated 100 times.
058
b etwe en s.pine _ 0 and hand
shoulder, elbow and wrist> 160 ?
Yo.
Angle ofhead, s.houlder centre
and spine
Fig 2, Scheme of Decision Tree based classification of emotions from gestures.
3) K-Nearest Neighbour Classifier The k-nearest neighbours (kNN) algorithm [14], [15] fInds
the k-nearest neighbours among the training set, and places an
object into the class that is most frequent among its k nearest
neighbours, k being a small positive integer. In the present
work, we have used k=3 with Euclidean Distance as the
distance measure to determine the nearest neighbours and
majority voting to determine the class of the test data from its
k nearest neighbours.
4) Support Vector Machine Linear SVM [15] separates two classes of data by
constructing a hyper plane, specifIed by 'support vectors',
within the training data points, such that the distance margin
between the support vectors, and hence the two classes is
maximized. However, linear SVM can be successfully used
only where the data are linearly separable. This limitation can
be overcome by mapping the data into a larger dimensional
space using a kernel function. The RBF or Gaussian kernel
with the width of the Gaussian asl has been used in the
present work. Binary SVM classifIcation is carried out for
every class of emotion in a One-against-all approach and then
the mean recognition accuracy of each emotion with respect to
the other classes of emotion is computed.
5) Neural Network with Back-Propagation Learning The Back propagation algorithm [16] is one of the most
popular supervised learning algorithms with artifIcial neural
networks. In the present work, the adaptation of network
weights from is carried out using Levenberg-Marquardt
Optimization [17]. This technique incorporates a mixture of
the simple gradient descent learning and quadratic
approximations that helps to achieve convergence while
minimizing the error at a faster pace.
IV. EXPERIMENTS AND RESULTS
The patterns generated for the nine features, as mentioned
in Section II. B, with respect to frame number are depicted in
Fig. 3. Kinect sensor is able to capture 30 frames per sec, thus
for 60 seconds of data stream, we have total 1800 frames. It is
very prominent from the patterns that the hand movements for
fIve gestures vary widely from each other.
The comparison of accuracy and time requirement between
all the classifIers is described in Fig. 4 and 5.The average
classifIcation accuracies rounded up to two decimal places,
obtained for binary decision tree, ensemble tree, k-NN, SVM
with radial basis function kernel and neural network classifIer
with back-propagation learning are 76.63%, 90.83%, 86.77%,
87.74% and 89.26% respectively. The y-axis in Fig. 4 depicts
the accuracy rates in percentage. The best result obtained is for
ensemble decision tree classifIer, but is takes worst amount of
time of 15.592 sec for a gesture sequence of 60 seconds
duration. The time requirement is best for binary decision tree
with 0.9 sec for recognition of each gesture sequence of 60 sec
duration in Intel Core2 Duo Processor with Matlab R2012b.
With the variation of frame number the time required to
recognize a gesture changes. In Fig. 5, the x-axis represents
the amount of duration of the data stream (in seconds)
processed and the y-axis denotes the time requirement in
seconds. The time required is always the average of the time
required to recognize the gestures.
�!l� :�I� �Ir:::1 ];;;;±j '�I;;;±1,�Itt0d :�I:\ZJ l!�S1 :�E\<S1
(i)
::�l .9 ::tl:sJ �I:±i Jl::BJ1::B J;;;;;Id :�lc;(d �l� �j::g
(ii)
059
�.]c:;J �:Jz::::J}B.d }tiES l±E31 JE±fB ::f;:a :�j�l �f);d
(iv)
::��l ::�E2S!,� l�j� 1 ;:j�l J�l �j�l :�jYl :�� ��j�
(v) Fig 3. Patterns obtained for (i) Anger, (ii) Fear, (iii) Happiness, (iv) Sadness, (v) Relaxation, time in seconds is plotted along x-axis and the mean feature value for I second along y-axis.
100 95 90 85 80 7) 70
l-
t j
• Binary D ecisi on Tes Classifie:r
.
r-
..
E nsemble Tree Cla s s ifia:r
K-Noare.sl Neighbour Cla55ifi er
& u pport V.., lor <. <. 4' ,;> � lI. 1aehine �� ��'I> ':$>" '/:,<::-� �,o ���
�<I. <;,'" �+ .;: - NeUflllN .twork .... "th �'I> �'" Bac.k-Propagation
L.eamin.E; Fig 4. Comparison of accuracies for the five classifiers.
16 14 12 10
g 6 4 2 o
-,--------__ r-+- Binary Dacisi on T"'2 Cla55itik""
+------�=--�� E nsemble Tre2 Cla55ifier
+---..",.-----�'---*- supp ort V..,lor M""hin2
---*-Neural Netv,.-ork ,,,,,th Back- Prop agatio n L<2'l!flling
--- K-Nearest NeiEhbour 10 20 30 40 50 60 Cla55ifier -
Fig 5. Comparison of time requirement for the five classifiers.
V. CONCLUSION AND FUTURE SCOPE
The work focuses on recognition of five basic human
emotions based on gestures using a cost effective Kinect
Sensor. Gesture distinctions for recognizing human emotional
states can be applied for control of devices automatically
according to a person's emotion thereby bringing about
efficient human computer interactions through the use of only
gestures. An overall percentage accuracy of 90.83% with a
computation time of 15.592 seconds is achieved for emotion
recognition using ensemble decision tree in an Intel Pentium
Core 2 Duo processor running Matlab R2012b.
The present work is concerned with identification of
gestures corresponding to the five basic emotional states viz.
anger, happiness, sadness, fear and relaxation only. Future
scope of work in this direction includes working with gestures
for more complex emotions like 'Surprise', 'Shyness', 'Pride',
'Embarrassment' etc. which is currently in progress.
ACKNOWLEDGMENT
The study is supported by the University Grants
Commission, India, University of Potential Excellence
Programme (Phase II) in Cognitive Science, Jadavpur
University.
REFERENCES
[I] S. Brave and C. Nass, "Emotion in human-computer interaction," The human-computer interaction handbook: fundamentals, evolving
technologies and emerging applications, pp. 8 1-96, 2002. [2] A. Halder, A. Konar, R. Mandai, A. Chakraborty, P. Bhowmik, N. R.
Pal, and A. K. Nagar, "General and Interval Type-2 Fuzzy Face-Space Approach to Emotion Recognition. "
[3] G. Castellano, S. D. Villalba, and A. Camurri, "Recognising human emotions from body movement and gesture dynamics," in Affective
computing and intelligent interaction, Springer, 2007, pp. 71-82. [4] D. O. Bos, "EEG-based emotion recognition," The influence of Visual
and Auditory Stimuli, pp. 1-17,2006. [5] A. Chella, H. Dindo, and I. Infantino, "A system for simultaneous
people tracking and posture recognition in the context of humancomputer interaction," in Computer as a Tool, 2005. EUROCON 2005. The international Conference on, 2005, vol. 2, pp. 991-994.
[6] L. B. Ozer and W. Wolf, "Real-time posture and activity recognition," in Motion and Video Computing, 2002. Proceedings. Workshop on, 2002, pp. 133-138.
[7] H. Zhuang, B. Zhao, Z. Ahmad, S. Chen, and K. S. Low, "3D depth camera based human posture detection and recognition Using PCNN circuits and learning-based hierarchical classifier," in Neural Networks
(iJCNN), The 2012 international Joint Conference on, 2012, pp. 1-5. [8] T. Leyvand, C. Meekhof, Y.-C. Wei, J. Sun, and B. Guo, "Kinect
identity: Technology and experience," Computer, vol. 44, no. 4, pp. 94-96,2011.
[9] J. Solaro, 'The Kinect Digital Out-of-Box Experience," Computer, pp. 97-99,2011.
[10] L. Rokach, Data mining with decision trees: theory and applications, vol. 69. World Scientific, 2007.
[I I] R. Polikar, "Ensemble based systems in decision making," Circuits and Systems Magazine, iEEE, vol. 6, no. 3, pp. 21-45, 2006.
[12] T. G. Dietterich, "An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization," Machine learning, vol. 40, no. 2, pp. 139-157,2000.
[13] G. Ratsch, T. Onoda, and K.-R. MUlier, "Soft margins for AdaBoost," Machine learning, vol. 42, no. 3, pp. 287-320, 2001.
[14] P. Cunningham and S. J. Delany, "k-Nearest neighbour classifiers," Multiple Classifier Systems, pp. 1-17,2007.
[15] T. M. Mitchell, "Machine learning and data mining," Communications of the ACM, vol. 42, no. I I, pp. 30-36, 1999.
[16] A. Konar, Computational intelligence: principles, techniques and
applications. Springer, 2005. [17] S. Roweis, "Levenberg-marquardt optimization," Univ. of Toronto,
unpublished, 1996.
060