Download - [IEEE 2014 International Conference on Communications and Signal Processing (ICCSP) - Melmaruvathur, India (2014.4.3-2014.4.5)] 2014 International Conference on Communication and Signal

International Conference on Communication and Signal Processing, April 3-5, 2014, India

A Study on Emotion Recognition from Body Gestures Using Kinect Sensor

Sriparna Saha, Shreyasi Datta, Amit Konar and Ramadoss Janarthanan

Abstract- This novel work is aimed at the study of emotion

recognition from gestures using Kinect sensor. The Kinect sensor

along with Software Development Kit (SDK) generates the

human skeleton represented by 3-dimensional coordinates

corresponding to twenty body joints. Using the co-ordinates of

eleven such joints from the upper body and the hands, a set of

nine features based on the distances, accelerations and angles

between the different joints have been extracted. These features

are able to uniquely identify gestures corresponding to five basic

human emotional states, namely, 'Anger', 'Fear', 'Happiness',

'Sadness' and 'Relaxation'. The goal of the proposed system is to

classify an emotion based on body gesture. A comparison of

classification using binary decision tree, ensemble decision tree,

k-nearest neighbour, support vector machine with radial basis

function kernel and neural network classifier based on back

propagation learning is made, in terms of average classification

accuracy and computation time. A high overall recognition rate

of90.83% is obtained from the ensemble decision tree.

D Index Terms- Emotion Recognition, Gesture Classification,

Kinect Sensor

I. INTRODUCTION

GESTURES constitute an important medium for human

beings to interact and communicate with the

surroundings. Gestures are expressive body movements,

involving the face and hands in majority of the cases. Gestures

are an important indication of a person's mental state and

his/her emotions and hence human emotions can be classified

based on their gestures.

Emotion recognition is a very important aspect of human

computer interaction [1]. Recognizing emotions from gestures

will enable humans to efficiently communicate with machines

using only sign language according to one's mental state. It is

interesting to study how human emotions can be recognized

from the corresponding gestures so that such gestures can be

utilized to control a machine according to the human

emotional state. For example, if a sad person enters a room

where loud music is being played, the music player will be

automatically switched off or turned mute.

Emotions are mainly classified from facial expressions, i.e.,

Sri pam a Saha, Shreyasi Datta and Amit Konar are with the Department of Electronics and Telecommunication Engineering, Jadavpur University, Kolkata, India (emails: [email protected]@gmail.com. [email protected]).

R. Janarthanan is with the department of Computer Science and Engineering, T J S Engineering College, Chennai, India (e-mail: srmjana [email protected]).

978-1-4799-3358-7114/$31.00 ©2014 IEEE

the movements of cheek, chin, and wrinkles [2] but very few

of them correlates the facial expression with the body

movements [3]. Another important principal of emotion

recognition is based on EEG signals from brain activity [4].

With reference to human-computer interaction, gesture

recognition plays an important role. In [5], a condensation

algorithm implements a visual tracking system. In [6], real

time MPEG video inputs are provided and AC-DCT

coefficients are used to obtain Eigen space representation of

human silhouettes. Two cameras are required for this work

each along with a standard camera and video processing

board. In [7], time-of-flight depth camera along with a pulse

coupled neural network (PCNN) is used for recognition of

human body in distorted background. Here hierarchical

decision tree is used for the classification purpose.

For the present work, human emotions are recognized from

the corresponding gestures captured using a Kinect Sensor [8],

D . A Kinect sensor has been utilized to detect the skeleton of

a person while making gestures according to his/her emotional

state, using 20 body joint co-ordinates. The work focuses on

the recognition of five basic types of emotions from body

gestures. They are 'Anger', 'Fear', 'Happiness', 'Sadness',

and 'Relaxation'. A total of nine unique features based on the

angles and displacements of the body joints have been

extracted for a sequence of gestures for depicting a particular

emotion. The information regarding movements of the feet

and the lower parts of the body has not been taken into

account in the present work as these five particular emotions

are decoded efficiently from the movements and gestures

made using the upper body alone.

Classification of emotions from the extracted features is

carried out using a binary decision tree based classifier and a

tree classifier based on ensemble learning. A comparison of

these methods is made with three standard pattern classifiers,

namely, k-nearest neighbour (k-NN), support vector machine

(SVM) with radial basis function kernel and neural network

classifier based on back-propagation learning, in terms of

average classification accuracy and computation time. The

best result is obtained for ensemble decision tree with 90.83%.

The required time is least for binary decision tree with 0.9 sec

and largest for ensemble tree with 15.592 sec for recognition

of each gesture sequence of 60 seconds duration. The codes

are executed in Intel Core2Duo Processor with Matlab

R2012b.

Section II provides a discussion on the Kinect Sensor. The

methodology followed has been described in Section III with

+-IEEE Advancing Technology

for Humanity

056

explanations of features extracted from the subjects' skeleton

as well as the scheme of classification of emotions from

gestures using the extracted features with different classifiers.

Section IV presents the experiments and results. Finally,

Section V provides the conclusions with directions for future

work.

II. KINECT SENSOR

The Kinect [8], [9] is a sensor device with set of IR and

RGB camera. It appears as a long horizontal bar with a

motorized base as shown in the flowchart in Fig. 1. It detects

the 3D image of an object and tracks the skeleton of the

person standing in front of it within a finite amount of

distance. The Kinect sensor with the help of the corresponding

Software Development Kit (SDK) senses the skeleton and the

body postures irrespective of the color of the skin or the

individual's dress. In Fig. 1 an RGB image of a person and the

corresponding skeleton generated by the Kinect has been

shown. The Kinect Sensor produces the human skeleton

represented by twenty body joints in the 3-D space. Out of

these twenty joints, eleven from the upper parts of the body

and both the hands are useful for the present work, denoted by

squares. These joints are the head, shoulder center, spine, hand

left, wrist left, elbow left, shoulder left, hand right, wrist right,

elbow right and shoulder right. The rest of the joints are

denoted by stars, which are not needed here, as gestures are

mainly depicted by the hand movements.

III. METHODOLOGY

As in case of any pattern recognition problem, in order to

classify the emotions from gestures, these gestures need to be

effectively represented by suitable features. The feature spaces

thus obtained are classified using a number of different

classifiers. The entire course of work including data

acquisition from Kinect Sensor, feature extraction and

subsequent classification and hence emotion recognition is

illustrated in Fig. l.

A. Data Acquisition Experiments are conducted with 10 subjects in the age

group of 25±5 years with their consent. Data is acquired by

stimulating the environment in such a way so as to excite the

subjects to five particular emotions: anger, fear, happiness,

sadness, and relaxation. The subjects are instructed to make

gestures in accordance with their emotional state. The Kinect

acquires data at a sampling rate of 30 frames per second. Data

for a total duration of 60 seconds of each emotion is acquired

from each subject. The body joint co-ordinates thus obtained

as depicted by squares, represented in Cartesian form, are

processed in the next stage. The RGB and skeleton images for

the five body gestures are shown in Table 1.

057

Kineet Sen&or Skeleton

R�eognized

Emotion + I Classification I � from Gesture L-___ ....I

Fig I. A flowchart illustrating the course of work.

TABLEl RGB IMAGES AND SKELETONS OF BODY GESTURES

Emotions

Anger

Fear

Happiness

Sadness

Relaxation

B. Feature Extraction Gestures corresponding to the different emotions are

effectively coded from hand, head, shoulder and spine

positions and angles in terms of the following nine features.

The features requiring the hand joints are calculated for both

the hands.

1) Distance of hand with respect to spine (DHsleji, DHtghj The Euclidean distance between the hand and the elbow is

considered as two features. Gestures corresponding to 'Anger'

are characterized by rapid 'to and fro' hand movements.

Therefore, the distance of hands and elbow with respect to the

spine repeatedly increases and decreases. In case of 'sadness'

as well as 'fear' gestures, hands come closer to the body.

Hence the distances of the hands with respect to the spine

decrease. The Euclidean distance Dist between two points

with co-ordinates (xJ,y"z,) and (X2,Y2,Z2) is calculated using

(1).

(1)

2) Maximum Acceleration of hand and elbow with respect to spine (AccH1eji, AccH"gh', AccE1eji, AccErighJ

The hand movements in 'anger' are very fast. They are

accompanied with vibrations of the palm or the hands.

Therefore, the accelerations of the hands for 'anger' are much

greater than that in any other gesture. Velocity of a joint is

calculated from the difference in the displacements of the

joints in two consecutive frames. Acceleration is computed

from the change in velocity in for two consecutive frames.

Maximum acceleration is computed among all the frames of a

particular dataset for each second.

3) Angle between head, shoulder center and spine (AHead, SC.Spine)

The hand movements to express 'sadness' and 'fear'

emotions are quite similar as in both the cases hands and

elbows come closer to the body. A possible feature to

distinguish these two emotions is the calculation of the angle

between head, shoulder center and spine. Let, the co-ordinates

of head, shoulder center and spine be (xJ,y"z,), (X2,Y2,Z2) and

(X3,Y3,Z3) respectively. The vectors vec1 and vec2 formed by

head, shoulder center and spine, shoulder center are given by

(2) and (3) and the angle between these two vectors is

calculated by (4).

vec1 = (XI -x2 )f +(YI - Y2 )] +(ZI -z2 )k (2)

(3)

I atan2(cross(vecl,vec2),dot(vec1,vec2)) 0 ang e= . xl80

pi (4)

where atan2 denotes the arctangent.

4) Angle between shoulder, elbow and wrist (ASE W1eji, ASEWrighJ

This feature is essential to distinguish the gestures

corresponding to 'Relaxation' from those of any other

emotion.

C. Classification Classification is carried out using the following classifiers.

1) Binary Decision Tree Classifier A decision tree based classifier [10] uses a recursive tree

like structure with a number of nodes and edges. The starting

node is the root of the classifier. Each of the interior nodes

denotes a test on an attribute, each edge denotes an outcome

and each of the exterior or leaf node denotes a class. Each

interior node splits the instance space into two subspaces

depending on some criterion. The process continues till

classification is done at the leaves upon testing with all the

attributes. In the present work, the starting node or root for

the decision tree classifier is the set of gestures corresponding

to the five emotions. The complete scheme of decision tree

based classification is depicted in Fig. 2. From experimental

observations, the threshold values of maximum acceleration

are set to 20m/s2 and 70m/s2 for the elbows and the hands

respectively. It has been experimentally found that the

structure of human body shows considerable variations from

person to person. Therefore, a 20 degree margin for error is

provided for angle between shoulder, elbow and wrist.

2) Ensemble Tree Classifier An ensemble classifier [11] or a multiple classifier system,

is made up of a number of 'base' classifiers or 'weak

learners' , each of which classifies the training data set

separately. The outcome of the ensemble classifier is a

combination of the decisions of these base classifiers and

hence it's performance is usually better than the base

classifiers, provided the individual classifier errors are

uncorrelated. A 'tree' classifier has been used as the weak

learner in this work. Each node of the tree classifier decides on

each of the features in the dataset to predict the class of a

sample. Two popular methods to implement the ensemble

classifiers are bagging and boosting [12]. The present work

has utilized AdaBoost [13] or adaptive boosting algorithm to

implement a boosting ensemble classifier based on tree

learning. Initially the weights of all the samples in the dataset

are equal. In each iteration, a new weak classifier is created

using the whole dataset and the weights for all samples are

updated by increasing the weights of the samples misclassified

and decreasing the weights of the samples correctly classified.

Hence it is of 'adaptive' nature. A weighted voting mechanism

determines the class of a new sample. For our work, the

process is iterated 100 times.

058

b etwe en s.pine _ 0 and hand

shoulder, elbow and wrist> 160 ?

Yo.

Angle ofhead, s.houlder centre

and spine

Fig 2, Scheme of Decision Tree based classification of emotions from gestures.

3) K-Nearest Neighbour Classifier The k-nearest neighbours (kNN) algorithm [14], [15] fInds

the k-nearest neighbours among the training set, and places an

object into the class that is most frequent among its k nearest

neighbours, k being a small positive integer. In the present

work, we have used k=3 with Euclidean Distance as the

distance measure to determine the nearest neighbours and

majority voting to determine the class of the test data from its

k nearest neighbours.

4) Support Vector Machine Linear SVM [15] separates two classes of data by

constructing a hyper plane, specifIed by 'support vectors',

within the training data points, such that the distance margin

between the support vectors, and hence the two classes is

maximized. However, linear SVM can be successfully used

only where the data are linearly separable. This limitation can

be overcome by mapping the data into a larger dimensional

space using a kernel function. The RBF or Gaussian kernel

with the width of the Gaussian asl has been used in the

present work. Binary SVM classifIcation is carried out for

every class of emotion in a One-against-all approach and then

the mean recognition accuracy of each emotion with respect to

the other classes of emotion is computed.

5) Neural Network with Back-Propagation Learning The Back propagation algorithm [16] is one of the most

popular supervised learning algorithms with artifIcial neural

networks. In the present work, the adaptation of network

weights from is carried out using Levenberg-Marquardt

Optimization [17]. This technique incorporates a mixture of

the simple gradient descent learning and quadratic

approximations that helps to achieve convergence while

minimizing the error at a faster pace.

IV. EXPERIMENTS AND RESULTS

The patterns generated for the nine features, as mentioned

in Section II. B, with respect to frame number are depicted in

Fig. 3. Kinect sensor is able to capture 30 frames per sec, thus

for 60 seconds of data stream, we have total 1800 frames. It is

very prominent from the patterns that the hand movements for

fIve gestures vary widely from each other.

The comparison of accuracy and time requirement between

all the classifIers is described in Fig. 4 and 5.The average

classifIcation accuracies rounded up to two decimal places,

obtained for binary decision tree, ensemble tree, k-NN, SVM

with radial basis function kernel and neural network classifIer

with back-propagation learning are 76.63%, 90.83%, 86.77%,

87.74% and 89.26% respectively. The y-axis in Fig. 4 depicts

the accuracy rates in percentage. The best result obtained is for

ensemble decision tree classifIer, but is takes worst amount of

time of 15.592 sec for a gesture sequence of 60 seconds

duration. The time requirement is best for binary decision tree

with 0.9 sec for recognition of each gesture sequence of 60 sec

duration in Intel Core2 Duo Processor with Matlab R2012b.

With the variation of frame number the time required to

recognize a gesture changes. In Fig. 5, the x-axis represents

the amount of duration of the data stream (in seconds)

processed and the y-axis denotes the time requirement in

seconds. The time required is always the average of the time

required to recognize the gestures.

�!l� :�I� �Ir:::1 ];;;;±j '�I;;;±1,�Itt0d :�I:\ZJ l!�S1 :�E\<S1

(i)

::�l .9 ::tl:sJ �I:±i Jl::BJ1::B J;;;;;Id :�lc;(d �l� �j::g

(ii)

059

�.]c:;J �:Jz::::J}B.d }tiES l±E31 JE±fB ::f;:a :�j�l �f);d

(iv)

::��l ::�E2S!,� l�j� 1 ;:j�l J�l �j�l :�jYl :�� j�

(v) Fig 3. Patterns obtained for (i) Anger, (ii) Fear, (iii) Happiness, (iv) Sadness, (v) Relaxation, time in seconds is plotted along x-axis and the mean feature value for I second along y-axis.

100 95 90 85 80 7) 70

l-

t j

• Binary D ecisi on Tes Classifie:r

.

r-

..

E nsemble Tree Cla s s ifia:r

K-Noare.sl Neighbour Cla55ifi er

& u pport V.., lor <. <. 4' ,;> � lI. 1aehine �� 'I> ':$>" '/:,<::-� �,o ��

�<I. <;,'" �+ .;: - NeUflllN .twork .... "th �'I> �'" Bac.k-Propagation

L.eamin.E; Fig 4. Comparison of accuracies for the five classifiers.

16 14 12 10

g 6 4 2 o

-,--------__ r-+- Binary Dacisi on T"'2 Cla55itik""

+------�=--�� E nsemble Tre2 Cla55ifier

+---..",.-----�'---*- supp ort V..,lor M""hin2

---*-Neural Netv,.-ork ,,,,,th Back- Prop agatio n L<2'l!flling

--- K-Nearest NeiEhbour 10 20 30 40 50 60 Cla55ifier -

Fig 5. Comparison of time requirement for the five classifiers.

V. CONCLUSION AND FUTURE SCOPE

The work focuses on recognition of five basic human

emotions based on gestures using a cost effective Kinect

Sensor. Gesture distinctions for recognizing human emotional

states can be applied for control of devices automatically

according to a person's emotion thereby bringing about

efficient human computer interactions through the use of only

gestures. An overall percentage accuracy of 90.83% with a

computation time of 15.592 seconds is achieved for emotion

recognition using ensemble decision tree in an Intel Pentium

Core 2 Duo processor running Matlab R2012b.

The present work is concerned with identification of

gestures corresponding to the five basic emotional states viz.

anger, happiness, sadness, fear and relaxation only. Future

scope of work in this direction includes working with gestures

for more complex emotions like 'Surprise', 'Shyness', 'Pride',

'Embarrassment' etc. which is currently in progress.

ACKNOWLEDGMENT

The study is supported by the University Grants

Commission, India, University of Potential Excellence

Programme (Phase II) in Cognitive Science, Jadavpur

University.

REFERENCES

[I] S. Brave and C. Nass, "Emotion in human-computer interaction," The human-computer interaction handbook: fundamentals, evolving

technologies and emerging applications, pp. 8 1-96, 2002. [2] A. Halder, A. Konar, R. Mandai, A. Chakraborty, P. Bhowmik, N. R.

Pal, and A. K. Nagar, "General and Interval Type-2 Fuzzy Face-Space Approach to Emotion Recognition. "

[3] G. Castellano, S. D. Villalba, and A. Camurri, "Recognising human emotions from body movement and gesture dynamics," in Affective

computing and intelligent interaction, Springer, 2007, pp. 71-82. [4] D. O. Bos, "EEG-based emotion recognition," The influence of Visual

and Auditory Stimuli, pp. 1-17,2006. [5] A. Chella, H. Dindo, and I. Infantino, "A system for simultaneous

people tracking and posture recognition in the context of humancomputer interaction," in Computer as a Tool, 2005. EUROCON 2005. The international Conference on, 2005, vol. 2, pp. 991-994.

[6] L. B. Ozer and W. Wolf, "Real-time posture and activity recognition," in Motion and Video Computing, 2002. Proceedings. Workshop on, 2002, pp. 133-138.

[7] H. Zhuang, B. Zhao, Z. Ahmad, S. Chen, and K. S. Low, "3D depth camera based human posture detection and recognition Using PCNN circuits and learning-based hierarchical classifier," in Neural Networks

(iJCNN), The 2012 international Joint Conference on, 2012, pp. 1-5. [8] T. Leyvand, C. Meekhof, Y.-C. Wei, J. Sun, and B. Guo, "Kinect

identity: Technology and experience," Computer, vol. 44, no. 4, pp. 94-96,2011.

[9] J. Solaro, 'The Kinect Digital Out-of-Box Experience," Computer, pp. 97-99,2011.

[10] L. Rokach, Data mining with decision trees: theory and applications, vol. 69. World Scientific, 2007.

[I I] R. Polikar, "Ensemble based systems in decision making," Circuits and Systems Magazine, iEEE, vol. 6, no. 3, pp. 21-45, 2006.

[12] T. G. Dietterich, "An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization," Machine learning, vol. 40, no. 2, pp. 139-157,2000.

[13] G. Ratsch, T. Onoda, and K.-R. MUlier, "Soft margins for AdaBoost," Machine learning, vol. 42, no. 3, pp. 287-320, 2001.

[14] P. Cunningham and S. J. Delany, "k-Nearest neighbour classifiers," Multiple Classifier Systems, pp. 1-17,2007.

[15] T. M. Mitchell, "Machine learning and data mining," Communications of the ACM, vol. 42, no. I I, pp. 30-36, 1999.

[16] A. Konar, Computational intelligence: principles, techniques and

applications. Springer, 2005. [17] S. Roweis, "Levenberg-marquardt optimization," Univ. of Toronto,

unpublished, 1996.

060

Download - [IEEE 2014 International Conference on Communications and Signal Processing (ICCSP) - Melmaruvathur, India (2014.4.3-2014.4.5)] 2014 International Conference on Communication and Signal

Top Related