building emotional machines: recognizing image emotions ... · building emotional machines:...

11
1 Building Emotional Machines: Recognizing Image Emotions through Deep Neural Networks Hye-Rin Kim, Yeong-Seok Kim, Seon Joo Kim, In-Kwon Lee Abstract—An image is a very effective tool for conveying emo- tions. Many researchers have investigated in computing the image emotions by using various features extracted from images. In this paper, we focus on two high level features, the object and the background, and assume that the semantic information of images is a good cue for predicting emotion. An object is one of the most important elements that define an image, and we find out through experiments that there is a high correlation between the object and the emotion in images. Even with the same object, there may be slight difference in emotion due to different backgrounds, and we use the semantic information of the background to improve the prediction performance. By combining the different levels of features, we build an emotion based feed forward deep neural network which produces the emotion values of a given image. The output emotion values in our framework are continuous values in the 2-dimensional space (Valence and Arousal), which are more effective than using a few number of emotion categories in describing emotions. Experiments confirm the effectiveness of our network in predicting the emotion of images. I. I NTRODUCTION Images are very powerful tools for conveying moods and emotions as shown in Figure 1. Through images, people can express their feelings and communicate with other people. With the recent development in the deep learning technology, computers have become better at recognizing objects, faces, and actions. Computers have also started to write image captions and answer questions about images. But how about emotions? Can we teach computers to have similar feeling as humans do when looking at images? Predicting evoked emotion from an image is a difficult task and is still in its early stage. As the deep learning technology shows remarkable perfor- mance in various computer vision tasks such as image clas- sifcation [1]–[4], segmentation [5], [6] and image processing [7], [8], several studies have been introduced recently that apply the deep learning for the emotion prediction [9]–[13]. Those works mostly use the convolutional neural network (CNN) [14], which has shown better prediction results for the emotion classification compared to the model that uses a shallow network such as the linear model. The CNN has had a big impact especially in the image classification, and it is an effective network model for learning filters that capture the shapes that repeatedly appear in images. However, we argue that the learning process for the image emotion prediction should be different from that of image classification. This is because some images with different appearances can have same emotions, and some images with similar appearances can have different emotions. For example, an image that includes a person riding a bike and an image Fig. 1. Images with different emotions. that includes a person surfing in the ocean may give the same feelings, even though they look different. From this point of view, the performance may be limited if CNN is applied to an emotion prediction system. In addition, one of the main issues in the emotion recog- nition is the affective gap. The affective gap is the lack of coincidence between the measurable signal properties, com- monly referred to as features, and the expected affective state in which the user is brought by perceiving the signal [15]. To narrow this affective gap, several works proposed emotion classification systems based on the psychology and art theory based high level features such as harmony, movement, rule of third, etc. [16]–[18]. While those features help to improve the emotion recognition, a better set of features are still necessary. Distinguishing the effective features among various features is also important. As another example, similar to the example above, images with the objects such as guns or sharks arouse scary feeling, while images with babies or flowers lead to more happiness. It can be speculated that certain objects will affect the determination of emotions. Based on this observation, we assume that the main object appearing in the image plays an important role in determining the emotion. The idea is that object categories can be good cues for emotions. Through arXiv:1705.07543v2 [cs.CV] 3 Jul 2017

Upload: ngonhi

Post on 04-Sep-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building Emotional Machines: Recognizing Image Emotions ... · Building Emotional Machines: Recognizing Image Emotions through ... classification systems based on the psychology

1

Building Emotional Machines: Recognizing ImageEmotions through Deep Neural Networks

Hye-Rin Kim, Yeong-Seok Kim, Seon Joo Kim, In-Kwon Lee

Abstract—An image is a very effective tool for conveying emo-tions. Many researchers have investigated in computing the imageemotions by using various features extracted from images. In thispaper, we focus on two high level features, the object and thebackground, and assume that the semantic information of imagesis a good cue for predicting emotion. An object is one of the mostimportant elements that define an image, and we find out throughexperiments that there is a high correlation between the objectand the emotion in images. Even with the same object, there maybe slight difference in emotion due to different backgrounds, andwe use the semantic information of the background to improvethe prediction performance. By combining the different levels offeatures, we build an emotion based feed forward deep neuralnetwork which produces the emotion values of a given image.The output emotion values in our framework are continuousvalues in the 2-dimensional space (Valence and Arousal), whichare more effective than using a few number of emotion categoriesin describing emotions. Experiments confirm the effectiveness ofour network in predicting the emotion of images.

I. INTRODUCTION

Images are very powerful tools for conveying moods andemotions as shown in Figure 1. Through images, people canexpress their feelings and communicate with other people.With the recent development in the deep learning technology,computers have become better at recognizing objects, faces,and actions. Computers have also started to write imagecaptions and answer questions about images. But how aboutemotions? Can we teach computers to have similar feelingas humans do when looking at images? Predicting evokedemotion from an image is a difficult task and is still in itsearly stage.

As the deep learning technology shows remarkable perfor-mance in various computer vision tasks such as image clas-sifcation [1]–[4], segmentation [5], [6] and image processing[7], [8], several studies have been introduced recently thatapply the deep learning for the emotion prediction [9]–[13].Those works mostly use the convolutional neural network(CNN) [14], which has shown better prediction results forthe emotion classification compared to the model that usesa shallow network such as the linear model.

The CNN has had a big impact especially in the imageclassification, and it is an effective network model for learningfilters that capture the shapes that repeatedly appear in images.However, we argue that the learning process for the imageemotion prediction should be different from that of imageclassification. This is because some images with differentappearances can have same emotions, and some images withsimilar appearances can have different emotions. For example,an image that includes a person riding a bike and an image

Fig. 1. Images with different emotions.

that includes a person surfing in the ocean may give the samefeelings, even though they look different. From this point ofview, the performance may be limited if CNN is applied to anemotion prediction system.

In addition, one of the main issues in the emotion recog-nition is the affective gap. The affective gap is the lack ofcoincidence between the measurable signal properties, com-monly referred to as features, and the expected affective statein which the user is brought by perceiving the signal [15].To narrow this affective gap, several works proposed emotionclassification systems based on the psychology and art theorybased high level features such as harmony, movement, rule ofthird, etc. [16]–[18]. While those features help to improve theemotion recognition, a better set of features are still necessary.Distinguishing the effective features among various featuresis also important. As another example, similar to the exampleabove, images with the objects such as guns or sharks arousescary feeling, while images with babies or flowers lead to morehappiness. It can be speculated that certain objects will affectthe determination of emotions. Based on this observation, weassume that the main object appearing in the image plays animportant role in determining the emotion. The idea is thatobject categories can be good cues for emotions. Through

arX

iv:1

705.

0754

3v2

[cs

.CV

] 3

Jul

201

7

Page 2: Building Emotional Machines: Recognizing Image Emotions ... · Building Emotional Machines: Recognizing Image Emotions through ... classification systems based on the psychology

2

1 9Negative Positive

1

9

Exciting

Calm

Valence

Arousal

Fig. 2. Two dimensional emotion models (Valence and Arousal).

experiments, we show that objects appearing in images arerelated to emotion values, and objects are used as one ofthe features of our model. Besides, the emotion, even ifimages include the same object, can vary depending on thebackground. We also use the semantic information of thebackground as our features to improve prediction performance.

Predicting emotions from images is a complex task that isquite different from the object detection or the image classifi-cation. In this paper, we combine high-level features such asthe object and the background information extracted from apre-trained deep network model for the image classificationand the segmentation with low-level features such as colorstatistics to obtain a set of features for the emotion prediction.Using this feature set, we design a feed forward deep network(FFNN) [19] which produces the emotion value for a givenimage.

In previous works for the emotion prediction, the emotionsare categorized by a few number of classes, such as happy,awe, sad, fear, etc. In comparison, we opt for the dimensionalmodel for expressing the emotions [20], which is widely usedin the field of psychology. Specifically, the dimensional modelconsists of two parameters, Valence and Arousal (Figure 2).Valence represents the pleasure through the scale from 1(negative) to 9 (positive). Arousal is the level of excitement,which also ranges from 1 (calm) to 9 (excited). Using thismodel, any emotion can be represented by using these twovalues in 2D space. As this dimensional model can expressmore emotions compared to using a discrete set of emotioncategories, we train our emotion prediction system based onthe V-A model.

The contributions of our work are summarized as follows:

• We propose an image emotion recognition system whichoutputs valence-arousal values for expressing emotion. As

far as we know, this is the first system to build the deeplearning model for the dimensional emotion model (VAmodel).

• We build a new image emotion dataset through thecrowdsourcing. The feed forward neural network is usedto learn emotion features from this database.

• We propose a novel idea of using an pre-trained CNN torelate the main object and background of the image tothe emotional feature.

• We show the effectiveness of an object to estimate anemotion of an image using correlations between emo-tional values of words and images.

II. RELATED WORK

Emotion of an image can be evoked by various factors.To figure out significant features for the emotion predictionproblem, many researchers have considered various types ofthe features from color statistics to the art and the psycho-logical features. Machajdik et al. [17] introduced an affectiveimage classification system using psychology and art theorybased features such as Itten’s color contrast and rule of thirds.Zhao et al. [18] proposed to extract principles of art featuresfor emotion classification. Similarly, Lu et al. [16] computedthe shape based features in natural images. As a high levelconcept for representing the sentiment of an image, adjective-noun pairs were introduced by Borth et al. in [21].

Despite the rise of deep learning studies, relatively fewstudies have attempted to address the emotion prediction ofimages using the deep network. With the data set from Borthet al. [21], Chen et al. [22] classified the adjective-noun pairsusing the CNN and achieved better accurate classificationperformance than Borth et al. [21].

Several studies utilized the pre-trained model for the im-age classification and transferred the learned parameter. Bychanging the number of outputs to be the same as the numberof labels of their dataset, the classifier can be trained. Forexample, for a binary classification with a positive and anegative label, the number of output would be two. Someresearchers [9]–[13] used AlexNet’s structure [1] with pre-trained weight to train their sentiment prediction frameworkswith different output numbers. Binary classification whichgives either a positive or a negative label was consideredin [11], [13], [23], [24]. Peng et al. [10] and You et al. [12]trained the classifiers for seven and eight classes, respectively.In most studies, an emotion classification model was createdthrough transfer learning using a pre-trained model for imageclassification. In general, fine-tuned models record better emo-tion classification performance compared to previous studieswith shallow networks. In this paper, we use a feed forwardneural network using low level and high-level features ratherthan transfer learning method. Our method is compared withtransfer learning method by using same training and testdataset.

Compared to the previous CNN based emotion recognitionsystems which are limited to output only a few number ofemotional states, we propose to use the valence-arousal modelto represent emotions. By using these two parameters that lie

Page 3: Building Emotional Machines: Recognizing Image Emotions ... · Building Emotional Machines: Recognizing Image Emotions through ... classification systems based on the psychology

3

Valence (negative - positive)

Arousal (calm - exciting)

Fig. 3. Image representation of the Self Assessment Manikin (SAM) [25]used for V-A value assignment in the user study.

on a continuous space, we can represent emotions much betterthan the previous works. To enable the learning of the VAvalues using the CNN, we build a large set of data with variousemotions and obtain the V-A labels through the crowdsourcingsurvey.

III. IMAGE EMOTION DATASET

In order to build the dataset with emotion values (valenceand arousal), we collected a large set of images from tworesources. We first searched for the emotional images usingthe 22 keywords from [26] in Flickr [27]. Those words includewhat we call basic emotions (happy, angry, afraid, sad), as wellas less prototypical emotions (gloomy, bored) and affectivestates (sleepy, serene).

Using the keywords, we first collected over twenty thousandimages. Since the goal is to collect emotional images, we man-ually eliminated non-emotional images from the collection.Each candidate image in the dataset was assessed by threehuman subjects, and only the images that were determined tobe emotional by more than two subjects were included in thedatabase. The words used for the search are listed along withthe number of images in the dataset: afraid (113), alarmed(14), angry (217), annoyed (179), distressed (150), frustrated(77), tense (8), aroused (4), delighted (28), excited (734), glad(315), happy (102), astonished (13), at ease (16), content (656),satisfied (33), serene (1917), pleased (42), depressed (179),bored (312), tired (942), gloomy(793). As a result, the totalnumber of image is 6844.

Second, we collected more images from [12] which includeeight types of emotion categories (Amusement, Awe, Content-ment, Excitement, Anger, Disgust, Fear, Sad). We took 3,236images out of 23,308 images. An even number of images fromeight classes have been selected and included in our database.As a result, a total of 10766 images is acquired in our dataset.

Next, we use the AMT to assign V-A emotion values forall images in our database. Given an image, a worker rates theemotion values for each image using the representation of theV-A scale, Self-Assessment Manikin (SAM) [25] (Figure 3).In each question, two images are shown to the worker. Oneis the image whose emotion value is to be measured, and theother is the image of the previous question with emotion value

Fig. 4. The histogram of valence and arousal value from the collected images.

recorded by the worker. By showing the previous image, theworker can measure a value for the given image compared tothe value selected in the previous problem. It also alleviates thedifficulty of having to choose absolute numbers. Each image isrepresented to the worker randomly and is evaluated only onceby the same worker. To ensure the quality of the answers, weonly allow the workers with 95% approval rate to participate inour Human Intelligence Task (HIT). We also limit the numberof questions for each worker to 200 to keep them focusedthroughout the test.

A total of 1,339 workers were recruited to assign theemotion values of the 10,766 images in the dataset. Theevaluation time per image varied from 10 seconds to a minutedepending on the worker. Each image is evaluated on at leastfive workers, and the average of the acquired values is assignedto the emotion value of the image. Figure 4 shows the totaldistribution of the V-A values of the images. In the case ofValence, it can be seen that there are relatively more positiveimages than negative images. In common sense, people usuallyshare the positive images, rather than negative images, and thistendency is well represented in this distribution. In the caseof Arousal, various emotion values were obtained except forthe extreme values. Figure 5 shows some of the images in ourdatabase.

We compare the emotion distribution of the dataset wecollected with the distribution of IAPS dataset which is anemotion dataset introduced in [28]. The IAPS consists of of1182 images, each of which contains valence, arousal anddominance values. Although the number of images is different,we can see that our data is spread more than the IAPS datadistribution in the V-A emotion space (Figure 6).

We provide additional analysis to validate our new dataset.We split Valence-Arousal space into 4 by 4 sections: Mostnegative (from 1 to 3), negative (from 3 to 5), positive (from5 to 7), and most positive (from 7 to 9). Most calm (from 1to 3), calm (from 3 to 5), exciting (from 5 to 7), and mostexciting (from 7 to 9). We then place the images in the emotionspace according to the obtained emotion values and extractthe most used words (using the image tags) in each section.The results are shown in Figure 7. In the negative sectionof V-A space, most words have low valence values such asafraid, angry, annoyed, depressed and gloomy. On the otherhand, the words such as content, serene, excited, and glad areincluded in the positive section. In the case of the arousal,the lowest arousal side (from 1 to 3) consists of the wordssuch as serene, tired and gloomy and the highest arousal side

Page 4: Building Emotional Machines: Recognizing Image Emotions ... · Building Emotional Machines: Recognizing Image Emotions through ... classification systems based on the psychology

4

Fig. 5. Example images in our database. From left to right side, the valence value of the image increase. The images on the left/right have negative/positiveemotion. From bottom to top, the arousal value of the image increase. The images on the bottom/top have the calm/exciting emotion.

Fig. 6. Emotion distribution of our database and the IAPS [28] in V-A emotionspace.

(from 7 to 9) include the words such as excited, angry anddistressed. This analysis indicates that the use of words forimage collection and user study is appropriate for obtainingas diverse and well-distributed emotion values as possible inspace. The new image emotion dataset can be used for theemotion recognition in two ways. A classifier can be trained tooutput either the continuous V-A values or discrete categoriesof emotions using the words in Figure 7.

IV. FEATURES

Now, we extract a various type of features for emotionprediction including color, local, object, and semantic features.

A. Color features

Color is the most basic and powerful element to expressemotions and can be effectively used by artists to induce emo-tional effects. Many studies have been conducted to changethe color of an image as a means to change the emotionof the image [29]–[32]. Color is not an element that candirectly resolve an affective-gap because it can be viewed asa low-dimensional feature, but color is still a crucial factor inemotion recognition. We extract the mean values of RGB andHSV color space as the basic color characteristics. We alsocalculate the HSV histogram and extract the label number and

Fig. 7. The most used words in each section.

values of the bin with the largest value in the histogram. Aconcept similar to a color histogram, it calculates how much ofthe 11 basic colors exist in the image [33]. According to [20],saturation and brightness can have direct in on pleasure,arousal, and dominance. Using the saturation and brightnessvalues, Valdez and Mehrabian [34] introduced formulas tomeasure the value of pleasure, arousal, and dominance throughexperiments. The formula for computing the values is asfollows:

Pleasure = 0.69Y + 0.22S (1)Arousal = 0.31Y + 0.60S (2)

Dominance = 0.76Y + 0.32S (3)

We measure the values for the three elements from the imageand use them as features.

B. Local features

We exploit two kinds of local features used in the [21].We use a 512-dimensional GIST descriptor that is effective

Page 5: Building Emotional Machines: Recognizing Image Emotions ... · Building Emotional Machines: Recognizing Image Emotions through ... classification systems based on the psychology

5

Fig. 8. Correlation between image emotion and object emotion. Imageemotion values are taken from the IAPS data [28] set and object emotionvalues are taken from the word emotion dictionary [35].

in detecting scenes and a 59-dimensional local binary pattern(LBP) descriptor that is effective in detecting textures.

C. Object features

Identifying the emotion of the image from low-level fea-tures, such as color statistics and texture-related features, isdifficult for a human subject. Many researchers stated the needfor high level of features on the affective level and designedthe various types of features for emotion prediction. In ourstudy, we assume that the object is one of the most importantfactors contributing to the emotion of an image. We conductedan experiment to prove that the object in images has relevanceto the emotion elicited from the images. Each image in theIAPS dataset, an emotion dataset introduced in [28], includesa tag representing the primary object (e.g., baby, snake, andshark) and the V-A emotion values. In order to replace theobject tag with the emotion level representation, we adopt theword emotion dictionary [35] with the valence and the arousalvalues for each word. We convert each tag to V-A valuesby searching for the tag in the word emotion dictionary. Bymeasuring the Pearson correlation coefficient, we can easilyunderstand the relationship between the object and the imageemotion. Figure 8 shows the results of the experiment. As youcan see, the correlation between the object emotion and theimage emotion is significantly high, which means the objectaffects the emotion of the image. Especially, the correlationof the valence is higher than the arousal, which means thatvalence is more likely to be affected by the object than thearousal.

Based on this observation, we add object-based featuresto our system in predicting emotions. In recent years, manystudies have used CNN models of various structures usingImageNet dataset [36] for the image classification. We usethree of the most popular models (AlexNet, VGG16, andResNet) to extract object features from our image datasetsand experiment with the effects of features extracted fromeach model on emotion prediction results. Our object featureis the result of the final output layer, and it represents theprobabilities of 1000 object categories.

D. Sematic features

As a high-level feature with a similar concept to object, weconsider another semantic information that can describe the

background of the image. It is important what the object isin the image, but what the background is made up of is asimportant. For example, if the main object of an image is aperson, the emotion may be different depending on whetherthe background is a city with a lot of buildings or naturesuch as a mountain or a sea. Also, the ratio of sky, sea, orbuildings in the background can also affect the emotions. Touse the semantic information of the background as a feature,we perform scene parsing on all images. Wu et al. [37]proposed a semantic segmentation method based on a deepnetwork, which classifies each pixel of an image into one of150 semantic categories. Given a semantic map, we find outwhich of the 150 semantic categories each pixel in the imagebelongs to. As a result, a 150-dimensional vector is obtainedand used as one of the input features.

These object and semantic-based high level features arecombined with low level features such as color and localfeatures to learn our networks for emotion prediction.

V. LEARNING EMOTION MODEL

In this section, we introduce the details of our emotionprediction framework. The overall architecture of our frame-work is shown in Figure 9. Our model is a fully connectedfeedforward neural network. In general, a neural networkconsists of an input layer, an output layer, and one or morehidden layers. Normally, when there are two or more hiddenlayers, the network is called deep network. Each layer is madeup of multiple neurons, and the edges that connect neuronsbetween adjacent layers have weights. The values of neurons(except the neurons in input layer) and weights are trainedduring a training phase.

Our network F(X, θ) including an input layer, three hiddenlayers and an output layer is as follows:

F(X, θ) = f 4 ◦ g3 ◦ f 3 ◦ g2 ◦ f 2 ◦ g1 ◦ f 1(X), (4)

where X is input feature vector, θ is a set of weights includingweights w and bias b, and f 4 produces the final output of ourneural network (Valence and Arousal value).

Specifically, given an input vector X l = [xli, ..., xln]T in layer

l, the preactivation value pj for the neuron j of layer l + 1 isobtained through function f l+1:

pl+1j = f l+1(xl) =

n∑i=1

wl+1i j xli + bl+1

j , (5)

where wl+1i j is the connection weight connecting xi in layer l

to neuron j in layer l + 1, bl+1j is the bias of neural j in layer

l + 1, and n is the number of the neuron in layer l. Then, theoutput value xj of the layer l (also input vector for layer l+1)is obtained through function gl which is a nonlinear activationfunction in layer l + 1,

xl+1j = gl+1(pl+1

j ). (6)

Note that, rectified linear unit(ReLU) [1], max(0, x), is used asthe non-linear activation function throughout the all network.

The number of neurons in each layer is given in theTable I. We set the loss function of our network as L =∑K

k=1 ( f 4k− Tk)2, where the f 4

kand Tk are the output value

Page 6: Building Emotional Machines: Recognizing Image Emotions ... · Building Emotional Machines: Recognizing Image Emotions through ... classification systems based on the psychology

6

Scene segmentation

Object classification

1000 D vector

150 D vector

… … … …

Low-level features

Emotion prediction model

V-A values

1938 D vector

Fig. 9. The overall architecture of the proposed emotion prediction model.

TABLE ITHE NUMBER OF NEURONS IN EACH LAYER.

Input H1 H2 H3 OutputNum of neurons 1588 3000 1000 1000 1

predicted by our model and the ground truth emotion valueof given image I and K is the number of images. In trainingphase, network weights θ are updated by backpropagating thegradients through all layers. By minimizing the cost of theloss function, we can optimize the weights of our network.We set learning rate to 0.0001 and the network is trainedby using the stocahstic gradient descent (SGD) optimizationmethod with momentum of 0.9. We set the batch size to 1,000and train our model until the error no longer diminishes. Allexperiments are implemented by using the open source deeplearning framework Tensorflow [38].

VI. EXPERIMENT

A. Model performance

We first evaluate the performance of our model. The entiredataset is divided into five groups, four groups are used forthe training phase, and the remaining one group is used forthe test phase. Each group is used as a test phase once. Ineach group, the number of training images is 8600, and thenumber of test images is 2166. All groups are learned usingthe same structure.

The results are presented in Table II. The number in thefirst row represents the group number. Column g1 shows thetraining and test error when using group 2 to 5 as trainingdata and group 1 as test data. The dimension of input feature is3088 described in Tabel I, and the output is valence or arousalvalue. To compare the performance of the object featuresobtained from the three CNNs, we build various models bycombining object features with other features and comparetheir performance. Note that the ‘A’, ‘V’, and ‘R’ in thethird column represent the AlexNet, VGG16 and ResNet,respectively. The number of each bin represent the meansquare error between ground truth emotion value and predictedemotion value by each model. We also extracted category-level

TABLE IIRESULTS OF 5-FOLD VALIDATION EXPERIMENTS

g1 g2 g3 g4 g5 Avg.

Valence

Training

A 1.37 1.37 1.35 1.37 1.38 1.37V 1.31 1.31 1.30 1.30 1.33 1.31R 1.29 1.30 1.28 1.31 1.30 1.30O 1.47 1.47 1.46 1.48 1.48 1.47

Test

A 1.72 1.68 1.67 1.69 1.61 1.67V 1.68 1.66 1.63 1.63 1.59 1.64R 1.70 1.66 1.65 1.62 1.60 1.65O 1.80 1.81 1.73 1.73 1.69 1.75

Arousal

Training

A 1.22 1.24 1.20 1.21 1.23 1.22V 1.16 1.21 1.17 1.16 1.16 1.17R 1.18 1.22 1.18 1.18 1.18 1.19O 1.26 1.30 1.25 1.27 1.29 1.27

Test

A 1.50 1.52 1.50 1.48 1.44 1.49V 1.48 1.49 1.46 1.48 1.44 1.47R 1.49 1.49 1.47 1.47 1.45 1.48O 1.56 1.53 1.54 1.50 1.48 1.52

features from [39], not from CNN-based model, and includedthem as input features in emotion prediction. Borth et al. [21]also used this feature to predict the sentiment of images. Notethat, the dimension of this feature is 2000, and the number ofinput feature neurons in our model is changed to 2588. Therow with ’O’ represents the prediction result. The results showthat the features from VGG16 achieved the best performancein both the ’Valence’ and ’Arousal’ models (valence: 1.64,arousal: 1.47).

Figures 10 to 14 show the qualitative analysis with emotionvalues and accuracy which is predicted by our model. InFigure 10, the images were placed so that the predicted valuesmatches the emotion values in the VA space. Figure 11 and12 shows the results of the valence model. Figure 13 and 14shows the results of the arousal model.

B. Feature performance

We also investigate the effects of the various features weproposed. First, we combine the color feature and the localfeatures into a low level feature. As a method for constructinga network model using each feature, we use the structure ofour model and change the number of neurons. Our proposedmodel consists of 3 hidden layers with 3000, 1000, and 500

Page 7: Building Emotional Machines: Recognizing Image Emotions ... · Building Emotional Machines: Recognizing Image Emotions through ... classification systems based on the psychology

7

neurons. In the model for feature learning, the number of nodesin each layer is based on the ratio of the number of nodes intwo adjacent layers of our model (Table III (bottom)). As aresult, the object feature among the three features showed thebest result in emotion prediction. In Valence, the object featureextracted from VGG16 had the best result (mse:1.92). Inarousal, the object feature extracted from AlexNet showed thebest result (mse:1.61). Semantic features also showed lowererror than low-level features. Our model combining all thefeatures resulted in the best prediction performance, which isthe synergy effect of extracted features from various sourcesand deep neural network, which is a powerful expressionpower.

TABLE IIIPERFORMANCE OF EACH FEATURE

Low Object semantic AllAlexnet VGG16 ResNetValence 1.98 1.93 1.92 1.98 1.97 1.64Arousal 1.66 1.61 1.62 1.68 1.64 1.47

input 438 1000 1000 1000 150 1588h1 900 2000 300 3000h2 300 700 100 1000h3 150 350 350 350 50 500

output 1

C. Comparison with CNN

Some studies have learned emotion classification modelsusing pre-trained weights learned for image classification [9]–[13]. We compare our emotion prediction model with CNN-based emotion prediction model generated by transfer learning.Two CNN structures are used for comparison; AlexNet andVgg19. We first initialize the weights of AlexNet and VGG19to the weights learned for the image classification. Except forthe final output layer, the other convolution layer and the fully-connected layer use the existing model structure. Since thenumber of the output layer of CNN model based on ImageNetis 1000, we change the number of output layers to 1 for ourpurpose (Valence and Arousal).

Besides, various results can be obtained in transfer learn-ing. AlexNet has five convolutional layers and three fullyconnected layers, and VGG19 has 16 convolutional layersand three fully connected layers. In transfer learning, we candetermine which layer to freeze and which layer to train.We experiment with two conditions. The first is that theconvolutional layer is frozen, only the fully connected layeris learned (conv-frozen), and the second is learning all layerstogether (conv-train). The learning environment for the CNN-based model is almost similar to that of our FFNN modelexcept for the batch size and learning rate. The learningrates of both conv-frozen network and conv-train network are0.5 × 10−4, and the batch size of both CNN models is 50.When we train the CNN models, including our model, we usesame training and test dataset.

The results are shown in the Table IV. The second columnshows the training range, conv-frozen means that only thefully connected layer has been learned, and conv-train means

TABLE IVCOMPARISION WITH OTHER LEARNING METHOD

Emotion Train range AlexNet VGG19 Linear SVR Ours

Valence conv-frozen 2.76 2.64 2.42 1.75 1.64conv-train 2.64 2.60

Arousal conv-frozen 1.95 1.91 2.65 1.54 1.47conv-train 1.89 1.87

that all layers have been learned. From the training rangeperspective, it can be seen that the error of the transferlearning of the entire network is smaller than that of the fullyconnected layer transfer learning. This result implies that thefilter in low-level, as well as the filters in high-level, mustbe learned in order to achieve a better performance emotionprediction model. In other words, the convolutional layer andthe fully connected layer must be learned together. On themodel side, we can see that the performance of the modellearned by using the structure of Vgg19 Network is betterthan AlexNet. However, the results of both models are muchmore error-prone than our proposed emotion based FFNN.If CNN-based models have the same type of data with thesame class, such as object detection or image classification,the learning is well done and the prediction performance isexcellent. However, as mentioned earlier, images with differentshapes can have the same emotions, and images with similarshapes can have different emotions. We also conducted thetest using other machine learning methods with same dataset.Linear regression and Support vector regression method wereused. Compared to the CNN-based model, the performance ofboth models is better, but the results of our model still showthe best performance (See Table IV).

VII. CONCLUSION

In this paper, we presented a new emotion recognitionsystem with a deep learning framework. To reduce the affec-tive gap, we designed and extracted objects and backgroundsemantic features as high-level features, and showed thatthese features are effective for emotion prediction. Both high-level features and low-level features complement each otherwell, which leads to better emotion recognition performance.As expected, the accuracy of the object recognition has animpact on the performance of the emotion prediction. Theobject features with incorrect recognition may lead to incorrectemotion prediction results. There is also a problem when themain object of the image is not included in the existing 1000classes. However, with the rapid progress in the deep learningtechnology with large dataset, the accuracy as the number ofclasses will be increased, which in turn will also help ouremotion recognition system.

As an interesting future work, one can consider the presenceof a person in the image and the facial expression. A facialexpression is one of the features that can greatly affect theemotion prediction. Even if the overall mood of the imageis dark, smiling face can mitigate negative emotion a little(Figure 15). In addition, when the face occupies most ofthe part in the photograph, facial expression and emotion aredirectly connected. We will consider enhancing emotion recog-

Page 8: Building Emotional Machines: Recognizing Image Emotions ... · Building Emotional Machines: Recognizing Image Emotions through ... classification systems based on the psychology

8

V: 3.17/2.50A: 6.30/7.00

V: 4.55/3.83A: 5.00/5.17

V: 4.89/4.57A: 5.51/5.57

V: 4.87/4.08A: 5.56/5.53

V: 3.91/4.40A: 4.42/4.20

V: 3.46/3.80A: 4.33/4.80

V: 4.28/3.80A: 4.78/4.00

V: 3.97/4.67A: 4.56/5.33

V: 3.54/4.00A: 4.86/4.40

V: 4.80/4.29A: 4.24/3.70

V: 3.84/4.60A: 3.47/3.20

V: 5.68/4.60A: 5.77/6.00

V: 4.14/3.00A: 3.54/3.17

V: 7.37/7.50A: 7.05/7.33

V: 6.43/6.89A: 6.07/5.33

V: 7.08/6.80A: 5.81/5.20

V: 7.41/8.00A: 6.98/6.80

V: 7.36/6.60A: 6.80/6.80

V: 6.87/7.33A: 6.51/7.00

V: 6.65/7.00A: 6.23/6.80

V: 6.06/6.60A: 5.36/6.00

V: 7.25/7.29A: 4.70/4.14

V: 7.63/8.00A: 3.89/4.00

V: 7.27/7.17A: 3.78/4.00

V: 5.84/5.80A: 3.32/3.40

V: 6.65/6.60A: 5.32/5.20

V: 3.63/2.86A: 4.59/5.57

91.6% / 91.3% 90.1% / 99.6% 86.5% / 97.1% 96.5% / 92.4% 94.3% / 93.9% 92.6% / 97.8% 98.4% / 96.5%

90.4% / 87.8% 94.3% / 94.3% 91.0% / 97.9% 96.0% / 99.3% 94.3% / 90.8% 90.5% / 100% 95.6% / 92.9%

95.8% / 94.1% 91.3% / 90.4% 93.6% / 93.3% 99.5% / 99.0% 99.4% / 98.5% 93.3% / 92.0% 99.5% / 93.0%

90.5% / 96.6% 93.9% / 97.3% 94.0% / 90.3% 85.8% / 95.4% 98.8% / 97.3% 95.4% / 98.6%

Fig. 10. Prediction results. The values below each image show the prediction results of FFNN and ground truth emotion values, respectively. The predictionaccuracy results (V/A) are also shown.

nition performance by adding a facial expression recognitionframework.

Several studies have demonstrated that biometric data has apositive effect on emotion recognition [40], [41]. We can alsoconsider using biometric data or an observer’s facial featureas additional features. However, in general, deep networksrequire thousands or tens of thousands of data, and collectingthese biometric and facial data is not easy task. We will try tofind the method to improve the performance of the model byconsidering a small amount of biometric data, which is left asanother future work.

We also built a database for the emotion estimation withthe V-A model and will continue to collect more data. Weexpect our dataset will be widely used in the field of affectivecomputing.

REFERENCES

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural infor-mation processing systems, 2012, pp. 1097–1105.

[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2015, pp. 1–9.

[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” arXiv preprint arXiv:1512.03385, 2015.

[5] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2015, pp. 3431–3440.

[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in Proceedings of the IEEE conference on computer vision and patternrecognition, 2014, pp. 580–587.

[7] Z. Cheng, Q. Yang, and B. Sheng, “Deep colorization,” in Proceedingsof the IEEE International Conference on Computer Vision, 2015, pp.415–423.

[8] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolu-tional network for image super-resolution,” in European Conference onComputer Vision. Springer, 2014, pp. 184–199.

[9] V. Campos, A. Salvador, X. Giro-i Nieto, and B. Jou, “Diving deepinto sentiment: Understanding fine-tuned cnns for visual sentimentprediction,” in Proceedings of the 1st International Workshop on Affect& Sentiment in Multimedia. ACM, 2015, pp. 57–62.

[10] K.-C. Peng, T. Chen, A. Sadovnik, and A. C. Gallagher, “A mixedbag of emotions: Model, predict, and transfer emotion distributions,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2015, pp. 860–868.

[11] Q. You, J. Luo, H. Jin, and J. Yang, “Robust image sentiment analysis

Page 9: Building Emotional Machines: Recognizing Image Emotions ... · Building Emotional Machines: Recognizing Image Emotions through ... classification systems based on the psychology

9

3.96/3.89 (99.1%)3.88/3.20 (91.5%) 3.95/3.25 (91.3%)

4.97/4.60 (95.4%)4.60/4.40 (97.5%)

4.46/4.2 (96.8%)4.19/3.67 (93.5%)

4.90/5.10 (97.5%)

4.10/4.71 (92.4%)

4.33/3.83 (93.8%) 4.61/4.60 (99.9%)

5.12/5.13 (99.9%) 5.18/5 (97.8%)

5.11/5.33 (97.3%)

5.75/5.80 (99.4%)

6.56/6.80 (97.0%)

6.36/6.00 (95.5%)

6.77/6.80 (99.6%) 7.28/7.57 (96.4%)

5.79/5.83 (99.5%)5.68/5.60 (99.0%)

4.20/4 (97.5%)

5.90/6.00 (98.8%)

7.65/7.88 (97.1%) 7.67/8.00 (95.9%)7.23/7.00 (97.1%)

Fig. 11. Valence prediction results of the images with low arousal. The values below each image represent the predicted valence value and ground truth value(prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of valence increases. All images inthis example have low arousal values.

3.22/3.00 (97.3%) 3.98/3.59 (95.1%)3.83/3.60 (97.1%)3.78/3.40 (95.3%) 3.84/3.80 (99.5%) 3.88/3.60 (96.5%) 4.51/4.50 (99.9%)

5.77/6.00 (97.1%)5.43/5.17 (96.8%)4.90/4.80 (98.8%) 5.19/4.80 (95.1%)

7.63/8.00 (95.4%)

7.24/7.29 (99.4%) 7.33/7.00 (95.9%)6.94/7.00 (99.3%)6.52/6.70 (97.8%)

4.99/4.80 (97.6%)4.15/3.80 (95.6%)

6.50/6.89 (95.1%)

7.70/8.00 (96.3%)7.46/7.63 (97.9%)7.42/8.13 (91.1%) 7.51/7.88 (95.4%)

4.91/4.40 (93.6%)

7.19/7.40 (97.4%)

7.42/7.00 (94.8%)

7.18/7.83 (91.9%)7.02/6.60 (94.8%)

Fig. 12. Valence prediction results of the images with high arousal. The values below each image represent the predicted valence value and ground truthvalue (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of valence increases. All imagesin this example have high arousal values.

Page 10: Building Emotional Machines: Recognizing Image Emotions ... · Building Emotional Machines: Recognizing Image Emotions through ... classification systems based on the psychology

10

3.98/3.89 (98.9%) 3.46/3.20 (96.8%) 3.88/3.83 (99.4%) 3.87/3.67 (97.5%)

4.67/4.33 (95.8%) 4.75/4.80 (99.4%)4.75/4.60(98.1%)4.60/4.40 (97.5%)

4.47/4.60 (98.4%) 4.49/4.40 (98.9%)

4.65/4.50 (98.1%) 4.70/5.20 (93.8%)

4.17/4.00 (97.9%)

5.05/5.60(93.1%)

5.34/5.60 (96.8%) 5.40/5.20 (97.5%)5.24/5.80 (93.0%) 5.43/6.00 (92.9%) 6.23/6.33 (98.8%)5.29/5.60 (96.1%)

4.19/4.50 (96.1%) 4.40/4.60 (97.5%)

4.14/4.00 (98.3%)

4.50/4.00 (93.8%)

5.69/ 5.85 (98.0%)

5.16/4.75 (94.9%)

Fig. 13. Arousal prediction results of the images with low valence. The values below each image represent the predicted arousal value and ground truth value(prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of arousal increases. All images inthis example have low valence values.

3.70/4.20 (93.8%) 3.74/3.40 (95.8%) 3.88/3.50 (95.3%) 3.21/3.00 (97.4%) 3.33/3.60 (96.6%) 3.85/3.80 (99.4%)

3.97/4.50 (93.4%) 3.61/3.67 (99.3%) 3.99/4.00 (99.9%) 3.42/3.20 (97.3%) 3.88/4.00 (98.5%) 3.96/4.17 (97.4%)

4.70/4.88 (97.8%) 5.00/5.00 (100.0%) 5.80/6.20 (95.0%) 6.33/6.33 (100.0%) 6.44/6.40 (99.5%) 5.76/6.00 (97.0%)

5.87/6.25 (95.3%) 5.82/6.20 (95.3%) 5.68/6.00 (96.0%) 6.13/6.30 (97.9%) 6.65/6.83 (97.8%) 7.22/7.40 (97.8%)6.95/6.80 (98.1%)

Fig. 14. Arousal prediction results of the images with high valence. The values below each image represent the predicted arousal value and ground truthvalue (prediction/ground truth). The prediction accuracy result is also shown. From the top left to the bottom right, the value of arousal increases. All imagesin this example have high valence values.

Page 11: Building Emotional Machines: Recognizing Image Emotions ... · Building Emotional Machines: Recognizing Image Emotions through ... classification systems based on the psychology

11

Fig. 15. Different emotions with different facial expressions.

using progressively trained and domain transferred deep networks,” inTwenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

[12] ——, “Building a large scale dataset for image emotion recognition:The fine print and the benchmark,” 2016.

[13] C. Xu, S. Cetintas, K.-C. Lee, and L.-J. Li, “Visual sentimentprediction with deep convolutional neural networks,” arXiv preprintarXiv:1411.5731, 2014.

[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, pp. 2278–2324, 1998.

[15] A. Hanjalic, “Extracting moods from pictures and sounds: Towards trulypersonalized tv,” IEEE Signal Processing Magazine, vol. 23, no. 2, pp.90–100, 2006.

[16] X. Lu, P. Suryanarayan, R. B. Adams Jr, J. Li, M. G. Newman, and J. Z.Wang, “On shape and the computability of emotions,” in Proceedingsof the 20th ACM international conference on Multimedia. ACM, 2012,pp. 229–238.

[17] J. Machajdik and A. Hanbury, “Affective image classification usingfeatures inspired by psychology and art theory,” in Proceedings of the18th ACM international conference on Multimedia. ACM, 2010, pp.83–92.

[18] S. Zhao, Y. Gao, X. Jiang, H. Yao, T.-S. Chua, and X. Sun, “Exploringprinciples-of-art features for image emotion recognition,” in Proceedingsof the 22nd ACM international conference on Multimedia. ACM, 2014,pp. 47–56.

[19] S. Haykin, Neural Networks: A Comprehensive Foundation. PrenticeHall, 1999. [Online]. Available: https://books.google.co.kr/books?id=3-1HPwAACAAJ

[20] C. E. Osgood, “The nature and measurement of meaning.” Psychologicalbulletin, vol. 49, no. 3, p. 197, 1952.

[21] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scalevisual sentiment ontology and detectors using adjective noun pairs,” inProceedings of the 21st ACM international conference on Multimedia.ACM, 2013, pp. 223–232.

[22] T. Chen, D. Borth, T. Darrell, and S.-F. Chang, “Deepsentibank: Vi-sual sentiment concept classification with deep convolutional neuralnetworks,” arXiv preprint arXiv:1410.8586, 2014.

[23] V. Campos, B. Jou, and X. Giro-i Nieto, “From pixels to sentiment:Fine-tuning cnns for visual sentiment prediction,” Image and VisionComputing, 2017.

[24] J. Islam and Y. Zhang, “Visual sentiment analysis for social imagesusing transfer learning approach,” in Big Data and Cloud Computing(BDCloud), Social Computing and Networking (SocialCom), SustainableComputing and Communications (SustainCom)(BDCloud-SocialCom-SustainCom), 2016 IEEE International Conferences on. IEEE, 2016,pp. 124–130.

[25] M. M. Bradley and P. J. Lang, “Measuring emotion: the self-assessmentmanikin and the semantic differential,” Journal of behavior therapy andexperimental psychiatry, vol. 25, no. 1, pp. 49–59, 1994.

[26] J. A. Russell, “A circumplex model of affect,” 1980.[27] Flickr, “http://www.flickr.com.”

[28] P. J. Lang, M. M. Bradley, and B. N. Cuthbert, “International affectivepicture system (iaps): Affective ratings of pictures and instructionmanual,” Technical report A-8, 2008.

[29] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley, “Color transferbetween images,” IEEE Computer graphics and applications, no. 5, pp.34–41, 2001.

[30] G. Csurka, S. Skaff, L. Marchesotti, and C. Saunders, “Learning moodsand emotions from color combinations,” in Proceedings of the SeventhIndian Conference on Computer Vision, Graphics and Image Processing.ACM, 2010, pp. 298–305.

[31] L. He, H. Qi, and R. Zaretzki, “Image color transfer to evoke differ-ent emotions based on color combinations,” Signal, Image and VideoProcessing, pp. 1–9, 2014.

[32] H.-R. Kim, H. Kang, and I.-K. Lee, “Image recoloring with valence-arousal emotion model,” in Computer Graphics Forum, vol. 35, no. 7.Wiley Online Library, 2016, pp. 209–216.

[33] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus, “Learningcolor names for real-world applications,” IEEE Transactions on ImageProcessing, vol. 18, no. 7, pp. 1512–1523, 2009.

[34] P. Valdez and A. Mehrabian, “Effects of color on emotions.” Journal ofexperimental psychology: General, vol. 123, no. 4, p. 394, 1994.

[35] A. B. Warriner, V. Kuperman, and M. Brysbaert, “Norms of valence,arousal, and dominance for 13,915 english lemmas,” Behavior researchmethods, vol. 45, no. 4, pp. 1191–1207, 2013.

[36] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE,2009, pp. 248–255.

[37] Z. Wu, C. Shen, and A. v. d. Hengel, “Wider or deeper: Revisiting theresnet model for visual recognition,” arXiv preprint arXiv:1611.10080,2016.

[38] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scalemachine learning on heterogeneous distributed systems,” arXiv preprintarXiv:1603.04467, 2016.

[39] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, and S.-F. Chang, “Design-ing category-level attributes for discriminative visual recognition,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2013, pp. 771–778.

[40] L. Aftanas, N. Reva, A. Varlamov, S. Pavlov, and V. Makhnev, “Analysisof evoked eeg synchronization and desynchronization in conditions ofemotional activation in humans: temporal and topographic characteris-tics,” Neuroscience and behavioral physiology, vol. 34, no. 8, pp. 859–867, 2004.

[41] G. Chanel, J. Kronegg, D. Grandjean, and T. Pun, “Emotion assessment:Arousal evaluation using eegâAZs and peripheral physiological signals,”Multimedia content representation, classification and security, pp. 530–537, 2006.