knowledge-enriched transformer for emotion detection in textual conversations · 2020-01-23 ·...

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 165–176,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

165

Knowledge-Enriched Transformer for Emotion Detection in TextualConversations

Peixiang Zhong1,2, Di Wang1, Chunyan Miao1,2,3

1Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly2Alibaba-NTU Singapore Joint Research Institute3School of Computer Science and EngineeringNanyang Technological University, Singapore

[email protected], {wangdi, ascymiao}@ntu.edu.sg

Abstract

Messages in human conversations inherentlyconvey emotions. The task of detecting emo-tions in textual conversations leads to a widerange of applications such as opinion miningin social networks. However, enabling ma-chines to analyze emotions in conversations ischallenging, partly because humans often relyon the context and commonsense knowledgeto express emotions. In this paper, we addressthese challenges by proposing a Knowledge-Enriched Transformer (KET), where contex-tual utterances are interpreted using hierarchi-cal self-attention and external commonsenseknowledge is dynamically leveraged using acontext-aware affective graph attention mech-anism. Experiments on multiple textual con-versation datasets demonstrate that both con-text and commonsense knowledge are consis-tently beneficial to the emotion detection per-formance. In addition, the experimental resultsshow that our KET model outperforms thestate-of-the-art models on most of the testeddatasets in F1 score.

1 Introduction

Emotions are “generated states in humans that re-flect evaluative judgments of the environment, theself and other social agents” (Hudlicka, 2011).Messages in human communications inherentlyconvey emotions. With the prevalence of socialmedia platforms such as Facebook Messenger, aswell as conversational agents such as AmazonAlexa, there is an emerging need for machines tounderstand human emotions in natural conversa-tions. This work addresses the task of detectingemotions (e.g., happy, sad, angry, etc.) in textualconversations, where the emotion of an utteranceis detected in the conversational context. Beingable to effectively detect emotions in conversa-tions leads to a wide range of applications rang-ing from opinion mining in social media platforms

No emotion

socialize

What do you plan to do for your birthday?

I want to have a picnic with my friends, Mum.

How about a party at home? That way we can get together and celebrate it.

OK, Mum. I'll invite my friends home.

party movie

No emotion

Happiness

Happiness

Context

Response

Figure 1: An example conversation with annotated la-bels from the DailyDialog dataset (Li et al., 2017). Byreferring to the context, “it” in the third utterance islinked to “birthday” in the first utterance. By lever-aging an external knowledge base, the meaning of“friends” in the forth utterance is enriched by associ-ated knowledge entities, namely “socialize”, “party”,and “movie”. Thus, the implicit “happiness” emotionin the fourth utterance can be inferred more easily viaits enriched meaning.

(Chatterjee et al., 2019) to building emotion-awareconversational agents (Zhou et al., 2018a).

However, enabling machines to analyze emo-tions in human conversations is challenging, partlybecause humans often rely on the context andcommonsense knowledge to express emotions,which is difficult to be captured by machines. Fig-ure 1 shows an example conversation demonstrat-ing the importance of context and commonsenseknowledge in understanding conversations and de-tecting implicit emotions.

There are several recent studies that model con-textual information to detect emotions in conver-sations. Poria et al. (2017) and Majumder et al.(2019) leveraged recurrent neural networks (RNN)to model the contextual utterances in sequence,where each utterance is represented by a featurevector extracted by convolutional neural networks(CNN) at an earlier stage. Similarly, Hazarikaet al. (2018a,b) proposed to use extracted CNN

166

features in memory networks to model contextualutterances. However, these methods require sepa-rate feature extraction and tuning, which may notbe ideal for real-time applications. In addition, tothe best of our knowledge, no attempts have beenmade in the literature to incorporate commonsenseknowledge from external knowledge bases to de-tect emotions in textual conversations. Common-sense knowledge is fundamental to understand-ing conversations and generating appropriate re-sponses (Zhou et al., 2018b).

To this end, we propose a Knowledge-EnrichedTransformer (KET) to effectively incorporate con-textual information and external knowledge basesto address the aforementioned challenges. TheTransformer (Vaswani et al., 2017) has beenshown to be a powerful representation learningmodel in many NLP tasks such as machine trans-lation (Vaswani et al., 2017) and language under-standing (Devlin et al., 2018). The self-attention(Cheng et al., 2016) and cross-attention (Bah-danau et al., 2014) modules in the Transformercapture the intra-sentence and inter-sentence cor-relations, respectively. The shorter path of in-formation flow in these two modules comparedto gated RNNs and CNNs allows KET to modelcontextual information more efficiently. In ad-dition, we propose a hierarchical self-attentionmechanism allowing KET to model the hierarchi-cal structure of conversations. Our model sepa-rates context and response into the encoder and de-coder, respectively, which is different from otherTransformer-based models, e.g., BERT (Devlinet al., 2018), which directly concatenate contextand response, and then train language models us-ing only the encoder part.

Moreover, to exploit commonsense knowledge,we leverage external knowledge bases to facili-tate the understanding of each word in the utter-ances by referring to related knowledge entities.The referring process is dynamic and balancesbetween relatedness and affectiveness of the re-trieved knowledge entities using a context-awareaffective graph attention mechanism.

In summary, our contributions are as follows:

• For the first time, we apply the Transformerto analyze conversations and detect emotions.Our hierarchical self-attention and cross-attention modules allow our model to exploitcontextual information more efficiently thanexisting gated RNNs and CNNs.

• We derive dynamic, context-aware, andemotion-related commonsense knowledgefrom external knowledge bases and emotionlexicons to facilitate the emotion detection inconversations.

• We conduct extensive experiments demon-strating that both contextual information andcommonsense knowledge are beneficial tothe emotion detection performance. In addi-tion, our proposed KET model outperformsthe state-of-the-art models on most of thetested datasets across different domains.

2 Related Work

Emotion Detection in Conversations: Earlystudies on emotion detection in conversations fo-cus on call center dialogs using lexicon-basedmethods and audio features (Lee and Narayanan,2005; Devillers and Vidrascu, 2006). Devillerset al. (2002) annotated and detected emotions incall center dialogs using unigram topic modelling.In recent years, there is an emerging research trendon emotion detection in conversational videos andmulti-turn Tweets using deep learning methods(Hazarika et al., 2018b,a; Zahiri and Choi, 2018;Chatterjee et al., 2019; Zhong and Miao, 2019; Po-ria et al., 2019). Poria et al. (2017) proposed a longshort-term memory network (LSTM) (Hochreiterand Schmidhuber, 1997) based model to capturecontextual information for sentiment analysis inuser-generated videos. Majumder et al. (2019)proposed the DialogueRNN model that uses threegated recurrent units (GRU) (Cho et al., 2014) tomodel the speaker, the context from the preced-ing utterances, and the emotions of the precedingutterances, respectively. They achieved the state-of-the-art performance on several conversationalvideo datasets.Knowledge Base in Conversations: Recentlythere is a growing number of studies on incorpo-rating knowledge base in generative conversationsystems, such as open-domain dialogue systems(Han et al., 2015; Asghar et al., 2018; Ghazvinine-jad et al., 2018; Young et al., 2018; Parthasarathiand Pineau, 2018; Liu et al., 2018; Moghe et al.,2018; Dinan et al., 2019; Zhong et al., 2019),task-oriented dialogue systems (Madotto et al.,2018; Wu et al., 2019; He et al., 2019) and ques-tion answering systems (Kiddon et al., 2016; Haoet al., 2017; Sun et al., 2018; Mihaylov and Frank,2018). Zhou et al. (2018b) adopted structured

167

knowledge graphs to enrich the interpretation ofinput sentences and help generate knowledge-aware responses using graph attentions. The graphattention in the knowledge interpreter (Zhou et al.,2018b) is static and only related to the recognizedentity of interest. By contrast, our graph attentionmechanism is dynamic and selects context-awareknowledge entities that balances between related-ness and affectiveness.Emotion Detection in Text: There is a trendmoving from traditional machine learning meth-ods (Pang et al., 2002; Wang and Manning, 2012;Seyeditabari et al., 2018) to deep learning methods(Abdul-Mageed and Ungar, 2017; Zhang et al.,2018b) for emotion detection in text. Khanpourand Caragea (2018) investigated the emotion de-tection from health-related posts in online healthcommunities using both deep learning featuresand lexicon-based features.Incorporating Knowledge in Sentiment Anal-ysis: Traditional lexicon-based methods detectemotions or sentiments from a piece of text basedon the emotions or sentiments of words or phrasesthat compose it (Hu et al., 2009; Taboada et al.,2011; Bandhakavi et al., 2017). Few studies in-vestigated the usage of knowledge bases in deeplearning methods. Kumar et al. (2018) proposed touse knowledge from WordNet (Fellbaum, 2012) toenrich the text representations produced by LSTMand obtained improved performance.Transformer: The Transformer has been appliedto many NLP tasks due to its rich representa-tion and fast computation, e.g., document machinetranslation (Zhang et al., 2018a), response match-ing in dialogue system (Zhou et al., 2018c), lan-guage modelling (Dai et al., 2019) and understand-ing (Radford et al., 2018). A very recent work(Rik Koncel-Kedziorski and Hajishirzi, 2019) ex-tends the Transformer to graph inputs and proposea model for graph-to-text generation.

3 Our Proposed KET Model

In this section we present the task definition andour proposed KET model.

3.1 Task Definition

Let {Xij , Y

ij }, i = 1, ...N, j = 1, ...Ni be a collec-

tion of {utterance, label} pairs in a given dialoguedataset, where N denotes the number of conversa-tions and Ni denotes the number of utterances inthe ith conversation. The objective of the task is to

maximize the following function:

Φ =

N∏i=1

Ni∏j=1

p(Y ij |Xi

j , Xij−1, ..., X

i1; θ), (1)

where Xij−1, ..., X

i1 denote contextual utterances

and θ denotes the model parameters we want tooptimize.

We limit the number of contextual utterances toM . Discarding early contextual utterances maycause information loss, but this loss is negligiblebecause they only contribute the least amount ofinformation (Su et al., 2018). This phenomenoncan be further observed in our model analysis re-garding context length (see Section 5.2). Similarto (Poria et al., 2017), we clip and pad each utter-ance Xi

j to a fixed m number of tokens. The over-all architecture of our KET model is illustrated inFigure 2.

3.2 Knowledge Retrieval

We use a commonsense knowledge base Con-ceptNet (Speer et al., 2017) and an emotion lex-icon NRC VAD (Mohammad, 2018a) as knowl-edge sources in our model.

ConceptNet is a large-scale multilingual seman-tic graph that describes general human knowledgein natural language. The nodes in ConceptNetare concepts and the edges are relations. Each〈concept1, relation, concept2〉 triplet is an asser-tion. Each assertion is associated with a confi-dence score. An example assertion is 〈friends,CausesDesire, socialize〉 with confidence score of3.46. Usually assertion confidence scores are inthe [1, 10] interval. Currently, for English, Con-ceptNet comprises 5.9M assertions, 3.1M con-cepts and 38 relations.

NRC VAD is a list of English words andtheir VAD scores, i.e., valence (negative-positive), arousal (calm-excited), and dominance(submissive-dominant) scores in the [0, 1]interval. The VAD measure of emotion isculture-independent and widely adopted in Psy-chology (Mehrabian, 1996). Currently NRC VADcomprises around 20K words.

In general, for each non-stopword token t inXi

j , we retrieve a connected knowledge graph g(t)comprising its immediate neighbors from Con-ceptNet. For each g(t), we remove concepts thatare stopwords or not in our vocabulary. We fur-ther remove concepts with confidence scores less

168

Embedding Layer (Section 3.3)

KB

Word Embedding

Concept Embedding

KB

Word Embedding

Concept Embedding

KB

Word Embedding

Concept Embedding

. . .

Dynamic Context-Aware Affective Graph Attention

Word Embedding

Concept Representation

Word Embedding


Word Embedding


Multi-Head Self-Attention & FF Multi-Head Self-Attention & FF Multi-Head Self-Attention

Context Response

. . .

. . .

Multi-Head Self-Attention & FF

Multi-Head Cross-Attention & FF

Softmax

Max Pooling & Linear

Concatenation

Concept-Enriched Embedding

Hierarchical Self-Attention (Section 3.5)

Knowledge Retrieval (Section 3.2)

Dynamic Context-Aware Affective Graph Attention (Section 3.4)

Context-Response Cross-Attention (Section 3.6)

Figure 2: Overall architecture of our proposed KET model. The positional encoding, residual connection, andlayer normalization are omitted in the illustration for brevity.

than 1 to reduce annotation noises. For each con-cept, we retrieve its VAD values from NRC VAD.The final knowledge representation for each to-ken t is a list of tuples: (c1, s1,VAD(c1)),(c2, s2,VAD(c2)), ..., (c|g(t)|, s|g(t)|,VAD(c|g(t)|)),where ck ∈ g(t) denotes the kth connected con-cept, sk denotes the associated confidence score,and VAD(ck) denotes the VAD values of ck. Thetreatment for tokens that are not associated withany concept and concepts that are not included inNRC VAD are discussed in Section 3.4. We leavethe treatment on relations as future work.

3.3 Embedding LayerWe use a word embedding layer to convert eachtoken t in Xi into a vector representation t ∈ Rd,where d denotes the size of word embedding. Toencode positional information, the position encod-ing (Vaswani et al., 2017) is added as follows:

t = Embed(t) + Pos(t). (2)

Similarly, we use a concept embedding layer toconvert each concept c into a vector representationc ∈ Rd but without position encoding.

3.4 Dynamic Context-Aware Affective GraphAttention

To enrich word embedding with concept represen-tations, we propose a dynamic context-aware af-fective graph attention mechanism to compute the

concept representation for each token. Specifi-cally, the concept representation c(t) ∈ Rd fortoken t is computed as

c(t) =

|g(t)|∑k=1

αk ∗ ck, (3)

where ck ∈ Rd denotes the concept embeddingof ck and αk denotes its attention weight. If|g(t)| = 0, we set c(t) to the average of all con-cept embeddings. The attention αk in Equation 3is computed as

αk = softmax(wk), (4)

where wk denotes the weight of ck.The derivation of wk is crucial because it reg-

ulates the contribution of ck towards enriching t.A standard graph attention mechanism (Velikoviet al., 2018) computes wk by feeding t and ck intoa single-layer feedforward neural network. How-ever, not all related concepts are equal in detect-ing emotions given the conversational context. Inour model, we make the assumption that importantconcepts are those that relate to the conversationalcontext and have strong emotion intensity. To thisend, we propose a context-aware affective graphattention mechanism by incorporating two factorswhen computing wk, namely relatedness and af-fectiveness.

169

Relatedness: Relatedness measures the strengthof the relation between ck and the conversationalcontext. The relatedness factor in wk is computedas

relk = min-max(sk) ∗ abs(cos(CR(Xi), ck)),(5)

where sk is the confidence score introduced inSection 3.2, min-max denotes min-max scaling foreach token t, abs denotes the absolute function,cos denotes the cosine similarity function, andCR(Xi) ∈ Rd denotes the context representa-tion of the ith conversation Xi. Here we computeCR(Xi) as the average of all sentence represen-tations in Xi as follows:

CR(Xi) = avg(SR(Xij−M ), ...,SR(Xi

j)), (6)

where SR(Xij) ∈ Rd denotes the sentence rep-

resentation of Xij . We compute SR(Xi

j) via hi-erarchical pooling (Shen et al., 2018) where n-gram (n ≤ 3) representations in Xi

j are first com-puted by max-pooling and then all n-gram repre-sentations are averaged. The hierarchical poolingmechanism preserves word order information tocertain degree and has demonstrated superior per-formance than average pooling or max-pooling onsentiment analysis tasks (Shen et al., 2018).Affectiveness: Affectiveness measures the emo-tion intensity of ck. The affectiveness factor in wk

is computed as

affk = min-max(||[V(ck)−1/2,A(ck)/2]||2), (7)

where ||.||k denotes lk norm, V(ck) ∈ [0, 1] andA(ck) ∈ [0, 1] denote the valence and arousal val-ues of VAD(ck), respectively. Intuitively, affk con-siders the deviations of valence from neutral andthe level of arousal from calm. There is no es-tablished method in the literature to compute theemotion intensity based on VAD values, but em-pirically we found that our method correlates bet-ter with an emotion intensity lexicon comprising6K English words (Mohammad, 2018b) than othermethods such as taking dominance into consider-ation or taking l1 norm. For concept ck not inNRC VAD, we set affk to the mid value of 0.5.

Combining both relk and affk, we define theweight wk as follows:

wk = λk ∗ relk + (1− λk) ∗ affk, (8)

where λk is a model parameter balancing the im-pacts of relatedness and affectiveness on comput-ing concept representations. Parameter λk can be

fixed or learned during training. The analysis ofλk is discussed in Section 5.2.

Finally, the concept-enriched word representa-tion t can be obtained via a linear transformation:

t = W[t; c(t)], (9)

where [; ] denotes concatenation and W ∈ Rd×2d

denotes a model parameter. All m tokens in eachXi

j then form a concept-enriched utterance em-bedding Xi

j ∈ Rm×d.

3.5 Hierarchical Self-Attention

We propose a hierarchical self-attention mecha-nism to exploit the structural representation ofconversations and learn a vector representationfor the contextual utterances Xi

j−1, ..., Xij−M .

Specifically, the hierarchical self-attention followstwo steps: 1) each utterance representation iscomputed using an utterance-level self-attentionlayer, and 2) a context representation is computedfrom M learned utterance representations using acontext-level self-attention layer.

At step 1, for each utterance Xin, n=j − 1, ...,

j −M , its representation X′in ∈ Rm×d is learned

as follows:

X′in = FF(L

′(MH(L(Xi

n),L(Xin),L(Xi

n)))),(10)

where L(Xin) ∈ Rm×h×ds is linearly transformed

from Xin to form h heads (ds = d/h), L

′linearly

transforms from h heads back to 1 head, and

MH(Q,K, V ) = softmax(QKT

√ds

)V, (11)

FF(x) = max(0, xW1 + b1)W2 + b2, (12)

where Q, K, and V denote sets of queries, keysand values, respectively, W1 ∈ Rd×p, b1 ∈Rp,W2 ∈ Rp×d and b2 ∈ Rd denote model pa-rameters, and p denotes the hidden size of thepoint-wise feedforward layer (FF) (Vaswani et al.,2017). The multi-head self-attention layer (MH)enables our model to jointly attend to informationfrom different representation subspaces (Vaswaniet al., 2017). The scaling factor 1√

dsis added to

ensure the dot product of two vectors do not getoverly large. Similar to (Vaswani et al., 2017),both MH and FF layers are followed by resid-ual connection and layer normalization, which areomitted in Equation 10 for brevity.

170

Dataset Domain #Conv. (Train/Val/Test) #Utter. (Train/Val/Test) #Classes EvaluationEC Tweet 30160/2755/5509 90480/8265/16527 4 Micro-F1

DailyDialog Daily Communication 11118/1000/1000 87170/8069/7740 7 Micro-F1MELD TV Show Scripts 1038/114/280 9989/1109/2610 7 Weighted-F1

EmoryNLP TV Show Scripts 659/89/79 7551/954/984 7 Weighted-F1IEMOCAP Emotional Dialogues 100/20/31 4810/1000/1523 6 Weighted-F1

Table 1: Dataset descriptions.

At step 2, to effectively combine all utter-ance representations in the context, the context-level self-attention layer is proposed to hierarchi-cally learn the context-level representation Ci ∈RM×m×d as follows:

Ci = FF(L′(MH(L(Xi),L(Xi),L(Xi)))), (13)

where Xi denotes [X′ij−M ; ...; X

′ij−1], which is the

concatenation of all learned utterance representa-tions in the context.

3.6 Context-Response Cross-Attention

Finally, a context-aware concept-enriched re-sponse representation Ri ∈ Rm×d for conversa-tion Xi is learned by cross-attention (Bahdanauet al., 2014), which selectively attends to theconcept-enriched context representation as fol-lows:

Ri = FF(L′(MH(L(X

′ij ),L(Ci),L(Ci)))), (14)

where the response utterance representation X′ij ∈

Rm×d is obtained via the MH layer:

X′ij = L

′(MH(L(Xi

j),L(Xij),L(Xi

j))), (15)

The resulted representation Ri ∈ Rm×d is thenfed into a max-pooling layer to learn discrimina-tive features among the positions in the responseand derive the final representation O ∈ Rd:

O = max pool(Ri). (16)

The output probability p is then computed as

p = softmax(OW3 + b3), (17)

where W3 ∈ Rd×q and b3 ∈ Rq denote modelparameters, and q denotes the number of classes.The entire KET model is optimized in an end-to-end manner as defined in Equation 1. Our modelis available at here1.

1https://github.com/zhongpeixiang/KET

4 Experimental Settings

In this section we present the datasets, evaluationmetrics, baselines, our model variants, and otherexperimental settings.

4.1 Datasets and EvaluationsWe evaluate our model on the following five emo-tion detection datasets of various sizes and do-mains. The statistics are reported in Table 1.EC (Chatterjee et al., 2019): Three-turn Tweets.The emotion labels include happiness, sadness,anger and other.DailyDialog (Li et al., 2017): Human writtendaily communications. The emotion labels in-clude neutral and Ekman’s six basic emotions (Ek-man, 1992), namely happiness, surprise, sadness,anger, disgust and fear.MELD (Poria et al., 2018): TV show scripts col-lected from Friends. The emotion labels are thesame as the ones used in DailyDialog.EmoryNLP (Zahiri and Choi, 2018): TV showscripts collected from Friends as well. How-ever, its size and annotations are different fromMELD. The emotion labels include neutral, sad,mad, scared, powerful, peaceful, and joyful.IEMOCAP (Busso et al., 2008): Emotional dia-logues. The emotion labels include neutral, happi-ness, sadness, anger, frustrated, and excited.

In terms of the evaluation metric, for EC andDailyDialog, we follow (Chatterjee et al., 2019) touse the micro-averaged F1 excluding the majorityclass (neutral), due to their extremely unbalancedlabels (the percentage of the majority class in thetest set is over 80%). For the rest relatively bal-anced datasets, we follow (Majumder et al., 2019)to use the weighted macro-F1.

4.2 Baselines and Model VariantsFor a comprehensive performance evaluation, wecompare our model with the following baselines:cLSTM: A contextual LSTM model. Anutterance-level bidirectional LSTM is used to en-code each utterance. A context-level unidirec-tional LSTM is used to encode the context.

171

Model EC DailyDialog MELD EmoryNLP IEMOCAPcLSTM 0.6913 0.4990 0.4972 0.2601 0.3484

CNN (Kim, 2014) 0.7056 0.4934 0.5586 0.3259 0.5218CNN+cLSTM (Poria et al., 2017) 0.7262 0.5024 0.5687 0.3289 0.5587BERT BASE (Devlin et al., 2018) 0.6946 0.5312 0.5621 0.3315 0.6119

DialogueRNN (Majumder et al., 2019) 0.7405 0.5065 0.5627 0.3170 0.6121KET SingleSelfAttn (ours) 0.7285 0.5192 0.5624 0.3251 0.5810

KET StdAttn (ours) 0.7413 0.5254 0.5682 0.3353 0.5861KET (ours) 0.7348 0.5337 0.5818 0.3439 0.5956

Table 2: Performance comparisons on the five test sets. Best values are highlighted in bold.

Dataset M m d p hEC 2 30 200 100 4

DailyDialog 6 30 300 400 4MELD 6 30 200 100 4

EmoryNLP 6 30 100 200 4IEMOCAP 6 30 300 400 4

Table 3: Hyper-parameter settings for KET. M : con-text length. m: number of tokens per utterance. d:word embedding size. p: hidden size in FF layer. h:number of heads.

CNN (Kim, 2014): A single-layer CNN withstrong empirical performance. This model istrained on the utterance-level without context.CNN+cLSTM (Poria et al., 2017): An CNN isused to extract utterance features. An cLSTM isthen applied to learn context representations.BERT BASE (Devlin et al., 2018): Base versionof the state-of-the-art model for sentiment classifi-cation. We treat each utterance with its context asa single document. We limit the document lengthto the last 100 tokens to allow larger batch size.We do not experiment with the large version ofBERT due to memory constraint of our GPU.DialogueRNN (Majumder et al., 2019): The state-of-the-art model for emotion detection in textualconversations. It models both context and speak-ers information. The CNN features used in Dia-logueRNN are extracted from the carefully tunedCNN model. For datasets without speaker in-formation, i.e., EC and DailyDialog, we use twospeakers only. For MELD and EmoryNLP, whichhave 260 and 255 speakers, respectively, we addi-tionally experimented with clipping the number ofspeakers to the most frequent ones (6 main speak-ers + an universal speaker representing all otherspeakers) and reported the best results.KET SingleSelfAttn: We replace the hierarchi-cal self-attention by a single self-attention layerto learn context representations. Contextual utter-ances are concatenated together prior to the singleself-attention layer.

KET StdAttn: We replace the dynamic context-aware affective graph attention by the standardgraph attention (Velikovi et al., 2018).

4.3 Other Experimental Settings

We preprocessed all datasets by lower-casing andtokenization using Spacy2. We keep all tokensin the vocabulary3. We use the released codefor BERT BASE and DialogueRNN. For eachdataset, all models are fine-tuned based on theirperformance on the validation set.

For our model in all datasets, we use Adam opti-mization (Kingma and Ba, 2014) with a batch sizeof 64 and learning rate of 0.0001 throughout thetraining process. We use GloVe embedding (Pen-nington et al., 2014) for initialization in the wordand concept embedding layers4. For the classweights in cross-entropy loss for each dataset, weset them as the ratio of the class distribution inthe validation set to the class distribution in thetraining set. Thus, we can alleviate the problem ofunbalanced dataset. The detailed hyper-parametersettings for KET are presented in Table 3.

5 Result Analysis

In this section we present model evaluation results,model analysis, and error analysis.

5.1 Comparison with Baselines

We compare the performance of KET against thatof the baseline models on the five afore-introduceddatasets. The results are reported in Table 2. Notethat our results for CNN, CNN+cLSTM and Di-alogueRNN on EC, MELD and IEMOCAP areslightly different from the reported results in (Ma-jumder et al., 2019; Poria et al., 2019).

2https://spacy.io/3We keep tokens with minimum frequency of 2 for Daily-

Dialog due to its large vocabulary size4We use GloVe embeddings from Magnitude Medium:

https://github.com/plasticityai/magnitude

172

0 1 2

0.72

0.74

0.76

0 5 10

0.525

0.550

0 5 10

0.500

0.525

0.550

0 5 10

0.34

0.36

0.38

0 5 10

0.50

0.55

0.0 0.5 1.0

EC

0.72

0.74

0.76

0.0 0.5 1.0

DailyDialog

0.52

0.54

0.56

0.0 0.5 1.0

MELD

0.52

0.54

0.56

0.0 0.5 1.0

EmoryNLP

0.34

0.36

0.38

0.0 0.5 1.0

IEMOCAP

0.52

0.54

Figure 3: Validation performance by KET. Top: different context length (M ). Bottom: different sizes of randomfractions of ConceptNet.

cLSTM performs reasonably well on shortconversations (i.e., EC and DailyDialog), butthe worst on long conversations (i.e., MELD,EmoryNLP and IEMOCAP). One major reason isthat learning long dependencies using gated RNNsmay not be effective enough because the gradi-ents are expected to propagate back through in-evitably a huge number of utterances and tokensin sequence, which easily leads to the vanishinggradient problem (Bengio et al., 1994). In con-trast, when the utterance-level LSTM in cLSTMis replaced by features extracted by CNN, i.e.,the CNN+cLSTM, the model performs signifi-cantly better than cLSTM on long conversations,which further validates that modelling long con-versations using only RNN models may not besufficient. BERT BASE achieves very competi-tive performance on all datasets except EC due toits strong representational power via bi-directionalcontext modelling using the Transformer. Notethat BERT BASE has considerably more param-eters than other baselines and our model (110Mfor BERT BASE versus 4M for our model), whichcan be a disadvantage when deployed to deviceswith limited computing power and memory. Thestate-of-the-art DialogueRNN model performs thebest overall among all baselines. In particular,DialogueRNN performs better than our model onIEMOCAP, which may be attributed to its detailedspeaker information for modelling the emotion dy-namics in each speaker as the conversation flows.

It is encouraging to see that our KET modeloutperforms the baselines on most of the datasetstested. This finding indicates that our model is ro-bust across datasets with varying training sizes,context lengths and domains. Our KET vari-ants KET SingleSelfAttn and KET StdAttn per-form comparably with the best baselines on all

datasets except IEMOCAP. However, both vari-ants perform noticeably worse than KET on alldatasets except EC, validating the importanceof our proposed hierarchical self-attention anddynamic context-aware affective graph attentionmechanism. One observation worth mentioningis that these two variants perform on a par withthe KET model on EC. Possible explanations arethat 1) hierarchical self-attention may not be crit-ical for modelling short conversations in EC, and2) the informal linguistic styles of Tweets in EC,e.g., misspelled words and slangs, hinder the con-text representation learning in our graph attentionmechanism.

5.2 Model Analysis

We analyze the impact of different settings on thevalidation performance of KET. All results in thissection are averaged over 5 random seeds.Analysis of context length: We vary the contextlength M and plot model performance in Figure 3(top portion). Note that EC has only a maximumnumber of 2 contextual utterances. It is clear thatincorporating context into KET improves perfor-mance on all datasets. However, adding more con-text is contributing diminishing performance gainor even making negative impact in some datasets.This phenomenon has been observed in a priorstudy (Su et al., 2018). One possible explanationis that incorporating long contextual informationmay introduce additional noises, e.g., polysemesexpressing different meanings in different utter-ances of the same context. More thorough investi-gation of this diminishing return phenomenon is aworthwhile direction in the future.Analysis of the size of ConceptNet: We vary thesize of ConceptNet by randomly keeping only afraction of the concepts in ConceptNet when train-

173

Dataset 0 0.3 0.7 1EC 0.7345 0.7397 0.7426 0.7363

DailyDialog 0.5365 0.5432 0.5451 0.5383MELD 0.5321 0.5395 0.5366 0.5306

EmoryNLP 0.3528 0.3624 0.3571 0.3488IEMOCAP 0.5344 0.5367 0.5314 0.5251

Table 4: Analysis of the relatedness-affectivenesstradeoff on the validation sets. Each column corre-sponds to a fixed λk for all concepts (see Equation 8).

Dataset KET -context -knowledgeEC 0.7451 0.7343 0.7359

DailyDialog 0.5544 0.5282 0.5402MELD 0.5401 0.5177 0.5248

EmoryNLP 0.3712 0.3564 0.3553IEMOCAP 0.5389 0.4976 0.5217

Table 5: Ablation study for KET on the validation sets.

ing and evaluating our model. The results are il-lustrated in Figure 3 (bottom portion). Addingmore concepts consistently improves model per-formance before reaching a plateau, validating theimportance of commonsense knowledge in detect-ing emotions. We may expect the performance ofour KET model to improve with the growing sizeof ConceptNet in the future.Analysis of the relatedness-affectiveness trade-off: We experiment with different values of λk ∈[0, 1] (see Equation 8) for all k and report the re-sults in Table 4. It is clear that λk makes a notice-able impact on the model performance. Discard-ing relatedness or affectiveness completely willcause significant performance drop on all datasets,with one exception of IEMOCAP. One possiblereason is that conversations in IEMOCAP areemotional dialogues, therefore, the affectivenessfactor in our proposed graph attention mechanismcan provide more discriminative power.Ablation Study: We conduct ablation study to in-vestigate the contribution of context and knowl-edge as reported in Table 5. It is clear that bothcontext and knowledge are essential to the strongperformance of KET on all datasets. Note that re-moving context has a greater impact on long con-versations than short conversations, which is ex-pected because more contextual information is lostin long conversations.

5.3 Error Analysis

Despite the strong performance of our model, itstill fails to detect certain emotions on certaindatasets. We rank the F1 score of each emotionper dataset and investigate the emotions with the

worst scores. We found that disgust and fear aregenerally difficult to detect and differentiate. Forexample, the F1 score of fear emotion in MELD isas low as 0.0667. One possible cause is that thesetwo emotions are intrinsically similar. The VADvalues of both emotions have low valence, higharousal and low dominance (Mehrabian, 1996).Another cause is the small amount of data avail-able for these two emotions. How to differentiateintrinsically similar emotions and how to effec-tively detect emotions using limited data are twochallenging directions in this field.

6 Conclusion

We present a knowledge-enriched transformer todetect emotions in textual conversations. Ourmodel learns structured conversation represen-tations via hierarchical self-attention and dy-namically refers to external, context-aware, andemotion-related knowledge entities from knowl-edge bases. Experimental analysis demonstratesthat both contextual information and common-sense knowledge are beneficial to model perfor-mance. The tradeoff between relatedness and af-fectiveness plays an important role as well. In ad-dition, our model outperforms the state-of-the-artmodels on most of the tested datasets of varyingsizes and domains.

Given that there are similar emotion lexiconsto NRC VAD in other languages and ConceptNetis a multilingual knowledge base, our model canbe easily adapted to other languages. In addition,given that NRC VAD is the only emotion-specificcomponent, our model can be adapted as a genericmodel for conversation analysis.

Acknowledgments

The authors would like to thank the anonymousreviewers for their valuable comments. This re-search is supported, in part, by the National Re-search Foundation, Prime Ministers Office, Singa-pore under its AI Singapore Programme (AwardNumber: AISG-GC-2019-003) and under its NRFInvestigatorship Programme (NRFI Award No.NRF-NRFI05-2019-0002). This research is alsosupported, in part, by the Alibaba-NTU SingaporeJoint Research Institute, Nanyang TechnologicalUniversity, Singapore.

174

ReferencesMuhammad Abdul-Mageed and Lyle Ungar. 2017.

Emonet: Fine-grained emotion detection with gatedrecurrent neural networks. In ACL, volume 1, pages718–728.

Nabiha Asghar, Pascal Poupart, Jesse Hoey, Xin Jiang,and Lili Mou. 2018. Affective neural response gen-eration. In ECIR, pages 154–166. Springer.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Anil Bandhakavi, Nirmalie Wiratunga, Stewart Massie,and Deepak Padmanabhan. 2017. Lexicon genera-tion for emotion detection from text. IEEE Intelli-gent Systems, 32(1):102–108.

Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al.1994. Learning long-term dependencies with gradi-ent descent is difficult. IEEE Transactions on Neu-ral Networks, 5(2):157–166.

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, AbeKazemzadeh, Emily Mower, Samuel Kim, Jean-nette N Chang, Sungbok Lee, and Shrikanth SNarayanan. 2008. IEMOCAP: Interactive emotionaldyadic motion capture database. Language Re-sources and Evaluation, 42(4):335.

Ankush Chatterjee, Umang Gupta, Manoj KumarChinnakotla, Radhakrishnan Srikanth, Michel Gal-ley, and Puneet Agrawal. 2019. Understanding emo-tions in text using deep learning and big data. Com-puters in Human Behavior, 93:309 – 317.

Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016.Long short-term memory-networks for machinereading. In EMNLP, pages 551–561.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In EMNLP, pages1724–1734.

Zihang Dai, Zhilin Yang, Yiming Yang, William WCohen, Jaime Carbonell, Quoc V Le, and RuslanSalakhutdinov. 2019. Transformer-xl: Attentive lan-guage models beyond a fixed-length context. arXivpreprint arXiv:1901.02860.

Laurence Devillers, Ioana Vasilescu, and Lori Lamel.2002. Annotation and detection of emotion in atask-oriented human-human dialog corpus. In Pro-ceedings of ISLE Workshop.

Laurence Devillers and Laurence Vidrascu. 2006.Real-life emotions detection with lexical and par-alinguistic cues on human-human call center di-alogs. In Ninth International Conference on SpokenLanguage Processing.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Emily Dinan, Stephen Roller, Kurt Shuster, AngelaFan, Michael Auli, and Jason Weston. 2019. Wizardof wikipedia: Knowledge-powered conversationalagents. In ICLR.

Paul Ekman. 1992. An argument for basic emotions.Cognition & emotion, 6(3-4):169–200.

Christiane Fellbaum. 2012. Wordnet. The Encyclope-dia of Applied Linguistics.

Marjan Ghazvininejad, Chris Brockett, Ming-WeiChang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, andMichel Galley. 2018. A knowledge-grounded neuralconversation model. In AAAI.

Sangdo Han, Jeesoo Bang, Seonghan Ryu, andGary Geunbae Lee. 2015. Exploiting knowledgebase to generate responses for natural language di-alog listening agents. In Proceedings of the 16thSIGDIAL, pages 129–133.

Yanchao Hao, Yuanzhe Zhang, Kang Liu, Shizhu He,Zhanyi Liu, Hua Wu, and Jun Zhao. 2017. An end-to-end model for question answering over knowl-edge base with cross-attention combining globalknowledge. In ACL, pages 221–231.

Devamanyu Hazarika, Soujanya Poria, Rada Mihal-cea, Erik Cambria, and Roger Zimmermann. 2018a.Icon: Interactive conversational memory networkfor multimodal emotion detection. In EMNLP,pages 2594–2604.

Devamanyu Hazarika, Soujanya Poria, Amir Zadeh,Erik Cambria, Louis-Philippe Morency, and RogerZimmermann. 2018b. Conversational memory net-work for emotion recognition in dyadic dialoguevideos. In NAACL, volume 1, pages 2122–2132.

Junqing He, Bing Wang, Mingming Fu, Tianqi Yang,and Xuemin Zhao. 2019. Hierarchical attention andknowledge matching networks with information en-hancement for end-to-end task-oriented dialog sys-tems. IEEE Access, 7:18871–18883.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Yajie Hu, Xiaoou Chen, and Deshun Yang. 2009.Lyric-based song emotion detection with affectivelexicon and fuzzy clustering method. In ISMIR,pages 123–128.

Eva Hudlicka. 2011. Guidelines for designing compu-tational models of emotions. International Journalof Synthetic Emotions, 2(1):26–79.

Hamed Khanpour and Cornelia Caragea. 2018. Fine-grained emotion detection in health-related onlineposts. In EMNLP, pages 1160–1166.

175

Chloe Kiddon, Luke Zettlemoyer, and Yejin Choi.2016. Globally coherent text generation with neu-ral checklist models. In EMNLP, pages 329–339.

Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Abhishek Kumar, Daisuke Kawahara, and Sadao Kuro-hashi. 2018. Knowledge-enriched two-layered at-tention network for sentiment analysis. In NAACL,volume 2, pages 253–258.

Chul Min Lee and Shrikanth S Narayanan. 2005. To-ward detecting emotions in spoken dialogs. IEEETransactions on Speech and Audio Processing,13(2):293–303.

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, ZiqiangCao, and Shuzi Niu. 2017. Dailydialog: A man-ually labelled multi-turn dialogue dataset. In IJC-NLP, volume 1, pages 986–995.

Shuman Liu, Hongshen Chen, Zhaochun Ren, YangFeng, Qun Liu, and Dawei Yin. 2018. Knowledgediffusion for neural dialogue generation. In ACL,pages 1489–1498.

Andrea Madotto, Chien-Sheng Wu, and Pascale Fung.2018. Mem2seq: Effectively incorporating knowl-edge bases into end-to-end task-oriented dialog sys-tems. In ACL, volume 1, pages 1468–1478.

Navonil Majumder, Soujanya Poria, Devamanyu Haz-arika, Rada Mihalcea, Alexander Gelbukh, and ErikCambria. 2019. Dialoguernn: An attentive rnn foremotion detection in conversations. In AAAI.

Albert Mehrabian. 1996. Pleasure-arousal-dominance:A general framework for describing and measuringindividual differences in temperament. Current Psy-chology, 14(4):261–292.

Todor Mihaylov and Anette Frank. 2018. Knowledge-able reader: Enhancing cloze-style reading compre-hension with external commonsense knowledge. InACL, pages 821–832.

Nikita Moghe, Siddhartha Arora, Suman Banerjee, andMitesh M Khapra. 2018. Towards exploiting back-ground knowledge for building conversation sys-tems. In EMNLP, pages 2322–2332.

Saif Mohammad. 2018a. Obtaining reliable human rat-ings of valence, arousal, and dominance for 20,000english words. In ACL, pages 174–184.

Saif M. Mohammad. 2018b. Word affect intensities.In LREC.

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.2002. Thumbs up?: sentiment classification usingmachine learning techniques. In EMNLP, pages 79–86. Association for Computational Linguistics.

Prasanna Parthasarathi and Joelle Pineau. 2018. Ex-tending neural generative conversational model us-ing external knowledge sources. In EMNLP, pages690–695.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP, pages 1532–1543.

Soujanya Poria, Erik Cambria, Devamanyu Hazarika,Navonil Majumder, Amir Zadeh, and Louis-PhilippeMorency. 2017. Context-dependent sentiment anal-ysis in user-generated videos. In ACL, volume 1,pages 873–883.

Soujanya Poria, Devamanyu Hazarika, Navonil Ma-jumder, Gautam Naik, Erik Cambria, and Rada Mi-halcea. 2018. Meld: A multimodal multi-partydataset for emotion recognition in conversations.arXiv preprint arXiv:1810.02508.

Soujanya Poria, Navonil Majumder, Rada Mihalcea,and Eduard Hovy. 2019. Emotion recognition inconversation: Research challenges, datasets, and re-cent advances. arXiv preprint arXiv:1905.02947.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.

Yi Luan Mirella Lapata Rik Koncel-Kedziorski,Dhanush Bekal and Hannaneh Hajishirzi. 2019.Text Generation from Knowledge Graphs withGraph Transformers. In NAACL.

Armin Seyeditabari, Narges Tabari, and WlodekZadrozny. 2018. Emotion detection in text: a re-view. arXiv preprint arXiv:1806.00674.

Dinghan Shen, Guoyin Wang, Wenlin Wang, Mar-tin Renqiang Min, Qinliang Su, Yizhe Zhang, Chun-yuan Li, Ricardo Henao, and Lawrence Carin.2018. Baseline needs more love: On simple word-embedding-based models and associated poolingmechanisms. In ACL, pages 440–450.

Robyn Speer, Joshua Chin, and Catherine Havasi.2017. Conceptnet 5.5: An open multilingual graphof general knowledge. In AAAI.

Shang-Yu Su, Pei-Chieh Yuan, and Yun-Nung Chen.2018. How time matters: Learning time-decay at-tention for contextual spoken language understand-ing in dialogues. In NAACL, volume 1, pages 2133–2142.

Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, KathrynMazaitis, Ruslan Salakhutdinov, and William Co-hen. 2018. Open domain question answering us-ing early fusion of knowledge bases and text. InEMNLP, pages 4231–4242.

Maite Taboada, Julian Brooke, Milan Tofiloski, Kim-berly Voll, and Manfred Stede. 2011. Lexicon-basedmethods for sentiment analysis. Computational Lin-guistics, 37(2):267–307.

176

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS, pages 5998–6008.

Petar Velikovi, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Li, and Yoshua Bengio.2018. Graph attention networks. In ICLR.

Sida Wang and Christopher D Manning. 2012. Base-lines and bigrams: Simple, good sentiment and topicclassification. In ACL, pages 90–94. Association forComputational Linguistics.

Chien-Sheng Wu, Richard Socher, and Caiming Xiong.2019. Global-to-local memory pointer networks fortask-oriented dialogue. In ICLR.

Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou,Subham Biswas, and Minlie Huang. 2018. Aug-menting end-to-end dialogue systems with common-sense knowledge. In AAAI.

Sayyed M Zahiri and Jinho D Choi. 2018. Emotion de-tection on tv show transcripts with sequence-basedconvolutional neural networks. In Workshops atAAAI.

Jiacheng Zhang, Huanbo Luan, Maosong Sun, FeifeiZhai, Jingfang Xu, Min Zhang, and Yang Liu.2018a. Improving the transformer translation modelwith document-level context. In EMNLP, pages533–542.

Yuxiang Zhang, Jiamei Fu, Dongyu She, Ying Zhang,Senzhang Wang, and Jufeng Yang. 2018b. Textemotion distribution learning via multi-task convo-lutional neural network. In IJCAI, pages 4595–4601.

Peixiang Zhong and Chunyan Miao. 2019. ntuer atSemEval-2019 task 3: Emotion classification withword and sentence representations in RCNN. In Se-mEval, pages 282–286.

Peixiang Zhong, Di Wang, and Chunyan Miao. 2019.An affect-rich neural conversational model with bi-ased attention and weighted cross-entropy loss. InAAAI, pages 7492–7500.

Hao Zhou, Minlie Huang, Tianyang Zhang, XiaoyanZhu, and Bing Liu. 2018a. Emotional chatting ma-chine: Emotional conversation generation with in-ternal and external memory. In AAAI.

Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao,Jingfang Xu, and Xiaoyan Zhu. 2018b. Com-monsense knowledge aware conversation generationwith graph attention. In IJCAI, pages 4623–4629.

Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, YingChen, Wayne Xin Zhao, Dianhai Yu, and Hua Wu.2018c. Multi-turn response selection for chatbotswith deep attention matching network. In ACL, vol-ume 1, pages 1118–1127.

knowledge-enriched transformer for emotion detection in textual conversations · 2020-01-23 ·...

Documents