neural response generation with meta-wordswe present open domain response generation with...

Neural Response Generation with Meta-Words

Can Xu1, Wei Wu1∗ , Chongyang Tao2, Huang Hu1, Matt Schuerman3, and Ying Wang3

1Microsoft Corporation, Beijing, China2Institute of Computer Science and Technology, Peking University, Beijing, China

3Microsoft Corporation, Redmond, Washington1,3{caxu,wuwei,huahu,matthes,Ying.Wang}@microsoft.com

[email protected]

Abstract

We present open domain response generationwith meta-words. A meta-word is a structuredrecord that describes various attributes of a re-sponse, and thus allows us to explicitly modelthe one-to-many relationship within open do-main dialogues and perform response genera-tion in an explainable and controllable man-ner. To incorporate meta-words into genera-tion, we enhance the sequence-to-sequence ar-chitecture with a goal tracking memory net-work that formalizes meta-word expression asa goal and manages the generation process toachieve the goal with a state memory panel anda state controller. Experimental results on twolarge-scale datasets indicate that our modelcan significantly outperform several state-of-the-art generation models in terms of responserelevance, response diversity, accuracy of one-to-many modeling, accuracy of meta-word ex-pression, and human evaluation.

1 Introduction

Human-machine conversation is a fundamentalproblem in NLP. Traditional research focuseson building task-oriented dialog systems (Younget al., 2013) to achieve specific user goals such asrestaurant reservation through limited turns of dia-logues within specific domains. Recently, buildinga chatbot for open domain conversation (Vinyalsand Le, 2015) has attracted increasing attention,not only owing to the availability of large amountof human-human conversation data on internet,but also because of the success of such systems inreal products such as the social bot XiaoIce (Shumet al., 2018) from Microsoft.

A common approach to implementing a chat-bot is to learn a response generation model withinan encoder-decoder framework (Vinyals and Le,

∗Corresponding author.

Message: last week I have a nice trip to New York!Meta-word: Act: yes-no question | Len: 8 | Copy: true | Utts: false | Spe: mediumResponse 1: Is New York more expensive than California?Meta-word: Act: wh-question | Len: 17 | Copy: false | Utts: true | Spe: highResponse 2: Cool, sounds great! What is the tallest building in this city, Chrysler building?Meta-word: Act: statement | Len: 13 | Copy: false | Utts: true | Spe: lowResponse 3: I don’t know what you are talking about. But it seems good.

Table 1: An example of response generation with meta-words. The underlined word means it is copied fromthe message, and the word in bold means it corre-sponds to high specificity.

2015; Shang et al., 2015). Although the architec-ture can naturally model the correspondence be-tween a message and a response, and is easy to ex-tend to handle conversation history (Serban et al.,2016; Xing et al., 2018) and various constraints(Li et al., 2016; Zhou et al., 2018), it is notori-ous for generating safe responses such as “I don’tknow” and “me too” in practice. A plausible rea-son for the “safe response” issue is that there ex-ists one-to-many relationship between messagesand responses. One message could correspondto many valid responses and vice versa (Zhanget al., 2018a). The vanilla encoder-decoder archi-tecture is prone to memorize high-frequency pat-terns in data, and thus tends to generate similarand trivial responses for different messages. Atypical method for modeling the relationship be-tween messages and responses is to introduce la-tent variables into the encoder-decoder framework(Serban et al., 2017; Zhao et al., 2017; Park et al.,2018). It is, however, difficult to explain whatrelationship a latent variable represents, nor onecan control responses to generate by manipulatingthe latent variable. Although a recent study (Zhaoet al., 2018) replaces continuous latent variableswith discrete ones, it still needs a lot of post hu-man effort to explain the meaning of the variables.

In this work, we aim to model the one-to-manyrelationship in open domain dialogues in an ex-plainable and controllable way. Instead of using

arX

iv:1

906.

0605

0v1

[cs

.CL

] 1

4 Ju

n 20

19

latent variables, we consider explicitly represent-ing the relationship between a message and a re-sponse with meta-words1. A meta-word is a struc-tured record that characterizes the response to gen-erate. The record consists of a group of vari-ables with each an attribute of the response. Eachvariable is in a form of (key, type, value) where“key” defines the attribute, “value” specifies theattribute, and “type” ∈ {r, c} indicates whetherthe variable is real-valued (r) or categorical (c).Given a message, a meta-word corresponds to onekind of relationship between the message and a re-sponse, and by manipulating the meta-word (e.g.,values of variables or combination of variables),one can control responses in a broad way. Table1 gives an example of response generation withvarious meta-words, where “Act”, “Len”, “Copy”,“Utts”, and “Spe” are variables of a meta-wordand refer to dialogue act, response length (includ-ing punctuation marks), if copy from the message,if made up of multiple utterances, and specificitylevel (Zhang et al., 2018a) respectively2. Advan-tages of response generation with meta-words arethree-folds: (1) the generation model is explain-able as the meta-words inform the model, devel-opers, and even end users what responses theywill have before the responses are generated; (2)the generation process is controllable. The meta-word system acts as an interface that allows devel-opers to customize responses by tailoring the setof attributes; (3) the generation approach is gen-eral. By taking dialogue acts (Zhao et al., 2017),personas (Li et al., 2016), emotions (Zhou et al.,2018), and specificity (Zhang et al., 2018a) as at-tributes, our approach can address the problems inthe existing literature in a unified form; and (4)generation-based open domain dialogue systemsnow become scalable, since the model supportsfeature engineering on meta-words.

The challenge of response generation withmeta-words lies in how to simultaneously ensurerelevance of a response to the message and fi-delity of the response to the meta-word. To tacklethe challenge, we propose equipping the vanillasequence-to-sequence architecture with a novelgoal tracking memory network (GTMN) and craft-ing a new loss item for learning GTMN. GTMN

1We start from single messages. It is easy to extend theproposed approach to handle conversation history.

2For ease of understanding, we transformed “copy ratio”and “specificity” used in our experiments into categoricalvariables.

sets meta-word expression as a goal of generationand dynamically monitors expression of each vari-able in the meta-word during the decoding pro-cess. Specifically, GTMN consists of a state mem-ory panel and a state controller where the formerrecords status of meta-word expression and thelatter manages information exchange between thestate memory panel and the decoder. In decoding,the state controller updates the state memory panelaccording to the generated sequence, and reads outdifference vectors that represent the residual of themeta-word. The next word from the decoder ispredicted based on attention on the message rep-resentations, attention on the difference vectors,and the word predicted in the last step. In learn-ing, besides the negative log likelihood, we fur-ther propose minimizing a state update loss thatcan directly supervise the learning of the memorynetwork under the ground truth. We also proposea meta-word prediction method to make the pro-posed approach complete in practice.

We test the proposed model on two large-scaleopen domain conversation datasets built fromTwitter and Reddit, and compare the model withseveral state-of-the-art generation models in termsof response relevance, response diversity, accu-racy of one-to-many modeling, accuracy of meta-word expression, and human judgment. Evalu-ation results indicate that our model can signifi-cantly outperform the baseline models over mostof the metrics on both datasets.

Our contributions in this paper are three-folds:(1) proposal of explicitly modeling one-to-manyrelationship and explicitly controlling responsegeneration in open domain dialogues with multi-ple variables (a.k.a., meta-word); (2) proposal ofa goal tracking memory network that naturally al-lows a meta-word to guide response generation;and (3) empirical verification of the effectivenessof the proposed model on two large-scale datasets.

2 Related Work

Neural response generation models are built uponthe encoder-decoder framework (Sutskever et al.,2014). Starting from the basic sequence-to-sequence with attention architecture (Vinyals andLe, 2015; Shang et al., 2015), extensions under theframework have been made to combat the “saferesponse” problem (Mou et al., 2016; Tao et al.,2018); to model the hierarchy of conversation his-tory (Serban et al., 2016, 2017; Xing et al., 2018);

to generate responses with specific personas oremotions (Li et al., 2016; Zhou et al., 2018); andto speed up response decoding (Wu et al., 2018).In this work, we also aim to tackle the “safe re-sponse” problem, but in an explainable, control-lable, and general way. Rather than learning witha different objective (e.g., (Li et al., 2015)), gen-eration from latent variables (e.g., (Zhao et al.,2017)), or introducing extra content (e.g., (Xinget al., 2017)), we explicitly describe relationshipbetween message-response pairs by defining meta-words and express the meta-words in responsesthrough a goal tracking memory network. Ourmethod allows developers to manipulate the gener-ation process by playing with the meta-words andprovides a general solution to response generationwith specific attributes such as dialogue acts.

Recently, controlling specific aspects in textgeneration is drawing increasing attention (Huet al., 2017; Logeswaran et al., 2018). In the con-text of dialogue generation, Wang et al. (2017)propose steering response style and topic with hu-man provided topic hints and fine-tuning on smallscenting data; Zhang et al. (2018a) propose learn-ing to control specificity of responses; and veryrecently, See et al. (2019) investigate how con-trollable attributes of responses affect human en-gagement with methods of conditional trainingand weighted decoding. Our work is differentin that (1) rather than playing with a single vari-able like specificity or topics, our model simulta-neously controls multiple variables and can takecontrolling with specificity or topics as specialcases; and (2) we manage attribute expression inresponse generation with a principled approachrather than simple heuristics like in (See et al.,2019), and thus, our model can achieve better ac-curacy in terms of attribute expression in gener-ated responses.

3 Problem Formalization

Suppose that we have a dataset D ={(Xi,Mi, Yi)}Ni=1, where Xi is a message,Yi is a response, and Mi = (mi,1, . . . ,mi,l) is ameta-word with mi,j = (mi,j .k,mi,j .t,mi,j .v)the j-th variable and mi,j .k, mi,j .t, and mi,j .vthe key, the type, and the value of the variablerespectively. Our goal is to estimate a generationprobability P (Y |X,M) from D, and thus givena new message X with a pre-defined meta-wordM , one can generate responses for X according

to P (Y |X,M). In this work, we assume that Mis given as input for response generation. Later,we will describe how to obtain M with X .

4 Response Generation withMeta-Words

In this section, we present our model for responsegeneration with meta-words. We start from anoverview of the model, and then dive into detailsof the goal tracking memory enhanced decoding.

4.1 Model Overview

Figure 1 illustrates the architecture of ourgoal tracking memory enhanced sequence-to-sequence model (GTMES2S). The model equipsthe encoder-decoder structure with a goal trackingmemory network that comprises a state memorypanel and a state controller. Before response de-coding, the encoder represents an input messageas a hidden sequence through a bi-directional re-current neural network with gated recurrent units(biGRU) (Chung et al., 2014), and the goal track-ing memory network is initialized by a meta-word.Then, during response decoding, the state mem-ory panel tracks expression of the meta-word andgets updated by the state controller. The state con-troller manages the process of decoding at eachstep by reading out the status of meta-word expres-sion from the state memory panel and informingthe decoder of the difference between the statusand the target of meta-word expression. Based onthe message representation, the information pro-vided by the state controller, and the generatedword sequence, the decoder predicts the next wordof the response.

In the following section, we will elaborate thegoal tracking memory enhanced decoding, whichis the key to having a response that is relevant tothe message and at the same time accurately re-flects the meta-word.

4.2 Goal Tracking Memory Network

The goal tracking memory network (GTMN) dy-namically controls response generation accordingto the given meta-word via cooperation of the statememory panel and the state controller. It informsthe decoder at the first time to what extend themeta-word has been expressed. For local attributessuch as response length3, the dynamic control

3Local attributes refer to the attributes whose values arelocation sensitive during response generation. For example,

KEY

!"#$ - !"

StateController

%"

+

GOAL

*" +

Softmax

Weightedsum

StateUpdate

DifferenceReading

LastweekIhaveanicetriptoNewYork

Len

Act

Copy

Spe

Utts

StateMemoryPanelAttributeKey AttributeValue

(--1)

Length

Act

Copy

Spe

Utts

StateMemoryPanelAttributeKey AttributeValue

(-)

IsNewYork ? ? ? ? ?1$ 12 13… … 4"#$ 4"Encoder DecoderDistanceVector

*" A*" B *" $*" C

Figure 1: Architecture of goal tracking memory enhanced sequence-to-sequence model.

strategy is more reasonable than static strategiessuch as feeding the embedding of attributes to thedecoder like in conditional training in (See et al.,2019). This is because if the goal is to generate aresponse with 5 words and 2 words have been de-coded, then the decoder needs to know that thereare 3 words left rather than always memorizingthat 5 words should be generated.

4.2.1 State Memory PanelSuppose that the given meta-word M consists ofl variables, then the state memory panel M ismade up of l memory cells {Mi}li=1 where ∀i ∈{1, . . . , l}, Mi is in a form of (key, goal, value)which are denoted asMi.k,Mi.g, andMi.v re-spectively. We define Rep(·) as a representationgetting function which can be formulated as

Rep(mi.k) = B(mi.k),

Rep(mi.v) =

{σ(B(mi.v)), mi.t = c

mi.v × σ(B(mi.k)),mi.t = r,

(1)

where mi is the i-th variable of M , σ(·) is a sig-moid function, and B(·) returns the bag-of-wordsrepresentation for a piece of text. Mi is then ini-tialized as:

Mi.k = Rep(mi.k),

Mi.g = Rep(mi.v),

Mi.v0 = 0.

(2)

Mi.k ∈ Rd stores the key of mi, andMi.g ∈Rd stores the goal for expression of mi in genera-tion. Thus, the two items are frozen in decoding.Mi.v ∈ Rd refers to the gray part of the progressbar in Figure 1, and represents the progress of ex-pression of mi in decoding. Hence, it is updatedby the state controller after each step of decoding.length of the remaining sequence varies after each step of de-coding. In contrary, some attributes, such as dialogue acts,are global attributes, as they are reflected by the entire re-sponse.

4.2.2 State ControllerAs illustrated by Figure 1, the state controllerstays between the encoder and the decoder, andmanages the interaction between the state mem-ory panel and the decoder. Let st be the hiddenstate of the decoder at step t. The state controllerfirst updates Mi.vt−1 to Mi.vt based on st witha state update operation. It then obtains the dif-ference between Mi.g and Mi.vt from the statememory panel via a difference reading operation,and feeds the difference to the decoder to predictthe t-th word of the response.

State Update Operation. The operation in-cludes SUB and ADD as two sub-operations. In-tuitively, when the status of expression surpassesthe goal, then the state controller should executethe SUB operation (stands for “subtract”) to trimthe status representation; while when the status ofexpression is inadequate, then the state controllershould use the ADD operation to enhance the sta-tus representation. Technically, rather than com-paring Mi.vt−1 with Mi.g and adopting opera-tions accordingly, we propose a soft way to up-date the state memory panel with SUB and ADD,since (1) it is difficult to identify over-expressionor sub-expression by comparing two distributedrepresentations; and (2) the hard way will breakdifferentiablility of the model. Specifically, we de-fine gt ∈ Rd×l as a gate to control the use of SUBor ADD where gt(i) ∈ Rd is the i-th element ofgt. Let ∆SUB

t (i) ∈ Rd and ∆ADDt (i) ∈ Rd be

the changes from the SUB operation and the ADDoperation respectively, thenMi.vt−1 is updated as

V̂t(i) =Mi.vt−1 − gt(i) ◦∆SUBt (i),

Mi.vt = V̂t(i) + (1− gt(i)) ◦∆ADDt (i),

(3)

where ◦ means element-wise multiplication, and

gt(i), ∆SUBt (i), and ∆ADD

t (i) can be defined as

gt(i) = σ(WgSt(i) + bg) (4)

and[∆SUB

t (i)∆ADD

t (i)

]= σ

([WSUB

WADD

]St(i) +

[bSUB

bADD

])(5)

respectively with Wg ∈ Rd×d, bg ∈ Rd,W {SUB,ADD} ∈ Rd×3d, and b{SUB,ADD} ∈ Rdparameters. St(i) =Mi.k⊕Mi.vt−1⊕ st where⊕ is a concatenation operator.

Difference Reading Operation. For each vari-able in the meta-word M , the operation representsthe difference between the status of expression andthe goal of expression as a vector, and then appliesan attention mechanism to the vectors to indicatethe decoder the importance of variables in gen-eration of the next word. Formally, suppose thatdti ∈ R2d is the difference vector for mi ∈ M atstep t, then dti is defined as

dti = (Mi.g −Mi.vt)⊕ (Mi.g ◦Mi.vt). (6)

With (dt1, . . . , dtl) as a difference memory, the dif-

ference reading operation then takes st as a queryvector and calculates attention over the memory as

ot =∑l

i=1ai · (Udti),

ati = softmax((st)>(Udti)),

(7)

where (at1, . . . , atl) are attention weights, and U ∈

Rd×d is a parameter.

4.3 Response Decoding

In decoding, the hidden state st is calculated byGRU(st−1, [e(yt−1) ⊕ Ct]), where e(yt−1) ∈ Rdis the embedding of the word predicted at stept − 1, and Ct is a context vector obtained fromattention over the hidden states of the input mes-sage X given by the biGRU based encoder. LetHX = (hX,1, . . . , hX,Tx) be the hidden states ofX , then Ct is calculated via

Ct =∑Tx

j=1αt,jhX,j

αt,j =exp(et,j)∑Tx

k=1 exp(et,k),

et,j = U>d tanh (Wsst−1 +WhhX,j + bd),

(8)

where Ud, Ws, Wh, and bd are parameters, andst−1 is the hidden state of the decoder at step t−1.

With the hidden state st and the distance vectorot returned by the state controller, the probability

distribution for predicting the t-th word of the re-sponse is given by

p(yt) = softmax(Wp[e(yt)⊕ ot ⊕ st] + bp), (9)

where yt is the t-th word of the response with e(yt)its embedding, and Wp and bp are parameters.

5 Learning Method

To perform online response generation with meta-words, we need to (1) estimate parameters ofGTMES2S by minimizing a loss function; and (2)learn a model to predict meta-words for onlinemessages.

5.1 Loss for Model LearningThe first loss item is the negative log likelihood(NLL) of D, which is formulated as

LNLL(Θ) = − 1

N

∑N

i=1logP (Yi|Xi,Mi), (10)

where Θ is the set of parameters of GTMES2S.By minimizing NLL, the supervision signals in Dmay not sufficiently flow to GTMN, as GTMNis nested within response decoding. Thus, be-sides NLL, we propose a state update loss that di-rectly supervises the learning of GTMN with D.The idea is to minimize the distance between theground truth status of meta-word expression andthe status stored in the state memory panel. Sup-pose that y1:t is the segment of response Y gen-erated until step t, then ∀mi ∈ M , we considertwo cases: (1) ∃Fi(·) that Fi(y1:t) maps y1:t to thespace ofmi.v. As an example, response length be-longs to this case with Fi(y1:t) = t; (2) it is hardto define an Fi(·) that can map y1:t to the spaceof mi.v. For instance, dialogue acts belong to thiscase since it is often difficult to judge the dialogueact from part of a response. For case (1), we definethe state update loss as

L1SU (mi) =

∑T

t=1‖Mi.vt −Rep(Fi(y1:t))‖, (11)

where T is the length of Y and ‖·‖ refers to L2

norm. For case (2), the loss is defined as

L2SU (mi) = ‖Mi.vT −Rep(mi.v)‖. (12)

The full state update loss LSU (Θ) for D is thengiven by

N∑i=1

l∑j=1

I[mi,j ∈ C1]L1SU (mi,j) + I[mi,j ∈ C2]L2

SU (mi,j),

(13)

where C1 and C2 represent sets of variables belong-ing to case (1) and case (2) respectively, and I(·) isan indicator function. The loss function for learn-ing of GTMES2S is finally defined by

L(Θ) = LNLL(Θ) + λLSU (Θ), (14)

where λ acts as a trade-off between the two items.

5.2 Meta-word PredictionWe assume that values of meta-words are givenbeforehand. In training, the values can be ex-tracted from ground truth. In test, however,since only a message is available, we proposesampling values of a meta-word for the mes-sage from probability distributions estimated from{(Xi,Mi)}Ni=1 ⊂ D. The sampling approachnot only provides meta-words to GTMNES2S, butalso keeps meta-words diverse for similar mes-sages. Formally, let hpX be the last hidden state ofa message X processed by a biGRU, then ∀mi ∈M , we assume that mi.v obeys a multinomial dis-tribution with the probability ~pi parameterized assoftmax(Wmul

i hpX + bmuli ), if mi.t = c; other-wise, mi.v obeys a normal distribution with µiand log(σ2i ) parameterized as Wµ

i hpX + bµi and

W σi h

pX + bσi respectively. In distribution estima-

tion, we assume that variables in a meta-word areindependent, and jointly maximize the log likeli-hood of {(Mi|Xi)}Ni=1 and the entropy of the dis-tributions as regularization.

6 Experiments

We test GTMNES2S on two large-scale datasets.

6.1 DatasetsWe mine 10 million message-response pairs fromTwitter FireHose, covering 2-month period fromJune 2016 to July 2016, and sample 10 mil-lion pairs from the full Reddit data4. As pre-processing, we remove duplicate pairs, pairs witha message or a response having more than 30words, and messages that correspond to more than20 responses to prevent them from dominatinglearning. After that, there are 4, 759, 823 pairs leftfor Twitter and 4, 246, 789 pairs left for Reddit.On average, each message contains 10.78 wordsin the Twitter data and 12.96 words in the Redditdata. The average lengths of responses in the Twit-ter data and the Reddit data are 11.03 and 12.75 re-spectively. From the pairs after pre-processing, we

4https://redd.it/3bxlg7

randomly sample 10k pairs as a validation set and10k pairs as a test set for each data, and make surethat there is no overlap between the two sets. Afterexcluding pairs in the validation sets and the testsets, the left pairs are used for model training. Thetest sets are built for calculating automatic met-rics. Besides, we randomly sample 1000 distinctmessages from each of the two test sets and re-cruit human annotators to judge the quality of re-sponses generated for these messages. For boththe Twitter data and the Reddit data, top 30, 000most frequent words in messages and responsesin the training sets are kept as message vocabular-ies and response vocabularies. In the Twitter data,the message vocabulary and the response vocab-ulary cover 99.17% and 98.67% words appearingin messages and responses respectively. The tworatios are 99.52% and 98.8% respectively in theReddit data. Other words are marked as “UNK”.

6.2 Meta-word Construction

As a showcase of the framework of GTMNES2S,we consider the following variables as a meta-word: (1) Response Length (RL): number ofwords and punctuation marks in a response. Werestrict the range of the variable in {1, . . . , 25}(i.e., responses longer than 25 are normalized as25), and treat it as a categorical variable. (2) Di-alog Act (DA): we employ the 42 dialogue actsbased on the DAMSL annotation scheme (Coreand Allen, 1997). The dialogue act of a given re-sponse is obtained by the state-of-the-art dialogueact classifier in (Liu et al., 2017) learned from theSwitchboard (SW) 1 Release 2 Corpus (Godfreyand Holliman, 1997). DA is a categorical vari-able. (3) Multiple Utterances (MU): if a responseis made up of multiple utterances. We split a re-sponse as utterances according to “.”, “?” and “!”,and remove utterances that are less than 3 words.The variable is “true” if there are more than 1 ut-terance left, otherwise it is “false”. (4) Copy Ra-tio (CR): inspired by COPY-NET (Gu et al., 2016)which indicates that humans may repeat entitynames or even long phrases in conversation, weincorporate a “copy mechanism” into our modelby using copy ratio as a soft implementation ofCOPY-NET. We compute the ratio of unigramsshared by a message and its response (divided bythe length of the response) with stop words and top1000 most frequent words in training excluded.CR is a real-valued variable. (5) Specificity (S):

https://redd.it/3bxlg7

following SC-Seq2Seq (Zhang et al., 2018b), wecalculate normalized inverse word frequency as aspecificity variable. The variable is real-valued.Among the five variables, RL, CR, and S corre-spond to the state update loss given by Equation(11), and others correspond to Equation (12).

6.3 BaselinesWe compare GTMNES2S with the followingbaseline models: (1) MMI-bidi: the sequence-to-sequence model with response re-ranking in(Li et al., 2015) learned by a maximum mu-tual information objective; (2) SC-Seq2Seq: thespecificity controlled Seq2Seq model in (Zhanget al., 2018b); (3) kg-CVAE: the knowledge-guided conditional variational autoencoders in(Zhao et al., 2017); and (4) CT: the condi-tional training method in (See et al., 2019) thatfeeds the embedding of pre-defined response at-tributes to the decoder of a sequence-to-sequencemodel. Among the baselines, CT exploits thesame attributes as GTMNES2S, SC-Seq2Seq uti-lizes specificity, and kg-CVAE leverages dia-logue acts. All models are implemented withthe recommended parameter configurations in theexisting papers, where for kg-CVAE, we usethe code shared at https://github.com/snakeztc/NeuralDialog-CVAE, and forother models without officially published code, wecode with TensorFlow. Besides the baselines, wealso compare GTMNES2E learned from the fullloss given by Equation (14) with a variant learnedonly from the NLL loss, in order to check the ef-fect of the proposed state update loss. We denotethe variant as GTMNES2S w/o SU.

6.4 Evaluation MetricsWe conduct both automatic evaluation and humanevaluation. In terms of automatic ways, we eval-uate models from four aspects: relevance, diver-sity, accuracy of one-to-many modeling, and ac-curacy of meta-word expression. For relevance,besides BLEU (Papineni et al., 2002), we follow(Serban et al., 2017) and employ Embedding Av-erage (Average), Embedding Extrema (Extrema),Embedding Greedy (Greedy) as metrics. To eval-uate diversity, we follow (Li et al., 2015) and useDistinct-1 (Dist1) and Distinct-2 (Dist2) as met-rics which are calculated as the ratios of distinctunigrams and bigrams in the generated responses.For accuracy of one-to-many modeling, we utilizeA-bow precision (A-prec), A-bow recall (A-rec),

E-bow precision (E-prec), and E-bow recall (E-rec) proposed in (Zhao et al., 2017) as metrics.For accuracy of meta-word expression, we mea-sure accuracy for categorical variables and squaredeviation for real-valued variables. Metrics of rel-evance, diversity, and accuracy of meta-word ex-pression are calculated on the 10k test data basedon top 1 responses from beam search. To mea-sure the accuracy of meta-word expression for agenerated response, we extract values of the meta-word of the response with the methods describedin Section 6.2, and compare these values with theoracle ones sampled from distributions. Metricsof accuracy of one-to-many modeling require atest message to have multiple reference responses.Thus, we filter the test sets by picking out mes-sages that have at least 2 responses, and formtwo subsets with 166 messages for Twitter and135 messages for Reddit respectively. On aver-age, each message corresponds to 2.8 responsesin the Twitter data and 2.92 responses in the Red-dit data. For each message, 10 responses from amodel are used for evaluation. In kg-CVAE, wefollow (Zhao et al., 2017) and sample 10 timesfrom the latent variable; in SC-Seq2Seq, we varythe specificity in {0.1, 0.2, . . . , 1}; and in both CTand GTMNES2S, we sample 10 times from thedistributions. Top 1 response from beam searchunder each sampling or specificity setting are col-lected as the set for evaluation.

In terms of human evaluation, we recruit 3 na-tive speakers to label top 1 responses of beamsearch from different models. Responses from allmodels for all the 1000 test messages in both dataare pooled, randomly shuffled, and presented toeach of the annotators. The annotators judge thequality of the responses according to the follow-ing criteria: +2: the response is not only relevantand natural, but also informative and interesting;+1: the response can be used as a reply, but mightnot be informative enough (e.g.,“Yes, I see” etc.);0: the response makes no sense, is irrelevant, or isgrammatically broken. Each response receives 3labels. Agreements among the annotators are mea-sured by Fleiss’ kappa (Fleiss and Cohen, 1973).

6.5 Implementation Details

In test, we fix the specificity variable as 0.5 in SC-Seq2Seq, since in (Zhang et al., 2018a), the au-thors conclude that the model achieves the bestoverall performance under the setting. For kg-

https://github.com/snakeztc/NeuralDialog-CVAE

https://github.com/snakeztc/NeuralDialog-CVAE

Dataset Models Relevance Diversity One-to-ManyBLEU Average Greedy Extreme Dist1 Dist2 A-prec A-rec E-prec E-rec

Twitter

MMI-bidi 2.92 0.787 0.181 0.394 6.35 20.6 0.853 0.810 0.601 0.554kg-CVAE 1.83 0.766 0.175 0.373 8.65 29.7 0.862 0.822 0.597 0.545

SC-Seq2Seq 2.57 0.776 0.182 0.387 6.87 22.5 0.857 0.815 0.594 0.551CT 3.32 0.792 0.181 0.402 8.04 26.9 0.859 0.813 0.596 0.550

GTMNES2S w/o SU 3.25 0.793 0.183 0.405 7.59 28.4 0.861 0.819 0.598 0.554GTMNES2S 3.39 0.810 0.182 0.413 8.41 30.5 0.886 0.839 0.610 0.560

Reddit

MMI-bidi 1.82 0.752 0.171 0.369 6.12 20.3 0.821 0.775 0.587 0.542kg-CVAE 1.89 0.745 0.171 0.357 8.47 28.7 0.827 0.781 0.583 0.531

SC-Seq2Seq 1.95 0.752 0.176 0.362 5.94 19.2 0.823 0.778 0.581 0.536CT 2.43 0.751 0.172 0.383 8.62 33.4 0.827 0.783 0.587 0.540

GTMNES2S w/o SU 2.75 0.757 0.174 0.382 8.47 32.6 0.832 0.791 0.594 0.548GTMNES2S 2.95 0.760 0.172 0.386 10.35 36.3 0.841 0.795 0.602 0.554

Table 2: Results on relevance, diversity, and accuracy of one-to-many modeling. Numbers in bold mean thatimprovement over the best baseline is statistically significant (t-test, p-value < 0.01).

Dataset Metaword Type SC-Seq2Seq kg-CVAE CT GTMNES2S w/o SU GTMNES2S

Twitter

RL c - - 97% 95.6% 98.6%DA c - 58.2% 60.9% 61.2% 62.6%MU c - - 98.8% 99.5% 99.4%CR r - - 0.176 0.178 0.164

S r 0.195 - 0.130 0.158 0.103

Reddit

RL c - - 94.5% 95.1% 96.7%DA c - 55.7% 59.9% 55.9% 61.2%MU c - - 99.2% 98.7% 99.4%CR r - - 0.247 0.253 0.236

S r 0.143 - 0.118 0.112 0.084

Table 3: Results on accuracy of meta-word expression. Numbers in bold mean that improvement over the bestbaseline is statistically significant (t-test, p-value < 0.01).

CVAE, we follow (Zhao et al., 2017) and pre-dict a dialogue act for a message with an MLP.GTMNES2S and CT leverage the same set of at-tributes. Thus, for fair comparison, we let themexploit the same sampled values in generation. InGTMNES2S, the size of hidden units of the en-coder and the decoder, and the size of the vectorsin memory cells (i.e., d) are 512. Word embed-ding is randomly initialized with a size of 512. Weadopt the Adadelta algorithm (Zeiler, 2012) in op-timization with a batch size 200. Gradients areclipped when their norms exceed 5. We stop train-ing when the perplexity of a model on the valida-tion data does not drop in two consecutive epochs.Beam sizes are 200 in MMI-bidi (i.e., the size usedin (Li et al., 2015)) and 5 in other models.

6.6 Evaluation Results

Table 2 and Table 3 report evaluation resultson automatic metrics. On most of the metrics,GTMNES2S outperforms all baseline methods,and the improvements are significant in a statis-tical sense (t-test, p-value < 0.01). The resultsdemonstrate that with meta-words, our model canrepresent the relationship between messages andresponses in a more effective and more accurateway, and thus can generate more diverse responses

without sacrifice on relevance. Despite leverag-ing the same attributes for response generation,GTMNES2S achieves better accuracy than CT onboth one-to-many modeling and meta-word ex-pression, indicating the advantages of the dynamiccontrol strategy over the static control strategy,as we have analyzed at the beginning of Section4.2. Without the state update loss, there is sig-nificant performance drop for GTMNES2S. Theresults verified the effect of the proposed loss inlearning. Table 4 summarizes human evaluationresults. Compared with the baseline methods andthe variant, the full GTMNES2S model can gen-erate much more excellent responses (labeled as“2”) and much fewer inferior responses (labeledas “0”). Kappa values of all models exceed 0.6,indicating substantial agreement over all annota-tors. The results further demonstrate the value ofthe proposed model for real human-machine con-versation. kg-CVAE gives more informative re-sponses, and also more bad responses than MMI-bidi and SC-Seq2Seq. Together with the contra-diction on diversity and relevance in Table 2, theresults indicate that latent variable is a double-bladed sword: the randomness may bring interest-ing content to responses and may also make re-sponses out of control. On the other hand, there

Dataset Models 2 1 0 Avg kappa

Twitter

MMI-bidi 16.6% 51.7% 31.7% 0.85 0.65kg-CVAE 23.1% 40.9% 36% 0.87 0.78

SC-Seq2Seq 21.2% 48.5% 30.3% 0.91 0.61CT 27.6% 38.4% 34% 0.94 0.71

GTMNES2S w/o SU 27% 39.1% 33.9% 0.93 0.64GTMNES2S 33.2% 37.7% 29.1% 1.04 0.71

Reddit

MMI-bidi 4.4% 58.1% 37.5% 0.67 0.79kg-CVAE 13.7% 44.6% 41.7% 0.72 0.68

SC-Seq2Seq 9.9% 51.2% 38.9% 0.71 0.78CT 16.5% 48.2% 35.3% 0.81 0.73

GTMNES2S w/o SU 15.7% 47.3% 37% 0.79 0.66GTMNES2S 19.2% 47.5% 33.3% 0.86 0.76

Table 4: Results on the human evaluation. Ratios arecalculated by combining labels from the three judges.

DatasetMultiple Dialog

LengthCopy

Specificity PPLutterances Act Ratio

Twitter

× × × × × 70.19X × × × × 67.23X X × × × 62.13X X X × × 50.36X X X X × 42.05X X X X X 38.57

Reddit

× × × × × 72.43X × × × × 65.17X X × × × 61.92X X X × × 49.67X X X X × 41.78X X X X X 37.96

Table 5: Contribution of different attributes.

are no random variables in our model, and thus, itcan enjoy a well-trained language model.

6.7 DiscussionsIn this section, we examine effect of different at-tributes by adding them one by one to the gen-eration model. Besides, we also illustrate howGTMNES2S tracks attribute expression in re-sponse generation with test examples.

Contribution of attributes. Table 5 shows per-plexity (PPL) of GTMNES2S with different setsof attributes on the validation data. We can see thatthe more attributes are involved in learning, thelower PPL we can get. By leveraging all the 5 at-tributes, we can reduce almost 50% PPL from thevanilla encoder-decoder model (i.e., the one with-out any attributes). The results not only indicatethe contribution of different attributes to model fit-ting, but also inspire us the potential of the pro-posed framework, since it allows further improve-ment with more well designed attributes involved.

Case Study. Figure 2 illustrates how our modelcontrols attributes of responses with the goaltracking mechanism, where distance between thevalue of a memory cell (i.e.,Mi.vt) during gener-

Message: mmsoshouldi justpulltheringout than?kg-CVAE: whereisthering?MMI-bidi: youdon’twanttothat

SC-Seq2Seq: youshouldnotdosuchthingsMU=False,DA=Statement-non-opinion, RL=8,CR=0.24,S=0.5GTMNES2S:

GTMNES2Sw/o SU:

i ‘lljustpull thering outcreepier

i ’llpull thering onthering

Message: i willnotgiveupuntilyou takeanactualguesskg-CVAE: openyourmouthMMI-bidi: whatyou 'retalkingabout???

SC-Seq2Seq: i 'mnotsureaboutthat.MU=True,DA=Wh-Question,RL=12,CR=0.08,S=0.4GTMNES2S:

GTMNES2Sw/oSU:

whyareyousomeantome ?i ‘mprettyspecial

whatdoyoumean ?you‘renot anormalperson.

Figure 2: Examples of response generation from theTwitter test data. Up: the heat map is defined by‖Mi.vt−Mi.g‖ normalized to [0, 1], whereMi refersto CR. Below:Mi in the heat map refers to MU.

ation and the goal of the memory cell (i.e.,Mi.g)is visualized via heat maps. In the first example,the full model gradually reduces the distance be-tween the value and the goal of copy ratio expres-sion with the generation process moving on. Asa result, it just copies “pull the ring out” from themessage, which makes the response informativeand coherent. On the other hand, without the stateupdate loss, GTMNES2S w/o SU makes a mistakeby copying “ring” twice, and the distance betweenthe value and the goal is out of control. In thesecond example, we visualize the expression ofMU, a categorical attribute. Compared with real-valued attributes, categorical attributes are easierto express. Therefore, both the full model andGTMNES2S w/o SU successfully generate a re-sponse with multiple utterances, although the dis-tance between the value and the goal of MU ex-pression in GTMNES2S w/o SU is still in a mess.

7 Conclusions

We present a goal-tracking memory enhancedsequence-to-sequence model for open domain re-sponse generation with meta-words which explic-itly define characteristics of responses. Evalua-tion results on two datasets indicate that our modelsignificantly outperforms several state-of-the-artgenerative architectures in terms of both responsequality and accuracy of meta-word expression.

ReferencesJunyoung Chung, Caglar Gulcehre, KyungHyun Cho,

and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. arXiv preprint arXiv:1412.3555.

Mark G Core and James Allen. 1997. Coding dialogswith the damsl annotation scheme. In AAAI fall sym-posium on communicative action in humans and ma-chines, volume 56. Boston, MA.

Joseph L Fleiss and Jacob Cohen. 1973. The equiv-alence of weighted kappa and the intraclass corre-lation coefficient as measures of reliability. Educa-tional and psychological measurement, 33(3):613–619.

John J Godfrey and Edward Holliman. 1997.Switchboard-1 release 2. Linguistic Data Consor-tium, Philadelphia, 926:927.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OKLi. 2016. Incorporating copying mechanism insequence-to-sequence learning. arXiv preprintarXiv:1603.06393.

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Toward con-trolled generation of text. In International Confer-ence on Machine Learning, pages 1587–1596.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2015. A diversity-promoting objec-tive function for neural conversation models. In Pro-ceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages110–119.

Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp-ithourakis, Jianfeng Gao, and Bill Dolan. 2016. Apersona-based neural conversation model. In Pro-ceedings of the 54th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 994–1003.

Yang Liu, Kun Han, Zhao Tan, and Yun Lei. 2017. Us-ing context information for dialog act classificationin dnn framework. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 2170–2178.

Lajanugen Logeswaran, Honglak Lee, and Samy Ben-gio. 2018. Content preserving text generation withattribute controls. In Advances in Neural Informa-tion Processing Systems, pages 5108–5118.

Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang, andZhi Jin. 2016. Sequence to backward and forwardsequences: A content-introducing approach to gen-erative short-text conversation. In Proceedings ofthe 26th International Conference on ComputationalLinguistics: Technical Papers, pages 3349–3358.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics, pages 311–318.

Yookoon Park, Jaemin Cho, and Gunhee Kim. 2018. Ahierarchical latent structure for variational conversa-tion modeling. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), vol-ume 1, pages 1792–1801.

Abigail See, Stephen Roller, Douwe Kiela, and JasonWeston. 2019. What makes a good conversation?how controllable attributes affect human judgments.arXiv preprint arXiv:1902.08654.

Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben-gio, Aaron C. Courville, and Joelle Pineau. 2016.End-to-end dialogue systems using generative hier-archical neural network models. In AAAI, pages3776–3784.

Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron C Courville,and Yoshua Bengio. 2017. A hierarchical latentvariable encoder-decoder model for generating di-alogues. In AAAI, pages 3295–3301.

Lifeng Shang, Zhengdong Lu, and Hang Li. 2015.Neural responding machine for short-text conversa-tion. In ACL, pages 1577–1586.

Heung-Yeung Shum, Xiaodong He, and Di Li. 2018.From eliza to xiaoice: Challenges and opportunitieswith social chatbots. Frontiers of Information Tech-nology & Electronic Engineering, 19(1):10–26.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In Proceedings of the 27th InternationalConference on Neural Information Processing Sys-tems - Volume 2, NIPS’14, pages 3104–3112, Cam-bridge, MA, USA. MIT Press.

Chongyang Tao, Shen Gao, Mingyue Shang, Wei Wu,Dongyan Zhao, and Rui Yan. 2018. Get the point ofmy utterance! learning towards effective responseswith multi-head attention mechanism. In IJCAI,pages 4418–4424.

Oriol Vinyals and Quoc Le. 2015. A neural conversa-tional model. arXiv preprint arXiv:1506.05869.

Di Wang, Nebojsa Jojic, Chris Brockett, and Eric Ny-berg. 2017. Steering output style and topic in neu-ral response generation. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 2140–2150.

Yu Wu, Wei Wu, Dejian Yang, Can Xu, Zhoujun Li,and Ming Zhou. 2018. Neural response generationwith dynamic vocabularies. In AAAI, pages 5594–5601.

http://aclweb.org/anthology/P02-1040

http://aclweb.org/anthology/P02-1040

https://doi.org/10.1631/FITEE.1700826

https://doi.org/10.1631/FITEE.1700826

http://dl.acm.org/citation.cfm?id=2969033.2969173

http://dl.acm.org/citation.cfm?id=2969033.2969173

Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang,Ming Zhou, and Wei-Ying Ma. 2017. Topic awareneural response generation. In AAAI, pages 3351–3357.

Chen Xing, Wei Wu, Yu Wu, Ming Zhou, Yalou Huang,and Wei-Ying Ma. 2018. Hierarchical recurrent at-tention network for response generation. In AAAI,pages 5610–5617.

Stephanie Young, Milica Gasic, Blaise Thomson, andJohn D Williams. 2013. Pomdp-based statisticalspoken dialog systems: A review. Proceedings ofthe IEEE, 101(5):1160–1179.

Matthew D Zeiler. 2012. Adadelta: an adaptive learn-ing rate method. arXiv preprint arXiv:1212.5701.

Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan,Jun Xu, and Xueqi Cheng. 2018a. Learning to con-trol the specificity in neural response generation. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), volume 1, pages 1108–1117.

Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan,Xiujun Li, Chris Brockett, and Bill Dolan. 2018b.Generating informative and diverse conversationalresponses via adversarial information maximization.In Advances in Neural Information Processing Sys-tems, pages 1815–1825.

Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi.2018. Unsupervised discrete sentence representa-tion learning for interpretable neural dialog gener-ation. arXiv preprint arXiv:1804.08069.

Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.2017. Learning discourse-level diversity for neuraldialog models using conditional variational autoen-coders. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), volume 1, pages 654–664.

Hao Zhou, Minlie Huang, Tianyang Zhang, XiaoyanZhu, and Bing Liu. 2018. Emotional chatting ma-chine: Emotional conversation generation with in-ternal and external memory. In AAAI, pages 730–738.

neural response generation with meta-wordswe present open domain response generation with...

Documents