generating negative samples by manipulating golden ...the golden response.li et al.(2016) explored...

10
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1525–1534 June 6–11, 2021. ©2021 Association for Computational Linguistics 1525 Generating Negative Samples by Manipulating Golden Responses for Unsupervised Learning of a Response Evaluation Model ChaeHun Park Eugene Jang Wonsuk Yang Jong C. Park * School of Computing Korea Advanced Institute of Science and Technology {ddehun,eugenej,derrick0511,park}@nlp.kaist.ac.kr Abstract Evaluating the quality of responses generated by open-domain conversation systems is a challenging task. This is partly because there can be multiple appropriate responses to a given dialogue history. Reference-based met- rics that rely on comparisons to a set of known correct responses often fail to account for this variety, and consequently correlate poorly with human judgment. To address this problem, re- searchers have investigated the possibility of assessing response quality without using a set of known correct responses. Tao et al. (2018) demonstrated that an automatic response eval- uation model could be made using unsuper- vised learning for the next-utterance prediction (NUP) task. For unsupervised learning of such a model, we propose a method of manipulat- ing a golden response to create a new negative response that is designed to be inappropriate within the context while maintaining high sim- ilarity with the original golden response. We find, from our experiments on English datasets, that using the negative samples generated by our method alongside random negative sam- ples can increase the model’s correlation with human evaluations. The process of generating such negative samples is automated and does not rely on human annotation. 1 1 Introduction Automatic evaluation of responses can be difficult because multiple answers could be suitable for a single context. Well-known metrics often used in machine translation or text summarization, such as BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), or ROUGE (Lin, 2004), are based on measuring n-gram overlap with a set of human- annotated golden answers. Compared to machine * Corresponding author 1 The code is available at https://github.com/ nlpcl-lab/dialog-eval-hard-negative. Figure 1: Example of the three different types of re- sponses for a given dialogue history. Our method manipulates the original response "What’s wrong with heading out with Mark for vacation?" to generate the negative sample "What’s wrong? Go out with Mark for dinner." translation or text summarization systems, conver- sational systems have a wider range of acceptable responses to a given situation (dialogue history). This could explain the low correlation between n-gram-based evaluations and human-conducted evaluations for responses generated by conversa- tion systems, as reported by Liu et al. (2016). They also suggested calculating the embedding similar- ities between responses and correct answers, and showed that these metrics had a higher correlation with human evaluations than n-gram-based metrics. As this method only rewards responses similar to ones in the fixed set of answer candidates, however, it still fails to account for other possible answers that are dissimilar to the known answers. To solve this problem, Lowe et al. (2017) pro- posed a supervised regression model that makes predictions independent of correct answer candi- dates. Although they were able to achieve better correlation with human evaluations, their method depends on procuring a human-annotated dataset to learn from. Tao et al. (2018) used the Next-

Upload: others

Post on 30-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Generating Negative Samples by Manipulating Golden ...the golden response.Li et al.(2016) explored dia-log system with textual feedback.Ghandeharioun et al.(2019) suggested the necessity

Proceedings of the 2021 Conference of the North American Chapter of theAssociation for Computational Linguistics Human Language Technologies pages 1525ndash1534

June 6ndash11 2021 copy2021 Association for Computational Linguistics

1525

Generating Negative Samples by Manipulating Golden Responsesfor Unsupervised Learning of a Response Evaluation Model

ChaeHun Park Eugene Jang Wonsuk Yang Jong C Parklowast

School of ComputingKorea Advanced Institute of Science and Technology

ddehuneugenejderrick0511parknlpkaistackr

Abstract

Evaluating the quality of responses generatedby open-domain conversation systems is achallenging task This is partly because therecan be multiple appropriate responses to agiven dialogue history Reference-based met-rics that rely on comparisons to a set of knowncorrect responses often fail to account for thisvariety and consequently correlate poorly withhuman judgment To address this problem re-searchers have investigated the possibility ofassessing response quality without using a setof known correct responses Tao et al (2018)demonstrated that an automatic response eval-uation model could be made using unsuper-vised learning for the next-utterance prediction(NUP) task For unsupervised learning of sucha model we propose a method of manipulat-ing a golden response to create a new negativeresponse that is designed to be inappropriatewithin the context while maintaining high sim-ilarity with the original golden response Wefind from our experiments on English datasetsthat using the negative samples generated byour method alongside random negative sam-ples can increase the modelrsquos correlation withhuman evaluations The process of generatingsuch negative samples is automated and doesnot rely on human annotation1

1 Introduction

Automatic evaluation of responses can be difficultbecause multiple answers could be suitable for asingle context Well-known metrics often used inmachine translation or text summarization such asBLEU (Papineni et al 2002) METEOR (Banerjeeand Lavie 2005) or ROUGE (Lin 2004) are basedon measuring n-gram overlap with a set of human-annotated golden answers Compared to machine

lowast Corresponding author1The code is available at httpsgithubcom

nlpcl-labdialog-eval-hard-negative

Figure 1 Example of the three different types of re-sponses for a given dialogue history Our methodmanipulates the original response Whatrsquos wrong withheading out with Mark for vacation to generate thenegative sample Whatrsquos wrong Go out with Mark fordinner

translation or text summarization systems conver-sational systems have a wider range of acceptableresponses to a given situation (dialogue history)This could explain the low correlation betweenn-gram-based evaluations and human-conductedevaluations for responses generated by conversa-tion systems as reported by Liu et al (2016) Theyalso suggested calculating the embedding similar-ities between responses and correct answers andshowed that these metrics had a higher correlationwith human evaluations than n-gram-based metricsAs this method only rewards responses similar toones in the fixed set of answer candidates howeverit still fails to account for other possible answersthat are dissimilar to the known answers

To solve this problem Lowe et al (2017) pro-posed a supervised regression model that makespredictions independent of correct answer candi-dates Although they were able to achieve bettercorrelation with human evaluations their methoddepends on procuring a human-annotated datasetto learn from Tao et al (2018) used the Next-

1526

Utterance Prediction (NUP) task to learn for auto-matic response evaluation Their model which isunsupervised learned to distinguish an appropriateresponse from random negative samples (responsesrandomly taken from the training corpus) Themodel can evaluate the response quality by estimat-ing the probability that the response occurs directlyafter the dialogue history They also demonstratedthat the probability-based evaluations highly corre-lated with human evaluations of response quality

In this paper we propose a method to create anegative sample by manipulating a golden responseThe manipulation is carried out in three steps (1)scoring each word (2) selecting words to replaceand (3) replacing the selected words In the firststep each word is assigned a score designed to de-termine how dependent the word is on the contextIn the second step we select all the words with ascore above a threshold value where higher scoresindicate higher dependency to the dialogue historyIn the third step all previously selected words aremasked and replaced with words predicted in theirplace by a pretrained language model (LM) Fig-ure 1 shows an example of a negative sample gen-erated by our method When Whatrsquos wrong withheading out with Mark for vacation is the goldenresponse the tokens with heading vacationand were selected and replaced with Godinner and in that order

We find that the model trained with our negativesamples alongside random negative samples showsa higher correlation with human evaluations thanthe models trained only on random negative sam-ples in experiments using two datasets (Zhao et al2020) We also find evidence that automatic eval-uation systems trained with the negative samplesgenerated by our proposed method can make deci-sions closer to human judgment than those without

The contributions of this paper are as follows(1) We introduce a method that automatically

generates negative samples from the golden re-sponses

(2) We show that the negative samples can boostunsupervised learning of an automatic responseevaluation model with experiment results

(3) We conducted crowdsourcing and used itsresults to examine whether the negative samplesgenerated by our method are actually negative

2 Related Work

Liu et al (2016) pointed out that the traditional

n-gram overlap based metrics such as BLEU ME-TEOR and ROUGE show low correlation withhuman evaluations when used to evaluate the re-sults of an open-domain conversation system Theysuggested measuring the similarity by comparingembeddings of a generated response to those ofthe golden response Li et al (2016) explored dia-log system with textual feedback Ghandehariounet al (2019) suggested the necessity of interactivehuman evaluation for dialogue systems and pro-posed a self-play scenario to reduce the burden ofhuman effort Hashimoto et al (2019) proposed amethod to combine human assessments with thepredictions of an evaluation model

Lowe et al (2017) proposed a supervised learn-ing method to predict the quality of a responsedirectly rather than measuring the similarities withgolden responses Tao et al (2018) showed thata model trained on the NUP task in an unsuper-vised manner can be used to predict the quality ofa response that is generated by a system Ghazar-ian et al (2019) improved the previous work byusing contextualized word embeddings Mehri andEskenazi (2020) proposed two unsupervised evalu-ation models one based on masked language mod-eling (MLM) and another based on the responseretrieval task using a pretrained LM Pang et al(2020) predicted the coherence and fluency of aresponse by estimating its likelihood using a LM

Sai et al (2020) emphasized the importance ofadversarial negative samples for learning responseevaluation and released a dataset with human-curated adversarial negative responses Their neg-ative samples were manually curated howeverwhose process can be both time-consuming andexpensive Wu et al (2020) attempted to improvethe performance of evaluation models for abstrac-tive summarization by corrupting the golden sum-mary and using it as a negative sample In themachine translation task Sellam et al (2020) cre-ated paired data with synthetic examples throughmethods such as back-translation and mask-fillingwith BERT (Devlin et al 2019) and they usedthe paired data to pretrain the evaluation modelsOur work introduces a method to create negativesamples by manipulating the golden response to thedialogue history and also suggests that the negativesamples generated by the proposed method couldbe used to improve the unsupervised response eval-uation model The proposed method can be per-formed automatically without human effort

1527

3 Method

In this section we describe our method to gener-ate negative samples The proposed method cre-ates a negative sample by selecting and replacingspecific word(s) in a golden response The wordselection is based on the difference between (a) theestimated probability that a word would appear inthe response considering the dialogue history and(b) the estimated probability that the word wouldappear in the response when the dialogue history isnot considered An LM that can perform MLM canbe used to estimate these probabilities Words thathave large differences in probability are selectedand replaced with other words When replacing aword with another word an LM that can performMLM can be used to predict the word that is mostlikely to appear in the position of the original wordwhen the dialogue history is not given

31 Scoring

The proposed method includes a scoring processto determine which words in the golden responseare affected the most by the dialogue history Thescore of a word is calculated by taking the differ-ence between (a) the estimated probability of theword appearing in its position when the dialoguehistory is given and (b) the estimated probabilityof the word appearing in its position when the dia-logue history is not given This scoring process isperformed independently for all words in the targetresponse

Specifically to calculate the score of the i-thword (xi) in the golden response we first replacexi with the [mask] token Then the likelihood thatthe original word xi appears in place of the maskedtoken is calculated twice once with the dialoguehistory and once without The difference in the log-likelihood is used as the final score of each wordwhich is defined as

score(xi | c ri θ) = log(P(xi | [c ri] θ))minuslog(P(xi | ri θ))

(1)

where xi denotes the word to be scored and ri de-notes the sequence of words in the golden responsewhere xi is masked c denotes the dialogue historyof the golden response and [ ] the concatenationof two pieces of text P (xi|[c ri] θ) denotes theestimated probability that xi would occur when thedialog history is considered P (xi|ri θ) denotesthe estimated probability that xi would occur when

Figure 2 An illustration of our proposed method forgenerating a negative sample The original response tothe dialogue history Whatrsquos wrong with heading outwith Mark for vacation is manipulated through thesteps of scoring selecting and replacing

the dialog history is not considered θ denotes theparameters of the LM

Figure 2 shows an example of our proposed scor-ing process The word vacation in the original re-sponse received the highest score among the wordsin the response The words with and headingalso scored higher than other words

32 Selecting

For each sentence we select words that scoredhigher than the threshold t For example in thecase seen in Figure 2 if the threshold is 05 thewords with heading vacation and willbe selected If none of the words receive a scorehigher than the threshold value no words will beselected and in this case a negative sample cannotbe generated

We set the threshold t to 05 for our experimentsUsing this threshold in our dataset an average of2728 of tokens were selected for each responseAlso 9489 of the responses contained at leastone selected word which means a negative samplecould be generated for 9489 of the cases

1528

33 ReplacingThe selected words are then replaced using an LMAll selected words are replaced with [mask] tokensin the original response Then the LM predictswithout considering the dialogue history the wordsthat are most likely to occur in the location of eachmasked word If the LM predicts the original wordthe second most likely word is used instead

4 Experiments

41 Setting411 DatasetTo measure the correlation between model predic-tions and human evaluations we use the response-evaluation dataset proposed by Zhao et al (2020)The dataset contains dialogue histories machine-generated responses golden responses and ap-propriateness scores evaluated by human annota-tors The scores were on a 5-point Likert scaleand each response was scored by four annota-tors Six generative models S2S (Sutskever et al2014) attentional S2S HRED (Serban et al 2016)VHRED (Serban et al 2017) GPT2-sm and GPT2-md (Wolf et al 2018) with three decoding algo-rithms greedy decoding ancestral decoding andnucleus sampling (Holtzman et al 2020) wereused to generate the responses They used Daily-Dialog (Li et al 2017) and PersonaChat (Zhanget al 2018) For each dataset they trained a set ofgenerative conversation models Each of the 900context-response pairs was randomly selected fromthe test set of the two datasets and the annotatorsevaluated the appropriateness of each response tothe context to construct two different evaluationdatasets The Krippendorffrsquos alpha for this datasetwas 0815 suggesting reasonable inter-annotatoragreement

DailyDialog dataset consists of 13118 multi-turn open-domain conversations written by hu-man workers and PersonaChat dataset consists of12875 multi-turn open-domain conversations writ-ten by human workers

412 ModelsThe evaluation models used in the experiment arelisted below Among them BLEU ROUGE ME-TEOR Embedding AverageExtremaGreedy andBERTScore are reference-based metrics that eval-uate the quality of a response based on its similar-ity to the golden response BERT-MLM GPT2-coherence BERT-retrieval (random-N) BERT-

retrieval (ours) are unreferenced metrics that do notrequire golden responses RUBER can be viewedas a hybrid metric that includes both reference-based and unreferenced approaches Some of thereference-based metrics are simple comparisonmethods rather than trainable models but are pre-sented along with other models because they canalso be used to estimate the quality of responsesIt should be noted that we do not compare theunsupervised approaches listed below with super-vised approaches such as the ones proposed byLowe et al (2017) Zhao et al (2020) which re-quire human-annotated response-evaluation pairsfor training

BLEU is a widely used metric for the machinetranslation task by measuring n-gram precision be-tween multiple references and a hypothesis (Pap-ineni et al 2002)

ROUGE is a widely used metric for text sum-marization which measures the n-gram recall (Lin2004) We use the F-score of ROUGE-L as anappropriateness score

METEOR is a metric for the machine transla-tion task which considers both n-gram precisionand n-gram recall of a hypothesis (Banerjee andLavie 2005)

Embeddding AverageGreedyExtrema calcu-late the similarity between golden and generatedresponses using the embedding similarity to ac-count for the diverse ways in which the goldenresponse could be stated (Liu et al 2016)

BERTScore is a recently proposed unsuper-vised metric based on the contextualized BERTembeddings (Zhang et al 2020)

RUBER calculates the scores of reference-basedand unreferenced metrics individually then usesthem to predict the final score (Tao et al 2018)The reference-based metric measures the similaritybetween golden responses and generated responsesbased on their embedding similarity The unrefer-enced metric is trained on the NUP task

BERT-MLM sums the log-likelihood of each to-ken in a response after masking it using an LM thatis fine-tuned on a corpus then uses the aggregatedlikelihood as the final score of the response (Mehriand Eskenazi 2020)

GPT2-coherence measures the coherence be-tween the dialogue history and a response by usinga fine-tuned GPT2 model (Radford et al 2019)to compute the averaged log-likelihood of the re-sponse (Pang et al 2020)

1529

BERT-retrieval (random-N) is a BERT-basedmodel that is trained to distinguish a golden re-sponse from a negative sample (Mehri and Eske-nazi 2020) using the dialogue history We refer tothe original model by Mehri and Eskenazi (2020)as BERT-retrieval (random-1) since they used onerandom response as a negative sample for a dia-logue history We refer to a variation of the modelthat uses two random negative samples for a dia-logue history as BERT-retrieval (random-2) Thisis to fairly compare with our model which usestwo negative samples for a dialogue history as ex-plained below

BERT-retrieval (ours) is a model that has thesame structure as the BERT-retrieval model Thedifference is that our model utilizes the negativesamples generated by the method that we proposeThe model uses both the generated negative sam-ples and the random negative samples Specificallyduring training the model learns to distinguish agolden response from two negative samples onegenerated from our method and one randomly sam-pled from the corpus

413 Implementation DetailsWe trained the unreferenced models on the origi-nal DailyDialog dataset and then evaluated themon the two response-evaluation datasets (Sec-tion 411) We split the conversations in the Daily-Dialog dataset in a sliding window manner to con-struct pairs of dialogue histories and correspondingresponses The maximum turn of the dialogue his-tory was set to 5 following Zhao et al (2020)

We use the pretrained BERT and GPT2 releasedby Wolf et al (2018) for all of our relevant exper-iments2 A BERT model fine-tuned on the Dai-lyDialog train set with MLM for 1 epoch wasused for the scoring step of our proposed method(Section 31) The same model was used for thereplacing step (Section 33) We used the thresh-old3 of 05 for the selecting step (Section 32) Weused Adam optimizer (Kingma and Ba 2015) fortraining We searched for hyperparameters forthe BERT-retrieval (random-1) model that max-imize the (Pearson) correlation between humanevaluations and model predictions on the response-evaluation dataset made from DailyDialog dataset

2bert-base-uncased and gpt2-12layer areused

3We tested threshold values of 0 05 1 and 2 and foundthat using 05 as the threshold achieved the highest correla-tion with human evaluations therefore we report only theexperiment results with this value

DailyDialog PersonaModel r ρ r ρ

BLEU 08 05 19 23METEOR 11 33 23 18ROUGE 12 09 25 21Embed Average 09 08 16 17Embed Greedy 18 18 25 24Embed Extrema 16 15 28 27BERTScore 13 12 28 26RUBER 28 26 07 04BERT-MLM 32 38 35 35GPT2-coherence 47 47 48 48BERT-rtv (rand1) 47 47 56 60BERT-rtv (rand2) 49 48 55 58BERT-rtv (ours) 55 56 64 66

Table 1 The correlations between model predictionsand human evaluations for each model based on thetwo response-evaluation datasets The highest score oneach metric is highlighted in bold All values with p gt0001 are marked with DailyDialog and Persona de-note the response-evaluation datasets made from Daily-Dialog and PersonaChat datasets respectively BERT-rtv denotes the BERT-retrieval model r and ρ meanthe Pearson correlation and Spearmanrsquos rank correla-tion coefficient respectively

DailyDialog PersonaModel r ρ r ρ

drop-golden 49 48 54 57shuffle-golden 46 45 55 59score-wo-history 51 52 58 58select-random 54 55 57 59replace-w-history 52 52 57 57

Table 2 The correlations between model predictionsand human evaluations for each of the variations of ourmodel based on the two response-evaluation datasets

(Section 411) The values found in this search(epoch=3 batch size=64 and learning rate=2e-5) were used for all the BERT-retrieval models(random-N ours) The random seed was fixed forall experiments

42 Results

In Section 421 we check the correlations betweenthe results of each evaluation model and humanevaluations In Section 422 an in-depth anal-ysis of our proposed method is shown In Sec-tion 423 we present examples that may suggestthat automatic evaluation systems that have beentrained with the proposed method can make deci-

1530

Figure 3 The scatter plots that show in detail the correlations between model predictions and human evaluationsEach of the plots contains 800 system-generated responses in the response-evaluation dataset made from Daily-Dialog dataset (Section 411) Each point indicates a response Its x-value indicates the human evaluation scorefor the quality of the response given on a 5-point Likert scale Its y-value indicates the model prediction for thequality of the response normalized into the range of [0 1] The orange line is a linear regression We add a noisesampled from N (0 009) into human score for better visualization following previous studies (Lowe et al 2017Bak and Oh 2020 Pang et al 2020)

sions closer to human judgment than models thathave not

421 Correlation with Human Judgment

Table 1 shows the correlation between model pre-dictions and human evaluations for each modelbased on the two datasets Pearson correlation (r)and Spearmanrsquos rank correlation coefficient (ρ)were used to measure the correlation between hu-man score and model prediction It should benoted that we excluded the scores of golden re-sponses from the response-evaluation datasets andextracted 800 and 750 response-evaluation pairsfrom the DailyDialog and PersonaChat datasetsrespectively The model incorporating our nega-tive sample method made predictions with highercorrelation with human evaluations than the predic-tions made by BERT-retrieval (random-2) whichuses the same number of negative samples fortraining Among the baseline models most ofthe reference-based metrics showed comparativelylow performances It is thought that these resultssupport the observations made by previous stud-ies suggesting that using the golden response asthe ldquoone and onlyrdquo correct answer to evaluate re-sponses can be ineffective RUBER showed bet-ter performance than other reference-based mod-

els for the DailyDialog dataset but showed lowperformance in evaluating PersonaChat responsesThe GPT2-coherence model showed similar perfor-mance to the BERT-retrieval (random-1) model onthe DailyDialog dataset but relatively low perfor-mance in the PersonaChat dataset It should alsobe noted that the hybrid and unreferenced modelswere trained on the DailyDialog dataset and noton the PersonaChat dataset

Figure 3 shows a scatter plot visualizing the hu-man scores and model predictions for the response-evaluation dataset on DailyDialog BLEU tendedto predict low scores This may suggest thatthere were only a few n-gram overlaps betweenthe golden responses and the generated responsesThe predictions of embedding-based metrics (EmbGreedy and BERTScore) were concentrated on aspecific range and showed low correlation withhuman scores The unreferenced or hybrid met-rics (RUBER BERT-MLM GPT2-coherence andBERT-retrieval (random-1)) show relatively highercorrelations than the reference-based metrics Wecan see that BERT-retrieval (ours) shows the great-est correlation among the models with a correla-tion coefficient of 01974 The scatter plots suggestthat false-positive predictions which frequentlyoccurred in the BERT-retrieval (random-1) predic-

1531

tions occurred less frequently in our modelrsquos pre-dictions However the scatter plot for our modelhas a step-function-like appearance Most of theresponses received a score near 0 or near 1 and thisis problematic because an ideal model should beable to match human scores even when the scoresare moderate This tendency is considered as a lim-itation of our model that must be addressed in thefuture work

422 Model AnalysisWe analyze our model by performing experimentswith some variations in making the negative sam-ples to be used with the random negative sample(1) drop-golden Instead of following the stepsof scoring selecting and replacing we randomlydrop some of the words in the golden response tocreate a negative sample and use it with the ran-dom negative sample (2) shuffle-golden Instead offollowing the three steps we randomly shuffle thewords in the golden response to create a negativesample and use it with the random negative sample(3) score-wo-history We use the scoring functionin Equation 1 without the first term so that it onlyconsiders the probabilities within the sentence with-out the dialogue history (4) select-random Insteadof using the scoring function proposed in Equation1 we randomly select the words to be replaced(5) replace-w-history When replacing a word weconcatenate the dialogue history with the responseso that the LM considers the dialogue history whenreplacing the masked words

Table 2 shows the correlations between modelpredictions and human evaluations for the modi-fied models above Dropping or shuffling wordsin the golden response to make a negative sampleshows similar or lower performance compared tousing random responses (BERT-retrieval (random-1 random-2)) The correlation was lower when thedialogue history was not considered in the scoringprocess than when it was considered We specu-late that this is because it gives high scores notonly to words important for the consistency of aconversation but also to the words with low likeli-hoods in general Randomly selecting the tokensshows lower correlation than using our proposedscoring function Considering the dialogue historyin the replacing process gives lower performancethan when it is not considered We speculate thatproviding the dialogue history makes predictionson the masked words that are more appropriate tothe context making the reconstructed response less

Figure 4 Some examples of cases in which our modelpredicted scores similar to human evaluations GPT2denotes the GPT2-coherence model BERT-rtv de-notes the BERT-retrieval (random-1) All scores arenormalized into the range of [0 1]

appropriate as a negative sample

423 Case StudyFigure 4 shows some of the evaluation results ofeach model on the DailyDialog dataset The re-sponses in the first and second examples are appro-priate to the given dialogue history as suggested bythe high human score BLEU-2 gives a score of 0

1532

2130

VERB1830

NOUN 1670

PRON 1020

ADJ 750

ADP 730

ADV 660

etc 1210

CORPUS

VERB2190

NOUN 2050

ADJ 12

1210

PRON 900

ADP 780

ADV 560

etc 1090

OURS

Figure 5 The POS tag distribution of the words in theoriginal training corpus (left) and the selected words byour method (right)

because the response has no bi-grams shared withthe golden response RUBER and GPT2-coherencedid not recognize the utterances as appropriate re-sponses BERT-retrieval (random-1) and BERT-retrieval (ours) gave relatively high scores to the re-sponses evaluating them as appropriate utterancesIn the third example the system response appearsto be somewhat relevant to the given context be-cause it includes some words (chance future)relevant to the phrase take part in the finals Arepetition of a phrase in this example (to get achance) is believed to have contributed to the lowhuman evaluation score (012) The RUBER andBERT-retrieval (random) models appear to lackthis intuition and instead evaluate the response asappropriate possibly because some words appearrelevant Our proposed model scored the responsewith a relatively low score of 015 which was closeto the human score In the fourth example the re-sponse is not coherent but because it begins witha sentence Let me get a peek it could have ap-peared as a coherent response to the previous di-alogue about parking tickets For this case ourproposed model and GPT2-coherence gave scoressimilar to human scores

43 POS-tag distribution of selected words

We compute the Part-of-Speech (POS) tag distribu-tion of selected words by our method and compareit with the original distribution of the DailyDialogcorpus (Figure 5) 4 As we can see the VERBand NOUN tags are the most frequently selected(219 and 205 respectively) and their ratio isincreased than in the original corpus (183 and167 respectively) Meanwhile the ratio of punc-tuation tag () is highly decreased (from 213 to

4We use the NLTK POS tagger (httpswwwnltkorgbookch05html) with universal tagset

Figure 6 Box plot of scores for each type of responsesThe average scores of each type are 465 251 and 119The standard deviations for the scores of each type are067 127 and 041 (from left to right)

121) We suspect that the likelihood of the punc-tuation tag is more affected by local informationfrom a response rather than dialog history

431 Are the generated samples actuallyinappropriate

To see whether the negative samples generatedby our method are actually inappropriate we con-ducted a survey through Amazon Mechanical Turk(AMT) We selected 40 dialogue history examplesand prepared three types of responses for each dia-logue 1) the golden response 2) a negative samplegenerated by our method and 3) a randomly se-lected negative sample from the corpus For eachdialog 4 annotators were asked to score the qual-ity of the three responses Following Lowe et al(2017) we asked the question ldquoHow appropriate isthe response overallrdquo for each context-responsepair and the evaluation was conducted on a 5-pointLikert scale The Fleissrsquo kappa and Krippendorffrsquosalpha for the annotations were 063 and 063 re-spectively

Figure 6 shows the survey results The meanscores of golden and random responses were 465and 119 respectively The mean score of our neg-ative samples was 251 The standard deviationsfor the scores of each response type were 067127 and 041 for the golden response our nega-tive sample and the random response respectivelyWe see that these results do not guarantee that allthe generated negative samples are inappropriateWhat we can assume however is that our methodof manipulating a golden response generates a neg-ative sample that is more inappropriate than thegolden response Table 3 shows two examples ofthe three different types of responses for a givendialog history with their survey results

1533

Dialog HistoryA Sir would you like some dessert nowB Please show me the menu againA Here you are sir The chocolate cake is very delicious

ResponsesGolden No thanks I donrsquot like chocolate Irsquod likestrawberry pie (5)Ours No thanks I donrsquot have chocolate Irsquoll likesome one (15)Random I basically believe in science over theologyI mean I () (1)

Dialog HistoryA Could you tell me something about your family

ResponsesGolden Ok There are five people in my family fathermother elder brother younger sister and I (5)Ours Ok There are five children in my family fathermother and brother and father my me (325)Random When do you want to move in (125)

Table 3 Examples of three different types of responsesfor a given dialog history with their survey results Thehighlighted words are newly generated by our methodThe score of each response is underlined

For a model learning to find the difference be-tween appropriate and inappropriate responses wespeculate that the task of distinguishing the neg-ative samples generated by our method from thegolden responses would be more difficult than thetask of distinguishing the randomly selected nega-tive samples from the golden responses We believethat this is because the generated negative samplescan be inappropriate in more subtle ways than com-pletely unrelated responses are We suspect thatlearning with this more challenging setting haveresulted in the performance gain that we discussedin Section 421 However we believe that it willneed a more in-depth semantic analysis on each ofthe cases such as performing a more quantitativeanalysis (through an extensive human study forinstance) and further interpretation of the semanticrelationships between the original golden responsesand the modified negative samples according to theproposed method We leave it as a future work

5 Conclusion

In this paper we proposed an automatic methodfor generating negative samples that can be used totrain an unsupervised and unreferenced responseevaluation model We performed experiments todemonstrate that the proposed method can boostthe unsupervised training of a response evaluationmodel We analyzed the experiment results quan-titatively and examined some examples that show

the distinct characteristics of our proposed method

Acknowledgments

This work was supported by Institute for Infor-mation and communications Technology Promo-tion (IITP) grant funded by the Korea governmentMSIT) (No 2018-0-00582 Prediction and augmen-tation of the credibility distribution via linguisticanalysis and automated evidence document collec-tion)

References

JinYeong Bak and Alice Oh 2020 Speaker sensitiveresponse evaluation model In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics pages 6376ndash6385

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the ACL Workshop on Intrinsic and Extrinsic Eval-uation Measures for Machine Translation andorSummarization pages 65ndash72

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies pages 4171ndash4186

Asma Ghandeharioun Judy Hanwen Shen NatashaJaques Craig Ferguson Noah Jones AgataLapedriza and Rosalind Picard 2019 Approximat-ing interactive human evaluation with self-play foropen-domain dialog systems In Advances in Neu-ral Information Processing Systems pages 13658ndash13669

Sarik Ghazarian Johnny Wei Aram Galstyan andNanyun Peng 2019 Better automatic evaluation ofopen-domain dialogue systems with contextualizedembeddings In Proceedings of the Workshop onMethods for Optimizing and Evaluating Neural Lan-guage Generation pages 82ndash89

Tatsunori Hashimoto Hugh Zhang and Percy Liang2019 Unifying human and statistical evaluation fornatural language generation In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies pages 1689ndash1701

Ari Holtzman Jan Buys Li Du Maxwell Forbes andYejin Choi 2020 The curious case of neural textdegeneration In 8th International Conference onLearning Representations

1534

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In 3rd Interna-tional Conference on Learning Representations

Jiwei Li Alexander H Miller Sumit ChopraMarcrsquoAurelio Ranzato and Jason Weston 2016Dialogue learning with human-in-the-loop arXivpreprint arXiv161109823

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 DailyDialog A manu-ally labelled multi-turn dialogue dataset In Proceed-ings of the Eighth International Joint Conference onNatural Language Processing pages 986ndash995

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81

Chia-Wei Liu Ryan Lowe Iulian Serban Mike Nose-worthy Laurent Charlin and Joelle Pineau 2016How NOT to evaluate your dialogue system An em-pirical study of unsupervised evaluation metrics fordialogue response generation In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing pages 2122ndash2132

Ryan Lowe Michael Noseworthy Iulian Vlad Ser-ban Nicolas Angelard-Gontier Yoshua Bengio andJoelle Pineau 2017 Towards an automatic Turingtest Learning to evaluate dialogue responses InProceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1116ndash1126

Shikib Mehri and Maxine Eskenazi 2020 USR Anunsupervised and reference free evaluation metricfor dialog generation In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 681ndash707

Bo Pang Erik Nijkamp Wenjuan Han Linqi ZhouYixian Liu and Kewei Tu 2020 Towards holisticand automatic evaluation of open-domain dialoguegeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 3619ndash3629

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics pages 311ndash318

Alec Radford Jeffrey Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIblog

Ananya B Sai Akash Kumar Mohankumar SiddharthaArora and Mitesh M Khapra 2020 Improving di-alog evaluation with a multi-reference adversarialdataset and large scale pretraining arXiv preprintarXiv200911321

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892

Iulian V Serban Alessandro Sordoni Yoshua BengioAaron Courville and Joelle Pineau 2016 Buildingend-to-end dialogue systems using generative hier-archical neural network models In Proceedings ofthe 30th AAAI Conference on Artificial Intelligencepages 3776ndash3783

Iulian Vlad Serban Alessandro Sordoni Ryan LoweLaurent Charlin Joelle Pineau Aaron Courville andYoshua Bengio 2017 A hierarchical latent variableencoder-decoder model for generating dialogues InProceedings of the 31st AAAI Conference on Artifi-cial Intelligence pages 3295ndash3301

Ilya Sutskever Oriol Vinyals and Quoc V Le 2014Sequence to sequence learning with neural networksIn Advances in Neural Information Processing Sys-tems pages 3104ndash3112

Chongyang Tao Lili Mou Dongyan Zhao and RuiYan 2018 Ruber An unsupervised method for au-tomatic evaluation of open-domain dialog systemsProceedings of the 32nd AAAI Conference on Artifi-cial Intelligence pages 722ndash729

Thomas Wolf Victor Sanh Julien Chaumond andClement Delangue 2018 Transfertransfo A trans-fer learning approach for neural network based con-versational agents In NeurIPS Workshop on Con-versational AI

Hanlu Wu Tengfei Ma Lingfei Wu TariroManyumwa and Shouling Ji 2020 Unsuper-vised reference-free summary quality evaluationvia contrastive learning In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing pages 3612ndash3621

Saizheng Zhang Emily Dinan Jack Urbanek ArthurSzlam Douwe Kiela and Jason Weston 2018 Per-sonalizing dialogue agents I have a dog do youhave pets too In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics pages 2204ndash2213

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with BERT In 8th Interna-tional Conference on Learning Representations

Tianyu Zhao Divesh Lala and Tatsuya Kawahara2020 Designing precise and robust dialogue re-sponse evaluators In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 26ndash33

Page 2: Generating Negative Samples by Manipulating Golden ...the golden response.Li et al.(2016) explored dia-log system with textual feedback.Ghandeharioun et al.(2019) suggested the necessity

1526

Utterance Prediction (NUP) task to learn for auto-matic response evaluation Their model which isunsupervised learned to distinguish an appropriateresponse from random negative samples (responsesrandomly taken from the training corpus) Themodel can evaluate the response quality by estimat-ing the probability that the response occurs directlyafter the dialogue history They also demonstratedthat the probability-based evaluations highly corre-lated with human evaluations of response quality

In this paper we propose a method to create anegative sample by manipulating a golden responseThe manipulation is carried out in three steps (1)scoring each word (2) selecting words to replaceand (3) replacing the selected words In the firststep each word is assigned a score designed to de-termine how dependent the word is on the contextIn the second step we select all the words with ascore above a threshold value where higher scoresindicate higher dependency to the dialogue historyIn the third step all previously selected words aremasked and replaced with words predicted in theirplace by a pretrained language model (LM) Fig-ure 1 shows an example of a negative sample gen-erated by our method When Whatrsquos wrong withheading out with Mark for vacation is the goldenresponse the tokens with heading vacationand were selected and replaced with Godinner and in that order

We find that the model trained with our negativesamples alongside random negative samples showsa higher correlation with human evaluations thanthe models trained only on random negative sam-ples in experiments using two datasets (Zhao et al2020) We also find evidence that automatic eval-uation systems trained with the negative samplesgenerated by our proposed method can make deci-sions closer to human judgment than those without

The contributions of this paper are as follows(1) We introduce a method that automatically

generates negative samples from the golden re-sponses

(2) We show that the negative samples can boostunsupervised learning of an automatic responseevaluation model with experiment results

(3) We conducted crowdsourcing and used itsresults to examine whether the negative samplesgenerated by our method are actually negative

2 Related Work

Liu et al (2016) pointed out that the traditional

n-gram overlap based metrics such as BLEU ME-TEOR and ROUGE show low correlation withhuman evaluations when used to evaluate the re-sults of an open-domain conversation system Theysuggested measuring the similarity by comparingembeddings of a generated response to those ofthe golden response Li et al (2016) explored dia-log system with textual feedback Ghandehariounet al (2019) suggested the necessity of interactivehuman evaluation for dialogue systems and pro-posed a self-play scenario to reduce the burden ofhuman effort Hashimoto et al (2019) proposed amethod to combine human assessments with thepredictions of an evaluation model

Lowe et al (2017) proposed a supervised learn-ing method to predict the quality of a responsedirectly rather than measuring the similarities withgolden responses Tao et al (2018) showed thata model trained on the NUP task in an unsuper-vised manner can be used to predict the quality ofa response that is generated by a system Ghazar-ian et al (2019) improved the previous work byusing contextualized word embeddings Mehri andEskenazi (2020) proposed two unsupervised evalu-ation models one based on masked language mod-eling (MLM) and another based on the responseretrieval task using a pretrained LM Pang et al(2020) predicted the coherence and fluency of aresponse by estimating its likelihood using a LM

Sai et al (2020) emphasized the importance ofadversarial negative samples for learning responseevaluation and released a dataset with human-curated adversarial negative responses Their neg-ative samples were manually curated howeverwhose process can be both time-consuming andexpensive Wu et al (2020) attempted to improvethe performance of evaluation models for abstrac-tive summarization by corrupting the golden sum-mary and using it as a negative sample In themachine translation task Sellam et al (2020) cre-ated paired data with synthetic examples throughmethods such as back-translation and mask-fillingwith BERT (Devlin et al 2019) and they usedthe paired data to pretrain the evaluation modelsOur work introduces a method to create negativesamples by manipulating the golden response to thedialogue history and also suggests that the negativesamples generated by the proposed method couldbe used to improve the unsupervised response eval-uation model The proposed method can be per-formed automatically without human effort

1527

3 Method

In this section we describe our method to gener-ate negative samples The proposed method cre-ates a negative sample by selecting and replacingspecific word(s) in a golden response The wordselection is based on the difference between (a) theestimated probability that a word would appear inthe response considering the dialogue history and(b) the estimated probability that the word wouldappear in the response when the dialogue history isnot considered An LM that can perform MLM canbe used to estimate these probabilities Words thathave large differences in probability are selectedand replaced with other words When replacing aword with another word an LM that can performMLM can be used to predict the word that is mostlikely to appear in the position of the original wordwhen the dialogue history is not given

31 Scoring

The proposed method includes a scoring processto determine which words in the golden responseare affected the most by the dialogue history Thescore of a word is calculated by taking the differ-ence between (a) the estimated probability of theword appearing in its position when the dialoguehistory is given and (b) the estimated probabilityof the word appearing in its position when the dia-logue history is not given This scoring process isperformed independently for all words in the targetresponse

Specifically to calculate the score of the i-thword (xi) in the golden response we first replacexi with the [mask] token Then the likelihood thatthe original word xi appears in place of the maskedtoken is calculated twice once with the dialoguehistory and once without The difference in the log-likelihood is used as the final score of each wordwhich is defined as

score(xi | c ri θ) = log(P(xi | [c ri] θ))minuslog(P(xi | ri θ))

(1)

where xi denotes the word to be scored and ri de-notes the sequence of words in the golden responsewhere xi is masked c denotes the dialogue historyof the golden response and [ ] the concatenationof two pieces of text P (xi|[c ri] θ) denotes theestimated probability that xi would occur when thedialog history is considered P (xi|ri θ) denotesthe estimated probability that xi would occur when

Figure 2 An illustration of our proposed method forgenerating a negative sample The original response tothe dialogue history Whatrsquos wrong with heading outwith Mark for vacation is manipulated through thesteps of scoring selecting and replacing

the dialog history is not considered θ denotes theparameters of the LM

Figure 2 shows an example of our proposed scor-ing process The word vacation in the original re-sponse received the highest score among the wordsin the response The words with and headingalso scored higher than other words

32 Selecting

For each sentence we select words that scoredhigher than the threshold t For example in thecase seen in Figure 2 if the threshold is 05 thewords with heading vacation and willbe selected If none of the words receive a scorehigher than the threshold value no words will beselected and in this case a negative sample cannotbe generated

We set the threshold t to 05 for our experimentsUsing this threshold in our dataset an average of2728 of tokens were selected for each responseAlso 9489 of the responses contained at leastone selected word which means a negative samplecould be generated for 9489 of the cases

1528

33 ReplacingThe selected words are then replaced using an LMAll selected words are replaced with [mask] tokensin the original response Then the LM predictswithout considering the dialogue history the wordsthat are most likely to occur in the location of eachmasked word If the LM predicts the original wordthe second most likely word is used instead

4 Experiments

41 Setting411 DatasetTo measure the correlation between model predic-tions and human evaluations we use the response-evaluation dataset proposed by Zhao et al (2020)The dataset contains dialogue histories machine-generated responses golden responses and ap-propriateness scores evaluated by human annota-tors The scores were on a 5-point Likert scaleand each response was scored by four annota-tors Six generative models S2S (Sutskever et al2014) attentional S2S HRED (Serban et al 2016)VHRED (Serban et al 2017) GPT2-sm and GPT2-md (Wolf et al 2018) with three decoding algo-rithms greedy decoding ancestral decoding andnucleus sampling (Holtzman et al 2020) wereused to generate the responses They used Daily-Dialog (Li et al 2017) and PersonaChat (Zhanget al 2018) For each dataset they trained a set ofgenerative conversation models Each of the 900context-response pairs was randomly selected fromthe test set of the two datasets and the annotatorsevaluated the appropriateness of each response tothe context to construct two different evaluationdatasets The Krippendorffrsquos alpha for this datasetwas 0815 suggesting reasonable inter-annotatoragreement

DailyDialog dataset consists of 13118 multi-turn open-domain conversations written by hu-man workers and PersonaChat dataset consists of12875 multi-turn open-domain conversations writ-ten by human workers

412 ModelsThe evaluation models used in the experiment arelisted below Among them BLEU ROUGE ME-TEOR Embedding AverageExtremaGreedy andBERTScore are reference-based metrics that eval-uate the quality of a response based on its similar-ity to the golden response BERT-MLM GPT2-coherence BERT-retrieval (random-N) BERT-

retrieval (ours) are unreferenced metrics that do notrequire golden responses RUBER can be viewedas a hybrid metric that includes both reference-based and unreferenced approaches Some of thereference-based metrics are simple comparisonmethods rather than trainable models but are pre-sented along with other models because they canalso be used to estimate the quality of responsesIt should be noted that we do not compare theunsupervised approaches listed below with super-vised approaches such as the ones proposed byLowe et al (2017) Zhao et al (2020) which re-quire human-annotated response-evaluation pairsfor training

BLEU is a widely used metric for the machinetranslation task by measuring n-gram precision be-tween multiple references and a hypothesis (Pap-ineni et al 2002)

ROUGE is a widely used metric for text sum-marization which measures the n-gram recall (Lin2004) We use the F-score of ROUGE-L as anappropriateness score

METEOR is a metric for the machine transla-tion task which considers both n-gram precisionand n-gram recall of a hypothesis (Banerjee andLavie 2005)

Embeddding AverageGreedyExtrema calcu-late the similarity between golden and generatedresponses using the embedding similarity to ac-count for the diverse ways in which the goldenresponse could be stated (Liu et al 2016)

BERTScore is a recently proposed unsuper-vised metric based on the contextualized BERTembeddings (Zhang et al 2020)

RUBER calculates the scores of reference-basedand unreferenced metrics individually then usesthem to predict the final score (Tao et al 2018)The reference-based metric measures the similaritybetween golden responses and generated responsesbased on their embedding similarity The unrefer-enced metric is trained on the NUP task

BERT-MLM sums the log-likelihood of each to-ken in a response after masking it using an LM thatis fine-tuned on a corpus then uses the aggregatedlikelihood as the final score of the response (Mehriand Eskenazi 2020)

GPT2-coherence measures the coherence be-tween the dialogue history and a response by usinga fine-tuned GPT2 model (Radford et al 2019)to compute the averaged log-likelihood of the re-sponse (Pang et al 2020)

1529

BERT-retrieval (random-N) is a BERT-basedmodel that is trained to distinguish a golden re-sponse from a negative sample (Mehri and Eske-nazi 2020) using the dialogue history We refer tothe original model by Mehri and Eskenazi (2020)as BERT-retrieval (random-1) since they used onerandom response as a negative sample for a dia-logue history We refer to a variation of the modelthat uses two random negative samples for a dia-logue history as BERT-retrieval (random-2) Thisis to fairly compare with our model which usestwo negative samples for a dialogue history as ex-plained below

BERT-retrieval (ours) is a model that has thesame structure as the BERT-retrieval model Thedifference is that our model utilizes the negativesamples generated by the method that we proposeThe model uses both the generated negative sam-ples and the random negative samples Specificallyduring training the model learns to distinguish agolden response from two negative samples onegenerated from our method and one randomly sam-pled from the corpus

413 Implementation DetailsWe trained the unreferenced models on the origi-nal DailyDialog dataset and then evaluated themon the two response-evaluation datasets (Sec-tion 411) We split the conversations in the Daily-Dialog dataset in a sliding window manner to con-struct pairs of dialogue histories and correspondingresponses The maximum turn of the dialogue his-tory was set to 5 following Zhao et al (2020)

We use the pretrained BERT and GPT2 releasedby Wolf et al (2018) for all of our relevant exper-iments2 A BERT model fine-tuned on the Dai-lyDialog train set with MLM for 1 epoch wasused for the scoring step of our proposed method(Section 31) The same model was used for thereplacing step (Section 33) We used the thresh-old3 of 05 for the selecting step (Section 32) Weused Adam optimizer (Kingma and Ba 2015) fortraining We searched for hyperparameters forthe BERT-retrieval (random-1) model that max-imize the (Pearson) correlation between humanevaluations and model predictions on the response-evaluation dataset made from DailyDialog dataset

2bert-base-uncased and gpt2-12layer areused

3We tested threshold values of 0 05 1 and 2 and foundthat using 05 as the threshold achieved the highest correla-tion with human evaluations therefore we report only theexperiment results with this value

DailyDialog PersonaModel r ρ r ρ

BLEU 08 05 19 23METEOR 11 33 23 18ROUGE 12 09 25 21Embed Average 09 08 16 17Embed Greedy 18 18 25 24Embed Extrema 16 15 28 27BERTScore 13 12 28 26RUBER 28 26 07 04BERT-MLM 32 38 35 35GPT2-coherence 47 47 48 48BERT-rtv (rand1) 47 47 56 60BERT-rtv (rand2) 49 48 55 58BERT-rtv (ours) 55 56 64 66

Table 1 The correlations between model predictionsand human evaluations for each model based on thetwo response-evaluation datasets The highest score oneach metric is highlighted in bold All values with p gt0001 are marked with DailyDialog and Persona de-note the response-evaluation datasets made from Daily-Dialog and PersonaChat datasets respectively BERT-rtv denotes the BERT-retrieval model r and ρ meanthe Pearson correlation and Spearmanrsquos rank correla-tion coefficient respectively

DailyDialog PersonaModel r ρ r ρ

drop-golden 49 48 54 57shuffle-golden 46 45 55 59score-wo-history 51 52 58 58select-random 54 55 57 59replace-w-history 52 52 57 57

Table 2 The correlations between model predictionsand human evaluations for each of the variations of ourmodel based on the two response-evaluation datasets

(Section 411) The values found in this search(epoch=3 batch size=64 and learning rate=2e-5) were used for all the BERT-retrieval models(random-N ours) The random seed was fixed forall experiments

42 Results

In Section 421 we check the correlations betweenthe results of each evaluation model and humanevaluations In Section 422 an in-depth anal-ysis of our proposed method is shown In Sec-tion 423 we present examples that may suggestthat automatic evaluation systems that have beentrained with the proposed method can make deci-

1530

Figure 3 The scatter plots that show in detail the correlations between model predictions and human evaluationsEach of the plots contains 800 system-generated responses in the response-evaluation dataset made from Daily-Dialog dataset (Section 411) Each point indicates a response Its x-value indicates the human evaluation scorefor the quality of the response given on a 5-point Likert scale Its y-value indicates the model prediction for thequality of the response normalized into the range of [0 1] The orange line is a linear regression We add a noisesampled from N (0 009) into human score for better visualization following previous studies (Lowe et al 2017Bak and Oh 2020 Pang et al 2020)

sions closer to human judgment than models thathave not

421 Correlation with Human Judgment

Table 1 shows the correlation between model pre-dictions and human evaluations for each modelbased on the two datasets Pearson correlation (r)and Spearmanrsquos rank correlation coefficient (ρ)were used to measure the correlation between hu-man score and model prediction It should benoted that we excluded the scores of golden re-sponses from the response-evaluation datasets andextracted 800 and 750 response-evaluation pairsfrom the DailyDialog and PersonaChat datasetsrespectively The model incorporating our nega-tive sample method made predictions with highercorrelation with human evaluations than the predic-tions made by BERT-retrieval (random-2) whichuses the same number of negative samples fortraining Among the baseline models most ofthe reference-based metrics showed comparativelylow performances It is thought that these resultssupport the observations made by previous stud-ies suggesting that using the golden response asthe ldquoone and onlyrdquo correct answer to evaluate re-sponses can be ineffective RUBER showed bet-ter performance than other reference-based mod-

els for the DailyDialog dataset but showed lowperformance in evaluating PersonaChat responsesThe GPT2-coherence model showed similar perfor-mance to the BERT-retrieval (random-1) model onthe DailyDialog dataset but relatively low perfor-mance in the PersonaChat dataset It should alsobe noted that the hybrid and unreferenced modelswere trained on the DailyDialog dataset and noton the PersonaChat dataset

Figure 3 shows a scatter plot visualizing the hu-man scores and model predictions for the response-evaluation dataset on DailyDialog BLEU tendedto predict low scores This may suggest thatthere were only a few n-gram overlaps betweenthe golden responses and the generated responsesThe predictions of embedding-based metrics (EmbGreedy and BERTScore) were concentrated on aspecific range and showed low correlation withhuman scores The unreferenced or hybrid met-rics (RUBER BERT-MLM GPT2-coherence andBERT-retrieval (random-1)) show relatively highercorrelations than the reference-based metrics Wecan see that BERT-retrieval (ours) shows the great-est correlation among the models with a correla-tion coefficient of 01974 The scatter plots suggestthat false-positive predictions which frequentlyoccurred in the BERT-retrieval (random-1) predic-

1531

tions occurred less frequently in our modelrsquos pre-dictions However the scatter plot for our modelhas a step-function-like appearance Most of theresponses received a score near 0 or near 1 and thisis problematic because an ideal model should beable to match human scores even when the scoresare moderate This tendency is considered as a lim-itation of our model that must be addressed in thefuture work

422 Model AnalysisWe analyze our model by performing experimentswith some variations in making the negative sam-ples to be used with the random negative sample(1) drop-golden Instead of following the stepsof scoring selecting and replacing we randomlydrop some of the words in the golden response tocreate a negative sample and use it with the ran-dom negative sample (2) shuffle-golden Instead offollowing the three steps we randomly shuffle thewords in the golden response to create a negativesample and use it with the random negative sample(3) score-wo-history We use the scoring functionin Equation 1 without the first term so that it onlyconsiders the probabilities within the sentence with-out the dialogue history (4) select-random Insteadof using the scoring function proposed in Equation1 we randomly select the words to be replaced(5) replace-w-history When replacing a word weconcatenate the dialogue history with the responseso that the LM considers the dialogue history whenreplacing the masked words

Table 2 shows the correlations between modelpredictions and human evaluations for the modi-fied models above Dropping or shuffling wordsin the golden response to make a negative sampleshows similar or lower performance compared tousing random responses (BERT-retrieval (random-1 random-2)) The correlation was lower when thedialogue history was not considered in the scoringprocess than when it was considered We specu-late that this is because it gives high scores notonly to words important for the consistency of aconversation but also to the words with low likeli-hoods in general Randomly selecting the tokensshows lower correlation than using our proposedscoring function Considering the dialogue historyin the replacing process gives lower performancethan when it is not considered We speculate thatproviding the dialogue history makes predictionson the masked words that are more appropriate tothe context making the reconstructed response less

Figure 4 Some examples of cases in which our modelpredicted scores similar to human evaluations GPT2denotes the GPT2-coherence model BERT-rtv de-notes the BERT-retrieval (random-1) All scores arenormalized into the range of [0 1]

appropriate as a negative sample

423 Case StudyFigure 4 shows some of the evaluation results ofeach model on the DailyDialog dataset The re-sponses in the first and second examples are appro-priate to the given dialogue history as suggested bythe high human score BLEU-2 gives a score of 0

1532

2130

VERB1830

NOUN 1670

PRON 1020

ADJ 750

ADP 730

ADV 660

etc 1210

CORPUS

VERB2190

NOUN 2050

ADJ 12

1210

PRON 900

ADP 780

ADV 560

etc 1090

OURS

Figure 5 The POS tag distribution of the words in theoriginal training corpus (left) and the selected words byour method (right)

because the response has no bi-grams shared withthe golden response RUBER and GPT2-coherencedid not recognize the utterances as appropriate re-sponses BERT-retrieval (random-1) and BERT-retrieval (ours) gave relatively high scores to the re-sponses evaluating them as appropriate utterancesIn the third example the system response appearsto be somewhat relevant to the given context be-cause it includes some words (chance future)relevant to the phrase take part in the finals Arepetition of a phrase in this example (to get achance) is believed to have contributed to the lowhuman evaluation score (012) The RUBER andBERT-retrieval (random) models appear to lackthis intuition and instead evaluate the response asappropriate possibly because some words appearrelevant Our proposed model scored the responsewith a relatively low score of 015 which was closeto the human score In the fourth example the re-sponse is not coherent but because it begins witha sentence Let me get a peek it could have ap-peared as a coherent response to the previous di-alogue about parking tickets For this case ourproposed model and GPT2-coherence gave scoressimilar to human scores

43 POS-tag distribution of selected words

We compute the Part-of-Speech (POS) tag distribu-tion of selected words by our method and compareit with the original distribution of the DailyDialogcorpus (Figure 5) 4 As we can see the VERBand NOUN tags are the most frequently selected(219 and 205 respectively) and their ratio isincreased than in the original corpus (183 and167 respectively) Meanwhile the ratio of punc-tuation tag () is highly decreased (from 213 to

4We use the NLTK POS tagger (httpswwwnltkorgbookch05html) with universal tagset

Figure 6 Box plot of scores for each type of responsesThe average scores of each type are 465 251 and 119The standard deviations for the scores of each type are067 127 and 041 (from left to right)

121) We suspect that the likelihood of the punc-tuation tag is more affected by local informationfrom a response rather than dialog history

431 Are the generated samples actuallyinappropriate

To see whether the negative samples generatedby our method are actually inappropriate we con-ducted a survey through Amazon Mechanical Turk(AMT) We selected 40 dialogue history examplesand prepared three types of responses for each dia-logue 1) the golden response 2) a negative samplegenerated by our method and 3) a randomly se-lected negative sample from the corpus For eachdialog 4 annotators were asked to score the qual-ity of the three responses Following Lowe et al(2017) we asked the question ldquoHow appropriate isthe response overallrdquo for each context-responsepair and the evaluation was conducted on a 5-pointLikert scale The Fleissrsquo kappa and Krippendorffrsquosalpha for the annotations were 063 and 063 re-spectively

Figure 6 shows the survey results The meanscores of golden and random responses were 465and 119 respectively The mean score of our neg-ative samples was 251 The standard deviationsfor the scores of each response type were 067127 and 041 for the golden response our nega-tive sample and the random response respectivelyWe see that these results do not guarantee that allthe generated negative samples are inappropriateWhat we can assume however is that our methodof manipulating a golden response generates a neg-ative sample that is more inappropriate than thegolden response Table 3 shows two examples ofthe three different types of responses for a givendialog history with their survey results

1533

Dialog HistoryA Sir would you like some dessert nowB Please show me the menu againA Here you are sir The chocolate cake is very delicious

ResponsesGolden No thanks I donrsquot like chocolate Irsquod likestrawberry pie (5)Ours No thanks I donrsquot have chocolate Irsquoll likesome one (15)Random I basically believe in science over theologyI mean I () (1)

Dialog HistoryA Could you tell me something about your family

ResponsesGolden Ok There are five people in my family fathermother elder brother younger sister and I (5)Ours Ok There are five children in my family fathermother and brother and father my me (325)Random When do you want to move in (125)

Table 3 Examples of three different types of responsesfor a given dialog history with their survey results Thehighlighted words are newly generated by our methodThe score of each response is underlined

For a model learning to find the difference be-tween appropriate and inappropriate responses wespeculate that the task of distinguishing the neg-ative samples generated by our method from thegolden responses would be more difficult than thetask of distinguishing the randomly selected nega-tive samples from the golden responses We believethat this is because the generated negative samplescan be inappropriate in more subtle ways than com-pletely unrelated responses are We suspect thatlearning with this more challenging setting haveresulted in the performance gain that we discussedin Section 421 However we believe that it willneed a more in-depth semantic analysis on each ofthe cases such as performing a more quantitativeanalysis (through an extensive human study forinstance) and further interpretation of the semanticrelationships between the original golden responsesand the modified negative samples according to theproposed method We leave it as a future work

5 Conclusion

In this paper we proposed an automatic methodfor generating negative samples that can be used totrain an unsupervised and unreferenced responseevaluation model We performed experiments todemonstrate that the proposed method can boostthe unsupervised training of a response evaluationmodel We analyzed the experiment results quan-titatively and examined some examples that show

the distinct characteristics of our proposed method

Acknowledgments

This work was supported by Institute for Infor-mation and communications Technology Promo-tion (IITP) grant funded by the Korea governmentMSIT) (No 2018-0-00582 Prediction and augmen-tation of the credibility distribution via linguisticanalysis and automated evidence document collec-tion)

References

JinYeong Bak and Alice Oh 2020 Speaker sensitiveresponse evaluation model In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics pages 6376ndash6385

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the ACL Workshop on Intrinsic and Extrinsic Eval-uation Measures for Machine Translation andorSummarization pages 65ndash72

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies pages 4171ndash4186

Asma Ghandeharioun Judy Hanwen Shen NatashaJaques Craig Ferguson Noah Jones AgataLapedriza and Rosalind Picard 2019 Approximat-ing interactive human evaluation with self-play foropen-domain dialog systems In Advances in Neu-ral Information Processing Systems pages 13658ndash13669

Sarik Ghazarian Johnny Wei Aram Galstyan andNanyun Peng 2019 Better automatic evaluation ofopen-domain dialogue systems with contextualizedembeddings In Proceedings of the Workshop onMethods for Optimizing and Evaluating Neural Lan-guage Generation pages 82ndash89

Tatsunori Hashimoto Hugh Zhang and Percy Liang2019 Unifying human and statistical evaluation fornatural language generation In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies pages 1689ndash1701

Ari Holtzman Jan Buys Li Du Maxwell Forbes andYejin Choi 2020 The curious case of neural textdegeneration In 8th International Conference onLearning Representations

1534

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In 3rd Interna-tional Conference on Learning Representations

Jiwei Li Alexander H Miller Sumit ChopraMarcrsquoAurelio Ranzato and Jason Weston 2016Dialogue learning with human-in-the-loop arXivpreprint arXiv161109823

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 DailyDialog A manu-ally labelled multi-turn dialogue dataset In Proceed-ings of the Eighth International Joint Conference onNatural Language Processing pages 986ndash995

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81

Chia-Wei Liu Ryan Lowe Iulian Serban Mike Nose-worthy Laurent Charlin and Joelle Pineau 2016How NOT to evaluate your dialogue system An em-pirical study of unsupervised evaluation metrics fordialogue response generation In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing pages 2122ndash2132

Ryan Lowe Michael Noseworthy Iulian Vlad Ser-ban Nicolas Angelard-Gontier Yoshua Bengio andJoelle Pineau 2017 Towards an automatic Turingtest Learning to evaluate dialogue responses InProceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1116ndash1126

Shikib Mehri and Maxine Eskenazi 2020 USR Anunsupervised and reference free evaluation metricfor dialog generation In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 681ndash707

Bo Pang Erik Nijkamp Wenjuan Han Linqi ZhouYixian Liu and Kewei Tu 2020 Towards holisticand automatic evaluation of open-domain dialoguegeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 3619ndash3629

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics pages 311ndash318

Alec Radford Jeffrey Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIblog

Ananya B Sai Akash Kumar Mohankumar SiddharthaArora and Mitesh M Khapra 2020 Improving di-alog evaluation with a multi-reference adversarialdataset and large scale pretraining arXiv preprintarXiv200911321

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892

Iulian V Serban Alessandro Sordoni Yoshua BengioAaron Courville and Joelle Pineau 2016 Buildingend-to-end dialogue systems using generative hier-archical neural network models In Proceedings ofthe 30th AAAI Conference on Artificial Intelligencepages 3776ndash3783

Iulian Vlad Serban Alessandro Sordoni Ryan LoweLaurent Charlin Joelle Pineau Aaron Courville andYoshua Bengio 2017 A hierarchical latent variableencoder-decoder model for generating dialogues InProceedings of the 31st AAAI Conference on Artifi-cial Intelligence pages 3295ndash3301

Ilya Sutskever Oriol Vinyals and Quoc V Le 2014Sequence to sequence learning with neural networksIn Advances in Neural Information Processing Sys-tems pages 3104ndash3112

Chongyang Tao Lili Mou Dongyan Zhao and RuiYan 2018 Ruber An unsupervised method for au-tomatic evaluation of open-domain dialog systemsProceedings of the 32nd AAAI Conference on Artifi-cial Intelligence pages 722ndash729

Thomas Wolf Victor Sanh Julien Chaumond andClement Delangue 2018 Transfertransfo A trans-fer learning approach for neural network based con-versational agents In NeurIPS Workshop on Con-versational AI

Hanlu Wu Tengfei Ma Lingfei Wu TariroManyumwa and Shouling Ji 2020 Unsuper-vised reference-free summary quality evaluationvia contrastive learning In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing pages 3612ndash3621

Saizheng Zhang Emily Dinan Jack Urbanek ArthurSzlam Douwe Kiela and Jason Weston 2018 Per-sonalizing dialogue agents I have a dog do youhave pets too In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics pages 2204ndash2213

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with BERT In 8th Interna-tional Conference on Learning Representations

Tianyu Zhao Divesh Lala and Tatsuya Kawahara2020 Designing precise and robust dialogue re-sponse evaluators In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 26ndash33

Page 3: Generating Negative Samples by Manipulating Golden ...the golden response.Li et al.(2016) explored dia-log system with textual feedback.Ghandeharioun et al.(2019) suggested the necessity

1527

3 Method

In this section we describe our method to gener-ate negative samples The proposed method cre-ates a negative sample by selecting and replacingspecific word(s) in a golden response The wordselection is based on the difference between (a) theestimated probability that a word would appear inthe response considering the dialogue history and(b) the estimated probability that the word wouldappear in the response when the dialogue history isnot considered An LM that can perform MLM canbe used to estimate these probabilities Words thathave large differences in probability are selectedand replaced with other words When replacing aword with another word an LM that can performMLM can be used to predict the word that is mostlikely to appear in the position of the original wordwhen the dialogue history is not given

31 Scoring

The proposed method includes a scoring processto determine which words in the golden responseare affected the most by the dialogue history Thescore of a word is calculated by taking the differ-ence between (a) the estimated probability of theword appearing in its position when the dialoguehistory is given and (b) the estimated probabilityof the word appearing in its position when the dia-logue history is not given This scoring process isperformed independently for all words in the targetresponse

Specifically to calculate the score of the i-thword (xi) in the golden response we first replacexi with the [mask] token Then the likelihood thatthe original word xi appears in place of the maskedtoken is calculated twice once with the dialoguehistory and once without The difference in the log-likelihood is used as the final score of each wordwhich is defined as

score(xi | c ri θ) = log(P(xi | [c ri] θ))minuslog(P(xi | ri θ))

(1)

where xi denotes the word to be scored and ri de-notes the sequence of words in the golden responsewhere xi is masked c denotes the dialogue historyof the golden response and [ ] the concatenationof two pieces of text P (xi|[c ri] θ) denotes theestimated probability that xi would occur when thedialog history is considered P (xi|ri θ) denotesthe estimated probability that xi would occur when

Figure 2 An illustration of our proposed method forgenerating a negative sample The original response tothe dialogue history Whatrsquos wrong with heading outwith Mark for vacation is manipulated through thesteps of scoring selecting and replacing

the dialog history is not considered θ denotes theparameters of the LM

Figure 2 shows an example of our proposed scor-ing process The word vacation in the original re-sponse received the highest score among the wordsin the response The words with and headingalso scored higher than other words

32 Selecting

For each sentence we select words that scoredhigher than the threshold t For example in thecase seen in Figure 2 if the threshold is 05 thewords with heading vacation and willbe selected If none of the words receive a scorehigher than the threshold value no words will beselected and in this case a negative sample cannotbe generated

We set the threshold t to 05 for our experimentsUsing this threshold in our dataset an average of2728 of tokens were selected for each responseAlso 9489 of the responses contained at leastone selected word which means a negative samplecould be generated for 9489 of the cases

1528

33 ReplacingThe selected words are then replaced using an LMAll selected words are replaced with [mask] tokensin the original response Then the LM predictswithout considering the dialogue history the wordsthat are most likely to occur in the location of eachmasked word If the LM predicts the original wordthe second most likely word is used instead

4 Experiments

41 Setting411 DatasetTo measure the correlation between model predic-tions and human evaluations we use the response-evaluation dataset proposed by Zhao et al (2020)The dataset contains dialogue histories machine-generated responses golden responses and ap-propriateness scores evaluated by human annota-tors The scores were on a 5-point Likert scaleand each response was scored by four annota-tors Six generative models S2S (Sutskever et al2014) attentional S2S HRED (Serban et al 2016)VHRED (Serban et al 2017) GPT2-sm and GPT2-md (Wolf et al 2018) with three decoding algo-rithms greedy decoding ancestral decoding andnucleus sampling (Holtzman et al 2020) wereused to generate the responses They used Daily-Dialog (Li et al 2017) and PersonaChat (Zhanget al 2018) For each dataset they trained a set ofgenerative conversation models Each of the 900context-response pairs was randomly selected fromthe test set of the two datasets and the annotatorsevaluated the appropriateness of each response tothe context to construct two different evaluationdatasets The Krippendorffrsquos alpha for this datasetwas 0815 suggesting reasonable inter-annotatoragreement

DailyDialog dataset consists of 13118 multi-turn open-domain conversations written by hu-man workers and PersonaChat dataset consists of12875 multi-turn open-domain conversations writ-ten by human workers

412 ModelsThe evaluation models used in the experiment arelisted below Among them BLEU ROUGE ME-TEOR Embedding AverageExtremaGreedy andBERTScore are reference-based metrics that eval-uate the quality of a response based on its similar-ity to the golden response BERT-MLM GPT2-coherence BERT-retrieval (random-N) BERT-

retrieval (ours) are unreferenced metrics that do notrequire golden responses RUBER can be viewedas a hybrid metric that includes both reference-based and unreferenced approaches Some of thereference-based metrics are simple comparisonmethods rather than trainable models but are pre-sented along with other models because they canalso be used to estimate the quality of responsesIt should be noted that we do not compare theunsupervised approaches listed below with super-vised approaches such as the ones proposed byLowe et al (2017) Zhao et al (2020) which re-quire human-annotated response-evaluation pairsfor training

BLEU is a widely used metric for the machinetranslation task by measuring n-gram precision be-tween multiple references and a hypothesis (Pap-ineni et al 2002)

ROUGE is a widely used metric for text sum-marization which measures the n-gram recall (Lin2004) We use the F-score of ROUGE-L as anappropriateness score

METEOR is a metric for the machine transla-tion task which considers both n-gram precisionand n-gram recall of a hypothesis (Banerjee andLavie 2005)

Embeddding AverageGreedyExtrema calcu-late the similarity between golden and generatedresponses using the embedding similarity to ac-count for the diverse ways in which the goldenresponse could be stated (Liu et al 2016)

BERTScore is a recently proposed unsuper-vised metric based on the contextualized BERTembeddings (Zhang et al 2020)

RUBER calculates the scores of reference-basedand unreferenced metrics individually then usesthem to predict the final score (Tao et al 2018)The reference-based metric measures the similaritybetween golden responses and generated responsesbased on their embedding similarity The unrefer-enced metric is trained on the NUP task

BERT-MLM sums the log-likelihood of each to-ken in a response after masking it using an LM thatis fine-tuned on a corpus then uses the aggregatedlikelihood as the final score of the response (Mehriand Eskenazi 2020)

GPT2-coherence measures the coherence be-tween the dialogue history and a response by usinga fine-tuned GPT2 model (Radford et al 2019)to compute the averaged log-likelihood of the re-sponse (Pang et al 2020)

1529

BERT-retrieval (random-N) is a BERT-basedmodel that is trained to distinguish a golden re-sponse from a negative sample (Mehri and Eske-nazi 2020) using the dialogue history We refer tothe original model by Mehri and Eskenazi (2020)as BERT-retrieval (random-1) since they used onerandom response as a negative sample for a dia-logue history We refer to a variation of the modelthat uses two random negative samples for a dia-logue history as BERT-retrieval (random-2) Thisis to fairly compare with our model which usestwo negative samples for a dialogue history as ex-plained below

BERT-retrieval (ours) is a model that has thesame structure as the BERT-retrieval model Thedifference is that our model utilizes the negativesamples generated by the method that we proposeThe model uses both the generated negative sam-ples and the random negative samples Specificallyduring training the model learns to distinguish agolden response from two negative samples onegenerated from our method and one randomly sam-pled from the corpus

413 Implementation DetailsWe trained the unreferenced models on the origi-nal DailyDialog dataset and then evaluated themon the two response-evaluation datasets (Sec-tion 411) We split the conversations in the Daily-Dialog dataset in a sliding window manner to con-struct pairs of dialogue histories and correspondingresponses The maximum turn of the dialogue his-tory was set to 5 following Zhao et al (2020)

We use the pretrained BERT and GPT2 releasedby Wolf et al (2018) for all of our relevant exper-iments2 A BERT model fine-tuned on the Dai-lyDialog train set with MLM for 1 epoch wasused for the scoring step of our proposed method(Section 31) The same model was used for thereplacing step (Section 33) We used the thresh-old3 of 05 for the selecting step (Section 32) Weused Adam optimizer (Kingma and Ba 2015) fortraining We searched for hyperparameters forthe BERT-retrieval (random-1) model that max-imize the (Pearson) correlation between humanevaluations and model predictions on the response-evaluation dataset made from DailyDialog dataset

2bert-base-uncased and gpt2-12layer areused

3We tested threshold values of 0 05 1 and 2 and foundthat using 05 as the threshold achieved the highest correla-tion with human evaluations therefore we report only theexperiment results with this value

DailyDialog PersonaModel r ρ r ρ

BLEU 08 05 19 23METEOR 11 33 23 18ROUGE 12 09 25 21Embed Average 09 08 16 17Embed Greedy 18 18 25 24Embed Extrema 16 15 28 27BERTScore 13 12 28 26RUBER 28 26 07 04BERT-MLM 32 38 35 35GPT2-coherence 47 47 48 48BERT-rtv (rand1) 47 47 56 60BERT-rtv (rand2) 49 48 55 58BERT-rtv (ours) 55 56 64 66

Table 1 The correlations between model predictionsand human evaluations for each model based on thetwo response-evaluation datasets The highest score oneach metric is highlighted in bold All values with p gt0001 are marked with DailyDialog and Persona de-note the response-evaluation datasets made from Daily-Dialog and PersonaChat datasets respectively BERT-rtv denotes the BERT-retrieval model r and ρ meanthe Pearson correlation and Spearmanrsquos rank correla-tion coefficient respectively

DailyDialog PersonaModel r ρ r ρ

drop-golden 49 48 54 57shuffle-golden 46 45 55 59score-wo-history 51 52 58 58select-random 54 55 57 59replace-w-history 52 52 57 57

Table 2 The correlations between model predictionsand human evaluations for each of the variations of ourmodel based on the two response-evaluation datasets

(Section 411) The values found in this search(epoch=3 batch size=64 and learning rate=2e-5) were used for all the BERT-retrieval models(random-N ours) The random seed was fixed forall experiments

42 Results

In Section 421 we check the correlations betweenthe results of each evaluation model and humanevaluations In Section 422 an in-depth anal-ysis of our proposed method is shown In Sec-tion 423 we present examples that may suggestthat automatic evaluation systems that have beentrained with the proposed method can make deci-

1530

Figure 3 The scatter plots that show in detail the correlations between model predictions and human evaluationsEach of the plots contains 800 system-generated responses in the response-evaluation dataset made from Daily-Dialog dataset (Section 411) Each point indicates a response Its x-value indicates the human evaluation scorefor the quality of the response given on a 5-point Likert scale Its y-value indicates the model prediction for thequality of the response normalized into the range of [0 1] The orange line is a linear regression We add a noisesampled from N (0 009) into human score for better visualization following previous studies (Lowe et al 2017Bak and Oh 2020 Pang et al 2020)

sions closer to human judgment than models thathave not

421 Correlation with Human Judgment

Table 1 shows the correlation between model pre-dictions and human evaluations for each modelbased on the two datasets Pearson correlation (r)and Spearmanrsquos rank correlation coefficient (ρ)were used to measure the correlation between hu-man score and model prediction It should benoted that we excluded the scores of golden re-sponses from the response-evaluation datasets andextracted 800 and 750 response-evaluation pairsfrom the DailyDialog and PersonaChat datasetsrespectively The model incorporating our nega-tive sample method made predictions with highercorrelation with human evaluations than the predic-tions made by BERT-retrieval (random-2) whichuses the same number of negative samples fortraining Among the baseline models most ofthe reference-based metrics showed comparativelylow performances It is thought that these resultssupport the observations made by previous stud-ies suggesting that using the golden response asthe ldquoone and onlyrdquo correct answer to evaluate re-sponses can be ineffective RUBER showed bet-ter performance than other reference-based mod-

els for the DailyDialog dataset but showed lowperformance in evaluating PersonaChat responsesThe GPT2-coherence model showed similar perfor-mance to the BERT-retrieval (random-1) model onthe DailyDialog dataset but relatively low perfor-mance in the PersonaChat dataset It should alsobe noted that the hybrid and unreferenced modelswere trained on the DailyDialog dataset and noton the PersonaChat dataset

Figure 3 shows a scatter plot visualizing the hu-man scores and model predictions for the response-evaluation dataset on DailyDialog BLEU tendedto predict low scores This may suggest thatthere were only a few n-gram overlaps betweenthe golden responses and the generated responsesThe predictions of embedding-based metrics (EmbGreedy and BERTScore) were concentrated on aspecific range and showed low correlation withhuman scores The unreferenced or hybrid met-rics (RUBER BERT-MLM GPT2-coherence andBERT-retrieval (random-1)) show relatively highercorrelations than the reference-based metrics Wecan see that BERT-retrieval (ours) shows the great-est correlation among the models with a correla-tion coefficient of 01974 The scatter plots suggestthat false-positive predictions which frequentlyoccurred in the BERT-retrieval (random-1) predic-

1531

tions occurred less frequently in our modelrsquos pre-dictions However the scatter plot for our modelhas a step-function-like appearance Most of theresponses received a score near 0 or near 1 and thisis problematic because an ideal model should beable to match human scores even when the scoresare moderate This tendency is considered as a lim-itation of our model that must be addressed in thefuture work

422 Model AnalysisWe analyze our model by performing experimentswith some variations in making the negative sam-ples to be used with the random negative sample(1) drop-golden Instead of following the stepsof scoring selecting and replacing we randomlydrop some of the words in the golden response tocreate a negative sample and use it with the ran-dom negative sample (2) shuffle-golden Instead offollowing the three steps we randomly shuffle thewords in the golden response to create a negativesample and use it with the random negative sample(3) score-wo-history We use the scoring functionin Equation 1 without the first term so that it onlyconsiders the probabilities within the sentence with-out the dialogue history (4) select-random Insteadof using the scoring function proposed in Equation1 we randomly select the words to be replaced(5) replace-w-history When replacing a word weconcatenate the dialogue history with the responseso that the LM considers the dialogue history whenreplacing the masked words

Table 2 shows the correlations between modelpredictions and human evaluations for the modi-fied models above Dropping or shuffling wordsin the golden response to make a negative sampleshows similar or lower performance compared tousing random responses (BERT-retrieval (random-1 random-2)) The correlation was lower when thedialogue history was not considered in the scoringprocess than when it was considered We specu-late that this is because it gives high scores notonly to words important for the consistency of aconversation but also to the words with low likeli-hoods in general Randomly selecting the tokensshows lower correlation than using our proposedscoring function Considering the dialogue historyin the replacing process gives lower performancethan when it is not considered We speculate thatproviding the dialogue history makes predictionson the masked words that are more appropriate tothe context making the reconstructed response less

Figure 4 Some examples of cases in which our modelpredicted scores similar to human evaluations GPT2denotes the GPT2-coherence model BERT-rtv de-notes the BERT-retrieval (random-1) All scores arenormalized into the range of [0 1]

appropriate as a negative sample

423 Case StudyFigure 4 shows some of the evaluation results ofeach model on the DailyDialog dataset The re-sponses in the first and second examples are appro-priate to the given dialogue history as suggested bythe high human score BLEU-2 gives a score of 0

1532

2130

VERB1830

NOUN 1670

PRON 1020

ADJ 750

ADP 730

ADV 660

etc 1210

CORPUS

VERB2190

NOUN 2050

ADJ 12

1210

PRON 900

ADP 780

ADV 560

etc 1090

OURS

Figure 5 The POS tag distribution of the words in theoriginal training corpus (left) and the selected words byour method (right)

because the response has no bi-grams shared withthe golden response RUBER and GPT2-coherencedid not recognize the utterances as appropriate re-sponses BERT-retrieval (random-1) and BERT-retrieval (ours) gave relatively high scores to the re-sponses evaluating them as appropriate utterancesIn the third example the system response appearsto be somewhat relevant to the given context be-cause it includes some words (chance future)relevant to the phrase take part in the finals Arepetition of a phrase in this example (to get achance) is believed to have contributed to the lowhuman evaluation score (012) The RUBER andBERT-retrieval (random) models appear to lackthis intuition and instead evaluate the response asappropriate possibly because some words appearrelevant Our proposed model scored the responsewith a relatively low score of 015 which was closeto the human score In the fourth example the re-sponse is not coherent but because it begins witha sentence Let me get a peek it could have ap-peared as a coherent response to the previous di-alogue about parking tickets For this case ourproposed model and GPT2-coherence gave scoressimilar to human scores

43 POS-tag distribution of selected words

We compute the Part-of-Speech (POS) tag distribu-tion of selected words by our method and compareit with the original distribution of the DailyDialogcorpus (Figure 5) 4 As we can see the VERBand NOUN tags are the most frequently selected(219 and 205 respectively) and their ratio isincreased than in the original corpus (183 and167 respectively) Meanwhile the ratio of punc-tuation tag () is highly decreased (from 213 to

4We use the NLTK POS tagger (httpswwwnltkorgbookch05html) with universal tagset

Figure 6 Box plot of scores for each type of responsesThe average scores of each type are 465 251 and 119The standard deviations for the scores of each type are067 127 and 041 (from left to right)

121) We suspect that the likelihood of the punc-tuation tag is more affected by local informationfrom a response rather than dialog history

431 Are the generated samples actuallyinappropriate

To see whether the negative samples generatedby our method are actually inappropriate we con-ducted a survey through Amazon Mechanical Turk(AMT) We selected 40 dialogue history examplesand prepared three types of responses for each dia-logue 1) the golden response 2) a negative samplegenerated by our method and 3) a randomly se-lected negative sample from the corpus For eachdialog 4 annotators were asked to score the qual-ity of the three responses Following Lowe et al(2017) we asked the question ldquoHow appropriate isthe response overallrdquo for each context-responsepair and the evaluation was conducted on a 5-pointLikert scale The Fleissrsquo kappa and Krippendorffrsquosalpha for the annotations were 063 and 063 re-spectively

Figure 6 shows the survey results The meanscores of golden and random responses were 465and 119 respectively The mean score of our neg-ative samples was 251 The standard deviationsfor the scores of each response type were 067127 and 041 for the golden response our nega-tive sample and the random response respectivelyWe see that these results do not guarantee that allthe generated negative samples are inappropriateWhat we can assume however is that our methodof manipulating a golden response generates a neg-ative sample that is more inappropriate than thegolden response Table 3 shows two examples ofthe three different types of responses for a givendialog history with their survey results

1533

Dialog HistoryA Sir would you like some dessert nowB Please show me the menu againA Here you are sir The chocolate cake is very delicious

ResponsesGolden No thanks I donrsquot like chocolate Irsquod likestrawberry pie (5)Ours No thanks I donrsquot have chocolate Irsquoll likesome one (15)Random I basically believe in science over theologyI mean I () (1)

Dialog HistoryA Could you tell me something about your family

ResponsesGolden Ok There are five people in my family fathermother elder brother younger sister and I (5)Ours Ok There are five children in my family fathermother and brother and father my me (325)Random When do you want to move in (125)

Table 3 Examples of three different types of responsesfor a given dialog history with their survey results Thehighlighted words are newly generated by our methodThe score of each response is underlined

For a model learning to find the difference be-tween appropriate and inappropriate responses wespeculate that the task of distinguishing the neg-ative samples generated by our method from thegolden responses would be more difficult than thetask of distinguishing the randomly selected nega-tive samples from the golden responses We believethat this is because the generated negative samplescan be inappropriate in more subtle ways than com-pletely unrelated responses are We suspect thatlearning with this more challenging setting haveresulted in the performance gain that we discussedin Section 421 However we believe that it willneed a more in-depth semantic analysis on each ofthe cases such as performing a more quantitativeanalysis (through an extensive human study forinstance) and further interpretation of the semanticrelationships between the original golden responsesand the modified negative samples according to theproposed method We leave it as a future work

5 Conclusion

In this paper we proposed an automatic methodfor generating negative samples that can be used totrain an unsupervised and unreferenced responseevaluation model We performed experiments todemonstrate that the proposed method can boostthe unsupervised training of a response evaluationmodel We analyzed the experiment results quan-titatively and examined some examples that show

the distinct characteristics of our proposed method

Acknowledgments

This work was supported by Institute for Infor-mation and communications Technology Promo-tion (IITP) grant funded by the Korea governmentMSIT) (No 2018-0-00582 Prediction and augmen-tation of the credibility distribution via linguisticanalysis and automated evidence document collec-tion)

References

JinYeong Bak and Alice Oh 2020 Speaker sensitiveresponse evaluation model In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics pages 6376ndash6385

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the ACL Workshop on Intrinsic and Extrinsic Eval-uation Measures for Machine Translation andorSummarization pages 65ndash72

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies pages 4171ndash4186

Asma Ghandeharioun Judy Hanwen Shen NatashaJaques Craig Ferguson Noah Jones AgataLapedriza and Rosalind Picard 2019 Approximat-ing interactive human evaluation with self-play foropen-domain dialog systems In Advances in Neu-ral Information Processing Systems pages 13658ndash13669

Sarik Ghazarian Johnny Wei Aram Galstyan andNanyun Peng 2019 Better automatic evaluation ofopen-domain dialogue systems with contextualizedembeddings In Proceedings of the Workshop onMethods for Optimizing and Evaluating Neural Lan-guage Generation pages 82ndash89

Tatsunori Hashimoto Hugh Zhang and Percy Liang2019 Unifying human and statistical evaluation fornatural language generation In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies pages 1689ndash1701

Ari Holtzman Jan Buys Li Du Maxwell Forbes andYejin Choi 2020 The curious case of neural textdegeneration In 8th International Conference onLearning Representations

1534

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In 3rd Interna-tional Conference on Learning Representations

Jiwei Li Alexander H Miller Sumit ChopraMarcrsquoAurelio Ranzato and Jason Weston 2016Dialogue learning with human-in-the-loop arXivpreprint arXiv161109823

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 DailyDialog A manu-ally labelled multi-turn dialogue dataset In Proceed-ings of the Eighth International Joint Conference onNatural Language Processing pages 986ndash995

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81

Chia-Wei Liu Ryan Lowe Iulian Serban Mike Nose-worthy Laurent Charlin and Joelle Pineau 2016How NOT to evaluate your dialogue system An em-pirical study of unsupervised evaluation metrics fordialogue response generation In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing pages 2122ndash2132

Ryan Lowe Michael Noseworthy Iulian Vlad Ser-ban Nicolas Angelard-Gontier Yoshua Bengio andJoelle Pineau 2017 Towards an automatic Turingtest Learning to evaluate dialogue responses InProceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1116ndash1126

Shikib Mehri and Maxine Eskenazi 2020 USR Anunsupervised and reference free evaluation metricfor dialog generation In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 681ndash707

Bo Pang Erik Nijkamp Wenjuan Han Linqi ZhouYixian Liu and Kewei Tu 2020 Towards holisticand automatic evaluation of open-domain dialoguegeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 3619ndash3629

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics pages 311ndash318

Alec Radford Jeffrey Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIblog

Ananya B Sai Akash Kumar Mohankumar SiddharthaArora and Mitesh M Khapra 2020 Improving di-alog evaluation with a multi-reference adversarialdataset and large scale pretraining arXiv preprintarXiv200911321

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892

Iulian V Serban Alessandro Sordoni Yoshua BengioAaron Courville and Joelle Pineau 2016 Buildingend-to-end dialogue systems using generative hier-archical neural network models In Proceedings ofthe 30th AAAI Conference on Artificial Intelligencepages 3776ndash3783

Iulian Vlad Serban Alessandro Sordoni Ryan LoweLaurent Charlin Joelle Pineau Aaron Courville andYoshua Bengio 2017 A hierarchical latent variableencoder-decoder model for generating dialogues InProceedings of the 31st AAAI Conference on Artifi-cial Intelligence pages 3295ndash3301

Ilya Sutskever Oriol Vinyals and Quoc V Le 2014Sequence to sequence learning with neural networksIn Advances in Neural Information Processing Sys-tems pages 3104ndash3112

Chongyang Tao Lili Mou Dongyan Zhao and RuiYan 2018 Ruber An unsupervised method for au-tomatic evaluation of open-domain dialog systemsProceedings of the 32nd AAAI Conference on Artifi-cial Intelligence pages 722ndash729

Thomas Wolf Victor Sanh Julien Chaumond andClement Delangue 2018 Transfertransfo A trans-fer learning approach for neural network based con-versational agents In NeurIPS Workshop on Con-versational AI

Hanlu Wu Tengfei Ma Lingfei Wu TariroManyumwa and Shouling Ji 2020 Unsuper-vised reference-free summary quality evaluationvia contrastive learning In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing pages 3612ndash3621

Saizheng Zhang Emily Dinan Jack Urbanek ArthurSzlam Douwe Kiela and Jason Weston 2018 Per-sonalizing dialogue agents I have a dog do youhave pets too In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics pages 2204ndash2213

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with BERT In 8th Interna-tional Conference on Learning Representations

Tianyu Zhao Divesh Lala and Tatsuya Kawahara2020 Designing precise and robust dialogue re-sponse evaluators In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 26ndash33

Page 4: Generating Negative Samples by Manipulating Golden ...the golden response.Li et al.(2016) explored dia-log system with textual feedback.Ghandeharioun et al.(2019) suggested the necessity

1528

33 ReplacingThe selected words are then replaced using an LMAll selected words are replaced with [mask] tokensin the original response Then the LM predictswithout considering the dialogue history the wordsthat are most likely to occur in the location of eachmasked word If the LM predicts the original wordthe second most likely word is used instead

4 Experiments

41 Setting411 DatasetTo measure the correlation between model predic-tions and human evaluations we use the response-evaluation dataset proposed by Zhao et al (2020)The dataset contains dialogue histories machine-generated responses golden responses and ap-propriateness scores evaluated by human annota-tors The scores were on a 5-point Likert scaleand each response was scored by four annota-tors Six generative models S2S (Sutskever et al2014) attentional S2S HRED (Serban et al 2016)VHRED (Serban et al 2017) GPT2-sm and GPT2-md (Wolf et al 2018) with three decoding algo-rithms greedy decoding ancestral decoding andnucleus sampling (Holtzman et al 2020) wereused to generate the responses They used Daily-Dialog (Li et al 2017) and PersonaChat (Zhanget al 2018) For each dataset they trained a set ofgenerative conversation models Each of the 900context-response pairs was randomly selected fromthe test set of the two datasets and the annotatorsevaluated the appropriateness of each response tothe context to construct two different evaluationdatasets The Krippendorffrsquos alpha for this datasetwas 0815 suggesting reasonable inter-annotatoragreement

DailyDialog dataset consists of 13118 multi-turn open-domain conversations written by hu-man workers and PersonaChat dataset consists of12875 multi-turn open-domain conversations writ-ten by human workers

412 ModelsThe evaluation models used in the experiment arelisted below Among them BLEU ROUGE ME-TEOR Embedding AverageExtremaGreedy andBERTScore are reference-based metrics that eval-uate the quality of a response based on its similar-ity to the golden response BERT-MLM GPT2-coherence BERT-retrieval (random-N) BERT-

retrieval (ours) are unreferenced metrics that do notrequire golden responses RUBER can be viewedas a hybrid metric that includes both reference-based and unreferenced approaches Some of thereference-based metrics are simple comparisonmethods rather than trainable models but are pre-sented along with other models because they canalso be used to estimate the quality of responsesIt should be noted that we do not compare theunsupervised approaches listed below with super-vised approaches such as the ones proposed byLowe et al (2017) Zhao et al (2020) which re-quire human-annotated response-evaluation pairsfor training

BLEU is a widely used metric for the machinetranslation task by measuring n-gram precision be-tween multiple references and a hypothesis (Pap-ineni et al 2002)

ROUGE is a widely used metric for text sum-marization which measures the n-gram recall (Lin2004) We use the F-score of ROUGE-L as anappropriateness score

METEOR is a metric for the machine transla-tion task which considers both n-gram precisionand n-gram recall of a hypothesis (Banerjee andLavie 2005)

Embeddding AverageGreedyExtrema calcu-late the similarity between golden and generatedresponses using the embedding similarity to ac-count for the diverse ways in which the goldenresponse could be stated (Liu et al 2016)

BERTScore is a recently proposed unsuper-vised metric based on the contextualized BERTembeddings (Zhang et al 2020)

RUBER calculates the scores of reference-basedand unreferenced metrics individually then usesthem to predict the final score (Tao et al 2018)The reference-based metric measures the similaritybetween golden responses and generated responsesbased on their embedding similarity The unrefer-enced metric is trained on the NUP task

BERT-MLM sums the log-likelihood of each to-ken in a response after masking it using an LM thatis fine-tuned on a corpus then uses the aggregatedlikelihood as the final score of the response (Mehriand Eskenazi 2020)

GPT2-coherence measures the coherence be-tween the dialogue history and a response by usinga fine-tuned GPT2 model (Radford et al 2019)to compute the averaged log-likelihood of the re-sponse (Pang et al 2020)

1529

BERT-retrieval (random-N) is a BERT-basedmodel that is trained to distinguish a golden re-sponse from a negative sample (Mehri and Eske-nazi 2020) using the dialogue history We refer tothe original model by Mehri and Eskenazi (2020)as BERT-retrieval (random-1) since they used onerandom response as a negative sample for a dia-logue history We refer to a variation of the modelthat uses two random negative samples for a dia-logue history as BERT-retrieval (random-2) Thisis to fairly compare with our model which usestwo negative samples for a dialogue history as ex-plained below

BERT-retrieval (ours) is a model that has thesame structure as the BERT-retrieval model Thedifference is that our model utilizes the negativesamples generated by the method that we proposeThe model uses both the generated negative sam-ples and the random negative samples Specificallyduring training the model learns to distinguish agolden response from two negative samples onegenerated from our method and one randomly sam-pled from the corpus

413 Implementation DetailsWe trained the unreferenced models on the origi-nal DailyDialog dataset and then evaluated themon the two response-evaluation datasets (Sec-tion 411) We split the conversations in the Daily-Dialog dataset in a sliding window manner to con-struct pairs of dialogue histories and correspondingresponses The maximum turn of the dialogue his-tory was set to 5 following Zhao et al (2020)

We use the pretrained BERT and GPT2 releasedby Wolf et al (2018) for all of our relevant exper-iments2 A BERT model fine-tuned on the Dai-lyDialog train set with MLM for 1 epoch wasused for the scoring step of our proposed method(Section 31) The same model was used for thereplacing step (Section 33) We used the thresh-old3 of 05 for the selecting step (Section 32) Weused Adam optimizer (Kingma and Ba 2015) fortraining We searched for hyperparameters forthe BERT-retrieval (random-1) model that max-imize the (Pearson) correlation between humanevaluations and model predictions on the response-evaluation dataset made from DailyDialog dataset

2bert-base-uncased and gpt2-12layer areused

3We tested threshold values of 0 05 1 and 2 and foundthat using 05 as the threshold achieved the highest correla-tion with human evaluations therefore we report only theexperiment results with this value

DailyDialog PersonaModel r ρ r ρ

BLEU 08 05 19 23METEOR 11 33 23 18ROUGE 12 09 25 21Embed Average 09 08 16 17Embed Greedy 18 18 25 24Embed Extrema 16 15 28 27BERTScore 13 12 28 26RUBER 28 26 07 04BERT-MLM 32 38 35 35GPT2-coherence 47 47 48 48BERT-rtv (rand1) 47 47 56 60BERT-rtv (rand2) 49 48 55 58BERT-rtv (ours) 55 56 64 66

Table 1 The correlations between model predictionsand human evaluations for each model based on thetwo response-evaluation datasets The highest score oneach metric is highlighted in bold All values with p gt0001 are marked with DailyDialog and Persona de-note the response-evaluation datasets made from Daily-Dialog and PersonaChat datasets respectively BERT-rtv denotes the BERT-retrieval model r and ρ meanthe Pearson correlation and Spearmanrsquos rank correla-tion coefficient respectively

DailyDialog PersonaModel r ρ r ρ

drop-golden 49 48 54 57shuffle-golden 46 45 55 59score-wo-history 51 52 58 58select-random 54 55 57 59replace-w-history 52 52 57 57

Table 2 The correlations between model predictionsand human evaluations for each of the variations of ourmodel based on the two response-evaluation datasets

(Section 411) The values found in this search(epoch=3 batch size=64 and learning rate=2e-5) were used for all the BERT-retrieval models(random-N ours) The random seed was fixed forall experiments

42 Results

In Section 421 we check the correlations betweenthe results of each evaluation model and humanevaluations In Section 422 an in-depth anal-ysis of our proposed method is shown In Sec-tion 423 we present examples that may suggestthat automatic evaluation systems that have beentrained with the proposed method can make deci-

1530

Figure 3 The scatter plots that show in detail the correlations between model predictions and human evaluationsEach of the plots contains 800 system-generated responses in the response-evaluation dataset made from Daily-Dialog dataset (Section 411) Each point indicates a response Its x-value indicates the human evaluation scorefor the quality of the response given on a 5-point Likert scale Its y-value indicates the model prediction for thequality of the response normalized into the range of [0 1] The orange line is a linear regression We add a noisesampled from N (0 009) into human score for better visualization following previous studies (Lowe et al 2017Bak and Oh 2020 Pang et al 2020)

sions closer to human judgment than models thathave not

421 Correlation with Human Judgment

Table 1 shows the correlation between model pre-dictions and human evaluations for each modelbased on the two datasets Pearson correlation (r)and Spearmanrsquos rank correlation coefficient (ρ)were used to measure the correlation between hu-man score and model prediction It should benoted that we excluded the scores of golden re-sponses from the response-evaluation datasets andextracted 800 and 750 response-evaluation pairsfrom the DailyDialog and PersonaChat datasetsrespectively The model incorporating our nega-tive sample method made predictions with highercorrelation with human evaluations than the predic-tions made by BERT-retrieval (random-2) whichuses the same number of negative samples fortraining Among the baseline models most ofthe reference-based metrics showed comparativelylow performances It is thought that these resultssupport the observations made by previous stud-ies suggesting that using the golden response asthe ldquoone and onlyrdquo correct answer to evaluate re-sponses can be ineffective RUBER showed bet-ter performance than other reference-based mod-

els for the DailyDialog dataset but showed lowperformance in evaluating PersonaChat responsesThe GPT2-coherence model showed similar perfor-mance to the BERT-retrieval (random-1) model onthe DailyDialog dataset but relatively low perfor-mance in the PersonaChat dataset It should alsobe noted that the hybrid and unreferenced modelswere trained on the DailyDialog dataset and noton the PersonaChat dataset

Figure 3 shows a scatter plot visualizing the hu-man scores and model predictions for the response-evaluation dataset on DailyDialog BLEU tendedto predict low scores This may suggest thatthere were only a few n-gram overlaps betweenthe golden responses and the generated responsesThe predictions of embedding-based metrics (EmbGreedy and BERTScore) were concentrated on aspecific range and showed low correlation withhuman scores The unreferenced or hybrid met-rics (RUBER BERT-MLM GPT2-coherence andBERT-retrieval (random-1)) show relatively highercorrelations than the reference-based metrics Wecan see that BERT-retrieval (ours) shows the great-est correlation among the models with a correla-tion coefficient of 01974 The scatter plots suggestthat false-positive predictions which frequentlyoccurred in the BERT-retrieval (random-1) predic-

1531

tions occurred less frequently in our modelrsquos pre-dictions However the scatter plot for our modelhas a step-function-like appearance Most of theresponses received a score near 0 or near 1 and thisis problematic because an ideal model should beable to match human scores even when the scoresare moderate This tendency is considered as a lim-itation of our model that must be addressed in thefuture work

422 Model AnalysisWe analyze our model by performing experimentswith some variations in making the negative sam-ples to be used with the random negative sample(1) drop-golden Instead of following the stepsof scoring selecting and replacing we randomlydrop some of the words in the golden response tocreate a negative sample and use it with the ran-dom negative sample (2) shuffle-golden Instead offollowing the three steps we randomly shuffle thewords in the golden response to create a negativesample and use it with the random negative sample(3) score-wo-history We use the scoring functionin Equation 1 without the first term so that it onlyconsiders the probabilities within the sentence with-out the dialogue history (4) select-random Insteadof using the scoring function proposed in Equation1 we randomly select the words to be replaced(5) replace-w-history When replacing a word weconcatenate the dialogue history with the responseso that the LM considers the dialogue history whenreplacing the masked words

Table 2 shows the correlations between modelpredictions and human evaluations for the modi-fied models above Dropping or shuffling wordsin the golden response to make a negative sampleshows similar or lower performance compared tousing random responses (BERT-retrieval (random-1 random-2)) The correlation was lower when thedialogue history was not considered in the scoringprocess than when it was considered We specu-late that this is because it gives high scores notonly to words important for the consistency of aconversation but also to the words with low likeli-hoods in general Randomly selecting the tokensshows lower correlation than using our proposedscoring function Considering the dialogue historyin the replacing process gives lower performancethan when it is not considered We speculate thatproviding the dialogue history makes predictionson the masked words that are more appropriate tothe context making the reconstructed response less

Figure 4 Some examples of cases in which our modelpredicted scores similar to human evaluations GPT2denotes the GPT2-coherence model BERT-rtv de-notes the BERT-retrieval (random-1) All scores arenormalized into the range of [0 1]

appropriate as a negative sample

423 Case StudyFigure 4 shows some of the evaluation results ofeach model on the DailyDialog dataset The re-sponses in the first and second examples are appro-priate to the given dialogue history as suggested bythe high human score BLEU-2 gives a score of 0

1532

2130

VERB1830

NOUN 1670

PRON 1020

ADJ 750

ADP 730

ADV 660

etc 1210

CORPUS

VERB2190

NOUN 2050

ADJ 12

1210

PRON 900

ADP 780

ADV 560

etc 1090

OURS

Figure 5 The POS tag distribution of the words in theoriginal training corpus (left) and the selected words byour method (right)

because the response has no bi-grams shared withthe golden response RUBER and GPT2-coherencedid not recognize the utterances as appropriate re-sponses BERT-retrieval (random-1) and BERT-retrieval (ours) gave relatively high scores to the re-sponses evaluating them as appropriate utterancesIn the third example the system response appearsto be somewhat relevant to the given context be-cause it includes some words (chance future)relevant to the phrase take part in the finals Arepetition of a phrase in this example (to get achance) is believed to have contributed to the lowhuman evaluation score (012) The RUBER andBERT-retrieval (random) models appear to lackthis intuition and instead evaluate the response asappropriate possibly because some words appearrelevant Our proposed model scored the responsewith a relatively low score of 015 which was closeto the human score In the fourth example the re-sponse is not coherent but because it begins witha sentence Let me get a peek it could have ap-peared as a coherent response to the previous di-alogue about parking tickets For this case ourproposed model and GPT2-coherence gave scoressimilar to human scores

43 POS-tag distribution of selected words

We compute the Part-of-Speech (POS) tag distribu-tion of selected words by our method and compareit with the original distribution of the DailyDialogcorpus (Figure 5) 4 As we can see the VERBand NOUN tags are the most frequently selected(219 and 205 respectively) and their ratio isincreased than in the original corpus (183 and167 respectively) Meanwhile the ratio of punc-tuation tag () is highly decreased (from 213 to

4We use the NLTK POS tagger (httpswwwnltkorgbookch05html) with universal tagset

Figure 6 Box plot of scores for each type of responsesThe average scores of each type are 465 251 and 119The standard deviations for the scores of each type are067 127 and 041 (from left to right)

121) We suspect that the likelihood of the punc-tuation tag is more affected by local informationfrom a response rather than dialog history

431 Are the generated samples actuallyinappropriate

To see whether the negative samples generatedby our method are actually inappropriate we con-ducted a survey through Amazon Mechanical Turk(AMT) We selected 40 dialogue history examplesand prepared three types of responses for each dia-logue 1) the golden response 2) a negative samplegenerated by our method and 3) a randomly se-lected negative sample from the corpus For eachdialog 4 annotators were asked to score the qual-ity of the three responses Following Lowe et al(2017) we asked the question ldquoHow appropriate isthe response overallrdquo for each context-responsepair and the evaluation was conducted on a 5-pointLikert scale The Fleissrsquo kappa and Krippendorffrsquosalpha for the annotations were 063 and 063 re-spectively

Figure 6 shows the survey results The meanscores of golden and random responses were 465and 119 respectively The mean score of our neg-ative samples was 251 The standard deviationsfor the scores of each response type were 067127 and 041 for the golden response our nega-tive sample and the random response respectivelyWe see that these results do not guarantee that allthe generated negative samples are inappropriateWhat we can assume however is that our methodof manipulating a golden response generates a neg-ative sample that is more inappropriate than thegolden response Table 3 shows two examples ofthe three different types of responses for a givendialog history with their survey results

1533

Dialog HistoryA Sir would you like some dessert nowB Please show me the menu againA Here you are sir The chocolate cake is very delicious

ResponsesGolden No thanks I donrsquot like chocolate Irsquod likestrawberry pie (5)Ours No thanks I donrsquot have chocolate Irsquoll likesome one (15)Random I basically believe in science over theologyI mean I () (1)

Dialog HistoryA Could you tell me something about your family

ResponsesGolden Ok There are five people in my family fathermother elder brother younger sister and I (5)Ours Ok There are five children in my family fathermother and brother and father my me (325)Random When do you want to move in (125)

Table 3 Examples of three different types of responsesfor a given dialog history with their survey results Thehighlighted words are newly generated by our methodThe score of each response is underlined

For a model learning to find the difference be-tween appropriate and inappropriate responses wespeculate that the task of distinguishing the neg-ative samples generated by our method from thegolden responses would be more difficult than thetask of distinguishing the randomly selected nega-tive samples from the golden responses We believethat this is because the generated negative samplescan be inappropriate in more subtle ways than com-pletely unrelated responses are We suspect thatlearning with this more challenging setting haveresulted in the performance gain that we discussedin Section 421 However we believe that it willneed a more in-depth semantic analysis on each ofthe cases such as performing a more quantitativeanalysis (through an extensive human study forinstance) and further interpretation of the semanticrelationships between the original golden responsesand the modified negative samples according to theproposed method We leave it as a future work

5 Conclusion

In this paper we proposed an automatic methodfor generating negative samples that can be used totrain an unsupervised and unreferenced responseevaluation model We performed experiments todemonstrate that the proposed method can boostthe unsupervised training of a response evaluationmodel We analyzed the experiment results quan-titatively and examined some examples that show

the distinct characteristics of our proposed method

Acknowledgments

This work was supported by Institute for Infor-mation and communications Technology Promo-tion (IITP) grant funded by the Korea governmentMSIT) (No 2018-0-00582 Prediction and augmen-tation of the credibility distribution via linguisticanalysis and automated evidence document collec-tion)

References

JinYeong Bak and Alice Oh 2020 Speaker sensitiveresponse evaluation model In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics pages 6376ndash6385

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the ACL Workshop on Intrinsic and Extrinsic Eval-uation Measures for Machine Translation andorSummarization pages 65ndash72

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies pages 4171ndash4186

Asma Ghandeharioun Judy Hanwen Shen NatashaJaques Craig Ferguson Noah Jones AgataLapedriza and Rosalind Picard 2019 Approximat-ing interactive human evaluation with self-play foropen-domain dialog systems In Advances in Neu-ral Information Processing Systems pages 13658ndash13669

Sarik Ghazarian Johnny Wei Aram Galstyan andNanyun Peng 2019 Better automatic evaluation ofopen-domain dialogue systems with contextualizedembeddings In Proceedings of the Workshop onMethods for Optimizing and Evaluating Neural Lan-guage Generation pages 82ndash89

Tatsunori Hashimoto Hugh Zhang and Percy Liang2019 Unifying human and statistical evaluation fornatural language generation In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies pages 1689ndash1701

Ari Holtzman Jan Buys Li Du Maxwell Forbes andYejin Choi 2020 The curious case of neural textdegeneration In 8th International Conference onLearning Representations

1534

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In 3rd Interna-tional Conference on Learning Representations

Jiwei Li Alexander H Miller Sumit ChopraMarcrsquoAurelio Ranzato and Jason Weston 2016Dialogue learning with human-in-the-loop arXivpreprint arXiv161109823

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 DailyDialog A manu-ally labelled multi-turn dialogue dataset In Proceed-ings of the Eighth International Joint Conference onNatural Language Processing pages 986ndash995

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81

Chia-Wei Liu Ryan Lowe Iulian Serban Mike Nose-worthy Laurent Charlin and Joelle Pineau 2016How NOT to evaluate your dialogue system An em-pirical study of unsupervised evaluation metrics fordialogue response generation In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing pages 2122ndash2132

Ryan Lowe Michael Noseworthy Iulian Vlad Ser-ban Nicolas Angelard-Gontier Yoshua Bengio andJoelle Pineau 2017 Towards an automatic Turingtest Learning to evaluate dialogue responses InProceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1116ndash1126

Shikib Mehri and Maxine Eskenazi 2020 USR Anunsupervised and reference free evaluation metricfor dialog generation In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 681ndash707

Bo Pang Erik Nijkamp Wenjuan Han Linqi ZhouYixian Liu and Kewei Tu 2020 Towards holisticand automatic evaluation of open-domain dialoguegeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 3619ndash3629

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics pages 311ndash318

Alec Radford Jeffrey Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIblog

Ananya B Sai Akash Kumar Mohankumar SiddharthaArora and Mitesh M Khapra 2020 Improving di-alog evaluation with a multi-reference adversarialdataset and large scale pretraining arXiv preprintarXiv200911321

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892

Iulian V Serban Alessandro Sordoni Yoshua BengioAaron Courville and Joelle Pineau 2016 Buildingend-to-end dialogue systems using generative hier-archical neural network models In Proceedings ofthe 30th AAAI Conference on Artificial Intelligencepages 3776ndash3783

Iulian Vlad Serban Alessandro Sordoni Ryan LoweLaurent Charlin Joelle Pineau Aaron Courville andYoshua Bengio 2017 A hierarchical latent variableencoder-decoder model for generating dialogues InProceedings of the 31st AAAI Conference on Artifi-cial Intelligence pages 3295ndash3301

Ilya Sutskever Oriol Vinyals and Quoc V Le 2014Sequence to sequence learning with neural networksIn Advances in Neural Information Processing Sys-tems pages 3104ndash3112

Chongyang Tao Lili Mou Dongyan Zhao and RuiYan 2018 Ruber An unsupervised method for au-tomatic evaluation of open-domain dialog systemsProceedings of the 32nd AAAI Conference on Artifi-cial Intelligence pages 722ndash729

Thomas Wolf Victor Sanh Julien Chaumond andClement Delangue 2018 Transfertransfo A trans-fer learning approach for neural network based con-versational agents In NeurIPS Workshop on Con-versational AI

Hanlu Wu Tengfei Ma Lingfei Wu TariroManyumwa and Shouling Ji 2020 Unsuper-vised reference-free summary quality evaluationvia contrastive learning In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing pages 3612ndash3621

Saizheng Zhang Emily Dinan Jack Urbanek ArthurSzlam Douwe Kiela and Jason Weston 2018 Per-sonalizing dialogue agents I have a dog do youhave pets too In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics pages 2204ndash2213

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with BERT In 8th Interna-tional Conference on Learning Representations

Tianyu Zhao Divesh Lala and Tatsuya Kawahara2020 Designing precise and robust dialogue re-sponse evaluators In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 26ndash33

Page 5: Generating Negative Samples by Manipulating Golden ...the golden response.Li et al.(2016) explored dia-log system with textual feedback.Ghandeharioun et al.(2019) suggested the necessity

1529

BERT-retrieval (random-N) is a BERT-basedmodel that is trained to distinguish a golden re-sponse from a negative sample (Mehri and Eske-nazi 2020) using the dialogue history We refer tothe original model by Mehri and Eskenazi (2020)as BERT-retrieval (random-1) since they used onerandom response as a negative sample for a dia-logue history We refer to a variation of the modelthat uses two random negative samples for a dia-logue history as BERT-retrieval (random-2) Thisis to fairly compare with our model which usestwo negative samples for a dialogue history as ex-plained below

BERT-retrieval (ours) is a model that has thesame structure as the BERT-retrieval model Thedifference is that our model utilizes the negativesamples generated by the method that we proposeThe model uses both the generated negative sam-ples and the random negative samples Specificallyduring training the model learns to distinguish agolden response from two negative samples onegenerated from our method and one randomly sam-pled from the corpus

413 Implementation DetailsWe trained the unreferenced models on the origi-nal DailyDialog dataset and then evaluated themon the two response-evaluation datasets (Sec-tion 411) We split the conversations in the Daily-Dialog dataset in a sliding window manner to con-struct pairs of dialogue histories and correspondingresponses The maximum turn of the dialogue his-tory was set to 5 following Zhao et al (2020)

We use the pretrained BERT and GPT2 releasedby Wolf et al (2018) for all of our relevant exper-iments2 A BERT model fine-tuned on the Dai-lyDialog train set with MLM for 1 epoch wasused for the scoring step of our proposed method(Section 31) The same model was used for thereplacing step (Section 33) We used the thresh-old3 of 05 for the selecting step (Section 32) Weused Adam optimizer (Kingma and Ba 2015) fortraining We searched for hyperparameters forthe BERT-retrieval (random-1) model that max-imize the (Pearson) correlation between humanevaluations and model predictions on the response-evaluation dataset made from DailyDialog dataset

2bert-base-uncased and gpt2-12layer areused

3We tested threshold values of 0 05 1 and 2 and foundthat using 05 as the threshold achieved the highest correla-tion with human evaluations therefore we report only theexperiment results with this value

DailyDialog PersonaModel r ρ r ρ

BLEU 08 05 19 23METEOR 11 33 23 18ROUGE 12 09 25 21Embed Average 09 08 16 17Embed Greedy 18 18 25 24Embed Extrema 16 15 28 27BERTScore 13 12 28 26RUBER 28 26 07 04BERT-MLM 32 38 35 35GPT2-coherence 47 47 48 48BERT-rtv (rand1) 47 47 56 60BERT-rtv (rand2) 49 48 55 58BERT-rtv (ours) 55 56 64 66

Table 1 The correlations between model predictionsand human evaluations for each model based on thetwo response-evaluation datasets The highest score oneach metric is highlighted in bold All values with p gt0001 are marked with DailyDialog and Persona de-note the response-evaluation datasets made from Daily-Dialog and PersonaChat datasets respectively BERT-rtv denotes the BERT-retrieval model r and ρ meanthe Pearson correlation and Spearmanrsquos rank correla-tion coefficient respectively

DailyDialog PersonaModel r ρ r ρ

drop-golden 49 48 54 57shuffle-golden 46 45 55 59score-wo-history 51 52 58 58select-random 54 55 57 59replace-w-history 52 52 57 57

Table 2 The correlations between model predictionsand human evaluations for each of the variations of ourmodel based on the two response-evaluation datasets

(Section 411) The values found in this search(epoch=3 batch size=64 and learning rate=2e-5) were used for all the BERT-retrieval models(random-N ours) The random seed was fixed forall experiments

42 Results

In Section 421 we check the correlations betweenthe results of each evaluation model and humanevaluations In Section 422 an in-depth anal-ysis of our proposed method is shown In Sec-tion 423 we present examples that may suggestthat automatic evaluation systems that have beentrained with the proposed method can make deci-

1530

Figure 3 The scatter plots that show in detail the correlations between model predictions and human evaluationsEach of the plots contains 800 system-generated responses in the response-evaluation dataset made from Daily-Dialog dataset (Section 411) Each point indicates a response Its x-value indicates the human evaluation scorefor the quality of the response given on a 5-point Likert scale Its y-value indicates the model prediction for thequality of the response normalized into the range of [0 1] The orange line is a linear regression We add a noisesampled from N (0 009) into human score for better visualization following previous studies (Lowe et al 2017Bak and Oh 2020 Pang et al 2020)

sions closer to human judgment than models thathave not

421 Correlation with Human Judgment

Table 1 shows the correlation between model pre-dictions and human evaluations for each modelbased on the two datasets Pearson correlation (r)and Spearmanrsquos rank correlation coefficient (ρ)were used to measure the correlation between hu-man score and model prediction It should benoted that we excluded the scores of golden re-sponses from the response-evaluation datasets andextracted 800 and 750 response-evaluation pairsfrom the DailyDialog and PersonaChat datasetsrespectively The model incorporating our nega-tive sample method made predictions with highercorrelation with human evaluations than the predic-tions made by BERT-retrieval (random-2) whichuses the same number of negative samples fortraining Among the baseline models most ofthe reference-based metrics showed comparativelylow performances It is thought that these resultssupport the observations made by previous stud-ies suggesting that using the golden response asthe ldquoone and onlyrdquo correct answer to evaluate re-sponses can be ineffective RUBER showed bet-ter performance than other reference-based mod-

els for the DailyDialog dataset but showed lowperformance in evaluating PersonaChat responsesThe GPT2-coherence model showed similar perfor-mance to the BERT-retrieval (random-1) model onthe DailyDialog dataset but relatively low perfor-mance in the PersonaChat dataset It should alsobe noted that the hybrid and unreferenced modelswere trained on the DailyDialog dataset and noton the PersonaChat dataset

Figure 3 shows a scatter plot visualizing the hu-man scores and model predictions for the response-evaluation dataset on DailyDialog BLEU tendedto predict low scores This may suggest thatthere were only a few n-gram overlaps betweenthe golden responses and the generated responsesThe predictions of embedding-based metrics (EmbGreedy and BERTScore) were concentrated on aspecific range and showed low correlation withhuman scores The unreferenced or hybrid met-rics (RUBER BERT-MLM GPT2-coherence andBERT-retrieval (random-1)) show relatively highercorrelations than the reference-based metrics Wecan see that BERT-retrieval (ours) shows the great-est correlation among the models with a correla-tion coefficient of 01974 The scatter plots suggestthat false-positive predictions which frequentlyoccurred in the BERT-retrieval (random-1) predic-

1531

tions occurred less frequently in our modelrsquos pre-dictions However the scatter plot for our modelhas a step-function-like appearance Most of theresponses received a score near 0 or near 1 and thisis problematic because an ideal model should beable to match human scores even when the scoresare moderate This tendency is considered as a lim-itation of our model that must be addressed in thefuture work

422 Model AnalysisWe analyze our model by performing experimentswith some variations in making the negative sam-ples to be used with the random negative sample(1) drop-golden Instead of following the stepsof scoring selecting and replacing we randomlydrop some of the words in the golden response tocreate a negative sample and use it with the ran-dom negative sample (2) shuffle-golden Instead offollowing the three steps we randomly shuffle thewords in the golden response to create a negativesample and use it with the random negative sample(3) score-wo-history We use the scoring functionin Equation 1 without the first term so that it onlyconsiders the probabilities within the sentence with-out the dialogue history (4) select-random Insteadof using the scoring function proposed in Equation1 we randomly select the words to be replaced(5) replace-w-history When replacing a word weconcatenate the dialogue history with the responseso that the LM considers the dialogue history whenreplacing the masked words

Table 2 shows the correlations between modelpredictions and human evaluations for the modi-fied models above Dropping or shuffling wordsin the golden response to make a negative sampleshows similar or lower performance compared tousing random responses (BERT-retrieval (random-1 random-2)) The correlation was lower when thedialogue history was not considered in the scoringprocess than when it was considered We specu-late that this is because it gives high scores notonly to words important for the consistency of aconversation but also to the words with low likeli-hoods in general Randomly selecting the tokensshows lower correlation than using our proposedscoring function Considering the dialogue historyin the replacing process gives lower performancethan when it is not considered We speculate thatproviding the dialogue history makes predictionson the masked words that are more appropriate tothe context making the reconstructed response less

Figure 4 Some examples of cases in which our modelpredicted scores similar to human evaluations GPT2denotes the GPT2-coherence model BERT-rtv de-notes the BERT-retrieval (random-1) All scores arenormalized into the range of [0 1]

appropriate as a negative sample

423 Case StudyFigure 4 shows some of the evaluation results ofeach model on the DailyDialog dataset The re-sponses in the first and second examples are appro-priate to the given dialogue history as suggested bythe high human score BLEU-2 gives a score of 0

1532

2130

VERB1830

NOUN 1670

PRON 1020

ADJ 750

ADP 730

ADV 660

etc 1210

CORPUS

VERB2190

NOUN 2050

ADJ 12

1210

PRON 900

ADP 780

ADV 560

etc 1090

OURS

Figure 5 The POS tag distribution of the words in theoriginal training corpus (left) and the selected words byour method (right)

because the response has no bi-grams shared withthe golden response RUBER and GPT2-coherencedid not recognize the utterances as appropriate re-sponses BERT-retrieval (random-1) and BERT-retrieval (ours) gave relatively high scores to the re-sponses evaluating them as appropriate utterancesIn the third example the system response appearsto be somewhat relevant to the given context be-cause it includes some words (chance future)relevant to the phrase take part in the finals Arepetition of a phrase in this example (to get achance) is believed to have contributed to the lowhuman evaluation score (012) The RUBER andBERT-retrieval (random) models appear to lackthis intuition and instead evaluate the response asappropriate possibly because some words appearrelevant Our proposed model scored the responsewith a relatively low score of 015 which was closeto the human score In the fourth example the re-sponse is not coherent but because it begins witha sentence Let me get a peek it could have ap-peared as a coherent response to the previous di-alogue about parking tickets For this case ourproposed model and GPT2-coherence gave scoressimilar to human scores

43 POS-tag distribution of selected words

We compute the Part-of-Speech (POS) tag distribu-tion of selected words by our method and compareit with the original distribution of the DailyDialogcorpus (Figure 5) 4 As we can see the VERBand NOUN tags are the most frequently selected(219 and 205 respectively) and their ratio isincreased than in the original corpus (183 and167 respectively) Meanwhile the ratio of punc-tuation tag () is highly decreased (from 213 to

4We use the NLTK POS tagger (httpswwwnltkorgbookch05html) with universal tagset

Figure 6 Box plot of scores for each type of responsesThe average scores of each type are 465 251 and 119The standard deviations for the scores of each type are067 127 and 041 (from left to right)

121) We suspect that the likelihood of the punc-tuation tag is more affected by local informationfrom a response rather than dialog history

431 Are the generated samples actuallyinappropriate

To see whether the negative samples generatedby our method are actually inappropriate we con-ducted a survey through Amazon Mechanical Turk(AMT) We selected 40 dialogue history examplesand prepared three types of responses for each dia-logue 1) the golden response 2) a negative samplegenerated by our method and 3) a randomly se-lected negative sample from the corpus For eachdialog 4 annotators were asked to score the qual-ity of the three responses Following Lowe et al(2017) we asked the question ldquoHow appropriate isthe response overallrdquo for each context-responsepair and the evaluation was conducted on a 5-pointLikert scale The Fleissrsquo kappa and Krippendorffrsquosalpha for the annotations were 063 and 063 re-spectively

Figure 6 shows the survey results The meanscores of golden and random responses were 465and 119 respectively The mean score of our neg-ative samples was 251 The standard deviationsfor the scores of each response type were 067127 and 041 for the golden response our nega-tive sample and the random response respectivelyWe see that these results do not guarantee that allthe generated negative samples are inappropriateWhat we can assume however is that our methodof manipulating a golden response generates a neg-ative sample that is more inappropriate than thegolden response Table 3 shows two examples ofthe three different types of responses for a givendialog history with their survey results

1533

Dialog HistoryA Sir would you like some dessert nowB Please show me the menu againA Here you are sir The chocolate cake is very delicious

ResponsesGolden No thanks I donrsquot like chocolate Irsquod likestrawberry pie (5)Ours No thanks I donrsquot have chocolate Irsquoll likesome one (15)Random I basically believe in science over theologyI mean I () (1)

Dialog HistoryA Could you tell me something about your family

ResponsesGolden Ok There are five people in my family fathermother elder brother younger sister and I (5)Ours Ok There are five children in my family fathermother and brother and father my me (325)Random When do you want to move in (125)

Table 3 Examples of three different types of responsesfor a given dialog history with their survey results Thehighlighted words are newly generated by our methodThe score of each response is underlined

For a model learning to find the difference be-tween appropriate and inappropriate responses wespeculate that the task of distinguishing the neg-ative samples generated by our method from thegolden responses would be more difficult than thetask of distinguishing the randomly selected nega-tive samples from the golden responses We believethat this is because the generated negative samplescan be inappropriate in more subtle ways than com-pletely unrelated responses are We suspect thatlearning with this more challenging setting haveresulted in the performance gain that we discussedin Section 421 However we believe that it willneed a more in-depth semantic analysis on each ofthe cases such as performing a more quantitativeanalysis (through an extensive human study forinstance) and further interpretation of the semanticrelationships between the original golden responsesand the modified negative samples according to theproposed method We leave it as a future work

5 Conclusion

In this paper we proposed an automatic methodfor generating negative samples that can be used totrain an unsupervised and unreferenced responseevaluation model We performed experiments todemonstrate that the proposed method can boostthe unsupervised training of a response evaluationmodel We analyzed the experiment results quan-titatively and examined some examples that show

the distinct characteristics of our proposed method

Acknowledgments

This work was supported by Institute for Infor-mation and communications Technology Promo-tion (IITP) grant funded by the Korea governmentMSIT) (No 2018-0-00582 Prediction and augmen-tation of the credibility distribution via linguisticanalysis and automated evidence document collec-tion)

References

JinYeong Bak and Alice Oh 2020 Speaker sensitiveresponse evaluation model In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics pages 6376ndash6385

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the ACL Workshop on Intrinsic and Extrinsic Eval-uation Measures for Machine Translation andorSummarization pages 65ndash72

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies pages 4171ndash4186

Asma Ghandeharioun Judy Hanwen Shen NatashaJaques Craig Ferguson Noah Jones AgataLapedriza and Rosalind Picard 2019 Approximat-ing interactive human evaluation with self-play foropen-domain dialog systems In Advances in Neu-ral Information Processing Systems pages 13658ndash13669

Sarik Ghazarian Johnny Wei Aram Galstyan andNanyun Peng 2019 Better automatic evaluation ofopen-domain dialogue systems with contextualizedembeddings In Proceedings of the Workshop onMethods for Optimizing and Evaluating Neural Lan-guage Generation pages 82ndash89

Tatsunori Hashimoto Hugh Zhang and Percy Liang2019 Unifying human and statistical evaluation fornatural language generation In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies pages 1689ndash1701

Ari Holtzman Jan Buys Li Du Maxwell Forbes andYejin Choi 2020 The curious case of neural textdegeneration In 8th International Conference onLearning Representations

1534

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In 3rd Interna-tional Conference on Learning Representations

Jiwei Li Alexander H Miller Sumit ChopraMarcrsquoAurelio Ranzato and Jason Weston 2016Dialogue learning with human-in-the-loop arXivpreprint arXiv161109823

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 DailyDialog A manu-ally labelled multi-turn dialogue dataset In Proceed-ings of the Eighth International Joint Conference onNatural Language Processing pages 986ndash995

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81

Chia-Wei Liu Ryan Lowe Iulian Serban Mike Nose-worthy Laurent Charlin and Joelle Pineau 2016How NOT to evaluate your dialogue system An em-pirical study of unsupervised evaluation metrics fordialogue response generation In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing pages 2122ndash2132

Ryan Lowe Michael Noseworthy Iulian Vlad Ser-ban Nicolas Angelard-Gontier Yoshua Bengio andJoelle Pineau 2017 Towards an automatic Turingtest Learning to evaluate dialogue responses InProceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1116ndash1126

Shikib Mehri and Maxine Eskenazi 2020 USR Anunsupervised and reference free evaluation metricfor dialog generation In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 681ndash707

Bo Pang Erik Nijkamp Wenjuan Han Linqi ZhouYixian Liu and Kewei Tu 2020 Towards holisticand automatic evaluation of open-domain dialoguegeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 3619ndash3629

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics pages 311ndash318

Alec Radford Jeffrey Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIblog

Ananya B Sai Akash Kumar Mohankumar SiddharthaArora and Mitesh M Khapra 2020 Improving di-alog evaluation with a multi-reference adversarialdataset and large scale pretraining arXiv preprintarXiv200911321

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892

Iulian V Serban Alessandro Sordoni Yoshua BengioAaron Courville and Joelle Pineau 2016 Buildingend-to-end dialogue systems using generative hier-archical neural network models In Proceedings ofthe 30th AAAI Conference on Artificial Intelligencepages 3776ndash3783

Iulian Vlad Serban Alessandro Sordoni Ryan LoweLaurent Charlin Joelle Pineau Aaron Courville andYoshua Bengio 2017 A hierarchical latent variableencoder-decoder model for generating dialogues InProceedings of the 31st AAAI Conference on Artifi-cial Intelligence pages 3295ndash3301

Ilya Sutskever Oriol Vinyals and Quoc V Le 2014Sequence to sequence learning with neural networksIn Advances in Neural Information Processing Sys-tems pages 3104ndash3112

Chongyang Tao Lili Mou Dongyan Zhao and RuiYan 2018 Ruber An unsupervised method for au-tomatic evaluation of open-domain dialog systemsProceedings of the 32nd AAAI Conference on Artifi-cial Intelligence pages 722ndash729

Thomas Wolf Victor Sanh Julien Chaumond andClement Delangue 2018 Transfertransfo A trans-fer learning approach for neural network based con-versational agents In NeurIPS Workshop on Con-versational AI

Hanlu Wu Tengfei Ma Lingfei Wu TariroManyumwa and Shouling Ji 2020 Unsuper-vised reference-free summary quality evaluationvia contrastive learning In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing pages 3612ndash3621

Saizheng Zhang Emily Dinan Jack Urbanek ArthurSzlam Douwe Kiela and Jason Weston 2018 Per-sonalizing dialogue agents I have a dog do youhave pets too In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics pages 2204ndash2213

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with BERT In 8th Interna-tional Conference on Learning Representations

Tianyu Zhao Divesh Lala and Tatsuya Kawahara2020 Designing precise and robust dialogue re-sponse evaluators In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 26ndash33

Page 6: Generating Negative Samples by Manipulating Golden ...the golden response.Li et al.(2016) explored dia-log system with textual feedback.Ghandeharioun et al.(2019) suggested the necessity

1530

Figure 3 The scatter plots that show in detail the correlations between model predictions and human evaluationsEach of the plots contains 800 system-generated responses in the response-evaluation dataset made from Daily-Dialog dataset (Section 411) Each point indicates a response Its x-value indicates the human evaluation scorefor the quality of the response given on a 5-point Likert scale Its y-value indicates the model prediction for thequality of the response normalized into the range of [0 1] The orange line is a linear regression We add a noisesampled from N (0 009) into human score for better visualization following previous studies (Lowe et al 2017Bak and Oh 2020 Pang et al 2020)

sions closer to human judgment than models thathave not

421 Correlation with Human Judgment

Table 1 shows the correlation between model pre-dictions and human evaluations for each modelbased on the two datasets Pearson correlation (r)and Spearmanrsquos rank correlation coefficient (ρ)were used to measure the correlation between hu-man score and model prediction It should benoted that we excluded the scores of golden re-sponses from the response-evaluation datasets andextracted 800 and 750 response-evaluation pairsfrom the DailyDialog and PersonaChat datasetsrespectively The model incorporating our nega-tive sample method made predictions with highercorrelation with human evaluations than the predic-tions made by BERT-retrieval (random-2) whichuses the same number of negative samples fortraining Among the baseline models most ofthe reference-based metrics showed comparativelylow performances It is thought that these resultssupport the observations made by previous stud-ies suggesting that using the golden response asthe ldquoone and onlyrdquo correct answer to evaluate re-sponses can be ineffective RUBER showed bet-ter performance than other reference-based mod-

els for the DailyDialog dataset but showed lowperformance in evaluating PersonaChat responsesThe GPT2-coherence model showed similar perfor-mance to the BERT-retrieval (random-1) model onthe DailyDialog dataset but relatively low perfor-mance in the PersonaChat dataset It should alsobe noted that the hybrid and unreferenced modelswere trained on the DailyDialog dataset and noton the PersonaChat dataset

Figure 3 shows a scatter plot visualizing the hu-man scores and model predictions for the response-evaluation dataset on DailyDialog BLEU tendedto predict low scores This may suggest thatthere were only a few n-gram overlaps betweenthe golden responses and the generated responsesThe predictions of embedding-based metrics (EmbGreedy and BERTScore) were concentrated on aspecific range and showed low correlation withhuman scores The unreferenced or hybrid met-rics (RUBER BERT-MLM GPT2-coherence andBERT-retrieval (random-1)) show relatively highercorrelations than the reference-based metrics Wecan see that BERT-retrieval (ours) shows the great-est correlation among the models with a correla-tion coefficient of 01974 The scatter plots suggestthat false-positive predictions which frequentlyoccurred in the BERT-retrieval (random-1) predic-

1531

tions occurred less frequently in our modelrsquos pre-dictions However the scatter plot for our modelhas a step-function-like appearance Most of theresponses received a score near 0 or near 1 and thisis problematic because an ideal model should beable to match human scores even when the scoresare moderate This tendency is considered as a lim-itation of our model that must be addressed in thefuture work

422 Model AnalysisWe analyze our model by performing experimentswith some variations in making the negative sam-ples to be used with the random negative sample(1) drop-golden Instead of following the stepsof scoring selecting and replacing we randomlydrop some of the words in the golden response tocreate a negative sample and use it with the ran-dom negative sample (2) shuffle-golden Instead offollowing the three steps we randomly shuffle thewords in the golden response to create a negativesample and use it with the random negative sample(3) score-wo-history We use the scoring functionin Equation 1 without the first term so that it onlyconsiders the probabilities within the sentence with-out the dialogue history (4) select-random Insteadof using the scoring function proposed in Equation1 we randomly select the words to be replaced(5) replace-w-history When replacing a word weconcatenate the dialogue history with the responseso that the LM considers the dialogue history whenreplacing the masked words

Table 2 shows the correlations between modelpredictions and human evaluations for the modi-fied models above Dropping or shuffling wordsin the golden response to make a negative sampleshows similar or lower performance compared tousing random responses (BERT-retrieval (random-1 random-2)) The correlation was lower when thedialogue history was not considered in the scoringprocess than when it was considered We specu-late that this is because it gives high scores notonly to words important for the consistency of aconversation but also to the words with low likeli-hoods in general Randomly selecting the tokensshows lower correlation than using our proposedscoring function Considering the dialogue historyin the replacing process gives lower performancethan when it is not considered We speculate thatproviding the dialogue history makes predictionson the masked words that are more appropriate tothe context making the reconstructed response less

Figure 4 Some examples of cases in which our modelpredicted scores similar to human evaluations GPT2denotes the GPT2-coherence model BERT-rtv de-notes the BERT-retrieval (random-1) All scores arenormalized into the range of [0 1]

appropriate as a negative sample

423 Case StudyFigure 4 shows some of the evaluation results ofeach model on the DailyDialog dataset The re-sponses in the first and second examples are appro-priate to the given dialogue history as suggested bythe high human score BLEU-2 gives a score of 0

1532

2130

VERB1830

NOUN 1670

PRON 1020

ADJ 750

ADP 730

ADV 660

etc 1210

CORPUS

VERB2190

NOUN 2050

ADJ 12

1210

PRON 900

ADP 780

ADV 560

etc 1090

OURS

Figure 5 The POS tag distribution of the words in theoriginal training corpus (left) and the selected words byour method (right)

because the response has no bi-grams shared withthe golden response RUBER and GPT2-coherencedid not recognize the utterances as appropriate re-sponses BERT-retrieval (random-1) and BERT-retrieval (ours) gave relatively high scores to the re-sponses evaluating them as appropriate utterancesIn the third example the system response appearsto be somewhat relevant to the given context be-cause it includes some words (chance future)relevant to the phrase take part in the finals Arepetition of a phrase in this example (to get achance) is believed to have contributed to the lowhuman evaluation score (012) The RUBER andBERT-retrieval (random) models appear to lackthis intuition and instead evaluate the response asappropriate possibly because some words appearrelevant Our proposed model scored the responsewith a relatively low score of 015 which was closeto the human score In the fourth example the re-sponse is not coherent but because it begins witha sentence Let me get a peek it could have ap-peared as a coherent response to the previous di-alogue about parking tickets For this case ourproposed model and GPT2-coherence gave scoressimilar to human scores

43 POS-tag distribution of selected words

We compute the Part-of-Speech (POS) tag distribu-tion of selected words by our method and compareit with the original distribution of the DailyDialogcorpus (Figure 5) 4 As we can see the VERBand NOUN tags are the most frequently selected(219 and 205 respectively) and their ratio isincreased than in the original corpus (183 and167 respectively) Meanwhile the ratio of punc-tuation tag () is highly decreased (from 213 to

4We use the NLTK POS tagger (httpswwwnltkorgbookch05html) with universal tagset

Figure 6 Box plot of scores for each type of responsesThe average scores of each type are 465 251 and 119The standard deviations for the scores of each type are067 127 and 041 (from left to right)

121) We suspect that the likelihood of the punc-tuation tag is more affected by local informationfrom a response rather than dialog history

431 Are the generated samples actuallyinappropriate

To see whether the negative samples generatedby our method are actually inappropriate we con-ducted a survey through Amazon Mechanical Turk(AMT) We selected 40 dialogue history examplesand prepared three types of responses for each dia-logue 1) the golden response 2) a negative samplegenerated by our method and 3) a randomly se-lected negative sample from the corpus For eachdialog 4 annotators were asked to score the qual-ity of the three responses Following Lowe et al(2017) we asked the question ldquoHow appropriate isthe response overallrdquo for each context-responsepair and the evaluation was conducted on a 5-pointLikert scale The Fleissrsquo kappa and Krippendorffrsquosalpha for the annotations were 063 and 063 re-spectively

Figure 6 shows the survey results The meanscores of golden and random responses were 465and 119 respectively The mean score of our neg-ative samples was 251 The standard deviationsfor the scores of each response type were 067127 and 041 for the golden response our nega-tive sample and the random response respectivelyWe see that these results do not guarantee that allthe generated negative samples are inappropriateWhat we can assume however is that our methodof manipulating a golden response generates a neg-ative sample that is more inappropriate than thegolden response Table 3 shows two examples ofthe three different types of responses for a givendialog history with their survey results

1533

Dialog HistoryA Sir would you like some dessert nowB Please show me the menu againA Here you are sir The chocolate cake is very delicious

ResponsesGolden No thanks I donrsquot like chocolate Irsquod likestrawberry pie (5)Ours No thanks I donrsquot have chocolate Irsquoll likesome one (15)Random I basically believe in science over theologyI mean I () (1)

Dialog HistoryA Could you tell me something about your family

ResponsesGolden Ok There are five people in my family fathermother elder brother younger sister and I (5)Ours Ok There are five children in my family fathermother and brother and father my me (325)Random When do you want to move in (125)

Table 3 Examples of three different types of responsesfor a given dialog history with their survey results Thehighlighted words are newly generated by our methodThe score of each response is underlined

For a model learning to find the difference be-tween appropriate and inappropriate responses wespeculate that the task of distinguishing the neg-ative samples generated by our method from thegolden responses would be more difficult than thetask of distinguishing the randomly selected nega-tive samples from the golden responses We believethat this is because the generated negative samplescan be inappropriate in more subtle ways than com-pletely unrelated responses are We suspect thatlearning with this more challenging setting haveresulted in the performance gain that we discussedin Section 421 However we believe that it willneed a more in-depth semantic analysis on each ofthe cases such as performing a more quantitativeanalysis (through an extensive human study forinstance) and further interpretation of the semanticrelationships between the original golden responsesand the modified negative samples according to theproposed method We leave it as a future work

5 Conclusion

In this paper we proposed an automatic methodfor generating negative samples that can be used totrain an unsupervised and unreferenced responseevaluation model We performed experiments todemonstrate that the proposed method can boostthe unsupervised training of a response evaluationmodel We analyzed the experiment results quan-titatively and examined some examples that show

the distinct characteristics of our proposed method

Acknowledgments

This work was supported by Institute for Infor-mation and communications Technology Promo-tion (IITP) grant funded by the Korea governmentMSIT) (No 2018-0-00582 Prediction and augmen-tation of the credibility distribution via linguisticanalysis and automated evidence document collec-tion)

References

JinYeong Bak and Alice Oh 2020 Speaker sensitiveresponse evaluation model In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics pages 6376ndash6385

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the ACL Workshop on Intrinsic and Extrinsic Eval-uation Measures for Machine Translation andorSummarization pages 65ndash72

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies pages 4171ndash4186

Asma Ghandeharioun Judy Hanwen Shen NatashaJaques Craig Ferguson Noah Jones AgataLapedriza and Rosalind Picard 2019 Approximat-ing interactive human evaluation with self-play foropen-domain dialog systems In Advances in Neu-ral Information Processing Systems pages 13658ndash13669

Sarik Ghazarian Johnny Wei Aram Galstyan andNanyun Peng 2019 Better automatic evaluation ofopen-domain dialogue systems with contextualizedembeddings In Proceedings of the Workshop onMethods for Optimizing and Evaluating Neural Lan-guage Generation pages 82ndash89

Tatsunori Hashimoto Hugh Zhang and Percy Liang2019 Unifying human and statistical evaluation fornatural language generation In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies pages 1689ndash1701

Ari Holtzman Jan Buys Li Du Maxwell Forbes andYejin Choi 2020 The curious case of neural textdegeneration In 8th International Conference onLearning Representations

1534

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In 3rd Interna-tional Conference on Learning Representations

Jiwei Li Alexander H Miller Sumit ChopraMarcrsquoAurelio Ranzato and Jason Weston 2016Dialogue learning with human-in-the-loop arXivpreprint arXiv161109823

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 DailyDialog A manu-ally labelled multi-turn dialogue dataset In Proceed-ings of the Eighth International Joint Conference onNatural Language Processing pages 986ndash995

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81

Chia-Wei Liu Ryan Lowe Iulian Serban Mike Nose-worthy Laurent Charlin and Joelle Pineau 2016How NOT to evaluate your dialogue system An em-pirical study of unsupervised evaluation metrics fordialogue response generation In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing pages 2122ndash2132

Ryan Lowe Michael Noseworthy Iulian Vlad Ser-ban Nicolas Angelard-Gontier Yoshua Bengio andJoelle Pineau 2017 Towards an automatic Turingtest Learning to evaluate dialogue responses InProceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1116ndash1126

Shikib Mehri and Maxine Eskenazi 2020 USR Anunsupervised and reference free evaluation metricfor dialog generation In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 681ndash707

Bo Pang Erik Nijkamp Wenjuan Han Linqi ZhouYixian Liu and Kewei Tu 2020 Towards holisticand automatic evaluation of open-domain dialoguegeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 3619ndash3629

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics pages 311ndash318

Alec Radford Jeffrey Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIblog

Ananya B Sai Akash Kumar Mohankumar SiddharthaArora and Mitesh M Khapra 2020 Improving di-alog evaluation with a multi-reference adversarialdataset and large scale pretraining arXiv preprintarXiv200911321

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892

Iulian V Serban Alessandro Sordoni Yoshua BengioAaron Courville and Joelle Pineau 2016 Buildingend-to-end dialogue systems using generative hier-archical neural network models In Proceedings ofthe 30th AAAI Conference on Artificial Intelligencepages 3776ndash3783

Iulian Vlad Serban Alessandro Sordoni Ryan LoweLaurent Charlin Joelle Pineau Aaron Courville andYoshua Bengio 2017 A hierarchical latent variableencoder-decoder model for generating dialogues InProceedings of the 31st AAAI Conference on Artifi-cial Intelligence pages 3295ndash3301

Ilya Sutskever Oriol Vinyals and Quoc V Le 2014Sequence to sequence learning with neural networksIn Advances in Neural Information Processing Sys-tems pages 3104ndash3112

Chongyang Tao Lili Mou Dongyan Zhao and RuiYan 2018 Ruber An unsupervised method for au-tomatic evaluation of open-domain dialog systemsProceedings of the 32nd AAAI Conference on Artifi-cial Intelligence pages 722ndash729

Thomas Wolf Victor Sanh Julien Chaumond andClement Delangue 2018 Transfertransfo A trans-fer learning approach for neural network based con-versational agents In NeurIPS Workshop on Con-versational AI

Hanlu Wu Tengfei Ma Lingfei Wu TariroManyumwa and Shouling Ji 2020 Unsuper-vised reference-free summary quality evaluationvia contrastive learning In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing pages 3612ndash3621

Saizheng Zhang Emily Dinan Jack Urbanek ArthurSzlam Douwe Kiela and Jason Weston 2018 Per-sonalizing dialogue agents I have a dog do youhave pets too In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics pages 2204ndash2213

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with BERT In 8th Interna-tional Conference on Learning Representations

Tianyu Zhao Divesh Lala and Tatsuya Kawahara2020 Designing precise and robust dialogue re-sponse evaluators In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 26ndash33

Page 7: Generating Negative Samples by Manipulating Golden ...the golden response.Li et al.(2016) explored dia-log system with textual feedback.Ghandeharioun et al.(2019) suggested the necessity

1531

tions occurred less frequently in our modelrsquos pre-dictions However the scatter plot for our modelhas a step-function-like appearance Most of theresponses received a score near 0 or near 1 and thisis problematic because an ideal model should beable to match human scores even when the scoresare moderate This tendency is considered as a lim-itation of our model that must be addressed in thefuture work

422 Model AnalysisWe analyze our model by performing experimentswith some variations in making the negative sam-ples to be used with the random negative sample(1) drop-golden Instead of following the stepsof scoring selecting and replacing we randomlydrop some of the words in the golden response tocreate a negative sample and use it with the ran-dom negative sample (2) shuffle-golden Instead offollowing the three steps we randomly shuffle thewords in the golden response to create a negativesample and use it with the random negative sample(3) score-wo-history We use the scoring functionin Equation 1 without the first term so that it onlyconsiders the probabilities within the sentence with-out the dialogue history (4) select-random Insteadof using the scoring function proposed in Equation1 we randomly select the words to be replaced(5) replace-w-history When replacing a word weconcatenate the dialogue history with the responseso that the LM considers the dialogue history whenreplacing the masked words

Table 2 shows the correlations between modelpredictions and human evaluations for the modi-fied models above Dropping or shuffling wordsin the golden response to make a negative sampleshows similar or lower performance compared tousing random responses (BERT-retrieval (random-1 random-2)) The correlation was lower when thedialogue history was not considered in the scoringprocess than when it was considered We specu-late that this is because it gives high scores notonly to words important for the consistency of aconversation but also to the words with low likeli-hoods in general Randomly selecting the tokensshows lower correlation than using our proposedscoring function Considering the dialogue historyin the replacing process gives lower performancethan when it is not considered We speculate thatproviding the dialogue history makes predictionson the masked words that are more appropriate tothe context making the reconstructed response less

Figure 4 Some examples of cases in which our modelpredicted scores similar to human evaluations GPT2denotes the GPT2-coherence model BERT-rtv de-notes the BERT-retrieval (random-1) All scores arenormalized into the range of [0 1]

appropriate as a negative sample

423 Case StudyFigure 4 shows some of the evaluation results ofeach model on the DailyDialog dataset The re-sponses in the first and second examples are appro-priate to the given dialogue history as suggested bythe high human score BLEU-2 gives a score of 0

1532

2130

VERB1830

NOUN 1670

PRON 1020

ADJ 750

ADP 730

ADV 660

etc 1210

CORPUS

VERB2190

NOUN 2050

ADJ 12

1210

PRON 900

ADP 780

ADV 560

etc 1090

OURS

Figure 5 The POS tag distribution of the words in theoriginal training corpus (left) and the selected words byour method (right)

because the response has no bi-grams shared withthe golden response RUBER and GPT2-coherencedid not recognize the utterances as appropriate re-sponses BERT-retrieval (random-1) and BERT-retrieval (ours) gave relatively high scores to the re-sponses evaluating them as appropriate utterancesIn the third example the system response appearsto be somewhat relevant to the given context be-cause it includes some words (chance future)relevant to the phrase take part in the finals Arepetition of a phrase in this example (to get achance) is believed to have contributed to the lowhuman evaluation score (012) The RUBER andBERT-retrieval (random) models appear to lackthis intuition and instead evaluate the response asappropriate possibly because some words appearrelevant Our proposed model scored the responsewith a relatively low score of 015 which was closeto the human score In the fourth example the re-sponse is not coherent but because it begins witha sentence Let me get a peek it could have ap-peared as a coherent response to the previous di-alogue about parking tickets For this case ourproposed model and GPT2-coherence gave scoressimilar to human scores

43 POS-tag distribution of selected words

We compute the Part-of-Speech (POS) tag distribu-tion of selected words by our method and compareit with the original distribution of the DailyDialogcorpus (Figure 5) 4 As we can see the VERBand NOUN tags are the most frequently selected(219 and 205 respectively) and their ratio isincreased than in the original corpus (183 and167 respectively) Meanwhile the ratio of punc-tuation tag () is highly decreased (from 213 to

4We use the NLTK POS tagger (httpswwwnltkorgbookch05html) with universal tagset

Figure 6 Box plot of scores for each type of responsesThe average scores of each type are 465 251 and 119The standard deviations for the scores of each type are067 127 and 041 (from left to right)

121) We suspect that the likelihood of the punc-tuation tag is more affected by local informationfrom a response rather than dialog history

431 Are the generated samples actuallyinappropriate

To see whether the negative samples generatedby our method are actually inappropriate we con-ducted a survey through Amazon Mechanical Turk(AMT) We selected 40 dialogue history examplesand prepared three types of responses for each dia-logue 1) the golden response 2) a negative samplegenerated by our method and 3) a randomly se-lected negative sample from the corpus For eachdialog 4 annotators were asked to score the qual-ity of the three responses Following Lowe et al(2017) we asked the question ldquoHow appropriate isthe response overallrdquo for each context-responsepair and the evaluation was conducted on a 5-pointLikert scale The Fleissrsquo kappa and Krippendorffrsquosalpha for the annotations were 063 and 063 re-spectively

Figure 6 shows the survey results The meanscores of golden and random responses were 465and 119 respectively The mean score of our neg-ative samples was 251 The standard deviationsfor the scores of each response type were 067127 and 041 for the golden response our nega-tive sample and the random response respectivelyWe see that these results do not guarantee that allthe generated negative samples are inappropriateWhat we can assume however is that our methodof manipulating a golden response generates a neg-ative sample that is more inappropriate than thegolden response Table 3 shows two examples ofthe three different types of responses for a givendialog history with their survey results

1533

Dialog HistoryA Sir would you like some dessert nowB Please show me the menu againA Here you are sir The chocolate cake is very delicious

ResponsesGolden No thanks I donrsquot like chocolate Irsquod likestrawberry pie (5)Ours No thanks I donrsquot have chocolate Irsquoll likesome one (15)Random I basically believe in science over theologyI mean I () (1)

Dialog HistoryA Could you tell me something about your family

ResponsesGolden Ok There are five people in my family fathermother elder brother younger sister and I (5)Ours Ok There are five children in my family fathermother and brother and father my me (325)Random When do you want to move in (125)

Table 3 Examples of three different types of responsesfor a given dialog history with their survey results Thehighlighted words are newly generated by our methodThe score of each response is underlined

For a model learning to find the difference be-tween appropriate and inappropriate responses wespeculate that the task of distinguishing the neg-ative samples generated by our method from thegolden responses would be more difficult than thetask of distinguishing the randomly selected nega-tive samples from the golden responses We believethat this is because the generated negative samplescan be inappropriate in more subtle ways than com-pletely unrelated responses are We suspect thatlearning with this more challenging setting haveresulted in the performance gain that we discussedin Section 421 However we believe that it willneed a more in-depth semantic analysis on each ofthe cases such as performing a more quantitativeanalysis (through an extensive human study forinstance) and further interpretation of the semanticrelationships between the original golden responsesand the modified negative samples according to theproposed method We leave it as a future work

5 Conclusion

In this paper we proposed an automatic methodfor generating negative samples that can be used totrain an unsupervised and unreferenced responseevaluation model We performed experiments todemonstrate that the proposed method can boostthe unsupervised training of a response evaluationmodel We analyzed the experiment results quan-titatively and examined some examples that show

the distinct characteristics of our proposed method

Acknowledgments

This work was supported by Institute for Infor-mation and communications Technology Promo-tion (IITP) grant funded by the Korea governmentMSIT) (No 2018-0-00582 Prediction and augmen-tation of the credibility distribution via linguisticanalysis and automated evidence document collec-tion)

References

JinYeong Bak and Alice Oh 2020 Speaker sensitiveresponse evaluation model In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics pages 6376ndash6385

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the ACL Workshop on Intrinsic and Extrinsic Eval-uation Measures for Machine Translation andorSummarization pages 65ndash72

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies pages 4171ndash4186

Asma Ghandeharioun Judy Hanwen Shen NatashaJaques Craig Ferguson Noah Jones AgataLapedriza and Rosalind Picard 2019 Approximat-ing interactive human evaluation with self-play foropen-domain dialog systems In Advances in Neu-ral Information Processing Systems pages 13658ndash13669

Sarik Ghazarian Johnny Wei Aram Galstyan andNanyun Peng 2019 Better automatic evaluation ofopen-domain dialogue systems with contextualizedembeddings In Proceedings of the Workshop onMethods for Optimizing and Evaluating Neural Lan-guage Generation pages 82ndash89

Tatsunori Hashimoto Hugh Zhang and Percy Liang2019 Unifying human and statistical evaluation fornatural language generation In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies pages 1689ndash1701

Ari Holtzman Jan Buys Li Du Maxwell Forbes andYejin Choi 2020 The curious case of neural textdegeneration In 8th International Conference onLearning Representations

1534

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In 3rd Interna-tional Conference on Learning Representations

Jiwei Li Alexander H Miller Sumit ChopraMarcrsquoAurelio Ranzato and Jason Weston 2016Dialogue learning with human-in-the-loop arXivpreprint arXiv161109823

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 DailyDialog A manu-ally labelled multi-turn dialogue dataset In Proceed-ings of the Eighth International Joint Conference onNatural Language Processing pages 986ndash995

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81

Chia-Wei Liu Ryan Lowe Iulian Serban Mike Nose-worthy Laurent Charlin and Joelle Pineau 2016How NOT to evaluate your dialogue system An em-pirical study of unsupervised evaluation metrics fordialogue response generation In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing pages 2122ndash2132

Ryan Lowe Michael Noseworthy Iulian Vlad Ser-ban Nicolas Angelard-Gontier Yoshua Bengio andJoelle Pineau 2017 Towards an automatic Turingtest Learning to evaluate dialogue responses InProceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1116ndash1126

Shikib Mehri and Maxine Eskenazi 2020 USR Anunsupervised and reference free evaluation metricfor dialog generation In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 681ndash707

Bo Pang Erik Nijkamp Wenjuan Han Linqi ZhouYixian Liu and Kewei Tu 2020 Towards holisticand automatic evaluation of open-domain dialoguegeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 3619ndash3629

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics pages 311ndash318

Alec Radford Jeffrey Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIblog

Ananya B Sai Akash Kumar Mohankumar SiddharthaArora and Mitesh M Khapra 2020 Improving di-alog evaluation with a multi-reference adversarialdataset and large scale pretraining arXiv preprintarXiv200911321

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892

Iulian V Serban Alessandro Sordoni Yoshua BengioAaron Courville and Joelle Pineau 2016 Buildingend-to-end dialogue systems using generative hier-archical neural network models In Proceedings ofthe 30th AAAI Conference on Artificial Intelligencepages 3776ndash3783

Iulian Vlad Serban Alessandro Sordoni Ryan LoweLaurent Charlin Joelle Pineau Aaron Courville andYoshua Bengio 2017 A hierarchical latent variableencoder-decoder model for generating dialogues InProceedings of the 31st AAAI Conference on Artifi-cial Intelligence pages 3295ndash3301

Ilya Sutskever Oriol Vinyals and Quoc V Le 2014Sequence to sequence learning with neural networksIn Advances in Neural Information Processing Sys-tems pages 3104ndash3112

Chongyang Tao Lili Mou Dongyan Zhao and RuiYan 2018 Ruber An unsupervised method for au-tomatic evaluation of open-domain dialog systemsProceedings of the 32nd AAAI Conference on Artifi-cial Intelligence pages 722ndash729

Thomas Wolf Victor Sanh Julien Chaumond andClement Delangue 2018 Transfertransfo A trans-fer learning approach for neural network based con-versational agents In NeurIPS Workshop on Con-versational AI

Hanlu Wu Tengfei Ma Lingfei Wu TariroManyumwa and Shouling Ji 2020 Unsuper-vised reference-free summary quality evaluationvia contrastive learning In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing pages 3612ndash3621

Saizheng Zhang Emily Dinan Jack Urbanek ArthurSzlam Douwe Kiela and Jason Weston 2018 Per-sonalizing dialogue agents I have a dog do youhave pets too In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics pages 2204ndash2213

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with BERT In 8th Interna-tional Conference on Learning Representations

Tianyu Zhao Divesh Lala and Tatsuya Kawahara2020 Designing precise and robust dialogue re-sponse evaluators In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 26ndash33

Page 8: Generating Negative Samples by Manipulating Golden ...the golden response.Li et al.(2016) explored dia-log system with textual feedback.Ghandeharioun et al.(2019) suggested the necessity

1532

2130

VERB1830

NOUN 1670

PRON 1020

ADJ 750

ADP 730

ADV 660

etc 1210

CORPUS

VERB2190

NOUN 2050

ADJ 12

1210

PRON 900

ADP 780

ADV 560

etc 1090

OURS

Figure 5 The POS tag distribution of the words in theoriginal training corpus (left) and the selected words byour method (right)

because the response has no bi-grams shared withthe golden response RUBER and GPT2-coherencedid not recognize the utterances as appropriate re-sponses BERT-retrieval (random-1) and BERT-retrieval (ours) gave relatively high scores to the re-sponses evaluating them as appropriate utterancesIn the third example the system response appearsto be somewhat relevant to the given context be-cause it includes some words (chance future)relevant to the phrase take part in the finals Arepetition of a phrase in this example (to get achance) is believed to have contributed to the lowhuman evaluation score (012) The RUBER andBERT-retrieval (random) models appear to lackthis intuition and instead evaluate the response asappropriate possibly because some words appearrelevant Our proposed model scored the responsewith a relatively low score of 015 which was closeto the human score In the fourth example the re-sponse is not coherent but because it begins witha sentence Let me get a peek it could have ap-peared as a coherent response to the previous di-alogue about parking tickets For this case ourproposed model and GPT2-coherence gave scoressimilar to human scores

43 POS-tag distribution of selected words

We compute the Part-of-Speech (POS) tag distribu-tion of selected words by our method and compareit with the original distribution of the DailyDialogcorpus (Figure 5) 4 As we can see the VERBand NOUN tags are the most frequently selected(219 and 205 respectively) and their ratio isincreased than in the original corpus (183 and167 respectively) Meanwhile the ratio of punc-tuation tag () is highly decreased (from 213 to

4We use the NLTK POS tagger (httpswwwnltkorgbookch05html) with universal tagset

Figure 6 Box plot of scores for each type of responsesThe average scores of each type are 465 251 and 119The standard deviations for the scores of each type are067 127 and 041 (from left to right)

121) We suspect that the likelihood of the punc-tuation tag is more affected by local informationfrom a response rather than dialog history

431 Are the generated samples actuallyinappropriate

To see whether the negative samples generatedby our method are actually inappropriate we con-ducted a survey through Amazon Mechanical Turk(AMT) We selected 40 dialogue history examplesand prepared three types of responses for each dia-logue 1) the golden response 2) a negative samplegenerated by our method and 3) a randomly se-lected negative sample from the corpus For eachdialog 4 annotators were asked to score the qual-ity of the three responses Following Lowe et al(2017) we asked the question ldquoHow appropriate isthe response overallrdquo for each context-responsepair and the evaluation was conducted on a 5-pointLikert scale The Fleissrsquo kappa and Krippendorffrsquosalpha for the annotations were 063 and 063 re-spectively

Figure 6 shows the survey results The meanscores of golden and random responses were 465and 119 respectively The mean score of our neg-ative samples was 251 The standard deviationsfor the scores of each response type were 067127 and 041 for the golden response our nega-tive sample and the random response respectivelyWe see that these results do not guarantee that allthe generated negative samples are inappropriateWhat we can assume however is that our methodof manipulating a golden response generates a neg-ative sample that is more inappropriate than thegolden response Table 3 shows two examples ofthe three different types of responses for a givendialog history with their survey results

1533

Dialog HistoryA Sir would you like some dessert nowB Please show me the menu againA Here you are sir The chocolate cake is very delicious

ResponsesGolden No thanks I donrsquot like chocolate Irsquod likestrawberry pie (5)Ours No thanks I donrsquot have chocolate Irsquoll likesome one (15)Random I basically believe in science over theologyI mean I () (1)

Dialog HistoryA Could you tell me something about your family

ResponsesGolden Ok There are five people in my family fathermother elder brother younger sister and I (5)Ours Ok There are five children in my family fathermother and brother and father my me (325)Random When do you want to move in (125)

Table 3 Examples of three different types of responsesfor a given dialog history with their survey results Thehighlighted words are newly generated by our methodThe score of each response is underlined

For a model learning to find the difference be-tween appropriate and inappropriate responses wespeculate that the task of distinguishing the neg-ative samples generated by our method from thegolden responses would be more difficult than thetask of distinguishing the randomly selected nega-tive samples from the golden responses We believethat this is because the generated negative samplescan be inappropriate in more subtle ways than com-pletely unrelated responses are We suspect thatlearning with this more challenging setting haveresulted in the performance gain that we discussedin Section 421 However we believe that it willneed a more in-depth semantic analysis on each ofthe cases such as performing a more quantitativeanalysis (through an extensive human study forinstance) and further interpretation of the semanticrelationships between the original golden responsesand the modified negative samples according to theproposed method We leave it as a future work

5 Conclusion

In this paper we proposed an automatic methodfor generating negative samples that can be used totrain an unsupervised and unreferenced responseevaluation model We performed experiments todemonstrate that the proposed method can boostthe unsupervised training of a response evaluationmodel We analyzed the experiment results quan-titatively and examined some examples that show

the distinct characteristics of our proposed method

Acknowledgments

This work was supported by Institute for Infor-mation and communications Technology Promo-tion (IITP) grant funded by the Korea governmentMSIT) (No 2018-0-00582 Prediction and augmen-tation of the credibility distribution via linguisticanalysis and automated evidence document collec-tion)

References

JinYeong Bak and Alice Oh 2020 Speaker sensitiveresponse evaluation model In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics pages 6376ndash6385

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the ACL Workshop on Intrinsic and Extrinsic Eval-uation Measures for Machine Translation andorSummarization pages 65ndash72

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies pages 4171ndash4186

Asma Ghandeharioun Judy Hanwen Shen NatashaJaques Craig Ferguson Noah Jones AgataLapedriza and Rosalind Picard 2019 Approximat-ing interactive human evaluation with self-play foropen-domain dialog systems In Advances in Neu-ral Information Processing Systems pages 13658ndash13669

Sarik Ghazarian Johnny Wei Aram Galstyan andNanyun Peng 2019 Better automatic evaluation ofopen-domain dialogue systems with contextualizedembeddings In Proceedings of the Workshop onMethods for Optimizing and Evaluating Neural Lan-guage Generation pages 82ndash89

Tatsunori Hashimoto Hugh Zhang and Percy Liang2019 Unifying human and statistical evaluation fornatural language generation In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies pages 1689ndash1701

Ari Holtzman Jan Buys Li Du Maxwell Forbes andYejin Choi 2020 The curious case of neural textdegeneration In 8th International Conference onLearning Representations

1534

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In 3rd Interna-tional Conference on Learning Representations

Jiwei Li Alexander H Miller Sumit ChopraMarcrsquoAurelio Ranzato and Jason Weston 2016Dialogue learning with human-in-the-loop arXivpreprint arXiv161109823

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 DailyDialog A manu-ally labelled multi-turn dialogue dataset In Proceed-ings of the Eighth International Joint Conference onNatural Language Processing pages 986ndash995

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81

Chia-Wei Liu Ryan Lowe Iulian Serban Mike Nose-worthy Laurent Charlin and Joelle Pineau 2016How NOT to evaluate your dialogue system An em-pirical study of unsupervised evaluation metrics fordialogue response generation In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing pages 2122ndash2132

Ryan Lowe Michael Noseworthy Iulian Vlad Ser-ban Nicolas Angelard-Gontier Yoshua Bengio andJoelle Pineau 2017 Towards an automatic Turingtest Learning to evaluate dialogue responses InProceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1116ndash1126

Shikib Mehri and Maxine Eskenazi 2020 USR Anunsupervised and reference free evaluation metricfor dialog generation In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 681ndash707

Bo Pang Erik Nijkamp Wenjuan Han Linqi ZhouYixian Liu and Kewei Tu 2020 Towards holisticand automatic evaluation of open-domain dialoguegeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 3619ndash3629

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics pages 311ndash318

Alec Radford Jeffrey Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIblog

Ananya B Sai Akash Kumar Mohankumar SiddharthaArora and Mitesh M Khapra 2020 Improving di-alog evaluation with a multi-reference adversarialdataset and large scale pretraining arXiv preprintarXiv200911321

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892

Iulian V Serban Alessandro Sordoni Yoshua BengioAaron Courville and Joelle Pineau 2016 Buildingend-to-end dialogue systems using generative hier-archical neural network models In Proceedings ofthe 30th AAAI Conference on Artificial Intelligencepages 3776ndash3783

Iulian Vlad Serban Alessandro Sordoni Ryan LoweLaurent Charlin Joelle Pineau Aaron Courville andYoshua Bengio 2017 A hierarchical latent variableencoder-decoder model for generating dialogues InProceedings of the 31st AAAI Conference on Artifi-cial Intelligence pages 3295ndash3301

Ilya Sutskever Oriol Vinyals and Quoc V Le 2014Sequence to sequence learning with neural networksIn Advances in Neural Information Processing Sys-tems pages 3104ndash3112

Chongyang Tao Lili Mou Dongyan Zhao and RuiYan 2018 Ruber An unsupervised method for au-tomatic evaluation of open-domain dialog systemsProceedings of the 32nd AAAI Conference on Artifi-cial Intelligence pages 722ndash729

Thomas Wolf Victor Sanh Julien Chaumond andClement Delangue 2018 Transfertransfo A trans-fer learning approach for neural network based con-versational agents In NeurIPS Workshop on Con-versational AI

Hanlu Wu Tengfei Ma Lingfei Wu TariroManyumwa and Shouling Ji 2020 Unsuper-vised reference-free summary quality evaluationvia contrastive learning In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing pages 3612ndash3621

Saizheng Zhang Emily Dinan Jack Urbanek ArthurSzlam Douwe Kiela and Jason Weston 2018 Per-sonalizing dialogue agents I have a dog do youhave pets too In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics pages 2204ndash2213

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with BERT In 8th Interna-tional Conference on Learning Representations

Tianyu Zhao Divesh Lala and Tatsuya Kawahara2020 Designing precise and robust dialogue re-sponse evaluators In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 26ndash33

Page 9: Generating Negative Samples by Manipulating Golden ...the golden response.Li et al.(2016) explored dia-log system with textual feedback.Ghandeharioun et al.(2019) suggested the necessity

1533

Dialog HistoryA Sir would you like some dessert nowB Please show me the menu againA Here you are sir The chocolate cake is very delicious

ResponsesGolden No thanks I donrsquot like chocolate Irsquod likestrawberry pie (5)Ours No thanks I donrsquot have chocolate Irsquoll likesome one (15)Random I basically believe in science over theologyI mean I () (1)

Dialog HistoryA Could you tell me something about your family

ResponsesGolden Ok There are five people in my family fathermother elder brother younger sister and I (5)Ours Ok There are five children in my family fathermother and brother and father my me (325)Random When do you want to move in (125)

Table 3 Examples of three different types of responsesfor a given dialog history with their survey results Thehighlighted words are newly generated by our methodThe score of each response is underlined

For a model learning to find the difference be-tween appropriate and inappropriate responses wespeculate that the task of distinguishing the neg-ative samples generated by our method from thegolden responses would be more difficult than thetask of distinguishing the randomly selected nega-tive samples from the golden responses We believethat this is because the generated negative samplescan be inappropriate in more subtle ways than com-pletely unrelated responses are We suspect thatlearning with this more challenging setting haveresulted in the performance gain that we discussedin Section 421 However we believe that it willneed a more in-depth semantic analysis on each ofthe cases such as performing a more quantitativeanalysis (through an extensive human study forinstance) and further interpretation of the semanticrelationships between the original golden responsesand the modified negative samples according to theproposed method We leave it as a future work

5 Conclusion

In this paper we proposed an automatic methodfor generating negative samples that can be used totrain an unsupervised and unreferenced responseevaluation model We performed experiments todemonstrate that the proposed method can boostthe unsupervised training of a response evaluationmodel We analyzed the experiment results quan-titatively and examined some examples that show

the distinct characteristics of our proposed method

Acknowledgments

This work was supported by Institute for Infor-mation and communications Technology Promo-tion (IITP) grant funded by the Korea governmentMSIT) (No 2018-0-00582 Prediction and augmen-tation of the credibility distribution via linguisticanalysis and automated evidence document collec-tion)

References

JinYeong Bak and Alice Oh 2020 Speaker sensitiveresponse evaluation model In Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics pages 6376ndash6385

Satanjeev Banerjee and Alon Lavie 2005 Meteor Anautomatic metric for mt evaluation with improvedcorrelation with human judgments In Proceedingsof the ACL Workshop on Intrinsic and Extrinsic Eval-uation Measures for Machine Translation andorSummarization pages 65ndash72

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies pages 4171ndash4186

Asma Ghandeharioun Judy Hanwen Shen NatashaJaques Craig Ferguson Noah Jones AgataLapedriza and Rosalind Picard 2019 Approximat-ing interactive human evaluation with self-play foropen-domain dialog systems In Advances in Neu-ral Information Processing Systems pages 13658ndash13669

Sarik Ghazarian Johnny Wei Aram Galstyan andNanyun Peng 2019 Better automatic evaluation ofopen-domain dialogue systems with contextualizedembeddings In Proceedings of the Workshop onMethods for Optimizing and Evaluating Neural Lan-guage Generation pages 82ndash89

Tatsunori Hashimoto Hugh Zhang and Percy Liang2019 Unifying human and statistical evaluation fornatural language generation In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics Hu-man Language Technologies pages 1689ndash1701

Ari Holtzman Jan Buys Li Du Maxwell Forbes andYejin Choi 2020 The curious case of neural textdegeneration In 8th International Conference onLearning Representations

1534

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In 3rd Interna-tional Conference on Learning Representations

Jiwei Li Alexander H Miller Sumit ChopraMarcrsquoAurelio Ranzato and Jason Weston 2016Dialogue learning with human-in-the-loop arXivpreprint arXiv161109823

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 DailyDialog A manu-ally labelled multi-turn dialogue dataset In Proceed-ings of the Eighth International Joint Conference onNatural Language Processing pages 986ndash995

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81

Chia-Wei Liu Ryan Lowe Iulian Serban Mike Nose-worthy Laurent Charlin and Joelle Pineau 2016How NOT to evaluate your dialogue system An em-pirical study of unsupervised evaluation metrics fordialogue response generation In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing pages 2122ndash2132

Ryan Lowe Michael Noseworthy Iulian Vlad Ser-ban Nicolas Angelard-Gontier Yoshua Bengio andJoelle Pineau 2017 Towards an automatic Turingtest Learning to evaluate dialogue responses InProceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1116ndash1126

Shikib Mehri and Maxine Eskenazi 2020 USR Anunsupervised and reference free evaluation metricfor dialog generation In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 681ndash707

Bo Pang Erik Nijkamp Wenjuan Han Linqi ZhouYixian Liu and Kewei Tu 2020 Towards holisticand automatic evaluation of open-domain dialoguegeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 3619ndash3629

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics pages 311ndash318

Alec Radford Jeffrey Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIblog

Ananya B Sai Akash Kumar Mohankumar SiddharthaArora and Mitesh M Khapra 2020 Improving di-alog evaluation with a multi-reference adversarialdataset and large scale pretraining arXiv preprintarXiv200911321

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892

Iulian V Serban Alessandro Sordoni Yoshua BengioAaron Courville and Joelle Pineau 2016 Buildingend-to-end dialogue systems using generative hier-archical neural network models In Proceedings ofthe 30th AAAI Conference on Artificial Intelligencepages 3776ndash3783

Iulian Vlad Serban Alessandro Sordoni Ryan LoweLaurent Charlin Joelle Pineau Aaron Courville andYoshua Bengio 2017 A hierarchical latent variableencoder-decoder model for generating dialogues InProceedings of the 31st AAAI Conference on Artifi-cial Intelligence pages 3295ndash3301

Ilya Sutskever Oriol Vinyals and Quoc V Le 2014Sequence to sequence learning with neural networksIn Advances in Neural Information Processing Sys-tems pages 3104ndash3112

Chongyang Tao Lili Mou Dongyan Zhao and RuiYan 2018 Ruber An unsupervised method for au-tomatic evaluation of open-domain dialog systemsProceedings of the 32nd AAAI Conference on Artifi-cial Intelligence pages 722ndash729

Thomas Wolf Victor Sanh Julien Chaumond andClement Delangue 2018 Transfertransfo A trans-fer learning approach for neural network based con-versational agents In NeurIPS Workshop on Con-versational AI

Hanlu Wu Tengfei Ma Lingfei Wu TariroManyumwa and Shouling Ji 2020 Unsuper-vised reference-free summary quality evaluationvia contrastive learning In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing pages 3612ndash3621

Saizheng Zhang Emily Dinan Jack Urbanek ArthurSzlam Douwe Kiela and Jason Weston 2018 Per-sonalizing dialogue agents I have a dog do youhave pets too In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics pages 2204ndash2213

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with BERT In 8th Interna-tional Conference on Learning Representations

Tianyu Zhao Divesh Lala and Tatsuya Kawahara2020 Designing precise and robust dialogue re-sponse evaluators In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 26ndash33

Page 10: Generating Negative Samples by Manipulating Golden ...the golden response.Li et al.(2016) explored dia-log system with textual feedback.Ghandeharioun et al.(2019) suggested the necessity

1534

Diederik P Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In 3rd Interna-tional Conference on Learning Representations

Jiwei Li Alexander H Miller Sumit ChopraMarcrsquoAurelio Ranzato and Jason Weston 2016Dialogue learning with human-in-the-loop arXivpreprint arXiv161109823

Yanran Li Hui Su Xiaoyu Shen Wenjie Li ZiqiangCao and Shuzi Niu 2017 DailyDialog A manu-ally labelled multi-turn dialogue dataset In Proceed-ings of the Eighth International Joint Conference onNatural Language Processing pages 986ndash995

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81

Chia-Wei Liu Ryan Lowe Iulian Serban Mike Nose-worthy Laurent Charlin and Joelle Pineau 2016How NOT to evaluate your dialogue system An em-pirical study of unsupervised evaluation metrics fordialogue response generation In Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing pages 2122ndash2132

Ryan Lowe Michael Noseworthy Iulian Vlad Ser-ban Nicolas Angelard-Gontier Yoshua Bengio andJoelle Pineau 2017 Towards an automatic Turingtest Learning to evaluate dialogue responses InProceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1116ndash1126

Shikib Mehri and Maxine Eskenazi 2020 USR Anunsupervised and reference free evaluation metricfor dialog generation In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 681ndash707

Bo Pang Erik Nijkamp Wenjuan Han Linqi ZhouYixian Liu and Kewei Tu 2020 Towards holisticand automatic evaluation of open-domain dialoguegeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 3619ndash3629

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics pages 311ndash318

Alec Radford Jeffrey Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIblog

Ananya B Sai Akash Kumar Mohankumar SiddharthaArora and Mitesh M Khapra 2020 Improving di-alog evaluation with a multi-reference adversarialdataset and large scale pretraining arXiv preprintarXiv200911321

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892

Iulian V Serban Alessandro Sordoni Yoshua BengioAaron Courville and Joelle Pineau 2016 Buildingend-to-end dialogue systems using generative hier-archical neural network models In Proceedings ofthe 30th AAAI Conference on Artificial Intelligencepages 3776ndash3783

Iulian Vlad Serban Alessandro Sordoni Ryan LoweLaurent Charlin Joelle Pineau Aaron Courville andYoshua Bengio 2017 A hierarchical latent variableencoder-decoder model for generating dialogues InProceedings of the 31st AAAI Conference on Artifi-cial Intelligence pages 3295ndash3301

Ilya Sutskever Oriol Vinyals and Quoc V Le 2014Sequence to sequence learning with neural networksIn Advances in Neural Information Processing Sys-tems pages 3104ndash3112

Chongyang Tao Lili Mou Dongyan Zhao and RuiYan 2018 Ruber An unsupervised method for au-tomatic evaluation of open-domain dialog systemsProceedings of the 32nd AAAI Conference on Artifi-cial Intelligence pages 722ndash729

Thomas Wolf Victor Sanh Julien Chaumond andClement Delangue 2018 Transfertransfo A trans-fer learning approach for neural network based con-versational agents In NeurIPS Workshop on Con-versational AI

Hanlu Wu Tengfei Ma Lingfei Wu TariroManyumwa and Shouling Ji 2020 Unsuper-vised reference-free summary quality evaluationvia contrastive learning In Proceedings of the2020 Conference on Empirical Methods in NaturalLanguage Processing pages 3612ndash3621

Saizheng Zhang Emily Dinan Jack Urbanek ArthurSzlam Douwe Kiela and Jason Weston 2018 Per-sonalizing dialogue agents I have a dog do youhave pets too In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics pages 2204ndash2213

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with BERT In 8th Interna-tional Conference on Learning Representations

Tianyu Zhao Divesh Lala and Tatsuya Kawahara2020 Designing precise and robust dialogue re-sponse evaluators In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 26ndash33