source awareness memory end-to-end for task-oriented

Source Awareness Memory End-to-End

for Task-oriented Dialogue Learning

Bachelor thesisCredits: 18 EC

Tyler Cools

11004851

Supervisor

Jiahuan Pei

BSc Artificial IntelligenceUniversity of Amsterdam

Faculty of ScienceScience Park 904

1098 XH Amsterdam

July 15, 2019

Abstract

Task-oriented Dialogue systems are widely used nowadays. They aim to makehuman life easier by facilitating answers to their questions asked to the sys-tem. This paper uses data of restaurant bookings. Performance improvementsof Task-oriented Dialogue Systems are reached by using End-to-End networks.Whereas the traditional method uses handcrafted rules and is domain specific,End-to-End networks learn from previous data and can thus be scaled to mul-tiple domains. Despite the many advantages this network offers, the systemstill struggles with retrieving the right information from the Knowledge Base.This problem is addressed by using Source Awareness on the existing MemoryEnd-to-End network. This technique splits up the dialogue and is an importantextension because it enables the model to store data more efficiently. This en-sures better use of the memory’s attention. The experimental results show thatthe model outperforms the baseline model on almost every task. The mistakesthe system makes are analysed to get a good overview of the model’s strengthsand weaknesses.

1

Acronyms

DST Dialogue State Tracker. 8, 9

DSTC2 Dialog State Tracking Challenge 2. 15, 16, 22

GMemN2N Gated Memory End-to-End. 24

KB Knowledge Base. 4–6, 9, 15–17, 21–23

MemN2N Memory End-to-End. 1, 3–5, 10, 13–19, 21, 23–27

N2N End-to-End. 4, 5, 9, 10

NLG Natural Language Generation. 8

NLU Natural Language Understanding. 8, 9

NUC Next-Utterance-Classification. 10

OOV Out-Of-Vocabulary. 15–17, 21–23

PL Policy Learner. 8, 9

QA Question Answering. 4

RNN Recurrrent Neural Network. 4, 9

SA Source Awareness. 4, 5, 14, 21

SAMemN2N Source Awareness Memory End-to-End. 1, 5, 13, 16–18, 21–24

TDS Task-oriented Dialogue System. 4–6, 8, 10

2

Contents

1 Introduction 41.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Contributions and main findings . . . . . . . . . . . . . . . . . . 51.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Problem definition 6

3 Related work 83.1 Pipeline method . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 End-to-End Networks . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Background 104.1 Memory End-to-End network . . . . . . . . . . . . . . . . . . . . 104.2 Source Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Methods 13

6 Experiments 156.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

7 Results 16

8 Discussion 21

9 Conclusion 23

10 Future work 23

11 Appendix 24

3

1 Introduction

1.1 Motivation

In recent years, making reservations or appointments by telling a computerprogram to do so is a common thing to do. This is done by Task-orientedDialogue Systems (TDSs). A dialogue is a conversation between two or moreagents, which in a TDS are a computer and a human. A TDS is a computerprogram that communicates with a human in a natural way and is widely usedin virtual assistants like Apple’s Siri or Google’s Assistant (Arora et al. [1]).

The traditional method to construct a TDS is the pipeline method, for whichextensive human effort is necessary because the rules are handcrafted. In thelast few years, neural networks came up to be a solution to this problem. End-to-End (N2N) networks only use one module and learn from given trainingdatato find a pattern in the dialogues.

Weston et al. [17] concluded that despite the fact that N2N networks useprevious data to produce an output, their memory is typically too small andis not compartmentalized enough to accurately remember facts from the past.This makes neural networks hard to use in Question Answering (QA) in whicha lot of inferring is used. Zaremba and Sutskever [21] showed that a RecurrrentNeural Network (RNN) has difficulties with outputting the same sequence asthe input they just read.

Another problem arises when using N2N networks to compose a TDS. Be-cause the pipeline method works extremely well on domain specific tasks, theN2N network not only needs to achieve high accuracy over all domains, butalso beat the traditional model on the domain specific data. This results in anincrease in the amount of data which makes it harder for a neural network toprocess it all.

In order to resolve the above discussed problems, Sukhbaatar et al. [16]demonstrated a promising method which uses a novel RNN based on the Mem-ory Network implemented by Weston et al. [17]. Whereas this network modelwas not easy to train using backpropagation and required supervision at eachlayer of the network, the MemN2N network can be trained from input-outputpairs and is applicable to tasks where supervision is not available (Sukhbaataret al. [16]).

For accessing the data more efficiently, Stienstra [15] uses Source Awareness(SA). This technique splits up the dialogue in three different parts which en-sures a more efficient way of storing the information. By using this technique,information can be retrieved more efficiently and the results achieve higher ac-curacy.

In this thesis, data of restaurant bookings will be used, in which users askthe TDS to book a table in a restaurant at a specific location, price range andnumber of persons. The model is tested with two Knowledge Bases (KBs), con-taining only known entities and both known and unknown entities, respectively.

4

1.2 Research question

In this thesis Source Awareness (SA) will be implemented in the Memory End-to-End (MemN2N) network to ensure information is retrieved more efficiently.The research question (RQ) for this thesis is: How will the addition of SourceAwareness affect the results of the current Memory End-to-End network? Toanswer this question are three sub-questions formulated:

• SQ1 How does Source Awareness affect the performance of TDS?

• SQ2 Is the model a good fit or is it over-fitting or under-fitting?

• SQ3 When and why does the model outperform the regular MemN2Nnetwork and when does the model fail?

1.3 Contributions and main findings

The contributions of this thesis mainly include:

1. A split Memory End-to-End network, which can capture the Source Aware-ness for a better performance. This network is called Source AwarenessMemory End-to-End (SAMemN2N).

2. A case study on error types

The main findings are:

1. Source Awareness is a useful technique for a TDS because they can storea dialogue in a more efficient way in which the results, user and systemutterances are stored in different parts in the model.

2. The strength of the model (according to the cases where the MemN2Nmodel is outperformed) lays in the tasks where the regular KB is used.The model shows significant improvements on task 4 and 5, displayingoptions and providing extra information, respectively.

3. The drawbacks of the proposed model lay in the cases in which unknownentities occur. This is implied by the results of the Tasks using the KBcontaining entities that were not seen by the network.

1.4 Overview

The rest of this thesis is structured as follows. Section 2 will describe theproblem formulation. Section 3 gives some of the previous work related to End-to-End networks. Section 4 formally defines how Memory End-to-End networkswork and explains how source awareness works. Section 5 describes the methodsused in this paper and in section 6 the setup of the experiments is explained.Section 7 gives the results which are discussed in section 8. Section 9 concludesand in section 10 is future research proposed.

5

2 Problem definition

This section describes the formal definition of a TDS and specifies how dialoguesare composed.

Dialogues in a TDS are conversations between a human and a computer.They take turns, also called utterances, which are defined as a unit of textwithout an interruption from the other speaker. A turn can therefore containmultiple sentences (Stienstra [15]).

To collect information from outside the dialogue, an external KB is used inwhich for the restaurant booking domain the cuisine type and contact informa-tion of restaurants is stored. During a dialogue, the system can access this KBto look for the restaurant matching the demands of the user.

A TDS uses the KB together with the dialogue to generate a response Rt

as shown in equation 1. In this equation Dt stands for the dialogue at time t inthe conversation (Stienstra [15]).

TDSΘ(Dt,KB)→ Rt (1)

The KB is accessed when the system queries an utterance with the words‘api call’, which triggers the system to look for information in the KB.

When looking into the dataset, typically a dialogue consists of 5 differenttypes of utterances which leads to five different tasks proposed by Sakai et al.[13]. In Figure 1, the tasks are visualised using a conversation.

• Task 1 issuing API calls. For the system to find the best answer forthe user, the system asks questions to fill in all the required fields forgenerating the API call.

• Task 2 updating API calls. Users can decide that they have differentdemands and they can adjust their question up till four times. Then thesystem updates the API call.

• Task 3 displaying options. When the first two tasks are executed, thesystem queries the KB to find restaurants that fit the user.

• Task 4 providing extra information When the user agrees to a restau-rant, the system queries the knowledge base to give the additional factsof the restaurant.

• Task 5 conducting full dialogues. This task is a full dialogue and thusall the tasks 1-4 are combined.

6

Figure 1: Five different tasks explained in a sample dialogue (Stienstra [15]).

In section 1.2 is the main RQ presented together with three SQs. Below isdescribed how this thesis aims to answer those three questions to answer themain RQ.

To Answer SQ1, the values of a and a are used, being the predicted andtrue label, respectively. These values are listed in Table 1 and their equationsare explained in section 6.

For providing an answer to SQ2, the Loss value is used. section 5 explainsthe meaning of this value and discusses the importance of it.

The answer on SQ3 is given by inspecting the mistakes made by the system.The table is in 6.

Table 1 outlines all the glossaries that are used throughout this paper.

7

Symbols MeaningTDSΘ Formal definition of a Task-oriented Dialogue Systemut User utterance at time tst System utterance at time trt Result at time tU Utterances of conversation spoken by user u1, ..., ut−1

S Utterances of conversation spoken by system s1, ..., st−1

R Result history of conversation r1, ..., rt−1

Dt Dialogue at time tRt Response at time tΦ Maps utterance to bag of words dimensionµ Embedded bag of words vector of user utterancesη Embedded bag of words vector of system utterancesA Embedding matrixp Probability vectorm Embedded bag of words vectora Predicted labela True labelq User’s last utterance (query)u Embedded version of query qo Output vectork The current hop the system is inKB Knowledge Based Mean differenceT T-statisticSd Standard deviationn amount of samplesL Loss

Table 1: All glossaries that are used throughout the paper

3 Related work

3.1 Pipeline method

The traditional method of building a TDS is the pipeline approach. This ap-proach consists of four components (Chen et al. [3]).

1. Natural Language Understanding (NLU)

2. Dialogue State Tracker (DST)

3. Policy Learner (PL)

4. Natural Language Generation (NLG)

8

The four components run separately and are interdependent (Stienstra [15]).This means that all the components need to be pre-trained and if one modulechanges, the whole model needs to be retrained. Also, the NLU requires a lotof human effort because the system relies on hand crafted rules (Chen et al. [3]).Another drawback of the traditional approach, is the use of slot filling in NLU.In slot filling, different slots are chosen to fill during a conversion (Bordes et al.[2]). For example in the choice of a restaurant, slots can be; price, number ofpeople and the city you want to dine. In the sentence “I want to book a tableat an Italian restaurant for four persons in “New York”, the slot-value pairsare {City, New York; Cuisine, Italian; Amount of people, 4}. Slot filling worksvery well on domain specific tasks, but is difficult to scale to other domains. Allslots for all possible domains should be encoded to achieve this (Stienstra [15]).Recent work from Liu and Lane [7] shows an attention network that improvesthe performance of a regular RNN on slot filling, proving that a network withan attention mechanism improves on the slot filling method.

In the DST a representation of the dialogue so far is composed. Whereas thismethod uses the previous dialogue state and the output of the NLU module,Young et al. [20] use a process where dialogues behave as a Markov process inwhich each state is modelled by a probability function.

The PL generates an action based on the output of the DST. Whereasin simple cases a rule based system is used to create a mapping between thedialogue state (Yan et al. [19]), Williams et al. [18] use an RNN to provide thatmapping (Stienstra [15]).

In the final module the answer is generated using the output of the policylearner.

3.2 End-to-End Networks

End-to-End (N2N) models use a neural network which learns from previousconversations. These networks can therefore be used on multiple domains whichresults in a network that is no longer focused on specific domains (Bordes et al.[2]). The focus on multiple domains sounds promising, but still this approachhas its flaws. By making the model applicable on multiple domains, the modelrequires a large amount of data and more importantly, should be capable ofprocessing all this data. Furthermore, because the traditional method worksextremely well on domain specific tasks, for N2N networks to be useful, theymust be able to perform as good as the traditional methods.

N2N dialogue systems are generally categorised as generative or retrievalbased models.

Generative methods. In this method, a response is generated word for wordgiven the dialogue history (Stienstra [15]). These approaches mainly are nontask-oriented. However Eric et al. [4] do use this method in which they use aKB.

9

Retrieval methods. This method generates a response by selecting theanswer with maximum probability out of a list of candidates. This makes it aclassification problem. This method of evaluation is proposed by Lowe et al.[9] and is called Next-Utterance-Classification (NUC) and will be used in thispaper.

Because N2N networks look really promising because of their possibility to scaleto multiple domains, many of the recent models exploit this type of network toimprove performance of a TDS.

The model proposed in this paper is also based on N2N networks. TheMemN2N network by Sakai et al. [13] which is a retrieval based N2N network.This network contains an attention function for reading the memory. Furtherdetails about this network are in section 4.1.

4 Background

4.1 Memory End-to-End network

Sakai et al. [13] designed the MemN2N network which uses an explicit memoryto which, during the conversation, the utterance of the user and the computerare appended. Thereafter an attention function is used for reading the memory.This is achieved with multiple layers, also called hops, in which the outputgenerated by a layer is taken as input in the next layer. Moreover, the networkcan use the previously stored historical dialogues and short-term context toreason about the required response.

MemN2N networks perform extremely well on factual reasoning and de-duction, but on multi-fact question answering, positional reasoning or dialoguerelated tasks the network faces some difficulties (Liu and Perez [8]).

Input of memory representation. When going through the conversation,at every time step t, the previous utterance of the user ut and the responsesof the system st are appended to the memory. The goal is thus at time t tochoose st (Sakai et al. [13]). Every utterance u1 ... ut−1 and s1 ... st−1 is con-verted to vectors Uu and Us using embedding matrix A. Furthermore, Φ mapsthe utterance to a bag of words dimension. The result is shown in equation2. In the original Memory Network model presented by Weston et al. [17] theutterances do not contain information on which utterance it is. This is added inΦ by extending the vocabulary which encodes the index i into the bag of words(Sukhbaatar et al. [16]).

m = (AΦ(u1), AΦ(s1)..., AΦ(ut−1), AΦ(st−1) (2)

The last utterance of the user is called the query (q) and is embedded usingthe same embedding matrix A to obtain state u.

10

In equation 3 the match between each memory part and the query is com-puted by taking their inner product, followed by a softmax, which returns aprobability p (Sukhbaatar et al. [16]). This step is an important change be-cause this produces an attention over the memory and ensures reasoning overprevious utterances.

pi = Softmax(uTmi) (3)

Output of memory representation. The output vector of the model iscomputed using the embedded bag of words vector m (equation 2) and its prob-abilities p (equation 3) by summing each corresponding input vector weightedby its probability vector:

o =∑i

pimi (4)

Generating final prediction. In equation 5, the sum of the output vec-tor o and input embedding u is generated, in which k is the current hop thesystem is in (Sukhbaatar et al. [16]).

uk+1 = ok + uk (5)

The answer of equation 5 is used to produce the predicted label (Raunak[12]).

a = argmax(softmax((W (uk+1)) (6)

As can be derived from equation 5 and 6, the output of a hop is used as aninput in the following hop.

Figure 2 gives a more detailed visualisation of the composition of the net-work, figure 2a being a single layered network and figure 2b being multiplelayered. Figure 2b shows that the output of the previous layer is passed to thenext layer which happens using equation 5.

11

Figure 2: A more detailed visualisation of the composition of the Memory Net-work existing of one layer in (a) and existing of three layers in (b) [16].

The entire model is trained using stochastic gradient descent (SGD), mini-mizing a standard cross-entropy loss between a and the true label a (Sakai et al.[13]).

4.2 Source Awareness

When looking for entities in the dialogue history, different type of entities arenecessary. For instance when making a reservation for a restaurant, relevantentity types are cuisine type, location, contact details and the name of therestaurant. This information is stored in different parts of the dialogue. In therecently published thesis by Stienstra [15], Source Awareness is applied. Thisnew technique separates the dialogue in three parts: user history, system-historyand result-history. In this way, differences in parts of the dialogue system areexploited. This provides a better focus on the different parts of the dialogueand can thus extract information more precisely. Whereas the cuisine typeand location are stored in the user-history, the contact details and name of therestaurant can be found in the result history.

Using this technique will thus use less memory given the possibility to lookup the information in smaller parts of the dialogue. Figure 3 shows an exampleof the three different dialogue parts.

12

Figure 3: Splitting the dialogue in parts. The dialogue is split up into a re-sult, user and system history which consist respectively of u1, ..., ut−1, s1, ..., st−1

and r1, ..., rt−1 [15].

The language that is used in the three different parts strongly differs (Stien-stra [15]). The vocabulary of result history is very small and the structure isroughly the same containing dense information. In the user history however isa wide variety in use of words and words are often redundant. With SourceAwareness, the importance of utterances can be better determined. The ex-isting MemN2N model implemented by Raunak [12] is used in this paper toextend with Source Awareness 1.

5 Methods

This section describes the methods that are used in this paper. The methodsare used to enhance the regular MemN2N network by using Source Awareness.This results in the Source Awareness Memory End-to-End (SAMemN2N).

When having a more detailed look at the composition of a dialogue in general,several things catch the eye. Typically, a dialogue consists of responses fromthe user, the dialogue system itself and the results query. This means that somepieces of information always only occur in specific parts of the dialogue. Whenthe system is, for instance, looking for the type of cuisine, this is exclusivelyfound in the responses of the user whereas the contact details of the restaurantcan be found in the results query.

1The code for the SAMemN2N network is available onhttps://github.com/TylerCools/thesis

13

The Source Awareness technique uses this information to split up the dia-logue history in three different parts:

• User-history(U), all the utterances from the user u1, ..., ut−1.

• System-history (S), all the utterances from the system s1, ..., st−1.

• Result-history (R), which contains all the output from the knowledge basequery r1, ..., rt−1.

Section 4.1 described the formal definition of the MemN2N and showed howan attention function is implemented in the network. To extend this networkwith Source Awareness some adjustments should be made. Whereas in equation2 the embedded bag of words vector contains the dialogues of both the user andsystem, this model splits up the equation in two equations. This results in thefollowing two equations with equation 7 containing the bag of words vector ofthe user utterances and equation 8 containing the bag of words vector of thesystem utterances.

µ = AΦ(U) (7)

η = AΦ(S) (8)

These two equations are both used to compute the probability vector.

pi = Softmax(uTµiηi) (9)

Finally, equation 4 is being adapted to be able to handle multiple bag ofwords vectors:

o =∑i

piµiηi (10)

Loss function. In neural networks is the loss an interpretation of how wellthe model is doing on the train and validation set. Unlike the measurement ofaccuracy which is expressed as a percentage, loss is measured as a summationof the errors made for each example in the training or validation set (Peng et al.[11]). In this thesis the loss is defined as the softmax cross-entropy betweenthe predicted answer and the generated output calculated using equation 11 inwhich a is the encoded label and a the predicted label. a is first normalizedusing softmax.

L = −∑n

(an ∗ log an) (11)

14

The cross-entropy tending toward zero as the neural networks improves, butnever reaching zero, makes it a good measurement for interpreting the loss ofthe network (Nielsen [10]).

6 Experiments

This section describes the design of the experiments, information about thedataset is given and the computation of the results is explained.

6.1 Datasets

All the frameworks are tested with data of restaurant bookings. Two types ofdata are used, the bAbI dialogues and the Dialog State Tracking Challenge 2(DSTC2) dataset.

bAbI dialogue2. This data consists of dialogues simulated by a computer.The dialogues of this dataset are generated with the framework introduced byBordes et al. [2]. A big challenge in dialogues is handeling entities. Becausethe data contains a lot of different entities, it is inevitable that at some pointthe system sees an entity which was never seen before. Therefore, the KBs aresplit in half. The first KB is used to create the standard train, validation andtest sets. The second KB is used to generate test dialogues, termed Out-Of-Vocabulary (OOV) test sets (Bordes et al. [2]). The entities in this set are thusunseen in any training dialogue and expected to be harder to handle.

Dialogue State Tracking Challenge3. Whereas the bAbI dialogues are sim-ulated by a computer, the DSTC2 dataset consists of real user-system data(Henderson et al. [6]). This data also contains restaurant booking data andhas three fields: type of cuisine, location and price range, which have 91, fiveand three choices, respectively. Because this dataset is real user-system data, itcontains more noise.

6.2 Evaluation

Accuracy. The performance of the model will be measured on turn level ac-curacy which is defined as the number of correct responses out of all responses(Stienstra [15]). Because the MemN2N is a retrieval based method, the re-sponse is generated by selecting out a list of candidates instead of generatingan output (Lowe et al. [9]).

2https://research.fb.com/downloads/babi/3http://camdial.org/ mh521/dstc/

15

Paired t-test. Once the results are obtained, a check must be conductedto see if the results are significant. A paired t-test is used to evaluate the meandifference between matched data points, in this case the results of the MemN2Nand SAMemN2N network (Hedberg and Ayers [5]). First, the difference be-tween the two datapoints xi and yi are calculated: di = xi − yi. Then its meandifference d is calculated. Next, the standard error of the mean difference iscalculated where Sd is the standard deviation and n is the amount of samples.

SE(d) =Sd√n

(12)

Lastly, the t-statistic is calculated:

T =d

SE(d)(13)

This value can be looked up in the t-distribution to conduct the p-value,which indicates the significance of the result.

7 Results

This section describes the results of the experiments and shows the types ofmistakes the system makes.

Accuracy. Table 2 shows the average turn accuracy on the five bAbI dialoguetasks and the DSTC2 (Task 6). Task 1 to 5 are tested with both the OOV andthe regular KB. Because Task 6 consists of real conversations between humanand system, no OOV KB exists. The model outperforms the regular MemN2Nnetwork on almost every task. A more elaborate analysis on this task will bediscussed later in this section.

MemN2N SAMemN2NTask 1 99,90 99,90 ±0,060Task 2 100,00 100,00 ±0,014Task 3 74,90 99,69 ±0,288Task 4 59,50 93,09 ±1,220Task 5 96,10 87,60 ±0,451

Task 1 OOV 78,86 79,55 ±2,621Task 2 OOV 74,67 78,87 ±0,010Task 3 OOV 75,20 72,33 ±0,918Task 4 OOV 56,98 56,98 ±0,000Task 5 OOV 64,29 60,41 ±0,462

Task 6 40,60 65,78 ±0,501

Table 2: Results of MemN2N and SAMemN2N networks. Both networksare tested on Task 1 to 5 with the regular KB and the OOV KB (Mean andstandard deviation over 20 runs).

16

Loss. To get more insights in the performance of the model, research isdone on the validation and training loss. In Figure 4 and 5 are the training andvalidation loss of six tasks plotted without and with OOV KB, respectively4.All six tasks in both figures show a decrease in loss as the amount of iterationsincreases. Also the training and validation losses have roughly the same value.The difference in the two figures are in the noise of the curve. Whereas in Figure4 the values show very little noise, Figure 5 shows more noise in its curves. Thedifference in noise is explained in section 8.

The only figure that stands out is Figure 4f in which the training loss issignificantly larger than the validation loss. This will be discussed in section 8.

The losses of the regular MemN2N network are plotted to compare theresults. These losses are presented in Figure 6.

(a) Task 1 (b) Task 2 (c) Task 3

(d) Task 4 (e) Task 5 (f) Task 6

Figure 4: Training and validation losses on different tasks using theSAMemN2N network without OOV KB

.

4The reason that this figure does not contain information about Task 6 is due to the lackof OOV KB data

17


(d) Task 4 (e) Task 5

Figure 5: Training and validation losses on different tasks using theSAMemN2N network with OOV KB


(d) Task 4 (e) Task 5 (f) Task 6

Figure 6: Training an validation losses of regular MemN2N

18

Comparing Figures 4 and 5 with Figure 6, the losses on the first two figuresshow a more stable curve and also the values are much lower. This shows a sig-nificant improvement on the current Source Awareness model considering thevalue of the losses.

Significance. With the results on table 2 and the equations in section 6.2,the t-statistic can be calculated to find its corresponding p-value. This resultedin a p-value of 0.00015.

Mistake inspecting. As mentioned earlier, the model improves on almostevery task. On task 5 however, the model decreases in accuracy relative to theregular MemN2N. To provide more insight in the outcome of the model, allthe cases in which the system predicted a wrong answer were divided in thefollowing 3 mistakes, each divided in small or big mistakes. In total, this leadsto 7 types of mistakes which are listed in Table 3 together with its occurrences.

1. Wrong follow up. This can occur when, for instance, the system asksfor the price range of the user, but this was mentioned earlier in theconversation.

Predicted answer: which price range are you looking for?Real answer: ok let me look into some options for you

2. API call mistakes. When the system makes an API call it occurs thatmistakes are made. This type of mistake is divided in three sub mistakeswhich are distinguished by the number of mistakes in a single API call.

• API call one mistake

Predicted answer: api call spanish madrid six cheapReal answer: api call spanish paris six cheap

• API call two mistakes

Predicted answer: api call spanish rome two cheapReal answer: api call spanish london six cheap

• API call more than two mistakes

Predicted answer: api call italian rome eight cheapReal answer: api call british bombay six moderate

3. System answers. After the system makes an API call the system returns ananswer. It occurs that the system returns a wrong answer. This mistakeis divided in three sub mistakes distinguished by the number of mistakesmade in a single system answer.

• Answer with one mistake.

Predicted answer:what do you think of this option: resto bombay cheap spanish 7stars

5https://www.socscistatistics.com/pvalues/tdistribution.aspx is used to calculate the p-value corresponding to its t-statistic

19

Real answer:what do you think of this option: restobombay cheap spanish 6stars

• Answer with two mistakes.

Predicted answer:what do you think of this option: resto madrid moderate italian7starsReal answer:what do you think of this option: resto bombay moderate italian2stars

• Answer with more than two mistakes.

Predicted answer:what do you think of this option: resto rome expensive french 1starsReal answer:what do you think of this option: resto bombay expensive british3stars

As can be seen in Table 3, the type of mistake which occurs the most is the‘follow up’ mistake. This mistake occurs when the system asks a question forretrieving information which is already in the conversation. Two other mistakeswhich occur often are when the system proposes a restaurant but makes onlyone mistake. The worst case is the third mistake in which the system makesan API call with more than two mistakes. In this case, the system always hasalmost every detail of the API call wrong. Although the first four tasks makefew mistakes, it is interesting to see where the mistakes are made. More detailson dialogues and its mistakes are given in the appendix in section 11.

Tasks Types of mistakesFollow up 1-API 2-API >2-API 1-system 2-system >2-system

T1 3 (0,45) 0 (0) 0 (0) 0 (0) 4 (0,55) 0 (0) 0 (0)T2 0 (0,20) 0 (0,30) 0 (0) 0 (0) 1 (0,50) 0 (0) 0 (0)T3 8 (0,21) 0 (0) 0 (0) 0 (0) 25 (0,64) 0 (0) 6 (0,15)T4 1 (0) 0 (0) 0 (0) 0 (0) 17 (0,08) 40 (0,17) 172 (0,75)T5 1055 (0,46) 88 (0,04) 180 (0,08) 373 (0,16) 472 (0,20) 48 (0,02) 99 (0,04)

T1 OOV 132 (0,11) 0 (0) 959 (0,77) 40 (0,03) 107 (0,09) 0 (0) 0 (0)T2 OOV 0 (0) 2 (0) 1647 (0,82) 350 (0,17) 0 (0) 0 (0) 0 (0)T3 OOV 258 (0,09) 0 (0) 0 (0) 0 (0) 77 (0,03) 351 (0,13) 2068 (0,74)T4 OOV 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 65 (0,04) 1445 (0,96)T5 OOV 1059 (0,14) 0 (0,0) 457 (0,06) 1543 (0,21) 415 (0,06) 419 (0,06) 3489 (0,47)

T6 2210 (0,53) 37 (0,01) 12 (0) 103 (0,03) 1694 (0,43) 18 (0) 0 (0)

Table 3: Occurrence of each mistake type in Task 1 to 6 and the relativepercentage in brackets (mean over 10 runs).

20

Hyper parameter investigation. Because the algorithm takes some timeto run a task (+/- two hours per task), tweaking on some of the hyper param-eters is performed. All of the above results are obtained using 200 epochs. Butbecause some of the test accuracies reach close to, or even 100%, it is interestingto investigate if maybe 200 epochs is too many for some tasks. The results of theaccuracy against the number of epochs, is shown in Figure 7a and 7b, showingthe results without and with OOV Knowledge Base, respectively. In Figure 7athe accuracies converge very fast to a stable value, but in Figure 7b the valuesare much more unstable.

(a) (b)

Figure 7: Accuracies of Task 1 to 6 using the SAMemN2N networkplotted against the number of epochs. Figure 7a and 7b are the accuracyusing the regular KB and OOV KB respectively

8 Discussion

This paper extended the Memory End-to-End network proposed by Westonet al. [17] by implementing Source Awareness. Despite the results showing largeoverall improvements, some results are not as expected. These results will bediscussed in this section.

The results of using a KB with and without OOV are quite different. Thiscan be seen in three performance measurements. Table 2 shows that using theregular KB shows more improvements than the latter. Furthermore, there is anoticeable difference in Figure 4 and 5. Whereas the diagrams in Figure 4 showvery little noise, Figure 5 shows more noise in its curves. This can be explainedby the unknown entities in the OOV providing more noise.

21

More differences between using the regular and OOV KB can be seen intask 1 and 2. In Table 3 the type of mistakes are listed with its occurrences. InTask 1 and 2 a large amount of mistakes in the api calls are made when usingthe OOV KB. With the regular KB however, the system hardly makes anymistakes. Making many mistakes in api calls is not surprising since these twotasks involves issuing and updating api calls.

Table 3 also shows that Task 3 OOV makes more mistakes in the systemresponses. Whereas Task 3 with the regular KB makes at most one mistakein its system response, task 3 with OOV KB generates mostly responses withmore than two mistakes.

When examining the difference between Task 5 with and regular KB andOOV KB, the mistakes being made are quite different. Whereas the first makeshalf of its mitakes in asking the wrong follow up question, the latter mostlymakes mistakes in generating a response. Using the OOV KB has a positiveresult on asking the follow up question since the relative percentage of thismistake is much lower than for using the regular KB.

The last difference between the results of using the regular and OOV KBis in Figure 7. Whereas in Figure 7a the values converge to a stable value veryfast, Figure 7b shows that when using the OOV KB, the values are not stable.From these results can be inferred that the system is struggling with unknownentities.

The mistakes on Task 3 en 4 with both KBs, show that the model strug-gles with generating the right response. This is in line with the expectationsbecause these two tasks generally involve displaying options and dealing withextra information.

The high loss values in Figure 4f compared to the other diagrams in Figure4 is because the DSTC2 dataset contains real data between human and system.Another struggle of this model regards the word embeddings. When the sys-tem knows it must output a phone number, word embeddings make it hard todistinguish between phone numbers and addresses (Weston et al. [17]).

The number in Table 2 that strikes the attention is the standard deviationof Task 4 OOV for the SAMemN2N. Despite the fact that the model has beentested 20 times, the mean of this task is always exactly the same. Table 3however shows that the mistakes are not always exactly the same. It seems likein this task it reaches a maximum somehow, which is also proved by Table 7bin which the line of Task 4 is a perfect straight line. An explanation for theseresults have not yet been found.

The most significant improvement in performance is the one for task 6 whichshowed an increase in performance by 150% compared to the original model.This is promising because this is the dataset which uses real conversations be-tween human and system instead of simulated data in the first 5 tasks.

To place the above discussed results into perspective, some strengths andweaknesses of this study should be noted. This paper used a lot of measurementsto be able to say as much as possible about the data. Whereas Sukhbaataret al. [16] only measure the turn level accuracy, this paper used three other

22

measurements on top of the accuracy to check the results. First, the pairedt-test was used to check the significance of the results. Furthermore, the lossesof the model are studied to ensure the model is not over- or underfitting. Lastly,a study on mistakes was conducted to find the strengths and flaws of the model.This mistake study could have been extended such that more can be said aboutspecific flaws of this model.

9 Conclusion

This paper introduced the SAMemN2N model and it outperforms the regularMemN2N on almost every task 1 to 6.

One of the decreases in accuracy is in Task 5 in both the OOV and nonOOV KB. Looking at the type of mistakes shows that the the only mistakesthe system makes in this task involves suggesting a restaurant. Most of the timeit makes more than two mistakes in its suggestion. This infers that this networkhas much difficulty dealing with extra information. This answers SQ1, “how isthe accuracy affected?”

To answer SQ2, “is the model a good fit or is it over- or underfitting?” canbe found in the differences in Figure 4 and 6. From these figures can be derivedthat the network behaves as expected, having a lower loss than its originalMemN2N, and showing the expected decrease.

The model has been tested 20 times and the standard deviation being atmost 1.2% together with the results of the paired t-test say that the results ofthis paper are significant.

The decrease in performance when using the OOV Knowledge Base is due tothe unknown entities. On the follow up questions however, the system improvescompared to the tasks using the regular KB. This answers SQ3, “When and whydoes our model win (compared with baseline) and fail (compared with goldenlabels)?”

In conclusion can be stated that the Source Awareness Memory End-to-Endmodel outperforms the existing Memory End-to-End model. Measurementsshow that the results are significant and the model is a good fit. An error in-specting is conducted and on of the model’s flaws lays in the asking the rightfollow up question. This answers the research question “How will the addi-tion of Source Awareness affect the results of the current Memory End-to-Endnetwork?”

10 Future work

Despite the results showing the value of this thesis, still some improvementscould be made.

Firstly, in section 4 was talked about word embeddings. Whereas wordembeddings are useful for handeling synonyms, words denoting entities can makethe system struggle.

23

Entities like a new restaurant are often not seen in trainingdata and thusno word embedding is available. To address this problem, Sakai et al. [13]extended the MemN2N network with “match type” features. Each entity type(e.g. cuisine type, location) is added to the vocabulary. Now any word thatmatches its specific entity type can be matched even though this word hasnever been seen before in the training data. This will resolve the problem oflacking word embedding and thus will not result in failure. The results of thistechnique were promising, but unfortunately the code for this technique wasnot available. Contact was made with the authors of the paper, but the codewas not made available by the author. For future research, match type featurescould be added to extend the SAMemN2N network.

The second addition of the MemN2N implemented by Sakai et al. [13] is themeasurement of the per dialogue accuracy. For the performance of the model,this could be important to measure and can thus be done in future work.

Lastly, Liu and Perez [8] extended the MemN2N to regulate the interac-tions between the memory and the controller modules. Whereas in MemN2Nthe interactions needed to be decided a priori or with the use of supervised tech-niques, Liu and Perez [8] use highway networks to ensure dynamic regulationof memory interaction (Srivastava et al. [14]). This Gated Memory End-to-End(GMemN2N) is thus capable of dynamically conditioning the memory readingoperation on the controller state at each hop. This ensures only the relevantinformation flowing through each hop.

11 Appendix

This section shows dialogues with its predicted answers. The tables contain acolumn with time, speaker and dialogue. These three are indicating the pointin the conversation, speaker and the sentence respectively. In some tables a bigjump in time is made (for example in 5 in which the dialogue jumps from 3 to42. In these cases, all the dialogues in between are rules from the KB and notrelevant for showing in this table. A dialogue from task 2 is left out becausethis task reaches 100%.

Time Speaker dialogue1 User hi2 System hello what can I help you with today?3 User can you book a table for four people in a moderate price range with

spanish cuisine in bombay?4 System i’m on it5 System <SILENCE> ok let me look into some options for you

api call spanish bombay four moderateMemN2N api call british bombay four moderate

Correct answer api call spanish bombay four moderate [Correct]

Table 4: Task 1 sample dialogue. In this example the original MemN2Nnetwork predicts the answer wrong, whereas the predicts the right answer.

24

Time Speaker Dialogue1 System resto paris expensive french 2stars R location paris2 System resto paris expensive french 2stars R number six3 System resto paris expensive french 2stars R price expensive.. .. ....42 System resto paris expensive french 2stars R rating 243 User good morning44 System hello what can I help you with today?45 User can you book a table?46 System I’m on it47 System <SILENCE> any preference on a type of cuisine48 User with french cuisine49 System where should it be?50 User paris please51 System how many people would be in your party?52 User for six please53 System which price range are looking for?54 User in a expensive price range please55 System ok let me look into some options for you56 System <SILENCE> what do you think of this option: resto paris expensive french 7stars57 User no this does not work for me58 System sure let me find an other option for you59 System <SILENCE> what do you think of this option: resto paris expensive french 5stars60 User do you have something else?61 System sure let me find an other option for you62 System <SILENCE> what do you think of this option: restoparis expensive french 4stars63 User no this does not work for me64 System sure let me find an other option for you

what do you think of this option: resto paris expensive french 2starsMemN2N what do you think of this option: resto rome expensive italian 1stars

Correct answer what do you think of this option: resto paris expensive french 2stars [Correct]

Table 5: Task 3 sample dialogue in which the network outputs the correctanswer in contrast to the regular MemN2N network

Time Speaker Dialogue1 System resto rome cheap spanish 1stars R phone2 System resto rome cheap spanish 1stars R cuisine spanish3 System resto rome cheap spanish 1stars R address4 System resto rome cheap spanish 1stars R location rome5 System resto rome cheap spanish 1stars R number eight6 System resto rome cheap spanish 1stars R price cheap7 System resto rome cheap spanish 1stars R rating 18 User hello9 System hello what can I help you with today?10 User can you book a table at resto rome cheap spanish 1stars?11 System great let me do the reservation12 User may I have the address of the restaurant?

here it is resto rome cheap spanish 1stars addressMemN2N here it is resto bombay expensive italian 3stars phone

Correct answer here it is resto rome cheap spanish 1stars address [Correct]

Table 6: Dialogue on Task 4. The network outputs the correct answer forthis dialogue in contrast to the regular MemN2N network

25

Time Speaker Dialogue1 User hi2 System hello what can I help you with today?3 User can you book a table wit british cuisine?4 System I’m on it5 System <SILENCE> where should it be?6 User rome please7 System how many people would be in your party?8 User we will be six9 System which price range are looking for?10 User in a expensive price range please11 System ok let me look into some options for you12 System <SILENCE> api call britis rome six expensive13 User instead could it be in bombay?14 System sure is there anything else to update?15 User actually I would prefer with italian food16 System sure is there anything else to update?17 User actually I would prefer in a cheap price range18 System sure is there anything else to update?19 User no20 System ok let me look into some options for you21 System <SILENCE> api call italian bombay six cheap22 System resto bombay cheap italian 4stars.. ... ...47 System resto bombay cheap italian 1stars R rating 148 System <SILENCE> what do you think of this option:

resto bombay cheap italian 3stars49 User no I don’t like that50 System sure let me find an other option for you51 System <SILENCE> what do you think of this option:

resto bombay cheap italian 2stars52 User do you have something else?53 System sure let me find an other option for you54 System <SILENCE> what do you think of this option:

resto bombay cheap italian 1stars55 User I love that56 System great let me do the reservation57 User can you provide the address?58 System here it is resto bombay cheap italian 1stars address59 User thank you60 System is there anything I can help you with61 User no thank you62 System you’re welcome

MemN2N what do you think of this option: resto bombay cheap italian 2starswhat do you think of this option: resto bombay expensive british 5stars

Real answer what do you think of this option: resto bombay cheap italian 2stars [Incorrect]

Table 7: Dialgue of Task 5 in which the network outputs the wrong answerin contrast to the regular MemN2N network.

26

Time Speaker Dialogue1 System <SILENCE > Hello , welcome to the Cambridge restaurant system.

You can ask for restaurants by area , price range or food type . How may I help you ?2 User I need to find a cheap restaurant3 System There are restaurants in the cheap price range . What type of food do you want?3 User any4 System You are looking for a restaurant serving any kind of food right?4 User yes5 System What part of town do you have in mind?5 User south6 System api call R cuisine R location cheap6 System thanh binh R post code thanh binh post code144 System the gandhi R price cheap145 System the gandhi R rating 0

MemN2N the missing sock is a nice place in the east of town and the prices are cheapnandos is a nice place in the south of town and the prices are cheap

Real answer nandos is a nice place in the south of town and the prices are cheap [Correct]

Table 8: Dialgue of Task 5 in which the network outputs the wrong answerin contrast to the regular MemN2N network.

27

References

[1] Arora, S., Batra, K., and Singh, S. (2013). Dialogue system: A brief review.arXiv preprint arXiv:1306.4134.

[2] Bordes, A., Boureau, Y.-L., and Weston, J. (2016). Learning end-to-endgoal-oriented dialog. arXiv preprint arXiv:1605.07683.

[3] Chen, H., Liu, X., Yin, D., and Tang, J. (2017). A survey on dialoguesystems: Recent advances and new frontiers. ACM SIGKDD ExplorationsNewsletter, 19(2):25–35.

[4] Eric, M., Krishnan, L., Charette, F., and Manning, C. D. (2017). Key-value retrieval networks for task-oriented dialogue. In Proceedings of the 18thAnnual SIGdial Meeting on Discourse and Dialogue, pages 37–49.

[5] Hedberg, E. and Ayers, S. (2015). The power of a paired t-test with acovariate. Social science research, 50:277–291.

[6] Henderson, M., Thomson, B., and Williams, J. D. (2014). The second dialogstate tracking challenge. In Proceedings of the 15th Annual Meeting of theSpecial Interest Group on Discourse and Dialogue (SIGDIAL), pages 263–272.

[7] Liu, B. and Lane, I. (2016). Attention-based recurrent neural network modelsfor joint intent detection and slot filling. arXiv preprint arXiv:1609.01454.

[8] Liu, F. and Perez, J. (2017). Gated end-to-end memory networks. In Pro-ceedings of the 15th Conference of the European Chapter of the Associationfor Computational Linguistics: Volume 1, Long Papers, volume 1, pages 1–10.

[9] Lowe, R., Serban, I. V., Noseworthy, M., Charlin, L., and Pineau, J. (2016).On the evaluation of dialogue systems with next utterance classification. In17th Annual Meeting of the Special Interest Group on Discourse and Dialogue,pages 264–269.

[10] Nielsen, M. A. (2015). Neural networks and deep learning, volume 25.Determination press USA.

[11] Peng, B., Lu, Z., Li, H., and Wong, K.-F. (2015). Towards neural network-based reasoning. arXiv preprint arXiv:1508.05508.

[12] Raunak, V. (2017). Tensorflow implementation of learningend-to-end goal-oriented dialog. https://github.com/vyraun/

chatbot-MemN2N-tensorflow.

[13] Sakai, A., Shi, H., Ushio, T., and Endo, M. (2017). End-to-end memorynetworks with word abstraction and contextual numbering for goal-orientedtasks. Dial. Syst. Technol. Challenges, 6.

[14] Srivastava, R. K., Greff, K., and Schmidhuber, J. (2015). Highway net-works. arXiv preprint arXiv:1505.00387.

28

https://github.com/vyraun/chatbot-MemN2N-tensorflow

https://github.com/vyraun/chatbot-MemN2N-tensorflow

[15] Stienstra (2018). Role-aware recurrent entity networks for task-oriented di-alogue systems (unpusblished Master thesis). University of Amsterdam, Am-sterdam, Netherlands.

[16] Sukhbaatar, S., szlam, a., Weston, J., and Fergus, R. (2015). End-to-endmemory networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M.,and Garnett, R., editors, Advances in Neural Information Processing Systems28, pages 2440–2448. Curran Associates, Inc.

[17] Weston, J., Chopra, S., and Bordes, A. (2015). Memory networks. InInternational Conference on Learning Representations (ICLR).

[18] Williams, J. D., Asadi, K., and Zweig, G. (2017). Hybrid code networks:practical and efficient end-to-end dialog control with supervised and reinforce-ment learning. In Proceedings of the 55th Annual Meeting of the Associationfor Computational Linguistics, volume 1, pages 665–677.

[19] Yan, Z., Duan, N., Chen, P., Zhou, M., Zhou, J., and Li, Z. (2017). Buildingtask-oriented dialogue systems for online shopping. In AAAI, pages 4618–4626.

[20] Young, S., Gasic, M., Thomson, B., and Williams, J. D. (2013). Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE,101(5):1160–1179.

[21] Zaremba, W. and Sutskever, I. (2014). Learning to execute. arXiv preprintarXiv:1410.4615.

29

source awareness memory end-to-end for task-oriented

Documents