arxiv:1903.06164v3 [cs.lg] 9 jun 2019

Episodic Memory Reader: Learning What to Rememberfor Question Answering from Streaming Data

Moonsu Han1∗ Minki Kang1∗ Hyunwoo Jung1 Sung Ju Hwang1,2

KAIST1, Daejeon, South KoreaAITRICS2, Seoul, South Korea

{mshan92, zzxc1133, hyunwooj, sjhwang82}@kaist.ac.kr

Abstract

We consider a novel question answering (QA)task where the machine needs to read fromlarge streaming data (long documents orvideos) without knowing when the questionswill be given, which is difficult to solve withexisting QA methods due to their lack of scal-ability. To tackle this problem, we proposea novel end-to-end deep network model forreading comprehension, which we refer to asEpisodic Memory Reader (EMR) that sequen-tially reads the input contexts into an exter-nal memory, while replacing memories thatare less important for answering unseen ques-tions. Specifically, we train an RL agent toreplace a memory entry when the memory isfull, in order to maximize its QA accuracyat a future timepoint, while encoding the ex-ternal memory using either the GRU or theTransformer architecture to learn representa-tions that considers relative importance be-tween the memory entries. We validate ourmodel on a synthetic dataset (bAbI) as well asreal-world large-scale textual QA (TriviaQA)and video QA (TVQA) datasets, on which itachieves significant improvements over rule-based memory scheduling policies or an RL-based baseline that independently learns thequery-specific importance of each memory.

1 Introduction

Question answering (QA) problem is one of themost important challenges in Natural LanguageUnderstanding (NLU). In recent years, there hasbeen drastic progress on the topic, owing tothe success of deep learning based QA mod-els (Sukhbaatar et al., 2015; Seo et al., 2016;Xiong et al., 2016; Hu et al., 2018; Back et al.,2018; Devlin et al., 2018). On certain tasks suchas machine reading comprehension (MRC), where

* Equal contribution

the problem is to find the span of the answer withina given paragraph (Rajpurkar et al., 2016), thedeep-learning based QA models have even sur-passed human-level performances.

Despite such impressive achievements, it is stillchallenging to model question answering withdocument-level context (Joshi et al., 2017), wherethe context may include a long document witha large number of paragraphs, due to problemssuch as difficulty in modeling long-term depen-dency and computational cost. To overcome suchscalability problems, researchers have proposedpipelining or confidence based selection methodsthat combine paragraph-level models to obtain adocument-level model (Joshi et al., 2017; Chenet al., 2017; Clark and Gardner, 2018; Wang et al.,2018b). Yet, such models are applicable onlywhen questions are given beforehand and all sen-tences in the document can be stored in memory.

However, in realistic settings, the amount ofcontext may be too large to fit into the systemmemory. We may consider query-based contextselection methods such as ones proposed in In-durthi et al. (2018) and Min et al. (2018), but inmany cases, the question may not be given whenreading in the context, and thus it would be diffi-cult to select out the context based on the question.For example, a conversation agent may need to an-swer a question after numerous conversations in along-term time period, and a video QA model mayneed to watch an entire movie, or a sports game,or days of streaming videos from security camerasbefore answering a question. In such cases, exist-ing QA models will fail to solve the problem dueto memory limitation.

In this paper, we target a novel problem of solv-ing question answering problem with streamingdata as context, where the size of the context couldbe significantly larger than what the memory canaccommodate (See Figure 1). In such a case, the

arX

iv:1

903.

0616

4v3

[cs

.LG

] 9

Jun

201

9

Figure 1: Concept: We consider a novel problem of learning from streaming data, where the QA model may need to answera question that is given after reading in unlimited amount of context. To solve this problem, our Episodic Memory Reader(EMR) learns to retain the most important context vectors in an external memory, while replacing the memory entries in orderto maximize its accuracy on an unseen question given at a future timestep.

model needs to carefully manage what to remem-ber from this streaming data such that the memorycontains the most informative context instances inorder to answer an unseen question in the future.We pose this memory management problem as alearning problem and train both the memory rep-resentation and the scheduling agent using rein-forcement learning.

Specifically, we propose to train the memorymodule itself using reinforcement learning to re-place the most uninformative memory entry in or-der to maximize its reward on a given task. How-ever, this is a seemingly ill-posed problem sincefor most of the time, the scheduling should beperformed without knowing which question willarrive next. To tackle this challenge, we imple-ment the policy network and the value networkthat learn not only relation between sentences andquery but also relative importance among the sen-tences in order to maximize its question answer-ing accuracy at a future timepoint. We refer tothis network as Episodic Memory Reader (EMR).EMR can perform selective memorization to keepa compact set of important context that will be use-ful for future tasks in lifelong learning scenarios.

We validate our proposed memory network ona large-scale QA task (TriviaQA) and video ques-tion answering task (TVQA) where the context istoo large to fit into the external memory, againstrule-based and an RL-based scheduling methodwithout consideration of relative importance be-tween memories. The results show that our modelsignificantly outperforms the baselines, due to itsability to preserve the most important pieces of in-formation from the streaming data.

Our contribution is threefold:

• We consider a novel task of learning to re-member important instances from streamingdata for question answering task, where thesize of the memory is significantly smallerthan the length of the data stream.

• We propose a novel end-to-end memory-augmented neural architecture for solvingQA from streaming data, where we train ascheduling agent via reinforcement learningto store the most important memory entriesfor solving future QA tasks.

• We validate the efficacy of our model on real-world large-scale text and video QA datasets,on which it obtains significantly improvedperformances over baseline methods.

2 Related Work

Question-answering There has been a rapidprogress in question answering (QA) in recentyears, thanks to the advancement in deep learn-ing as well as the availability of large-scaledatasets. One of the most popular large-scale QAdataset is Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016) that contains100K question-answering pairs. Unlike Richard-son et al. (2013) and Hermann et al. (2015) thatprovide multiple-choice QA pairs, SQuAD pro-vides and requires to predict exact locations ofthe answers. On this span prediction task, at-tentional models (Pan et al., 2017; Cui et al.,2017; Hu et al., 2018) have achieved impressiveperformances, with Bi-Directional Attention Flow(BiDAF) (Seo et al., 2016) that uses bi-directionalattention mechanism for the context and query be-ing one of the best performing models. Trivi-aQA (Joshi et al., 2017) is another large-scale QAdataset that includes 950K QA pairs. Since thelength of each document in Trivia is much longerthan SQuAD, with average of 3K sentences perdocument, existing span prediction models (Joshiet al., 2017; Back et al., 2018; Yu et al., 2018) failto work due to memory limitation, and simply re-sort to document truncation. Video question an-swering (Tapaswi et al., 2016; Lei et al., 2018),where video frames are given as context for QA,

is another important topic where scalability is anissue. Several models (Kim et al., 2017, 2018;Na et al., 2017; Wang et al., 2018a) propose tosolve video QA using attentions and memory aug-mented networks, to perform composite reasoningover both videos and texts; however, they only fo-cus on short-length videos. Most existing work onQA focus on small-size problems due to memorylimitation. Our work, on the other hand, considersa challenging scenario where the context is orderof magnitude larger than the memory.

Context selection A few recent models proposeto select minimal context from the given documentwhen answering questions for scalability, ratherthan using the full context. Min et al. (2018)proposed a context selector that generates atten-tions on the context vectors, in order to achievescability and robustness against adversarial in-puts. Choi et al. (2017) and Indurthi et al. (2018)propose a similar method, but they use REIN-FORCE (Williams, 1992) instead of linear classi-fiers. Chen et al. (2017) selects the most relevantdocuments out of the Wikipedia database with re-spect to the query using TF-IDF matching, andWang et al. (2018b) propose to tackle the doc-ument ranking problem with RL agents. Whilethese context/document selection methods shareour motivation of achieving scability and select-ing out the most informative pieces of informationto solve the QA task, our problem setting is com-pletely different from theirs since we consider achallenging problem of learning from the stream-ing data without knowing when the question willbe given, where the size of the context is muchlarger than the memory and the question is unseenwhen training the selection module.

Memory-augmented neural networks Ourepisodic memory reader is essentially a memory-augmented network (MANN) (Graves et al.,2014; Sukhbaatar et al., 2015; Xiong et al., 2016;Kumar et al., 2016) with a RL-based scheduler.While most existing work on MANN assume thatthe memory is sufficiently large to hold all thedata instances, a few tried to consider memory-scheduling for better scalability. Gulcehre et al.(2016) propose to train an addressing agent usingreinforcement learning in order to dynamicallydecide which memory to overwrite based on thequery. This query-specific importance is similar toour motivation, but in our case the query is given

after reading in all the context and thus unusablefor scheduling, and we perform hard replacementinstead of overwriting. Differentiable NeuralComputer (DNC) (Graves et al., 2016) extendsthe NTM to address the issue by introducing atemporal link matrix, replacing the least usedmemory when the memory is full. However, thismethod is a rule-based one that cannot maximizethe performance on a given task.

3 Learning What to Remember fromStreaming Data

We now describe how to solve question answeringtasks with streaming data as context. In a moregeneral sense, this is a problem of learning froma long data stream that contains a large portion ofunimportant, noisy data (e.g. routine greetings indialogs, uninformative video frames) with limitedmemory. The data stream is episodic, where anunlimited amount of data instances may arrive atone time interval and becomes inaccessible after-ward. Additionally, we consider that it is not pos-sible for the model to know in advance what tasks(a question in the case of QA problem) will begiven at which timestep in the future (See Figure 1for more details). To solve this problem, the modelneeds to identify important data instances from thedata stream and store them into external memory.Formally, given a data stream (e.g. sentences orimages) X = {x(1), · · · ,x(T )} where x(t) ∈ Rdas input, the model should learn a function F :X 7→ M that maps it to the set of memory en-tries M = {m1, · · · ,mN} where mi ∈ Rk andT � N . How can we then learn such a func-tion that maximizes the performance on unseen fu-ture tasks without knowing what problems will begiven at what time? We formulate this problem asa reinforcement learning problem to train a mem-ory scheduling agent.

3.1 Model Overview

We now describe our model, Episodic MemoryReader (EMR) to solve the previously describedproblem. Our model has three components: (1)an agent A based on EMR, (2) an external mem-ory M = [m1, · · · ,mN ], and (3) a solver whichsolves the given task (e.g. QA) with the exter-nal memory. Figure 2 shows the overview ofour model. Basically, given a sequence of datainstances X = {x(1), · · · ,x(T )} that streamsthrough the system, the agent learns to retain the

Figure 2: The overview of our Episodic Memory Reader (EMR). EMR learns the policy and the value network to select amemory entry to replace, in order to maximize the reward, defined as the performance on future QA tasks (F1-score, accuracy).

most useful subset in the memory, by interactingwith the external memory that encodes the rel-ative importance of each memory entry. Whent ≤ N , the agent simply maps x(t) to m(t).However, when t > N , when the memory be-comes full, it selects an existing memory entry todelete. Specifically, it outputs an action based onπ(i|S(t)), which denotes the selection of ith mem-ory entry to delete. Here, the state is the con-catenation of the memory and the data instance:S(t) = [M (t), e(t)], where e(t) is the encoded in-put at timestep t. To maximize the performanceon the future QA task, the agent should replacethe least important memory entry. When the agentencounters the task T (QA problem) at timestepT + 1, it leverages both the memory at timestepT , M (T ) and the task information (e.g. question),to solve the task. For each action, the environ-ment (QA module) provides the reward R(t), thatis given either as the F1-score or the accuracy.

3.2 Episodic Memory Reader

Episodic Memory Reader (EMR) is composed ofthree components: (1) Data Encoder that encodeseach data instance into memory vector represen-tation, (2) Memory Encoder that generates re-placement probability for each memory entry, andthe (3) Value Network that estimates the value ofmemory as a whole. In some cases, we may usepolicy gradient methods, in which case the valuenetwork becomes unnecessary.

3.2.1 Data Encoder

The data instance x(t) which arrives at time t canbe in any data format, and thus we transform itinto a k-dimensional memory vector representa-tion e(t) ∈ Rk to using an encoder:

e(t) = ψ(x(t))

where ψ(·) is the data encoder, which could be anyneural architecture based on the type of the inputdata. For example, we could use a RNN if x(t)

is composed of sequential data (e.g. a sentencecomposed of words x(t) = {w1, w2, w3, · · · , ws})or a CNN if x(t) is an image. After deleting amemory entry m

(t)i , we append e(t) at the end of

the memory, which then becomes m(t+1)N .

3.2.2 Memory EncoderUsing the memory vector representations M (t) =

[m(t)1 , · · · ,m(t)

N ] and e(t) generated from the dataencoder, the memory encoder outputs a probabil-ity for each memory entry by considering its rela-tive importance, and then replaces the most unim-portant entry. This component corresponds to thepolicy network of the actor-critic method. Now wedescribe our EMR models.

EMR-Independent Since we do not have ex-isting work for our novel problem setting, asa baseline, we first consider a memory encoderthat only captures the relative importance of eachmemory entry independently to the new data in-stance, which we refer to as EMR-Independent.This scheduling mechanism is adopted from Dy-namic Least Recently Use (LRU) addressing in-troduced in Gulcehre et al. (2016), but differentfrom LRU in that it replaces the memory entryrather than overwriting it, and is trained withoutquery to maximize the performance for unseen fu-ture queries. EMR-Independent outputs the im-portance for each memory entry by comparingthem with an embedding of the new data instancex(t) as a

(t)i = softmax(m(t)

i ψ(x(t))T ). To com-

pute the overall importance of each memory en-try, as done in Gulcehre et al. (2016), we com-pute the exponential moving average as v

(t)i =

0.1v(t−1)i + 0.9a

(t)i . Then, we compute the re-

placement probabilty of each memory entry withthe LRU factor γ(t) as follows:

γ(t)i = σ(W T

γ m(t)i + bγ)

g(t)i = a

(t)i − γ

(t)i v

(t−1)i

π(i|[M (t), e(t)]; θ) = softmax(g(t)i )

where i ∈ [1, N ] is the memory index, Wγ ∈R1×d and bγ ∈ R are the weight matrix and thebias term, σ(·) and softmax(·) are sigmoid andsoftmax functions respectively, and π is the pol-icy of the memory scheduling agent.

EMR-biGRU A major drawback of EMR-Independent is that the evaluation of each memorydepends only on the input x(t). In other words,the importance is computed between each memoryentry and the new data instance regardless of otherentries in the memory. However, this scheme can-not model the relative importance of each mem-ory entry to other memory entries, which is moreimportant in deciding on the least important mem-ory. One way to consider relative relationships be-tween memory entries is to encode them using abidirectional GRU (biGRU) as follows:

−→h

(t)i = GRUθfw(m

(t)i ,−→h

(t)i−1)

←−h

(t)i = GRUθbw(m

(t)i ,←−h

(t)i+1)

h(t)i = [

−→h

(t)i ,←−h

(t)i ]

π(i|[M (t), e(t)]; θ) = softmax(MLP (h(t)i ))

where i ∈ [1, N +1] is the memory index, includ-ing the index of the encoded input m(t)

N+1 = e(t),GRUθ is a Gated Recurrent Unit parameterizedby θ, [

−→h

(t)i ,←−h

(t)i ] is a concatenation of features.

π is the policy of the agent, and MLP is a multi-layer perceptron with three layers with ReLU ac-tivation functions. Thus, EMR-biGRU learns thegeneral importance of each memory entry in re-lation to its neighbors rather than independentlycomputing the importance of each entry with re-spect to the query, which is useful when selectingout the most important entries among highly sim-ilar data instances (e.g. video frames). However,

Figure 3: Detailed architecture of memory encoder in EMR-Independent and EMR-biGRU/Transformer.

the model may not effectively model long-rangerelationships between memory entries in far-awayslots due to the inherent limitation with RNNs.

EMR-Transformer To overcome such subopti-mality of RNN-based modeling, we further adoptthe self-attention mechanism from Vaswani et al.(2017). With query Q(t), key K(t), and the valueV (t) we generate the relative importance of theentries with a linear layer that takes m(t) withthe position encoding proposed in Vaswani et al.(2017) as input. With multi-headed attention, eachcomponent is projected to a multi-dimensionalspace; the dimensions for each componenets areQ(t) ∈ RH×N× k

H , K(t) ∈ RH×N× kH , and V (t) ∈

RH×N× kH , where N is the size of memory and

H is the number of attention heads. Using these,we can formulate the retrieved output using self-attention and memory encoding as follows:

A(t) = softmax

(Q(t)K(t)T√

k/H

)o(t) = A(t)V (t)

h(t) = W To [o

(t)1 ,o

(t)2 , · · · ,o(t)

h ]

π(i|[M (t), e(t)]; θ) = softmax(MLP (h(t)i ))

where i is the memory index, o(t)i ∈ RN× d

h ,[o

(t)1 ,o

(t)2 , · · · ,o(t)

h ] ∈ RN×k is a concatentationof o

(t)i , π is the policy of the agent, and MLP

is the same 3-layer multi-layer perceptron used inEMR-biGRU. Memory encoding h(t) is then com-puted using linear function Wo ∈ Rd×d with h(t)

as input. Figure 3 illustrates the architecture of thememory encoder for EMR-Independent and EMR-biGRU/Transformer.

3.2.3 Value NetworkFor solving certain QA problems, we need to con-sider the future importance of each memory entry.

Figure 4: Example context and QA pair from TriviaQA.

Especially in textual QA tasks (e.g. TriviaQA),storing the evidence sentences that precede spanwords may be useful as they may provide usefulcontext. However, using only discrete policy gra-dient method, we cannot preserve such context in-stances. To overcome this issue, we use an actor-critic RL method (A3C) (Mnih et al., 2016) to es-timate the sum of future rewards at each state us-ing the value network. The difference between thepolicy and the value is that the value can be esti-mated differently at each time step and the needsto consider the memory as a whole. To obtaina holistic representation of our memory, we useDeep Sets (Zaheer et al., 2017). Following Zaheeret al. (2017) we sum up all h(t)

i and input them intoan MLP (ρ), that consists of two linear layers anda ReLU activation function, to obtain a set repre-sentation. Then, we further process the set repre-sentation ρ(

∑Ni=1 h

(t)i ) by a GRU with the hidden

state from the previous time step. Finally, we feedthe output of the GRU to a multi-layer perceptronto estimate the value V (t) for the current timestep.

3.3 Training and test

Training Our model learns the memoryscheduling policy jointly with the model to solvethe task. For training EMR, we choose A3C(Mnih et al., 2016) or REINFORCE (Williams,1992). At training time, since the tasks are given,we provide the question to the agent at everytimestep. At each step, the agent selects the actionstochastically from multinomial distribution basedon π(i|[M (t), e(t)]; θ) to explore various states,and make an action. Then, the QA model providesthe agent the reward Rt. We use asynchronousmultiprocessing method illustrated in Mnih et al.(2016) to train several models at once.

Test At test time, the agent deletes the mem-ory index by following the learned policy π:argmaxi π(i|[M (t), e(t)]; θ). Contrarily from thetraining step, the model observes the question only

at the end of the data stream. When encounteringthe question, the model solves the task using thedata instances kept in the external memory.

4 Experiment

We experiment our EMR-biGRU and EMR-Transformer against several baselines:

1) FIFO (First-In First-Out). A rule-basedmemory scheduling policy that replaces the oldestmemory entry.

2) Uniform. A policy that replaces all memoryentries with equal probability at each time.

3) LIFO (Last-In First-Out). A policy that re-places the newest data instance. That is, it first fillsin the memory and then discards all following datainstances.

4) EMR-Independent. A baseline EMR whichlearns the importance of each memory entry onlyrelative to the new data instance.

5) EMR-biGRU. An EMR implemented usinga biGRU, that considers relative importance ofeach memory entry to its neighbors when learningthe memory replacement policy.

6) EMR-Transformer. An EMR that utilizesTransformer to model the global relative impor-tance between memory entries.

The codes for the baseline models and our mod-els are available at https://github.com/h19920918/emr. In the next subsections, wepresent experimental results on bAbI, TriviaQA,and TVQA datasets. For more experimental re-sults, please see supplementary file.

4.1 bAbI

Dataset bAbI (Weston et al., 2015) dataset,which is a synthetic dataset for episodic questionanswering, consists of 20 tasks with small amountof vocabulary, that can be solved by remember-ing a person or an object. Among the 20 tasks,we select Task 2, which requires to remember twosupporting facts, to evaluate our model. Addition-ally, we generate noisy version of this task usingthe open-source template provied by Weston et al.(2015). Each episode of both tasks contains fivequestions, where all questions share the same con-text sentences. For Noisy task, we inject noise sen-tences that has nothing to do with the given task,to validate the effitiveness of our model. In thisdataset, 60% of the episodes have no noise sen-tence, 10% have approximately 30% noise sen-tences, 10% have approximately 45% noise sen-

https://github.com/h19920918/emr

https://github.com/h19920918/emr

0 5 10 15 20

The number of Memory Entries

0

10

20

30

40

50

60

70

80

90

100

Accura

cy

LIFO

FIFO

UNIFORM

EMR-Independent

EMR-biGRU

EMR-Transformer

(a) Original

0 5 10 15 20


0

10

20

30

40

50

60

70

80

90

100

Accu

racy

LIFO

FIFO

UNIFORM

EMR-Independent

EMR-biGRU

EMR-Transformer

(b) Noisy

Figure 5: QA performance of baseline models and EMR-variants. The reported results are averages over 3 runs.

tences and 10% of dataset have approximately60% noise sentences. The length of each episodeis fixed to 45 (5 questions + 40 facts), and a ques-tion appears after the arrival of 8 factual sentences.

Experiment Details We adopted MemN2N(Sukhbaatar et al., 2015) for this experiment,which consists of an embedded layer and a multi-hop mechanism that extracts high-level inference.We use MemN2N with position encoding repre-sentation, 3 hops and adjacent weight tying. Weset the dimension of memory representations tok = 20 and compare our model and the base-lines on the Original and Noisy tasks. To generatethe memory representation mi, we use the sumof the three hop value memories from the baseMemN2N. We experiment with varying memorysize: 5, 10 and 15. We train our model and thebaselines using ADAM optimizer (Kingma andBa, 2014) with the learning rate 0.0005 for 400Ksteps on both tasks.

Results and Analysis In Figure 5, we reportthe experiment results for Original and Noisytasks. Both our model (EMR-biGRU and EMR-Transformer) outperform the baselines, especiallywith higher gain in the case of Noisy dataset.EMR-independent, which does not consider rela-tive importance among the memory entries, per-forms worse or simlar to FIFO baseline. The re-sults suggest that our methods are able to retainthe supporting facts even with small number ofmemory entries. For further analysis for the ex-periments on the bAbI dataset, please see supple-mentary file.

4.2 TriviaQA

Dataset TriviaQA (Joshi et al., 2017) is a real-istic text-based question answering dataset whichincludes 950K question-answer pairs from 662Kdocuments collected from Wikipedia and the web.This dataset is more challenging than standard

Figure 6: The histogram of number of answers for each doc-ument length for TriviaQA dataset.

QA benchmark datasets such as Stanford QuestionAnswering Dataset (SQuAD) (Rajpurkar et al.,2016), as the answers for a question may notbe directly obtained by span prediction and thecontext is very long (Figure 4). Since conven-tional QA models (Seo et al., 2016; Back et al.,2018; Yu et al., 2018; Devlin et al., 2018) arespan prediction models, on TriviaQA they onlytrain on QA pairs whose answers can be foundin the given context. In such a setting, Trivi-aQA becomes highly biased, where the answersare mostly spanned in the earlier part of the doc-ument (Figure 6). We evaluate our work only onthe Wikipedia domain since most previous workreport similar results on both domains. WhileTriviaQA dataset consists of both human-verifiedand machine-generated QA subsets, we use thehuman-verified subset only since the machine-generated QA pairs are unreliable. We use thevalidation set for test since the test set does notcontain labels.

Experiment Details We employ the pre-trainedmodel from Deep Bidirectional Transformers(BERT) (Devlin et al., 2018), which is the cur-rent state-of-the-art model for SQuAD challenge,that trains several Transformers in Vaswani et al.(2017) for pretraining tasks for predicting the in-dices of the exact location of an answer. We em-bed 20 words into each memory cell using a GRUand set the number of cells to 20, thus the memorycan hold 400 words at maximum. This is a reason-able restriction since BERT limits the maximumnumber of word tokens to 512, including both thecontext and the query.

Results and Analysis We report the perfor-mance of our model on the TriviaQA using bothExactMatch and F1-score in Table 1. We see thatEMR models which consider the relative impor-tance between the memory entries (EMR-biGRUand EMR-Transformer) outperform both the rule-based baselines and EMR-Independent. One in-teresting observation is that LIFO performs quitewell unlike the other rule-based scheduling poli-

Figure 7: An example of how our model operates in order to solve the memory retention problem. Episodic Memory Reader(EMR) sequentially reads the sentences one by one while replacing least important memories. State 1 and State T representmemory entries in the in the initial state and the last state, respectively. Our EMR retrains the sentences which contain the word‘France’ (in bold red fonts) in order to answer the given question.

Table 1: Q&A accuracy on the TriviaQA dataset. model.

Model ExactMatch F1FIFO 24.53 27.22

Uniform 28.30 34.39LIFO 46.23 50.10

EMR-Independent 38.05 41.15EMR-biGRU 52.20 57.57

EMR-Transformer 48.43 53.81

cies, but this is due to the dataset bias (See Fig-ure 6) where most answers are spanned in earlierpart of the documents. To further see whether thisimprovement is from its ability to remember im-portant context, we examine the sentences that re-main in the memory after EMR finishes readingall the sentences in Figure 7. We see that EMRremembered the sentences that contain key wordsthat is required to answer the future question.

4.3 TVQA

Dataset TVQA (Lei et al., 2018) is a localized,compositional video question-answering datasetthat contains 153K question-answer pairs from22K clips spanning over 460 hours of video. Thequestions are multiple choice questions on thevideo contents, where the task is to find a sin-gle correct answer out of five candidate answers.The questions can be answered by examining theannotated clip segments, which spans around 30frames per clip (See Figure 8). The average num-ber of frames for each clip is 229. In addition tothe video frames, the dataset also provides subti-tles for each video frame. Thus solving the ques-tions requires compositional reasoning capabilityover both a large number of images and texts.

Experiment Details As for the QA module, weuse Multi-stream model for Multi-Modal VideoQA, which is the attention-based baseline model

provided in (Lei et al., 2018). For efficient train-ing, we use features extracted from a ResNet-101pretrained on the ImageNet dataset. For embed-ding subtitles and question-answering pairs, weuse GloVe (Pennington et al., 2014). For train-ing, we restrict the number of memory entries forour episodic reader as 20, where each memory en-try contains the encoding of a video frame and thesubtitle associated with the frame, where the for-mer is encoded using CNN and the latter usingGRU. We train our model and the baseline mod-els using the ADAM optimizer (Kingma and Ba,2014), with the initial learning rate of 0.0001. Un-like from the experiments on TriviaQA, we useREINFORCE (Williams, 1992) to train the policy.This is because TVQA is composed of consecutiveimage frames captured within a short time inter-val, which tend to contain redundant information.Thus the value network of the actor-critic modelfails to estimate good value of the given state sincedeleting a good frame will not result in the loss ofQA accuracy. Thus we compute the rewardR(t) asthe accuracy difference between at time step t andt − 1 then use only the policy with non-episodicREINFORCE for training. With this method, ifthe task fails to solve the question after deletingcertain frame, the frame is considered as impor-tant, and unimportant otherwise.

Results and Analysis We report the accuracyon TVQA as a function of memory size in Fig-ure 9. We observe that EMR variants signifi-cantly outperform all baselines, including EMR-Independent. We also observe that the models per-form well even when the size of the memory is in-creased to as large as 60, which was never encoun-tered during the training stage where the numberof memory entries was fixed as 20. When thesize of memory is small, the gap between differentmodels are larger, with EMR-Transformer obtain-

Figure 8: An example of TVQA dataset and a visualization of how our model operates. The answer in red is the ground truthand the answer with underline is the predicted answer from our QA model. At state 1, since the memory is empty, our modelencodes every video frame into the memory until the memory becomes full. At state T, after reading in all the video frames,it retains the most informative frames to answer the question. Note that this retention of the important frames is done withoutknowing the question in advance.

0 10 20 30 40 50 60 70 80


35

40

45

50

55

60

65

Accura

cy

LIFO

FIFO

UNIFORM

EMR-Independent

EMR-biGRU

EMR-Transformer

Figure 9: QA accuracy of various memory scheduling poli-cies on the TVQA dataset, reported as a function of the num-ber of memory entries.

ing the best accuracy, which may be due to its abil-ity to capture global relative importance of eachmemory entry. However, the gap between EMR-Transformer and EMR-biGRU diminishes as thesize of memory increases, since then the size ofthe memory becomes large enough to contain allthe frames necessary to answer the question.

As qualitative analysis, we further examinewhich frames and subtitles were preserved in theexternal memory after the model has read throughthe entire sequence in Figure 8. To answer thequestion for this example, the model should con-sider the relationship between two frames, wherethe first frame describes Ross showing the paperto others, and the second frame describes Monicaentering the coffee shop. We see that our modelkept both frames, although it did not know whatthe question will be.

5 Conclusion

We proposed a novel problem of question an-swering from streaming data, where the modelneeds to answer a question that is given afterreading through unlimited amount of context (e.g.documents, videos) that cannot fit into the sys-tem memory. To handle this problem, we pro-posed Episodic Memory Reader (EMR), whichis basically a memory-augmented network withRL-based memory-scheduler, that learns the rel-ative importance among memory entries and re-places the entries with the lowest importance tomaximize the QA performance for future tasks.We validated EMR on three QA datasets againstrule-based memory scheduling as well as an RL-baseline that does not model relative importancesamong memory entries, which it significantly out-performs. Further qualitative analysis of memorycontents after learning confirms that such goodperformance comes from its ability to retain im-portant instances for future QA tasks.

Acknowledgements

This work was supported by Clova, NAVER Corp.and Institute for Information & communicationsTechnology Promotion(IITP) grant funded by theKorea government(MSIT) (No.2016-0-00563, Re-search on Adaptive Machine Learning TechnologyDevelopment for Intelligent Autonomous DigitalCompanion).

ReferencesSeohyun Back, Seunghak Yu, Sathish Reddy Indurthi,

Jihie Kim, and Jaegul Choo. 2018. Memoreader:Large-scale reading comprehension through neuralmemory controller. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, Brussels, Belgium, October 31 - Novem-ber 4, 2018, pages 2131–2140.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading wikipedia to answer open-domain questions. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics, ACL 2017, Vancouver, Canada, July 30- August 4, Volume 1: Long Papers, pages 1870–1879.

Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, IlliaPolosukhin, Alexandre Lacoste, and Jonathan Be-rant. 2017. Coarse-to-fine question answering forlong documents. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics, ACL 2017, Vancouver, Canada, July 30 -August 4, Volume 1: Long Papers, pages 209–220.

Christopher Clark and Matt Gardner. 2018. Simpleand effective multi-paragraph reading comprehen-sion. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics, ACL2018, Melbourne, Australia, July 15-20, 2018, Vol-ume 1: Long Papers, pages 845–855.

Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang,Ting Liu, and Guoping Hu. 2017. Attention-over-attention neural networks for reading comprehen-sion. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics, ACL2017, Vancouver, Canada, July 30 - August 4, Vol-ume 1: Long Papers, pages 593–602.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing. CoRR, abs/1810.04805.

Alex Graves, Greg Wayne, and Ivo Danihelka. 2014.Neural turing machines. CoRR, abs/1410.5401.

Alex Graves, Greg Wayne, Malcolm Reynolds,Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez Colmenarejo, EdwardGrefenstette, Tiago Ramalho, John Agapiou,Adria Puigdomenech Badia, Karl Moritz Hermann,Yori Zwols, Georg Ostrovski, Adam Cain, HelenKing, Christopher Summerfield, Phil Blunsom,Koray Kavukcuoglu, and Demis Hassabis. 2016.Hybrid computing using a neural network withdynamic external memory. Nature, 538(7626):471–476.

Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho,and Yoshua Bengio. 2016. Dynamic neural tur-ing machine with soft and hard addressing schemes.CoRR, abs/1607.00036.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Advances inNeural Information Processing Systems 28: AnnualConference on Neural Information Processing Sys-tems 2015, December 7-12, 2015, Montreal, Que-bec, Canada, pages 1693–1701.

Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu,Furu Wei, and Ming Zhou. 2018. Reinforcedmnemonic reader for machine reading comprehen-sion. In Proceedings of the Twenty-Seventh Inter-national Joint Conference on Artificial Intelligence,IJCAI 2018, July 13-19, 2018, Stockholm, Sweden.,pages 4099–4106.

Sathish Reddy Indurthi, Seunghak Yu, Seohyun Back,and Heriberto Cuayahuitl. 2018. Cut to the chase:A context zoom-in network for reading comprehen-sion. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Process-ing, Brussels, Belgium, October 31 - November 4,2018, pages 570–575.

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics, ACL2017, Vancouver, Canada, July 30 - August 4, Vol-ume 1: Long Papers, pages 1601–1611.

Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, andByoung-Tak Zhang. 2018. Multimodal dual at-tention memory for video story question answer-ing. In Computer Vision - ECCV 2018 - 15th Euro-pean Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, pages 698–713.

Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, andByoung-Tak Zhang. 2017. Deepstory: Video storyQA by deep embedded memory networks. InProceedings of the Twenty-Sixth International JointConference on Artificial Intelligence, IJCAI 2017,Melbourne, Australia, August 19-25, 2017, pages2016–2022.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Ankit Kumar, Ozan Irsoy, Peter Ondruska, MohitIyyer, James Bradbury, Ishaan Gulrajani, VictorZhong, Romain Paulus, and Richard Socher. 2016.Ask me anything: Dynamic memory networks fornatural language processing. In Proceedings of the33nd International Conference on Machine Learn-ing, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1378–1387.

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L.Berg. 2018. TVQA: localized, compositional videoquestion answering. CoRR, abs/1809.01696.

Sewon Min, Victor Zhong, Richard Socher, and Caim-ing Xiong. 2018. Efficient and robust question an-swering from minimal context over documents. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2018,Melbourne, Australia, July 15-20, 2018, Volume 1:Long Papers, pages 1725–1735.

Volodymyr Mnih, Adria Puigdomenech Badia, MehdiMirza, Alex Graves, Timothy P. Lillicrap, TimHarley, David Silver, and Koray Kavukcuoglu.2016. Asynchronous methods for deep reinforce-ment learning. In Proceedings of the 33nd Inter-national Conference on Machine Learning, ICML2016, New York City, NY, USA, June 19-24, 2016,pages 1928–1937.

Seil Na, Sangho Lee, Jisung Kim, and Gunhee Kim.2017. A read-write memory network for moviestory understanding. In IEEE International Confer-ence on Computer Vision, ICCV 2017, Venice, Italy,October 22-29, 2017, pages 677–685.

Boyuan Pan, Hao Li, Zhou Zhao, Bin Cao, Deng Cai,and Xiaofei He. 2017. MEMEN: multi-layer em-bedding with memory networks for machine com-prehension. CoRR, abs/1707.09098.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2014, October 25-29,2014, Doha, Qatar, A meeting of SIGDAT, a SpecialInterest Group of the ACL, pages 1532–1543.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100, 000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing, EMNLP 2016, Austin,Texas, USA, November 1-4, 2016, pages 2383–2392.

Matthew Richardson, Christopher J. C. Burges, andErin Renshaw. 2013. Mctest: A challenge datasetfor the open-domain machine comprehension oftext. In Proceedings of the 2013 Conference on Em-pirical Methods in Natural Language Processing,EMNLP 2013, pages 193–203.

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi,and Hannaneh Hajishirzi. 2016. Bidirectional at-tention flow for machine comprehension. CoRR,abs/1611.01603.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston,and Rob Fergus. 2015. End-to-end memory net-works. In Advances in Neural Information Process-ing Systems 28: Annual Conference on Neural In-formation Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2440–2448.

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen,Antonio Torralba, Raquel Urtasun, and Sanja Fidler.

2016. Movieqa: Understanding stories in moviesthrough question-answering. In 2016 IEEE Confer-ence on Computer Vision and Pattern Recognition,CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016,pages 4631–4640.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems 30: Annual Conference on NeuralInformation Processing Systems 2017, 4-9 Decem-ber 2017, Long Beach, CA, USA, pages 6000–6010.

Bo Wang, Youjiang Xu, Yahong Han, and RichangHong. 2018a. Movie question answering: Re-membering the textual cues for layered visual con-tents. In Proceedings of the Thirty-Second AAAIConference on Artificial Intelligence, (AAAI-18),the 30th innovative Applications of Artificial Intel-ligence (IAAI-18), and the 8th AAAI Symposiumon Educational Advances in Artificial Intelligence(EAAI-18), New Orleans, Louisiana, USA, February2-7, 2018, pages 7380–7387.

Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,Tim Klinger, Wei Zhang, Shiyu Chang, GerryTesauro, Bowen Zhou, and Jing Jiang. 2018b. R3:Reinforced ranker-reader for open-domain questionanswering. In Proceedings of the Thirty-SecondAAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial In-telligence (IAAI-18), and the 8th AAAI Symposiumon Educational Advances in Artificial Intelligence(EAAI-18), New Orleans, Louisiana, USA, February2-7, 2018, pages 5981–5988.

Jason Weston, Antoine Bordes, Sumit Chopra, andTomas Mikolov. 2015. Towards ai-complete ques-tion answering: A set of prerequisite toy tasks.CoRR, abs/1502.05698.

Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine Learning, 8:229–256.

Caiming Xiong, Stephen Merity, and Richard Socher.2016. Dynamic memory networks for visual andtextual question answering. In Proceedings of the33nd International Conference on Machine Learn-ing, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2397–2406.

Adams Wei Yu, David Dohan, Minh-Thang Luong, RuiZhao, Kai Chen, Mohammad Norouzi, and Quoc V.Le. 2018. Qanet: Combining local convolutionwith global self-attention for reading comprehen-sion. CoRR, abs/1804.09541.

Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh,Barnabas Poczos, Ruslan R. Salakhutdinov, andAlexander J. Smola. 2017. Deep sets. In Advancesin Neural Information Processing Systems 30: An-nual Conference on Neural Information ProcessingSystems 2017, 4-9 December 2017, Long Beach, CA,USA, pages 3394–3404.

(a) Original (b) Noisy

Figure 10: Example of (a) Original task and (b) Noisy task. Sentences in green are noise sentences and ones inblue are supporting facts of each question.

A bAbI dataset

We provide examples of the Original and Noisydatasets, as well as visualization of the memorizedexamples to show what our EMR models have re-membered, for bAbI (Weston et al., 2015) dataset.

Dataset We visualize an example for Originaland Noisy tasks in Figure 10.

Results and Analysis As shown in Figure 11,we further report the performance of the base-line models and our EMR variants, on how manysupporting facts they retrain in the memory (de-noted as solvable), by considering the QA per-formance with a perfect QA model. We observethat both EMR variants, EMR-Independent andEMR-Transformer, significantly outperform rule-based memory scheduling agents as well as EMR-Independent.

B TriviaQA

We provide more experiment details and addi-tional examples for analyzing what our EMR mod-els have remembered, for TriviaQA (Joshi et al.,2017).

Dataset The objective of our model is to learngeneral importance in situations where not know-ing the question from streaming data. In terms ofscalability, our model is able to access sequentiallya large amount of streaming data by replacingthe most uninformative memory entry in the ex-ternal memory. When comparing TriviaQA witha common question-answering dataset (Rajpurkar

et al., 2016; Weston et al., 2015), it is an appro-priate dataset to prove the efficiency of our modelsince its average word number is approximately3K which cannot be accessed using conventionalmodels that predicts answer indices using a pointernetwork (Seo et al., 2016; Back et al., 2018; Yuet al., 2018).

To preprocess TriviaQA according to problemsetting, we truncate all documents within 1200words for a training set, in order to reduce the costof training process. Unlike the training set, a testset takes all words in the documents. AlthoughTriviaQA does not provide the answer indices ina document, we extract the documents that can bespanned to adopt Deep Bidirectional Transform-ers (BERT) (Devlin et al., 2018), which is state-of-the-art reading comprehension model using apointer network. Additionally, we made all letterslowercase and removed all special characters.

Experiment Details As described in the mainpaper, we employed the pre-trained BERT to solveTriviaQA. A more specific implementation is de-scribed here. We encode the current input x(t)

to m(t)i using the BERT encoding layer and a bi-

directional GRU whose output size is 768 and 128,respectively. The reason for using it is to con-vert the words into a sentence. By doing this,it can make accessing a possible chunk of wordsand computation cost is reduced. We utilize m(t)

i

to output relative importance between the mem-ory entries {m(t)

1 , ...,m(t)i }, where i indicates an

address in the memory entry, as described in the

0 5 10 15 20


0

10

20

30

40

50

60

70

80

90

100

Solv

able

LIFO

FIFO

UNIFORM

EMR-Independent

EMR-biGRU

EMR-Transformer

(a) Original (Solvable)

0 5 10 15 20


0

10

20

30

40

50

60

70

80

90

100

So

lva

ble

LIFO

FIFO

UNIFORM

EMR-Independent

EMR-biGRU

EMR-Transformer

(b) Noisy (Solvable)

Figure 11: The results for our model (EMR-biGRU and EMR-Transformer) and the baselines. The reported resultsare averages over 3 runs. The Solvable represents an accuracy that when the model encounters a question, itcontains supporting facts in the memory to solve the question.

(a) EMR-biGRU (Original) (b) EMR-biGRU (Noisy)

(c) EMR-Transformer (Original) (d) EMR-Transformer (Noisy)

Figure 12: Example of Original and Noisy task for EMR-biGRU and EMR-Transformer. Sentences in blue aresupporting facts of each question. The Index on the figure represents the order of sentences in the context.

main paper. In addition to using the pre-trainedBERT, we finetune it with truncated documents(Up to 400 words) in the same way as LIFO (Last-In-First-Out) since hoping our model focuses onlearning what to remember in the external memoryand generalizes well even watching limited con-tents in the documents. We train our model and thebaseline models using ADAM optimizer (Kingmaand Ba, 2014), with the initial learning rate of 1e-5and dropout probability of 0.1 for 1M steps. ForA3C (Mnih et al., 2016), we set the discount fac-tor to 0.1 and entropy regularization to 0.01 for allexperiments.

Results and Analysis As shown in Figure 13,we report the score of each method using a perfectQA model, to see how many of the important facts

are remembered by each method. We see that onTriviaQA dataset, LIFO contains similar amountof words as EMR-biGRU and EMR-Transformer.This is mostly due to the dataset bias, where mostof the answers are found in earlier parts of the doc-uments (Figure 6). However, our models outper-form LIFO in QA task, since it observed more sen-tences during training which help the QA model toperform better, compared to LIFO that observedless number of training examples during trainingdue to Last-In-First-Out policy that discards allwords that come after the memory is filled.

C TVQA

We provide more experimental details and exam-ples to show what our EMR models have remem-bered for the TVQA dataset. Each frame illus-

0 5 10 15 20 25


0

10

20

30

40

50

60

70

80

90

100

Accura

cy

LIFO

FIFO

UNIFORM

EMR-Independent

EMR-biGRU

EMR-Transformer

Figure 13: Oracle score

trated in the figure are the frames in the externalmemory at the last time step. The stars with dif-ferent colors denote the supporting frames for dif-ferent questions.

Experiment Details As described in the mainpaper, we use the Multi-stream model for Multi-Modal Video QA, which is suggested in Lei et al.(2018). We also pretrain the QA model for a deli-cate check of the performance of our EMR model.We use only the annotated frame when trainingthe QA model. Since we use only the subtitle andframe image feature as input, we pretrain the QAmodel until reaching the reported performance ofS+V model with the annotated time stamp in Leiet al. (2018).

Below is the detailed implementation of ourmodel EMR. Since we have two kinds of input x(t)

in TVQA, we need to blend them to one memoryfeature to be fitted to our model. In the case of sub-title input, we use GloVe (Pennington et al., 2014)to embed words to 300-size vectors. Then, we usebi-directional GRU to make the sentence 128-sizevector from word vectors. In the case of videoframe input, we use 2048-size feature vectors ex-tracted from a ResNet-101 pretrained on the Im-ageNet dataset. Then, we compress video framevectors to 128-size vector using a linear layer.Then, we add two 128-size feature vectors fromthe subtitle and the video frame to make 128-sizeof memory feature vector m(t)

i . Other details in-cluding optimizer and reinforcement learning set-ting are described in the main paper.

Figure 14: An example visualization of the memory. The answer word ’belgium’ (Red / Thick) arrives at first timestep, andour model retains sentences at state T, which means after reading all the contexts. The star shape (*) indicates our model’sselection which memory entry is deleted.

Figure 15: An example visualization of the memory. The answer word ’ely’ (Red / Thick) arrives at first timestep, and ourmodel retains it after reading in all the context sentences. The star shape (*) indicates our model’s selection which memoryentry is deleted.

Figure 16: An example visualization of the memory. The answer word ’alaska’ (Red / Thick) arrives at timestep 10, and ourmodel retains it after reading in all the context sentences. The star shape (*) indicates our model’s selection which memoryentry is deleted.

Figure 17: An example of clip from drama ’House’. Each frame with star is corresponding to question with the star of samecolor.

Figure 18: An example of clip from drama ’Friends’. Each frame with star is corresponding to question with the star of samecolor.

Figure 19: An example of clip from drama ’Castle’. Each frame with star is corresponding to question with the star of samecolor.

Figure 20: An example of clip from drama ’When I met your mother’. Each frame with star is corresponding to question withthe star of same color.

arxiv:1903.06164v3 [cs.lg] 9 jun 2019

Documents