arxiv:1812.01628v2 [cs.cl] 25 mar 2019 · a text adventure game can be quite large: a = o(jvjj oj2)...

9
Playing Text-Adventure Games with Graph-Based Deep Reinforcement Learning Prithviraj Ammanabrolu School of Interactive Computing Georgia Institute of Technology Atlanta, GA [email protected] Mark O. Riedl School of Interactive Computing Georgia Institute of Technology Atlanta, GA [email protected] Abstract Text-based adventure games provide a plat- form on which to explore reinforcement learning in the context of a combinatorial action space, such as natural language. We present a deep reinforcement learning architecture that represents the game state as a knowledge graph which is learned during exploration. This graph is used to prune the action space, enabling more efficient exploration. The question of which action to take can be reduced to a question-answering task, a form of transfer learning that pre-trains certain parts of our architecture. In experi- ments using the TextWorld framework, we show that our proposed technique can learn a control policy faster than baseline alterna- tives. We have also open-sourced our code at https://github.com/rajammanabrolu/KG- DQN. 1 Introduction Natural language communication can be used to affect change in the real world. Text adventure games, in which players must make sense of the world through text descriptions and declare ac- tions through natural language, can provide a step- ping stone toward more real-world environments where agents must communicate to understand the state of the world and indirectly affect change in the world. Text adventure games are also useful for developing and testing reinforcement learning algorithms that must deal with the partial observ- ability of the world (Narasimhan et al., 2015; He et al., 2016). In text adventure games, the agent receives an incomplete textual description of the current state of the world. From this information, and pre- vious interactions with the world, a player must determine the next best action to take to achieve some quest or goal. The player must then com- pose a textual description of the action they in- tend to make and receive textual feedback of the effects of the action. Formally, a text-based game is a partially observable Markov decision process (POMDP), represented as a 7-tuple of hS,T,A, Ω,O,R,γ i representing the set of en- vironment states, conditional transition probabil- ities between states, words used to compose text commands, observations, observation conditional probabilities, reward function, and the discount factor respectively (ot´ e et al., 2018). In text-based games, the agent never has access to the true underlying world state and has to rea- son about how to act in the world based only on the textual observations. Additionally, the agent’s ac- tions must be expressed through natural language commands, ensuring that the action space is com- binatorially large. Thus, text-based games pose a different set of challenges than traditional video games. Text-based games require a greater un- derstanding of previous context to be able to ex- plore the state-action space more effectively. Such games have historically proven to be difficult to play for AI agents, and the more complex variants such as Zork still remain firmly out of the reach of existing approaches. We introduce three contributions to text-based game playing to deal with the combinatorially large state and action spaces. First, we show that a state representation in the form of a knowledge graph gives us the ability to effectively prune an action space. A knowledge graph captures the re- lationships between entities as a directed graph. The knowledge graph provides a persistent mem- ory of the world over time and enables the agent to have a prior notion of what actions it should not take at a particular stage of the game. Our second contribution is a deep reinforcement learning architecture, Knowledge Graph DQN (KG-DQN), that effectively uses this state rep- arXiv:1812.01628v2 [cs.CL] 25 Mar 2019

Upload: others

Post on 26-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1812.01628v2 [cs.CL] 25 Mar 2019 · a text adventure game can be quite large: A = O(jVjj Oj2) where V is the number of action verbs, and Ois the number of distinct objects in

Playing Text-Adventure Games withGraph-Based Deep Reinforcement Learning

Prithviraj AmmanabroluSchool of Interactive ComputingGeorgia Institute of Technology

Atlanta, [email protected]

Mark O. RiedlSchool of Interactive ComputingGeorgia Institute of Technology

Atlanta, [email protected]

Abstract

Text-based adventure games provide a plat-form on which to explore reinforcementlearning in the context of a combinatorialaction space, such as natural language.We present a deep reinforcement learningarchitecture that represents the game state asa knowledge graph which is learned duringexploration. This graph is used to prunethe action space, enabling more efficientexploration. The question of which action totake can be reduced to a question-answeringtask, a form of transfer learning that pre-trainscertain parts of our architecture. In experi-ments using the TextWorld framework, weshow that our proposed technique can learna control policy faster than baseline alterna-tives. We have also open-sourced our codeat https://github.com/rajammanabrolu/KG-DQN.

1 Introduction

Natural language communication can be used toaffect change in the real world. Text adventuregames, in which players must make sense of theworld through text descriptions and declare ac-tions through natural language, can provide a step-ping stone toward more real-world environmentswhere agents must communicate to understand thestate of the world and indirectly affect change inthe world. Text adventure games are also usefulfor developing and testing reinforcement learningalgorithms that must deal with the partial observ-ability of the world (Narasimhan et al., 2015; Heet al., 2016).

In text adventure games, the agent receives anincomplete textual description of the current stateof the world. From this information, and pre-vious interactions with the world, a player mustdetermine the next best action to take to achievesome quest or goal. The player must then com-

pose a textual description of the action they in-tend to make and receive textual feedback ofthe effects of the action. Formally, a text-basedgame is a partially observable Markov decisionprocess (POMDP), represented as a 7-tuple of〈S, T,A,Ω, O,R, γ〉 representing the set of en-vironment states, conditional transition probabil-ities between states, words used to compose textcommands, observations, observation conditionalprobabilities, reward function, and the discountfactor respectively (Cote et al., 2018).

In text-based games, the agent never has accessto the true underlying world state and has to rea-son about how to act in the world based only on thetextual observations. Additionally, the agent’s ac-tions must be expressed through natural languagecommands, ensuring that the action space is com-binatorially large. Thus, text-based games pose adifferent set of challenges than traditional videogames. Text-based games require a greater un-derstanding of previous context to be able to ex-plore the state-action space more effectively. Suchgames have historically proven to be difficult toplay for AI agents, and the more complex variantssuch as Zork still remain firmly out of the reach ofexisting approaches.

We introduce three contributions to text-basedgame playing to deal with the combinatoriallylarge state and action spaces. First, we show thata state representation in the form of a knowledgegraph gives us the ability to effectively prune anaction space. A knowledge graph captures the re-lationships between entities as a directed graph.The knowledge graph provides a persistent mem-ory of the world over time and enables the agentto have a prior notion of what actions it should nottake at a particular stage of the game.

Our second contribution is a deep reinforcementlearning architecture, Knowledge Graph DQN(KG-DQN), that effectively uses this state rep-

arX

iv:1

812.

0162

8v2

[cs

.CL

] 2

5 M

ar 2

019

Page 2: arXiv:1812.01628v2 [cs.CL] 25 Mar 2019 · a text adventure game can be quite large: A = O(jVjj Oj2) where V is the number of action verbs, and Ois the number of distinct objects in

resentation to estimate the Q-value for a state-action pair. This architecture leverages recent ad-vances in graph embedding and attention tech-niques (Guan et al., 2018; Velickovic et al., 2018)to learn which portions of the graph to pay atten-tion to given an input state description in addi-tion to having a mechanism that allows for natu-ral language action inputs. Finally, we take initialsteps toward framing the POMDP as a question-answering (QA) problem wherein a knowledge-graph can be used to not only prune actions butto answer the question of what action is most ap-propriate. Previous work has shown that manyNLP tasks can be framed as instances of question-answering and that we can transfer knowledge be-tween these tasks (McCann et al., 2017). We showhow pre-training certain parts of our KG-DQNnetwork using existing QA methods improves per-formance and allows knowledge to be transferredfrom different games.

We provide results on ablative experimentscomparing our knowledge-graph based approachapproaches to strong baselines. Results show thatincorporating a knowledge-graph into a reinforce-ment learning agent results in converges to thehighest reward more than 40% faster than thebest baseline. With pre-training using a question-answering paradigm, we achieve this fast conver-gence rate while also achieving high quality questsolutions as measured by the number of steps re-quired to complete the quests.

2 Related Work

A growing body of research has explored the chal-lenges associated with text-based games (Bordeset al., 2010; Narasimhan et al., 2015; He et al.,2016; Fulda et al., 2017; Haroush et al., 2018; Coteet al., 2018; Tao et al., 2018). Narasimhan et al.(2015) attempts to solve parser-based text gamesby encoding the observations using an LSTM.This encoding vector is then used by an actionscoring network that determines the scores for theaction verb and each of the corresponding argu-ment objects. The two scores are then averagedto determine Q-value for the state-action pair. Heet al. (2016) present the Deep Reinforcement Rel-evance Network (DRRN) which uses two separatedeep neural networks to encode the state and ac-tions. The Q-value for a state-action pair is thencomputed by a pairwise interaction function be-tween the two encoded representations. Both of

these methods are not conditioned on previous ob-servations and so are at a disadvantage when deal-ing with complex partially observable games. Ad-ditionally, neither of these approaches prune theaction space and so end up wasting trials explor-ing state-action pairs that are likely to have lowQ-values, likely leading to slower convergence timesfor combinatorially large action spaces.

Haroush et al. (2018) introduce the ActionEliminating Network (AEN) that attempts to re-strict the actions in each state to the top-k mostlikely ones, using the emulator’s feedback. Thenetwork learns which actions should not be takengiven a particular state. Their work shows thatreducing the size of the action space allows formore effective exploration, leading to better per-formance. Their network is also not conditionedon previous observations.

Knowledge graphs have been demonstrated toimprove natural language understanding in otherdomains outside of text adventure games. Forexample, Guan et al. (2018) use commonsenseknowledge graphs such as ConceptNet (Speer andHavasi, 2012) to significantly improve the abilityof neural networks to predict the end of a story.They represent the graph in terms of a knowl-edge context vector using features from Concept-Net and graph attention (Velickovic et al., 2018).The state representation that we have chosen aswell as our method of action pruning builds on thestrengths of existing approaches while simultane-ously avoiding the shortcomings of ineffective ex-ploration and lack of long-term context.

3 Knowledge Graph DQN

In this section we introduce our knowledgegraph representation, action pruning and deep Q-network architecture.

3.1 Knowledge Graph Representation

In our approach, our agent learns a knowledgegraph, stored as a set of RDF triples, i.e. 3-tuplesof 〈subject, relation, object〉. These triples areextracted from the observations using Stanford’sOpen Information Extraction (OpenIE) (Angeliet al., 2015). OpenIE is not optimized to the regu-larities of text adventure games and there are a lotof relations that can be inferred from the typicalstructure of descriptive texts. For example, froma phrase such as “There is an exit to the north”one can infer a has relation between the current

Page 3: arXiv:1812.01628v2 [cs.CL] 25 Mar 2019 · a text adventure game can be quite large: A = O(jVjj Oj2) where V is the number of action verbs, and Ois the number of distinct objects in

Figure 1: Graph state update example given two observations

location and the direction of the exit. These addi-tional rules fill in the information not provided byOpenIE. The resultant knowledge graph gives theagent what essentially amounts to a mental map ofthe game world.

The knowledge graph is updated after everyagent action (see Figure 1). The update rulesare defined such that there are portions of thegraph offering short and long-term context. Aspecial node—designated “you”—represents theagent and relations out of this node are updated af-ter every action with the exception of relations de-noting the agent’s inventory. Other relations per-sist after each action. We intend for the updaterules to be applied to text-based games in differentdomains and so only hand-craft a minimal set ofrules that we believe apply generally. They are:

• Linking the current room type (e.g. “base-ment”, “chamber’) to the items found inthe room with the relation “has”, e.g.〈chamber, has, bed stand〉

• Extracting information regarding entrancesand exits and linking them to the currentroom, e.g. 〈basement, has, exit to north〉

• Removing all relations relating to the “you”node with the exception of inventory everyaction, e.g. 〈you, have, cubical key〉

• Linking rooms with directions based on theaction taken to move between the rooms, e.g.〈chamber, east of, basement〉 after the ac-tion “go east” is taken to go from the base-ment to the chamber

All other RDF triples generated are taken fromOpenIE.

3.2 Action PruningThe number of actions available to an agent ina text adventure game can be quite large: A =O(|V | × |O|2) where V is the number of actionverbs, and O is the number of distinct objects inthe world that the agent can interact with, assum-ing that verbs can take two arguments. Some ac-tions, such as movement, inspecting inventory, orobserving the room, do not have arguments.

The knowledge graph is used to prune the com-binatorially large space of possible actions avail-able to the agent as follows. Given the currentstate graph representation Gt, the action space ispruned by ranking the full set of actions and se-lecting the top-k. Our action scoring function is:

• +1 for each object in the action that is presentin the graph; and

• +1 if there exists a valid directed path be-tween the two objects in the graph.

We assume that each action has at most two ob-jects (for example inserting a key in a lock).

3.3 Model Architecture and TrainingFollowing Narasimhan et al. (2015), all actionsA that will be accepted by the game’s parser areavailable to the agent at all times. When playingthe game, the agent chooses an action and receivesan observation ot from the simulator, which is atextual description of current game state. The stategraph Gt is updated according to the given obser-vation, as described in Section 3.1.

We use the Q-Learning technique (Watkins andDayan, 1992) to learn a control policy π(at|st),at ∈ A, which gives us the probability of takingaction at given the current state st. The policy is

Page 4: arXiv:1812.01628v2 [cs.CL] 25 Mar 2019 · a text adventure game can be quite large: A = O(jVjj Oj2) where V is the number of action verbs, and Ois the number of distinct objects in

determined by the Q-value of a particular state-action pair, which is updated using the Bellmanequation (Sutton and Barto, 2018):

Qt+1(st+1,at+1) =

E[rt+1 + γmaxa∈At

Qt(s, a)|st, at] (1)

where γ refers to the discount factor and rt+1 isthe observed reward. The policy is thus to takethe action that maximizes the Q-value in a partic-ular state, which will correspond to the action thatmaximizes the reward expectation given that theagent has taken action at at the current state st andfollowed the policy π(a|s) after.

The architecture in Figure 2 is responsible forcomputing the representations for both the state stand the actions a(i) ∈ A and coming to an estima-tion of the Q-value for a particular state and ac-tion. During the forward activation, the agent usesthe observation to update the graph Gt using therules outlined in Section 3.2.

The graph is then embedded into a single vec-tor gt. We use Graph Attention (Velickovic et al.,2018) with an attention mechanism similar tothat described in Bahdanau et al. (2014). For-mally, the Multi-headed Graph Attention com-ponent receives a set of node features H =h1,h2, . . . ,hN, hi ∈ IRF, whereN is the num-ber of nodes and F the number of features in eachnode, and the adjacency matrix of Gt. Each of thenode features consist of the averaged word embed-dings for the tokens in that node, as determined bythe preceding graph embedding layer. The atten-tion mechanism is set up using self-attention onthe nodes after a learnable linear transformationW ∈ IR2F×F applied to all the node features:

eij = LeakyReLU(p ·W (hi ⊕ hj)) (2)

where p ∈ IR2F is a learnable parameter. The at-tention coefficients αij are then computed by nor-malizing over the choices of k ∈ N using the soft-max function. Here N refers to the neighborhoodin which we compute the attention coefficients.This is determined by the adjacency matrix for Gt

and consists of all third-order neighbors of a par-ticular node.

αij =exp(eij)∑

k∈N exp(eik)(3)

Multi-head attention is then used, calculating mul-tiple independent attention coefficients. The re-sulting features are then concatenated and passed

into a linear layer to determine gt:

gt = f(Wg(‖Kk=1σ(∑j∈N

α(k)ij W(k)hj))+bg) (4)

where k refers to the parameters of the kth in-dependent attention mechanism, Wg and bg theweights and biases of this component’s output lin-ear layer, and ‖ represents concatenation.

Simultaneously, an encoded representation ofthe observation ot is computed using a SlidingBidirectional LSTM (SB-LSTM). The final staterepresentation st is computed as:

st = f(Wl(gt ⊕ ot) + bl) (5)

where Wl, bl represent the final linear layer’sweights and biases and ot is the result of encod-ing the observation with the SB-LSTM.

The entire set of possible actions A is prunedby scoring each a ∈ A according to the mech-anism previously described using the newly up-dated Gt+1. We then embed and encode allof these action strings using an LSTM encoder(Sutskever et al., 2014). The dashed lines in Fig-ure 2 denotes non-differentiable processes.

The final Q-value for a state-action pair is:

Q(st,at) = st · at (6)

This method of separately computing the repre-sentations for the state and action is similar to theapproach taken in the DRRN (He et al., 2016).

We train the network using experience replay(Lin, 1993) with prioritized sampling (cf., (Mooreand Atkeson, 1993)) and a modified version of theε-greedy algorithm (Sutton and Barto, 2018) thatwe call the ε1, ε2-greedy learning algorithm. Theexperience replay strategy finds paths in the game,which are then stored as transition tuples in a ex-perience replay buffer D. The ε1, ε2-greedy algo-rithm explores by choosing actions randomly fromAwith probability ε1 and fromAt with a probabil-ity ε2. The second threshold is needed to accountfor situations where an action must be chosen toadvance the quest for which the agent has no priorin Gt. That is, action pruning may remove ac-tions essential to quest completion because thoseactions involve combinations of entities that havenot been encountered before.

We then sample a mini-batch of transition tuplesconsisting of 〈sk,ak, rk+1, sk+1,Ak+1, pk〉 from

Page 5: arXiv:1812.01628v2 [cs.CL] 25 Mar 2019 · a text adventure game can be quite large: A = O(jVjj Oj2) where V is the number of action verbs, and Ois the number of distinct objects in

Figure 2: KG-DQN architecture, blue shading indicates components that can be pre-trained and red indicates nopre-training. The solid lines indicate gradient flow for learnable components.

D and compute the temporal difference loss as:

L(θ) =rk+1+

γ maxa∈Ak+1

Q(st,a; θ)−Q(st,at; θ)(7)

Replay sampling from D is done by sampling afraction ρ from transition tuples with a positivereward and 1 − ρ from the rest. As shown in(Narasimhan et al., 2015), prioritized samplingfrom experiences with a positive reward helps thedeep Q-network more easily find the sparse set oftransitions that advance the game. The exact train-ing mechanism is described in Algorithm 1.

4 Game Play as Question Answering

Previous work has shown that many NLP tasks canbe framed as instances of question-answering andthat in doing so, one can transfer knowledge be-tween these tasks (McCann et al., 2017). In the ab-stract, an agent playing a text adventure game canbe thought of as continuously asking the question“What is the right action to perform in this situa-tion?” When appropriately trained, the agent maybe able to answer the question for itself and selecta good next move to execute. Treating the problemas question-answering will not replace the need forexploration in text-adventure games. However, wehypothesize that it will cut down on the amount ofexploration needed during testing time, theoreti-cally allowing it to complete quests faster; one ofthe challenges of text adventure games is that thequests are puzzles and even after training, execu-tion of the policy requires a significant amount ofexploration.

To teach the agent to answer the question ofwhat action is best to take given an observation,

we use an offline, pre-training approach. The datafor the pre-training approach is generated usingan oracle, an agent capable of finishing a gameperfectly in the least number of steps possible.Specifically, the agent knows exactly what actionto take given the state observation in order to ad-vance the game in the most optimal manner pos-sible. Through this process, we generate a setof traces consisting of state observations and ac-tions such that the state observation provides thecontext for the implicit question of ”What actionshould be taken?” and the oracle’s correct actionis the answer. We then use the DrQA (Chenet al., 2017) question-answering technique to traina paired question encoder and an answer encoderthat together predict the answer (action) from thequestion (text observation). The weights from theSB-LSTM in the document encoder in the DrQAsystem are then used to initialize the weights of theSB-LSTM. Similarly, embedding layers of boththe graph and the LSTM action encoder are ini-tialized with the weights from the embedding layerof same document encoder. Since the DrQA em-bedding layers are initialized with GloVe, we aretransferring word embeddings that are tuned dur-ing the training of the QA architecture.

The game traces used to train the question-answering come from a set of games of the samedomain but have different specific configurationsof the environment and different quests. Weuse the TextWorld framework (Cote et al., 2018),which uses a grammar to generate random worldsand quests. The types of rooms are the same, buttheir relative spatial configuration, the types of ob-jects, and the specific sequence of actions neededto complete the quest are different each time. This

Page 6: arXiv:1812.01628v2 [cs.CL] 25 Mar 2019 · a text adventure game can be quite large: A = O(jVjj Oj2) where V is the number of action verbs, and Ois the number of distinct objects in

Table 1: Generated game details.

Small LargeRooms 10 20Total objects 20 40Quest length 5 10Branching factor 143 562Vocab size 746 819Average words per obs. 67.5 94.0Average new RDF triples per obs. 7.2 10.5

means that the agent cannot simply memorizequests. For pre-training to work, the agent mustdevelop a general question-answering competencethat can transfer to new quests. Our approach toquestion-answering in the context of text adven-ture game playing thus represents a form of trans-fer learning.

5 Experiments

We conducted experiments in the TextWorldframework (Cote et al., 2018) using their “home”theme. TextWorld uses a grammar to randomlygenerate game worlds and quests with given pa-rameters. Games generated with TextWorld startwith a zero-th observation that gives instructionsfor the quest; we do not allow our agent to accessthis information. The TextWorld API also pro-vides a list of admissible actions at each state—theactions that can be performed based on the objectsthat are present. We do not allow our agent to ac-cess the admissible actions.

We generated two sets of games with differentrandom seeds, representing different game diffi-culties, which we denote as small and large. Smallgames have ten rooms and quests of length fiveand large games have twenty rooms and quests oflength ten. Statistics on the games are given inTable 1. Quest length refers to the number of ac-tions that the agent is required to perform in orderto finish the quest; more actions are typically nec-essary to move around the environment and findthe objects that need to be interacted with. Thebranching factor is the size of the action set A forthat particular game.

The reward function provided by TextWorld isas follows: +1 for each action taken that movesthe agent closer to finishing the quest; -1 for eachaction taken that extends the minimum number ofsteps needed to finish the quest from the currentstage; 0 for all other situations. The maximumachievable reward for the small and large sets ofgames are 5 and 10 respectively. This allows for

Table 2: Pre-training accuracy.

EM Precision Recall F1Small 46.20 56.57 63.38 57.94Large 34.13 52.53 64.72 55.06

a large amount of variance in quest quality—asmeasured by steps to complete the quest—that re-ceives maximum reward.

The following procedure for pre-training wasdone separately for each set of games. Pre-trainingof the SB-LSTM within the question-answeringarchitecture is conducted by generating 200 gamesfrom the same TextWorld theme. The QA systemwas then trained on data from walkthroughs of arandomly-chosen subset of 160 of these generatedgames, tuned on a dev set of 20 games, and eval-uated on the held-out set of 20 games. Table 2provides details on the Exact Match (EM), preci-sion, recall, and F1 scores of the QA system af-ter training for the small and large sets of games.Precision, recall, and F1 scores are calculated bycounting the number of tokens between the pre-dicted answer and ground truth. An Exact Matchis when the entire predicted answer matches withthe ground truth. This score is used to tune themodel based on the dev set of games.

A random game was chosen from the test-set ofgames and used as the environment for the agentto train its deep Q-network on. Thus, at no timedid the QA system see the final testing game priorto the training of the KG-DQN network.

We compare our technique to three baselines:

• Random command, which samples from thelist of admissible actions returned by theTextWorld simulator at each step.

• LSTM-DQN, developed by Narasimhan etal. (2015).

• Bag-of-Words DQN, which uses a bag-of-words encoding with a multi-layer feed for-ward network instead of an LSTM.

To achieve the most competitive baselines, weused a randomized grid search to choose the besthyperparameters (e.g., hidden state size, γ, ρ, finalε, update frequency, learning rate, replay buffersize) for the BOW-DQN and LSTM-DQN base-lines.

We tested three versions of our KG-DQN:

1. Un-pruned actions with pre-training

Page 7: arXiv:1812.01628v2 [cs.CL] 25 Mar 2019 · a text adventure game can be quite large: A = O(jVjj Oj2) where V is the number of action verbs, and Ois the number of distinct objects in

Algorithm 1 ε1, ε2-greedy learning algorithm for KG-DQN1: for episode=1 toM do2: Initialize action dictionaryA and graphG0

3: Reset the game simulator4: Read initial observation o15: G1 ← updateGraph(G0, o1);A1 ← pruneActions(A,G0) . Section 3.26: for step t=1 to T do7: if random() < ε1 then8: if random() < ε2 then9: Select random action at ∈ A10: else11: Select random action at ∈ At

12: else13: ComputeQ(st, a

(i); θ) for a(i) ∈ A for network parameters θ . Section 3.3, Eq. 614: Select at based on π(a|st)15: Execute action at in the simulator and observe reward rt16: Receive next observation ot+ 117: Gt+1 ← updateGraph(Gt, ot+1);At+1 ← pruneActions(A,Gt+1) . Section 3.118: Compute st+1 and At+1 = a′(i) for all a′(i) ∈ A . Section 3.319: Set priority pt = 1 if rt > 0, else pt = 020: Store transition (st, at, rt, st+1,At+1, pt) in replay bufferD21: Sample mini-batch of transitions (sk, ak, rk, sk+1,Ak+1, pk) fromD, with fraction ρ having pk = 1

22: Set yk = rk + γmaxa∈Ak+1Q(st, a; θ), or yk = rk if sk+1 is terminal

23: Perform gradient descent step on loss function L(θ) = (yk −Q(st, at; θ))2

2. Pruned actions without pre-training

3. Pruned actions with pre-training (full)

Our models use 50-dimensional word embed-dings, 2 heads on the graph attention layers, mini-batch size of 16, and perform a gradient descentupdate every 5 steps taken by the agent.

All models are evaluated by observing the(a) time to reward convergence, and (b) the av-erage number of steps required for the agent tofinish the game with ε = 0.1 over 5 episodes af-ter training has completed. Following Narasimhanet al. (2015) we set ε to a non-zero value becausetext adventure games, by nature, require explo-ration to complete the quests. All results are re-ported based on multiple independent trials. Forthe large set of games, we only perform experi-ments on the best performing models found in thesmall set of games. Also note that for experimentson large games, we do not display the entire learn-ing curve for the LSTM-DQN baseline, as it con-verges significantly more slowly than KG-DQN.We run each experiment 5 times and average theresults.

Additionally, human performance on the boththe games was measured by counting the numberof steps taken to finish the game, with and with-out instructions on the exact quest. We modifiedTextworld to give the human players reward feed-back in the form of a score, the reward functionitself is identical to that received by the deep re-inforcement learning agents. In one variation ofthis experiment, the human was given instructionson the potential sequence of steps that are required

to finish the game in addition to the reward in theform of a score and in the other variation, the hu-man received no instructions.

6 Results and Discussion

Recall that the number of steps required to finishthe game for the oracle agent is 5 and 10 for thesmall and large maps respectively. It is impossi-ble to achieve this ideal performance due to thestructure of the quest. The player needs to interactwith objects and explore the environment in orderto figure out the exact sequence of actions requiredto finish the quest. To help benchmark our agent’sperformance, we observed people unaffiliated withthe research playing through the same TextWorld“home” quests as the other models. Those who didnot receive instructions on how to finish the questnever finished a single quest and gave up after anaverage of 184 steps on the small map and an av-erage of 190 steps on the large map. When giveninstructions, human players completed the queston the large map in an average of 23 steps, fin-ishing the game with the maximum reward possi-ble. Also note that none of the deep reinforcementlearning agents received instructions.

On both small and large maps, all versions ofKG-DQN tested converge faster than baselines(see Figure 3 for the small game and Figure 4for the large game). We don’t show BOW-DQNbecause it is strictly inferior to LSTM-DQN inall situations). KG-DQN converges 40% fasterthan baseline on the small game; both KG-DQNand the LSTM-DQN baseline reaches the max-imum reward of five. On the large game, no

Page 8: arXiv:1812.01628v2 [cs.CL] 25 Mar 2019 · a text adventure game can be quite large: A = O(jVjj Oj2) where V is the number of action verbs, and Ois the number of distinct objects in

Figure 3: Reward learning curve for select experimentswith the small games.

Table 3: Average number of steps (and standard devia-tion) taken to complete the small game.

Model StepsRandom Command 319.8BOW-DQN 83.1± 8.0LSTM-DQN 72.4± 4.6Unpruned, pre-trained KG-DQN 131.7± 7.7Pruned, non-pre-trained KG-DQN 97.3± 9.0Full KG-DQN 73.7± 8.5

agents achieve the maximum reward of 10, and theLSTM-DQN requires more than 300 episodes toconverge at the same level as KG-DQN. Since allversions of KG-DQN converge at approximatelythe same rate, we conclude that the knowledgegraph—i.e., persistent memory—is the main fac-tor helping convergence time since it is the com-mon element across all experiments.

After training is complete, we measure the num-ber of steps each agent needs to complete eachquest. Full KG-DQN requires an equivalent num-ber of steps in the small game (Table 3) and inthe large game (Table 4). Differences betweenLSTM-DQN and full KG-DQN are not statisti-cally significant, p = 0.199 on an independent T-test. The ablated versions of KG-DQN—unprunedKG-DQN and non-pre-trained KG-DQN—requiremany more steps to complete quests. TextWorld’sreward function allows for a lot of explorationof the environment without penalty so it is pos-sible for a model that has converged on rewardto complete quests in as few as five steps or inmany hundreds of steps. From these results, weconclude that the pre-training using our question-answering paradigm is allowing the agent to finda general understanding of how to pick good ac-tions even when the agent has never seen the final

Figure 4: Reward learning curve for select experimentswith the large games.

Table 4: Average number of steps (and standard devia-tion) taken to complete the large game.

Model StepsRandom Command 2054.8LSTM-DQN 260.3 ± 4.5Pruned, non-pre-trained KG-DQN 340 ± 6.4Full KG-DQN 265.9 ± 9.4

test game. LSTM-DQN also learns how to chooseactions efficiently, but this knowledge is capturedin the LSTM’s cell state, whereas in KG-DQNthis knowledge is made explicit in the knowledgegraph and retrieved effectively by graph attention.Taken together, KG-DQN converges faster with-out loss of quest solution quality.

7 Conclusions

We have shown that incorporating knowledgegraphs into an deepQ-network can reduce trainingtime for agents playing text-adventure games ofvarious lengths. We speculate that this is becausethe knowledge graph provides a persistent mem-ory of the world as it is being explored. While theknowledge graph allows the agent to reach optimalreward more quickly, it doesn’t ensure a high qual-ity solution to quests. Action pruning using theknowledge graph and pre-training of the embed-dings used in the deep Q-network result in shorteraction sequences needed to complete quests.

The insight into pre-training portions of theagent’s architecture is based on converting text-adventure game playing into a question-answeringactivity. That is, at every step, the agent isasking—and trying to answer—what is the mostimportant thing to try. The pre-training acts as aform of transfer learning from different, but re-

Page 9: arXiv:1812.01628v2 [cs.CL] 25 Mar 2019 · a text adventure game can be quite large: A = O(jVjj Oj2) where V is the number of action verbs, and Ois the number of distinct objects in

lated games. However, question-answering alonecannot solve the text-adventure playing problembecause there will always be some trial and errorrequired.

By addressing the challenges of partial observ-ability and combinatorially large action, spacesthrough persistent memory, our work on play-ing text-adventure games addresses a critical needfor reinforcement learning for language. Text-adventure games can be seen as a stepping stonetoward more complex, real-world tasks; the hu-man world is one of partial understanding throughcommunication and acting on the world using lan-guage.

ReferencesGabor Angeli, Johnson Premkumar, Melvin Jose, and

Christopher D. Manning. 2015. Leveraging Lin-guistic Structure For Open Domain Information Ex-traction. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers).

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv:1409.0473.

Antoine Bordes, Nicolas Usunier, Ronan Collobert,and Jason Weston. 2010. Towards understanding sit-uated natural language. In Proceedings of the 2010International Conference on Artificial Intelligenceand Statistics.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In Association for Computa-tional Linguistics (ACL).

Marc-Alexandre Cote, Akos Kadar, Xingdi Yuan, BenKybartas, Emery Fine, James Moore, MatthewHausknecht, Layla El Asri, Mahmoud Adada,Wendy Tay, and Adam Trischler. 2018. TextWorld :A Learning Environment for Text-based Games. InProceedings of the ICML/IJCAI 2018 Workshop onComputer Games, page 29.

Nancy Fulda, Daniel Ricks, Ben Murdoch, and DavidWingate. 2017. What can you do with a rock? affor-dance extraction via word embeddings. In Proceed-ings of the Twenty-Sixth International Joint Con-ference on Artificial Intelligence, IJCAI-17, pages1039–1045.

Jian Guan, Yansen Wang, and Minlie Huang.2018. Story Ending Generation with Incre-mental Encoding and Commonsense Knowledge.arXiv:1808.10113v1.

Matan Haroush, Tom Zahavy, Daniel J Mankowitz, andShie Mannor. 2018. Learning How Not to Act inText-Based Games. In Workshop Track at ICLR2018, pages 1–4.

Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Li-hong Li, Li Deng, and Mari Ostendorf. 2016. DeepReinforcement Learning with a Natural LanguageAction Space. In Association for ComputationalLinguistics (ACL).

Long-Ji Lin. 1993. Reinforcement learning for robotsusing neural networks. Ph.D. thesis, Carnegie Mel-lon University.

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong,and Richard Socher. 2017. The Natural LanguageDecathlon : Multitask Learning as Question An-swering. arXiv:1806.08730.

Andrew W. Moore and Christopher G. Atkeson. 1993.Prioritized sweeping: Reinforcement learning withless data and less time. Machine Learning,13(1):103–130.

Karthik Narasimhan, Tejas Kulkarni, and ReginaBarzilay. 2015. Language Understanding for Text-based Games Using Deep Reinforcement Learning.In Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Robert Speer and Catherine Havasi. 2012. Repre-senting General Relational Knowledge in Concept-Net 5. In Proceedings of the Eighth InternationalConference on Language Resources and Evaluation(LREC).

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems, pages 3104–3112.

Richard S Sutton and Andrew G Barto. 2018. Rein-forcement Learning: An Introduction. MIT Press.

Ruo Yu Tao, Marc-Alexandre Cote, Xingdi Yuan, andLayla El Asri. 2018. Towards solving text-basedgames by producing adaptive action spaces. In Pro-ceedings of the 2018 NeurIPS Workshop on Word-play: Reinforcement and Language Learning inText-based Games.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova,Adriana Romero, Pietro Lio, and Yoshua Bengio.2018. Graph Attention Networks. InternationalConference on Learning Representations (ICLR).

Christopher J. C. H. Watkins and Peter Dayan. 1992.Q-learning. Machine Learning, 8(3):279–292.