h arxiv:1909.01871v1 [cs.hc] 4 sep 2019users.umiacs.umd.edu/~hal/docs/daume19hanna.pdf · 2019. 9....

Help, Anna! Visual Navigation with Natural Multimodal Assistance viaRetrospective Curiosity-Encouraging Imitation Learning

Khanh Nguyen� and Hal Daume III�♥University of Maryland, College Park�, Microsoft Research, New York♥

{kxnguyen,hal}@umiacs.umd.edu

AbstractMobile agents that can leverage help from hu-mans can potentially accomplish more com-plex tasks than they could entirely on theirown. We develop “Help, Anna!” (HANNA), aninteractive photo-realistic simulator in whichan agent fulfills object-finding tasks by re-questing and interpreting natural language-and-vision assistance. An agent solving tasksin a HANNA environment can leverage simu-lated human assistants, called ANNA (Auto-matic Natural Navigation Assistants), which,upon request, provide natural language and vi-sual instructions to direct the agent towardsthe goals. To address the HANNA problem,we develop a memory-augmented neural agentthat hierarchically models multiple levels ofdecision-making, and an imitation learning al-gorithm that teaches the agent to avoid re-peating past mistakes while simultaneouslypredicting its own chances of making futureprogress. Empirically, our approach is able toask for help more effectively than competitivebaselines and, thus, attains higher task suc-cess rate on both previously seen and previ-ously unseen environments. We publicly re-lease code and data at https://github.com/khanhptnk/hanna .

1 Introduction

The richness and generalizability of natural lan-guage makes it an effective medium for directingmobile agents in navigation tasks, even in environ-ments they have never encountered before (Ander-son et al., 2018b; Chen et al., 2019; Misra et al.,2018; de Vries et al., 2018; Qi et al., 2019). Nev-ertheless, even with language-based instructions,such tasks can be overly difficult for agents ontheir own, especially in unknown environments.To accomplish tasks that surpass their knowl-edge and skill levels, agents must be able to ac-tively seek for and leverage assistance in the en-vironment. Humans are rich external knowledge

sources but, unfortunately, they may not be avail-able all the time to provide guidance, or may beunwilling to help too frequently. To reduce theneeded effort from human assistants, it is essentialto design research platforms for teaching agents torequest help mindfully.

In natural settings, human assistance is often:� derived from interpersonal interaction (a lost

tourist asks a local for directions);� reactive to the situation of the receiver, based

on the assistant’s knowledge (the local mayguide the tourist to the goal, or may redirectthem to a different source of assistance);� delivered via a multimodal communication

channel (the local uses a combination of lan-guage, images, maps, gestures, etc.).

We introduce the “Help, Anna!” (HANNA) prob-lem (§ 3), in which a mobile agent has to navi-gate (without a map) to an object by interpretingits first-person visual perception and requestinghelp from Automatic Natural Navigation Assis-tants (ANNA). HANNA models a setting in whicha human is not always available to help, but ratherthat human assistants are scattered throughout theenvironment and provide help upon request (mod-eling the interpersonal aspect). The assistants arenot omniscient: they are only familiar with cer-tain regions of the environment and, upon request,provide subtasks, expressed in language and im-ages (modeling the multimodal aspect), for gettingcloser to the goal, not necessarily for fully com-pleting the task (modeling the reactive aspect).

In HANNA, when the agent gets lost and be-comes unable to make progress, it has the optionof requesting assistance from ANNA. At test time,the agent must decide where to go and whetherto request help from ANNA without additional su-pervision. At training time, we leverage imitationlearning to learn an effective agent, both in termsof navigation, and in terms of being able to decide

arX

iv:1

909.

0187

1v1

[cs

.HC

] 4

Sep

201

9

https://github.com/khanhptnk/hanna

https://github.com/khanhptnk/hanna

A

Find a mug

Help, Anna!

B

Enter the bedroom and turn left

immediately. Walk straight to the carpet in the living room...

Success!

B

A C

Route 1

Route 2

Agent path

Start location

Enter route

Depart route

Goal location

CEnter the bedroom and turn left immediately.

Walk straight to the carpet in the living room.Turn right, come to the coffee table.

Figure 1: An example HANNA task. Initially, the agent stands in the bedroom at A and is requested by a humanrequester to “find a mug.” The agent begins, but gets lost somewhere in the bathroom. It gets to the start location ofroute ( B ) to request help from ANNA. Upon request, ANNA assigns the agent a navigation subtask described bya natural language instruction that guides the agent to a target location, and an image of the view at that location.The agents follows the language instruction and arrives at C , where it observes a match between the target imageand the current view, thus decides to depart route . After that, it resumes the main task of finding a mug. Fromthis point, the agent gets lost one more time and has to query ANNA for another subtask that helps it follow routeand enter the kitchen. The agent successfully the task it finally stops within εmeters of an instance of the requestedobject ( ). Here, the ANNA feedback is simulated using two pre-collected language-assisted routes ( and ).

when it is most worthwhile to request assistance.

This paper has two primary contributions:

1. Constructing the HANNA simulator by aug-menting an indoor photo-realistic simulatorwith simulated human assistance, mimickinga scenario where a mobile agent finds objectsby asking for directions along the way (§3).

2. An effective model and training algorithm forthe HANNA problem, which includes a hi-erarchical memory-augmented recurrent ar-chitecture that models human assistance assub-goals (§ 5), and introduces an imitationlearning objective that enhances explorationof the environment and interpretability of theagent’s help-request decisions. (§4).

We embed the HANNA problem in the photo-realistic Matterport3D environments (Chang et al.,2017) with no extra annotation cost by reusingthe pre-existing Room-to-Room dataset (Ander-son et al., 2018b). Empirical results (§ 7) showthat our agent can effectively learn to request andinterpret language and vision instructions, given atraining set of 51 environments and less than 9,000language instructions. Even in new environments,where the scenes and the language instructions arepreviously unseen, the agent successfully accom-plishes 47% of its tasks. Our methods for train-ing the navigation and help-request policies out-perform competitive baselines by large margins.

2 Related work

Simulated environments provide an inexpensiveplatform for fast prototyping and evaluating newideas before deploying them into the real world.Video-game and physics simulators are standardbenchmarks in reinforcement learning (Todorovet al., 2012; Mnih et al., 2013; Kempka et al.,2016; Brockman et al., 2016; Vinyals et al., 2017).Nevertheless, these environments under-representthe complexity of the world. Realistic simulatorsplay an important role in sim-to-real approaches,in which an agent is trained with arbitrarily manysamples provided by the simulators, then trans-ferred to real settings using sample-efficient trans-fer learning techniques (Kalashnikov et al., 2018;Andrychowicz et al., 2018; Karttunen et al., 2019).While modern techniques are capable of simulat-ing images that can convince human perception(Karras et al., 2017, 2018), simulating languageinteraction remains challenging. There are effortsin building complex interactive text-based worlds(Cote et al., 2018; Urbanek et al., 2019) but thelack of a graphical component makes them notsuitable for visually grounded learning. On theother hand, experimentation on real humans androbots, despite expensive and time-consuming, areimportant for understanding the true complexity ofreal-world scenarios (Chai et al., 2018, 2016; Ryb-ski et al., 2007; Mohan and Laird, 2014; Liu et al.,2016; She et al., 2014).

Recent navigation tasks in photo-realistic simu-lators have accelerated research on teaching agentsto execute human instructions. Nevertheless, mod-eling human assistance in these problems remainssimplistic (Table 1): they either do not incorporatethe ability to request additional help while execut-ing tasks (Misra et al., 2014, 2017; Anderson et al.,2018b; Chen et al., 2019; Das et al., 2018; Misraet al., 2018; Wijmans et al., 2019; Qi et al., 2019),or mimic human verbal assistance with primitive,highly scripted language (Nguyen et al., 2019;Chevalier-Boisvert et al., 2019). HANNA im-proves the realisticity of the VNLA setup (Nguyenet al., 2019) by using fully natural language in-structions.

Imitation learning algorithms are a great fit fortraining agents in simulated environments: ac-cess to ground-truth information about the envi-ronments allows optimal actions to be computedin many situations. The “teacher” in standardimitation learning algorithms (Daume III et al.,2009; Ross et al., 2011; Ross and Bagnell, 2014;Chang et al., 2015; Sun et al., 2017; Sharaf andDaume III, 2017) does not take into consider-ation the agent’s capability and behavior. Heet al. (2012) present a coaching method where theteacher gradually increases the complexity of itsdemonstrations over time. Welleck et al. (2019)propose an ”unlikelihood” objective, which, sim-ilar to our curiosity-encouraging objective, pe-nalizes likelihoods of candidate negative actionsto avoid mistake repetition. Our approach takesinto account the agent’s past and future behav-ior to determine actions that are most and leastbeneficial to them, combining the advantages ofboth model-based and progress-estimating meth-ods (Wang et al., 2018; Ma et al., 2019a,b).

3 The HANNA Simulator

Problem. HANNA simulates a scenario where ahuman requester asks a mobile agent via languageto find an object in an indoor environment. Thetask request is only a high-level command (“find[object(s)]”), modeling the general case when therequester does not need know how to accomplisha task when requesting it. We assume the task isalways feasible: there is at least an instance of therequested object in the environment.

Figure 1, to which references in this section willbe made, illustrates an example where the agent isasked to “find a mug.” The agent starts at a ran-

Request Multimodal SimulatedProblem assistance instructions humans

VLN 7 7 7VNLA 3 7 3CVDN 3 7 7HANNA (this work) 3 3 3

Table 1: Comparing HANNA with other photo-realisticnavigation problems. VLN (Anderson et al., 2018b)does not allow agent to request help. VNLA (Nguyenet al., 2019) models an advisor who is always presentto help but speaks simple, templated language. CVDN

(Thomason et al., 2019) contains natural conversationsin which a human assistant aids another human in navi-gation tasks but offers limited language interaction sim-ulation, as language assistance is not available whenthe agent deviates from the collected trajectories andtasks. HANNA simulates human assistants that pro-vide language-and-vision instructions that adapt to theagent’s current position and goal.

dom location ( A ), is given a task request, and isallotted a budget of T time steps to complete thetask. The agent succeeds the if its final locationis within εsuccess meters of the location of any in-stance of the requested object ( ). The agent isnot given any sensors that help determine its lo-cation or the object’s location and must navigateonly with a monocular camera that captures itsfirst-person view as an RGB image (e.g., imagein the upper right of Figure 1).

The only source of help the agent can leveragein the environment is assistants, who are presentat both training and evaluation time. The assis-tants are not aware of the agent unless it enterstheir zones of attention, which include all loca-tions within εattn meters of their locations. Whenthe agent is in one of these zones, it has an op-tion to request help from the corresponding assis-tant. The assistant helps the agent by giving a sub-task, described by a natural language instructionthat guides the agent to a specific location, and animage of the view at that location.

In our example, at B , the assistant says “Enter

the bedroom and turn left immediately. Walk straight to the

carpet in the living room. Turn right, come to the coffee ta-

ble.” and provides an image of the destination inthe living room. Executing the subtask may notfulfill the main task, but is guaranteed to get theagent to a location closer to a goal than where itwas before (e.g., C ).

Photo-realistic Navigation Simulator. HANNA

uses the Matterport3D simulator (Chang

et al., 2017; Anderson et al., 2018b) to photo-realistically emulate a first-person view whilenavigating in indoor environments. HANNA

features 68 Matterport3D environments, eachof which is a residential building consisting ofmultiple rooms and floors. Navigation is modeledas traversing an undirected graph G = (V,E),where each location corresponds to a node v ∈ Vwith 3D-coordinates xv, and edges are weightedby their lengths (in meters). The state of the agentis fully determined by its pose τ = (v, ψ, ω),where v is its location, ψ ∈ (0, 2π] is its heading(horizontal camera angle), and ω ∈

[−π

6 ,π6

]is

its elevation (vertical camera angle). The agentdoes not know v, and the angles are constrained tomultiples of π

6 . In each step, the agent can eitherstay at its current location, or it can rotate towardand go to a location adjacent to it in the graph1.Every time the agent moves (and thus changespose), the simulator recalculates the image toreflect the new view.

Automatic Natural Navigation Assistants(ANNA). ANNA is a simulation of human as-sistants who do not necessarily know themselveshow to optimally accomplish the agent’s goal:they are only familiar with scenes along certainpaths in the environment, and thus give advice tohelp the agent make partial progress. Specifically,the assistance from ANNA is modeled by a set oflanguage-assisted routes R = {r1, r2, . . . , r|R|}.Each route r = (ψr, ωr, pr, lr) is defined byinitial camera angles (ψr, ωr), a path pr inthe environment graph, and a natural languageinstruction lr. A route becomes enterable whenits start location is adjacent to and within εattnmeters of the agent’s location. When the agententers a route, it first adjusts its camera angles to(ψr, ωr), then attempts to interpret the languageinstructions lr to traverse along pr. At anytime, the agent can depart the route by stoppingfollowing lr. An example of a route in Figure 1 isthe combination of the initial camera angles at B ,the path , and the language instruction “Enter the

bedroom and turn left immediately. . . ”

The set of all routes starting from a locationsimulates a human assistant who can recall scenesalong these routes’ paths. The zone of attention ofthe simulated human is the set of all locations fromwhich the agent can enter one of the routes; whenthe agent is in this zone, it may ask the human for

1We use the “panoramic action space” (Fried et al., 2018).

help. Upon receiving a help request, the humanselects a route r? for the agent to enter (e.g., ),and a location vd on the route where it wants theagent to depart (e.g., C ). It then replies the agentwith a multimedia message (lr

?, Ivd), where lr

?is

the selected route’s language instruction, and Ivdis an image of the panoramic view at the departurelocation. The message describes a subtask whichrequires the agent to follow the direction describedby lr

?and to stop if it reaches the location refer-

enced by Ivd . The route r? and the departure nodevd are selected to get the agent as close to a goallocation as possible. Concretely, let Rcurr be theset of all routes associated with the requested hu-man. The selected route minimizes the distance tothe goal locations among all routes in Rcurr:

r? = argminr∈Rcurr

d(r, Vgoal

)(1)

where d(r, Vgoal

) def= min

g∈Vgoal,v∈prd (g, v) (2)

d(., .) returns the (shortest-path) distance betweentwo locations, and Vgoal is the set of all goal loca-tions. The departure location minimizes the dis-tance to the goal locations among all locations onthe selected route:

vd = argming∈Vgoal,v∈pr?

d (g, v)def= d

(r?, Vgoal

)(3)

When the agent chooses to depart the route (notnecessarily at the departure node), the human fur-ther assists it by providing Ig? , an image of thepanoramic view at the goal location closest to thedeparture node:

g? = argming∈Vgoal

d (g, vd) (4)

The way the agent leverages ANNA to accom-plish tasks is analogous to how humans travel us-ing public transportation systems (e.g., bus, sub-way). For example, passengers of a subway sys-tem utilize fractions of pre-constructed routes tomake progress toward a destination. They exe-cute travel plans consisting of multiple subtasks,each of which requires entering a start stop, fol-lowing a route (typically described by its nameand last stop), and exiting at a departure stop (e.g.,“Enter the Penn Station, hop on the Red line in the direc-

tion toward the South Ferry, get off at the World Trade Cen-

ter”). Occasionally, users walk short distances(at a lower speed) to switch routes. Our setupfollows the same principle, but instead of hav-ing physical vehicles and railways, we employ

low-level language-and-vision instructions as the“high-speed means” to accelerate travel.

Constructing ANNA route system. Given aphoto-realistic simulator, the primary cost for con-structing the HANNA problem comes from crowd-sourcing the natural language instructions. Ideally,we want to collect sufficient instructions to simu-late humans in any location in the environment.Let N = |V | be the number of locations in theenvironment. Since each simulated human is fa-miliar with at most N locations, in the worst case,we need to collect O(N2) instructions to connectall location pairs. However, we theoretically provethat, assuming the agent executes instructions per-fectly, it is possible to guide the agent betweenany location pair by collecting only Θ(N logN)instructions. The key idea is using O(logN) in-stead of a single instruction to connect each pair,and reusing an instruction for multiple routes.

Lemma 1. (proof in Appendix A) To guide theagent between any two locations using O(logN)instructions, we need to collect instructions forΘ(N logN) location pairs.

In our experiments, we leverage the pre-existingRoom-to-room dataset (Anderson et al., 2018b)to construct the route system. This dataset con-tains 21,567 natural language instructions crowd-sourced from humans and is originally intended tobe used for the Vision-Language Navigation task(such as those in Figure 1), where an agent exe-cutes a language instruction to go to a location.We exclude instructions of the test split and theircorresponding environments because ground-truthpaths are not given. We use (on average) 211routes to connect (on overage) 125 locations perenvironment. Even though the routes are selectedrandomly in the original dataset, our experimentsshow that they are sufficient for completing thetasks (assuming perfect assistance interpretation).

4 Retrospective Curiosity-EncouragingImitation Learning

Agent Policies. Let s be a fully-observed statethat contains ground-truth information about theenvironment and the agent (e.g., object locations,environment graph, agent parameters, etc.). Letos be the corresponding observation given to theagent, which only encodes the current view, thecurrent task, and extra information that the agentkeeps track of (e.g., time, action history, etc.). The

Algorithm 1 Task episode, given agent help-request policy πask and navigation policy πnav

1: agent receives task request e2: initialize the agent mode: m← main task3: initialize the language instruction: l0 ← e4: initialize the target image: I tgt

0 ← None5: for t = 1 . . . T do6: let st be the current state, ot the current observation,

and τt = (vt, ψt, ωt) the current pose7: agent makes a help-request decision aask

t ∼ πask(ot)8: carry on task from the previous step:

lt ← lt−1, Itgtt = I tgt

t−1

9: if aaskt = request help then

10: set mode: m← sub task11: request help: (r, Idepart, Igoal)← ANNA(st)12: set the language instruction: lt ← lr

13: set the target image: I tgtt ← Idepart

14: set the navigation action:anavt ← (pr0, ψ

r − ψt, ωr − ωt),where pr0 is the start location of route r

15: else16: agent chooses navigation: anav

t ∼ πnav(ot)17: if anav

t = stop then18: if m = main task then19: break20: else21: set mode: m← main task22: set the language instruction: lt ← e23: set the target image: I tgt

t ← Igoal

24: set navigation action: anavt ← (vt, 0, 0)

25: end if26: end if27: end if28: agent executes anav

t to go to the next location29: end for

agent maintains two stochastic policies: a navi-gation policy πnav and a help-request policy πask.Each policy maps an observation to a probabilitydistribution over its action space. Navigation ac-tions are tuples (v,∆ψ,∆ω), where v is a nextlocation that is adjacent to the current location and(∆ψ,∆ω) is the camera angle change. A spe-cial stop action is added to the set of naviga-tion actions to signal that the agent wants to ter-minate the main task or a subtask (by departing aroute). The action space of the help-request pol-icy contains two actions: request help anddo nothing. The request help action isonly available when the agent is in a zone of atten-tion. Alg 1 describes the effects of these actionsduring a task episode.

Imitation Learning Objective. The agent istrained with imitation learning to mimic behav-iors suggested by a navigation teacher π?nav and ahelp-request teacher π?ask, who have access to thefully-observed states. In general, imitation learn-ing (Daume III et al., 2009; Ross et al., 2011; Ross

and Bagnell, 2014; Chang et al., 2015; Sun et al.,2017) finds a policy π that minimizes the expectedimitation loss Lwith respect to a teacher policy π?

under the agent-induced state distribution Dπ:

minπ

Es∼Dπ [L(s, π, π?)] (5)

We frame the HANNA problem as an instanceof Imitation Learning with Indirect Intervention(I3L) (Nguyen et al., 2019). Under this frame-work, assistance is viewed as augmenting the cur-rent environment with new information. Interpret-ing the assistance is cast as finding the optimalacting policy in the augmented environment. For-mally, I3L searches for policies that optimize:

minπask,πnav

Es∼Dstateπnav,E ,E∼D

envπask

[L(s)] (6)

L(s) = Lnav(s, πnav, π?nav) + Lask(s, πask, π

?ask)

where Lnav and Lask are the navigation and help-request loss functions, respectively, Denv

πaskis the

environment distribution induced by πask, andDstateπnav,E is the state distribution induced by πnav in

environment E . A common choice for the lossfunctions is the agent-estimated negative log like-lihood of the reference action:

LNL(s, π, π?) = − log π(a? | os) (7)

where a? is the reference action suggested by π?.We introduce novel loss functions that enforcemore complex behaviors than simply mimickingreference actions.

Reference Actions. The navigation teacher sug-gests a reference action anav? that takes the agentto the next location on the shortest path from its lo-cation to the target location. Here, the target loca-tion refers to the nearest goal location (if no targetimage is available), or the location referenced bythe target image (provided by ANNA). If the agentis already at the target location, anav? = stop. Todecide whether the agent should request help, thehelp-request teacher verifies the following condi-tions:

1. lost: the agent will not get (strictly) closerto the target location in the future;

2. uncertain wong: the entropy2 of the nav-igation action distribution is greater than orequal to a threshold γ, and the highest-probability predicted navigation action is notsuggested by the navigation teacher;

2Precisely, we use efficiency, or entropy of base |Anav| =37, where Anav is the navigation action space.

3. never asked: the agent previously neverrequested help at the current location;

If condition (1) or (2), and condition (3) are satis-fied, we set aask? = request help; otherwise,aask? = do nothing.

Curiosity-Encouraging Navigation Teacher.In addition to a reference action, the navigationteacher returns Anav⊗, the set of all non-referenceactions that the agent took at the current locationwhile executing the same language instruction:

Anav⊗t =

{a ∈ Anav : ∃t′ < t, vt = vt′ , (8)

lt = lt′ , a = anavt′ 6= anav?

t′}

where Anav is the navigation action space.We devise a curiosity-encouraging loss Lcurious,

which minimizes the log likelihoods of actions inAnav⊗. This loss prevents the agent from repeatingpast mistakes and motivates it to explore untriedactions. The navigation loss is:

Lnav(s, πnav, π?nav) =

LNL(s,πnav,π?nav)︷︸︸︷− log πnav(anav? | os) (9)

+ α1

|Anav⊗|∑

a∈Anav⊗

log πnav(a | os)︸︷︷︸Lcurious(s,πnav,π?nav)

where α ∈ [0,∞) is a weight hyperparameter.

Retrospective Interpretable Help-RequestTeacher. In deciding whether the agent shouldask for help, the help-request teacher mustconsider the agent’s future situations. Standardimitation learning algorithms (e.g., DAgger)employ an online mode of interaction whichqueries the teacher at every time step. This modeof interaction is not suitable for our problem:the teacher must be able to predict the agent’sfuture actions if it is queried when the episodeis not finished. To overcome this challenge, weintroduce a more efficient retrospective mode ofinteraction, which waits until the agent completesan episode and queries the teacher for referenceactions for all time steps at once. With thisapproach, because the future actions at each timestep are now fully observed, they can be takeninto consideration when computing the referenceaction. In fact, we prove that the retrospectiveteacher is optimal for teaching the agent todetermine the lost condition, which is the onlycondition that requires knowing the agent’s future.Lemma 2. (proof in Appendix B) At any time step,the retrospective help-request teacher suggests the

Text attention

query

Text attention

hidden

SelfAttention

Currentview

VisualAttention

Local time

Target similarity

Statefeatures

Text Attention

Currentview

Global time

Targetsimilarity

Visualattention

Visualattention

Targetview

SelfAttention State

hidden Softmax

TextmemoryEncoderLanguage

instruction

Input features

Intermediate features

Neural module

Previous action

Inter-task

Intra-taskEncoder

CosineSimAttention

Figure 2: Our hierarchical recurrent model architecture (the navigation network). The help-request network ismostly similar except that the navigation action distribution is fed as an input to compute the “state features”.

action that results in the agent getting closer tothe target location in the future under its currentnavigation policy (if such an action exists).

To help the agent better justify its help-requestdecisions, we train a reason classifier Φ to predictwhich conditions are satisfied. To train this clas-sifier, the teacher provides a reason vector ρ? ∈{0, 1}3, where ρ?i = 1 indicates that the i-th condi-tion is met. We formulate this prediction problemas multi-label binary classification and employ abinary logistic loss for each condition. Learning topredict the conditions helps the agent make moreaccurate and interpretable decisions. The help-request loss is:

Lask(s, πask, π?ask) =

LNL(s,πask,π?ask)︷︸︸︷

− log πask(aask? | os) (10)

−1

3

3∑i=1

[ρ?i log ρi + (1− ρ?i ) log(1− ρi)]︸︷︷︸Lreason(s,πask,π

?ask)

where(aask?, ρ?

)= π?ask(s), and ρ = Φ(os) is the

agent-estimated likelihoods of the conditions.

5 Hierarchical Recurrent Architecture

We model the navigation policy and the help-request policy as two separate neural networks.The two networks have similar architectures,which consists of three main components: thetext-encoding component, the inter-task compo-nent, and the intra-task component (Figure 2).We use self-attention instead of recurrent neuralnetworks to better capture long-term dependency,and develop novel cosine-similarity attention andResNet-based time-encoding. Detail on the com-putations in each module is in the Appendix.

The text-encoding component computes a textmemory M text, which stores the hidden represen-tation of the current language instruction. The

Split Environments Tasks ANNA Instructions

Train 51 82,484 8,586Val SeenEnv 51 5,001 4,287Val UnseenAll 7 5,017 2,103Test SeenEnv 51 5,004 4,287Test Unseen 10 5,012 2,331

Table 2: Data split.

inter-task module computes a vector hintert repre-

senting the state of the current task’s execution.During the episode, every time the current taskis altered (due to the agent requesting help or de-parting a route), the agent re-encodes the new lan-guage instruction to generate a new text memoryand resets the inter-task state to a zero vector. Theintra-task module computes a vector hintra

t repre-senting the state of the entire episode. To computethis state, we first calculate hintra

t , a tentative cur-rent state, and hintra

t , a weighted combination ofthe past states at nearly identical situations. hintra

t

is computed as:

hintrat = hintra

t − β · hintrat (11)

β = σ(Wgate · [hintrat ; hintra

t ]) (12)

Eq 29 creates an context-sensitive dissimilarity be-tween the current state and the past states at nearlyidentical situations. The scale vector β determineshow large the dissimilarity is based on the in-puts. This formulation incorporates past relatedinformation into the current state, thus enables theagent to optimize the curiosity-encouraging losseffectively. Finally, hintra

t is passed through a soft-max layer to produce an action distribution.

6 Experimental Setup

Dataset. We generate a dataset of object-findingtasks in the HANNA environments to train andevaluate our agent. Table 2 summarizes the datasetsplit. Our dataset features 289 object types; the

SEENENV UNSEENALL

Agent SR ↑ SPL ↑ Nav. ↓ Requests/ SR ↑ SPL ↑ Nav. ↓ Requests/(%) (%) Err. (m) task ↓ (%) (%) Err. (m) task ↓

Non-learning agentsRANDOMWALK 0.54 0.33 15.38 0.0 0.46 0.23 15.34 0.0FORWARD10 5.98 4.19 14.61 0.0 6.36 4.78 13.81 0.0

Learning agentsNo assistance 17.21 13.76 11.48 0.0 8.10 4.23 13.22 0.0Learn to interpret assistance (ours) 88.37 63.92 1.33 2.9 47.45 25.50 7.67 5.8

SkylinesSHORTEST 100.00 100.00 0.00 0.0 100.00 100.00 0.00 0.0Perfect assistance interpretation 90.99 68.87 0.91 2.5 83.56 56.88 1.83 3.2

Table 3: Results on test splits. The agent with “perfect assistance interpretation” uses the teacher navigation policy(π?

nav) to make decisions when executing a subtask from ANNA. Results of our final system are in bold.

language instruction vocabulary contains 2,332words. The numbers of locations on the shortestpaths to the requested objects are restricted to bebetween 5 and 15. With an average edge lengthof 2.25 meters, the agent has to travel about 9to 32 meters to reach its goals. We evaluate theagent in environments that are seen during train-ing (SEENENV), and in environments that are notseen (UNSEENALL). Even in the case of SEE-NENV, the tasks and the ANNA language instruc-tions given during evaluation were never given inthe same environments during training.

Hyperparameters. See Appendix.

Baselines and Skylines. We compare ouragent against the following non-learning agents:1. SHORTEST: uses the navigation teacher pol-icy to make decisions (this is a skyline); 2. RAN-DOMWALK: randomly chooses a navigation ac-tion at every time step; 3. FORWARD10: navigatesto the next location closest to the center of the cur-rent view to advance for 10 time steps. We com-pare our learned help-request policy with the fol-lowing heuristics: 1. NOASK: does not requesthelp; 2. RANDOMASK: randomly chooses to re-quest help with a probabilty of 0.2, which is theaverage help-request ratio of our learned agent;3. ASKEVERY5: requests help as soon as walkingat least 5 time steps.

Evaluation metrics. Our main metrics are: suc-cess rate (SR), the fraction of examples on whichthe agent successfully solves the task; naviga-tion error, the average (shortest-path) distance be-tween the agent’s final location and the nearestgoal from that location; and SPL (Anderson et al.,2018a), which weights task success rate by travel

distance as follows:

SPL =1

N

N∑i=1

SiLi

max(Pi, Li)(13)

where N is the number of tasks, Si indicateswhether task i is successful, Pi is the agent’s traveldistance, and Li is the shortest-path distance to thegoal nearest to the agent’s final location.

7 Results

Main results. From Table 3, we see that ourproblem is challenging: simple heuristic-basedbaselines such as RANDOMWALK and FOR-WARD10 attain success rates less than 7%. Anagent that learns to accomplish tasks withoutadditional assistance from ANNA succeeds only17.21% of the time on TEST SEENENV, and8.10% on TEST UNSEENALL.

Leveraging help from ANNA dramaticallyboosts the success rate by 71.16% on TEST SEE-NENV and by 39.35% on TEST UNSEENALL overnot requesting help. Given the small size of ourdataset (e.g., the agent has fewer than 9,000 sub-task instructions to learn from), it is encouragingthat our agent is successful in nearly half of itstasks. On average, the agent takes paths that are1.38 and 1.86 times longer than the optimal pathson TEST SEENENV and TEST UNSEENALL, re-spectively. In unseen environments, it issues onaverage twice as many requests to as it does inseen environments. To understand how well theagent interprets the ANNA instructions, we alsoprovide results where our agent uses the optimalnavigation policy to make decisions while execut-ing subtasks. The large gaps on TEST SEENENV

indicate there is still much room for improvement

Assistance type SEENENV UNSEENALL

Target image only 84.95 31.88+ Language instruction 88.37 47.45

Table 4: Success rates (%) of agents on test splits withdifferent types of assistance.

SEENENV UNSEENALL

πask SR ↑ Requests/ SR ↑ Requests/(%) task ↓ (%) task ↓

NOASK 17.21 0.0 8.10 0.0RANDOMASK 82.71 4.3 37.05 6.8ASKEVERY5 87.39 3.4 34.42 7.1Learned (ours) 88.37 2.9 47.45 5.8

Table 5: Success rates (%) of different help-requestpolicies on test splits.

in the future, purely in learning to execute lan-guage instructions.

Does understanding language improve general-izability? Our agent is assisted with both lan-guage and visual instructions; here we disentan-gle the usefulness two these two modes of as-sistance. As seen in Table 4, the improvementfrom language on TEST UNSEENALL (+15.17%)is substantially more than that on TEST SEENENV

(+3.42%), largely the agent can simply memorizethe seen environments. This confirms that un-derstanding language-based assistance effectivelyenhances the agent’s capability of accomplishingtasks in novel environments.

Is learning to request help effective? Table 5compares our learned help-request policies withbaselines. We find that ASKEVERY5 provides asurprisingly strong baseline for this problem, lead-ing to an improvement of +26.32% over not re-questing help on TEST UNSEENALL. Neverthe-less, our learned policy, with the ability to predictthe future and access to the agent’s uncertainty,outperforms all baselines by at least 10.40% insuccess rate on TEST UNSEENALL, while mak-ing less help requests. The small gap betweenthe learned policy and ASKEVERY5 on TEST UN-SEENALL is expected because, on this split, theperformance is mostly determined by the model’smemorizing capability and is mostly insensitive tothe help-request strategy.

Is proposed model architecture effective?We implement an LSTM-based encoder-decodermodel that is based on the architecture proposed

SR ↑ Nav. mistake ↓ Help-request ↓Model (%) repeat (%) repeat (%)

LSTM-ENCDEC 19.25 31.09 49.37Our model (α = 0) 43.12 25.00 40.17Our model (α = 1) 47.45 17.85 21.10

Table 6: Results on TEST UNSEENALL of our model,trained with and without curiosity-encouraging loss,and an LSTM-based encoder-decoder model (bothmodels have about 15M parameters). “Navigation mis-take repeat” is the fraction of time steps on which theagent repeats a non-optimal navigation action at a pre-viously visited location while executing the same task.“Help-request repeat” is the fraction of help requestsmade at a previously visited location while executingthe same task.

by (Wang et al., 2019). To incorporate the targetimage, we add an attention layer that uses the im-age’s vector set as the attention memory. We trainthis model with imitation learning using the stan-dard negative log likelihood loss (Eq 7), withoutthe curiosity-encouraging and reason-predictionlosses. As seen in Table 6, our hierarchical re-current model outperforms this model by a largemargin on TEST UNSEENALL (+28.2%).

Does the proposed imitation learning algorithmachieve its goals? The curiosity-encouragingtraining objective is proposed to prevent the agentfrom making the same mistakes at previously en-countered situations. Table 6 shows that trainingwith the curiosity-encouraging objective reducesthe chance of the agent looping and making thesame decisions repeatedly. As a result, its suc-cess rate is greatly boosted (+4.33% on TEST UN-SEENALL) over no curiosity-encouraging.

8 Conclusion

In this work, we present a photo-realistic simula-tor that mimics primary characteristics of real-lifehuman assistance. We develop effective imitationlearning techniques for learning to request and in-terpret the simulated assistance, coupled with a hi-erarchical neural network model for representingsubtasks. Future work aims to provide more natu-ral, linguistically realistic interaction between theagent and humans (e.g., providing the agent theability ask a natural question rather than just signalfor help), and to establish a theoretical frameworkfor modeling human assistance. We are also ex-ploring ways to deploy and evaluate our methodson real-world platforms.

ReferencesPeter Anderson, Angel Chang, Devendra Singh Chap-

lot, Alexey Dosovitskiy, Saurabh Gupta, VladlenKoltun, Jana Kosecka, Jitendra Malik, RoozbehMottaghi, Manolis Savva, et al. 2018a. On evalu-ation of embodied navigation agents. arXiv preprintarXiv:1807.06757.

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce,Mark Johnson, Niko Sunderhauf, Ian Reid, StephenGould, and Anton van den Hengel. 2018b. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environ-ments. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages3674–3683.

Marcin Andrychowicz, Bowen Baker, Maciek Chociej,Rafal Jozefowicz, Bob McGrew, Jakub Pachocki,Arthur Petron, Matthias Plappert, Glenn Powell,Alex Ray, et al. 2018. Learning dexterous in-handmanipulation. arXiv preprint arXiv:1808.00177.

Greg Brockman, Vicki Cheung, Ludwig Pettersson,Jonas Schneider, John Schulman, Jie Tang, and Wo-jciech Zaremba. 2016. Openai gym. arXiv preprintarXiv:1606.01540.

Joyce Y Chai, Rui Fang, Changsong Liu, and LanboShe. 2016. Collaborative language grounding to-ward situated human-robot dialogue. AI Magazine,37(4):32–45.

Joyce Y Chai, Qiaozi Gao, Lanbo She, Shaohua Yang,Sari Saba-Sadiya, and Guangyue Xu. 2018. Lan-guage to action: Towards interactive task learningwith physical agents. In International Joint Confer-ence on Artificial Intelligence.

Angel Chang, Angela Dai, Thomas Funkhouser, Ma-ciej Halber, Matthias Niessner, Manolis Savva, Shu-ran Song, Andy Zeng, and Yinda Zhang. 2017. Mat-terport3D: Learning from RGB-D data in indoor en-vironments. International Conference on 3D Vision(3DV).

Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agar-wal, Hal Daume III, and John Langford. 2015.Learning to search better than your teacher. In Pro-ceedings of the International Conference of MachineLearning.

Howard Chen, Alane Shur, Dipendra Misra, NoahSnavely, Ian Artzi, Yoav, Stephen Gould, and An-ton van den Hengel. 2019. Touchdown: Naturallanguage navigation and spatial reasoning in visualstreet environments. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition (CVPR).

Maxime Chevalier-Boisvert, Dzmitry Bahdanau,Salem Lahlou, Lucas Willems, Chitwan Saharia,Thien Huu Nguyen, and Yoshua Bengio. 2019.Babyai: A platform to study the sample efficiencyof grounded language learning. In Proceedings

of the International Conference on LearningRepresentations.

Marc-Alexandre Cote, Akos Kadar, Xingdi Yuan,Ben Kybartas, Tavian Barnes, Emery Fine, JamesMoore, Matthew Hausknecht, Layla El Asri, Mah-moud Adada, Wendy Tay, and Adam Trischler.2018. Textworld: A learning environment for text-based games. In Computer Games Workshop atICML/IJCAI.

Abhishek Das, Samyak Datta, Georgia Gkioxari, Ste-fan Lee, Devi Parikh, and Dhruv Batra. 2018. Em-bodied question answering. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition.

Hal Daume III, John Langford, and Daniel Marcu.2009. Search-based structured prediction. InMachine learning, volume 75, pages 297–325.Springer.

Daniel Fried, Ronghang Hu, Volkan Cirik, AnnaRohrbach, Jacob Andreas, Louis-Philippe Morency,Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein,and Trevor Darrell. 2018. Speaker-follower modelsfor vision-and-language navigation. In Proceedingsof Advances in Neural Information Processing Sys-tems.

He He, Jason Eisner, and Hal Daume III. 2012. Imi-tation learning by coaching. In Proceedings of Ad-vances in Neural Information Processing Systems,pages 3149–3157.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Ju-lian Ibarz, Alexander Herzog, Eric Jang, DeirdreQuillen, Ethan Holly, Mrinal Kalakrishnan, VincentVanhoucke, et al. 2018. Qt-opt: Scalable deep rein-forcement learning for vision-based robotic manip-ulation. In Proceedings of the Conference on RobotLearning.

Tero Karras, Timo Aila, Samuli Laine, and JaakkoLehtinen. 2017. Progressive growing of gans forimproved quality, stability, and variation. In Pro-ceedings of the International Conference on Learn-ing Representations.

Tero Karras, Samuli Laine, and Timo Aila. 2018. Astyle-based generator architecture for generative ad-versarial networks. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recogni-tion.

Janne Karttunen, Anssi Kanervisto, Ville Hautamaki,and Ville Kyrki. 2019. From video game to realrobot: The transfer between action spaces. arXivpreprint arXiv:1905.00741.

Michał Kempka, Marek Wydmuch, Grzegorz Runc,Jakub Toczek, and Wojciech Jaskowski. 2016. Viz-doom: A doom-based ai research platform for visualreinforcement learning. In Computational Intelli-gence and Games (CIG), 2016 IEEE Conference on,pages 1–8. IEEE.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450.

Changsong Liu, Shaohua Yang, Sari Saba-Sadiya,Nishant Shukla, Yunzhong He, Song-Chun Zhu,and Joyce Chai. 2016. Jointly learning groundedtask structures from language instruction and visualdemonstration. In Proceedings of Emperical Meth-ods in Natural Language Processing, pages 1482–1492.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings ofEmperical Methods in Natural Language Process-ing.

Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan Al-Regib, Zsolt Kira, Richard Socher, and CaimingXiong. 2019a. Self-monitoring navigation agent viaauxiliary progress estimation. In Proceedings of theInternational Conference on Learning Representa-tions.

Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caim-ing Xiong, and Zsolt Kira. 2019b. The regretfulagent: Heuristic-aided navigation through progressestimation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages6732–6740.

Dipendra Misra, Andrew Bennett, Valts Blukis, EyvindNiklasson, Max Shatkhin, and Yoav Artzi. 2018.Mapping instructions to actions in 3d environmentswith visual goal prediction. In Proceedings of Em-perical Methods in Natural Language Processing,pages 2667–2678. Association for ComputationalLinguistics.

Dipendra Misra, John Langford, and Yoav Artzi. 2017.Mapping instructions and visual observations to ac-tions with reinforcement learning. Proceedings ofEmperical Methods in Natural Language Process-ing.

Dipendra K Misra, Jaeyong Sung, Kevin Lee, andAshutosh Saxena. 2014. Tell me dave: Contextsen-sitive grounding of natural language to mobile ma-nipulation instructions. In Robotics: Science andSystems.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Alex Graves, Ioannis Antonoglou, Daan Wierstra,and Martin Riedmiller. 2013. Playing atari withdeep reinforcement learning. In NIPS Deep Learn-ing Workshop.

Shiwali Mohan and John Laird. 2014. Learning goal-oriented hierarchical tasks from situated interactiveinstruction. In Association for the Advancement ofArtificial Intelligence.

Khanh Nguyen, Debadeepta Dey, Chris Brockett, andBill Dolan. 2019. Vision-based navigation withlanguage-based assistance via imitation learningwith indirect intervention. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 12527–12537.

Yuankai Qi, Qi Wu, Peter Anderson, Marco Liu,Chunhua Shen, and Anton van den Hengel. 2019.Rerere: Remote embodied referring expressionsin real indoor environments. arXiv preprintarXiv:1904.10151.

Stephane Ross and J Andrew Bagnell. 2014. Rein-forcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979.

Stephane Ross, Geoffrey Gordon, and Drew Bagnell.2011. A reduction of imitation learning and struc-tured prediction to no-regret online learning. InProceedings of Artificial Intelligence and Statistics,pages 627–635.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, et al.2015. Imagenet large scale visual recognition chal-lenge. International Journal of Computer Vision,115(3):211–252.

Paul E Rybski, Kevin Yoon, Jeremy Stolarz, andManuela M Veloso. 2007. Interactive robot tasktraining through dialog and demonstration. In Pro-ceedings of the ACM/IEEE international conferenceon Human-robot interaction, pages 49–56. ACM.

Amr Sharaf and Hal Daume III. 2017. Structured pre-diction via learning to search under bandit feedback.In Proceedings of the 2nd Workshop on StructuredPrediction for Natural Language Processing, pages17–26.

Lanbo She, Shaohua Yang, Yu Cheng, Yunyi Jia, JoyceChai, and Ning Xi. 2014. Back to the blocks world:Learning new actions through situated human-robotdialogue. In Proceedings of the 15th Annual Meet-ing of the Special Interest Group on Discourse andDialogue (SIGDIAL), pages 89–97.

Wen Sun, Arun Venkatraman, Geoffrey J Gordon, By-ron Boots, and J Andrew Bagnell. 2017. Deeplyaggrevated: Differentiable imitation learning for se-quential prediction. In Proceedings of the Interna-tional Conference of Machine Learning.

Jesse Thomason, Michael Murray, Maya Cakmak, andLuke Zettlemoyer. 2019. Vision-and-dialog naviga-tion. arXiv preprint arXiv:1907.04957.

http://aclweb.org/anthology/D18-1287

http://aclweb.org/anthology/D18-1287

Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012.Mujoco: A physics engine for model-based con-trol. In 2012 IEEE/RSJ International Conference onIntelligent Robots and Systems, pages 5026–5033.IEEE.

Jack Urbanek, Angela Fan, Siddharth Karamcheti,Saachi Jain, Samuel Humeau, Emily Dinan, TimRocktaschel, Douwe Kiela, Arthur Szlam, and Ja-son Weston. 2019. Learning to speak and actin a fantasy text adventure game. arXiv preprintarXiv:1903.03094.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Oriol Vinyals, Timo Ewalds, Sergey Bartunov, PetkoGeorgiev, Alexander Sasha Vezhnevets, MichelleYeo, Alireza Makhzani, Heinrich Kuttler, John Aga-piou, Julian Schrittwieser, et al. 2017. Starcraft ii:A new challenge for reinforcement learning. arXivpreprint arXiv:1708.04782.

Harm de Vries, Kurt Shuster, Dhruv Batra, DeviParikh, Jason Weston, and Douwe Kiela. 2018.Talk the walk: Navigating new york citythrough grounded dialogue. arXiv preprintarXiv:1807.03367.

Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jian-feng Gao, Dinghan Shen, Yuan-Fang Wang,William Yang Wang, and Lei Zhang. 2019. Re-inforced cross-modal matching and self-supervisedimitation learning for vision-language navigation.In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition.

Xin Wang, Wenhan Xiong, Hongmin Wang, andWilliam Yang Wang. 2018. Look before youleap: Bridging model-free and model-based rein-forcement learning for planned-ahead vision-and-language navigation. In Proceedings of the Euro-pean Conference on Computer Vision.

Sean Welleck, Ilia Kulikov, Stephen Roller, EmilyDinan, Kyunghyun Cho, and Jason Weston. 2019.Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319.

Erik Wijmans, Samyak Datta, Oleksandr Maksymets,Abhishek Das, Georgia Gkioxari, Stefan Lee, IrfanEssa, Devi Parikh, and Dhruv Batra. 2019. Em-bodied question answering in photorealistic environ-ments with point cloud perception. In Proceedingsof the IEEE Conference on Computer Vision andPattern Recognition, pages 6659–6668.

Acknowledgement

We would like to thank Debadeepta Dey for help-ful discussions. We thank Jesse Thomason for thethorough review and suggestions on related workand future directions. Many thanks to the review-ers for their meticulous and insightful comments.Special thanks to Reviewer 2 for addressing ourauthor response in detail, and to Reviewer 3 for thehelpful suggestions and the careful typing-errorcorrection.

A Proof of Lemma 1

Lemma 1. To guide the agent between any twolocations using O(logN) instructions, we needto collect instructions for Θ(N logN) locationpairs.

Proof. Consider a spanning tree of the environ-ment graph that is rooted at vr. For each nodev, let d(v) be the distance from v to vr. Fori = 0 · · · blog2 d(v)c − 1, collect an instructionfor the path from v to its 2i-th ancestor, and aninstruction for the reverse path. In total, no morethan 2 ·N log2N = Θ(N logN) instructions arecollected.

A language-assisted path is a path that is associ-ated with a natural language instruction. It is easyto show that, under this construction, O(logN)language-assisted paths are sufficient to traversebetween v and any ancestor of v as follows. Forany two arbitrary nodes u and v, let w be theirleast common ancestor. To traverse from u tov, we first go from u to w, then from w to v.The total number of language-assisted path is stillO(logN).

B Proof of Lemma 2

Lemma 2. At any time step, the retrospectivehelp-request teacher suggests the action that re-sults in the agent getting closer to the target lo-cation in the future under its current navigationpolicy (if such an action exists).

The retrospective mode of interaction allows theteacher to observe the agent’s full trajectory whencomputing the reference actions. Note that thistrajectory is generated by the agent’s own naviga-tion policy (πnav) because, during training, we usethe agent’s own policies to act. During a time step,after observing the future trajectory, the teacher

determines whether the lost condition is satis-fied or not, i.e. whether the agent will get closer tothe target location or not.

1. If condition is not met, the teacher suggeststhe do nothing action: it is observed thatthe agent will get closer the target locationwithout requesting help by following the cur-rent navigation policy.

2. If condition is met, the teacher suggests therequest help action. First of all, in thiscase, it is observed that the do nothing ac-tion will lead to no progress toward the targetlocation. If the request help action willalso lead to no progress, then it does not mat-ter which action the teacher suggests. In con-trast, if the request help action will helpthe agent get closer to the target location, theteacher has suggested the desired action.

C Features

At time step t, the model makes use of the follow-ing features:� Icur

t : set of visual feature vectors representingthe current panoramic view;� I tgt

t : set of visual feature vectors representingthe target location’s panoramic view;� at−1: embedding of the last action;� δt: a vector representing how similar the tar-

get view is to the current view;� ηloc

t : local-time embedding encoding thenumber of time steps since the last time theinter-task module was reset;� ηglob

t : global-time embedding encoding thenumber time steps since the beginning of theepisode;� P nav

t : the navigation action distribution out-put by πnav. Each action corresponding to aview angle is mapped to a static index to en-sure that the order of the actions is indepen-dent of the view angle. This feature is onlyused by the help-request network.

Visual features. To embed the first-person viewimages, we use the visual feature vectors providedAnderson et al. (2018b), which are extracted froma ResNet-152 (He et al., 2016) pretrained on Im-ageNet (Russakovsky et al., 2015). Followingthe Speaker-Follower model (Fried et al., 2018),at time step t, we provide the agent with a fea-ture set Icur

t representing the current panoramicview. The set consists of visual feature vectorsthat represent all 36 possible first-person view an-

gles (12 headings × 3 elevations). Similarly, thepanoramic view at the target location is given bya feature set I tgt

t . Each next location is asso-ciated with a view angle whose center is clos-est to it (in angular distance). The embeddingof a navigation action (v,∆ψ,∆ω) is constructedby concatenating the feature vector of the corre-sponding view and an orientation feature vector[sin ∆ψ; cos ∆ψ; sin ∆ω; sin ∆ω] where ∆ψ and∆ω are the camera angle change needed to shiftfrom the current view to the view associated withv. The stop action is mapped to a zero vector.The action embedding set Enav

t contains embed-dings of all navigation actions at time step t.

Target similarity features. The vector δ repre-sents the similarity between Icur and I tgt. To com-pute this vector, we first construct a matrix C,where Ci,j is the cosine similarity between Icur

i

and I tgtj . δt is then obtained by taking the max-

imum values of the rows of C. The intuition is,for each current view angle, we find the most sim-ilar target view angle. If the current view perfectlymatches with the target view, δt is a vector of ones.

Time features. The local-time embedding iscomputed by a residual neural network (He et al.,2016) that learns a time-incrementing operator

ηloct = INCTIME

(ηloct−1)

(14)

where INCTIME(x) = LAYERNORM(x +RELU(LINEAR(x))). The global-time embeddingis computed similarly but also incorporates thecurrent local-time

ηglobt = INCTIME

([ηloct ; η

globt−1

])(15)

The linear layers of the local and global time mod-ules do not share parameters. Our learned timefeatures generalizes to previously unseen numbersof steps. They allow us to evaluate the agent onlonger episodes than during training, significantlyreducing training cost. We use the sinusoid encod-ing (Vaswani et al., 2017) for the text-encodingmodule, and the ResNet-based encoding for thedecoder modules. In our preliminary experiments,we also experimented using the sinusoid encodingin all modules but doing so significantly degradedsuccess rate.

D Model

Modules. The building blocks of our architec-ture are the Transformer modules (Vaswani et al.,2017)

� TRANSENCODER(l) is a Transformer-styleencoder, which generates a set of feature vec-tors representing an input sequence l;� MULTIATTEND(q,K, V ) is a multi-headed

attention layer that takes as input a query vec-tor q, a set of key vectors V , and a set ofvalue vectors V . The output is added a resid-ual connection (He et al., 2016) and is layer-normalized (Lei Ba et al., 2016).� FFN(x) is a Transformer-style feed-forward

layer, which consists of a regular feed-forward layer with a hidden layer and theRELU activation function, followed by aresidual connection and layer normalization.The hidden size of the feed-forward layer isfour times larger than the input/output size.

In addition, we devise the following new atten-tion modules� SELFATTEND(q,K, V ) implements a multi-

headed attention layer internally. It calculatesan output h = MULTIATTEND(q,K, V ). Af-ter that, the input q and output h are appendedto K and V , respectively. This module is dif-ferent from the Transformer’s self-attentionin that the keys and values are distinct.� SIMATTEND(q,K, V ) computes a weighted

value h =∑

i aiVi where each weight ai isdefined as

ai = I{ai > 0.9} ai∑j aj

(16)

ai = COSINESIMILARITY(q,Ki) (17)

where COSINESIMILARITY(., .) returns thecosine similarity between two vectors, andI{.} is an indicator function. Intuitively, thismodule finds keys that are nearly identicalto the query and returns the weighted aver-age of values corresponding to those keys.We use this module to obtain a representationof related past, which is crucial to enforcingcuriosity-encouraging training.

We now describe the navigation network in de-tail. For notation brevity, we omit the nav super-scripts in all variables.

Text-encoding component. The agent main-tains a text memoryM text, which stores the hiddenrepresentation of the current language instructionlt. At the beginning of time, the agent encodesthe task request to generate an initial text mem-ory. During time step t, if the current task is al-tered (due to the agent requesting help or depart-

Hyperparameter Value

CommonHidden size 256Navigation action embedding size 256Help-requesting action embedding size 128Word embedding size 256Number of self-attention heads 8Number of instruction-attention heads 8ResNet-extracted visual feature size 2048

Help-request teacherUncertainty threshold (γ) 0.25

TrainingOptimizer AdamNumber of training iterations 3× 104

Learning rate 10−4

Batch size 32Dropout ratio 0.3Training time steps 20Maximum instruction length 50Curiosity-encouraging weight (α) 1.0(∗)

EvaluationSuccess radius (εsuccess) 2 metersAttention radius (εattn) 2 metersEvaluation time steps 50

Table 7: Hyperparameters. (*) Training collapses whenusingα = 1 to train the agent with the help-request pol-icy baselines (NOASK, RANDOMASK, ASKEVERY5).Instead, we use α = 0.5 in those experiments.

ing a route), the agent encodes the new languageinstruction to generate a new text memory

M text = TRANSENCODER(lt) (18)

if lt 6= lt−1 or t = 0

Inter-task component. The inter-task modulecomputes a vector hinter

t representing the state ofthe current task’s execution. This state and the lo-cal time embedding are reset to zero vectors everytime the agent switches task3. Otherwise, a newstate is computed as follows

hintert = SELFATTEND(qinter

t ,M interin ,M inter

out )(19)

qintert = Winter[c

intert ; at−1; δt] + ηloc

t (20)

cintert = DOTATTEND(hinter

t−1 , Icurt ) (21)

where M interin = {qinter

0 , · · · , qintert−1} and M inter

in ={hinter

0 , · · · , hintert−1} are the input and output inter-

task memories, and DOTATTEND(q,M) is thedot-product attention (Luong et al., 2015).

Next, hintert is used to select which part of the

language instruction should be interpreted in this

3In practice, every time we reset the state and the localtime embedding in an episode, we also reset all states andlocal time embeddings in all episodes in the same batch.

step

ctextt = FFN

(ctext) (22)

ctext = MULTIATTEND(hintert ,M text

t ,M textt )

(23)

Finally, ctextt is used to select which areas of the

current view and target view the agent should fo-cus on

ccurt = DOTATTEND(ctext

t , Icurt ) (24)

ctgtt = DOTATTEND(ctext

t , Itgtt ) (25)

Intra-task component. The inter-task modulecomputes a vector hintra

t representing the state ofthe entire episode. To compute this state, we firstcalculate hintra

t , a tentative current state, and hintrat ,

a weighted combination of the past states at nearlyidentical situations

hintrat = FFN

(SELFATTEND(qintra

t ,M intrain ,M intra

out ))

(26)

hintrat = SIMATTEND(qintra

t ,M intrain ,M intra

out ) (27)

qintrat = Wintra

[ctextt ; ccur

t ; ctgtt ; δt

]+ η

globt (28)

where M intrain = {qintra

0 , · · · , qintrat−1} and M intra

in ={hintra

0 , · · · , hintrat−1} are the input and output intra-

task memories. The final state is determined bysubtracting a scaled version of hintra

t from hintrat

hintrat = hintra

t − β · hintrat (29)

β = σ(Wgate · [hintrat ; hintra

t ]) (30)

Finally, the navigation action distribution is com-puted as follows

P navt = SOFTMAX(W out hintra

t W act Anavt )) (31)

where Enavt is a matrix containing embeddings of

the navigation actions. The computations in thehelp-request network is almost identical exceptthat (a) the navigation action distribution P nav

t isfed as an extra input to the intra-task components( 28), and (b) the help-request and reason distribu-tions are calculated as follows

P askt = SOFTMAX(Eask

t Ereasont hask,intra

t ) (32)

P reasont = SOFTMAX(Ereason

t hask,intrat ) (33)

where Easkt contains embeddings of the help-

request actions and Ereasont contains embeddings

of the reason labels.

18%

82%

Do nothingRequest help

(a)

36.1

21.7

26

10

32.6

26.6

0

10

20

30

40

50

lost uncertain_wrong already_asked

Condition for requesting help

Perc

enta

ge (%

)

PredictedTrue

(b)

Figure 3: Help-request behavior on TEST UN-SEENALL: (a) fraction of time steps where the agent re-quests help and (b) predicted and true condition distri-butions. The already asked condition is the nega-tion of the never asked condition.

57.7

46.2

76.975.5

48

33.2

86.681.3

52.347.5

58.1

71.8

0

25

50

75

100

lost uncertain_wrong already_asked

Condition for requesting help

Perc

enta

ge (%

)

AccuracyPrecisionRecallF−1

Figure 4: Accuracy, precision, recall, and F-1 scoresin predicting the help-request conditions on TEST UN-SEENALL. The already asked condition is thenegation of the never asked condition.

E Hyperparameters

Table 7 summarizes all training and evaluation hy-perparameters. Training our agent takes approxi-mately 9 hours on a single Titan Xp GPU.

F Analysis

Ablation Study. Table 8 shows an ablation studyon the techniques we propose in this paper. Weobserve the performance on VAL SEENENV of themodel trained without the curiosity-encouragingloss (α = 0) is slightly higher than that ofour final model, indicating that training with thecuriosity-encouraging loss slightly hurts memo-rization. This is expected because the model hasrelatively small size but has to devote part of itsparameters to learn the curiosity-encouraging be-havior. On the other hand, optimizing with thecuriosity-encouraging loss helps our agent gen-eralize better to unseen environments. Predict-ing help-request conditions produces contrasting

VAL SEENENV VAL UNSEENALL

Agent SR ↑ SPL ↑ Nav. ↓ Requests/ SR ↑ SPL ↑ Nav. ↓ Requests/(%) (%) Err. (m) task ↓ (%) (%) Err. (m) task ↓

Final agent 87.24 63.02 1.21 2.9 45.64 22.68 7.72 5.9No inter-task reset 83.62 60.44 1.44 3.2 40.60 19.89 8.26 7.3No condition prediction 72.33 47.10 2.64 4.7 47.88 24.68 6.61 7.9No cosine-similarity attention (β = 0) 83.08 59.52 1.68 3.0 43.69 22.44 7.94 6.9No curiosity-encouraging loss (α = 0) 87.36 70.18 1.25 2.0 39.23 20.78 8.88 7.5

Table 8: Ablation study on our proposed techniques. Models are evaluated on the validation splits. Best results ineach column are in bold.

effects on the agent in seen and unseen environ-ments, boosting performance in VAL SEENENV

while slightly degrading performance in VAL UN-SEENALL. We investigate the help-request pat-terns and find that, in both types of environments,the agent makes significantly more requests whennot learning to predict the conditions. Specif-ically, the no-condition-prediction agent meetsthe uncertain wrong condition considerablymore often (+5.2% on VAL SEENENV and +6.4%on VAL UNSEENALL), and also requests help ata previously visited location while executing thesame task more frequently (+7.22% on VAL SEE-NENV and +8.59% on VAL UNSEENALL). Thisphenomenon highlights a challenge in training thenavigation and training the help-request policiesjointly: as the navigation policy gets better, thereare less positive examples (i.e., situations wherethe agent needs help) for the help-request policyto learn from. In this specific case, learning a nav-igation policy that is less accurate in seen environ-ments is beneficial to the agent when it encountersunseen environments because such a navigationpolicy creates more positive examples for trainingthe help-request. Devising learning strategies thatlearns both efficient navigation and help-requestpolicies is an exciting future direction.Help-request Behavior on TEST UNSEENALL.Our final agent requests about 18% of the to-tal number of time steps (Figure 3a). Over-all, it learns a conservative help-request policy(Figure 3b). Figure 4 shows how accurate ouragent predicts the help-request conditions. Theagent achieves high precision scores in predict-ing the lost and uncertain wrong condi-tions (76.9% and 86.6%), but achieves lower re-calls in all conditions (less than 50%). Surpris-ingly, it predicts the already asked conditionless accurate than the other two, even though thiscondition is intuitively fairly simple to realize.

h arxiv:1909.01871v1 [cs.hc] 4 sep 2019users.umiacs.umd.edu/~hal/docs/daume19hanna.pdf · 2019. 9....

Documents