answering complex open-domain questions through iterative ...documents to answer open-domain...

Answering Complex Open-domain QuestionsThrough Iterative Query Generation

Peng Qi† Xiaowen Lin∗† Leo Mehr∗† Zijian Wang∗‡ Christopher D. Manning†† Computer Science Department ‡ Symbolic Systems Program

Stanford University{pengqi, veralin, leomehr, zijwang, manning}@cs.stanford.edu

Abstract

It is challenging for current one-step retrieve-and-read question answering (QA) systems toanswer questions like “Which novel by the au-thor of ‘Armada’ will be adapted as a featurefilm by Steven Spielberg?” because the ques-tion seldom contains retrievable clues aboutthe missing entity (here, the author). Answer-ing such a question requires multi-hop reason-ing where one must gather information aboutthe missing entity (or facts) to proceed withfurther reasoning. We present GOLDEN (GoldEntity) Retriever, which iterates between read-ing context and retrieving more supportingdocuments to answer open-domain multi-hopquestions. Instead of using opaque and com-putationally expensive neural retrieval models,GOLDEN Retriever generates natural languagesearch queries given the question and availablecontext, and leverages off-the-shelf informa-tion retrieval systems to query for missing en-tities. This allows GOLDEN Retriever to scaleup efficiently for open-domain multi-hop rea-soning while maintaining interpretability. Weevaluate GOLDEN Retriever on the recentlyproposed open-domain multi-hop QA dataset,HOTPOTQA, and demonstrate that it outper-forms the best previously published model de-spite not using pretrained language modelssuch as BERT.

1 Introduction

Open-domain question answering (QA) is an im-portant means for us to make use of knowledgein large text corpora and enables diverse querieswithout requiring a knowledge schema ahead oftime. Enabling such systems to perform multi-step inference can further expand our capability toexplore the knowledge in these corpora (e.g., seeFigure 1).

∗Equal contribution, order decided by a random numbergenerator.

Q: Which novel by the author of “Armada” will be adapted as a feature film by Steven Spielberg?A: Ready Player One

Which novel by the author of “Armada” will beadapted as a feature film by Steven Spielberg?

The collectorThe Color Purple (film)Kim Wozencraft

novel by the author of “Armada”Armada (novel)Author, Author (novel)Armada

Armada authorArmadaArmada CentreHalley Armada

Armada (novel)Armada is a science fiction novel byErnest Cline, …

Ernest ClineErnest Christy Cline … co-wrote thescreenplay for the film adaptation ofReady Player One, directed byStephen Spielberg.

Search Results with queries derived from theoriginal question

Figure 1: An example of an open-domain multi-hopquestion from the HOTPOTQA dev set, where “ErnestCline” is the missing entity. Note from the search re-sults that it cannot be easily retrieved based on merelythe question. (Best viewed in color)

Fueled by the recently proposed large-scale QAdatasets such as SQuAD (Rajpurkar et al., 2016,2018) and TriviaQA (Joshi et al., 2017), muchprogress has been made in open-domain questionanswering. Chen et al. (2017) proposed a two-stage approach of retrieving relevant content withthe question, then reading the paragraphs returnedby the information retrieval (IR) component to ar-rive at the final answer. This “retrieve and read”approach has since been adopted and extended invarious open-domain QA systems (Nishida et al.,2018; Kratzwald and Feuerriegel, 2018), but it isinherently limited to answering questions that donot require multi-hop/multi-step reasoning. Thisis because for many multi-hop questions, not allthe relevant context can be obtained in a single re-trieval step (e.g., “Ernest Cline” in Figure 1).

More recently, the emergence of multi-hopquestion answering datasets such as QAngaroo(Welbl et al., 2018) and HOTPOTQA (Yang et al.,2018) has sparked interest in multi-hop QA in theresearch community. Designed to be more chal-lenging than SQuAD-like datasets, they featurequestions that require context of more than one

document to answer, testing QA systems’ abili-ties to infer the answer in the presence of multi-ple pieces of evidence and to efficiently find theevidence in a large pool of candidate documents.However, since these datasets are still relativelynew, most of the existing research focuses on thefew-document setting where a relatively small setof context documents is given, which is guaran-teed to contain the “gold” context documents, allthose from which the answer comes (De Cao et al.,2019; Zhong et al., 2019).

In this paper, we present GOLDEN (Gold En-tity) Retriever. Rather than relying purely on theoriginal question to retrieve passages, the centralinnovation is that at each step the model also usesIR results from previous hops of reasoning to gen-erate a new natural language query and retrievenew evidence to answer the original question. Forthe example in Figure 1, GOLDEN would first gen-erate a query to retrieve Armada (novel) based onthe question, then query for Ernest Cline based onnewly gained knowledge in that article. This al-lows GOLDEN to leverage off-the-shelf, general-purpose IR systems to scale open-domain multi-hop reasoning to millions of documents efficiently,and to do so in an interpretable manner. Com-bined with a QA module that extends BiDAF++(Clark and Gardner, 2017), our final system out-performs the best previously published system onthe open-domain (fullwiki) setting of HOTPOTQAwithout using powerful pretrained language mod-els like BERT (Devlin et al., 2019).

The main contributions of this paper are: (a)a novel iterative retrieve-and-read framework ca-pable of multi-hop reasoning in open-domainQA;1 (b) a natural language query generationapproach that guarantees interpretability in themulti-hop evidence gathering process; (c) an ef-ficient training procedure to enable query genera-tion with minimal supervision signal that signifi-cantly boosts recall of gold supporting documentsin retrieval.

2 Related Work

Open-domain question answering (QA) In-spired by the series of TREC QA competitions,2

Chen et al. (2017) were among the first to adaptneural QA models to the open-domain setting.

1Code and pretrained models available at https://github.com/qipeng/golden-retriever

2http://trec.nist.gov/data/qamain.html

They built a simple inverted index lookup withTF-IDF on the English Wikipedia, and used thequestion as the query to retrieve top 5 results fora reader model to produce answers with. Recentwork on open-domain question answering largelyfollow this retrieve-and-read approach, and focuson improving the information retrieval componentwith question answering performance in consider-ation (Nishida et al., 2018; Kratzwald and Feuer-riegel, 2018; Nogueira et al., 2019). However,these one-step retrieve-and-read approaches arefundamentally ill-equipped to address questionsthat require multi-hop reasoning, especially whennecessary evidence is not readily retrievable withthe question.

Multi-hop QA datasets QAngaroo (Welblet al., 2018) and HOTPOTQA (Yang et al., 2018)are among the largest-scale multi-hop QA datasetsto date. While the former is constructed arounda knowledge base and the knowledge schematherein, the latter adopts a free-form question gen-eration process in crowdsourcing and span-basedevaluation. Both datasets feature a few-documentsetting where the gold supporting facts are pro-vided along with a small set of distractors to easethe computational burden. However, researchershave shown that this sometimes results in game-able contexts, and thus does not always test themodel’s capability of multi-hop reasoning (Chenand Durrett, 2019; Min et al., 2019a). Therefore,in this work, we focus on the fullwiki setting ofHOTPOTQA, which features a truly open-domainsetting with more diverse questions.

Multi-hop QA systems At a broader level, theneed for multi-step searches, query task decom-position, and subtask extraction has been clearlyrecognized in the IR community (Hassan Awadal-lah et al., 2014; Mehrotra et al., 2016; Mehrotraand Yilmaz, 2017), but multi-hop QA has onlyrecently been studied closely with the release oflarge-scale datasets. Much research has focusedon enabling multi-hop reasoning in question an-swering models in the few-document setting, e.g.,by modeling entity graphs (De Cao et al., 2019)or scoring answer candidates against the context(Zhong et al., 2019). These approaches, however,suffer from scalability issues when the number ofsupporting documents and/or answer candidatesgrow beyond a few dozen. Ding et al. (2019) ap-ply entity graph modeling to HOTPOTQA, where

https://github.com/qipeng/golden-retriever

https://github.com/qipeng/golden-retriever

http://trec.nist.gov/data/qamain.html

they expand a small entity graph starting fromthe question to arrive at the context for the QAmodel. However, centered around entity names,this model risks missing purely descriptive cluesin the question. Das et al. (2019) propose a neu-ral retriever trained with distant supervision tobias towards paragraphs containing answers to thegiven questions, which is then used in a multi-stepreader-reasoner framework. This does not funda-mentally address the discoverability issue in open-domain multi-hop QA, however, because usuallynot all the evidence can be directly retrieved withthe question. Besides, the neural retrieval modellacks explainability, which is crucial in real-worldapplications. Talmor and Berant (2018), instead,propose to answer multi-hop questions at scale bydecomposing the question into sub-questions andperform iterative retrieval and question answering,which shares very similar motivations as our work.However, the questions studied in that work arebased on logical forms of a fixed schema, whichyields additional supervision for question decom-position but limits the diversity of questions. Morerecently, Min et al. (2019b) apply a similar idea toHOTPOTQA, but this approach similarly requiresadditional annotations for decomposition, and theauthors did not apply it to iterative retrieval.

3 Model

In this section, we formally define the prob-lem of open-domain multi-hop question answer-ing, and motivate the architecture of the proposedGOLDEN (Gold Entity) Retriever model. We thendetail the query generation components as well ashow to derive supervision signal for them, beforeconcluding with the QA component.

3.1 Problem Statement

We define the problem of open-domain multi-hopQA as one involving a question q, and S relevant(gold) supporting context documents d1, . . . , dSwhich contain the desired answer a.3 These sup-porting documents form a chain of reasoning nec-essary to arrive at the answer, and they come froma large corpus of documentsD where |D| � S. Inthis chain of reasoning, the supporting documentsare usually connected via shared entities or textualsimilarities (e.g., they describe similar entities or

3In this work, we only consider extractive, or span-based, QA tasks, but the problem statement and the proposedmethod apply to generative QA tasks as well.

events), but these connections do not necessarilyconform to any predefined knowledge schema.

We contrast this to what we call the few-document setting of multi-hop QA, where theQA system is presented with a small set ofdocuments Dfew-doc = {d1, . . . , dS , d′1, . . . , d′D},where d′1, . . . , d

′D comprise a small set of “dis-

tractor” documents that test whether the systemis able to pick out the correct set of supportingdocuments in the presence of noise. This settingis suitable for testing QA systems’ ability to per-form multi-hop reasoning given the gold support-ing documents with bounded computational bud-get, but we argue that it is far from a realistic one.In practice, an open-domain QA system has to lo-cate all gold supporting documents from D on itsown, and as shown in Figure 1, this is often diffi-cult for multi-hop questions based on the originalquestion alone, as not all gold documents are eas-ily retrievable given the question.

To address this gold context discoverabilityissue, we argue that it is necessary to moveaway from a single-hop retrieve-and-read ap-proach where the original question serves as thesearch query. In the next section, we introduceGOLDEN Retriever, which addresses this problemby iterating between retrieving more documentsand reading the context for multiple rounds.

3.2 Model Overview

Essentially, the challenge of open-domain multi-hop QA lies in the fact that the information needof the user (q → a) cannot be readily satisfied byany information retrieval (IR) system that modelsmerely the similarity between the question q andthe documents. This is because the true informa-tion need will only unfold with progressive rea-soning and discovery of supporting facts. There-fore, one cannot rely solely on a similarity-basedIR system for such iterative reasoning, because thepotential pool of relevant documents grows expo-nentially with the number of hops of reasoning.

To this end, we propose GOLDEN (Gold En-tity) Retriever, which makes use of the gold doc-ument4 information available in the QA dataset attraining time to iteratively query for more relevantsupporting documents during each hop of reason-ing. Instead of relying on the original question asthe search query to retrieve all supporting facts,

4In HOTPOTQA, documents usually describe entities,thus we use “documents” and “entities” interchangeably.

Query Generator 1

Query Generator 2

Hop 1 Query Hop 2 Query Answer Prediction

novel by the author of “Armada”

search

Armada (novel)Author, Author (novel)Armada

…

Ernest Cline

Ernest ClineCline (biology)Patsy Cline (album)

…

search

QA Model

Paragraph Set 1 Paragraph Set 2

⋈

Paragraph Set 1

Ready Player One

Which novel by the author of “Armada” will be adapted as a feature film by Steven Spielberg?

Q Q Q

Figure 2: Model overview of GOLDEN Retriever. Given an open-domain multi-hop question, the model iterativelyretrieves more context documents, and concatenates all retrieved context for a QA model to answer from.

or building computationally expensive search en-gines that are less interpretable to humans, we pro-pose to leverage text-based IR engines for inter-pretability, and generate different search queriesas each reasoning step unfolds. In the very firsthop of reasoning, GOLDEN Retriever is presentedthe original question q, from which it generatesa search query q1 that retrieves supporting docu-ment d1.5 Then for each of the subsequent rea-soning steps (k = 2, . . . , S), GOLDEN Retrievergenerates a query qk from the question and theavailable context, (q, d1, . . . , dk−1). This formula-tion allows the model to generate queries based oninformation revealed in the supporting facts (seeFigure 2, for example).

We note that GOLDEN Retriever is much moreefficient, scalable, and interpretable at retrievinggold documents compared to its neural retrievalcounterparts. This is because GOLDEN Retrieverdoes not rely on a QA-specific IR engine tuned toa specific dataset, where adding new documentsor question types into the index can be extremelyinefficient. Further, GOLDEN Retriever generatesqueries in natural language, making it friendly tohuman interpretation and verification. One corechallenge in GOLDEN Retriever, however, is totrain query generation models in an efficient man-ner, because the search space for potential queriesis enormous and off-the-shelf IR engines are notend-to-end differentiable. We outline our solutionto this challenge in the following sections.

3.3 Query Generation

For each reasoning step, we need to generate thesearch query given the original question q and

5For notational simplicity, dk denotes the supporting doc-ument needed to complete the k-th step of reasoning. We alsoassume that the goal of each IR query is to retrieve one andonly one gold supporting document in its top n results.

some context of documents we have already re-trieved (initially empty). This query generationproblem is conceptually similar to the QA task inthat they both map a question and some contextto a target, only instead of an answer, the targethere is a search query that helps retrieve the de-sired supporting document for the next reasoningstep. Therefore, we formulate the query genera-tion process as a question answering task.

To reduce the potentially large space of possi-ble queries, we favor a QA model that extractstext spans from the context over one that gener-ates free-form text as search queries. We thereforeemploy DrQA’s Document Reader model (Chenet al., 2017), which is a relatively light-weight re-current neural network QA model that has demon-strated success in few-document QA. We adapt itto query generation as follows.

For each reasoning step k = 1, . . . , S, given aquestion q and some retrieval context Ck whichideally contains the gold supporting documentsd1, . . . , dk−1, we aim to generate a search queryqk that helps us retrieve dk for the next reasoningstep. A Document Reader model is trained to se-lect a span from Ck as the query

qk = Gk(q, Ck),

where Gk is the query generator. This queryis then used to search for supporting documents,which are concatenated with the current retrievalcontext to update it

Ck+1 = Ck ./ IRn(qk)

where IRn(qk) is the top n documents retrievedfrom the search engine using qk, and C1 = q.6

At the end of the retrieval steps, we provide q asquestion along with CS as context to the final few-

6In the query result, the title of each document is delim-ited with special tokens <t> and </t> before concatenation.

Question Hop 1 Oracle Hop 2 Oracle

What government position was held by the woman who portrayedCorliss Archer in the film Kiss and Tell?

Corliss Archer in thefilm Kiss and Tell

Shirley Temple

Scott Parkin has been a vocal critic of Exxonmobil and another cor-poration that has operations in how many countries?

Scott Parkin Halliburton

Are Giuseppe Verdi and Ambroise Thomas both Opera composers? Giuseppe Verdi Ambroise Thomas

Table 1: Example oracle queries on the HOTPOTQA dev set.

document QA component detailed in Section 3.5to obtain the final answer to the original question.

To train the query generators, we follow thesteps above to construct the retrieval contexts, butduring training time, when dk is not part of theIR result, we replace the lowest ranking documentwith dk before concatenating it with Ck to makesure the downstream models have access to neces-sary context.

3.4 Deriving Supervision Signal for QueryGeneration

When deriving supervision signal to train ourquery generators, the potential search space isenormous for each step of reasoning even if weconstrain ourselves to predicting spans from thecontext. This is aggravated by multiple hops ofreasoning required by the question. One solutionto this issue is to train the query generators withreinforcement learning (RL) techniques (e.g., RE-INFORCE (Sutton et al., 2000)), where (Nogueiraand Cho, 2017) and (Buck et al., 2018) are exam-ples of one-step query generation with RL. How-ever, it is computationally inefficient, and has highvariance especially for the second reasoning stepand forward, because the context depends greatlyon what queries have been chosen previously andtheir search results.

Instead, we propose to leverage the limited su-pervision we have about the gold supporting docu-ments d1, . . . , dS to narrow down the search space.The key insight we base our approach on is thatat any step of open-domain multi-hop reasoning,there is some semantic overlap between the re-trieval context and the next document(s) we wishto retrieve. For instance, in our Armada example,when the retrieval context contains only the ques-tion, this overlap is the novel itself; after we haveexpanded the retrieval context, this overlap be-comes the name of the author, Ernest Cline. Find-ing this semantic overlap between the retrievalcontext and the desired documents not only reveals

the chain of reasoning naturally, but also allows usto use it as the search query for retrieval.

Because off-the-shelf IR systems generally op-timize for shallow lexical similarity between queryand candidate documents in favor of efficiency, agood proxy for this overlap is locating spans oftext that have high lexical overlap with the in-tended supporting documents. To this end, wepropose a simple yet effective solution, employingseveral heuristics to generate candidate queries:computing the longest common string/sequencebetween the current retrieval context and the ti-tle/text of the intended paragraph ignoring stopwords, then taking the contiguous span of text thatcorresponds to this overlap in the retrieval con-text. This allows us to not only make use of entitynames, but also textual descriptions that better leadto the gold entities. It is also more generally ap-plicable than question decomposition approaches(Talmor and Berant, 2018; Min et al., 2019b), anddoes not require additional annotation for decom-position.

Applying various heuristics results in a hand-ful of candidate queries for each document, andwe use our IR engine (detailed next) to rank thembased on recall of the intended supporting docu-ment to choose one as the final oracle query wetrain our query generators to predict. This allowsus to train the query generators in a fully super-vised manner efficiently. Some examples of oraclequeries on the HOTPOTQA dev set can be foundin Table 1. We refer the reader to Appendix Bfor more technical details about our heuristics andhow the oracle queries are derived.

Oracle Query vs Single-hop Query We evalu-ate the oracle query against the single-hop query,i.e., querying with the original question, on theHOTPOTQA dev set. Specifically, we compare therecall of gold paragraphs, because the greater therecall, the fewer documents we need to pass intothe expensive neural multi-hop QA component.

1 2 5 10 20 500

20

40

60

80

100

Number of Retrieved Documents

Dev

Rec

all(

%)

Single-hop d1 Single-hop d2Oracle d1 Oracle d2

Figure 3: Recall comparison between single-hopqueries and GOLDEN Retriever oracle queries for bothsupporting paragraphs on the HOTPOTQA dev set.Note that the oracle queries are much more effectivethan the original question (single-hop query) at retriev-ing target paragraphs in both hops.

We index the English Wikipedia dump withintroductory paragraphs provided by the HOT-POTQA authors7 with Elasticsearch 6.7 (Gorm-ley and Tong, 2015), where we index the titlesand document text in separate fields with bigramindexing enabled. This results in an index with5,233,329 total documents. At retrieval time, weboost the scores of any search result whose titlematches the search query better – this results in abetter recall for entities with common names (e.g.,“Armada” the novel). For more details about howthe IR engine is set up and the effect of scoreboosting, please refer to Appendix A.

In Figure 3, we compare the recall of the twogold paragraphs required for each question inHOTPOTQA at various number of documents re-trieved (R@n) for the single-hop query and themulti-hop queries generated from the oracle. Notethat the oracle queries are much more effectiveat retrieving the gold paragraphs than the originalquestion in both hops. For instance, if we com-bine R@5 of both oracles (which effectively re-trieves 10 documents from two queries) and com-pare that to R@10 for the single-hop query, the or-acle queries improve recall for d1 by 6.68%, andthat for d2 by a significant margin of 49.09%.8

7https://hotpotqa.github.io/wiki-readme.html

8Since HOTPOTQA does not provide the logical order itsgold entities should be discovered, we simply call the doc-ument d1 which is more easily retrievable with the queries,and the other as d2.

RNN

Char RNN Word Emb 0

Paragraphs

RNN

Char RNN Word Emb 1

Question

Self Attention

RNN

Self Attention

RNN

residual

concat

concat

Linear

Supporting fact?

RNN

Linear

Start token

RNN

Linear

End token

RNN

Linear

yes/no/span

concat

.

Figure 4: Question answering component in GOLDENRetriever. (Best viewed in color)

This means that the final QA model will need toconsider far fewer documents to arrive at a decentset of supporting facts that lead to the answer.

3.5 Question Answering Component

The final QA component of GOLDEN Retriever isbased on the baseline model presented in (Yanget al., 2018), which is in turn based on BiDAF++(Clark and Gardner, 2017). We make two ma-jor changes to this model. Yang et al. (2018)concatenated all context paragraphs into one longstring to predict span begin and end offsets forthe answer, which is potentially sensitive to theorder in which these paragraphs are presentedto the model. We instead process them sepa-rately with shared encoder RNN parameters toobtain paragraph order-insensitive representationsfor each paragraph. Span offset scores are pre-dicted from each paragraph independently beforefinally aggregated and normalized with a globalsoftmax operation to produce probabilities overspans. The second change is that we replace allattention mechanism in the original model withself attention layers over the concatenated ques-tion and context. To differentiate context para-graph representations from question representa-tions in this self-attention mechanism, we indicatequestion and context tokens by concatenating a 0/1feature at the input layer. Figure 4 illustrates theQA model architecture.

https://hotpotqa.github.io/wiki-readme.html


System Answer Sup Fact JointEM F1 EM F1 EM F1

Baseline (Yang et al., 2018) 25.23 34.40 5.07 40.69 2.63 17.85GRN + BERT 29.87 39.14 13.16 49.67 8.26 25.84MUPPET (Feldman and El-Yaniv, 2019) 30.61 40.26 16.65 47.33 10.85 27.01CogQA (Ding et al., 2019) 37.12 48.87 22.82 57.69 12.42 34.92PR-Bert 43.33 53.79 21.90 59.63 14.50 39.11Entity-centric BERT Pipeline 41.82 53.09 26.26 57.29 17.01 39.18BERT pip. (contemporaneous) 45.32 57.34 38.67 70.83 25.14 47.60

GOLDEN Retriever 37.92 48.58 30.69 64.24 18.04 39.13

Table 2: End-to-end QA performance of baselines and our GOLDEN Retriever model on the HOTPOTQA fullwikitest set. Among systems that were not published at the time of submission of this paper, “BERT pip.” wassubmitted to the official HOTPOTQA leaderboard on May 15th (thus contemporaneous), while “Entity-centricBERT Pipeline” and “PR-Bert” were submitted after the paper submission deadline.

4 Experiments

4.1 Data and Setup

We evaluate our models in the fullwiki setting ofHOTPOTQA (Yang et al., 2018). HOTPOTQA isa question answering dataset collected on the En-glish Wikipedia, containing about 113k crowd-sourced questions that are constructed to requirethe introduction paragraphs of two Wikipedia arti-cles to answer. Each question in the dataset comeswith the two gold paragraphs, as well as a listof sentences in these paragraphs that crowdwork-ers identify as supporting facts necessary to an-swer the question. A diverse range of reasoningstrategies are featured in HOTPOTQA, includingquestions involving missing entities in the ques-tion (our Armada example), intersection questions(What satisfies property A and property B?), andcomparison questions, where two entities are com-pared by a common attribute, among others. In thefew-document distractor setting, the QA modelsare given ten paragraphs in which the gold para-graphs are guaranteed to be found; in the open-domain fullwiki setting, which we focus on, themodels are only given the question and the entireWikipedia. Models are evaluated on their answeraccuracy and explainability, where the former ismeasured as overlap between the predicted andgold answers with exact match (EM) and unigramF1, and the latter concerns how well the predictedsupporting fact sentences match human annotation(Supporting Fact EM/F1). A joint metric is also re-ported on this dataset, which encourages systemsto perform well on both tasks simultaneously.

We use the Stanford CoreNLP toolkit (Manninget al., 2014) to preprocess Wikipedia, as well as togenerate POS/NER features for the query genera-tors, following (Yang et al., 2018) and (Chen et al.,2017). We always detokenize a generated searchquery before sending it to Elasticsearch, which hasits own preprocessing pipeline. Since all ques-tions in HOTPOTQA require exactly two support-ing documents, we fix the number of retrieval stepsof GOLDEN Retriever to S = 2. To accommo-date arbitrary steps of reasoning in GOLDEN Re-triever, a stopping criterion is required to deter-mine when to stop retrieving for more support-ing documents and perform few-document ques-tion answering, which we leave to future work.During training and evaluation, we set the num-ber of retrieved documents added to the retrievalcontext to 5 for each retrieval step, so that the to-tal number of paragraphs our final QA model con-siders is 10, for a fair comparison to (Yang et al.,2018).

We include hyperparameters and training de-tails in Appendix C for reproducibility.

4.2 End-to-end Question Answering

We compare the end-to-end performance ofGOLDEN Retriever against several QA systemson the HOTPOTQA dataset: (1) the baseline pre-sented in (Yang et al., 2018), (2) CogQA (Dinget al., 2019), the top-performing previously pub-lished system, and (3) other high-ranking sys-tems on the leaderboard. As shown in Table 2,GOLDEN Retriever is much better at locating thecorrect supporting facts from Wikipedia compared

Setting Ans F1 Sup F1 R@10∗

GOLDEN Retriever 49.79 64.58 75.46Single-hop query 38.19 54.82 62.38HOTPOTQA IR 36.34 46.78 55.71

Table 3: Question answering and IR performanceamongst different IR settings on the dev set. We ob-serve that although improving the IR engine is help-ful, most of the performance gain results from the iter-ative retrieve-and-read strategy of GOLDEN Retriever.(*: for GOLDEN Retriever, the 10 paragraphs are com-bined from both hops, 5 from each hop.)

to CogQA, as well as most of the top-ranking sys-tems. However, the QA performance is handi-capped because we do not make use of pretrainedcontextualization models (e.g., BERT) that thesesystems use. We expect a boost in QA perfor-mance from adopting these more powerful ques-tion answering models, especially ones that aretailored to perform few-document multi-hop rea-soning. We leave this to future work.

To understand the contribution of GOLDEN Re-triever’s iterative retrieval process compared tothat of the IR engine, we compare the performanceof GOLDEN Retriever against two baseline sys-tems on the dev set: one that retrieves 10 support-ing paragraphs from Elasticsearch with the origi-nal question, and one that uses the IR engine pre-sented in HOTPOTQA.9 In all cases, we use theQA component in GOLDEN Retriever for the fi-nal question answering step. As shown in Table 3,replacing the hand-engineered IR engine in (Yanget al., 2018) with Elasticsearch does result in somegain in recall of the gold documents, but thatdoes not translate to a significant improvement inQA performance. Further inspection reveals thatdespite Elasticsearch improving overall recall ofgold documents, it is only able to retrieve bothgold documents for 36.91% of the dev set ques-tions, in comparison to 28.21% from the IR en-gine in (Yang et al., 2018). In contrast, GOLDEN

Retriever improves this percentage to 61.01%, al-most doubling the recall over the single-hop base-line, providing the QA component a much betterset of context documents to predict answers from.

Lastly, we perform an ablation study in whichwe replace our query generator models with ourquery oracles and observe the effect on end-to-end

9For the latter, we use the fullwiki test input file the au-thors released, which contains the top-10 IR output from thatretrieval system with the question as the query.

System Ans F1 Sup F1 Joint F1

GOLDEN Retriever 49.79 64.58 40.21w/ Hop 1 oracle 52.53 68.06 42.68w/ Hop 1 & 2 oracles 62.32 77.00 52.18

Table 4: Pipeline ablative analysis of GOLDEN Re-triever end-to-end QA performance by replacing eachquery generator with a query oracle.

performance. As can be seen in Table 4, replacingG1 with the oracle only slightly improves end-to-end performance, but further substituting G2 withthe oracle yields a significant improvement. Thisillustrates that the performance loss is largely at-tributed to G2 rather than G1, because G2 solvesa harder span selection problem from a longer re-trieval context. In the next section, we examine thequery generation models more closely by evaluat-ing their performance without the QA component.

4.3 Analysis of Query Generation

To evaluate the query generators, we begin by de-termining how well they emulate the oracles. Weevaluate them using Exact Match (EM) and F1 onthe span prediction task, as well as compare theirqueries’ retrieval performance against the oraclequeries. As can be seen in Table 6, the perfor-mance of G2 is worse than that of G1 in gen-eral, confirming our findings on the end-to-endpipeline. When we combine them into a pipeline,the generated queries perform only slightly betteron d1 when a total of 10 documents are retrieved(89.91% vs 87.85%), but are significantly moreeffective for d2 (61.01% vs 36.91%). If we fur-ther zoom in on the retrieval performance on non-comparison questions for which finding the twoentities involved is less trivial, we can see that therecall on d2 improves from 27.88% to 53.23%, al-most doubling the number of questions we havethe complete gold context to answer. We note thatthe IR performance we report on the full pipelineis different to that when we evaluate the query gen-erators separately. We attribute this difference tothe fact that the generated queries sometimes re-trieve both gold documents in one step.

To better understand model behavior, we alsorandomly sampled some examples from the devset to compare the oracle queries and the predictedqueries. Aside from exact matches, we find thatthe predicted queries are usually small variationsof the oracle ones. In some cases, the model se-lects spans that are more natural and informative

Question Predicted q1 Predicted q2

(1) What video game character did the voice actress in theanimated film Alpha and Omega voice?

voice actress in the animated filmAlpha and Omega (animated filmAlpha and Omega voice)

Hayden Panettiere

(2) What song was created by the group consisting of Jef-frey Jey, Maurizio Lobina and Gabry Ponte and releasedon 15 January 1999?

Jeffrey Jey (group consisting ofJeffrey Jey, Maurizio Lobina andGabry Ponte)

Gabry Ponte and releasedon 15 January 1999 (“Blue(Da Ba Dee)”)

(3) Yau Ma Tei North is a district of a city with how manycitizens?

Yau Ma Tei North Yau Tsim Mong District ofHong Kong (Hong Kong)

(4) What company started the urban complex developmentthat included the highrise building, The Harmon?

highrise building, The Harmon CityCenter

Table 5: Examples of predicted queries from the query generators on the HOTPOTQA dev set. The oracle query isdisplayed in blue in parentheses if it differs from the predicted one.

Model Span R@5EM F1

G1 51.40 78.75 85.86G2 52.29 63.07 64.83

Table 6: Span prediction and IR performance of thequery generator models for Hop 1 (G1) and Hop 2 (G2)evaluated separately on the HOTPOTQA dev set.

(Example (1) in Table 5). When they differ a bitmore, the model is usually overly biased towardsshorter entity spans and misses out on informa-tive information (Example (2)). When there aremultiple entities in the retrieval context, the modelsometimes selects the wrong entity, which sug-gests that a more powerful query generator mightbe desirable (Example (3)). Despite these issues,we find that these natural language queries makethe reasoning process more interpretable, and eas-ier for a human to verify or intervene as needed.

Limitations Although we have demonstratedthat generating search queries with span selectionworks in most cases, it also limits the kinds ofqueries we can generate, and in some cases leadsto undesired behavior. One common issue is thatthe entity of interest has a name shared by toomany Wikipedia pages (e.g., “House Rules” the2003 TV series). This sometimes results in theinclusion of extra terms in our oracle query to ex-pand it (e.g., Example (4) specifies that “The Har-mon” is a highrise building). In other cases, ourspan oracle makes use of too much informationfrom the gold entities (Example (2), where a hu-man would likely query for “Eiffel 65 song re-leased 15 January 1999” because “Blue” is not theonly song mentioned in d1). We argue, though,

these are due to the simplifying choice of span se-lection for query generation and fixed number ofquery steps. We leave extensions to future work.

5 Conclusion

In this paper, we presented GOLDEN (Gold En-tity) Retriever, an open-domain multi-hop ques-tion answering system for scalable multi-hop rea-soning. Through iterative reasoning and retrieval,GOLDEN Retriever greatly improves the recall ofgold supporting facts, thus providing the questionanswering model a much better set of context doc-uments to produce an answer from, and demon-strates competitive performance to the state of theart. Generating natural languages queries for eachstep of reasoning, GOLDEN Retriever is also moreinterpretable to humans compared to previous neu-ral retrieval approaches and affords better under-standing and verification of model behavior.

Acknowledgements

The authors would like to thank Robin Jia amongother members of the Stanford NLP Group, as wellas the anonymous reviewers for discussions andcomments on earlier versions of this paper. PengQi would also like to thank Suprita Shankar, JamilDhanani, and Suma Kasa for early experiments onBiDAF++ variants for HOTPOTQA. This researchis funded in part by Samsung Electronics Co., Ltd.and in part by the SAIL-JD Research Initiative.

References

Christian Buck, Jannis Bulian, Massimiliano Cia-ramita, Wojciech Gajewski, Andrea Gesmundo, NeilHoulsby, and Wei Wang. 2018. Ask the right ques-tions: Active question reformulation with reinforce-

ment learning. In Proceedings of the InternationalConference on Learning Representations.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In Association for Computa-tional Linguistics (ACL).

Jifan Chen and Greg Durrett. 2019. Understandingdataset design choices for multi-hop reasoning. InProceedings of the Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL).

Christopher Clark and Matt Gardner. 2017. Simpleand effective multi-paragraph reading comprehen-sion. In Proceedings of the 55th Annual Meetingof the Association of Computational Linguistics.

Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer,and Andrew McCallum. 2019. Multi-step retriever-reader interaction for scalable open-domain questionanswering. In International Conference on Learn-ing Representations.

Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019.Question answering by reasoning across documentswith graph convolutional networks. In Proceed-ings of the Conference of the North American Chap-ter of the Association for Computational Linguistics(NAACL).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language un-derstanding. In Proceedings of the Conference ofthe North American Chapter of the Association forComputational Linguistics (NAACL).

Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang,and Jie Tang. 2019. Cognitive graph for multi-hopreading comprehension at scale. In Proceedings ofthe 57th Annual Meeting of the Association of Com-putational Linguistics.

Yair Feldman and Ran El-Yaniv. 2019. Multi-hop para-graph retrieval for open-domain question answering.In Proceedings of the 57th Annual Meeting of the As-sociation of Computational Linguistics.

Clinton Gormley and Zachary Tong. 2015. Elastic-search: The definitive guide: A distributed real-timesearch and analytics engine. O’Reilly Media, Inc.

Ahmed Hassan Awadallah, Ryen W White, PatrickPantel, Susan T Dumais, and Yi-Min Wang. 2014.Supporting complex search tasks. In Proceedings ofthe 23rd ACM International Conference on Confer-ence on Information and Knowledge Management,pages 829–838. ACM.

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and LukeZettlemoyer. 2017. TriviaQA: A large scale dis-tantly supervised challenge dataset for reading com-prehension. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedingsof the International Conference on Learning Repre-sentations (ICLR).

Bernhard Kratzwald and Stefan Feuerriegel. 2018.Adaptive document retrieval for deep question an-swering. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP).

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations,pages 55–60.

Rishabh Mehrotra, Prasanta Bhattacharya, and Em-ine Yilmaz. 2016. Deconstructing complex searchtasks: A bayesian nonparametric approach for ex-tracting sub-tasks. In Proceedings of the 2016 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, pages 599–605.

Rishabh Mehrotra and Emine Yilmaz. 2017. Extract-ing hierarchies of search tasks & subtasks via abayesian nonparametric approach. In Proceedingsof the 40th International ACM SIGIR Conferenceon Research and Development in Information Re-trieval, pages 285–294. ACM.

Sewon Min, Eric Wallace, Sameer Singh, Matt Gard-ner, Hannaneh Hajishirzi, and Luke Zettlemoyer.2019a. Compositional questions do not necessi-tate multi-hop reasoning. In Proceedings of theAnnual Conference of the Association of Computa-tional Linguistics.

Sewon Min, Victor Zhong, Luke Zettlemoyer, andHannaneh Hajishirzi. 2019b. Multi-hop readingcomprehension through question decomposition andrescoring. In Proceedings of the Annual Conferenceof the Association of Computational Linguistics.

Kyosuke Nishida, Itsumi Saito, Atsushi Otsuka, HisakoAsano, and Junji Tomita. 2018. Retrieve-and-read: Multi-task learning of information retrievaland reading comprehension. In Proceedings of the27th ACM International Conference on Informa-tion and Knowledge Management, pages 647–656.ACM.

Rodrigo Nogueira and Kyunghyun Cho. 2017. Task-oriented query reformulation with reinforcementlearning.

Rodrigo Nogueira, Wei Yang, Jimmy Lin, andKyunghyun Cho. 2019. Document expansion byquery prediction. arXiv preprint arXiv:1904.08375.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for SQuAD. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics.

https://openreview.net/forum?id=HkfPSh05K7



http://www.aclweb.org/anthology/P/P14/P14-5010

http://www.aclweb.org/anthology/P/P14/P14-5010

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questionsfor machine comprehension of text. In Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Stephen E. Robertson, Steve Walker, Susan Jones,Micheline M Hancock-Beaulieu, Mike Gatford,et al. 1995. Okapi at trec-3. Nist Special Publica-tion Sp, 109:109.

Richard S. Sutton, David A. McAllester, Satinder P.Singh, and Yishay Mansour. 2000. Policy gradi-ent methods for reinforcement learning with func-tion approximation. In Advances in Neural Infor-mation Processing Systems, pages 1057–1063.

Alon Talmor and Jonathan Berant. 2018. The web asa knowledge-base for answering complex questions.Proceedings of the Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics (NAACL).

Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hopreading comprehension across documents. Transac-tions of the Association of Computational Linguis-tics, 6:287–302.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-gio, William W. Cohen, Ruslan Salakhutdinov, andChristopher D. Manning. 2018. HotpotQA: Adataset for diverse, explainable multi-hop questionanswering. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP).

Victor Zhong, Caiming Xiong, Nitish Shirish Keskar,and Richard Socher. 2019. Coarse-grain fine-graincoattention network for multi-evidence question an-swering. In Proceedings of the International Con-ference on Learning Representations.

A Elasticsearch Setup

A.1 Setting Up the Index

We start from the Wikipedia dump file containingthe introductory paragraphs used in HOTPOTQAthat Yang et al. (2018) provide,10 and add the fieldscorresponding to Wikipedia page titles and the in-troductory paragraphs (text) into the index.

For the title, we use Elasticsearch’s simpleanalyzer which performs basic tokenization andlowercasing of the content. For the text, we joinall the sentences and use the standard analyzerwhich further allows for removal of punctuationand stop words. For both fields, we index an aux-iliary field with bigrams using the shingle fil-ter,11 and perform basic asciifolding to mapnon ASCII characters to a similar ASCII character(e.g., “e”→ “e”).

At search time, we launch a multi matchquery against all fields with the same query, whichperforms a full-text query employing the BM25ranking function (Robertson et al., 1995) with allfields in the index, and returns the score of thebest field for ranking by default. To promote doc-uments whose title match the search query, weboost the search score of all title-related fields by1.25 in this query.

A.2 Reranking Search Results

In Wikipedia, it is common that pages or en-tity names share rare words that are important tosearch engines, and a naive full-text search IR sys-tem will not be able to pick the one that matchesthe query the best. For instance, if one set up Elas-ticsearch according to the instructions above andsearched for “George W. Bush”, he/she would besurprised to see that the actual page is not even inthe top-10 search results, which contains entitiessuch as “George W. Bush Childhood Home” and“Bibliography of George W. Bush”.

To this end, we propose to rerank these queryresults with a simple but effective heuristic that al-leviates this issue. We would first retrieve at least50 candidate documents for each query for consid-eration, and boost the query scores of documentswhose title exactly matches the search query, or isa substring of the search query. Specifically, we

10https://hotpotqa.github.io/wiki-readme.html

11https://www.elastic.co/guide/en/elasticsearch/reference/6.7/analysis-shingle-tokenfilter.html

IR System R@10 for d1 R@10 for d2

Final system 87.85 36.91w/o Title Boosting 86.85 32.64w/o Reranking 86.32 34.77w/o Both 84.67 29.55

Table 7: IR performance (recall in percentages) of var-ious Elasticsearch setups on the HOTPOTQA dev setusing the original question.

multiply the document score by a heuristic con-stant between 1.05 and 1.5, depending on howwell the document title matches the query, beforereranking all search results. This results in a sig-nificant improvement in these cases. For the query“George W. Bush”, the page for the former USpresident is ranked at the top after reranking. InTable 7, we also provide results from the single-hop query to show the improvement from titlescore boosting introduced from the previous sec-tion and reranking.

B Oracle Query Generation

We mainly employ three heuristics to find the se-mantic overlap between the retrieval context andthe desired documents: longest common subse-quence (LCS), longest common substring (LC-SubStr), and overlap merging which generalizesthe two. Specifically, the overlap merging heuris-tic looks for contiguous spans in the retrieval con-text that have high rates of overlapping tokenswith the desired document, determined by the totalnumber of overlapping tokens divided by the totalnumber of tokens considered in the span.

In all heuristics, we ignore stop words and low-ercase the rest in computing the spans to capturemore meaningful overlaps, and finally take thespan in the retrieval context that all the overlap-ping words are contained in. For instance, if theretrieval context contains “the GOLDEN Retrievermodel on HOTPOTQA” and the desired documentcontains “GOLDEN Retriever on the HOTPOTQAdataset”, we will identify the overlapping termsas “GOLDEN”, “Retriever”, and “HOTPOTQA”,and return the span “GOLDEN Retriever model onHOTPOTQA” as the resulting candidate query.

To generate candidates for the oracle query,we apply the heuristics between combinations of{cleaned question, cleaned question without punc-tuation} × {cleaned document title, cleaned para-graph}, where cleaning means stop word removaland lowercasing. Once oracle queries are gener-



https://www.elastic.co/guide/en/elasticsearch/reference/6.7/analysis-shingle-tokenfilter.html



Hyperparameter Values

Learning rate 5 × 10−4 , 1× 10−3

Finetune embeddings 0, 200, 500, 1000Epoch 25, 40Batch size 32, 64, 128Hidden size 64, 128, 256, 512, 768Max sequence length 15, 20, 30, 50, 100Dropout rate 0.3, 0.35, 0.4, 0.45

Table 8: Hyperparameter settings for the query gener-ators. The final hyperparameters for the Hop 1 querygenerator are shown in bold, and those for the Hop 2query generator are shown in underlined itallic.

ated, we launch these queries against Elasticsearchto determine the rank of the desired paragraph. Ifmultiple candidate queries are able to place the de-sired paragraph in the top 5 results, we further rankthe candidate queries by other metrics (e.g., lengthof the query) to arrive at the final oracle query totrain the query generators. We refer the reader toour released code for further details.

C Training Details

C.1 Query Generators

Once the oracle queries are generated, we train ourquery generators to emulate them on the trainingset, and choose the best model with F1 in span se-lection on the dev set. We experiment with hyper-parameters such as learning rate, training epochs,batch size, number of word embeddings to fine-tune, among others, and report the final hyperpa-rameters for both query generators (G1 and G2) inTable 8.

C.2 Question Answering Model

Our final question answering component is trainedwith the paragraphs produced by the oracle queries(5 from each hop, 10 in total), with d1 and d2 in-serted to replace the lowest ranking paragraph ineach hop if they are not in the set already.

We develop our model based on the baselinemodel of Yang et al. (2018), and reuse the samedefault hyperparameters whenever possible. Themain differences in the hyperparameters are: weoptimize our QA model with Adam (with defaulthyperparameters) (Kingma and Ba, 2015) insteadof stochastic gradient descent with a larger batchsize of 64; we anneal the learning rate by 0.5 witha patience of 3 instead of 1, that is, we multiply thelearning rate by 0.5 after three consecutive failuresto improve dev F1; we clip the gradient down to a

maximum `2 norm of 5; we apply a 10% dropoutto the model, for which we have increased the hid-den size to 128; and use 10 as the coefficient bywhich we multiply the supporting facts loss, be-fore mixing it with the span prediction loss. Weconfigure the model to read 10 context paragraphs,and limit each paragraph to at most 400 tokens in-cluding the title.