learning to paraphrase for question answering · li dongy and jonathan mallinson y and siva reddyz...

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 875–886Copenhagen, Denmark, September 7–11, 2017. c©2017 Association for Computational Linguistics

Learning to Paraphrase for Question Answering

Li Dong† and Jonathan Mallinson† and Siva Reddy‡ and Mirella Lapata†† ILCC, School of Informatics, University of Edinburgh‡ Computer Science Department, Stanford [email protected], [email protected]@stanford.edu, [email protected]

Abstract

Question answering (QA) systems are sen-sitive to the many different ways naturallanguage expresses the same informationneed. In this paper we turn to paraphrasesas a means of capturing this knowledgeand present a general framework whichlearns felicitous paraphrases for variousQA tasks. Our method is trained end-to-end using question-answer pairs as a su-pervision signal. A question and its para-phrases serve as input to a neural scor-ing model which assigns higher weights tolinguistic expressions most likely to yieldcorrect answers. We evaluate our approachon QA over Freebase and answer sentenceselection. Experimental results on threedatasets show that our framework con-sistently improves performance, achievingcompetitive results despite the use of sim-ple QA models.

1 Introduction

Enabling computers to automatically answer ques-tions posed in natural language on any domain ortopic has been the focus of much research in re-cent years. Question answering (QA) is challeng-ing due to the many different ways natural lan-guage expresses the same information need. As aresult, small variations in semantically equivalentquestions, may yield different answers. For exam-ple, a hypothetical QA system must recognize thatthe questions “who created microsoft” and “whostarted microsoft” have the same meaning and thatthey both convey the founder relation in order toretrieve the correct answer from a knowledge base.

Given the great variety of surface forms for se-mantically equivalent expressions, it should comeas no surprise that previous work has investigated

the use of paraphrases in relation to question an-swering. There have been three main strands ofresearch. The first one applies paraphrasing tomatch natural language and logical forms in thecontext of semantic parsing. Berant and Liang(2014) use a template-based method to heuristi-cally generate canonical text descriptions for can-didate logical forms, and then compute paraphrasescores between the generated texts and input ques-tions in order to rank the logical forms. Anotherstrand of work uses paraphrases in the context ofneural question answering models (Bordes et al.,2014a,b; Dong et al., 2015). These models are typ-ically trained on question-answer pairs, and em-ploy question paraphrases in a multi-task learningframework in an attempt to encourage the neuralnetworks to output similar vector representationsfor the paraphrases.

The third strand of research uses paraphrasesmore directly. The idea is to paraphrase thequestion and then submit the rewritten versionto a QA module. Various resources have beenused to produce question paraphrases, such asrule-based machine translation (Duboue and Chu-Carroll, 2006), lexical and phrasal rules from theParaphrase Database (Narayan et al., 2016), aswell as rules mined from Wiktionary (Chen et al.,2016) and large-scale paraphrase corpora (Faderet al., 2013). A common problem with the gen-erated paraphrases is that they often contain in-appropriate candidates. Hence, treating all para-phrases as equally felicitous and using them to an-swer the question could degrade performance. Toremedy this, a scoring model is often employed,however independently of the QA system used tofind the answer (Duboue and Chu-Carroll, 2006;Narayan et al., 2016). Problematically, the sepa-rate paraphrase models used in previous work donot fully utilize the supervision signal from thetraining data, and as such cannot be properly tuned

875

Question

Answering

Model

Score

Normalization

pψ(a|q1)

pψ(a|qm)...

Question

Encoder

q1

qm

...

q

...

pψ(a|q)q

q

s(q1|q)

s(qm|q)...

s(q|q)

pθ(q1|q)

pθ(qm|q)...

pθ(q|q)

Paraphrase

Scoring Model

Bi-LST

M

Bi-LST

M

... ...

Question

VectorsScores

AnswerQuestionq: who created microsoft?

Paraphrasesq1: who founded microsoft?q2: who is the founder of microsoft?q3: who is the creator of microsoft? qm: who designed microsoft?

microsoftorg_

founderPaul AllenBill Gates

Figure 1: We use three different methods to generate candidate paraphrases for input q. The question andits paraphrases are fed into a neural model which scores how suitable they are. The scores are normalizedand used to weight the results of the question answering model. The entire system is trained end-to-endusing question-answer pairs as a supervision signal.

to the question answering tasks at hand. Basedon the large variety of possible transformationsthat can generate paraphrases, it seems likely thatthe kinds of paraphrases that are useful would de-pend on the QA application of interest (Bhagatand Hovy, 2013). Fader et al. (2014) use featuresthat are defined over the original question and itsrewrites to score paraphrases. Examples includethe pointwise mutual information of the rewriterule, the paraphrase’s score according to a lan-guage model, and POS tag features. In the contextof semantic parsing, Chen et al. (2016) also usethe ID of the rewrite rule as a feature. However,most of these features are not informative enoughto model the quality of question paraphrases, orcannot easily generalize to unseen rewrite rules.

In this paper, we present a general frameworkfor learning paraphrases for question answeringtasks. Given a natural language question, ourmodel estimates a probability distribution overcandidate answers. We first generate paraphrasesfor the question, which can be obtained by one orseveral paraphrasing systems. A neural scoringmodel predicts the quality of the generated para-phrases, while learning to assign higher weightsto those which are more likely to yield correct an-swers. The paraphrases and the original questionare fed into a QA model that predicts a distributionover answers given the question. The entire sys-tem is trained end-to-end using question-answerpairs as a supervision signal. The framework isflexible, it does not rely on specific paraphrase orQA models. In fact, this plug-and-play functional-

ity allows to learn specific paraphrases for differ-ent QA tasks and to explore the merits of differentparaphrasing models for different applications.

We evaluate our approach on QA over Free-base and text-based answer sentence selection. Weemploy a range of paraphrase models based onthe Paraphrase Database (PPDB; Pavlick et al.2015), neural machine translation (Mallinsonet al., 2016), and rules mined from the WikiAn-swers corpus (Fader et al., 2014). Results on threedatasets show that our framework consistently im-proves performance; it achieves state-of-the-art re-sults on GraphQuestions and competitive perfor-mance on two additional benchmark datasets us-ing simple QA models.

2 Problem Formulation

Let q denote a natural language question, and a itsanswer. Our aim is to estimate p (a|q), the condi-tional probability of candidate answers given thequestion. We decompose p (a|q) as:

p (a|q) =∑

q′∈Hq∪{q}pψ(a|q′)︸︷︷︸

QA Model

pθ(q′|q)︸︷︷︸

Paraphrase Model

(1)

where Hq is the set of paraphrases for question q,ψ are the parameters of a QA model, and θ are theparameters of a paraphrase scoring model.

As shown in Figure 1, we first generate candi-date paraphrases Hq for question q. Then, a neu-ral scoring model predicts the quality of the gen-erated paraphrases, and assigns higher weights tothe paraphrases which are more likely to obtain

876

Input: what be the zip code of the largest car manufacturerwhat be the zip code of the largest vehicle manufacturer PPDBwhat be the zip code of the largest car producer PPDBwhat be the postal code of the biggest automobile manufacturer NMTwhat be the postcode of the biggest car manufacturer NMTwhat be the largest car manufacturer ’s postal code Rulezip code of the largest car manufacturer Rule

Table 1: Paraphrases obtained for an input ques-tion from different models (PPDB, NMT, Rule).Words are lowercased and stemmed.

the correct answers. These paraphrases and theoriginal question simultaneously serve as input toa QA model that predicts a distribution over an-swers for a given question. Finally, the results ofthese two models are fused to predict the answer.In the following we will explain how p (q′|q) andp (a|q′) are estimated.

2.1 Paraphrase Generation

As shown in Equation (1), the term p (a|q) isthe sum over q and its paraphrases Hq. Ide-ally, we would generate all the paraphrases of q.However, since this set could quickly become in-tractable, we restrict the number of candidate para-phrases to a manageable size. In order to in-crease the coverage and diversity of paraphrases,we employ three methods based on: (1) lexicaland phrasal rules from the Paraphrase Database(Pavlick et al., 2015); (2) neural machine trans-lation models (Sutskever et al., 2014; Bahdanauet al., 2015); and (3) paraphrase rules mined fromclusters of related questions (Fader et al., 2014).We briefly describe these models below, however,there is nothing inherent in our framework that isspecific to these, any other paraphrase generatorcould be used instead.

2.1.1 PPDB-based GenerationBilingual pivoting (Bannard and Callison-Burch,2005) is one of the most well-known approachesto paraphrasing; it uses bilingual parallel corporato learn paraphrases based on techniques fromphrase-based statistical machine translation (SMT,Koehn et al. 2003). The intuition is that twoEnglish strings that translate to the same foreignstring can be assumed to have the same meaning.The method first extracts a bilingual phrase tableand then obtains English paraphrases by pivotingthrough foreign language phrases.

Drawing inspiration from syntax-based SMT,Callison-Burch (2008) and Ganitkevitch et al.(2011) extended this idea to syntactic paraphrases,

Input

q

g1

gK

... ... ...

<s> y1

...

y1 y2

Encoder

Encoder

...

...

...

yt-1

...

yt

NMT1 Pivots

NMT2Decoder unit

Deco

der

Figure 2: Overview of NMT-based paraphrasegeneration. NMT1 (green) translates ques-tion q into pivots g1 . . . gK which are then back-translated by NMT2 (blue) where K decodersjointly predict tokens at each time step, rather thanonly conditioning on one pivot and independentlypredicting outputs.

leading to the creation of PPDB (Ganitkevitchet al., 2013), a large-scale paraphrase databasecontaining over a billion of paraphrase pairs in24 different languages. Pavlick et al. (2015) fur-ther used a supervised model to automatically la-bel paraphrase pairs with entailment relationshipsbased on natural logic (MacCartney, 2009). In ourwork, we employ bidirectionally entailing rulesfrom PPDB. Specifically, we focus on lexical (sin-gle word) and phrasal (multiword) rules which weuse to paraphrase questions by replacing wordsand phrases in them. An example is shown inTable 1 where we substitute car with vehicle andmanufacturer with producer.

2.1.2 NMT-based Generation

Mallinson et al. (2016) revisit bilingual pivoting inthe context of neural machine translation (NMT,Sutskever et al. 2014; Bahdanau et al. 2015) andpresent a paraphrasing model based on neural net-works. At its core, NMT is trained end-to-end tomaximize the conditional probability of a correcttranslation given a source sentence, using a bilin-gual corpus. Paraphrases can be obtained by trans-lating an English string into a foreign languageand then back-translating it into English. NMT-based pivoting models offer advantages over con-ventional methods such as the ability to learn con-tinuous representations and to consider wider con-text while paraphrasing.

In our work, we select German as our pivotfollowing Mallinson et al. (2016) who show thatit outperforms other languages in a wide rangeof paraphrasing experiments, and pretrain twoNMT systems, English-to-German (EN-DE) and

877

Source Targetthe average size of what be average size

be locate on which continent what continent be a part oflanguage speak in what be the official language ofwhat be the money in what currency do use

Table 2: Examples of rules used in the rule-basedparaphrase generator.

German-to-English (DE-EN). A naive implemen-tation would translate a question to a Germanstring and then back-translate it to English. How-ever, using only one pivot can lead to inaccu-racies as it places too much faith on a singletranslation which may be wrong. Instead, wetranslate from multiple pivot sentences (Mallinsonet al., 2016). As shown in Figure 2, question qis translated to K-best German pivots, Gq ={g1, . . . , gK}. The probability of generating para-phrase q′ = y1 . . . y|q′| is decomposed as:

p(q′|Gq

)=|q′|∏t=1

p (yt|y<t,Gq)

=|q′|∏t=1

K∑k=1

p (gk|q) p (yt|y<t, gk)(2)

where y<t = y1, . . . , yt−1, and |q′| is the lengthof q′. Probabilities p (gk|q) and p (yt|y<t, gk) arecomputed by the EN-DE and DE-EN models, re-spectively. We use beam search to decode tokensby conditioning on multiple pivoting sentences.The results with the best decoding scores are con-sidered candidate paraphrases. Examples of NMTparaphrases are shown in Table 1.

Compared to PPDB, NMT-based paraphrasesare syntax-agnostic, operating on the surface levelwithout knowledge of any underlying grammar.Furthermore, paraphrase rules are captured im-plicitly and cannot be easily extracted, e.g., froma phrase table. As mentioned earlier, the NMT-based approach has the potential of perform-ing major rewrites as paraphrases are generatedwhile considering wider contextual information,whereas PPDB paraphrases are more local, andmainly handle lexical variation.

2.1.3 Rule-Based GenerationOur third paraphrase generation approach usesrules mined from the WikiAnswers corpus (Faderet al., 2014) which contains more than 30 mil-lion question clusters labeled as paraphrases by

WikiAnswers1 users. This corpus is a large re-source (the average cluster size is 25), but is rel-atively noisy due to its collaborative nature – 45%of question pairs are merely related rather thangenuine paraphrases. We therefore followed themethod proposed in (Fader et al., 2013) to har-vest paraphrase rules from the corpus. We first ex-tracted question templates (i.e., questions with atmost one wild-card) that appear in at least ten clus-ters. Any two templates co-occurring (more thanfive times) in the same cluster and with the samearguments were deemed paraphrases. Table 2shows examples of rules extracted from the cor-pus. During paraphrase generation, we considersubstrings of the input question as arguments, andmatch them with the mined template pairs. For ex-ample, the stemmed input question in Table 1 canbe paraphrased using the rules (“what be the zipcode of ”, “what be ’s postal code”) and (“whatbe the zip code of ”, “zip code of ”). If no ex-act match is found, we perform fuzzy matching byignoring stop words in the question and templates.

2.2 Paraphrase Scoring

Recall from Equation (1) that pθ (q′|q) scores thegenerated paraphrases q′ ∈ Hq ∪ {q}. We esti-mate pθ (q′|q) using neural networks given theirsuccessful application to paraphrase identificationtasks (Socher et al., 2011; Yin and Schutze, 2015;He et al., 2015). Specifically, the input ques-tion and its paraphrases are encoded as vectors.Then, we employ a neural network to obtain thescore s (q′|q) which after normalization becomesthe probability pθ (q′|q).

Encoding Let q = q1 . . . q|q| denote an inputquestion. Every word is initially mapped to ad-dimensional vector. In other words, vector qtis computed via qt = Wqe (qt), where Wq ∈Rd×|V| is a word embedding matrix, |V| is thevocabulary size, and e (qt) is a one-hot vector.Next, we use a bi-directional recurrent neural net-work with long short-term memory units (LSTM,Hochreiter and Schmidhuber 1997) as the ques-tion encoder, which is shared by the input ques-tions and their paraphrases. The encoder recur-sively processes tokens one by one, and uses theencoded vectors to represent questions. We com-pute the hidden vectors at the t-th time step via:

1wiki.answers.com

878

−→h t = LSTM

(−→h t−1,qt

), t = 1, . . . , |q|

←−h t = LSTM

(←−h t+1,qt

), t = |q|, . . . , 1

(3)

where−→h t,←−h t ∈ Rn. In this work we follow the

LSTM function described in Pham et al. (2014).The representation of q is obtained by:

q =[−→h |q|,

←−h 1

](4)

where [·, ·] denotes concatenation, and q ∈ R2n.

Scoring After obtaining vector representationsfor q and q′, we compute the score s (q′|q) via:

s(q′|q) = ws ·

[q,q′,q� q′

]+ bs (5)

where ws ∈ R6n is a parameter vector, [·, ·, ·] de-notes concatenation, � is element-wise multipli-cation, and bs is the bias. Alternative ways to com-pute s (q′|q) such as dot product or with a bilinearterm were not empirically better than Equation (5)and we omit them from further discussion.

Normalization For paraphrases q′ ∈ Hq ∪ {q},the probability pθ (q′|q) is computed via:

pθ(q′|q) =

exp{s (q′|q)}∑r∈Hq∪{q} exp{s (r|q)} (6)

where the paraphrase scores are normalized overthe set Hq ∪ {q}.2.3 QA ModelsThe framework defined in Equation (1) is rela-tively flexible with respect to the QA model beingemployed as long as it can predict pψ (a|q′). We il-lustrate this by performing experiments across dif-ferent tasks and describe below the models usedfor these tasks.

Knowledge Base QA In our first task we usethe Freebase knowledge base to answer questions.Query graphs for the questions typically containmore than one predicate. For example, to answerthe question “who is the ceo of microsoft in 2008”,we need to use one relation to query “ceo of mi-crosoft” and another relation for the constraint “in2008”. For this task, we employ the SIMPLE-GRAPH model described in Reddy et al. (2016,2017), and follow their training protocol and fea-ture design. In brief, their method uses rules to

convert questions to ungrounded logical forms,which are subsequently matched against Freebasesubgraphs. The QA model learns from question-answer pairs: it extracts features for pairs of ques-tions and Freebase subgraphs, and uses a logisticregression classifier to predict the probability thata candidate answer is correct. We perform entitylinking using the Freebasee/KG API on the origi-nal question (Reddy et al., 2016, 2017), and gener-ate candidate Freebase subgraphs. The QA modelestimates how likely it is for a subgraph to yieldthe correct answer.

Answer Sentence Selection Given a questionand a collection of relevant sentences, the goalof this task is to select sentences which containan answer to the question. The assumption isthat correct answer sentences have high semanticsimilarity to the questions (Yu et al., 2014; Yanget al., 2015; Miao et al., 2016). We use two bi-directional recurrent neural networks (BILSTM)to separately encode questions and answer sen-tences to vectors (Equation (4)). Similarity scoresare computed as shown in Equation (5), and thensquashed to (0, 1) by a sigmoid function in orderto predict pψ (a|q′).2.4 Training and Inference

We use a log-likelihood objective for training,which maximizes the likelihood of the correct an-swer given a question:

maximize∑

(q,a)∈Dlog p (a|q) (7)

where D is the set of all question-answer trainingpairs, and p (a|q) is computed as shown in Equa-tion (1). For the knowledge base QA task, we pre-dict how likely it is that a subgraph obtains thecorrect answer, and the answers of some candidatesubgraphs are partially correct. So, we use thebinary cross entropy between the candidate sub-graph’s F1 score and the prediction as the objec-tive function. The RMSProp algorithm (Tielemanand Hinton, 2012) is employed to solve this non-convex optimization problem. Moreover, dropoutis used for regularizing the recurrent neural net-works (Pham et al., 2014).

At test time, we generate paraphrases for thequestion q, and then predict the answer by:

a = arg maxa′∈Cq

p(a′|q) (8)

879

where Cq is the set of candidate answers,and p (a′|q) is computed as shown in Equation (1).

3 Experiments

We compared our model which we call PARA4QA(as shorthand for learning to paraphrase for ques-tion answering) against multiple previous systemson three datasets. In the following we introducethese datasets, provide implementation details forour model, describe the systems used for compar-ison, and present our results.

3.1 DatasetsOur model was trained on three datasets, repre-sentative of different types of QA tasks. The firsttwo datasets focus on question answering over astructured knowledge base, whereas the third oneis specific to answer sentence selection.

WEBQUESTIONS This dataset (Berant et al.,2013) contains 3, 778 training instances and2, 032 test instances. Questions were collected byquerying the Google Suggest API. A breadth-firstsearch beginning with wh- was conducted and theanswers were crowd-sourced using Freebase as thebackend knowledge base.

GRAPHQUESTIONS The dataset (Su et al.,2016) contains 5, 166 question-answer pairs(evenly split into a training and a test set). It wascreated by asking crowd workers to paraphrase500 Freebase graph queries in natural language.

WIKIQA This dataset (Yang et al., 2015) has3, 047 questions sampled from Bing query logs.The questions are associated with 29, 258 candi-date answer sentences, 1, 473 of which contain thecorrect answers to the questions.

3.2 Implementation DetailsParaphrase Generation Candidate paraphraseswere stemmed (Minnen et al., 2001) and lower-cased. We discarded duplicate or trivial para-phrases which only rewrite stop words or punc-tuation. For the NMT model, we followed the im-plementation2 and settings described in Mallinsonet al. (2016), and used English↔German as thelanguage pair. The system was trained on datareleased as part of the WMT15 shared transla-tion task (4.2 million sentence pairs). We alsohad access to back-translated monolingual train-ing data (Sennrich et al., 2016a). Rare words were

2github.com/sebastien-j/LV_groundhog

split into subword units (Sennrich et al., 2016b) tohandle out-of-vocabulary words in questions. Weused the top 15 decoding results as candidate para-phrases. We used the S size package of PPDB2.0 (Pavlick et al., 2015) for high precision. Atmost 10 candidate paraphrases were considered.We mined paraphrase rules from WikiAnswers(Fader et al., 2014) as described in Section 2.1.3.The extracted rules were ranked using the point-wise mutual information between template pairsin the WikiAnswers corpus. The top 10 candidateparaphrases were used.

Training For the paraphrase scoring model, weused GloVe (Pennington et al., 2014) vectors3 pre-trained on Wikipedia 2014 and Gigaword 5 to ini-tialize the word embedding matrix. We kept thismatrix fixed across datasets. Out-of-vocabularywords were replaced with a special unknown sym-bol. We also augmented questions with start-of-and end-of-sequence symbols. Word vectors forthese special symbols were updated during train-ing. Model hyperparameters were validated onthe development set. The dimensions of hid-den vectors and word embeddings were selectedfrom {50, 100, 200} and {100, 200}, respectively.The dropout rate was selected from {0.2, 0.3, 0.4}.The BILSTM for the answer sentence selectionQA model used the same hyperparameters. Pa-rameters were randomly initialized from a uniformdistribution U (−0.08, 0.08). The learning rateand decay rate of RMSProp were 0.01 and 0.95,respectively. The batch size was set to 150. Toalleviate the exploding gradient problem (Pascanuet al., 2013), the gradient norm was clipped to 5.Early stopping was used to determine the numberof epochs.

3.3 Paraphrase StatisticsTable 3 presents descriptive statistics on the para-phrases generated by the various systems acrossdatasets (training set). As can be seen, the av-erage paraphrase length is similar to the averagelength of the original questions. The NMT methodgenerates more paraphrases and has wider cover-age, while the average number and coverage of theother two methods varies per dataset. As a wayof quantifying the extent to which rewriting takesplace, we report BLEU (Papineni et al., 2002) andTER (Snover et al., 2006) scores between the orig-inal questions and their paraphrases. The NMT

3nlp.stanford.edu/projects/glove

880

Metric GRAPHQ WEBQ WIKIQA

NMT PPDB Rule NMT PPDB Rule NMT PPDB Ruleavg(|q|) 10.87 7.71 6.47avg(|q′|) 10.87 12.40 10.51 8.13 8.55 7.54 6.60 7.85 7.15avg(#q′) 13.85 3.02 2.50 13.76 0.71 7.74 13.95 0.62 5.64Coverage (%) 99.67 73.52 31.16 99.87 35.15 83.61 99.89 31.04 63.12BLEU (%) 42.33 67.92 54.23 35.14 56.62 42.37 32.40 54.24 40.62TER (%) 39.18 14.87 38.59 45.38 19.94 43.44 46.10 17.20 48.59

Table 3: Statistics of generated paraphrases acrossdatasets (training set). avg(|q|): average ques-tion length; avg(|q′|): average paraphrase length;avg(#q′): average number of paraphrases; cover-age: the proportion of questions that have at leastone candidate paraphrase.

method and the rules extracted from WikiAnswerstend to paraphrase more (i.e., have lower BLEUand higher TER scores) compared to PPDB.

3.4 Comparison Systems

We compared our framework to previous workand several ablation models which either do notuse paraphrases or paraphrase scoring, or are notjointly trained.

The first baseline only uses the base QA mod-els described in Section 2.3 (SIMPLEGRAPH andBILSTM). The second baseline (AVGPARA) doesnot take advantage of paraphrase scoring. Theparaphrases for a given question are used while theQA model’s results are directly averaged to predictthe answers. The third baseline (DATAAUGMENT)employs paraphrases for data augmentation dur-ing training. Specifically, we use the question, itsparaphrases, and the correct answer to automati-cally generate new training samples.

In the fourth baseline (SEPPARA), the para-phrase scoring model is separately trained on para-phrase classification data, without taking question-answer pairs into account. In the experiments,we used the Quora question paraphrase dataset4

which contains question pairs and labels indicat-ing whether they constitute paraphrases or not. Weremoved questions with more than 25 tokens andsub-sampled to balance the dataset. We used 90%of the resulting 275K examples for training, andthe remaining for development. The paraphrasescore s (q′|q) (Equation (5)) was wrapped by asigmoid function to predict the probability of aquestion pair being a paraphrase. A binary cross-entropy loss was used as the objective. The classi-fication accuracy on the dev set was 80.6%.

4goo.gl/kMP46n

Method Average F1 (%)

GRAPHQ WEBQSEMPRE (Berant et al., 2013) 10.8 35.7JACANA (Yao and Van Durme, 2014) 5.1 33.0PARASEMP (Berant and Liang, 2014) 12.8 39.9SUBGRAPH (Bordes et al., 2014a) - 40.4MCCNN (Dong et al., 2015) - 40.8YAO15 (Yao, 2015) - 44.3AGENDAIL (Berant and Liang, 2015) - 49.7STAGG (Yih et al., 2015) - 48.4 (52.5)MCNN (Xu et al., 2016) - 47.0 (53.3)TYPERERANK (Yavuz et al., 2016) - 51.6 (52.6)BILAYERED (Narayan et al., 2016) - 47.2UDEPLAMBDA (Reddy et al., 2017) 17.6 49.5SIMPLEGRAPH (baseline) 15.9 48.5AVGPARA 16.1 48.8SEPPARA 18.4 49.6DATAAUGMENT 16.3 48.7PARA4QA 20.4 50.7−NMT 18.5 49.5−PPDB 19.5 50.4−RULE 19.4 49.1

Table 4: Model performance on GRAPHQUES-TIONS and WEBQUESTIONS. Results with addi-tional task-specific resources are shown in paren-theses. The base QA model is SIMPLEGRAPH.Best results in each group are shown in bold.

Finally, in order to assess the individual con-tribution of different paraphrasing resources, wecompared the PARA4QA model against versionsof itself with one paraphrase generator removed(−NMT/−PPDB/−RULE).

3.5 Results

We first discuss the performance of PARA4QA onGRAPHQUESTIONS and WEBQUESTIONS. Thefirst block in Table 4 shows a variety of systemspreviously described in the literature using aver-age F1 as the evaluation metric (Berant et al.,2013). Among these, PARASEMP, SUBGRAPH,MCCNN, and BILAYERED utilize paraphrasingresources. The second block compares PARA4QAagainst various related baselines (see Section 3.4).SIMPLEGRAPH results on WEBQUESTIONS andGRAPHQUESTIONS are taken from Reddy et al.(2016) and Reddy et al. (2017), respectively.

Overall, we observe that PARA4QA outper-forms baselines which either do not employ para-phrases (SIMPLEGRAPH) or paraphrase scoring(AVGPARA, DATAAUGMENT), or are not jointlytrained (SEPPARA). On GRAPHQUESTIONS, ourmodel PARA4QA outperforms the previous stateof the art by a wide margin. Ablation experimentswith one of the paraphrase generators removed

881

Method MAP MRRBIGRAMCNN (Yu et al., 2014) 0.6190 0.6281BIGRAMCNN+CNT (Yu et al., 2014) 0.6520 0.6652PARAVEC (Le and Mikolov, 2014) 0.5110 0.5160PARAVEC+CNT (Le and Mikolov, 2014) 0.5976 0.6058LSTM (Miao et al., 2016) 0.6552 0.6747LSTM+CNT (Miao et al., 2016) 0.6820 0.6988NASM (Miao et al., 2016) 0.6705 0.6914NASM+CNT (Miao et al., 2016) 0.6886 0.7069KVMEMNET+CNT (Miller et al., 2016) 0.7069 0.7265BILSTM (baseline) 0.6456 0.6608AVGPARA 0.6587 0.6753SEPPARA 0.6613 0.6765DATAAUGMENT 0.6578 0.6736PARA4QA 0.6759 0.6918−NMT 0.6528 0.6680−PPDB 0.6613 0.6767−RULE 0.6553 0.6756

BILSTM+CNT (baseline) 0.6722 0.6877PARA4QA+CNT 0.6978 0.7131

Table 5: Model performance on WIKIQA. +CNT:word matching features introduced in Yang et al.(2015). The base QA model is BILSTM. Best re-sults in each group are shown in bold.

show that performance drops most when the NMTparaphrases are not used on GRAPHQUESTIONS,whereas on WEBQUESTIONS removal of the rule-based generator hurts performance most. One rea-son is that the rule-based method has higher cov-erage on WEBQUESTIONS than on GRAPHQUES-TIONS (see Table 3).

Results on WIKIQA are shown in Table 5. Wereport MAP and MMR which evaluate the rela-tive ranks of correct answers among the candi-date sentences for a question. Again, we observethat PARA4QA outperforms related baselines (seeBILSTM, DATAAUGMENT, AVGPARA, and SEP-PARA). Ablation experiments show that perfor-mance drops most when NMT paraphrases are re-moved. When word matching features are used(see +CNT in the third block), PARA4QA reachesstate of the art performance.

Examples of paraphrases and their probabil-ities pθ (q′|q) (see Equation (6)) learned byPARA4QA are shown in Table 6. The two ex-amples are taken from the development set ofGRAPHQUESTIONS and WEBQUESTIONS, re-spectively. We also show the Freebase relationsused to query the correct answers. In the first ex-ample, the original question cannot yield the cor-rect answer because of the mismatch between thequestion and the knowledge base. The paraphrasecontains “role” in place of “sort of part”, increas-ing the chance of overlap between the question and

Examples pθ (q′|q)(music.concert performance.performance role)what sort of part do queen play in concert 0.0659what role do queen play in concert 0.0847what be the role play by the queen in concert 0.0687what role do queen play during concert 0.0670what part do queen play in concert 0.0664which role do queen play in concert concert 0.0652(sports.sports team roster.team)what team do shaq play 4 0.2687what team do shaq play for 0.2783which team do shaq play with 0.0671which team do shaq play out 0.0655which team have you play shaq 0.0650what team have we play shaq 0.0497

Table 6: Questions and their top-five paraphraseswith probabilities learned by the model. The Free-base relations used to query the correct answersare shown in brackets. The original question isunderlined. Questions with incorrect predictionsare in red.

the predicate words. The second question containsan informal expression “play 4”, which confusesthe QA model. The paraphrase model generates“play for” and predicts a high paraphrase scorefor it. More generally, we observe that the modeltends to give higher probabilities pθ (q′|q) to para-phrases biased towards delivering appropriate an-swers.

We also analyzed which structures were mostlyparaphrased within a question. We manually in-spected 50 (randomly sampled) questions fromthe development portion of each dataset, and theirthree top scoring paraphrases (Equation (5)). Wegrouped the most commonly paraphrased struc-tures into the following categories: a) questionwords, i.e., wh-words and and “how”; b) ques-tion focus structures, i.e., cue words or cue phrasesfor an answer with a specific entity type (Yao andVan Durme, 2014); c) verbs or noun phrases in-dicating the relation between the question topicentity and the answer; and d) structures requir-ing aggregation or imposing additional constraintsthe answer must satisfy (Yih et al., 2015). In theexample “which year did Avatar release in UK”,the question word is “which”, the question focusis “year”, the verb is “release”, and “in UK” con-strains the answer to a specific location.

Figure 3 shows the degree to which differenttypes of structures are paraphrased. As can beseen, most rewrites affect Relation Verb, espe-cially on WEBQUESTIONS. Question Focus, Re-lation NP, and Constraint & Aggregation are more

882

QuestionWord

QuestionFocus

RelationVerb

RelationNP

Constraint &Aggregation

0

10

20

30

40P

rop

orti

on(%

)

GraphQ

WebQ

WikiQA

Figure 3: Proportion of linguistic phenomena sub-ject to paraphrasing within a question.

Method Average F1 (%)

Simple ComplexSIMPLEGRAPH 20.9 12.2PARA4QA 27.4 (+6.5) 16.0 (+3.8)

Table 7: We group GRAPHQUESTIONS into sim-ple and complex questions and report model per-formance in each split. Best results in each groupare shown in bold. The values in brackets are ab-solute improvements of average F1 scores.

often rewritten in GRAPHQUESTIONS comparedto the other datasets.

Finally, we examined how our method fares onsimple versus complex questions. We performedthis analysis on GRAPHQUESTIONS as it containsa larger proportion of complex questions. We con-sider questions that contain a single relation assimple. Complex questions have multiple rela-tions or require aggregation. Table 7 shows howour model performs in each group. We observeimprovements for both types of questions, withthe impact on simple questions being more pro-nounced. This is not entirely surprising as it is eas-ier to generate paraphrases and predict the para-phrase scores for simpler questions.

4 Conclusions

In this work we proposed a general frameworkfor learning paraphrases for question answering.Paraphrase scoring and QA models are trainedend-to-end on question-answer pairs, which re-sults in learning paraphrases with a purpose. Theframework is not tied to a specific paraphrase gen-erator or QA system. In fact it allows to in-corporate several paraphrasing modules, and canserve as a testbed for exploring their coverageand rewriting capabilities. Experimental results

on three datasets show that our method improvesperformance across tasks. There are several direc-tions for future work. The framework can be usedfor other natural language processing tasks whichare sensitive to the variation of input (e.g., tex-tual entailment or summarization). We would alsolike to explore more advanced paraphrase scoringmodels (Parikh et al., 2016; Wang and Jiang, 2016)as well as additional paraphrase generators sinceimprovements in the diversity and the quality ofparaphrases could also enhance QA performance.

Acknowledgments The authors gratefully ac-knowledge the support of the UK Engineeringand Physical Sciences Research Council (grantEP/L016427/1; Mallinson) and the European Re-search Council (award number 681760; Lapata).

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. In International Con-ference on Learning Representations.

Colin Bannard and Chris Callison-Burch. 2005. Para-phrasing with bilingual parallel corpora. In Pro-ceedings of the 43rd Annual Meeting of the As-sociation for Computational Linguistics (ACL’05),pages 597–604, Ann Arbor, Michigan. Associationfor Computational Linguistics.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing, pages 1533–1544, Seattle, Wash-ington, USA. Association for Computational Lin-guistics.

Jonathan Berant and Percy Liang. 2014. Semanticparsing via paraphrasing. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1415–1425, Baltimore, Maryland. Association forComputational Linguistics.

Jonathan Berant and Percy Liang. 2015. Imitationlearning of agenda-based semantic parsers. Trans-actions of the Association for Computational Lin-guistics, 3:545–558.

Rahul Bhagat and Eduard Hovy. 2013. What is a para-phrase? Computational Linguistics, 39(3):463–472.

Antoine Bordes, Sumit Chopra, and Jason Weston.2014a. Question answering with subgraph embed-dings. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 615–620, Doha, Qatar. Associationfor Computational Linguistics.

883

Antoine Bordes, Jason Weston, and Nicolas Usunier.2014b. Open question answering with weakly su-pervised embedding models. In Proceedings ofthe European Conference on Machine Learning andKnowledge Discovery in Databases - Volume 8724,ECML PKDD 2014, pages 165–180, New York, NY,USA. Springer-Verlag New York, Inc.

Chris Callison-Burch. 2008. Syntactic constraints onparaphrases extracted from parallel corpora. In Pro-ceedings of the 2008 Conference on Empirical Meth-ods in Natural Language Processing, pages 196–205, Honolulu, Hawaii. Association for Computa-tional Linguistics.

Bo Chen, Le Sun, Xianpei Han, and Bo An. 2016. Sen-tence rewriting for semantic parsing. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 766–777, Berlin, Germany. Associationfor Computational Linguistics.

Li Dong, Furu Wei, Ming Zhou, and Ke Xu.2015. Question answering over freebase with multi-column convolutional neural networks. In Proceed-ings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 1: Long Papers), pages 260–269,Beijing, China. Association for Computational Lin-guistics.

Pablo Duboue and Jennifer Chu-Carroll. 2006. An-swering the question you wish they had asked:The impact of paraphrasing for question answering.In Proceedings of the Human Language Technol-ogy Conference of the NAACL, Companion Volume:Short Papers, pages 33–36, New York City, USA.Association for Computational Linguistics.

Anthony Fader, Luke Zettlemoyer, and Oren Etzioni.2013. Paraphrase-driven learning for open questionanswering. In Proceedings of the 51st Annual Meet-ing of the Association for Computational Linguis-tics (Volume 1: Long Papers), pages 1608–1618,Sofia, Bulgaria. Association for Computational Lin-guistics.

Anthony Fader, Luke Zettlemoyer, and Oren Etzioni.2014. Open question answering over curated andextracted knowledge bases. In Proceedings of the20th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’14,pages 1156–1165, New York, NY, USA. ACM.

Juri Ganitkevitch, Chris Callison-Burch, CourtneyNapoles, and Benjamin Van Durme. 2011. Learningsentential paraphrases from bilingual parallel cor-pora for text-to-text generation. In Proceedings ofthe 2011 Conference on Empirical Methods in Nat-ural Language Processing, pages 1168–1179, Edin-burgh, Scotland, UK. Association for ComputationalLinguistics.

Juri Ganitkevitch, Benjamin Van Durme, and ChrisCallison-Burch. 2013. PPDB: the paraphrase

database. In Proceedings of the 2013 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 758–764, Atlanta, Georgia. Associ-ation for Computational Linguistics.

Hua He, Kevin Gimpel, and Jimmy Lin. 2015. Multi-perspective sentence similarity modeling with con-volutional neural networks. In Proceedings of the2015 Conference on Empirical Methods in Natu-ral Language Processing, pages 1576–1586, Lis-bon, Portugal. Association for Computational Lin-guistics.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1, pages 48–54, Sapporo, Japan.

Quoc Le and Tomas Mikolov. 2014. Distributed repre-sentations of sentences and documents. In Proceed-ings of The 31st International Conference on Ma-chine Learning, pages 1188–1196.

Bill MacCartney. 2009. Natural Language Inference.Ph.D. thesis, Stanford University.

Jonathan Mallinson, Rico Sennrich, and Mirella Lap-ata. 2016. Paraphrasing revisited with neural ma-chine translation. In Proceedings of the 15th Con-ference of the European Chapter of the Associationfor Computational Linguistics, Valencia, Spain.

Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neu-ral variational inference for text processing. In Pro-ceedings of The 33rd International Conference onMachine Learning, pages 1727–1736.

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason We-ston. 2016. Key-value memory networks for directlyreading documents. In Proceedings of the 2016Conference on Empirical Methods in Natural Lan-guage Processing, pages 1400–1409, Austin, Texas.Association for Computational Linguistics.

Guido Minnen, John Carroll, and Darren Pearce. 2001.Applied morphological processing of english. Nat-ural Language Engineering, 7(3):207–223.

Shashi Narayan, Siva Reddy, and Shay B Cohen. 2016.Paraphrase generation from Latent-Variable PCFGsfor semantic parsing. In Proceedings of the 9th In-ternational Natural Language Generation Confer-ence, pages 153–162, Edinburgh, Scotland.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of

884

40th Annual Meeting of the Association for Com-putational Linguistics, pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.

Ankur Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 2249–2255,Austin, Texas. Association for Computational Lin-guistics.

Razvan Pascanu, Tomas Mikolov, and Yoshua Ben-gio. 2013. On the difficulty of training recurrentneural networks. In Proceedings of The 30th In-ternational Conference on Machine Learning, pages1310–1318, Atlanta, Georgia.

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,Benjamin Van Durme, and Chris Callison-Burch.2015. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, andstyle classification. In Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 2:Short Papers), pages 425–430, Beijing, China. As-sociation for Computational Linguistics.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543.

V. Pham, T. Bluche, C. Kermorvant, and J. Louradour.2014. Dropout improves recurrent neural networksfor handwriting recognition. In 2014 14th Inter-national Conference on Frontiers in HandwritingRecognition, pages 285–290.

Siva Reddy, Oscar Tackstrom, Michael Collins, TomKwiatkowski, Dipanjan Das, Mark Steedman, andMirella Lapata. 2016. Transforming dependencystructures to logical forms for semantic parsing.Transactions of the Association for ComputationalLinguistics, 4:127–140.

Siva Reddy, Oscar Tackstrom, Slav Petrov, MarkSteedman, and Mirella Lapata. 2017. Universal se-mantic parsing. arXiv preprint arXiv:1702.03196.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Improving neural machine translation mod-els with monolingual data. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages86–96, Berlin, Germany. Association for Computa-tional Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A studyof translation edit rate with targeted human annota-tion. In In Proceedings of Association for MachineTranslation in the Americas, pages 223–231.

Richard Socher, Eric H Huang, Jeffrey Pennin, Christo-pher D Manning, and Andrew Y Ng. 2011. Dy-namic pooling and unfolding recursive autoencodersfor paraphrase detection. In Advances in Neural In-formation Processing Systems, pages 801–809.

Yu Su, Huan Sun, Brian Sadler, Mudhakar Srivatsa,Izzeddin Gur, Zenghui Yan, and Xifeng Yan. 2016.On generating characteristic-rich question sets for qaevaluation. In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 562–572, Austin, Texas. Associationfor Computational Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In Proceedings of the 27th InternationalConference on Neural Information Processing Sys-tems, NIPS’14, pages 3104–3112, Cambridge, MA,USA. MIT Press.

T. Tieleman and G. Hinton. 2012. Lecture 6.5—RmsProp: Divide the gradient by a running averageof its recent magnitude. Technical report.

Shuohang Wang and Jing Jiang. 2016. Learning natu-ral language inference with lstm. In Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies, pages 1442–1451, San Diego, California. Association for Com-putational Linguistics.

Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang,and Dongyan Zhao. 2016. Question answering onfreebase via relation extraction and textual evidence.In Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 2326–2336, Berlin, Germany.Association for Computational Linguistics.

Yi Yang, Wen-tau Yih, and Christopher Meek. 2015.Wikiqa: A challenge dataset for open-domain ques-tion answering. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 2013–2018, Lisbon, Portugal. As-sociation for Computational Linguistics.

Xuchen Yao. 2015. Lean question answering over free-base from scratch. In Proceedings of the 2015 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Demonstra-tions, pages 66–70, Denver, Colorado. Associationfor Computational Linguistics.

885

Xuchen Yao and Benjamin Van Durme. 2014. In-formation extraction over structured data: Questionanswering with freebase. In Proceedings of the52nd Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 956–966, Baltimore, Maryland. Associationfor Computational Linguistics.

Semih Yavuz, Izzeddin Gur, Yu Su, Mudhakar Srivatsa,and Xifeng Yan. 2016. Improving semantic pars-ing via answer type inference. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing, pages 149–159, Austin,Texas. Association for Computational Linguistics.

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, andJianfeng Gao. 2015. Semantic parsing via stagedquery graph generation: Question answering withknowledge base. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: LongPapers), pages 1321–1331, Beijing, China. Associ-ation for Computational Linguistics.

Wenpeng Yin and Hinrich Schutze. 2015. Convolu-tional neural network for paraphrase identification.In Proceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 901–911, Denver, Colorado. Association forComputational Linguistics.

Lei Yu, Karl Moritz Hermann, Phil Blunsom, andStephen Pulman. 2014. Deep Learning for AnswerSentence Selection. In NIPS Deep Learning Work-shop.

886

learning to paraphrase for question answering · li dongy and jonathan mallinson y and siva reddyz...

Documents