arxiv:1808.10113v3 [cs.cl] 2 dec 2018 · 2018-12-04 · cloze test (sct) (mostafazadeh et al....

Story Ending Generation with Incremental Encoding and CommonsenseKnowledge

Jian Guan2∗ , Yansen Wang1∗, Minlie Huang1†1Dept. of Computer Science & Technology, Tsinghua University, Beijing 100084, China

1Institute for Artificial IntelligenceTsinghua University (THUAI), China1Beijing National Research Center for Information Science and Technology, China

2Dept. of Physics, Tsinghua University, Beijing 100084, [email protected];[email protected];

[email protected]

Abstract

Generating a reasonable ending for a given story context, i.e.,story ending generation, is a strong indication of story com-prehension. This task requires not only to understand the con-text clues which play an important role in planning the plot,but also to handle implicit knowledge to make a reasonable,coherent story. In this paper, we devise a novel model forstory ending generation. The model adopts an incremental en-coding scheme to represent context clues which are spanningin the story context. In addition, commonsense knowledgeis applied through multi-source attention to facilitate storycomprehension, and thus to help generate coherent and rea-sonable endings. Through building context clues and usingimplicit knowledge, the model is able to produce reasonablestory endings. Automatic and manual evaluation shows thatour model can generate more reasonable story endings thanstate-of-the-art baselines. 1

IntroductionStory generation is an important but challenging task be-cause it requires to deal with logic and implicit knowledge(Li et al. 2013; Soo, Lee, and Chen 2016; Ji et al. 2017;Jain et al. 2017; Martin et al. 2018; Clark, Ji, and Smith2018). Story ending generation aims at concluding a storyand completing the plot given a story context. We argue thatsolving this task involves addressing the following issues: 1)Representing the context clues which contain key informa-tion for planning a reasonable ending; and 2) Using implicitknowledge (e.g., commonsense knowledge) to facilitate un-derstanding of the story and better predict what will happennext.

Comparing to textual entailment or reading comprehen-sion (Dagan, Glickman, and Magnini 2006; Hermann et al.2015) story ending generation requires more to deal with thelogic and causality information that may span multiple sen-tences in a story context. The logic information in story can

∗ Authors contributed equally to this work.† Corresponding author: Minlie Huang.

Copyright c© 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

1Our codes and data are available at https://github.com/JianGuanTHU/StoryEndGen.

He hopes to get a lot of candy .

Today is Halloween .Jack is so excited to go trick or treating tonight .He is going to dress up like a monster .The costume is real scary .

Halloween

trick or treat

dress up monster

costume is scary

get candy

trick or treatcostume Halloween

dress up

monster scary

candyscare

hallowday

beast

dress

Man

nerO

f RelatedTo

RelatedTo

RelatedTo RelatedTo

RelatedToRelatedTo

RelatedTo. . .

. . .

Figure 1: A story example. Words in blue/purple are eventsand entities. The bottom-left graph is retrieved from Con-ceptNet and the bottom-right graph represents how eventsand entities form the context clue.

be captured by the appropriate sequence of events2 or enti-ties occurring in a sequence of sentences, and the chronolog-ical order or causal relationship between events or entities.The ending should be generated from the whole context cluerather than merely inferred from a single entity or the lastsentence. It is thus important for story ending generation torepresent the context clues for predicting what will happenin an ending.

However, deciding a reasonable ending not only dependson representing the context clues properly, but also on theability of language understanding with implicit knowledgethat is beyond the text surface. Humans use their own experi-ences and implicit knowledge to help understand a story. Asshown in Figure 1, the ending talks about candy which canbe viewed as commonsense knowledge about Halloween.Such knowledge can be crucial for story ending generation.

Figure 1 shows an example of a typical story in the ROC-Stories corpus (Mostafazadeh et al. 2016b). In this exam-ple, the events or entities in the story context constitute the

2Event in this paper refers to a verb or simple action such asdress up, trick or treating as shown in Figure 1.

arX

iv:1

808.

1011

3v3

[cs

.CL

] 2

Dec

201

8

https://github.com/JianGuanTHU/StoryEndGen

https://github.com/JianGuanTHU/StoryEndGen

context clues which reveal the logical or causal relation-ships between events or entities. These concepts, includ-ing Halloween, trick or treat, and monster, are connectedas a graph structure. A reasonable ending should considerall the connected concepts rather than just some individualone. Furthermore, with the help of commonsense knowledgeretrieved from ConceptNet (Speer and Havasi 2012), it iseasier to infer a reasonable ending with the knowledge thatcandy is highly related to Halloween.

To address the two issues in story ending generation, wedevise a model that is equipped with an incremental en-coding scheme to encode context clues effectively, and amulti-source attention mechanism to use commonsenseknowledge. The representation of the context clues is builtthrough incremental reading (or encoding) of the sentencesin the story context one by one. When encoding a currentsentence in a story context, the model can attend not onlyto the words in the preceding sentence, but also the knowl-edge graphs which are retrieved from ConceptNet for eachword. In this manner, commonsense knowledge can be en-coded in the model through graph representation techniques,and therefore, be used to facilitate understanding story con-text and inferring coherent endings. Integrating the contextclues and commonsense knowledge, the model can generatemore reasonable endings than state-of-the-art baselines.

Our contributions are as follows:

• To assess the machinery ability of story comprehension,we investigate story ending generation from two new an-gles. One is to modeling logic information via incrementalcontext clues and the other is the utility of implicit knowl-edge in this task.

• We propose a neural model which represents context cluesby incremental encoding, and leverages commonsenseknowledge by multi-source attention, to generate logicaland reasonable story endings. Results show that the twotechniques are effective in capturing the coherence andlogic of story.

Related WorkThe corpus we used in this paper was first designed for StoryCloze Test (SCT) (Mostafazadeh et al. 2016a), which re-quires to select a correct ending from two candidates givena story context. Feature-based (Chaturvedi, Peng, and Dan2017; Lin, Sun, and Han 2017) or neural (Mostafazadeh etal. 2016b; Wang, Liu, and Zhao 2017) classification mod-els are proposed to measure the coherence between a can-didate ending and a story context from various aspectssuch as event, sentiment, and topic. However, story end-ing generation (Li, Ding, and Liu 2018; Zhao et al. 2018;Peng et al. 2018) is more challenging in that the task re-quires to modeling context clues and implicit knowledge toproduce reasonable endings.

Story generation, moving forward to complete story com-prehension, is approached as selecting a sequence of eventsto form a story by satisfying a set of criteria (Li et al.2013). Previous studies can be roughly categorized into twolines: rule-based methods and neural models. Most of thetraditional rule-based methods for story generation (Li et

al. 2013; Soo, Lee, and Chen 2016) retrieve events from aknowledge base with some pre-specified semantic relations.

Neural models for story generation has been widely stud-ied with sequence-to-sequence (seq2seq) learning (Roem-mele 2016). And various contents such as photos andindependent descriptions are largely used to inspire thestory (Jain et al. 2017).To capture the deep meaning ofkey entities and events, Ji et al. (2017) and Clark, Ji, andSmith (2018) explicitly modeled the entities mentioned instory with dynamic representation, and Martin et al. (2018)decomposed the problem into planning successive eventsand generating sentences from some given events. Fan,Lewis, and Dauphin (2018) adopted a hierarchical architec-ture to generate the whole story from some given keywords.

Commonsense knowledge is beneficial for many naturallanguage tasks such as semantic reasoning and text entail-ment, which is particularly important for story generation.LoBue and Yates (2011) characterized the types of com-monsense knowledge mostly involved in recognizing tex-tual entailment. Afterwards, commonsense knowledge wasused in natural language inference (R. Bowman et al. 2015;Zhang et al. 2017) and language generation (Zhou et al.2018). Mihaylov and Frank (2018) incorporated externalcommonsense knowledge into a neural cloze-style read-ing comprehension model. Rashkin et al. (2018) performedcommonsense inference on people’s intents and reactionsof the event’s participants given a short text. Similarly,Knight et al. (2018) introduced a new annotation frame-work to explain psychology of story characters with com-monsense knowledge. And commonsense knowledge hasalso been shown useful to choose a correct story endingfrom two candidate endings (Lin, Sun, and Han 2017;Li et al. 2018).

MethodologyOverviewThe task of story ending generation can be stated as fol-lows: given a story context consisting of a sentence sequenceX = {X1, X2, · · · , XK}3, where Xi = x

(i)1 x

(i)2 · · ·x

(i)li

represents the i-th sentence containing li words, the modelshould generate a one-sentence ending Y = y1y2...yl whichis reasonable in logic, formally as

Y ∗ = argmaxY

P(Y |X). (1)

As aforementioned, context clue and commonsenseknowledge is important for modeling the logic and casualinformation in story ending generation. To this end, we de-vise an incremental encoding scheme based on the gen-eral encoder-decoder framework (Sutskever, Vinyals, and Le2014). As shown in Figure 2, the scheme encodes the sen-tences in a story context incrementally with a multi-sourceattention mechanism: when encoding sentence Xi, the en-coder obtains a context vector which is an attentive read ofthe hidden states, and the graph vectors of the preceding sen-tence Xi−1. In this manner, the relationship between words

3Adjacent sentences in story context have much logic connec-tion, temporal dependency, and casual relationship.

Today is Halloween .

Jack is so excited to go trick or

treating tonight .

He is going to dress up like a monster .

The costume is real scary .

He hopes to get a lot of candy .

s

𝒉𝒋(𝒊) ...

𝒙𝒋(𝐢)

Multi-Source Attention

𝑋1

𝑋4

𝑌

𝑋3

𝑋2

"halloween"

"candy" "children"

"holiday" "costume"

Knowledge Graph Representation

𝛼𝑅0

𝛼𝑅1 𝛼𝑅2

𝛼𝑅3

g("halloween")

Query-word

Neighboring entity

Utilized entity in the ending

Graph representation of query-word

𝒉𝒋+𝟏(𝒊)

𝒄𝒙𝒋(𝒊)

𝒄𝒉𝒋(𝒊)

...𝒈𝒋(𝒊)

𝒈𝒋+𝟏(𝒊)

Incremental Encoding

MSA

MSA

MSA

MSA

𝒙𝒋+𝟏(𝐢)

𝒄𝒙𝒋(𝒊)

𝒄𝒉𝒋(𝒊)

𝒉𝒋(𝒊+𝟏) ...

𝒙𝒋(𝐢+𝟏)

𝒉𝒋+𝟏(𝒊+𝟏)

𝒄𝒙𝒋(𝒊+𝟏)

𝒄𝒉𝒋(𝒊+𝟏)

...𝒈𝒋(𝒊+𝟏)

𝒈𝒋+𝟏(𝒊+𝟏)

𝒙𝒋+𝟏(𝐢+𝟏)

𝒄𝒙𝒋(𝒊+𝟏)

𝒄𝒉𝒋(𝒊+𝟏)

Figure 2: Model overview. The model is equipped with incremental encoding (IE) and multi-source attention (MSA). x(i)j :

the j-th word in sentence i; c(i)hj : state context vector; c(i)xj : knowledge context vector; g(i)j : graph vector of word x(i)j ; h(i)

j :j-th hidden state of sentence i. The state (knowledge) context vectors are attentive read of hidden states (graph vectors) in thepreceding sentence.

(some are entities or events) in sentence Xi−1 and those inXi is built incrementally, and therefore, the chronologicalorder or causal relationship between entities (or events) inadjacent sentences can be captured implicitly. To leveragecommonsense knowledge which is important for generatinga reasonable ending, a one-hop knowledge graph for eachword in a sentence is retrieved from ConceptNet, and eachgraph can be represented by a vector in two ways. The in-cremental encoder not only attends to the hidden states ofXi−1, but also to the graph vectors at each position ofXi−1.By this means, our model can generate more reasonable end-ings by representing context clues and encoding common-sense knowledge.

Background: Encoder-Decoder FrameworkThe encoder-decoder framework is a general frameworkwidely used in text generation. Formally, the model encodesthe input sequenceX = x1x2 · · ·xm into a sequence of hid-den states, as follows,

ht = LSTM(ht−1, e(xt)), (2)

where ht denotes the hidden state at step t and e(x) is theword vector of x.

At each decoding position, the framework will generate aword by sampling from the word distribution P(yt|y<t, X)(y<t = y1y2 · · · yt−1 denotes the sequences that has beengenerated before step t), which is computed as follows:

P(yt|y<t, X) = softmax(W0st + b0), (3)st = LSTM(st−1, e(yt−1), ct−1), (4)

where st denotes the decoder state at step t. When an at-tention mechanism is applied, ct−1 is an attentive read ofthe context, which is a weighted sum of the encoder’s hid-den states as ct−1 =

∑mi=1 α(t−1)ihi, and α(t−1)i measures

the association between the decoder state st−1 and the en-coder state hi. Refer to (Bahdanau, Cho, and Bengio 2014)for more details.

Incremental Encoding SchemeStraightforward solutions for encoding the story context canbe: 1) Concatenating theK sentences to a long sentence andencoding it with an LSTM ; or 2) Using a hierarchical LSTMwith hierarchical attention (Yang et al. 2016), which firstlyattends to the hidden states of a sentence-level LSTM, andthen to the states of a word-level LSTM. However, these so-lutions are not effective to represent the context clues whichmay capture the key logic information. Such information re-vealed by the chronological order or causal relationship be-tween events or entities in adjacent sentences.

To better represent the context clues, we propose an incre-mental encoding scheme: when encoding the current sen-tence Xi, it obtains a context vector which is an attentiveread of the preceding sentence Xi−1. In this manner, the or-der/relationship between the words in adjacent sentences canbe captured implicitly.

This process can be stated formally as follows:

h(i)j = LSTM(h(i)

j−1, e(x(i)j ), c(i)lj ), i ≥ 2. (5)

where h(i)j denotes the hidden state at the j-th position of

the i-th sentence, e(x(i)j ) denotes the word vector of the j-

th word x(i)j . c(i)l,j is the context vector which is an attentive

read of the preceding sentence Xi−1, conditioned on h(i)j−1.

We will describe the context vector in the next section.During the decoding process, the decoder obtains a con-

text vector from the last sentenceXK in the context to utilize

the context clues. The hidden state is obtained as below:st = LSTM(st−1, e(yt−1), clt), (6)

where clt is the context vector which is the attentive read ofthe last sentence XK , conditioned on st−1. More details ofthe context vector will be presented in the next section.

Multi-Source Attention (MSA)The context vector (cl) plays a key role in representing thecontext clues because it captures the relationship betweenwords (or states) in the current sentence and those in thepreceding sentence. As aforementioned, story comprehen-sion sometime requires the access of implicit knowledge thatis beyond the text. Therefore, the context vector consists oftwo parts, computed with multi-source attention. The firstone c(i)hj is derived by attending to the hidden states of the

preceding sentence, and the second one c(i)xj by attending tothe knowledge graph vectors which represent the one-hopgraphs in the preceding sentence. The MSA context vectoris computed as follows:

c(i)lj = Wl([c(i)hj ; c(i)xj ]) + bl, (7)

where ⊕ indicates vector concatenation. Hereafter, c(i)hj is

called state context vector, and c(i)xj is called knowledge con-text vector.

The state context vector is a weighted sum of the hiddenstates of the preceding sentence Xi−1 and can be computedas follows:

c(i)hj =

li−1∑k=1

α(i)hk,j

h(i−1)k , (8)

α(i)hk,j

=eβ(i)hk,j

li−1∑m=1

eβ(i)hm,j

, (9)

β(i)hk,j

= h(i)Tj−1Wsh

(i−1)k , (10)

where β(i)hk,j

can be viewed as a weight between hidden state

h(i)j−1 in sentenceXi and hidden state h(i−1)

k in the precedingsentence Xi−1.

Similarly, the knowledge context vector is a weightedsum of the graph vectors for the preceding sentence. Eachword in a sentence will be used as a query to retrieve a one-hop commonsense knowledge graph from ConceptNet, andthen, each graph will be represented by a graph vector. Af-ter obtaining the graph vectors, the knowledge context vec-tor can be computed by:

c(i)xj =

li−1∑k=1

α(i)xk,j

g(x(i−1)k ), (11)

α(i)xk,j

=eβ(i)xk,j

li−1∑m=1

eβ(i)xm,j

, (12)

β(i)xk,j

= h(i)Tj−1Wkg(x

(i−1)k ), (13)

where g(x(i−1)k ) is the graph vector for the graph which is

retrieved for word x(i−1)k . Different from e(x

(i−1)k ) which

is the word vector, g(x(i−1)k ) encodes commonsense knowl-

edge and extends the semantic representation of a wordthrough neighboring entities and relations.

During the decoding process, the knowledge context vec-tors are similarly computed by attending to the last inputsentence XK . There is no need to attend to all the contextsentences because the context clues have been propagatedwithin the incremental encoding scheme.

Knowledge Graph RepresentationCommonsense knowledge can facilitate language under-standing and generation. To retrieve commonsense knowl-edge for story comprehension, we resort to Concept-Net4 (Speer and Havasi 2012). ConceptNet is a semanticnetwork which consists of triples R = (h, r, t) meaningthat head concept h has the relation r with tail concept t.Each word in a sentence is used as a query to retrieve aone-hop graph from ConceptNet. The knowledge graph fora word extends (encodes) its meaning by representing thegraph from neighboring concepts and relations.

There have been a few approaches to represent com-monsense knowledge. Since our focus in this paper is onusing knowledge to benefit story ending generation, in-stead of devising new methods for representing knowledge,we adopt two existing methods: 1) graph attention (?;Zhou et al. 2018), and 2) contextual attention (Mihaylovand Frank 2018). We compared the two means of knowledgerepresentation in the experiment.

Graph Attention Formally, the knowledge graph of word(or concept) x is represented by a set of triples, G(x) ={R1, R2, · · · , RNx

} (where each triple Ri has the samehead concept x), and the graph vector g(x) for word x canbe computed via graph attention, as below:

g(x) =

Nx∑i=1

αRi[hi; ti], (14)

αRi=

eβRi

Nx∑j=1

eβRj

, (15)

βRi= (Wrri)T tanh(Whhi + Wtti), (16)

where (hi, ri, ti) = Ri ∈ G(x) is the i-th triple in thegraph. We use word vectors to represent concepts, i.e. hi =e(hi), ti = e(ti), and learn trainable vector ri for relationri, which is randomly initialized.

Intuitively, the above formulation assumes that the knowl-edge meaning of a word can be represented by its neighbor-ing concepts (and corresponding relations) in the knowledgebase. Note that entities in ConceptNet are common words(such as tree, leaf, animal), we thus use word vectors to rep-resent h/r/t directly, instead of using geometric embedding

4https://conceptnet.io

methods (e.g., TransE) to learn entity and relation embed-dings. In this way, there is no need to bridge the representa-tion gap between geometric embeddings and text-contextualembeddings (i.e., word vectors).

Contextual Attention When using contextual attention,the graph vector g(x) can be computed as follows:

g(x) =

Nx∑i=1

αRiMRi

, (17)

MRi= BiGRU(hi, ri, ti), (18)

αRi =eβRi

Nx∑j=1

eβRj

, (19)

βRi= hT

(x)WcMRi, (20)

where MRiis the final state of a BiGRU connecting the el-

ements of triple Ri, which can be seen as the knowledgememory of the i-th triple, while h(x) denotes the hidden stateat the encoding position of word x.

Loss FunctionAs aforementioned, the incremental encoding scheme iscentral for story ending generation. To better model thechronological order and causal relationship between adja-cent sentences, we impose supervision on the encoding net-work. At each encoding step, we also generate a distributionover the vocabulary, very similar to the decoding process:

P(yt|y<t, X) = softmax(W0h(i)j + b0), (21)

Then, we calculate the negative data likelihood as loss func-tion:

Φ = Φen + Φde (22)

Φen =

K∑i=2

li∑j=1

− logP(x(i)j = x

(i)j |x

(i)<j , X<i), (23)

Φde =∑t

− logP(yt = yt|y<t, X), (24)

where x(i)j means the reference word used for encoding atthe j-th position in sentence i, and yt represents the j-thword in the reference ending. Such an approach does notmean that at each step there is only one correct next sen-tence, exactly as many other generation tasks. Experimentsshow that it is better in logic than merely imposing supervi-sion on the decoding network.

ExperimentsDatasetWe evaluated our model on the ROCStories corpus(Mostafazadeh et al. 2016a). The corpus contains 98,162five-sentence stories for evaluating story understanding andscript learning. The original task is designed to select a cor-rect story ending from two candidates, while our task is togenerate a reasonable ending given a four-sentence story

context. We randomly selected 90,000 stories for trainingand the left 8,162 for evaluation. The average number ofwords in X1/X2/X3/X4/Y is 8.9/9.9/10.1/10.0/10.5 re-spectively. The training data contains 43,095 unique words,and 11,192 words appear more than 10 times. For each word,we retrieved a set of triples from ConceptNet and storedthose whose head entity and tail entity are noun or verb,meanwhile both occurring in SCT. Moreover, we retainedat most 10 triples if there are too many. The average numberof triples for each query word is 3.4.

BaselinesWe compared our models with the following state-of-the-artbaselines:Sequence to Sequence (Seq2Seq): A simple encoder-decoder model which concatenates four sentences to a longsentence with an attention mechanism (Luong, Pham, andManning 2015).Hierarchical LSTM (HLSTM): The story context is rep-resented by a hierarchical LSTM: a word-level LSTM foreach sentence and a sentence-level LSTM connecting thefour sentences (Yang et al. 2016). A hierarchical attentionmechanism is applied, which attends to the states of the twoLSTMs respectively.HLSTM+Copy: The copy mechanism (Gu et al. 2016) isapplied to hierarchical states to copy the words in the storycontext for generation.HLSTM+Graph Attention(GA): We applied multi-sourceattention HLSTM where commonsense knowledge is en-coded by graph attention.HLSTM+Contextual Attention(CA): Contextual attentionis applied to represent commonsense knowledge.

Experiment SettingsThe parameters are set as follows: GloVe.6B (Pennington,Socher, and Manning 2014) is used as word vectors, andthe vocabulary size is set to 10,000 and the word vector di-mension to 200. We applied 2-layer LSTM units with 512-dimension hidden states. These settings were applied to allthe baselines.

The parameters of the LSTMs (Eq. 5 and 6) are shared bythe encoder and the decoder.

Automatic EvaluationWe conducted the automatic evaluation on the 8,162 stories(the entire test set). We generated endings from all the mod-els for each story context.

Evaluation Metrics We adopted perplexity(PPL) andBLEU (Papineni et al. 2002) to evaluate the generation per-formance. Smaller perplexity scores indicate better perfor-mance. BLEU evaluates n-gram overlap between a gener-ated ending and a reference ending. However, since thereis only one reference ending for each story context, BLEUscores will become extremely low for larger n. We thus ex-perimented with n = 1, 2. Note also that there may existmultiple reasonable endings for the same story context.

Model PPL BLEU-1 BLEU-2 Gram. Logic.

Seq2Seq 18.97 0.1864 0.0410 1.74 0.70HLSTM 17.26 0.2459 0.0771 1.57 0.84HLSTM+Copy 19.93 0.2469 0.0783 1.66 0.90HLSTM+MSA(GA) 15.75 0.2588 0.0809 1.70 1.06HLSTM+MSA(CA) 12.53 0.2514 0.0825 1.72 1.02IE (ours) 11.04 0.2514 0.0813 1.84 1.10IE+MSA(GA) (ours) 9.72 0.2566 0.0854 1.68 1.26IE+MSA(CA) (ours) 8.79 0.2682 0.0936 1.66 1.24

Table 1: Automatic and manual evaluation results.

Results The results of the automatic evaluation are shownin Table 15, where IE means a simple incremental encodingframework that ablated the knowledge context vector fromcl in Eq. (7). Our models have lower perplexity and higherBLEU scores than the baselines. IE and IE+MSA have re-markably lower perplexity than other models. As for BLEU,IE+MSA(CA) obtained the highest BLEU-1 and BLEU-2scores. This indicates that multi-source attention leads togenerate story endings that have more overlaps with the ref-erence endings.

Manual EvaluationManual evaluations are indispensable to evaluate the coher-ence and logic of generated endings. For manual evaluation,we randomly sampled 200 stories from the test set and ob-tained 1,600 endings from the eight models. Then, we re-sorted to Amazon Mechanical Turk (MTurk) for annotation.Each ending will be scored by three annotators and majorityvoting is used to select the final label.

Evaluation Metrics We defined two metrics - grammarand logicality for manual evaluation. Score 0/1/2 is appliedto each metric during annotation.

Grammar (Gram.): Whether an ending is natural and flu-ent. Score 2 is for endings without any grammar errors, 1 forendings with a few errors but still understandable and 0 forendings with severe errors and incomprehensible.

Logicality (Logic.): Whether an ending is reasonable andcoherent with the story context in logic. Score 2 is for rea-sonable endings that are coherent in logic, 1 for relevant end-ings but with some discrepancy between an ending and agiven context, and 0 for totally incompatible endings.

Note that the two metrics are scored independently. Toproduce high-quality annotation, we prepared guidelinesand typical examples for each metric score.

Results The results of the manual evaluation are alsoshown in Table 1. Note that the difference between IE andIE+MSA exists in that IE does not attend to knowledgegraph vectors in a preceding sentence, and thus it does useany commonsense knowledge. The incremental encodingscheme without MSA obtained the best grammar score and

5The BLEU-2 results in the AAAI version is corrected in thisversion. Conclusions are not affected.

our full mode IE+MSA(GA) has the best logicality score.All the models have fairly good grammar scores (maximumis 2.0), while the logicality scores differ remarkably, muchlower than the maximum score, indicating the challenges ofthis task.

More specifically, incremental encoding is effectivedue to the facts: 1) IE is significantly better thanSeq2Seq and HLSTM in grammar (Sign Test, 1.84 vs.1.74/1.57, p-value=0.046/0.037, respectively), and in log-icality (1.10 vs. 0.70/0.84, p-value< 0.001/0.001). 2)IE+MSA is significantly better than HLSTM+MSA in log-icality (1.26 vs. 1.06, p-value=0.014 for GA; 1.24 vs.1.02, p-value=0.022 for CA). This indicates that incremen-tal encoding is more powerful than traditional (Seq2Seq)and hierarchical (HLSTM) encoding/attention in utilizingcontext clues. Furthermore, using commonsense knowl-edge leads to significant improvements in logicality.The comparison in logicality between IE+MSA and IE(1.26/1.24 vs. 1.10, p-value=0.028/0.042 for GA/CA, re-spectively), HLSTM+MSA and HLSTM (1.06/1.02 vs.0.84, p-value< 0.001/0.001 for GA/CA, respectively), andHLSTM+MSA and HLSTM+Copy (1.06/1.02 vs. 0.90, p-value=0.044/0.048, respectively) all approve this claim. Inaddition, similar results between GA and CA show that com-monsense knowledge is useful but multi-source attention isnot sensitive to the knowledge representation scheme.

More detailed results are listed in Table 2. Comparing toother models, IE+MSA has a much larger proportion of end-ings that are good both in grammar and logicality (2-2). Theproportion of good logicality (score=2.0) from IE+MSA ismuch larger than that from IE (45.0%+5.0%/41.0%+4.0%vs. 36.0%+2.0% for GA/CA, respectively), and also remark-able larger than those from other baselines. Further, HLSTMequipped with MSA is better than those without MSA, in-dicating that commonsense knowledge is helpful. And thekappa measuring inter-rater agreement is 0.29 for three an-notators, which implies a fair agreement.

Gram.-Logic. Score 2-2 2-1 1-2 1-1

Seq2seq 20.0% 22.0% 6.5% 1.5%HLSTM 21.0% 17.0% 10.0% 3.5%HLSTM+Copy 28.0% 19.0% 7.0% 5.5%HLSTM+MSA(GA) 33.5% 25.0% 5.0% 4.0%HLSTM+MSA(CA) 30.0% 26.0% 2.0% 8.0%

IE (ours) 36.0% 34.0% 2.0% 4.0%IE+MSA(GA) (ours) 45.0% 24.0% 5.0% 2.0%IE+MSA(CA) (ours) 41.0% 27.0% 4.0% 2.0%

Table 2: Data distribution over Gram.-Logic. scores. a-b de-notes that the grammar score is a and the logicality score isb. Each cell denotes the proportion of the endings with scorea-b.

Examples and Attention VisualizationWe presented an example of generated story endings in Ta-ble 3. Our model generates more natural and reasonable end-ings than the baselines.

Context: Martha is cooking a special meal for herfamily.She wants everything to be just right forwhen they eat.Martha perfects everything and puts herdinner into the oven.Martha goes to lay down for a quick nap.

GoldenEnding:

She oversleeps and runs into the kitchen totake out her burnt dinner.

Seq2Seq: She was so happy to have a new cake.HLSTM: Her family and her family are very happy

with her food.HLSTM+Copy:

Martha is happy to be able to eat her fam-ily.

HLSTM+GA:

She is happy to be able to cook her dinner.

HLSTM+CA:

She is very happy that she has made a newcook .

IE: She is very happy with her family.IE+GA: When she gets back to the kitchen, she sees

a burning light on the stove.IE+CA: She realizes the food and is happy she was

ready to cook .

Table 3: Generated endings from different models. Boldwords denote the key entity and event in the story. Improperwords in ending is in italic and proper words are underlined.

In this example, the baselines predicted wrongevents in the ending. Baselines (Seq2Seq, HLSTM,and HLSTM+Copy) have predicted improper entities(cake), generated repetitive contents (her family), or copiedwrong words (eat). The models equipped with incrementalencoding or knowledge through MSA(GA/CA) performbetter in this example. The ending by IE+MSA is morecoherent in logic, and fluent in grammar. We can see thatthere may exist multiple reasonable endings for the samestory context.

In order to verify the ability of our model to utilize thecontext clues and implicit knowledge when planning thestory plot, we visualized the attention weights of this exam-ple, as shown in Figure 3. Note that this example is producedfrom graph attention.

In Figure 3, phrases in the box are key events of the sen-tences that are manually highlighted. Words in blue or pur-ple are entities that can be retrieved from ConceptNet, re-spectively in story context or in ending. An arrow indicatesthat the words in the current box (e.g., they eat in X2) allhave largest attention weights to some words in the box ofthe preceding sentence (e.g., cooking a special meal in X1).Black arrows are for state context vector (see Eq.25) andblue for knowledge context vector (see Eq.26). For instance,eat has the largest knowledge attention to meal through theknowledge graph (<meal, AtLocation, dinner>,<meal, Re-latedTo, eat>). Similarly, eat also has the second largest at-tention weight to cooking through the knowledge graph. Forattention weights of state context vector, both words in per-fects everything has the largest weight to some of everything

Martha is cooking a special meal for her family .

She wants everything to be just right for when they eat .

Martha perfects everything and puts her dinner into the oven .

Martha goes to lay down for a quick nap .

When she gets back to the kitchen , she sees a burning light on the stove .

𝑋1:

𝑋2:

𝑋3:

𝑋4:

𝑌:

Entity commonsense knowledge

cook(cook, AtLocation, kitchen)(cook, HasLastSubevent, eat)

meal(meal, AtLocation, dinner)(meal, RelatedTo, eat)

eat (eat, AtLocation, dinner)

oven(oven, AtLocation, stove)(oven, RelatedTo, kitchen)(oven, UsedFor, burn)

Figure 3: An example illustrating how incremental encodingbuilds connections between context clues.

to be just right (everything→everything, perfect→ right).The example illustrates how the connection between con-

text clues are built through incremental encoding and useof commonsense knowledge. The chain of context clues,such as be cooking → want everything be right →perfect everything → lay down → get back, andthe commonsense knowledge, such as <cook, AtLocation,kitchen> and <oven, UsedFor, burn>, are useful for gener-ating reasonable story endings.

Conclusion and Future WorkWe present a story ending generation model that builds con-text clues via incremental encoding and leverages common-sense knowledge with multi-source attention. It encodes astory context incrementally with a multi-source attentionmechanism to utilize not only context clues but also com-monsense knowledge: when encoding a sentence, the modelobtains a multi-source context vector which is an attentiveread of the words and the corresponding knowledge graphsof the preceding sentence in the story context. Experimentsshow that our models can generate more coherent and rea-sonable story endings.

As future work, our incremental encoding and multi-source attention for using commonsense knowledge may beapplicable to other language generation tasks.

Refer to the Appendix for more details.

AcknowledgementsThis work was jointly supported by the National ScienceFoundation of China (Grant No.61876096/61332007), andthe National Key R&D Program of China (Grant No.

2018YFC0830200). We would like to thank Prof. XiaoyanZhu for her generous support.

References[Bahdanau, Cho, and Bengio 2014] Bahdanau, D.; Cho, K.;and Bengio, Y. 2014. Neural machine translation byjointly learning to align and translate. arXiv preprintarXiv:1409.0473.

[Chaturvedi, Peng, and Dan 2017] Chaturvedi, S.; Peng, H.;and Dan, R. 2017. Story comprehension for predicting whathappens next. In EMNLP, 1603–1614.

[Clark, Ji, and Smith 2018] Clark, E.; Ji, Y.; and Smith, N. A.2018. Neural text generation in stories using entity represen-tations as context. In NAACL, 1631–1640.

[Dagan, Glickman, and Magnini 2006] Dagan, I.; Glickman,O.; and Magnini, B. 2006. The pascal recognising tex-tual entailment challenge. In Machine Learning Challenges.Evaluating Predictive Uncertainty, Visual Object Classifica-tion, and Recognising Tectual Entailment, 177–190.

[Fan, Lewis, and Dauphin 2018] Fan, A.; Lewis, M.; andDauphin, Y. 2018. Hierarchical neural story generation. InACL, 889–898.

[Gu et al. 2016] Gu, J.; Lu, Z.; Li, H.; and Li, V. O. K. 2016.Incorporating copying mechanism in sequence-to-sequencelearning. In ACL, 1631–1640.

[Hermann et al. 2015] Hermann, K. M.; Kocisky, T.; Grefen-stette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; and Blun-som, P. 2015. Teaching machines to read and comprehend.In NIPS’15, 1693–1701.

[Jain et al. 2017] Jain, P.; Agrawal, P.; Mishra, A.; Sukhwani,M.; Laha, A.; and Sankaranarayanan, K. 2017. Story genera-tion from sequence of independent short descriptions. arXivpreprint arXiv:1707.05501.

[Ji et al. 2017] Ji, Y.; Tan, C.; Martschat, S.; Choi, Y.; andSmith, N. A. 2017. Dynamic entity representations in neurallanguage models. In EMNLP, 1830–1839.

[Knight et al. 2018] Knight, K.; Choi, Y.; Sap, M.; Rashkin,H.; and Bosselut, A. 2018. Modeling naive psychology ofcharacters in simple commonsense stories. In ACL, 2289–2299.

[Li et al. 2013] Li, B.; Lee-Urban, S.; Johnston, G.; andRiedl, M. O. 2013. Story generation with crowdsourcedplot graphs. In AAAI, 598–604.

[Li et al. 2018] Li, Q.; Li, Z.; Wei, J.-M.; Gu, Y.; Jatowt, A.;and Yang, Z. 2018. A multi-attention based neural networkwith external knowledge for story ending predicting task. InCOLING, 1754–1762.

[Li, Ding, and Liu 2018] Li, Z.; Ding, X.; and Liu, T. 2018.Generating reasonable and diversified story ending usingsequence to sequence model with adversarial training. InCOLING, 1033–1043.

[Lin, Sun, and Han 2017] Lin, H.; Sun, L.; and Han, X.2017. Reasoning with heterogeneous knowledge for com-monsense machine comprehension. In EMNLP, 2032–2043.

[LoBue and Yates 2011] LoBue, P., and Yates, A. 2011.Types of common-sense knowledge needed for recognizingtextual entailment. In ACL, HLT ’11, 329–334.

[Luong, Pham, and Manning 2015] Luong, M.-T.; Pham, H.;and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, 1412–1421.

[Martin et al. 2018] Martin, L.; Ammanabrolu, P.; Wang, X.;Hancock, W.; Singh, S.; Harrison, B.; and Riedl, M. 2018.Event representations for automated story generation withdeep neural nets. In AAAI, 868–875.

[Mihaylov and Frank 2018] Mihaylov, T., and Frank, A.2018. Knowledgeable reader: Enhancing cloze-style read-ing comprehension with external commonsense knowledge.In ACL, 821–832.

[Mostafazadeh et al. 2016a] Mostafazadeh, N.; Chambers,N.; He, X.; Parikh, D.; Batra, D.; Vanderwende, L.; Kohli,P.; and Allen, J. 2016a. A corpus and cloze evaluation fordeeper understanding of commonsense stories. In NAACL,839–849.

[Mostafazadeh et al. 2016b] Mostafazadeh, N.; Vander-wende, L.; Yih, W. T.; Kohli, P.; and Allen, J. 2016b. Storycloze evaluator: Vector space representation evaluationby predicting what happens next. In The Workshop onEvaluating Vector-Space Representations for Nlp, 24–29.

[Papineni et al. 2002] Papineni, K.; Roukos, S.; Ward, T.;and Zhu, W. J. 2002. Ibm research report bleu: a methodfor automatic evaluation of machine translation. In ACL,311–318.

[Peng et al. 2018] Peng, N.; Ghazvininejad, M.; May, J.; andKnight, K. 2018. Towards controllable story generation. InThe Workshop on Storytelling, 43–49.

[Pennington, Socher, and Manning 2014] Pennington, J.;Socher, R.; and Manning, C. 2014. Glove: Global vectorsfor word representation. In EMNLP, 1532–1543.

[R. Bowman et al. 2015] R. Bowman, S.; Angeli, G.; Potts,C.; and Manning, C. 2015. A large annotated corpus forlearning natural language inference. In EMNLP, 632–642.

[Rashkin et al. 2018] Rashkin, H.; Sap, M.; Allaway, E.;Smith, N. A.; and Choi, Y. 2018. Event2mind: Common-sense inference on events, intents, and reactions. In ACL,463–473.

[Roemmele 2016] Roemmele, M. 2016. Writing Stories withHelp from Recurrent Neural Networks. In AAAI, 4311 –4312. Phoenix, AZ: AAAI Press.

[Soo, Lee, and Chen 2016] Soo, V.; Lee, C.; and Chen, T.2016. Generate believable causal plots with user prefer-ences using constrained monte carlo tree search. In AAAI,218–224.

[Speer and Havasi 2012] Speer, R., and Havasi, C. 2012.Representing general relational knowledge in conceptnet 5.In LREC, 3679–3686.

[Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.;and Le, Q. V. 2014. Sequence to sequence learning withneural networks. In NIPS, 3104–3112.

[Wang, Liu, and Zhao 2017] Wang, B.; Liu, K.; and Zhao, J.

2017. Conditional generative adversarial networks for com-monsense machine comprehension. In IJCAI, 4123–4129.

[Yang et al. 2016] Yang, Z.; Yang, D.; Dyer, C.; He, X.;Smola, A.; and Hovy, E. 2016. Hierarchical attention net-works for document classification. In NAACL, 1480–1489.

[Zhang et al. 2017] Zhang, S.; Rudinger, R.; Duh, K.; andVan Durme, B. 2017. Ordinal common-sense inference.In ACL, 379395.

[Zhao et al. 2018] Zhao, Y.; Liu, L.; Liu, C.; Yang, R.; andYu, D. 2018. From plots to endings: A reinforced pointergenerator for story ending generation. In Zhang, M.; Ng, V.;Zhao, D.; Li, S.; and Zan, H., eds., NLPCC, 51–63.

[Zhou et al. 2018] Zhou, H.; Yang, T.; Huang, M.; Zhao, H.;Xu, J.; and Zhu, X. 2018. Commonsense knowledge awareconversation generation with graph attention. In IJCAI.

Appendix A: Annotation StatisticsWe presented the statistics of annotation agreement in Table4. The proportion of the annotations in which at least twoannotators (3/3+2/3) assigned the same score to an ending is96% for grammar and 94% for logicality. We can also seethat the 3/3 agreement for logicality is much lower than thatfor grammar, indicating that logicality is more complicatedfor annotation than grammar.

Metric 3/3 2/3 1/3

Gram. 58% 38% 4%Logic. 38% 57% 6%

Table 4: Annotation statistics. n/3 agreement means n outof 3 annotators assigned the same score to an ending.

Appendix B: Error AnalysisWe analyzed error types by manually checking all 46 badendings generated by our model, where bad means the aver-age score in terms of at least one metric is not greater than1.

There are 3 typical error types: bad grammar (BG), badlogicality (BL), and other errors. The distribution of types isshown in Table 5.

Error Type BG BL Others

IE+MGA 32.6% 37.0% 30.4%

Table 5: Distribution of errors.

We also presented some typical cases for each error typein Table 6. Note that we only took graph attention as ex-ample. The first case shows an instance of bad grammar forrepetitive generation. The second case shows that our modelpredicted a wrong entity at the last position where car is ob-viously more appropriate than daughter. It happens when theattention focuses on the wrong position, but in more cases ithappens due to the noise of the commonsense knowledge

Context: Sean practiced driving every day.He studied hard for the written part of thedriving test.Sean felt ready to take his driving test aftera couple months.Sean was confident that he would pass hisdriving test.

GoldenEnding:

Sean passed his driving test on the first at-tempt .

IE+MSA(GA):

He failed his test and failed his test.

Context: Danny wanted to give his daughter a sweetgift.DJ was turning 18 years old and ready todrive.Danny decided it was time for a new car.Danny went down to the dealership andbought a new car.

GoldenEnding:

He surprised DJ on her 18th birthday witha new car.

IE+MSA(GA):

He was so happy with his new daughter.

Context: There was a big pothole near Bill’s house.He had to swerve to avoid it all the time.He kept calling the city to fix it.Eventually someone returned his calls.

GoldenEnding:

It took three more months for it to be fixed

IE+MSA(GA):

He was relieved to find that he had towork on his own.

Table 6: Typical cases of errors. Bold words denote the keyentity and event in the story. Improper words in ending is initalic and proper is underlined.

base6. The ending of the third case contains a relevant eventwork on his own but the event is not consistent to the pre-vious word relieved. Other cases show that our model is notgood at dealing with rare words. However, this can be fur-ther improved by applying copy mechanism, as our futurework. These errors also indicate that story ending genera-tion is challenging, and logic and implicit knowledge playsa central role in this task.

Appendix C: Attention VisualizationThe multi-source attention mechanism computes the statecontext vectors and knowledge context vectors respectivelyas follows:

c(i)hj =

li−1∑k=1

α(i)hk,j

h(i−1)k , (25)

c(i)xj =

li−1∑k=1

α(i)xk,j

g(x(i−1)k ), (26)

The visualization analysis in Section 4.6 “Generated End-ing Examples and Attention Visualization” is based on the

6Note that ConceptNet contains many low-quality triples.

Marth

a isco

okin

g asp

ecialm

ealfo

rh

erfam

ily .

She wants

everythingtobe

justright

forwhentheyeat

.

Marth

ap

erfectseveryth

ing

and

pu

tsh

erd

inn

erin

toth

eo

ven .

Marthagoes

tolay

downfor

aquick

nap.

Marth

ag

oes to

layd

ow

nfo

r aq

uick

nap .

Whenshe

getsback

tothe

kitchen,

shesees

aburning

lightonthe

stove.

Marth

a isco

okin

g asp

ecialm

ealfo

rh

erfam

ily .

Shewants

everythingtobe

justright

forwhentheyeat

.

She

wan

tseveryth

ing

to be

just

righ

tfo

rw

hen

they

eat .

Marthaperfects

everythingandputsher

dinnerintothe

oven.

Marth

ap

erfectseveryth

ing

and

pu

tsh

erd

inn

erin

toth

eo

ven .

Marthagoes

tolay

downfor

aquick

nap.

Marth

ag

oes to

layd

ow

nfo

r aq

uick

nap .

Whenshe

getsback

toThe

kitchen,

shesees

aburning

lightonthe

stove.

min

max

min

max

She

wan

tseveryth

ing

to be

just

righ

tfo

rw

hen

they

eat .

Marthaperfects

everythingandputsher

dinnerintothe

oven.

𝑋1

𝑋2

𝑋3

𝑋4

𝑌

𝑋2

𝑋3

𝑋4

𝑋1

𝑋2

𝑋3

𝑋4

Figure 4: Attention weight visualization by multi-source at-tention. Green graphs are for state context vectors (Eq. 25)and blue for knowledge context vectors (Eq. 26). Each graphrepresents the attention weight matrix for two adjacent sen-tences (Xi in row and Xi+1 in column). The five graphs inthe left column show the state attention weights in the incre-mental encoding process.

attention weights (α(i)hk,j

and α(i)xk,j ), as presented in Figure

4. Similarly we take as example the graph attention methodto represent commonsense knowledge.

The figure illustrates how the incremental encodingscheme with the multi-source attention utilizes context cluesand implicit knowledge.1) The left column: for utilizing context clues, when themodel encodes X2, cooking in X1 obtains the largest stateattention weight (α(i)

hk,j), which illustrates cooking is an im-

portant word (or event) for the context clue. Similarly, thekey events in each sentence have largest attention weightsto some entities or events in the preceding sentence, whichforms the context clue (e.g., perfects in X3 to right in X2,lay/down in X4 to perfect/everything in X3, get/back in Yto lay/down in X4, etc.).2) The right column: for the use of commonsense knowl-edge, each sentence has attention weights (α(i)

xk,j ) to theknowledge graphs of the preceding sentence (e.g. eat in X2

to meal in X1, dinner in X3 to eat in X2, etc.). In this man-ner, the knowledge information is added into the encodingprocess of each sentence, which helps story comprehensionfor better ending generation (e.g., kitchen in Y to oven inX2, etc.).

arxiv:1808.10113v3 [cs.cl] 2 dec 2018 · 2018-12-04 · cloze test (sct) (mostafazadeh et al....

Documents