image-grounded conversations: multimodal context for natural … · 2017. 11. 23. · for halloween...

11
Proceedings of the The 8th International Joint Conference on Natural Language Processing, pages 462–472, Taipei, Taiwan, November 27 – December 1, 2017 c 2017 AFNLP Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation Nasrin Mostafazadeh 1* , Chris Brockett 2 , Bill Dolan 2 , Michel Galley 2 , Jianfeng Gao 2 , Georgios P. Spithourakis 3* , Lucy Vanderwende 2 1 BenevolentAI, 2 Microsoft Research, 3 University College London [email protected], [email protected] Abstract The popularity of image sharing on social media and the engagement it creates be- tween users reflect the important role that visual context plays in everyday conver- sations. We present a novel task, Image- Grounded Conversations (IGC), in which natural-sounding conversations are gener- ated about a shared image. To benchmark progress, we introduce a new multiple- reference dataset of crowd-sourced, event- centric conversations on images. IGC falls on the continuum between chit-chat and goal-directed conversation models, where visual grounding constrains the topic of conversation to event-driven utterances. Experiments with models trained on so- cial media data show that the combination of visual and textual context enhances the quality of generated conversational turns. In human evaluation, the gap between hu- man performance and that of both neu- ral and retrieval architectures suggests that multi-modal IGC presents an interesting challenge for dialog research. 1 Introduction Bringing together vision & language in one in- telligent conversational system has been one of the longest running goals in AI (Winograd, 1972). Advances in image captioning (Fang et al., 2014; Chen et al., 2015; Donahue et al., 2015) have enabled much interdisciplinary research in vision and language, from video transcription (Rohrbach et al., 2012; Venugopalan et al., 2015), to an- swering questions about images (Antol et al., 2015; Malinowski and Fritz, 2014), to storytelling around series of photographs (Huang et al., 2016). * This work was performed at Microsoft. User1: My son is ahead and surprised! User2: Did he end up winning the race? User1: Yes he won, he can’t believe it! Figure 1: A naturally-occurring Image-Grounded Conversation. Most recent work on vision & language fo- cuses on either describing (captioning) the image or answering questions about their visible con- tent. Observing how people naturally engage with one another around images in social media, it is evident that it is often in the form of conversa- tional threads. On Twitter, for example, upload- ing a photo with an accompanying tweet has be- come increasingly popular: in June 2015, 28% of tweets reportedly contained an image (Morris et al., 2016). Moreover, across social media, the conversations around shared images range beyond what is explicitly visible in the image. Figure 1 illustrates such a conversation. As this example shows, the conversation is grounded not only in the visible objects (e.g., the boys, the bikes) but more importantly, in the events and actions (e.g., the race, winning) implicit in the image that is ac- companied by the textual utterance. To humans, it is these latter aspects that are likely to be the most interesting and most meaningful components of a natural conversation, and to the systems, inferring such implicit aspects can be the most challenging. In this paper we shift the focus from image as an artifact (as is in the existing vision & lan- guage work, to be described in Section 2), to im- age as the context for interaction: we introduce the task of Image-Grounded Conversation (IGC) in which a system must generate conversational turns to proactively drive the interaction forward. IGC thus falls on a continuum between chit-chat 462

Upload: others

Post on 16-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Image-Grounded Conversations: Multimodal Context for Natural … · 2017. 11. 23. · for Halloween a little early. Don't you think Halloween should be year -round, though? That'd

Proceedings of the The 8th International Joint Conference on Natural Language Processing, pages 462–472,Taipei, Taiwan, November 27 – December 1, 2017 c©2017 AFNLP

Image-Grounded Conversations:Multimodal Context for Natural Question and Response Generation

Nasrin Mostafazadeh1*, Chris Brockett2, Bill Dolan2, Michel Galley2, Jianfeng Gao2,Georgios P. Spithourakis3*, Lucy Vanderwende2

1 BenevolentAI, 2 Microsoft Research,

3 University College London

[email protected], [email protected]

Abstract

The popularity of image sharing on socialmedia and the engagement it creates be-tween users reflect the important role thatvisual context plays in everyday conver-sations. We present a novel task, Image-Grounded Conversations (IGC), in whichnatural-sounding conversations are gener-ated about a shared image. To benchmarkprogress, we introduce a new multiple-reference dataset of crowd-sourced, event-centric conversations on images. IGC fallson the continuum between chit-chat andgoal-directed conversation models, wherevisual grounding constrains the topic ofconversation to event-driven utterances.Experiments with models trained on so-cial media data show that the combinationof visual and textual context enhances thequality of generated conversational turns.In human evaluation, the gap between hu-man performance and that of both neu-ral and retrieval architectures suggests thatmulti-modal IGC presents an interestingchallenge for dialog research.

1 Introduction

Bringing together vision & language in one in-telligent conversational system has been one ofthe longest running goals in AI (Winograd, 1972).Advances in image captioning (Fang et al., 2014;Chen et al., 2015; Donahue et al., 2015) haveenabled much interdisciplinary research in visionand language, from video transcription (Rohrbachet al., 2012; Venugopalan et al., 2015), to an-swering questions about images (Antol et al.,2015; Malinowski and Fritz, 2014), to storytellingaround series of photographs (Huang et al., 2016).

* This work was performed at Microsoft.

User1: My son is ahead and surprised!User2: Did he end up winning the race?User1: Yes he won, he can’t believe it!

Figure 1: A naturally-occurring Image-GroundedConversation.

Most recent work on vision & language fo-cuses on either describing (captioning) the imageor answering questions about their visible con-tent. Observing how people naturally engage withone another around images in social media, it isevident that it is often in the form of conversa-tional threads. On Twitter, for example, upload-ing a photo with an accompanying tweet has be-come increasingly popular: in June 2015, 28%of tweets reportedly contained an image (Morriset al., 2016). Moreover, across social media, theconversations around shared images range beyondwhat is explicitly visible in the image. Figure 1illustrates such a conversation. As this exampleshows, the conversation is grounded not only inthe visible objects (e.g., the boys, the bikes) butmore importantly, in the events and actions (e.g.,the race, winning) implicit in the image that is ac-companied by the textual utterance. To humans, itis these latter aspects that are likely to be the mostinteresting and most meaningful components of anatural conversation, and to the systems, inferringsuch implicit aspects can be the most challenging.

In this paper we shift the focus from imageas an artifact (as is in the existing vision & lan-guage work, to be described in Section 2), to im-age as the context for interaction: we introducethe task of Image-Grounded Conversation (IGC)in which a system must generate conversationalturns to proactively drive the interaction forward.IGC thus falls on a continuum between chit-chat

462

Page 2: Image-Grounded Conversations: Multimodal Context for Natural … · 2017. 11. 23. · for Halloween a little early. Don't you think Halloween should be year -round, though? That'd

(open-ended) and goal-oriented task-completiondialog systems, where the visual context in IGCnaturally serves as a detailed topic for a conver-sation. As conversational agents gain increas-ing ground in commercial settings (e.g., Siri andAlexa), they will increasingly need to engage hu-mans in ways that seem intelligent and anticipa-tory of future needs. For example, a conversa-tional agent might engage in a conversation witha user about a camera-roll image in order to elicitbackground information from the user (e.g., spe-cial celebrations, favorite food, the name of thefriends and family, etc.).

This paper draws together two threads of in-vestigation that have hitherto remained largely un-related: vision & language and data-driven con-versation modeling. Its contributions are three-fold: (1) we introduce multimodal conversationalcontext for formulating questions and responsesaround images, and support benchmarking witha publicly-released, high-quality, crowd-sourceddataset of 4,222 multi-turn, multi-reference con-versations grounded on event-centric images. Weanalyze various characteristics of this IGC datasetin Section 3.1. (2) We investigate the applicationof deep neural generation and retrieval approachesfor question and response generation tasks (Sec-tion 5), trained on 250K 3-turn naturally-occurringimage-grounded conversations found on Twitter.(3) Our experiments suggest that the combinationof visual and textual context improves the qual-ity of generated conversational turns (Section 6-7). We hope that this novel task will spark newinterest in multimodal conversation modeling.

2 Related Work

2.1 Vision and Language

Visual features combined with language model-ing have shown good performance both in imagecaptioning (Devlin et al., 2015; Xu et al., 2015;Fang et al., 2014; Donahue et al., 2015) and inquestion answering on images (Antol et al., 2015;Ray et al., 2016; Malinowski and Fritz, 2014),when trained on large datasets, such as the COCOdataset (Lin et al., 2014). In Visual Question An-swering (VQA) (Antol et al., 2015), a system istasked with answering a question about a givenimage, where the questions are constrained to beanswerable directly from the image. In otherwords, the VQA task primarily serves to evaluatethe extent to which the system has recognized theexplicit content of the image.

Place near my house is getting ready for Halloween a little early.

Don't you think Halloween should be year-round, though?

That'd be fun since it's my favorite holiday!

It's my favorite holiday as well!

I never got around to carving a pumpkin last year even though I bought one.

Well, it's a good thing that they are starting to sell them early this year!

Is the photo in color?

Yes

Is the photo close up?

No

Is this at a farm?

Possibly

Do you think it's for Halloween?

That is possible

Do you see anyone?

No

Do you see trees?

No

Any huge pumpkins?

No

Figure 2: Typical crowdsourced conversations inIGC (left) and VisDial (right).

Das et al. (2017a) extend the VQA scenarioby collecting sequential questions from peoplewho are shown only an automatically generatedcaption, not the image itself. The utterances inthis dataset, called ‘Visual Dialog’ (VisDial), arebest viewed as simple one-sided QA exchanges inwhich humans ask questions and the system pro-vides answers. Figure 2 contrasts an example ICGconversation with the VisDial dataset. As this ex-ample shows, IGC involves natural conversationswith the image as the grounding, where the lit-eral objects (e.g., the pumpkins) may not even bementioned in the conversation at all, whereas Vis-Dial targets explicit image understanding. Morerecently, Das et al. (2017b) have explored the Vis-Dial dataset with richer models that incorporatedeep-reinforcement learning.

Mostafazadeh et al. (2016b) introduce the taskof visual question generation (VQG), in which thesystem itself outputs questions about a given im-age. Questions are required to be ‘natural and en-gaging’, i.e. a person would find them interestingto answer, but need not be answerable from the im-age alone. In this work, we introduce multimodalcontext, recognizing that images commonly comeassociated with a verbal commentary that can af-fect the interpretation. This is thus a broader, morecomplex task that involves implicit commonsensereasoning around both image and text.

463

Page 3: Image-Grounded Conversations: Multimodal Context for Natural … · 2017. 11. 23. · for Halloween a little early. Don't you think Halloween should be year -round, though? That'd

2.2 Data-Driven Conversational ModelingThis work is also closely linked to research ondata-driven conversation modeling. Ritter et al.(2011) posed response generation as a machinetranslation task, learning conversations from par-allel message-response pairs found on social me-dia. Their work has been successfully extendedwith the use of deep neural models (Sordoni et al.,2015; Shang et al., 2015; Serban et al., 2015a;Vinyals and Le, 2015; Li et al., 2016a,b). Sordoniet al. (2015) introduce a context-sensitive neu-ral language model that selects the most probableresponse conditioned on the conversation history(i.e., a text-only context). In this paper, we extendthe contextual approach with multimodal featuresto build models that are capable of asking ques-tions on topics of interest to a human that mightallow a conversational agent to proactively drive aconversation forward.

3 Image-Grounded Conversations

3.1 Task DefinitionWe define the current scope of IGC as the follow-ing two consecutive conversational steps:• Question Generation: Given a visual context Iand a textual context T (e.g., the first statement inFigure 1), generate a coherent, natural question Qabout the image as the second utterance in the con-versation. It has been shown that humans achievegreater consensus on what constitutes a naturalquestion to ask given an image (the task of VQG)than on captioning or asking a visually verifiablequestion (VQA) (Mostafazadeh et al., 2016b). Asseen in Figure 1, the question is not directly an-swerable from the image. Here we emphasize onquestions as a way of potentially engaging a hu-man in continuing the conversation.• Response Generation: Given a visual contextI, a textual context T, and a question Q, generatea coherent, natural, response R to the question asthe third utterance in the conversation. In the in-terests of feasible multi-reference evaluation, wepose question and response generation as two sep-arate tasks. However, all the models presentedin this paper can be fed with their own generatedquestion to generate a response.

3.2 The ICG DatasetThe majority of the available corpora for devel-oping data-driven dialogue systems contain task-oriented and goal-driven conversational data (Ser-ban et al., 2015b). For instance, the Ubuntu dia-

logue corpus (Lowe et al., 2015) is the largest cor-pus of dialogues (almost 1 million mainly 3-turndialogues) for the specific topic of troubleshootingUbuntu problems. On the other hand, for open-ended conversation modeling (chitchat), now ahigh demand application in AI, shared datasetswith which to track progress are severely lacking.The ICG task presented here lies nicely in the con-tinuum between the two, where the visual ground-ing of event-centric images constrains the topic ofconversation to contentful utterances.

To enable benchmarking of progress in theIGC task, we constructed the IGCCrowd datasetfor validation and testing purposes. We firstsampled eventful images from the VQG dataset(Mostafazadeh et al., 2016b) which has been ex-tracted by querying a search engine using event-centric query terms. These were then served in aphoto gallery of a crowd-sourcing platform we de-veloped using the Turkserver toolkit (Mao et al.,2012), which enables synchronous and real-timeinteractions between crowd workers on AmazonMechanical Turk. Multiple workers wait in a vir-tual lobby to be paired with a conversation part-ner. After being paired, one of the workers se-lects an image from the large photo gallery, af-ter which the two workers enter a chat window inwhich they conduct a short conversation about theselected image. We prompted the workers to natu-rally drive the conversation forward without usinginformal/IM language. To enable multi-referenceevaluation (Section 6), we crowd-sourced five ad-ditional questions and responses for the IGCCrowdcontexts and initial questions.

Table 1 shows three full conversations foundin the IGCCrowd dataset. These examples showshow that eventful images lead to conversa-tions that are semantically rich and appear toinvolve commonsense reasoning. Table 2 sum-marizes basic dataset statistics. The IGCCrowddataset has been released as the Microsoft Re-search Image-Grounded Conversation dataset(https://www.microsoft.com/en-us/download/details.aspx?id=55324&751be11f-ede8).

4 Task Characteristics

In this Section, we analyze the IGC dataset tohighlight a range of phenomena specific to thistask. (Additional material pertaining to the lexi-cal distributions of this dataset can be found in theSupplementary Material.)

464

Page 4: Image-Grounded Conversations: Multimodal Context for Natural … · 2017. 11. 23. · for Halloween a little early. Don't you think Halloween should be year -round, though? That'd

VisualContextTextualContext

This wasn’t the way I imaginedmy day starting.

I checked out the protest yester-day.

A terrible storm destroyed myhouse!

Question do you think this happened on thehighway?

Do you think America can everovercome its racial divide?

OH NO, what are you going todo?

Response Probably not, because I haven’tdriven anywhere except aroundtown recently.

I can only hope so. I will go live with my Dad un-til the insurance company sorts itout.

VQGQuestion

What caused that tire to go flat? Where was the protest? What caused the building to fallover?

Table 1: Example full conversations in our IGCCrowd dataset. For comparison, we also include VQGquestions in which the image is the only context.

IGCCrowd (val and test sets, split: 40% and 60%)# conversations = # images 4,222total # utterances 25,332# all workers participated 308Max # conversations by one worker 20Average payment per worker (min) 1.8 dollarsMedian work time per worker (min) 10.0IGCCrowd−multiref (val and test sets, split: 40% and 60%)# additional references per question/response 5total # multi-reference utterances 42,220

Table 2: Basic Dataset Statistics.

4.1 The Effectiveness of Multimodal ContextThe task of IGC emphasizes modeling of not onlyvisual but also textual context. We presented hu-man judges with a random sample of 600 tripletsof image, textual context, and question (I, T,Q)from each IGCTwitter and IGCCrowd datasets andasked them to rate the effectiveness of the visualand the textual context. We define ‘effectiveness’to be “the degree to which the image or text isrequired in order for the given question to soundnatural”. The workers were prompted to make thisjudgment based on whether or not the question al-ready makes sense without either the image or thetext. As Figure 3 demonstrates, both visual andtextual contexts are generally highly effective, andunderstanding of both would be required for thequestion that was asked. By way of comparison,Figure 3 also shows the effectiveness of image andtext for a sample taken from Twitter data presentedin Section 4.4. We note that the crowd-sourceddataset is more heavily reliant on understandingthe textual context than is the Twitter set.

Figure 3: The effectiveness of textual and visualcontext for asking questions.

4.2 Frame Semantic Analysis of QuestionsThe grounded conversations starting with ques-tions contain a considerable amount stereotypicalcommonsense knowledge. To get a better senseof the richness of our IGCCrowd dataset, we man-ually annotated a random sample of 330 (I, T,Q)triplets in terms of Minsky’s Frames: Minsky de-fines ‘frame’ as follows: “When one encounters anew situation, one selects from memory a struc-ture called a Frame” (Minsky, 1974). A frameis thus a commonsense knowledge representationdata-structure for representing stereotypical situa-tions, such as a wedding ceremony. Minsky fur-ther connects frames to the nature of questions:“[A Frame] is a collection of questions to be askedabout a situation”. These questions can ask aboutthe cause, intention, or side-effects of a presentedsituation.

We annotated1 the FrameNet (Baker et al.,1998) frame evoked by the image I , to be called(IFN ), and the textual context T , (TFN ). Then,for the question asked, we annotated the frame

1These annotations can be accessed through https://goo.gl/MVyGzP

465

Page 5: Image-Grounded Conversations: Multimodal Context for Natural … · 2017. 11. 23. · for Halloween a little early. Don't you think Halloween should be year -round, though? That'd

Visual Context Textual Context QuestionLook at all thisfood I ordered!

Where isthat from?

FN Food Request-Entity Supplier

Table 3: FrameNet (FN) annotation of an example.

My son is ahead and surprised

Did he end up winning the race

Yes, he won he can’t believe it

cause

before

cause

Figure 4: An example causal and temporal(CaTeRS) annotation on the conversation pre-sented in Figure 1. The rectangular nodes showthe event entities and the edges are the semanticlinks. For simplicity, we show the ‘identity’ rela-tion between events using gray nodes. The coref-erence chain is depicted by the underlined words.

slot (QFN−slot) associated with a context frame(QFN ). For 17% of cases, we were unable to iden-tify a corresponding QFN−slot in FrameNet. Asthe example in Table 3 shows, the image in iso-lation often does not evoke any uniquely content-ful frame, whereas the textual context frequentlydoes. In only 14% of cases does IFN=TFN ,which further supports the complementary effectof our multimodal contexts. Moreover,QFN=IFN

for 32% our annotations, whereas QFN=TFN for47% of the triplets, again, showing the effective-ness of textual context in determining the question.

4.3 Event Analysis of ConversationsTo further investigate the representation of eventsand any stereotypical causal and temporal rela-tions between them in the IGCCrowd dataset, wemanually annotated a sample of 20 conversationswith their causal and temporal event structures.Here, we followed the Causal and Temporal Re-lation Scheme (CaTeRS) (Mostafazadeh et al.,2016a) for event entity and event-event semanticrelation annotations. Our analysis shows that theIGC utterances are indeed rich in events. On av-erage, each utterance in IGC has 0.71 event entitymentions, such as ‘win’ or ‘remodel’. The seman-tic link annotation reflects commonsense relationbetween event mentions in the context of the on-going conversation. Figure 4 shows an exampleCaTeRS annotation. The distribution of semantic

t!Figure 5: The frequency of event-event semanticlinks in a random sample of 20 IGC conversations.

links in the annotated sample can be found in Fig-ure 5. These numbers further suggest that in ad-dition to jointly understanding the visual and tex-tual context (including multimodal anaphora reso-lution, among other challenges), capturing causaland temporal relations between events will likelybe necessary for a system to perform the IGC task.

4.4 IGCTwitter Training DatasetPrevious work in neural conversation modeling(Ritter et al., 2011; Sordoni et al., 2015) has suc-cessfully used Twitter as the source of natural con-versations. As training data, we sampled 250Kquadruples of {visual context, textual context,question, response} tweet threads from a largerdataset of 1.4 million, extracted from the TwitterFirehose over a 3-year period beginning in May2013 and filtered to select just those conversationsin which the initial turn was associated with an im-age and the second turn was a question.

Regular expressions were used to detect ques-tions. To improve the likelihood that the authorsare experienced Twitter conversationalists, we fur-ther limited extraction to those exchanges whereusers had actively engaged in at least 30 conversa-tional exchanges during a 3-month period.

Twitter data is noisy: we performed simple nor-malizations, and filtered out tweets that containedmid-tweet hashtags, were longer than 80 charac-ters2 or contained URLs not linking to the im-age. (Sample conversations from this dataset canbe found in the Supplementary Material.) A ran-dom sample of tweets suggests that about 46% ofthe Twitter conversations is affected by prior his-tory between users, making response generationparticularly difficult. In addition, the abundance ofscreenshots and non-photograph graphics is poten-tially a major source of noise in extracting featuresfor neural generation, though we did not attempt toexclude these from the training set.

2Pilot studies showed that 80 character limit more effec-tively retains one-sentence utterances that are to the point.

466

Page 6: Image-Grounded Conversations: Multimodal Context for Natural … · 2017. 11. 23. · for Halloween a little early. Don't you think Halloween should be year -round, though? That'd

5 Models

5.1 Generation ModelsFigure 6 overviews our three generation models.Across all the models, we use the VGGNet archi-tecture (Simonyan and Zisserman, 2015) for com-puting deep convolutional image features. We usethe 4096-dimensional output of the last fully con-nected layer (fc7) as the input to all the modelssensitive to visual context.

Visual Context Sensitive Model (V-Gen).Similar to Recurrent Neural Network (RNN) mod-els for image captioning (Devlin et al., 2015;Vinyals et al., 2015), (V-Gen) transforms theimage feature vector to a 500-dimensional vec-tor that serves as the initial recurrent state to a500-dimensional one-layer Gated Recurrent Unit(GRU) which is the decoder module. The outputsentence is generated one word at a time until the<EOS> (end-of-sentence) token is generated. Weset the vocabulary size to 6000 which yielded thebest results on the validation set. For this model,we got better results by greedy decoding. Un-known words are mapped to an <UNK> token dur-ing training, which is not allowed to be generatedat decoding time.

Textual Context Sensitive Model (T-Gen).This is a neural Machine Translation-like modelthat maps an input sequence to an output sequence(Seq2Seq model (Cho et al., 2014; Sutskever et al.,2014)) using an encoder and a decoder RNN. Thedecoder module is like the model described above,in this case the initial recurrent state being the 500-dimensional encoding of the textual context. Forconsistency, we use the same vocab size and num-ber of layers as in the (V-Gen) model.

Visual & Textual Context Sensitive Model(V&T-Gen). This model fully leverages bothtextual and visual contexts. The vision fea-ture is transformed to a 500-dimensional vec-tor, and the textual context is likewise encodedinto a 500-dimensional vector. The textual fea-ture vector can be obtained using either a bag-of-words (V&T.BOW-Gen) representation, or anRNN (V&T.RNN-Gen), as depicted in Figure 7.The textual feature vector is then concatenated tothe vision vector and fed into a fully connected(FC) feed forward neural network. As a result, weobtain a single 500-dimensional vector encodingboth visual and textual context, which then servesas the initial recurrent state of the decoder RNN.

In order to generate the response (the third ut-terance in the conversation), we need to represent

the conversational turns in the textual context in-put. There are various ways to represent conversa-tional history, including a bag of words model, ora concatenation of all textual utterances into onesentence (Sordoni et al., 2015). For response gen-eration, we implement a more complex treatmentin which utterances are fed into an RNN one wordat a time (Figure 7) following their temporal orderin the conversation. An <UTT> marker designatesthe boundary between successive utterances.

Decoding and Reranking. For all generationmodels, at decoding time we generate the N-bestlists using left-to-right beam search with beam-size 25. We set the maximum number of tokens to13 for the generated partial hypotheses. Any par-tial hypothesis that reaches <EOS> token becomesa viable full hypothesis for reranking. The first fewhypotheses on top of the N-best lists generated bySeq2Seq models tend to be very generic,3 disre-garding the input context. In order to address thisissue we rerank the N-best list using the followingscore function:

log p(h|C) + λ idf(h,D) + µ|h|+ κ V (h) (1)

where p(h|C) is the probability of the generatedhypothesis h given the context C. The functionV counts the number of verbs in the hypothesisand |h| denotes the number of tokens in the hy-pothesis. The function idf is the inverse documentfrequency, computing how common a hypothesisis across all the generated N-best lists. Here Dis the set of all N-best lists and d is a specific N-best list. We define idf(h, D) = log |D|

|{d∈D:h∈d}| ,where we set N=10 to cut short each N-best list.These parameters were selected following rerank-ing experiments on the validation set. We optimizeall the parameters of the scoring function towardsmaximizing the smoothed-BLEU score (Lin andOch, 2004) using the Pairwise Ranking Optimiza-tion algorithm (Hopkins and May, 2011).

5.2 Retrieval ModelsIn addition to generation, we implemented tworetrieval models customized for the tasks ofquestion and response generation. Work in visionand language has demonstrated the effectivenessof retrieval models, where one uses the annota-tion (e.g., caption) of a nearest neighbor in thetraining image set to annotate a given test image(Mostafazadeh et al., 2016b; Devlin et al., 2015;

3An example generic question is where is this? and ageneric response is I don’t know.

467

Page 7: Image-Grounded Conversations: Multimodal Context for Natural … · 2017. 11. 23. · for Halloween a little early. Don't you think Halloween should be year -round, though? That'd

GR

U

This was

Text

ual

Co

nte

xt S

en

siti

ve

GR

U

not

GR

U

starting

GR

U. . .

CNN

Vis

ual

Co

nte

xt S

en

siti

ve

Do

highway

. .

.GR

U

GR

U

<EOS>

Do

you

GR

U

<GO>the way I imagined my day

. .

.you think this happened on

. .

.

decoder

de

cod

er

Do you think this happened on highway <EOS>think this happened on highway

. .

.

Vis

ual

& T

extu

al (

BO

W)

Co

nte

xt S

ensi

tive

CNN

<GO>

dec

od

er

Do you think this happened on highway <EOS>

This was not the way I imagined my day

starting.

day

thisnot

I

.

.

.

.

.

.

.

One-hot vector

FC

Figure 6: Question generation using the Visual Context Sensitive Model (V-Gen), Textual Context Sen-sitive Model (T-Gen), and the Visual & Textual Context Sensitive Model (V&T.BOW-Gen), respectively.

VisualContext

Que

stio

nG

ener

atio

n TextualContext

The weather was amazing at thisbaseball game.

I got in a car wreck today! My cousins at the family re-union.

GoldQuestion

Nice, which team won? Did you get hurt? What is the name of your cousinin the blue shirt?

V&T-Ret U at the game? or did someonetake that pic for you?

You driving that today? U had fun?

V-Gen Where are you? Who’s is that? Who’s that guy?V&T-Gen Who’s winning? What happened? Where’s my invite?

Res

pons

eG

ener

atio

n TextualContext

The weather was amazing at thisbaseball game. <UTT> Nice,which team won?

I got in a car wreck today!<UTT> Did you get hurt?

My cousins at the family re-union. <UTT> What is the nameof your cousin in the blue shirt?

GoldResponse

My team won this game. No it wasn’t too bad of a bangup.

His name is Eric.

V&T-Ret 10 for me and 28 for my dad. Yes. lords cricket ground . beautiful.V&T-Gen ding ding ding! Nah, I’m at home now. He’s not mine!

Table 4: Example question and response generations on IGCCrowd test set. All the generation modelsuse beam search with reranking. In the textual context, <UTT> separates different utterances. Thegenerations in bold are acceptable utterances given the underlying context.

468

Page 8: Image-Grounded Conversations: Multimodal Context for Natural … · 2017. 11. 23. · for Halloween a little early. Don't you think Halloween should be year -round, though? That'd

Figure 7: The visual & textual context sensitivemodel with RNN encoding (V&T.RNN-Gen).

Hodosh et al., 2013; Ordonez et al., 2011; Farhadiet al., 2010).

Visual Context Sensitive Model (V-Ret). Thismodel uses only the provided image for retrieval.First, we find a set of K nearest training imagesfor the given test image based on cosine similarityof the fc7 vision feature vectors. Then we retrievethose K annotations as our pool of K candi-dates. Finally, we compute the textual similarityamong the questions in the pool according to aSmoothed-BLEU (Lin and Och, 2004) similarityscore, then emit the sentence that has the highestsimilarity to the rest of the pool.

Visual & Textual Context Sensitive Model(V&T-Ret). This model uses a linear combinationof fc7 and word2vec feature vectors for retrievingsimilar training instances.

6 Evaluation Setup

We provide both human (Table 5) and automatic(Table 6) evaluations for our question and re-sponse generation tasks on the IGCCrowd test set.We crowdsourced our human evaluation on anAMT-like crowdsourcing system, asking sevencrowd workers to each rate the quality of candidatequestions or responses on a three-point Likert-likescale, ranging from 1 to 3 (the highest). To ensurea calibrated rating, we showed the human judgesall system hypotheses for a particular test case atthe same time. System outputs were randomly or-dered to prevent judges from guessing which sys-tems were which on the basis of position. Af-ter collecting judgments, we averaged the scoresthroughout the test set for each model. We dis-carded any annotators whose ratings varied fromthe mean by more than 2 standard deviations.

Although human evaluation is to be preferred,and currently essential in open-domain generationtasks involving intrinsically diverse outputs, it isuseful to have an automatic metric for day-to-day

evaluation. For ease of replicability, we use thestandard Machine Translation metric, BLEU (Pa-pineni et al., 2002), which captures n-gram over-lap between hypotheses and multiple references.Results reported in Table 6 employ BLEU withequal weights up to 4-grams at corpus-level on themulti-reference IGCCrowd test set. Although Liuet al. (2016) suggest that BLEU fails to correlatewith human judgment at the sentence level, corre-lation increases when BLEU is applied at the doc-ument or corpus level (Galley et al., 2015; Przy-bocki et al., 2008).

7 Experimental Results

We experimented with all the models presentedin Section 5. For question generation, we useda visual & textual sensitive model that uses bag-of-words (V&T.BOW-Gen) to represent the textualcontext, which achieved better results. Earlier vi-sion & language work such as VQA (Antol et al.,2015) has shown that a bag-of-words baselineoutperforms LSTM-based models for represent-ing textual input when visual features are avail-able (Zhou et al., 2015). In response generation,which needs to account for textual input consistingof two turns, we used the V&T.RNN-Gen modelas the visual & textual-sensitive model for the re-sponse rows of tables 5 and 6. Since generatinga response solely from visual context is unlikelyto be successful, we did not use the V-Gen modelin response generation. All models are trained onIGCTwitter dataset, except for VQG, which sharesthe same architecture as with the (V-Gen) model,but is trained on 7,500 questions from the VQGdataset (Mostafazadeh et al., 2016b) as a point ofreference. We also include the gold human ref-erences from the IGCCrowd dataset in the humanevaluation to set a bound on human performance.

Table 4 presents example generations by ourbest performing systems. In human evaluationshown in Tables 5, the model that encodes bothvisual and textual context outperforms others. Wenote that human judges preferred the top genera-tion in the n-best list over the reranked best, likelyowing to the tradeoff between a safe and genericutterance and a riskier but contentful one. Thehuman gold references are consistently favoredthroughout the table. We take this as evidence thatIGCCrowd test set provides a robust and challeng-ing test set for benchmarking progress.

As shown in Table 6, BLEU scores are low,as is characteristic for language tasks with intrin-

469

Page 9: Image-Grounded Conversations: Multimodal Context for Natural … · 2017. 11. 23. · for Halloween a little early. Don't you think Halloween should be year -round, though? That'd

Human Generation (Greedy) Generation (Beam, best) Generation (Reranked, best) RetrievalGold Textual Visual V & T Textual Visual V & T VQG Textual Visual V & T Visual V & T

Question 2.68 1.46 1.58 1.86 1.07 1.86 2.28 2.24 1.03 2.06 2.13 1.59 1.54Response 2.75 1.24 – 1.40 1.12 – 1.49 – 1.04 – 1.44 – 1.48

Table 5: Human judgment results on the IGCCrowd test set. The maximum score is 3. Per model, thehuman score is computed by averaging across multiple images. The boldfaced numbers show the highestscore among the systems. The overall highest scores (underlined) are the human gold standards.

Generation RetrievalTextual Visual V & T VQG Visual V & T

Question 1.71 3.23 4.41 8.61 0.76 1.16Response 1.34 – 1.57 – – 0.66

Table 6: Results of evaluating using multi-reference BLEU.

sically diverse outputs (Li et al., 2016b,a). OnBLEU, the multimodal V&T model outperformsall the other models across test sets, except for theVQG model which does significantly better. Weattribute this to two issues: (1) the VQG train-ing dataset contains event-centric images similarto the IGCCrowd test set, (2) Training on a high-quality crowd-sourced dataset with controlled pa-rameters can, to a significant extent, produce betterresults on similarly crowd-sourced test data thantraining on data found ”in the wild” such as Twit-ter. However, crowd-sourcing multi-turn conver-sations between paired workers at large scale isprohibitively expensive, a factor that favors the useof readily available naturally-occurring but noisierdata.

Overall, in both automatic and human evalua-tion, the question generation models are more suc-cessful than response generation. This disparitymight be overcome by (1) implementation of moresophisticated systems for richer modeling of longcontexts across multiple conversational turns, (2)training on larger, higher-quality datasets.

8 Conclusions

We have introduced a new task of multimodalimage-grounded conversation, in which, given animage and a natural language text, the system mustgenerate meaningful conversational turns, the sec-ond turn being a question. We are releasing tothe research community a crowd-sourced datasetof 4,222 high-quality, multiple-turn, multiple-reference conversations about eventful images.Inasmuch as this dataset is not tied to the char-acteristics of any specific social media resource,e.g., Twitter or Reddit, we expect it to remain sta-

ble over time, as it is less susceptible to attrition inthe form of deleted posts or accounts.

Our experiments provide evidence that captur-ing multimodal context improves the quality ofquestion and response generation. Nonetheless,the performance gap between our best models andhumans opens opportunities further research in thecontinuum from casual chit-chat conversation tomore topic-oriented dialog. We expect that addi-tion of other forms of grounding, such as temporaland geolocation information, often embedded inimages, will further improve performance.

In this paper, we illustrated the application ofthis dataset using simple models trained on con-versations on Twitter. In the future, we expect thatmore complex models and richer datasets will per-mit emergence of intelligent human-like agent be-havior that can engage in implicit commonsensereasoning around images and proactively drive theinteraction forward.

Acknowledgments

We are grateful to Piali Choudhury, Rebecca Han-son, Ece Kamar, and Andrew Mao for their assis-tance with crowd-sourcing. We would also like tothank our reviewers for their helpful comments.

ReferencesStanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-

garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual question an-swering. In Proc. ICCV.

Collin F. Baker, Charles J. Fillmore, and John B. Lowe.1998. The Berkeley FrameNet Project. In Proc.COLING.

Jianfu Chen, Polina Kuznetsova, David Warren, andYejin Choi. 2015. Deja image-captions: A corpusof expressive descriptions in repetition. In Proc.NAACL-HLT.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. In Proc. EMNLP.

470

Page 10: Image-Grounded Conversations: Multimodal Context for Natural … · 2017. 11. 23. · for Halloween a little early. Don't you think Halloween should be year -round, though? That'd

Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, Jose M. F. Moura, DeviParikh, and Dhruv Batra. 2017a. Visual dialog. InProc. CVPR.

Abhishek Das, Satwik Kottur, Jose M. F. Moura, StefanLee, and Dhruv Batra. 2017b. Learning cooperativevisual dialog agents with deep reinforcement learn-ing. In Proc. ICVV.

Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta,Li Deng, Xiaodong He, Geoffrey Zweig, and Mar-garet Mitchell. 2015. Language models for imagecaptioning: The quirks and what works. In Proc.ACL-IJCNLP.

Jeff Donahue, Lisa Anne Hendricks, Sergio Guadar-rama, Marcus Rohrbach, Subhashini Venugopalan,Kate Saenko, and Trevor Darrell. 2015. Long-termrecurrent convolutional networks for visual recogni-tion and description. In Proc. CVPR.

Hao Fang, Saurabh Gupta, Forrest N. Iandola, Ru-pesh Srivastava, Li Deng, Piotr Dollar, JianfengGao, Xiaodong He, Margaret Mitchell, John C. Platt,C. Lawrence Zitnick, and Geoffrey Zweig. 2014.From captions to visual concepts and back. In Proc.CVPR.

Ali Farhadi, Mohsen Hejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, JuliaHockenmaier, and David Forsyth. 2010. Every pic-ture tells a story: Generating sentences from images.In Proc. ECCV.

Michel Galley, Chris Brockett, Alessandro Sordoni,Yangfeng Ji, Michael Auli, Chris Quirk, Mar-garet Mitchell, Jianfeng Gao, and Bill Dolan. 2015.deltaBLEU: A discriminative metric for generationtasks with intrinsically diverse targets. In Proc.ACL-IJCNLP.

Micah Hodosh, Peter Young, and Julia Hockenmaier.2013. Framing image description as a ranking task:Data, models and evaluation metrics. J. Artif. Int.Res., 47(1):853–899.

Mark Hopkins and Jonathan May. 2011. Tuning asranking. In Proc. EMNLP.

Ting-Hao (Kenneth) Huang, Francis Ferraro, NasrinMostafazadeh, Ishan Misra, Aishwarya Agrawal, Ja-cob Devlin, Ross Girshick, Xiaodong He, Push-meet Kohli, Dhruv Batra, C. Lawrence Zitnick,Devi Parikh, Lucy Vanderwende, Michel Galley, andMargaret Mitchell. 2016. Visual storytelling. InProc. NAACL-HLT.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016a. A diversity-promoting ob-jective function for neural conversation models. InProc. NAACL-HLT.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016b. A persona-based neural con-versation model. In Proc. ACL.

Chin-Yew Lin and Franz Josef Och. 2004. Auto-matic evaluation of machine translation quality us-ing longest common subsequence and skip-bigramstatistics. In Proc. ACL.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollr,and C. Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In Proc. ECCV.

Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-worthy, Laurent Charlin, and Joelle Pineau. 2016.How NOT to evaluate your dialogue system: An em-pirical study of unsupervised evaluation metrics fordialogue response generation. In Proc. EMNLP.

Ryan Lowe, Nissan Pow, Iulian Serban, and JoellePineau. 2015. The ubuntu dialogue corpus: A largedataset for research in unstructured multi-turn dia-logue systems. In Proc. SIGDIAL.

Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS.

Andrew Mao, David Parkes, Yiling Chen, Ariel D. Pro-caccia, Krzysztof Z. Gajos, and Haoqi Zhang. 2012.Turkserver: Enabling synchronous and longitudinalonline experiments. In Workshop on Human Com-putation (HCOMP).

Marvin Minsky. 1974. A framework for represent-ing knowledge. Technical report, Cambridge, MA,USA.

Meredith Ringel Morris, Annuska Zolyomi, CatherineYao, Sina Bahram, Jeffrey P. Bigham, and Shaun K.Kane. 2016. “with most of it being pictures now, Irarely use it”: Understanding twitter’s evolving ac-cessibility to blind users. In Proc. CHI.

Nasrin Mostafazadeh, Alyson Grealish, NathanaelChambers, James F. Allen, and Lucy Vanderwende.2016a. Caters: Causal and temporal relation schemefor semantic annotation of event structures. In Pro-ceedings of the The 4th Workshop on EVENTS: Def-inition, Detection, Coreference, and Representation.

Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Mar-garet Mitchell, Xiaodong He, and Lucy Vander-wende. 2016b. Generating natural questions aboutan image. In Proc. ACL.

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg.2011. Im2text: Describing images using 1 millioncaptioned photographs. In Proc. NIPS.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automaticevaluation of machine translation. In Proc. ACL.

M. Przybocki, K. Peterson, and S. Bronsart. 2008. Of-ficial results of the NIST 2008 metrics for machinetranslation challenge. In Proc. MetricsMATR08workshop.

471

Page 11: Image-Grounded Conversations: Multimodal Context for Natural … · 2017. 11. 23. · for Halloween a little early. Don't you think Halloween should be year -round, though? That'd

Arijit Ray, Gordon Christie, Mohit Bansal, Dhruv Ba-tra, and Devi Parikh. 2016. Question relevancein VQA: identifying non-visual and false-premisequestions. In Proc. EMNLP.

Alan Ritter, Colin Cherry, and William B Dolan. 2011.Data-driven response generation in social media. InProc. EMNLP.

Marcus Rohrbach, Sikandar Amin, Mykhaylo An-driluka, and Bernt Schiele. 2012. A database forfine grained activity detection of cooking activities.In Proc. CVPR.

Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau. 2015a. Hierar-chical neural network generative models for moviedialogues. arXiv preprint arXiv:1507.04808.

Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, andJoelle Pineau. 2015b. A survey of available corporafor building data-driven dialogue systems. CoRR,abs/1512.05742.

Lifeng Shang, Zhengdong Lu, and Hang Li. 2015.Neural responding machine for short-text conversa-tion. In Proc. ACL-IJCNLP.

K. Simonyan and A. Zisserman. 2015. Very deep con-volutional networks for large-scale image recogni-tion. In Proc. ICLR.

Alessandro Sordoni, Michel Galley, Michael Auli,Chris Brockett, Yangfeng Ji, Margaret Mitchell,Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015.A neural network approach to context-sensitivegeneration of conversational responses. In Proc.NAACL-HLT.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In NIPS.

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue,Marcus Rohrbach, Raymond Mooney, and KateSaenko. 2015. Translating videos to natural lan-guage using deep recurrent neural networks. InProc. NAACL-HLT.

Oriol Vinyals and Quoc Le. 2015. A neural conver-sational model. In Proc. Deep Learning Workshop,ICML.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Proc. CVPR.

Terry Winograd. 1972. Understanding Natural Lan-guage. Academic Press, New York.

Kelvin Xu, Jimmy Ba, Ryan Kiros, KyunghyunCho, Aaron C. Courville, Ruslan Salakhutdinov,Richard S. Zemel, and Yoshua Bengio. 2015. Show,attend and tell: Neural image caption generationwith visual attention. In Proc. ICML.

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar,Arthur Szlam, and Rob Fergus. 2015. Simplebaseline for visual question answering. CoRR,abs/1512.02167.

472