video dialog via progressive inference and cross-transformer · some more ad-vanced methods utilize...

10
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2109–2118, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics 2109 Video Dialog via Progressive Inference and Cross-Transformer Weike Jin 1 , Zhou Zhao 1* , Mao Gu 1 , Jun Xiao 1 , Furu Wei 2 , Yueting Zhuang 1 1 Zhejiang University, Hangzhou 2 Microsoft Research Asia, Beijing {weikejin,zhaozhou,21821134,junx,yzhuang}@zju.edu.cn [email protected] Abstract Video dialog is a new and challenging task, which requires the agent to answer questions combining video information with dialog his- tory. And different from single-turn video question answering, the additional dialog his- tory is important for video dialog, which often includes contextual information for the ques- tion. Existing visual dialog methods mainly use RNN to encode the dialog history as a single vector representation, which might be rough and straightforward. Some more ad- vanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit reasoning process. In this paper, we introduce a novel progressive in- ference mechanism for video dialog, which progressively updates query information based on dialog history and video content until the agent think the information is sufficient and unambiguous. In order to tackle the multi- modal fusion problem, we propose a cross- transformer module, which could learn more fine-grained and comprehensive interactions both inside and between the modalities. And besides answer generation, we also consider question generation, which is more challeng- ing but significant for a complete video dialog system. We evaluate our method on two large- scale datasets, and the extensive experiments show the effectiveness of our method. 1 Introduction Visual dialog can be seen as an extension of the visual question answering (VQA) (Antol et al., 2015; Yu et al., 2017). Unlike visual question answering, in which each question is asked inde- pendently, visual dialog requires the agent to an- swer multi-round questions and the previous round of question-answer pairs form the dialog history. Currently, most of the existing visual dialog ap- proaches mainly focus on image dialog (Das et al., * Zhou Zhao is the corresponding author. What does the person add to the water? Pepper. What color is the plate with meat? Yellow. What does the person make in the plastic container? He makes a sauce. What color is it? Green. What does the person do with it? He mixes the sauce with meat. Dialog History Progressive Inference What does the person do with it? Cross-Transformer Generator Answer Question Refined Query with Context Video Information Stop sauce in plastic container color green meat in plate color yellow Figure 1: An illustration of our video dialog model with progressive inference and cross-transformer. 2017a; Seo et al., 2017; Kottur et al., 2018; Lee et al., 2018). Video dialog is still less explored. For video dialog, given a video, a dialog history and a question about the video content, the agent has to combine video information with context in- formation from dialog history to infer the answer accurately. And due to the complexity of video information, it’s more challenging than image di- alog. Obviously, there are two major problems for video dialog: (1) How to obtain valuable infor- mation from the dialog history. We find that the content of the dialogue usually has a certain conti- nuity, and the subsequent dialogue is likely to con- tinue what was discussed before. Thus, it’s impor- tant to infer valuable information from the dialog history to help answer the question. Sometimes, the question can be ambiguous without such in- formation. For instance, as illustrated in Figure 1, the question “What does the person do with

Upload: others

Post on 14-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Video Dialog via Progressive Inference and Cross-Transformer · Some more ad-vanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 2109–2118,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

2109

Video Dialog via Progressive Inference and Cross-Transformer

Weike Jin1, Zhou Zhao1∗, Mao Gu1, Jun Xiao1, Furu Wei2, Yueting Zhuang1

1Zhejiang University, Hangzhou2Microsoft Research Asia, Beijing

{weikejin,zhaozhou,21821134,junx,yzhuang}@[email protected]

AbstractVideo dialog is a new and challenging task,which requires the agent to answer questionscombining video information with dialog his-tory. And different from single-turn videoquestion answering, the additional dialog his-tory is important for video dialog, which oftenincludes contextual information for the ques-tion. Existing visual dialog methods mainlyuse RNN to encode the dialog history as asingle vector representation, which might berough and straightforward. Some more ad-vanced methods utilize hierarchical structure,attention and memory mechanisms, which stilllack an explicit reasoning process. In thispaper, we introduce a novel progressive in-ference mechanism for video dialog, whichprogressively updates query information basedon dialog history and video content until theagent think the information is sufficient andunambiguous. In order to tackle the multi-modal fusion problem, we propose a cross-transformer module, which could learn morefine-grained and comprehensive interactionsboth inside and between the modalities. Andbesides answer generation, we also considerquestion generation, which is more challeng-ing but significant for a complete video dialogsystem. We evaluate our method on two large-scale datasets, and the extensive experimentsshow the effectiveness of our method.

1 Introduction

Visual dialog can be seen as an extension of thevisual question answering (VQA) (Antol et al.,2015; Yu et al., 2017). Unlike visual questionanswering, in which each question is asked inde-pendently, visual dialog requires the agent to an-swer multi-round questions and the previous roundof question-answer pairs form the dialog history.Currently, most of the existing visual dialog ap-proaches mainly focus on image dialog (Das et al.,

∗Zhou Zhao is the corresponding author.

Q1: What does the person add to the water?A1: Pepper.Q2: What color is the plate with meat?A2: Yellow.Q3: What does the person make in the plastic container?A3: He makes a sauce.Q4: What color is it?A4: Green.

Q: What does the person do with it?A: He mixes the sauce with meat.

What does the person add to the water?

Pepper.

What color is the plate with meat?

Yellow.

What does the person make in the plastic container?

He makes a sauce.

What color is it?

Green.

What does the person do with it?

He mixes the sauce with meat.

Dia

log

His

tory

Progressive Inference

What does the person do with it?

Cross-Transform

er

Generator

Answer

Question

Refined Query  with Context

Video Information

Stop

sauceinplastic container color green

meat in platecolor

yellow

Figure 1: An illustration of our video dialog modelwith progressive inference and cross-transformer.

2017a; Seo et al., 2017; Kottur et al., 2018; Leeet al., 2018). Video dialog is still less explored.For video dialog, given a video, a dialog historyand a question about the video content, the agenthas to combine video information with context in-formation from dialog history to infer the answeraccurately. And due to the complexity of videoinformation, it’s more challenging than image di-alog.

Obviously, there are two major problems forvideo dialog: (1) How to obtain valuable infor-mation from the dialog history. We find that thecontent of the dialogue usually has a certain conti-nuity, and the subsequent dialogue is likely to con-tinue what was discussed before. Thus, it’s impor-tant to infer valuable information from the dialoghistory to help answer the question. Sometimes,the question can be ambiguous without such in-formation. For instance, as illustrated in Figure1, the question “What does the person do with

Page 2: Video Dialog via Progressive Inference and Cross-Transformer · Some more ad-vanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit

2110

it?” contains a pronoun ‘it’, which brings ambi-guity to the question. We can’t figure out what‘it’ really refers to, if only rely on the question in-formation. There is no more key information inthe question, thus it’s also hard to get help fromvideo content. In such a situation, the dialog his-tory is necessary. The existing visual dialog meth-ods (Das et al., 2017a) mainly use recurrent neuralnetwork (like LSTM) to encode the dialog historyas a single vector representation, which we thinkmight be a bit rough and straightforward. Somemore advanced methods (Seo et al., 2017; Zhaoet al., 2018) utilize hierarchical structure, atten-tion and memory mechanisms to refine the dialoghistory representation, which still lacks an explicitreasoning process. Recently, Kottur et al. (2018)propose a neural module network architecture in-cluding two novel modules, which perform coref-erence resolution at a word level. However, theirwork is based on the image, which means theyonly take static visual characters into considera-tion, like objects and attributes. As for video dia-log, there are additional dynamic characters in thevideo, such as action and state transition. Thus,we employ multi-stream video information in ourmodel. And due to the continuity of the dialog his-tory, we attempt to design a reverse-order progres-sive reasoning process to extract useful informa-tion and eliminate ambiguity. (2) How to answerthe question related to the video content. This isalso the key problem in video question answering.For instance, a mainstream method is to utilize re-current neural network to learn the hidden statesof video sequences, due to the temporal structure.Then, the encoded query information is used to se-lect relevant information from the hidden states,through attention mechanism or memory network.Finally, the query information is combined withthe refined video information by multi-modal fu-sion to generate the answer.

In this paper, in order to make better use of thedialog history and video information, we proposea progressive inference mechanism for video dia-log, which progressively updates query informa-tion based on the dialog history and video content,as shown in Figure 1. And the progressive infer-ence will stop when the agent thinks the queryinformation is sufficient and unambiguous, or ithas arrived at the beginning of the dialog history.For the query information is a sequence of vectorrepresentations which contains key information of

the question, dialog history and video content,we propose a cross-transformer module to learn astronger joint representation of them. And in addi-tion to the generation of answers, we also attemptto generate questions according to the dialog his-tory, which is more challenging but significant fora complete video dialog system. The generator weuse is an extension of the Transformer (Vaswaniet al., 2017) decoder, in order to process multi-stream inputs. And the main contributions of thispaper are as follows:

• Unlike previous studies, we propose a novelprogressive inference mechanism for videodialog, which progressively updates query in-formation based on the dialog history andvideo content.

• We design a cross-transformer module totackle the multi-modal fusion problem,which could learn more fine-grained andcomprehensive interactions both inside andbetween the visual and textual information.

• Besides the answer generation, we also re-alize question generation under the sameframework to construct a complete video di-alog system.

• Our method achieves the state-of-the-art per-formance on two large-scale datasets.

2 Related Work

In this section, we briefly review some relatedwork of visual dialog, including image dialog andvideo dialog.

As first proposed in (Das et al., 2017a), visualdialog requires the agent to predict the answerof a given question based on an image and dia-log history. While dialog system (Serban et al.,2016, 2017) has been widely explored, visual di-alog is still a young task. Until recently, someapproaches are proposed. Das et al. (2017a) pro-pose three models based on late fusion, attentionalhierarchical LSTM and memory networks respec-tively. They also propose the VisDial dataset bypairing two subjects on Amazon Mechanical Turkto chat about an image. Lu et al. (2017) pro-pose a generator-discriminator architecture wherethe outputs of the generator are updated usinga perceptual loss from a pre-trained discrimina-tor. De Vries et al. (2017) propose a Guess-What game style dataset, on which one person

Page 3: Video Dialog via Progressive Inference and Cross-Transformer · Some more ad-vanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit

2111

asks questions about an image to guess which ob-ject has been selected and the second person an-swers questions. Das et al. (Das et al., 2017b) uti-lize deep reinforcement learning to learn the poli-cies of a ‘Questioner-Bot’ and an ‘Answerer-Bot’,based on the goal of selecting the right images thatthe two agents are talking. Seo et al. (2017) re-solve visual references in the question based on anew attention mechanism with an attention mem-ory, and the model indirectly resolves coreferencesof expressions through the attention retrieval pro-cess. Jain et al. (2018) propose a simple symmet-ric discriminative baseline, which can be appliedto both predicting an answer as well as predict-ing a question. Kottur et al. (2018) propose anintrospective reasoning about visual coreferences,which links coreferences and grounds them in theimage at a word-level, rather than at a sentence-level as in prior visual dialog work. Massicetiet al. (2018) propose FLIPDIAL model, a gen-erative convolutional model for visual dialoguewhich is able to generate answers as well as gen-erate questions based on a visual context. Lee etal. (2018) propose a practical goal-oriented dialogframework using information-theoretic approach,in which the questioner figures out the answerer’sintention via selecting a plausible question by cal-culating the information gain of the candidate in-tentions and possible answers of each question.Wu et al. (2018) propose a sequential co-attentiongenerative model that can jointly learn the image,dialog history information with question, and adiscriminator which can dynamically access to theattention memories with an intermediate reward.

The aforementioned work mainly focuses onthe image dialog. As for the task of video di-alog, it’s still less explored. One similar workis proposed by Zhao et al. (2018). They studythe problem of multi-turn video question answer-ing by employing a hierarchical attention con-text learning method with recurrent neural net-works for context-aware question understandingand a multi-stream attention network that learnsthe joint video representation. They also proposetwo large-scale multi-turn video question answer-ing datasets, which are employed in our experi-ments. And recently, Hori et al. (2018) proposea model that incorporates technologies for mul-timodal attention-based video description into anend-to-end dialog system. Pasunuru et al. (2018)propose a new game-chat based video-context,

many-speaker dialogue task. In this work, we uti-lize a more explicit progressive inference mecha-nism and a novel cross-transformer module to gen-erate both answers and questions for video dialog.

3 Our Approach

3.1 Problem Formulation

Before introducing our method, we first intro-duce some basic notions. We denote the video byv ∈ V , the dialog history by c ∈ C, the newquestion by q ∈ Q and the corresponding an-swer by a ∈ A, respectively. For video is a se-quence of frames, the frame-level representationfor video v is denoted by vf = (vf1 , v

f2 , . . . , v

fT1),

where T1 is the number of frames in video v.And besides the frame-level representation, wealso employ the segment-level representation tobring more information of video, which is givenby vs = (vs1, v

s2, . . . , v

sT2), where T2 is the num-

ber of segments and vsj is the vector representationof the j-th segment. The dialog history c ∈ Cis given by c = (c1, c2, . . . , cN ), where ci is thei-th round of dialog, which consists of a questionqi and an answer ai. Using these notations, thetask of video dialog could be formulated as fol-lows: given a set of video V and the associated di-alog historyC, the goal of video dialog is to train amodel that learns to generate human-like answerswhen a new question about the visual content isasked. Similar to the video question answeringtask, there are two kinds of methods to produce theanswer, generative and discriminative. For gener-ative type, a word sequence generator (normallya RNN) is employed to fit the ground truth an-swer sequences. As for discriminative type, an ad-ditional candidate answer vocabulary is providedand the problem is reformulated as a multi-classclassification problem. In this paper, we try totackle the generative version of video dialog, forit is more challenging than discriminative versionand is more valuable in the practical system.

3.2 Progressive Inference

We first introduce our dialog encoder with pro-gressive inference mechanism. Figure 2 showsthe overview of the dialog encoder, in which theinference process has been expanded for intu-itive understanding. As shown in the figure, theinputs of this module are a sequence of dialoghistory c = (c1, c2, . . . , cN ) and the query x.Specifically, the query x could be different ac-

Page 4: Video Dialog via Progressive Inference and Cross-Transformer · Some more ad-vanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit

2112

question

answer

Pair 1

question

answer

Pair 2

question

answer

Pair N

Multi-H

eadSelf-Attention

Transition

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

PositionalEncoding

PositionalEncoding

PositionalEncoding

PositionalEncodingQuery

Q

KV Infer

Gate

InferGate

InferGate

VKQ

VK

Q

After Tsteps

After Tsteps

After Tsteps

For T steps

Embedding

Target Sequence

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

VK

Q

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

QKV

C

C

   Video Features

Refined Query

Multi-H

eadAttention

Stack M layers

Linear

LinearFv

Fq

Swop

APU-1

APU-2

APU-2

APU-2

APU-2

APU-2

Multi-HeadSelf-Attention

Transition

Multi-HeadAttention

Multi-HeadSelf-Attention

Transition

Multi-HeadAttention

Embedding

Target Sequence

Ofq Osq

T steps T steps

x

Softmax

Output Probabilities

QKV Q K V

APU-3 APU-3

Cross-TransModule

v

Cross-TransModule

v

Cross-TransModule

v

Figure 2: The overview of the dialog encoder with pro-gressive inference.

cording to different purposes. For answer gener-ation, the query x is the question q waiting to beanswered. As for question generation, we com-bine the latest round of dialog cN in dialog his-tory with ground-truth answer a of the questionas the query x, considering the topic continuitybetween the dialog. And it’s intuitive that thenew round of dialog is more likely to be rele-vant to the latest dialog history, especially in videodialog, in which the dialog focuses on a spe-cific scenario not chit-chat. Each round of dia-log consists of a question qi = (wq

i1, wqi2, . . . , w

qil)

and an answer ai = (wai1, w

ai2, . . . , w

ail′), where

wq, wa are word embeddings and l, l′ are cor-responding sentence length. And for the i-thround dialog ci, instead of using two LSTM-like neural networks to learn sentence-level repre-sentations individually, we concatenate the ques-tion qi and the answer ai end to end, given byci = (wq

i1, wqi2, . . . , w

qil, w

ai1, w

ai2, . . . , w

ail′), and

then employ self-attention mechanism to extractmore fine-grained word-level interactions betweenthe question-answer pair. And the query x isalso encoded by a similar self-attention mecha-nism. Specifically, the attention processing units(APU) we employ is based on the Transformermodel (Vaswani et al., 2017), which has achievedgreat success in natural language processing. Thefirst type of attention processing unit (APU-1) isas same as the encoder module of the Transformermodel. It consists of a multi-head self-attention

layer and a transition layer. And it has to be notedthat we do not show residual connections and layernormalization in the figure for conciseness. Here,the transition layer is a fully-connected neural net-work that consists of two linear transformationswith a rectified-linear activation in between. Thesecond type of attention processing unit (APU-2) is similar to the decoder module of the Trans-former model. However, there is a difference inthe middle multi-head attention layer. We assumethe outside input is Io and the output of the multi-head self-attention layer isOa, then the normal op-eration in the Transformer decoder is given by

Attention(Oa, Io, Io) = softmax(OaI

>o√d

)Io (1)

where d is the dimension of sequence elements. Inour method, the order of the inputs are replaced,which is given by

Attention(Io, Oa, Oa) = softmax(IoO

>a√d

)Oa (2)

The reason of this replacement is that we want theoutside input information to guide the inner atten-tion operation, not the inside information. Underthe video dialog scenario, we expect to use thequery information to filter related and helpful in-formation from the dialog history.

After introducing the basic modules of our di-alog encoder, now we describe the progressiveinference mechanism, as shown in Algorithm 1.Firstly, we add its position encoding to the inputquery x to bring order information, which is sim-ilar to (Vaswani et al., 2017). Then, we employthe first type of attention processing unit (APU-1) to encode the initial query information and theencoded query is denoted as q. After that, weprogressively update query information from eachround of dialog. Due to the continuity of the di-alog history, the order of the inference is fromthe latest round to the beginning round. For thei-th round, firstly, Cross-Transformer module isused to capture both query-aware video informa-tion and video-aware query information, and theyare merged into the updated queries {Ofq, Osq},due to different visual features. Then, the infor-mation of current round of dialog history ci is en-coded and filtered by the second type of attentionprocessing unit (APU-2), using the latest updatedquery. Through this way, we can obtain the queryrelated information ci from current dialog, which

Page 5: Video Dialog via Progressive Inference and Cross-Transformer · Some more ad-vanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit

2113

question

answer

Pair 1

question

answer

Pair 2

question

answer

Pair N

Multi-H

eadSelf-Attention

Transition

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

PositionalEncoding

PositionalEncoding

PositionalEncoding

PositionalEncodingQuery

Q

KV Infer

Gate

InferGate

InferGate

VKQ

VK

Q

After Tsteps

After Tsteps

After Tsteps

For T steps

Embedding

Target Sequence

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

VK

Q

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

QKV

C

C

   Video Features

Updated  Query

Multi-H

eadAttention

Stack M layers

Linear

LinearFv

Fq

Swop

APU-1

APU-2

APU-2

APU-2

APU-2

APU-2

Multi-HeadSelf-Attention

Transition

Multi-HeadAttention

Multi-HeadSelf-Attention

Transition

Multi-HeadAttention

Embedding

Target Sequence

Ofq Osq

T steps T steps

x

Softmax

Output Probabilities

QKV Q K V

APU-3 APU-3

Cross-TransModule

v

Cross-TransModule

v

Cross-TransModule

v

Figure 3: The overview of the cross-transformer mod-ule for multi-modal fusion.

could complete the query information and elim-inate ambiguity. Finally, we utilize an inferencegate to update the query information again anddetermine whether it is needed to stop reasoning,given by

S = σ([c1i ;Ofq]W1 + b1), (3)

S′ = σ([c1i ;Ofq]W2 + b2), (4)

Ofq = S′ � c1i + S �Ofq, (5)

µ1 = σ(W3(OfqW4 + b3) + b4) (6)

where σ is the sigmoid function, [; ] is con-catenation operation, � is element-wise multi-plication, W1,W2,W3,W4 are weight matrixes,b1, b2, b3, b4 are biases, S, S′ are score vectors ofthe gate and µ1 (or µ2) is a scalar representing theinformation score. If µ = (µ1 + µ2)/2 is greaterthan a threshold τ , the inference gate will out-put current {Ofq, Osq} as the final output of theencoder and stop the inference, which means theagent think that the current query information issufficient and unambiguous for answer or questiongeneration. And the inference will also stop whenit has arrived at the beginning of the dialog history.

3.3 Cross-Transformer Module

In this section, we introduce the cross-transformer(CT) module for multi-modal fusion, which couldlearn more fine-grained and comprehensive inter-actions both inside and between the input modal-ities. As shown in Figure 3, there are two inputchannels, one for video information and the otherfor query information. In our method, we utilizetwo kinds of video information, the frame-levelinformation vf and the segment-level informationvs. Here, we take vf as an example, and the pro-cess of vs is the same. There are two attentionprocess units (APU-2) in our CT module. And

instead of encoding these two channels individ-ually, we design a cross connection between theAPU-2 to learn the interactions both inside andbetween different modalities. Specifically, we usethe updated query Ofq to guide the middle atten-tion layer of the video channel APU-2, in order tolearn the query-aware video information vfq . Andthe frame-level information vf is also utilized toguide the same attention layer of the query channelAPU-2, for the purpose of learning frame-awarequery information quf . Then, we further fuse theinformation of each output of the APU-2 with itsinput guidance by a concatenation and a linearlayer, given by

Fv = Linear([vf : quf ]) (7)

Fq = Linear([vfq : Ofq]) (8)

Before outputting, we swop the output streamsagain to restore the original input order for nextlayer process. By stacking M layers, we expect toenhance the fusion effect. Finally, we employ an-other multi-head attention layer to fuse Fv and Fq,and the output is denoted as Ofq. For segment-level video information vs, we also could get theoutput Osq.

3.4 Answer & Question DecoderThe decoder we use is shown in Figure 4. Inthis decoder, the third type of attention processunit (APU-3) is employed, which is as same asthe Transformer decoder. And because there is

Algorithm 1 Progressive InferenceInput: c, x, vf , vs, NParameter: c, q, i, µ, µ1, µ2Output: Ofq, Osq

1: Let i = N,µ = 02: q ← APU 1(x+ position encoding)3: Ofq ← Osq ← q4: while µ < τ and i > 0 do5: Ofq ← Cross Trans(Ofq, v

f )6: Osq ← Cross Trans(Osq, v

s)7: c1i ← APU 2(ci, Ofq)8: c2i ← APU 2(ci, Osq)9: µ1, Ofq ← Infer Gate(c1i , Ofq)

10: µ2, Osq ← Infer Gate(c2i , Osq)11: µ = (µ1 + µ2)/212: i = i− 1.13: end while14: return Ofq, Osq

Page 6: Video Dialog via Progressive Inference and Cross-Transformer · Some more ad-vanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit

2114

question

answer

Pair 1

question

answer

Pair 2

question

answer

Pair N

Multi-H

eadSelf-Attention

Transition

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

PositionalEncoding

PositionalEncoding

PositionalEncoding

PositionalEncodingQuery

Q

KV Infer

Gate

InferGate

InferGate

VKQ

VK

Q

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

VK

Q

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

QKV

After Tsteps

After Tsteps

After Tsteps

For T steps

C

C

  Video features

Context-aware      query

PositionalEncoding

Multi-H

eadAttention

For T steps

For T steps

PositionalEncoding

Multi-HeadSelf-Attention

Transition

Multi-HeadAttention

Multi-HeadSelf-Attention

Transition

Multi-HeadAttention

Embedding

Target Sequence

Ofq Osq

Softmax

T steps T steps

Softmax+

Softmax

Output Probabilities

QKV Q K V

Multi-HeadSelf-Attention

Transition

Multi-HeadAttention

Multi-HeadSelf-Attention

Transition

Multi-HeadAttention

Embedding

Target Sequence

Ofq Osq

T steps T steps

x

Softmax

Output Probabilities

QKV Q K V

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

VK

Q

Multi-H

eadSelf-Attention

Transition

Multi-H

eadAttention

QKV

C

C

   Video Features

Refined Query

Multi-H

eadAttention

Stack M layers

Linear

LinearFv

Fq

Swop

APU-1

APU-2

APU-2

APU-2

APU-2

APU-2

APU-3 APU-3

Figure 4: The overview of the answer and question de-coder.

two outputs of the dialog encoder, Ofq and Osq,we utilize two individual attention process units(APU-3) to decode their information. Thus, wecould obtain two different answer (or question)distributions based on the appearance and motioninformation correspondingly. Then, this two dis-tributions are multiplied elemently and a softmaxlayer is employed to generate the final output prob-abilities. We use teacher-forcing strategy duringtraining stage, and at generation time the answer(or question) is generated autoregressively.

4 Experiments

4.1 Setup

Dataset. The datasets that we use are proposed in(Zhao et al., 2018), which are based on YouTube-Clips (Chen and Dolan, 2011) dataset and TACoS-MultiLevel (Rohrbach et al., 2014) dataset. Thesetwo datasets consist of 1,987 and 1,303 videos in-dividually. Each video in YouTubeClips is com-posed of 60 frames, and as for TACoS-MultiLevel,the number of frames is 80. Each video has fivedifferent dialogs. There are 6515 video dialogsin YouTubeClips dataset and 9935 video dialogsin TACoS-MultiLevel dataset. The numbers ofquestion-answer pairs are 66,806 and 37,228 cor-respondingly. Statistically, there are five roundsof conversation in most of the video dialogs forTACoS-MultiLevel dataset and it is mostly be-tween three and twelve rounds for YouTubeClipsdataset. The percentages of training data, vali-dation data and testing data for both datasets are90%, 5%, and 5% respectively, according to the

number of constructed video dialogs.Implementation. Firstly, we use the pre-trainedGlove (Pennington et al., 2014) model to obtainthe word embeddings of the dialog. The dimen-sion of the word vector is 512. Then, we employthe pre-trained VGGNet to learn appearance fea-ture of each frame. The motion features of videoare extracted by the pre-trained 3D-ConvNet (Tranet al., 2015). And we utilize transition layers (Sec-tion 3.2) to transform both appearance and motionfeatures into the same dimension as word vectorsto ease later process. We set the circulation stepsT to 5 and the stack layers M to 4 after doing alot of attempts. The threshold τ in the progres-sive inference is also important. It will influencethe effect of the progressive inference process. Wefind that if the threshold is too small, the inferenceprocess will stop early and miss some importantinformation. On the contrary, if the threshold istoo large, it will stop too late leading to more in-terference information and time consuming. Wedid a lot of experiments and adjustments to get abalanced threshold of 0.85, and this value mightbe a bit different in different scenarios. For thetraining stage, we use Adam optimizer with theinitial learning rate of 0.0005, and we also adopt awarm-up strategy to improve the effectiveness ofnetwork learning.Evaluation Metrics. In this paper, we employseveral metrics to evaluate the quality of the gen-erated answers and questions. They are BLEU-N (N=1,2), ROUGE-L and METEOR, which havebeen widely used in natural language generationtasks.

4.2 Comparisons and Analysis

Because video dialog is still less explored espe-cially for generative results, we extend some ex-isting image dialog and video question answeringmodels as baseline models for video dialog, whichare introduced in the following:

• ESA+, STAN+ and CDMN+ methods ex-tend three video QA models respectively,which are ESA model (Zeng et al., 2017),STAN (Zhao et al., 2017) and CDMNmodel (Gao et al., 2018). A hierarchicalLSTM network is added to model the dialoghistory.

• LF+, HRE+ and MN+ methods extend threeimage dialog models (Das et al., 2017a) by

Page 7: Video Dialog via Progressive Inference and Cross-Transformer · Some more ad-vanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit

2115

Table 1: Experimental results of answer generation on TACoS-MultiLevel and YoutubeClip datasets.

MethodTACoS-MultiLevel YoutubeClip

BLEU-1 BLEU-2 ROUGE METEOR BLEU-1 BLEU-2 ROUGE METEOR

ESA+ 0.356 0.244 0.422 0.109 0.268 0.151 0.276 0.082STAN+ 0.408 0.312 0.449 0.133 0.315 0.185 0.306 0.090CDMN+ 0.429 0.341 0.460 0.142 0.293 0.161 0.311 0.094

LF+ 0.404 0.290 0.465 0.135 0.284 0.183 0.307 0.083HRE+ 0.438 0.320 0.502 0.153 0.293 0.172 0.308 0.094MN+ 0.430 0.326 0.472 0.149 0.306 0.185 0.290 0.086

SFQIH+ 0.438 0.334 0.481 0.153 0.326 0.202 0.319 0.085HACRN 0.451 0.346 0.499 0.161 0.307 0.174 0.331 0.104

RICT (ours) 0.464 0.361 0.527 0.178 0.333 0.194 0.332 0.104

Table 2: Experimental results of question generation on TACoS-MultiLevel and YoutubeClip datasets.

MethodTACoS-MultiLevel YoutubeClip

BLEU-1 BLEU-2 ROUGE METEOR BLEU-1 BLEU-2 ROUGE METEOR

ESA+ 0.693 0.582 0.718 0.341 0.497 0.333 0.565 0.212STAN+ 0.706 0.599 0.730 0.354 0.483 0.322 0.559 0.208CDMN+ 0.707 0.603 0.740 0.357 0.507 0.341 0.567 0.219

LF+ 0.704 0.598 0.728 0.349 0.512 0.346 0.574 0.218HRE+ 0.694 0.592 0.729 0.348 0.515 0.350 0.571 0.223MN+ 0.698 0.589 0.718 0.345 0.488 0.324 0.556 0.204

SFQIH+ 0.694 0.592 0.729 0.349 0.503 0.339 0.563 0.217HACRN 0.715 0.616 0.741 0.358 0.524 0.352 0.577 0.229

RICT (ours) 0.733 0.625 0.748 0.367 0.536 0.375 0.593 0.234

utilizing a LSTM network to encode thevideo information, which are based on latefusion, attention based hierarchical LSTM,and memory networks respectively.

• SFQIH+ method extends SF-QIH-semodel (Jain et al., 2018) by employing aLSTM network to encode the video infor-mation, which concatenates all of the inputembeddings for each of the possible answeroptions.

Besides the baseline models, we also compareour method with the HACRN model (Zhao et al.,2018). They propose a similar work called multi-turn video question answering. It uses LSTM andattention mechanism to encode the dialog historyand question to get a joint question representation.They combine this joint representation with videofeatures by a multi-stream attention network. Anda multi-step reasoning strategy is applied to en-hance the reasoning ability. For the above mod-els, a normal LSTM recurrent sentence generatoris employed to generate the answer and question.

Table 1 shows the experimental results of

answer generation on TACoS-MultiLevel andYoutubeClip datasets, and Table 2 shows the ques-tion generation results on same datasets. Ourmethod (RICT) outperforms all above models inalmost all metrics. This fact shows the effective-ness of our overall network architecture. And wefind that the image dialog models perform bet-ter than video QA models in answer generation,but worse in question generation on both datasets.This might indicate that for these two datasets,the answer generation is more dependent on dia-log, and question generation is more dependent onvideo content.

We perform an ablation study to evaluate theimpact of each component in our model. Theexperimental results are listed in Table 3, whereRICT(lstm), RICT(wo.ct), RICT(wo.pi) representour model with a LSTM sentence generator, re-placing cross-transformer module and replacingprogressive inference with baseline modules cor-respondingly. The results on YoutubeClip datasetis similar, which it’s not shown in the paper. Thefact that the variants perform worse than full RICTmodel but still better than baseline models that

Page 8: Video Dialog via Progressive Inference and Cross-Transformer · Some more ad-vanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit

2116

questions answers µ-scores Input: Where did the person wash it briefly? In the sink.Ans: On the cutting board.

What did the person take from the drawer? A cutting board. 0.90 Ground truth: Where did the person cut the roots of it?

What did the person also take out? A knife. 0.67 HACRN: What did the person do?

What did the person retrieve from the cabinet? One long leek. 0.55 Ours: Where did the person cut it?

questions answers µ-scores Input: What does the boy in striped T-shirt see before running away?

Where are the two boys? On the lawn. - Ground truth: Tennis ball.

What is the boy in blue holding? Tennis ball. 0.89 HACRN: Automobile.

Who is the boy in blue throwing balls at?

A boy in striped T-shirt. 0.73 Ours: Tennis ball.

(a)

(b)

Figure 5: Visualization examples of experimental results. (a) is an answer generation result on YoutubeClip dataset,and (b) is a question generation result on TACoS-MultiLevel dataset.

Table 3: Ablation study results on TACoS-MultiLeveldataset. The upper part is the results of answer genera-tion, and the lower part is the results of question gener-ation.

Method BLEU-1 ROUGE METEOR

RICT(lstm) 0.456 0.508 0.167RICT(wo.ct) 0.449 0.497 0.159RICT(wo.pi) 0.433 0.475 0.151

RICT (ours) 0.464 0.527 0.178

RICT(lstm) 0.724 0.745 0.362RICT(wo.ct) 0.705 0.733 0.352RICT(wo.pi) 0.711 0.740 0.357

RICT (ours) 0.733 0.748 0.367

have the similar modules proves the effectivenessof each part of our model.

We also show some visualization examples ofexperimental results in Figure 5. For the example(a), the progressive inference successfully stopsat the second round of dialog, for the current µ-score has exceeded the threshold and the formerround of dialog can’t bring more valuable infor-mation. The example (b) also shows a similar re-sult on question generation, however, this time theinference stops at the first round for it still contains

the related thing ‘cutting board’. Part of the wordswith high attention by our model are shown in dif-ferent colors. And both of the generated answerand question of our model in the examples arebetter than the compared model, which are muchcloser to the ground-truth.

5 Conclusion

In this paper, in order to have a better under-standing of both dialog history and video contents,we propose a novel progressive inference mech-anism for video dialog, which progressively up-dates query information until it is sufficient andunambiguous, or it has arrived at the beginningof the dialog history. We also design a cross-transformer module to tackle multi-modal fusionproblem, which could learn more fine-grained in-teractions between the visual and textual infor-mation. And in addition to answer generation,we also consider question generation based on thedialog history and video content under the sameframework, which is more challenging but signif-icant for a complete video dialog system. Thequalitative and quantitative experimental resultson two large-scale video dialog datasets show theeffectiveness of our method.

Page 9: Video Dialog via Progressive Inference and Cross-Transformer · Some more ad-vanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit

2117

Acknowledgments

This work was supported by Zhejiang Natural Sci-ence Foundation (LR19F020002, LZ17F020001),National Natural Science Foundation of China(61976185, 61572431), the Fundamental Re-search Funds for the Central Universities, ChineseKnowledge Center for Engineering Sciences andTechnology and Microsoft Research Asia.

ReferencesStanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-

garet Mitchell, Dhruv Batra, C Lawrence Zitnick,and Devi Parikh. 2015. Vqa: Visual question an-swering. In Proceedings of the IEEE internationalconference on computer vision, pages 2425–2433.

David L Chen and William B Dolan. 2011. Collect-ing highly parallel data for paraphrase evaluation. InProceedings of the 49th Annual Meeting of the As-sociation for Computational Linguistics, pages 190–200.

Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, Jose MF Moura, Devi Parikh,and Dhruv Batra. 2017a. Visual dialog. In IEEEConference on Computer Vision and Pattern Recog-nition, pages 1080–1089.

Abhishek Das, Satwik Kottur, Jose MF Moura, StefanLee, and Dhruv Batra. 2017b. Learning cooperativevisual dialog agents with deep reinforcement learn-ing. In International Conference on Computer Vi-sion, pages 2970–2979.

Harm De Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, and Aaron C Courville.2017. Guesswhat?! visual object discovery throughmulti-modal dialogue. In IEEE Conference on Com-puter Vision and Pattern Recognition, pages 4466–4475.

Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia.2018. Motion-appearance co-memory networks forvideo question answering. Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition.

Chiori Hori, Huda Alamri, Jue Wang, Gordon Winch-ern, Takaaki Hori, Anoop Cherian, Tim K Marks,Vincent Cartillier, Raphael Gontijo Lopes, AbhishekDas, et al. 2018. End-to-end audio visual scene-aware dialog using multimodal attention-basedvideo features. arXiv preprint arXiv:1806.08409.

Unnat Jain, Svetlana Lazebnik, and Alexander GSchwing. 2018. Two can play this game: visualdialog with discriminative question generation andanswering. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages5754–5763.

Satwik Kottur, Jose MF Moura, Devi Parikh, DhruvBatra, and Marcus Rohrbach. 2018. Visual corefer-ence resolution in visual dialog using neural modulenetworks. In Proceedings of the European Confer-ence on Computer Vision (ECCV), pages 153–169.

Sang-Woo Lee, Yu-Jung Heo, and Byoung-Tak Zhang.2018. Answerer in questioner’s mind: Informa-tion theoretic approach to goal-oriented visual dia-log. In Advances in Neural Information ProcessingSystems, pages 2580–2590.

Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh,and Dhruv Batra. 2017. Best of both worlds: Trans-ferring knowledge from discriminative learning to agenerative visual dialog model. In Advances in Neu-ral Information Processing Systems, pages 314–324.

Daniela Massiceti, N Siddharth, Puneet K Dokania,and Philip HS Torr. 2018. Flipdial: A generativemodel for two-way visual dialogue. In Proceedingsof the IEEE Conference on Computer Vision andPattern Recognition, pages 6097–6105.

Ramakanth Pasunuru and Mohit Bansal. 2018. Game-based video-context dialogue. Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP), pages 1532–1543.

Anna Rohrbach, Marcus Rohrbach, Wei Qiu, An-nemarie Friedrich, Manfred Pinkal, and BerntSchiele. 2014. Coherent multi-sentence video de-scription with variable level of detail. In Germanconference on pattern recognition, pages 184–195.

Paul Hongsuck Seo, Andreas Lehrmann, BohyungHan, and Leonid Sigal. 2017. Visual reference res-olution using attention memory for visual dialog. InAdvances in neural information processing systems,pages 3719–3729.

Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben-gio, Aaron C Courville, and Joelle Pineau. 2016.Building end-to-end dialogue systems using gener-ative hierarchical neural network models. In AAAI,pages 3776–3784.

Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron C Courville,and Yoshua Bengio. 2017. A hierarchical latentvariable encoder-decoder model for generating di-alogues. In AAAI, pages 3295–3301.

Du Tran, Lubomir Bourdev, Rob Fergus, LorenzoTorresani, and Manohar Paluri. 2015. Learningspatiotemporal features with 3d convolutional net-works. In Proceedings of the IEEE internationalconference on computer vision, pages 4489–4497.

Page 10: Video Dialog via Progressive Inference and Cross-Transformer · Some more ad-vanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit

2118

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

Qi Wu, Peng Wang, Chunhua Shen, Ian Reid, and An-ton van den Hengel. 2018. Are you talking to me?reasoned visual dialog generation through adversar-ial learning. IEEE Conference on Computer Visionand Pattern Recognition, pages 6106–6115.

Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui.2017. Multi-level attention networks for visualquestion answering. In IEEE Conference on Com-puter Vision and Pattern Recognition, volume 1,page 8.

Kuo-Hao Zeng, Tseng-Hung Chen, Ching-YaoChuang, Yuan-Hong Liao, Juan Carlos Niebles, andMin Sun. 2017. Leveraging video descriptions tolearn video question answering. In AAAI, pages4334–4340.

Zhou Zhao, Xinghua Jiang, Deng Cai, Jun Xiao, Xi-aofei He, and Shiliang Pu. 2018. Multi-turn videoquestion answering via multi-stream hierarchical at-tention context network. In Proceedings of the Inter-national Joint Conference on Artificial Intelligence,pages 3690–3696.

Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, andYueting Zhuang. 2017. Video question answeringvia hierarchical spatio-temporal attention networks.In International Joint Conference on Artificial Intel-ligence (IJCAI).