visual qa based on attributes and external …...visual qa based on attributes and external...

9
Visual QA based on Attributes and External Knowledge Sarah Radzihovsky *† [email protected] Weiying Goh *† [email protected] Jing Lim *† [email protected] Abstract Visual Question Answering (VQA) is a challenging task that combines computer vision and natural language processing (NLP) into one system with the goal of teaching a computer how to infer from what it sees and generate textual obser- vations about the image. For this project, we analyze and reimplement a model proposed by Wu et al. [2], which leverages captioning using image attributes and an incorporation of external knowledge to expand past the successful and popular CNN-RNN approach. In our evaluation, we show that using the image attributes as input alone can produce decent results for simple questions that primarily rely on semantic features from language, but fails to capture enough information for to understand semantic features from images. We also show that our proposed model has difficulty generalizing from the proposed image attributes extracted from the CNN which impedes its ability to generate substantive answers. 1 Introduction In recent years, the maturation of both computer vision and natural language processing (NLP) and the increasing availability of relevant large-scale datasets have led to a growing body of work in the intersection between the two domains. Vision to language problems are unique because not only do they face the challenge of acquiring, processing, and understanding images to discern what is happening in a given scene, these systems must also learn to read and interact with human language, then complete a task accordingly. Although these two fields were developed separately, together they enabling significant advances in complex machine learning systems and the growth of textual and visual data. Among the emerging areas in this intersection is Visual Question Answering (VQA) where, in its most common form, the computer tries to generate the correct answer when presented with an image and a question about the image. The answer can range from a few words to binary yes or no responses and multiple choice settings. Although it is often said that a picture is worth a thousand words, due to the fact that images are much more noisy and high dimensional than text, the added complexity required to understand an image pushes VQA far past its counter part areas like textual question answering that only answers questions about a passage instead of an image. Not only do images lack the structure and grammatical rules help us understand text, they are a lower form of abstraction than text and therefore much richer in content, making image more difficult to represent as features. An important distinction that sets VQA apart from several other tasks in computer vision is that it specifically addresses free-form open-ended questions where the questions to be answers are not determined until run time. For instance, in segmentation or object detection, a single question is predetermined by the algorithm and applied to each new image during executing to try to determine * Department of Computer Science, Stanford University, Stanford, CA 94305 Equal Contribution 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.

Upload: others

Post on 25-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visual QA based on Attributes and External …...Visual QA based on Attributes and External Knowledge Sarah Radzihovsky y sarahradz@cs.stanford.edu Weiying Goh gweiying@stanford.edu

Visual QA based on Attributes and ExternalKnowledge

Sarah Radzihovsky ∗†[email protected]

Weiying Goh ∗†[email protected]

Jing Lim ∗†

[email protected]

Abstract

Visual Question Answering (VQA) is a challenging task that combines computervision and natural language processing (NLP) into one system with the goal ofteaching a computer how to infer from what it sees and generate textual obser-vations about the image. For this project, we analyze and reimplement a modelproposed by Wu et al. [2], which leverages captioning using image attributes andan incorporation of external knowledge to expand past the successful and popularCNN-RNN approach. In our evaluation, we show that using the image attributesas input alone can produce decent results for simple questions that primarily relyon semantic features from language, but fails to capture enough information for tounderstand semantic features from images. We also show that our proposed modelhas difficulty generalizing from the proposed image attributes extracted from theCNN which impedes its ability to generate substantive answers.

1 Introduction

In recent years, the maturation of both computer vision and natural language processing (NLP) andthe increasing availability of relevant large-scale datasets have led to a growing body of work in theintersection between the two domains. Vision to language problems are unique because not onlydo they face the challenge of acquiring, processing, and understanding images to discern what ishappening in a given scene, these systems must also learn to read and interact with human language,then complete a task accordingly. Although these two fields were developed separately, together theyenabling significant advances in complex machine learning systems and the growth of textual andvisual data.

Among the emerging areas in this intersection is Visual Question Answering (VQA) where, in itsmost common form, the computer tries to generate the correct answer when presented with an imageand a question about the image. The answer can range from a few words to binary yes or no responsesand multiple choice settings. Although it is often said that a picture is worth a thousand words, dueto the fact that images are much more noisy and high dimensional than text, the added complexityrequired to understand an image pushes VQA far past its counter part areas like textual questionanswering that only answers questions about a passage instead of an image. Not only do images lackthe structure and grammatical rules help us understand text, they are a lower form of abstraction thantext and therefore much richer in content, making image more difficult to represent as features.

An important distinction that sets VQA apart from several other tasks in computer vision is that itspecifically addresses free-form open-ended questions where the questions to be answers are notdetermined until run time. For instance, in segmentation or object detection, a single question ispredetermined by the algorithm and applied to each new image during executing to try to determine

∗Department of Computer Science, Stanford University, Stanford, CA 94305†Equal Contribution

32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.

Page 2: Visual QA based on Attributes and External …...Visual QA based on Attributes and External Knowledge Sarah Radzihovsky y sarahradz@cs.stanford.edu Weiying Goh gweiying@stanford.edu

an answer. Because the question is only given at run time, it must truly be understood with the contextof the image.

In addition, VQA frequently faces questions that require information not provided in the image. Thisrequired information can range from common sense to encyclopedic knowledge about a specificelement from the image which can make VQA a significantly more complex problem.

We address these problems by following implementing a model proposed by Wu et al [2] whichuses a CNN to extract an attribute based representation of the input image, then generates an imagecaption and queries an external knowledge base with this attribute based representation, to producethree representations the given image. All three representations formed before the question is parsedand applied to the image to construct an answer. There are two primary contributions of this model.The first the use of a fully trainable attribute-based neural net which helps this model yield betterperformance. This is because, by inserting explicit representation of attributes of the image which aremeaningful to humans, we enable the computer to more easily interpret what it sees in an image as ascene that a human would see. The second contribution addresses the aforementioned challenge ofanswering questions that require information not provided in the image itself by providing a methodof incorporating external knowledge about the image into the VQA system.

2 Related Work

Over the past few years, the effort to combine visual and textual information for joint learning hasproduced a wealth of research on various vision + language tasks [17, 18, 19, 20, 21, 22]. Image andvideo captioning have become popular tasks where, given an image or video, the goal is to generate ashort text that describes what is seen [35, 36, 37, 38, 39]. Visual question answering is a natural andmore interactive extension to these captioning tasks

One of the first studies of the VQA problem was conducted by Malinowski et al. [23] who proposeda method that combines image segmentation and semantic parsing with a Bayesian approach thatsamples from nearest neighbors in the training set.

More recently, an architecture which combines a CNN and RNN to learn the mapping from imagesto sentences has become a dominant trend. The popularity of this architecture has been driven bythe significant progress achieved using deep neural network models in both computer vision andnatural language processing. Both Gao et al. [8] and Malinowski et al. [12] used RNNs to encode thequestion and output the answer, but with slight variants. While Gao et al. [8] used two networks, aseparate encoder and decoder, Malinowski et al. [12] used a single network for both encoding anddecoding.

Many modern day approaches begin with this popular and successful CNN-RNN combination andextend their model in various directions to better tackle different weaknesses within the VQA problem.Wu et al. [1] suggest that modern approaches to VQA can be organized nicely into the following fourcategories based on the nature of their primary contribution:

Joint embedding approaches maps both images and questions to a common embedding space,using CNNs to learn the representation of images and RNNs to learn embeddings of sentences inthe same feature space. Representing the two in a common space allows for learning interactionsand performing inferences over image and question contents together. The output stage is either aclassifier to output answers from a predefined set, or an RNN that generates naturalistic variable-lengthresponses. Joint embeddings form the basis of most modern approaches to VQA. For instance, Kimet al [40], propose a multimodal residual learning framework (MRN) to learn the joint representationof images and language, while Fukui et al.[41] perform the joint embedding visual and text featureswith a pooling method. called Mulimodal compact bilinear pooling.

Attention mechanisms eliminate irrelevant or noisy information from the prediction stage by as-signing importance features to different regions of the image. The attention weights, indicating"where to look" in the image, are commonly either derived from the image and/or the question,and allow the output stage to focus on relevant parts of the image and question when generatingresponses. Attention mechanisms have improved performance on open-ended questions that onlyrequire focusing on one or two aspects of the image and question, rather than compositional reasoning.However, performance on binary questions which tend to require compositional chains of reasoninghave not improved significantly. Ever since it was shown that encode visual attention can greatly

2

Page 3: Visual QA based on Attributes and External …...Visual QA based on Attributes and External Knowledge Sarah Radzihovsky y sarahradz@cs.stanford.edu Weiying Goh gweiying@stanford.edu

improve performance in image caption [24], numerous other have propose to use spacial attention tohelp answer visual questions [25], [26], [27], [28], [29], [30].

Composition models address the limitations of the monolithic nature of RNNs and CNNs used toextract representations of sentences and images. By connecting distinct modules designed for specificcapabilities such as memory or reasoning, composition models allow inferences to be tailored to theproblem instance. This facilitates both transfer learning and ’deep supervision’ as modules can beused and trained in different tasks, and these internal modules will be optimized over an objective task.Wu et al. [1] mention Neural Module Networks and Dynamic Memory Networks as two significantmodels within this group of methods.

Knowledge base-enhanced approaches query structured knowledge bases to retrieve informationexternal to the common visual datasets. This external information enhances the reasoning by providingprior knowledge to contextualize the question or the image. Zhu et al. [4] used a hand-crafted KBprimarily containing image-related information such as category labels, attribute labels, affordancelabels, and even specifics such as GPS coordinates. While, this was proved to be fairly effective,hand-crafted knowledge bases can be costly to create and are inevitably very domain specific. At thesame time, large and generic databases such as DBpedia [9] struggle with sparse and inconsistentinformation. [31], [32] showed that using visually-sourced information to query a general databasehas a lot of promise, but is not extremely useful within our approach as our model targets thecomment fields of the database after recognizing that the comments are most reliably informativeabout an attribute. It is worth noting that evaluation of current knowledge base-enhanced approachesis currently limited as there are only a few small datasets with questions that require more externalknowledge.

It has also been noted that using attribute-based representation as a high level representation of animage has shown potential in many computer vision tasks such as identifying familiar objects todescribe unfamiliar objects (Farhadi et al. [33]) or characterize image regions to globally describe animage Vogel and Schiele [34].

With the amalgam of learned experience gained from this survey of modern approaches from [1], in[2], Wu et al. propose their own model that specifically incorporates attribute-based representation ofthe input image and queries an external knowledge base to enhance their VQA model’s performance.In this paper we reimplementing this model and discuss the contribution of each component to theVQA problem.

3 Approach

Our model can be segmented into two phases. In phase one, given an input image I , we computeVatt, the attribute representation of the image, Vcap, the aggregation of hidden states from captiongeneration for the image, and Vknow, an external knowledge vector for the image. In phase two, wecombine Vatt, Vcap and Vknow and feed it as input to the VQA LSTM that encodes the question andgenerates an answer. We follow the implementation details of the VQA model proposed by Wu et. al[2].

3.1 Computing components for input to VQA LSTM

3.1.1 Vatt

We adopt YOLOv3 as our region proposal network to generate attribute probabilities on the image.Given an image, YOLOv3 proposes a set of regions, and runs independent logistic regressions oneach region to generate detection probabilities for 80 class labels.

To generate Vatt, [2] retains the top k = 5 label proposals for m = 10 regions for each image,and uses cross-hypotheses maxpooling over the mk + 1 hypotheses. Since YOLOv3 runs on multi-label logistic regression rather than softmax classification, we retained the top 20 (hyerparameter)detections ranked by confidence and augmented this with bounding boxes and probabilities. Thismade Vatt20 a vector of length [20× (1 (class label) +1 (probability)+4 (bounding box coordinates)].

In our baseline evaluations we found that the VQA was unable to learn image features from Vatt20

even if it had learn to generate naturalistic language responses. For example, it learnt to identifyand respond to yes/no queries, but did not learn the features from the image context. To retain more

3

Page 4: Visual QA based on Attributes and External …...Visual QA based on Attributes and External Knowledge Sarah Radzihovsky y sarahradz@cs.stanford.edu Weiying Goh gweiying@stanford.edu

Figure 1: Proposed model. For clarity: Vatt is in red, Vknow in blue, and Vcap in green. Modifiedversion of a figure taken from [2].

information about the image, we extracted the output of the convolutional layer before the smallestdetection layer in the CNN, then average pooled this output to form Vatt_conv. The convolutionallayer is 13x13x255; after average-pooling, we have a layer of shape 2x2x255 that is reshaped to forma vector of length 1020. We used both Vatt20 and Vatt_conv as inputs into the VQA LSTM.

3.1.2 Vknow

To generate Vknow, we take the top 5 predicted attributes from Vatt and use SPARQL to query textualcontext from DBpedia [9], a structured knowledge database based on Wikipedia. The DBpedia datasetdescribes approximately 4.58 million entities, of which 4.22 million are classified in a consistentontology. Wu et al.[2] notes that, due to the sparsity of information in DBpedia and similar databases,the "comment" field is most consistently informative, hence we retrieve the "comment" section foreach query.

A sample query-comment retrieved pair is as follows:

Query: "Truck"

Retrieved: "A truck (also called a lorry in the United Kingdom, Ireland, SouthAfrica, and India) is a motor vehicle designed to transport cargo. Trucks varygreatly in size, power, and configuration, with the smallest being mechanicallysimilar to and larger than an automobile."

For each of the 5 attributes, we retrieve the comment text for each query term, and combine the 5paragraphs into a single large paragraph. We then use Doc2Vec with pretrained weights from [15]to generate the vector representation of the paragraph of vector length 500, Vknow. Doc2Vec is anunsupervised algorithm that learns fixed-length vector representations from variable-length pieces oftexts [10].

3.1.3 Vcap

For caption generation, we use a standard LSTM which takes Vatt as input. During training, weembed the training caption as a sequence of words S1, . . . , SL, where each word is represented asa one-hot vector of dimension equal to size of words dictionary. The LSTM learns to generate thecaption by maximizing the log-likelihood of the caption S based on corresponding Vatt:

log p(S|Vatt(I)) =

L∑t=1

log p(St|S1:t−1, Vatt(I))

Since COCO train-2014 dataset has 5 captions for each image, on each training iteration, we randomlypick 1 out of 5 captions for each image as S. At each timestep t, we use Beam Search by iterativelyselecting the best 5 sentences as candidates to generate sentences at time t+ 1 and only keeping thebest 5 results. After training, we use the final hidden state of the caption-LSTM as Vcap.

4

Page 5: Visual QA based on Attributes and External …...Visual QA based on Attributes and External Knowledge Sarah Radzihovsky y sarahradz@cs.stanford.edu Weiying Goh gweiying@stanford.edu

Figure 2: Caption-LSTM which takes Vatt as input and generates captions. Figure taken from [2].

3.2 VQA-LSTM

We combine Vatt, Vcap, and Vknow as Vin, the input to the VQA LSTM. Note that, for our baseline,we used Vatt as the only VQA-LSTM input.

Given semantic image information Vin and question, the VQA-LSTM is trained to maximize thelikelihood of generating the ground truth answers in a training set. We want our model to be able togenerate multiple-word answers, so we formulate the answering process to generate the sequence ofwords by maximizing the following log-likelihood:

log p(A|Vin, Q) =

l∑t=1

log p(at|a1:t−1, Vin, Q)

where p(at|a1:t−1, Vin, Q) is the probability of generating at given question Q, image informationVin, and previous words a1:t−1, and A is the answer sequence A = a1, ..., al.

We encode the question Q and Vin with an encoder LSTM. At the start of training, t = 0, weset the LSTM input to x0 = [WeaVatt(I),WecVcap(I),WekVknow(I)] where Wea,Wec,Wek areembedding weights for the respective vectors. This forms hidden state h0. At each following timestep t = 1, 2, ...n, we embed each question word q1, q2, ...qn and feed it into the encoder, generatinghidden states h1, h2, ...hn. Thus, hn encodes all information about the question Q and the image’ssemantic information Vin. From time step t = n + 1 onwards, the decoder LSTM computes theprobability distribution of generating target answer word aw+1 based on the hidden state hn+w andembedded answer word aw. The word embeddings Wes are shared between the question words andthe answer words. The goal during training is to learn embeddings Wea,Wec,Wek,Wes, and allweights in the LSTM through the above log-likelihood maximization.

Since the image-question-answer triples in the MS COCO datatset is comprised of 10 answers foreach question, for each iteration we pick any image-question-answer pair with an equal likelihood.

4 Experiments

4.1 Data

We use the training and validation dataset from VQA [16] which consists of 120k images fromCOCO 2014 [7] and 650k human-generated questions, each with 10 different answers from humanannotators.

4.2 Training

We embed Vatt, Vcap, Vknow, and the question and answer words dictionary Q, A, using the embed-ding size 256. The learning rate is set to 0.001. The baseline model was trained with Vatt as Vin, andthe final model is trained with Vatt, Vknow, Vcap as Vin.

5

Page 6: Visual QA based on Attributes and External …...Visual QA based on Attributes and External Knowledge Sarah Radzihovsky y sarahradz@cs.stanford.edu Weiying Goh gweiying@stanford.edu

4.3 Evaluation Methods

4.3.1 VQA Dataset Metric

We use a quantitative metric defined by the VQA dataset, Acc(ans) = min{# humans that said ans3

, 1} where100% means that at least 3 of the 10 humans who answered the question gave the same answer.

We use the guidelines specified in [16] to process and clean the generated answers before evaluation,and use the classification defined by [2] Object, Number, Color, Where, Why to break down theperformance of the model by question type.

4.3.2 Qualitative Evaluation

Finally, we perform qualitative evaluation by generating answers for randomly selected images andquestions, and using human evaluation to evaluate the performance of the model.

4.4 Baseline Results

Model VQAEval by Question Type VQAEval by Answer Type OverallObject Number Color Where Why Yes/No Number OthersBaseline 13.89 34.34 29.25 0.23 1.14 62.58 28.34 16.39 35.31

Figure 3:Question: ’Where exactly is this?’Target Answer: ’Living Room’Generated Answer: ’On’

Figure 4:Question: ’What kind of food is this?’Target Answer: ’Chinese, Rice, Asian’Generated Answer: ’Pizza’

Figure 5:Question: ’What is the giraffe eating?’Target Answer: ’Grass’Generated Answer: ’Grass’

Figure 6:Question: ’Can you see a hairdryer?’Target Answer: ’Yes’Generated Answer: ’Yes’

6

Page 7: Visual QA based on Attributes and External …...Visual QA based on Attributes and External Knowledge Sarah Radzihovsky y sarahradz@cs.stanford.edu Weiying Goh gweiying@stanford.edu

4.5 Towards a VQA LSTM with Vatt20, Vatt_conv , Vknow, Vcap

For our baseline, we used only Vatt20, the top 20 (hyerparameter) detections ranked by confidenceand augmented this with bounding boxes and probabilities, as input to the VQA LSTM. While wehad achieved above random accuracy in Yes/No questions (62.58), the baseline did not seem to beable to understand semantic features about the image. However, it was able to understand semanticfeatures from language (in Figure 4, it generates a food item "Pizza" in response to the question"What kind of food is this?", even though the image is clearly not of pizza). We hypothesize that thepoor current performance of the model in Object, Number and Color categories was due to the lackof semantic image features in Vatt20. We also hypothesized that the performance in Where and Whyquestions could be improved by including external knowledge.

Following the successful VQA approach proposed by [2], we tried to use a combined VQS LSTMinput of Vatt20, Vatt_conv, Vcap, and Vknow as described in the Approach Section above. With allthese components to represent the input image, we anticipated more higher results on questionsrequiring context and a better understanding of the semantic features in each image. We also batchedour implementation of the captioning and VQA LSTMs to improve computational speed.

4.6 Issues faced and Error Analysis

We found that caption generation from Vatt_conv did not produce strong results. Instead, the generatedcaptions consisted of a stream of only ’8’s, which we hypothesized was a result of not being able togeneralize well to Vatt_conv as a result of the convolutional layer being too large (1x1020).

Based on this, we left out the captioning attributes Vcap from the training of our VQA LSTM. Instead,we used Vknow, Vatt_conv and Vatt20 as input to our VQA LSTM, hoping that the combined attributebased representations would help our model retain more image information to construct more accurateresponses to the input questions. Since each attribute is embedded before it was fed into the VQA, wehypothesized that it would be able to reduce the weight of the complex Vatt_conv while still learningfrom it.

Despite this, we found that only <EOS> tokens were produced in the answers. We investigated themodel thoroughly to understand why this may have resulted.

The error did not result from our testing function, as we checked the encoders’ and decoders’ hiddenstates from the tests, and while the encoders’ final hidden state were all different, the decoders’ finalhidden state was all the same, causing the output of all <EOS> tokens, leading us to believe that theerror was from the training of our model.

Since both our Captioning LSTM and VQA LSTM had similar bugs, the bug was likely in Vatt_convor batching as Vknow was not fed into the Captioning LSTM, so we ruled out errors from Vknow.

Hence, we believe that the limited performance of both LSTMs was either due to its inability togeneralize to the large input Vatt or errors in how we parsed the training questions and target answersin our batch model, causing the model to learn only ’<EOS>’ in response to input questions.

5 Conclusion

While the model showed promise, we were unable to investigate it fully because of errors in trainingand/ or input choices. We performed error analysis and hypothesize that the reasons could be either inerrors in our batching or the difficulty in generalizing to the input Vatt_conv. Time and computationcosts both limited our ability to further explore fixes to the model.

We believe further study of our model could produce valuable information about the type of imageinformation necessary to generalize and generate answers about the image, as well as shed insight onthe performance of the captioning and external knowledge inputs separately, so further study in thisdirection would still be valuable.

7

Page 8: Visual QA based on Attributes and External …...Visual QA based on Attributes and External Knowledge Sarah Radzihovsky y sarahradz@cs.stanford.edu Weiying Goh gweiying@stanford.edu

Acknowledgments

We would like to thank Suvadip Paul for his mentorship and feedback throughout this project. Wewould also like to thank Professor Chris Manning for a phenomenal class that equipped us with theunderstanding and tools necessary to delve into challenging NLP projects.

References

[1] Wu, Q. Teney, D. Wang, P. Shen, C. Dick, A. & den Hengel, A.V (2016). Visual Question Answering: ASurvey of Methods and Datasets. CoRR. abs/1607.05910.

[2] Wu, Q. Teney, D. Wang, P. Shen, C. Dick, A. & den Hengel, A.V (2016). Image Captioning and VisualQuestion Answering Based on Attributes and Their Related External Knowledge. CoRR. abs/1603.02814.

[3] Zhu, Y. Groth, O. Bernstein, M. & Fei-Fei, L. (2015). Visual7W: Grounded Question Answering in Images.CoRR. abs/1511.03416.

[4] Zhu, Y. Zhang, C. Re, C. & Fei-Fei, L. (2015). Building a Large-scale Multimodal Knowledge Base forVisual Question Answering. CoRR. abs/1507.05670.

[5] Johnson, J. Karpathy, A & Fei-Fei, L. (2016). Densecap: Fully convolutional localization networks for densecaptioning In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574.

[6] Malinowski, M. & Fritz, Mario (2014). Towards a Visual Turing Challenge

[7] Lin, T.Y. Maire, M. Belongie, S.J. Bourdev, L.D. Girshick, R.B. Hays, J. Perona P. Ramanan, D. Dollár, P. &Zitnick, C.L. (2015). Microsoft COCO: Common Objects in Context. CoRR. abs/1405.0312

[8] Gao, H. Mao, J. Zhou, J. Huang, Z. Wang, L. & Xu, W. (2015). Are You Talking to a Machine? Dataset andMethods for Multilingual Image Question Answering. In Proceedings of the 28th International Conference onNeural Information Processing Systems. Volume 2

[9] Auer, S. Bizer, C. Kobilarov, G. Lehmann, J. Cyganiak, R. & Ives, Z. (2007). Dbpedia: A nucleus for a webof open data. Springer.

[10] Le, Q. V. & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings ofthe 28th International Conference on Machine Learning

[11] Redmon, J. & Farhadi, Ali (2018). YOLOv3: An Incremental Improvement. arXiv

[12] Malinowski, M. Rohrbach, M. & Fritz, M. (2015). Ask Your Neurons:A Neural-based Approach toAnswering Questions about Images. in Proc. IEEE Int. Conf. Computer Vision.

[13] Ma, L. Lu, Z. & Li, H. (2015). Learning to Answer Questions From Image using Convolutional NeuralNetwork. abs/1506.00333.

[14] Yang, Z. He, X. Gao, J. Deng, L. & Smola, A. (2016) Stacked Attention Networks for Image QuestionAnswering. in Proc. IEEE Conf. Computer Vision Pattern Recognition.

[15] Lau, J.H. & Baldwin, T. (2016). An Empirical Evaluation of doc2vec with Practical Insights into DocumentEmbedding Generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, 2016.

[16] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual QuestionAnswering. In ICCV, 2015

[17] K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. M. Blei, and M. I. Jordan. Matching words andpictures. The Journal of Machine Learning Research, 3:1107–1135, 2003.

[18] C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. What are you talking about? text-to-imagecoreference. In CVPR, 2014.

[19] H. Pirsiavash, C. Vondrick, and A. Torralba. Inferring the why in images. arXiv preprint arXiv:1406.5472,2014.

[20] V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. Linking people with ”their” names using coreferenceresolution. In ECCV, 2014.

[21] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics forfinding and de- scribing images with sentences. Transactions of the Asso- ciation for Computational Linguistics,2:207–218, 2014.

[22] C. L. Zitnick, D. Parikh, and L. Vanderwende. Learning the visual interpretation of sentences. In ICCV,2013.

8

Page 9: Visual QA based on Attributes and External …...Visual QA based on Attributes and External Knowledge Sarah Radzihovsky y sarahradz@cs.stanford.edu Weiying Goh gweiying@stanford.edu

[23] Malinowski, M. and Fritz, M.(2014) “A Multi-World Approach to Question Answering about Real-WorldScenes based on Uncertain Input,” in Proc. Advances in Neural Inf. Process. Syst., pp. 1682–1690.

[24] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, Attend and Tell:Neural Image Caption Generation with Visual Attention,” in Proc. Int. Conf. Mach. Learn., 2015.

[25] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, “Visual7W: Grounded Question Answering in Images,” inProc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.

[26] H. Xu and K. Saenko, “Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for VisualQuestion Answer- ing,” arXiv preprint arXiv:1511.05234, 2015.

[27] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia, “ABC-CNN: An Attention Based Convolu-tional Neural Network for Visual Question Answering,” arXiv preprint arXiv:1511.05960, 2015.

[28] A. Jiang, F. Wang, F. Porikli, and Y. Li, “Compositional Memory for Visual Question Answering,” arXivpreprint arXiv:1511.05676, 2015.

[29] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Deep Compo- sitional Question Answering with NeuralModule Networks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.

[30] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked Attention Networks for Image QuestionAnswering,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.

[31] X. Lin and D. Parikh, “Don’t Just Listen, Use Your Imagination: Leveraging Visual Common Sense forNon-Visual Tasks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.

[32] F. Sadeghi, S. K. Kumar Divvala, and A. Farhadi, “VisKE: Visual Knowledge Extraction and QuestionAnswering by Visual Veri- fication of Relation Phrases,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June2015.

[33] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in Proc. IEEEConf. Comp. Vis. Patt. Recogn., 2009

[34] J. Vogel and B. Schiele, “Semantic modeling of natural scenes for content-based image retrieval,” Int. J.Comput. Vision, vol. 72, no. 2, pp. 133–157, 2007.

[35]A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Everypicture tells a story: Generating sentences from images,” in Proc. Eur. Conf. Comp. Vis., 2010.

[36] S.Li,G.Kulkarni,T.L.Berg,A.C.Berg,andY.Choi,“Composing simple image descriptions using web-scalen-grams,” in Proc. Conf. Computational Natural Language Learning, 2011. [37] B. Z. Yao, X. Yang, L. Lin, M.W. Lee, and S.-C. Zhu, “I2t: Image parsing to text description,” Proc. IEEE, vol. 98, no. 8, pp. 1485– 1508,2010.

[38] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, “Babytalk:Understanding and generating simple image descriptions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no.12, pp. 2891–2903, 2013.

[39] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dolla r, J. Gao, X. He, M. Mitchell, J. Platt et al.,“From captions to visual concepts and back,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.

[40] J.-H. Kim, S.-W. Lee, D.-H. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang. Multimodal residuallearning for visual qa. arXiv preprint arXiv:1606.01455, 2016.

[41] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinearpooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.

9