open-ended visual question-answering

Open-ended Visual Question-Answering

[thesis][web][code]

Issey Masuda Mora Santiago Pascual de la PuenteXavier Giró i Nieto

Roadmap

Introduction Related Work

Methodology Results Conclusions Future work

Methodology Results Conclusions Future Work

Introduction

Visual Question-Answering

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433). 4

Predict the answer of a given question related to an image

Visual Question-Answering: Types

Real images Abstract scenes

Multi-Choice

Open-ended

Q: Does it appear to be rainy?

Q: What is just under the tree?

A: a ball

Q: How many slices of pizza are there?

A: 1, 2, 3, 4

Q: What is for desert?

A: cake, ice cream, cheesecake, pie

Example

Question: What is bobbing in the water other than the boats?Answer: buoys

Motivation

New visual Turing test

Motivation: AI research

● Multidisciplinary tasks● Models able to perform more

complex activities● Different sub-problems tackled at

Computer Vision

KnowledgeRepresentation and Reasoning

Natural Language Processing

Related Work

Deep Learning

11Credit: Google

VQA: Common approach

Visual representation

Textual representation

Predict answerMerge

Question

What object is flying?

AnswerKite

Word/sentence embedding + LSTM

Tools: Convolutional Neural Networks (CNN)

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

AlexNet

Tools: Word and Sentence embeddings

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing systems (pp. 3111-3119).

Experiments from: Socher et. al. (2013b) and Collbert et. al. (2011)

King Man- Woman+ Queen=

Tools: Long Short-Term Memory networks (LSTM)

15Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.

Methodology

First steps: Text-based QA

Extending text-based QA for VQA

18Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Substitute VGG-16 with KCNN

19Liu, Z. (2015). Kernelized Deep Convolutional Neural Network for Describing Complex Images. arXiv preprint arXiv:1509.04581.

Sentence embedding and image projection

Question

Answer

Results

VQA Dataset: Real Images, Open-ended questions

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question answering. CVPR 2015.

1 (image) x 3 (questions) x 10 (answers)

Evaluation

Metric: Script:

● Characters to lowercase● Remove periods (unless decimal

periods)● Number words to digits● Remove articles● Add apostrophe to contractions● Replace punctuation with space

VQA Challenge

53.62%CVPR2016 VQA Challenge

Real Images Open-ended, test-standard dataset partition

Results in detail

VALIDATION SET TEST SET

Model Yes/No Number Other Overall Yes/No Number Other Overall

Model 1 71.82 23.79 27.99 43.87 71.62 28.76 29.32 46.70

Model 3 75.02 28.60 29.30 46.32 - - - -

Model 2 75.62 31.81 28.11 46.36 - - - -

Model 5 78.15 32.79 33.91 50.32 78.15 36.20 35.26 53.03

Model 4 78.73 32.82 35.5 51.34 78.02 35.68 36.54 53.62

Results in context

100%0%

Humans

83.30%

UC Berkeley & Sony

66.47%

Baseline LSTM&CNN

54.06%

Baseline Nearest neighbor

42.85%

Baseline Prior per question type

37.47%

Baseline All yes

29.88%

53.62%

Comparison with the baseline

Our model

● Single word answer● Generate answers

Baseline

● Multi word answers (hardcoded)● Classify over the 1000 most common

answers

Qualitative results: I

Qualitative results: II

Deep Python Project

31https://github.com/imatge-upc/vqa-2016-cvprw

Research contribution: Extended abstract

32VQA workshop, CVPR 2016

Research controbution: Extended abstract - Poster

… ticket to Las Vegas 34

35Presenting our poster and extended abstract at CVPR 2016, Las Vegas, USA

VQA Challenge statistics: Answering method

Conclusions

Conclusion

✓ Present to VQA Challenge, CVPR 2016

Goals accomplished

✓ First GPI project using text processing techniques

✓ Create a scalable VQA model✓ Build a modular and reusable

software package

✓ Extended abstract accepted to VQA workshop CVPR 2016

ConclusionPersonal overview

● Submission to VQA Challenge● VQA, hot topic at CVPR 2016● Model designed to generate

answers instead of classifying them

● Question-Answer pair generation proposal

Future Work

Future work

● Decoder for multiple word answers

● Character embedding● Attention mechanisms● Question-Answer pairs

generationNext steps

Automatic Question-Answer Pairs Generation

Thank You!43

Do you have any question?

Project resource links

● Thesis: https://imatge.upc.edu/web/sites/default/files/pub/xMasuda-Mora_0.pdf

● Web page: http://imatge-upc.github.io/vqa-2016-cvprw/● Source code: https://github.com/imatge-upc/vqa-2016-cvprw

Motivation: First steps towards QA Generation

AI System

Question

What is the man doing?

AnswerSurf

VQA: Counterexample

Dynamic Parameter Prediction Network (DPPnet)

Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016

Experiments: Batch Normalization

Losses I

Losses II

Losses III

VQA Challenge statistics: Image modelling

VQA Challenge statistics: Question modelling

open-ended visual question-answering

Technology

visalogy: answering visual analogy...

answering visual-relational queries in web-extracted...

vqa: visual question answering - arxiv · 1 vqa: visual...

answering visual questions with conversational crowd...

mutan: multimodal tucker fusion for visual question...

high-order attention models for visual question answering

knowledge acquisition for visual question answering via...

strategies for success on answering open-ended questions

multimodal residual learning for visual question-answering

an analysis of visual question answering algorithmsan...

iqa: visual question answering in interactive...

explicit bias discovery in visual question answering models

knowledge acquisition for visual question answering via...

visual textbook network: watch carefully before answering...

answer-type prediction for visual question answering

iqa: visual question answering in interactive...

dvqa: understanding data visualizations via … › static...

one perceptron 27-28 april 2017visual question answering...

vqa: visual question answering - cvf open...

iqa: visual question answering in interactive...