Transcript
Page 1: What’s in that picture? VQA system - Machine Learningcs229.stanford.edu/proj2017/final-posters/5148340.pdfVisual Question Answering is a complex task which aims at answering a question

What’s in that picture? VQA system Juanita Ordóñez

Department of Computer Science, Stanford [email protected]

Introduction

Dataset

Feature Extraction

Approach

Qualitative Results Metric

Discussion & Future Work

References

• Images– UsedVGG[2]CNNpre-trainedonImageNet,scaledimagesto224x224x3priortofeedinginnetworkandextractedfeaturesfromFC-7

• Text– Removedallthepunctuation,convertedtolowercaseandbuiltvocabularyontrainingset.

• Answers– Extractedtop-1000mostfrequentanswersfromtrainingset.Modelpredictsascoreforeach.

• UsedVisualQuestionAnswering(VQA)[1]dataset

• 204,721images• 3questionsperimage• 10candidatesanswersper

question• Widevarietyofimage

dimensions,RGBandgrayscale

Figure2:Visualrepresentationofthepreprocessingstep.

VisualQuestionAnsweringisacomplextaskwhichaimsatansweringaquestionaboutanimage.Thistaskrequiresamodelthatcananalyzetheactionswithinavisualsceneandexpressanswersaboutsuchasceneinnaturallanguage.Thisprojectfocusesonbuildingamodelthatanswersopen-endedquestions.

Figure1:VQAdatasetimage,question,andcandidatesanswersexamples.

Softmax layer for 1000 class.

Encode question information using LSTM network

Kept Image and question information throughout MLP, this is done by concatenating FC output with question image context vector.

Figure3:Highlevelvisualrepresentationofthemodel.

Table1.resultsevaluatedontheval-testdataset.Eachmodelwastrainedforatotalof50epochswiththesamehyper-parameters.Weshowevaluationsforthefollowingmodels:MLPbaseline,RecursiveMLPwithbagofwordsandLSTM-RMLP.Wealsoshowtheresultsofalanguage-onlyLSTM-RMLPmodelwhereinnoimageinformationisused

OurresultsshowthatencodingthequestionusinganLSTM,aswedointheLSTM-RMLPmodule,ourVQAscoreswentupby3.87%.Thelanguage-onlymodelonlydidaround4%worseincomparisontothefullinformationLSTM-RMLP.Thisresultisextremelysurprisingasitmeansthatthemodeldoesquitewellinansweringquestionsaboutanimagewithouteverseeingit.FormynextstepsIwillremovethesoftmaxandgeneratearespondinawaysimilartoSequencetoSequencemodels.Iwouldalsoliketoexplorereinforcementlearningtrainingtechniques.Finally,IwanttoexperimentwithtrainingtheVGG-16modelend-to-end.

Figure4:Qualitativeresultsofmodelprediction,redindicatesthemodelgottheincorrectanswerandgreenrepresentsthemodelgotthecorrectanswer.

[1]Antol,Stanislaw,etal."Vqa:Visualquestionanswering." ProceedingsoftheIEEEInternationalConferenceonComputerVision.2015.[2]Simonyan,Karen,andAndrewZisserman."Verydeepconvolutionalnetworksforlarge-scaleimagerecognition." arXivpreprintarXiv:1409.1556 (2014).[3]https://www.tensorflow.org/get_started/summaries_and_tensorboard

Where each word vector fed into LSTM cell, and the last hidden state is concatenated with VGG features

Results

ThemodelperformancewasevaluatedusingtheVQAscoremetric.Whichaisthemodel’sanswermatchestoquestioncandidatesresponses.

Top Related