what’s in that picture? vqa system - machine...

Post on 30-Jun-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

What’s in that picture? VQA system Juanita Ordóñez

Department of Computer Science, Stanford Universityordonez2@stanford.edu

Introduction

Dataset

Feature Extraction

Approach

Qualitative Results Metric

Discussion & Future Work

References

• Images– UsedVGG[2]CNNpre-trainedonImageNet,scaledimagesto224x224x3priortofeedinginnetworkandextractedfeaturesfromFC-7

• Text– Removedallthepunctuation,convertedtolowercaseandbuiltvocabularyontrainingset.

• Answers– Extractedtop-1000mostfrequentanswersfromtrainingset.Modelpredictsascoreforeach.

• UsedVisualQuestionAnswering(VQA)[1]dataset

• 204,721images• 3questionsperimage• 10candidatesanswersper

question• Widevarietyofimage

dimensions,RGBandgrayscale

Figure2:Visualrepresentationofthepreprocessingstep.

VisualQuestionAnsweringisacomplextaskwhichaimsatansweringaquestionaboutanimage.Thistaskrequiresamodelthatcananalyzetheactionswithinavisualsceneandexpressanswersaboutsuchasceneinnaturallanguage.Thisprojectfocusesonbuildingamodelthatanswersopen-endedquestions.

Figure1:VQAdatasetimage,question,andcandidatesanswersexamples.

Softmax layer for 1000 class.

Encode question information using LSTM network

Kept Image and question information throughout MLP, this is done by concatenating FC output with question image context vector.

Figure3:Highlevelvisualrepresentationofthemodel.

Table1.resultsevaluatedontheval-testdataset.Eachmodelwastrainedforatotalof50epochswiththesamehyper-parameters.Weshowevaluationsforthefollowingmodels:MLPbaseline,RecursiveMLPwithbagofwordsandLSTM-RMLP.Wealsoshowtheresultsofalanguage-onlyLSTM-RMLPmodelwhereinnoimageinformationisused

OurresultsshowthatencodingthequestionusinganLSTM,aswedointheLSTM-RMLPmodule,ourVQAscoreswentupby3.87%.Thelanguage-onlymodelonlydidaround4%worseincomparisontothefullinformationLSTM-RMLP.Thisresultisextremelysurprisingasitmeansthatthemodeldoesquitewellinansweringquestionsaboutanimagewithouteverseeingit.FormynextstepsIwillremovethesoftmaxandgeneratearespondinawaysimilartoSequencetoSequencemodels.Iwouldalsoliketoexplorereinforcementlearningtrainingtechniques.Finally,IwanttoexperimentwithtrainingtheVGG-16modelend-to-end.

Figure4:Qualitativeresultsofmodelprediction,redindicatesthemodelgottheincorrectanswerandgreenrepresentsthemodelgotthecorrectanswer.

[1]Antol,Stanislaw,etal."Vqa:Visualquestionanswering." ProceedingsoftheIEEEInternationalConferenceonComputerVision.2015.[2]Simonyan,Karen,andAndrewZisserman."Verydeepconvolutionalnetworksforlarge-scaleimagerecognition." arXivpreprintarXiv:1409.1556 (2014).[3]https://www.tensorflow.org/get_started/summaries_and_tensorboard

Where each word vector fed into LSTM cell, and the last hidden state is concatenated with VGG features

Results

ThemodelperformancewasevaluatedusingtheVQAscoremetric.Whichaisthemodel’sanswermatchestoquestioncandidatesresponses.

top related