visual attention reading - mitweb.mit.edu/zoya/www/visual_attention_reading.pdf · visual attention...

46
Visual attention [with and for] deep neural nets June 16, 2016 Adobe CTL Vision and Learning Reading Group Zoya Bylinskii

Upload: others

Post on 29-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

Visualattention[withandfor]

deepneuralnets

June16,2016AdobeCTLVisionandLearningReadingGroup

ZoyaBylinskii

Page 2: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

Visualattentionfor deepneuralnets[captioning]Paperdiscussion:“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”K.Xu,J.Ba,R.Kiros,K.Cho,A.Courville,R.Salakhutdinov,R.Zemel,Y.Bengio[ICML2015]

Page 3: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

? Abirdflyingoverabodyofwater

VisualAttentionforImageCaptioning

Page 4: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforImageCaptioning

? Abirdflyingoverabodyofwater

CNNforimagefeatureextraction

RNNwithattentionoverimagefeaturesforcaptiongeneration

1 2

Page 5: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforImageCaptioning

?

CNNforimagefeatureextraction

Lastconvolutional layerbeforemaxpoolinge.g.VGGnet:14x14x512featuremap->196imagefeaturevectorsofdimension 512

1

Page 6: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforImageCaptioning

?

CNNforimagefeatureextraction

a1 a2 a196…

1

Page 7: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforImageCaptioning

?

CNNforimagefeatureextraction

a1 a2 a196…

1Wanttocompute“weights”α1α2α196

Page 8: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforImageCaptioning

a1 a2 a196…

Equationsonthisslidecourtesyof:http://people.ee.duke.edu/~lcarin/Yunchen9.25.2015.pdf

Wanttocompute“weights”α1α2α196

Page 9: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforImageCaptioning

a1 a2 a196…

Equationsonthisslidecourtesyof:http://people.ee.duke.edu/~lcarin/Yunchen9.25.2015.pdf

Wanttocompute“weights”α1α2α196

Page 10: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforImageCaptioning

Equationsonthisslidecourtesyof:http://people.ee.duke.edu/~lcarin/Yunchen9.25.2015.pdf

Probability thatlocationiisrightplacetofocusforproducing nextword

Relativeimportanceoflocationi forproducingnextword

Page 11: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforImageCaptioning

• LSTMnetworkgeneratesonewordytateverytimestepconditionedonacontextvector,theprevioushiddenstateandthepreviouslygeneratedwords• Thecontextvectorzt isadynamicrepresentationoftherelevantpartoftheimageinputattimet

Page 12: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforImageCaptioning

? Abirdflyingoverabodyofwater

CNNforimagefeatureextraction

RNNwithattentionoverimagefeaturesforcaptiongeneration

1 2

Page 13: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforImageCaptioning

“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”K.Xu,J.Ba,R.Kiros,K.Cho,A.Courville,R.Salakhutdinov,R.Zemel,Y.Bengio [ICML2015]

Page 14: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforImageCaptioning

“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”K.Xu,J.Ba,R.Kiros,K.Cho,A.Courville,R.Salakhutdinov,R.Zemel,Y.Bengio [ICML2015]

Page 15: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforImageCaptioning

“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention”K.Xu,J.Ba,R.Kiros,K.Cho,A.Courville,R.Salakhutdinov,R.Zemel,Y.Bengio [ICML2015]

Page 16: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

Visualattentionfor deepneuralnets[questionanswering]Paperdiscussion:“StackedAttentionNetworksforImageQuestionAnswering”,Z.Yang,X.He,J.Gao,L.Deng,A.Smola [arXiv,Jan2016]

Page 17: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforQuestionAnswering

? Answer:Dogs

Question:Whataresittinginthebasketonabicycle?

Page 18: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforQuestionAnswering

? Answer:Dogs

Question:Whataresittinginthebasketonabicycle?

1 2

CNNforimagefeatureextraction

CNNorLSTMforquestion representation

3stackedattentionmodel

Page 19: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

?

CNNforimagefeatureextraction

f1 f2 a196…

1

VisualAttentionforQuestionAnswering

Page 20: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforQuestionAnswering

?

LSTMforquestionrepresentation

2

Question:Whataresittinginthebasketonabicycle?

Page 21: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforQuestionAnswering

?

LSTMforquestionrepresentation

2

Question:Whataresittinginthebasketonabicycle?

Page 22: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforQuestionAnswering

?

CNNforquestionrepresentation

2

Question:Whataresittinginthebasketonabicycle?

Page 23: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

LSTMforquestionrepresentation

CNNforquestionrepresentation

2a

2b

VisualAttentionforQuestionAnswering

Page 24: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

Questionsfordiscussion

• Whenmightwepreferoneoftheselanguagerepresentationsoveranother?• (2a)anLSTMthatbuildsuparepresentationovermultipletimesteps

• Dowekeepenoughinformationfromearlyoninthesentence?Shouldwefavorlatterpartsofasentence?

• (2b)aCNNthatstaticallycombinesword/sentencefeaturesatafewscales• TowhatextentdoestheNusedforN-gramsaffecttheresultingrepresentation?

Page 25: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforQuestionAnswering

CNN f1 f2 f196…

1

Question:Whataresittinginthebasketonabicycle?2

CNNor

LSTMvQ

Page 26: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforQuestionAnswering

v1 v2 v196

Question:Whataresittinginthebasketonabicycle?2

CNNor

LSTMvQ

CNN1 f1 f2 f196…

Page 27: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforQuestionAnswering

?

Question:Whataresittinginthebasketonabicycle?

vQvI

Page 28: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforQuestionAnswering

?

Question:Whataresittinginthebasketonabicycle?

vQvI

Page 29: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

VisualAttentionforQuestionAnswering

?

Question:Whataresittinginthebasketonabicycle?

vQvI

Page 30: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

Questionsfordiscussion

• Whenmightwewanttomodulatethequestion/languagerepresentationovertime,andwhenmightweprefertomodulatethevisualfeaturerepresentation?• Correspondstorecursing onvQ orvI

Page 31: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

Visualattentionwith deepneuralnetsPaperdiscussions:“SALICON:ReducingtheSemanticGapinSaliencyPredictionbyAdaptingDeepNeuralNetworks”,X.Huang,C.Shen,X.Boix,Q.Zhao[CVPR2015]“DeepFix:AFullyConvolutionalNeuralNetworkforpredictingHumanEyeFixations”,S.Kruthiventi,K.Ayush,R.Babu [arXiv Oct2015]“DeepGazeI:BoostingSaliencyPredictionwithFeatureMapsTrainedonImageNet”,M.Kümmerer,L.Theis,M.Bethge [ICLR2015workshop]“PredictingEyeFixationsusingConvolutionalNeuralNetworks”,N.Liu,J.Han,D.Zhang,S.Wen,T.Liu[CVPR2015]

Page 32: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

?

Page 33: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

• Bottom-uppop-out• Semanticobjectsofinterest• Salientnon-objectregions(”abstractconcepts”)• Multi-scale,context-sensitive

• Challenge:verysmalldatasets

Page 34: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

“SALICON:ReducingtheSemanticGapinSaliencyPredictionbyAdaptingDeepNeuralNetworks”,X.Huang,C.ShenX.Boix,Q.Zhao.[ICCV2015]

Page 35: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

Capturemulti-scalefeatures- Imagedown-sampled- SameDNNapplied

“SALICON:ReducingtheSemanticGapinSaliencyPredictionbyAdaptingDeepNeuralNetworks”,X.Huang,C.ShenX.Boix,Q.Zhao.[ICCV2015]

Page 36: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

AlexNet,VGG,orGoogleNet- RemoveallFClayers- Adddepth-1convolutional layerfor

saliencyprediction (aftercombiningresponses frombothscales)

“SALICON:ReducingtheSemanticGapinSaliencyPredictionbyAdaptingDeepNeuralNetworks”,X.Huang,C.ShenX.Boix,Q.Zhao.[ICCV2015]

Page 37: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

- fine-tuning pretrained networks- optimizesaliencyevaluation

metricsdirectly“SALICON:ReducingtheSemanticGapinSaliencyPredictionbyAdaptingDeepNeuralNetworks”,X.Huang,C.ShenX.Boix,Q.Zhao.[ICCV2015]

Page 38: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

“SALICON:ReducingtheSemanticGapinSaliencyPredictionbyAdaptingDeepNeuralNetworks”,X.Huang,C.ShenX.Boix,Q.Zhao.[ICCV2015]

Page 39: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

• First5layersinitializedwithVGG-16weights• Note:aschanneldepthdoubles,spatialdimensionsarehalvedwithstride-2poolinglayers

“DeepFix:AFullyConvolutionalNeuralNetworkforpredictingHumanEyeFixations”,S.Kruthiventi,K.Ayush,R.Babu [arXivOct2015]

Page 40: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

• First5layersinitializedwithVGG-16weights• Note:aschanneldepthdoubles,spatialdimensionsarehalvedwithstride-2poolinglayers

• Holesofsize2introducedinkernelsof5th layertoincreasereceptivefieldwithoutincreasingmemoryfootprint

“DeepFix:AFullyConvolutionalNeuralNetworkforpredictingHumanEyeFixations”,S.Kruthiventi,K.Ayush,R.Babu [arXivOct2015]

Page 41: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

• First5layersinitializedwithVGG-16weights• Note:aschanneldepthdoubles,spatialdimensionsarehalvedwithstride-2poolinglayers

• Holesofsize2introducedinkernelsof5th layertoincreasereceptivefieldwithoutincreasingmemoryfootprint• Twoinception-styleconvolutionalmodulestocapturemulti-scalesemanticstructure• Convolutionallayersin7th layerwithholesofsize6operateonlargereceptivefieldsformoreglobalcontext

“DeepFix:AFullyConvolutionalNeuralNetworkforpredictingHumanEyeFixations”,S.Kruthiventi,K.Ayush,R.Babu [arXivOct2015]

Page 42: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

• First5layersinitializedwithVGG-16weights• Note:aschanneldepthdoubles,spatialdimensionsarehalvedwithstride-2poolinglayers

• Holesofsize2introducedinkernelsof5th layertoincreasereceptivefieldwithoutincreasingmemoryfootprint• Twoinception-styleconvolutionalmodulestocapturemulti-scalesemanticstructure• Convolutionallayersin7th layerwithholesofsize6operateonlargereceptivefieldsformoreglobalcontext• Finallayerup-sampledtodepth-1saliencymap

“DeepFix:AFullyConvolutionalNeuralNetworkforpredictingHumanEyeFixations”,S.Kruthiventi,K.Ayush,R.Babu [arXivOct2015]

Page 43: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

“DeepFix:AFullyConvolutionalNeuralNetworkforpredictingHumanEyeFixations”,S.Kruthiventi,K.Ayush,R.Babu [arXivOct2015]

Introducing location-biasedbehaviorwithoutdrasticallyincreasingnumberofnetworkparameters

Page 44: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

SaliencyPredictionwithNeuralNets

• RemovefinalFClayers• Rescaleresponsesofallotherlayerstolargestsize(->3712filterresponsesperimagelocation)• Eachfilterindividuallynormalizedacrossdataset,thenGaussianblurredwithsomesigma• Saliencymapisweightedcombinationbetweenthesepost-processedfiltersandacenterbias• L1regularizationonweightstoencouragesparsity• Softmax producesfinaloutputmap

“DeepGazeI:BoostingSaliencyPredictionwithFeatureMapsTrainedonImageNet”,M.Kümmerer,L.Theis,M.Bethge [ICLR2015workshop]

Page 45: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

FixationPredictionwithNeuralNets

“PredictingEyeFixationsusingConvolutional NeuralNetworks”,N.Liu,J.Han,D.Zhang,S.Wen,T.Liu[CVPR2015]

Page 46: visual attention reading - MITweb.mit.edu/zoya/www/visual_attention_reading.pdf · Visual attention with deep neural nets Paper discussions: “SALICON: Reducing the Semantic Gap

Questionsfordiscussion

• Candirectlyoptimizingforvisualattention/saliencyleadtobenefitsforothercomputervisionapplicationsor shouldvisualattentionnaturallycomeoutofthespecificapplication?• Canmodelsofvisualattention/saliencyhelpbootstrapindividualtasksorleadtogeneralizationacrosstasks?