show, attend and tell: neural image caption generation with visual attention

30
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention by Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio, ICML 2015 Presented by Eun-ji Lee 2015.10.14 Data Mining Research Lab Sogang University

Upload: eun-ji-lee

Post on 15-Apr-2017

1.685 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show, Attend and Tell: Neural Image Caption Generation with Visual

Attentionby Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdi-

nov, Richard S. Zemel, Yoshua Bengio, ICML 2015

Presented by Eun-ji Lee

2015.10.14Data Mining Research Lab

Sogang University

Page 2: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Contents1. Introduction2. Image Caption Generation with Attention Mechanism

a. LSTM Tutorialb. Model Details: Encoder & Decoder

3. Learning Stochastic “Hard” vs Deterministic “Soft” Attentiona. Stochastic “Hard” Attentionb. Deterministic “Soft” Attentionc. Training Procedure

4. Experiments

Page 3: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

1. Introduction“Scene understanding“

“ Rather than compress an entire image into a static representation, attention allows for salient features to dynamically come to the forefront as needed.

“hard” attention & “soft attention

Page 4: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

2-a. LSTM tutorial (1)• : a input to the memory cell layer at time • : weight matrices• : bias vectors

1. (Input gate)2. (Candidate state)3. (Forget gate)4. (Memory Cells’ new state)5. (Output gate)6. (Outputs, or Hidden states)

http://deeplearning.net/tutorial/lstm.html#lstm

𝒙 𝒕

𝒊𝒕 𝒐 𝒕

𝒇 𝒕

𝑪𝒕 𝒉𝒕

𝑪𝒕−𝟏

Page 5: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

2-a. LSTM tutorial (2)• : a input to the memory cell layer at time • : weight matrices• : bias vectors

1. 2. 3. 4. 5. 6.

http://deeplearning.net/tutorial/lstm.html#lstm

𝒙 𝒕

𝒊𝒕 𝒐 𝒕

𝒇 𝒕

𝑪𝒕 𝒉𝒕

𝑪𝒕−𝟏

Page 6: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

2-a. LSTM tutorial (3)

http://deeplearning.net/tutorial/lstm.html#lstm

𝒙 𝒕

𝒊𝒕 𝒐 𝒕

𝒇 𝒕

𝑪𝒕 𝒉𝒕

𝑪𝒕−𝟏

Page 7: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

2-b. Model Details: EncoderA model takes a single raw image and generates a caption encoded as a se-quence of 1-of-K encoded words.

• Caption :

• Image :

𝑦= {𝒚𝟏 ,…, 𝒚 𝑪 } , 𝒚 𝒊∈ℝ𝐾

: vocab size, : caption length : dim. of representation corresponding to a part of the image

𝑎= {𝒂𝟏 ,… ,𝒂𝑳 } ,𝒂𝒊∈ℝ𝐷

𝒂𝒊

𝒚 𝒊

𝒂𝑳

𝒂𝟏⋯

Page 8: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

𝒂𝑳

𝒂𝟏⋯

2-b. Model Details: Encoder

• Caption :

• Image :

“We extract features from a lower convolutional layer unlike previous work which instead used a fully connected layer.”

𝑦= {𝒚𝟏 ,…, 𝒚 𝑪 } , 𝒚 𝒊∈ℝ𝐾

: vocab size, : caption length : dim. of representation corresponding to a part of the image

𝑎= {𝒂𝟏 ,… ,𝒂𝑳 } ,𝒂𝒊∈ℝ𝐷

Page 9: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• We use a LSTM[1] that produces a caption by generating one word at every time step conditioned on a context vector, the previous hidden state and the previously generated words.

2-b. Model Details: Decoder (LSTM) 𝒚 𝒕

𝒛 𝒕 𝒉𝒕−𝟏𝒚 𝒕−𝟏

[1] Hochreiter & Schmidhuber, 1997

Page 10: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

, , , , are the input, forget, memory, output and hidden state of LSTM. and are learned weight matrices and biases.: an embedding matrix. : embedding dim. : LSTM dim. : logistic sigmoid activation.

2-b. LSTM

Page 11: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• A dynamic representation of the relevant part of the image input at time .

2-b. Context vector

- (Stochastic attention) : the probability that location is the right place to focus for pro-ducing the next word.

- (Deterministic attention) : the relative importance to give to location in blending the ’s together.

The weight of each annotation vector is computed by an attention model for which we use a multilayer perceptron conditioned on .

𝑎= {𝒂𝟏 ,… ,𝒂𝑳 } ,𝒂𝒊∈ℝ𝐷

Page 12: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• The initial memory state and hidden state of the LSTM are predicted by an av-erage of the annotation vectors fed through two separate MLPs (init,c and init,h):

2-b. Initialization (LSTM)𝑎= {𝒂𝟏 ,… ,𝒂𝑳 } ,𝒂𝒊∈ℝ𝐷

Page 13: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• We use a deep output layer(Pascanu et al., 2014) to compute the output word probability.

2-b. Output word probability

• Vector exponential

.

Page 14: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• We represent the location variable as where the model decides to focus atten-tion when generating the word. is an indicator one-hot variable which is set to 1 if the -th location (out of ) is the one used to extract visual features.

3-a. Stochastic “Hard” AttentionBinary One-hot

00011011

0001001001001000

1𝐿

𝑖𝑠𝑡

시간에 attention 할 부분

= = : attention location var.

Page 15: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• A variational lower bound on the marginal log-likelihood of observing the se-quence of words given image features a.

3-a. A new objective function

Page 16: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• Monte Carlo based sampling approximation of the gradient with respect to the model parameters:

3-a. Approximation of the gradient

• 난수를 이용하여 함수의 값을 확률적으로 계산하는 알고리즘 .• 계산하려는 값이 닫힌 형식으로 표현되지 않거나 복잡한 경우 근사적으로 계산할 때 사용 .(ex) 원주율 계산

Monte Carlo method

~𝑠𝑛=(𝑠1𝑛 ,𝑠2𝑛 ,…)

Page 17: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• A moving average baseline Upon seeing the mini-batch, the moving average baseline is estimated as an accumulated sum of the previous log likelihoods with exponential decay:

• An entropy term on the multinouilli distribution, , is added.

3-a. Variance Reduction

Page 18: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• In making a hard choice at every point, is a function that returns a sampled at every point in time based upon a multinouilli distribution parameterized by .

3-a. Stochastic “Hard” Attention

Page 19: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• Take the expectation of the context vector directly,

and formulate a deterministic attention model by computing a soft attention weighted annotation vector .• This corresponds to feeding in a soft weighted context into the system.

3-b. Deterministic “Soft” Attention

Page 20: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• Learning the deterministic attention can be understood as approximately op-timizing the marginal likelihood under the .

• The hidden activation of LSTM is a linear projection of the stochastic context vector followed by tanh non-linearity.

• To the order Taylor approximation, the expected value is equal to computing using a single forward prop with the expected context vector .

3-b. Deterministic “Soft” Attention

Page 21: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• Let ()

• Define the normalized weighted geometric mean(NWGM) for the softmax word prediction:

3-b. Deterministic “Soft” Attention

Page 22: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• The NWGM can be approximated well by .(It shows that the NWGM of a softmax unit is obtained by applying softmax to the expectations of the underlying linear projections.)

• Also, from the results in (Baldi&Sadowski, 2014), under softmax activation.

• This means the expectation of the outputs over all possible attention locations induced by random variable is computed by simple feedforward propagation with expected context vector .

• In other words, the deterministic attention model is an approximation to the marginal likelihood over the attention locations.

3-b. Deterministic “Soft” Attention

𝑝 (𝑋|𝛼 )=∫𝜃

𝑝 ( 𝑋|𝜃 )𝑝 (𝜃|𝛼 )𝑑𝜃Marginal likelihood over

Page 23: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• By construction, as they are the output of a softmax.

• In training the deterministic version of our model, we introduce a form of dou-

bly stochastic regularization where .(This can be interpreted as encouraging the model to pay equal attention to every part of the im-age over the course of generation.)

• This penalty was important to improve overall BLEU score and this leads to more rich and descriptive captions.

3-b-1. Doubly Stochastic Attention𝛼 𝑡𝑖=

exp (𝑒𝑡𝑖)

∑𝑘=1

𝐿

exp (𝑒𝑡𝑘)

Page 24: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• In addition, the soft attention model predicts a gating scalar from previous hidden state at each time step , s.t

where .

• This gating variable lets the decoder decide whether to put more emphasis on language modeling or on the context at each time step.

• Qualitatively, we observe that the gating variable is larger than the decoder describes an object in the image.

3-b-1. Doubly Stochastic Attention

Page 25: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• The soft attention model is trained end-to-end by minimizing the following pe-nalized negative log-likelihood:

Where we simply fixed to 1.

3-b. Soft Attention Model

Page 26: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• Both variants of our attention model were trained with SGD using adaptive learning rate algorithms.

• To create , we used Oxford VGGnet pretrained on ImageNet without finetuning. We use the 14 x 14 x 512 feature map of the convolutional layer before max pooling. This means our decoder operates on the flattened 196 x 512() en-coding.

• (MS COCO) Soft attention model took less than 3 days (NVIDIA Titan Black GPU).

• GoogLeNet or Oxford VGG can give a boost in performance over using the AlexNet.

3-c. Training

Page 27: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Flickr8k Flickr30k MS COCO8,000 images 30,000 images 82,738 images

5 reference sentences / image More than 5 / image

4. Experiments• Data

• Metric : BLEU (Bilingual Evaluation Understudy) An algorithm for evaluating the quality of text which has been machine translated from one

natural language to another.

Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU.

Page 28: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

4. Experiments

Page 29: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

• We are able to significantly improve the state of the art performance METEOR on MS COCO that we speculate is connected to some of the regularization technique and our lower level representation.

• Our approach is much more flexible, since the model can attend to “non object” salient regions.

4. Experiments

Page 30: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Reference• Papers

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Kelvin Xu et al, ICML 2015

• Useful websites 딥 러닝 라이브러리 정리 , RNN 튜토리얼 ( 한글 ) : http://aikorea.org/ LSTM tutorial : http://deeplearning.net/tutorial/lstm.html#lstm BLEU: a Method for Automatic Evaluation of Machine Translation (http://www.aclweb.org/an-

thology/P02-1040.pdf)