show, attend and tell: neural image caption generation...
TRANSCRIPT
![Page 1: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/1.jpg)
Show, Attend and Tell: Neural Image Caption Generation
with Visual Attention Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville,
Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio
Presented by Kathy Ge
![Page 2: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/2.jpg)
Motivation: Attention
• “attention allows for salient features to dynamically come to the forefront as needed”
![Page 3: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/3.jpg)
Image Caption Generation with Attention Mechanism
• Encoder: lower convolutional layer of a CNN • Decoder: LSTM which generates a caption one word at a time • Attention mechanism
– Deterministic “soft” mechanism – Stochastic “hard” mechanism
• Output:
![Page 4: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/4.jpg)
Encoder: CNN • Lower convolutional layer of a CNN is used, to capture spatial information
encoded in images • annotation vector
![Page 5: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/5.jpg)
Decoder: LSTM
• where it, ft, ct, ot, ht are the input, forget, memory, output, and hidden state of the LSTM at time t
• is the context vector which captures the visual information associated with a particular input location
• is the embedding matrix
![Page 6: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/6.jpg)
• Define a function which computes the context vector zt given the annotation vectors and corresponding weights
• Given the previous word, previous hidden state and context vector, compute output word probability
Learning Stochastic “Hard” vs Deterministic “Soft”Attention
• Given an annotation vector ai, i = 1, …, L for each location i, an attention mechanism generates a positive weight
• Weight of each annotation vector is computed by an attention model fatt using a multi-layer perceptron conditioned on previous hidden states ht-1
![Page 7: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/7.jpg)
Deterministic“Soft” Attention • Compute expectation of context vector directly
• Then can compute a soft attention weighted annotation vector
• This model is smooth and differentiable, can be computed using standard backpropagation
![Page 8: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/8.jpg)
Doubly Stochastic Attention • When training the deterministic version of the model, can introduce a doubly
stochastic regularization, where
• This encourages model to pay equal attention to every part of the image throughout the caption generation
• In experiments, improved overall BLEU score, and lead to more rich and descriptive captions
• The model is trained by minimizing the negative log likelihood with penalty
![Page 9: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/9.jpg)
Stochastic “Hard” Attention • Let st represent the random variable corresponding to the location where the
model decides to focus attention at the tth word
• where zt is a random variable, and st are intermediate latent variables
![Page 10: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/10.jpg)
Stochastic “Hard” Attention • Define objection function, LS, the variational lower bound
• Gradient w.r.t. parameters of model, W
where
![Page 11: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/11.jpg)
Stochastic “Hard” Attention • Reduce estimator variance by using a moving average baseline and
introducing entropy term H[s] • Final learning rule: gradient w.r.t. parameters of model, W
where λr, λe are hyperparameters, and b is exponential decay used in calculating moving average baseline
• At each point, returns a sampled ai at every point in time
based on a multinomial distribution parametrized by
• Similar to REINFORCE rule
![Page 12: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/12.jpg)
Experiments • Evaluated performance on Flickr8K, Flickr30K, and MS COCO • Optimized using RMSProp for Flickr8K and Adam for Flickr30K/MS COCO • Used Oxford VGGnet pretrained on ImageNet • Quantitative results measured using BLEU and METEOR metrics
![Page 13: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/13.jpg)
Qualitative Results
![Page 14: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/14.jpg)
Mistakes
![Page 15: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/15.jpg)
“Soft” attention model
A woman is throwing a frisbee in a park.
![Page 16: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/16.jpg)
“Hard” attention model
A man and a woman playing frisbee in a field.
![Page 17: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/17.jpg)
“Soft” attention model
A woman holding a clock in her hand.
![Page 18: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/18.jpg)
“Hard” attention model
A woman is holding a donut in his hand.
![Page 19: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/19.jpg)
Conclusion • Xu et al. introduce an attention based model that is able describe the contents of
an image • The model is able to fix its gaze on salient objects while generating words in the
caption sequence • They compare the use of a stochastic“hard” attention mechanism by
maximizing a variational lower bound and a deterministic“soft” attention mechanism using standard backpropagation
• Learned attention model can give interpretability to model generation process, and through qualitative analysis can show that alignments of words to locations in an image correspond well to human intuition
![Page 20: Show, Attend and Tell: Neural Image Caption Generation ...fidler/slides/2017/CSC2539/Katherine_slides.pdfShow, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://reader030.vdocuments.us/reader030/viewer/2022040802/5e3b93b4f66a05000925eb31/html5/thumbnails/20.jpg)
Thanks! Any questions?