deep visual-semantic alignments for generating image ...vendrov/deepvisual... · • deep learning...
TRANSCRIPT
![Page 1: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/1.jpg)
Deep Visual-Semantic Alignments for Generating Image Descriptions
1
Andrej Karpathy, Fei-Fei Li
![Page 2: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/2.jpg)
2
Overview
![Page 3: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/3.jpg)
3
“To generate dense, free-form descriptions of images”
Goal
Figure from (Karpathy and Li 2014)
Not to generate Flickr-like descriptions!
![Page 4: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/4.jpg)
4
Motivation
• It’s hard
• Humans can do it
![Page 5: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/5.jpg)
5
Challenges• Big Challenge: build a model that reasons jointly about vision and language
• Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded templates, rules or categories, and instead rely primarily on training data”
• Immediate challenge: how do we learn a model that generates dense, region-level descriptions from training data of sparse, image-level descriptions?
• Training data is extremely noisy
“trampolines are fun way to exercise”
![Page 6: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/6.jpg)
6
Contributions
1. Infer region-word alignments(R-CNN + BRNN + MRF)
2. Generative model of image descriptions(new RNN architecture)
3. Generate region-level descriptions
![Page 7: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/7.jpg)
7
Technical Approach
![Page 8: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/8.jpg)
8
Inferring Alignments
Figure from (Karpathy and Li 2014)
![Page 9: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/9.jpg)
9
Figure from (Girshick et al 2014) - http://www.cs.berkeley.edu/~rbg/papers/r-cnn-cvpr.pdf
![Page 10: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/10.jpg)
10
Inferring Word AlignmentsRanking Loss
R-CNN
Embedding
Learned
Figure adapted from (Karpathy and Li 2014)
![Page 11: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/11.jpg)
11
Inferring Segment Alignments
Smoothing via MRF ?
Figure generated by http://cs.stanford.edu/people/karpathy/deepimagesent/rankingdemo/
![Page 12: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/12.jpg)
12
Smoothing with an MRF
Let aj = t mean that the jth word wj is aligned to the tth region rt.
Then to independently align each word to the best region, minimize
E(a1..aN ) =
X
aj=t
�similarity(wj , rt)
But to encourage nearby words to point to the same region, add a penalty
� when nearby words point to di↵erent regions:
E(a1..aN ) =
X
aj=t
�similarity(wj , rt) +X
j=1..N�1
�[aj = aj+1]
The argmin can be found with dynamic programming.
![Page 13: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/13.jpg)
13
Generating Descriptions
• RNN architecture
• doesn’t try to embed images and descriptions into a common space
• uses a “context” and the previous word to determine the probability distribution over the next word
• adds image features to the initial context
Figure from (Karpathy and Li 2014)
![Page 14: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/14.jpg)
14
Region-Level Descriptions• Train the description generating RNN on the aligned regions + sentence fragments!• At test time, run on many bounding boxes separately
Figure from (Karpathy and Li 2014)
![Page 15: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/15.jpg)
15
Evaluation
![Page 16: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/16.jpg)
16
Evaluation1. Infer region-word alignments
(R-CNN + BRNN + MRF)2. Generative model of image descriptions
(new RNN architecture)
3. Generate region-level descriptions
Image-Sentence Ranking(Flickr8k, Flickr30k, MSCOCO)
BLEU, Perplexity(Flickr8k, Flickr30k, MSCOCO)
BLEU on new region dataset
evaluated with
![Page 17: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/17.jpg)
17
Alignment Evaluation
• ranking evaluation as a proxy for alignment quality• Image Annotation: given image, find closest training description• Image Search: given description, find closest image
• uses image-sentence alignment score (sum of word-region alignments), instead of distance in multimodal space.
• outperform across the board, but is this a fair comparison?
Figure from (Karpathy and Li 2014)
![Page 18: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/18.jpg)
18
Description Evaluation
• BLEU scores surprisingly close to human agreement
• Perplexity (how well the model predicts test descriptions) is surprisingly low at 20 bits per sentence
• vs ~8 bits per word for the best language models
Figure from (Karpathy and Li 2014)
![Page 19: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/19.jpg)
19
Region-Level Evaluation
• Used new dataset of region-level annotations• 1469 annotations in 237 images• average length of annotation: 4.13 words
• Model performs much worse relative to humans• brevity means BLEU penalties sharper• model trained on extremely noisy data
Figure from (Karpathy and Li 2014)
![Page 20: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/20.jpg)
20
Discussion
![Page 21: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/21.jpg)
21
Priors (Deep Learning Heresies)
• descriptions are object-centric (hence use R-CNN detections)
Figure generated by http://cs.stanford.edu/people/karpathy/deepimagesent/rankingdemo/
![Page 22: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/22.jpg)
22
Priors (Deep Learning Heresies)
• image-sentence scores are sums of word-region scores
Best image result for “small dog sleeping in living room chair”
Figure generated by http://cs.stanford.edu/people/karpathy/deepimagesent/rankingdemo/
![Page 23: Deep Visual-Semantic Alignments for Generating Image ...vendrov/DeepVisual... · • Deep Learning Manifesto: “The model should be free of assumptions about specific hard-coded](https://reader034.vdocuments.us/reader034/viewer/2022042023/5e7ac5cb8469660aeb734b2e/html5/thumbnails/23.jpg)
23
Summary• New goal: dense descriptions
• New model to infer word-region alignments• BRNN to embed words• R-CNN to embed images• Trained using ranking loss over alignment score
• New description generation model• RNN with additive image context
• New region-level annotation dataset