visual language processing: image/video captioning and question answering
TRANSCRIPT
VISUAL LANGUAGE PROCESSING:
IMAGE/VIDEO CAPTIONING AND VISUAL QA
Yu Huang
Sunnyvale, California
OUTLINE
Part 1: Image/Video Captioning and Description
Part 2: Visual Question Answering
Part 3: Image/Video Generation
PART 1: IMAGE/VIDEO CAPTIONING AND DESCRIPTION
• Computer vision + Natural language processing;
• One not only needs to correctly recognize what appears in images, but also
incorporate knowledge of spatial relationships and interactions btw objects,
then needs to generate a description relevant and grammatically correct;
• Treated as a “Retrieval” task: retrieving the sentences given the query
image or retrieving the images given the query sentences;
• Assumes a specific rule of the language grammar, parses the sentence and divide
it into several parts;
• Learns a probability density over space of multimodal inputs (i.e. sentences,
images);
• It is natural to think of image caption generation as a translation problem;
• Transform a sentence S written in a source language, into its translation T in the
target language, by maximizing p(T/S).
INTRODUCTION
• A language model is needed in addition to visual understanding;
• Visual primitive recognizers combined with a structured formal language,
e.g. AND-OR graphs or logic systems, converted to natural language via
rule-based systems;
• Generation of image descriptions:
• Template-based methods, filling in sentence templates, such as triplets,
based on the results of object detections and spatial relationships;
• Composition-based methods, harness existing image-caption databases
by extracting components of related captions and composing them
together to generate novel descriptions;
• Neural network methods, generate descriptions by sampling from
conditional neural language models (multimodal).
INTRODUCTION
DEEP LEARNING FACE ATTRIBUTES IN THE WILD
• It cascades two CNNs (LNet and ANet) for face localization and attribute prediction;
• Trained in a cascade manner with attribute labels, but pre-trained differently.
• LNet pretrained with general object categories, ANet pre-trained with face identities.
• This not only outperforms state-of-the-art with large margin, but also reveals multiple
valuable facts on learning face representation.
DEEP LEARNING FACE ATTRIBUTES IN THE WILD
EXPLAIN IMAGES WITH MULTIMODAL RECURRENT NEURAL NETWORKS
• A multimodal Recurrent Neural Network (m-RNN) model for generating
novel sentence descriptions to explain the content of images.
• Models the prob. distribution of generating a word given previous words
and the image, and image descriptions generated by sampling from this
distribution.
• Two sub-networks: a deep recurrent NN for sentences and a deep
convolutional network for images, interacting with each other in a
multimodal layer.
EXPLAIN IMAGES WITH MULTIMODAL RECURRENT NEURAL NETWORKS
The simple RNN
m-RNN model
unfolded m-RNN
SHOW AND TELL: A NEURAL IMAGE CAPTION GENERATOR
• A generative model based on a deep recurrent architecture that combines
recent advances in computer vision and machine translation and that can
be used to generate natural sentences describing an image.
• Trained to maximize the likelihood of the target description sentence.
SHOW AND TELL: A NEURAL IMAGE CAPTION GENERATOR
Long-Short Term Memory (LSTM) net
LSTM model combined with a CNN
image embedder and word
embeddings. All LSTMs share the
same parameters.
SHOW, ATTEND AND TELL: A NEURAL IMAGE CAPTION
GENERATOR WITH VISUAL ATTENTION
A LSTM cell, lines with bolded squares imply projections
with a learnt weight vector. Each cell learns how to
weigh its input components (input gate), while learning
how to modulate that contribution to the memory (input
modulator). It also learns weights which erase the
memory cell (forget gate), and weights which control
how this memory should be emitted (output gate).
Examples of attending to the correct object
(white indicates the attended regions,
underlines indicated the corresponding word).
VISUAL-SEMANTIC EMBEDDINGS WITH MULTIMODAL NEURAL LANGUAGE MODELS
• An encoder-decoder pipeline learns a multimodal joint embedding space
with images and text and a novel language model for decoding distributed
representations.
• Unifies joint image-text embedding models with multimodal neural language
models;
• The structure-content neural language model disentangles the structure of a
sentence to its content; the encoder allows one to rank images and
sentences while the decoder can generate novel descriptions from scratch.
Encoder: A deep CNN and long short-term memory recurrent network (LSTM) for learning a
joint image-sentence embedding. Decoder: A new neural language model that combines
structure and content vectors for generating words one at a time in sequence.
VISUAL-SEMANTIC EMBEDDINGS WITH MULTIMODAL NEURAL LANGUAGE MODELS
(a): multiplicative neural language
model.
(b): Structure-content neural
language model (SC-NLM).
(c): The prediction problem of an
SC-NLM.
FROM CAPTIONS TO VISUAL CONCEPTS AND BACK
• Generating image descriptions: visual
detectors and language models learned
directly.
• Multiple Instance Learning to train visual
detectors for words in captions, including
many different parts of speech such as
nouns, verbs, and adjectives.
• The word detector outputs serve as
conditional inputs to a maximum-entropy
language model that learns from a set of
image descriptions to capture statistics of
word usage.
• Capture global semantics by re-ranking
caption candidates using sentence-level
features and a deep multimodal similarity
model.
FROM CAPTIONS TO VISUAL CONCEPTS AND BACK
DEEP VISUAL-SEMANTIC ALIGNMENTS FOR GENERATING IMAGE DESCRIPTIONS
• A model that generates free-form
natural language descriptions of
image regions and leverages
datasets of images and their
sentence descriptions to learn about
the inter-modal correspondences
between text and visual data.
• Combination of CNN over image
regions, bidirectional RNN over
sentences, and a structured
objective that aligns the two
modalities through a multimodal
embedding.
• RNN architecture that uses the
inferred alignments to learn to
generate novel descriptions of
image regions.
DEEP VISUAL-SEMANTIC ALIGNMENTS FOR GENERATING IMAGE DESCRIPTIONS
Diagram for evaluat.
image-sentence score
Diagram of our multimodal
Recurrent Neural Network
generative model
LEARNING A RECURRENT VISUAL REPRESENTATION FOR
IMAGE CAPTION GENERATION
Bi-directional mapping btw images and their sentence-based
descriptions learning by a recurrent neural network.
Apply recurrent visual memory that automatically learns to
remember long-term visual concepts to aid in both sentence
generation and visual feature reconstruction.
LEARNING A RECURRENT VISUAL REPRESENTATION FOR
IMAGE CAPTION GENERATION
(a) shows the full model used for training. (b) and (c) show the parts
of the model needed for generating sentences from visual features
and generating visual features from sentences respectively.
PHRASE-BASED IMAGE CAPTIONING
A model to generate descriptive sentences given a sample image.
Strong focus on syntax of the descriptions.
Train a bilinear model that learns a metric btw an image representation (a previously trained CNN) and phrases that are used to described them.
Based on caption syntax statistics, a language model that can produce relevant descriptions for a given test image using the phrases inferred.
The constrained language
model for generating
description given the predicted
phrases for an image.
PHRASE-BASED IMAGE CAPTIONING
Schematic illustration of the phrase-based model for image descriptions.
FAST NOVEL VISUAL CONCEPT LEARNING FROM
SENTENCE DESCRIPTIONS OF IMAGES
Learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions.
Using linguistic context and visual features, hypothesize the semantic meaning of new words and add to word dictionary, used to describe images which contain these novel concepts.
A transposed weight sharing scheme, which not only improves performance on image captioning, but also makes the model more suitable for the novel concept learning task.
Using a few “quidditch”
images with sentence
descriptions, the
method is able to learn
that “quidditch” is played
by people with a ball.
FAST NOVEL VISUAL CONCEPT LEARNING FROM
SENTENCE DESCRIPTIONS OF IMAGES
(a). The image captioning model (b). The transposed weight sharing of UD and UM.
FAST NOVEL VISUAL CONCEPT LEARNING FROM
SENTENCE DESCRIPTIONS OF IMAGES
Training novel concepts. Only update the
sub-matrix UDn in UD that is connected to
the node of new words in the One-Hot
layer and the SoftMax layer during the
training for novel concepts.
Organization of the novel concept datasets
LANGUAGE MODELS FOR IMAGE CAPTIONING:
THE QUIRKS AND WHAT WORKS
Method 1 uses a pipelined process where a set of candidate words is generated by a CNN trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence.
Method 2 uses the penultimate activation layer of the CNN as input to a RNN that then generates the caption sequence.
Compare the merits of these different language modeling approaches by using the same state-of-the-art CNN as input.
Examine linguistic irregularities, caption repetition, and data set overlap.
Combine key aspects of the ME and RNN methods.
WHAT VALUE DO EXPLICIT HIGH LEVEL CONCEPTS
HAVE IN VISION TO LANGUAGE PROBLEMS?
The CNN - RNN method does not explicitly represent high-level semantic concepts, rather progress from image features to text.
Incorporate high-level concepts into successful CNN-RNN approach.
Attribute based V2L framework:
The image analysis module learns a
mapping btw an image and the
semantic attributes through a CNN.
The language module learns a
mapping from the attributes vector to
a sequence of words using an LSTM.
WHAT VALUE DO EXPLICIT HIGH LEVEL CONCEPTS
HAVE IN VISION TO LANGUAGE PROBLEMS?
Attribute prediction CNN: the
model is initialized from VggNet
pre-trained on ImageNet. The
model is then fine-tuned on the
target multi-label dataset. Given a
test image, a set of proposal
regions are selected and passed
to the shared CNN, and finally the
CNN outputs from different
proposals are aggregated with
max pooling to produce the final
multi-label prediction, which gives
us the high-level image
representation, Vatt(I).
WHAT VALUE DO EXPLICIT HIGH LEVEL CONCEPTS
HAVE IN VISION TO LANGUAGE PROBLEMS?
Language generators for
different types of tasks:
(a) Image Captioning, (b)
VQA-single word, (c)
VQA-sentence. red
arrow indicates attributes
input Vatt(I) while blue
dash arrow shows the
baseline method input
CNN(I).
VIDEO CAPTIONING & DESCRIPTION
Tagging videos with metadata;
Clustering captions and videos;
Retrieval and predicting event tags rather than generating descriptive
sentences;
Two stages of description generation: Identify the semantic content;
train classifiers to identify candidate objects, actions and scenes;
Generate a sentence based on a template
combine visual confidences with a language model in a probabilistic graphical model;
Pro: detach content generation and surface realization;
Con: select a set of relevant objects, actions and scenes to recognize, and
lose richness of human language in the templates;
Deep learning: create the visual-semantic embedding Learn the spatio-temporal visual feature and also the temporal context model.
YOUTUBE2TEXT: RECOGNIZING AND DESCRIBING ARBITRARY ACTIVITIES
USING SEMANTIC HIERARCHIES AND ZERO-SHOT RECOGNITION
Take a short video clip and outputs a brief sentence that sums up the main activity in the video, such as actor, action and its object.
Small portions of the Hierarchies learned over
Subjects, Verbs and Objects.
ALIGNING BOOKS AND MOVIES: TOWARDS STORY-LIKE VISUAL
EXPLANATIONS BY WATCHING MOVIES AND READING BOOKS
reason about visual and dialog (text) alignment btw a movie and a book;
exploit a neural sentence embedding that is trained in an unsupervised way
from a large corpus of books, and a video-text neural embedding for
computing similarities btw movie clips and sentences in the book;
a simple pairwise CRF that smooth the alignments by encouraging them to
follow a linear timeline, both in the video and book domain.
Sentence neural embedding
DESCRIBING MULTIMEDIA CONTENT USING ATTENTION-
BASED ENCODER–DECODER NETWORKS
Translating a (short) video clip to
its natural language description;
CNN + GRU (RNN) + Attention;
DESCRIBING VIDEOS BY EXPLOITING TEMPORAL STRUCTURE
Incorporate models of local temporal dynamic of videos and global temporal structure;
The local structure is modeled using the temporal feature maps of a 3-D CNN, while a
temporal attention mechanism is used to combine information across the entire video;
Encoder-decoder to generate video description: encoded by CNN, decoded by RNN;
Spatio-temporal CNN;
LSTM;
TRANSLATING VIDEOS TO NATURAL LANGUAGE
USING DEEP RECURRENT NEURAL NETWORKS
An unified DNN with both
convolutional and
recurrent structure;
Create sentence
descriptions of open-
domain videos with large
vocabularies;
An end-to-end deep
model for video-to-text
generation that
simultaneously learns a
latent “meaning” state,
and a fluent grammatical
model of the associated
language. Video Description Network
LONG-TERM RECURRENT CONVOLUTIONAL NETWORKS
FOR VISUAL RECOGNITION AND DESCRIPTION
Long-term Recurrent Convolutional
Networks (LRCNs), a class of
architectures leveraging the
strengths of rapid progress in CNNs
for visual recognition problem, and
the growing desire to apply such
models to time-varying inputs and
outputs;
LRCN is directly connected to
modern visual CNN models and
can be jointly trained to
simultaneously learn temporal
dynamics and convolutional
perceptual representations.
LONG-TERM RECURRENT CONVOLUTIONAL NETWORKS
FOR VISUAL RECOGNITION AND DESCRIPTION
Instantiations of the LRCN model for activity recognition, image description,
and video description.
LONG-TERM RECURRENT CONVOLUTIONAL NETWORKS
FOR VISUAL RECOGNITION AND DESCRIPTION
Three variations of the LRCN image captioning architecture to evaluate.
LONG-TERM RECURRENT CONVOLUTIONAL NETWORKS
FOR VISUAL RECOGNITION AND DESCRIPTION
Video description in LRCN. (a) LSTM encoder & decoder with CRF max (b) LSTM decoder
with CRF max (c) LSTM decoder with CRF probabilities.
THE LONG-SHORT STORY OF MOVIE DESCRIPTION
Train the visual classifiers for verbs, objects and places, using
different visual features: DT (dense trajectories), LSDA (large scale
object detector) and PLACES (Places-CNN );
Next, concatenate the scores from a subset of selected robust
classifiers and use them as input to our LSTM.
JOINTLY MODELING EMBEDDING AND TRANSLATION TO
BRIDGE VIDEO AND LANGUAGE
An unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding; locally maximize the probability of generating the next word given
previous words and visual content;
create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content;
Include three parts: a 2-D and/or 3-D deep convolutional neural networks for learning
powerful video representation;
a deep RNN for generating sentences;
A joint embedding model for exploring the relationships between visual content and sentence semantics.
JOINTLY MODELING EMBEDDING AND TRANSLATION TO
BRIDGE VIDEO AND LANGUAGE
LSTM-E framework with a language generating LSTM and a visual-
semantic embedding model.
SEQUENCE TO SEQUENCE – VIDEO TO TEXT
End-to-end sequence-to-sequence model to generate captions for videos;
Exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation;
Train the LSTM on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip;
SEQUENCE TO SEQUENCE – VIDEO TO TEXT
A stack of two LSTMs that learn a representation of a sequence of frames in order to
decode it into a sentence that describes the event in the video. The top LSTM layer
models visual feature inputs. The second LSTM layer models language given the text
input and the hidden representation of the video sequence.
TEMPORAL TESSELLATION: A UNIFIED
APPROACH FOR VIDEO ANALYSIS
A general approach to video understanding, inspired by semantic transfer that is used for 2D image analysis.
A video is a 1D sequence of clips, each one associated with its own semantics.
The semantics – natural language captions – depends on the task at hand.
A test video is processed by forming correspondences btw its clips and the clips of reference videos with known semantics, following which, reference semantics can be transferred to the test video.
Matching methods, both designed to ensure that (a) reference clips appear similar to test clips and (b), taken together, the semantics of the selected reference clips is consistent and maintains temporal coherence.
TEMPORAL TESSELLATION: A UNIFIED
APPROACH FOR VIDEO ANALYSIS
Tessellation for temporal coherence. Given a query video, seek reference
video clips with similar semantics. Tessellation ensures that the semantics
assigned to the test clip are not only the most relevant (the five options for
each clip) but also preserve temporal coherence.
TEMPORAL TESSELLATION: A UNIFIED
APPROACH FOR VIDEO ANALYSIS
Two non-local tessellations. Left: Tessellation by restricted Viterbi. For a
query video, find visually similar videos and selects the clips that preserve
temporal coherence using the Viterbi Method. Right: Tessellation by predicting
the dynamics of semantics. Given a query video and a previous clip selection,
use an LSTM to predict the most accurate semantics for the next clip.
PART 2: VISUAL QUESTION ANSWERING
INTRODUCTION
• NLP, knowledge representation and visual image understanding;
• Answer natural language questions on real world visual images;
• Interaction btw human and computers;
• Task in QA: Given the question, learn the relevant visual and text
representation to infer the answer;
• Feature extraction from visual images: CNN;
• Question encoding in NLP: LSTM or CNN;
• Answer generation by the learned model;
• Attention or memory network.
VISUAL TURING TEST
• An operator-assisted device that produces a stochastic sequence of
binary questions from a given test image;
• VQA is a good task for visual Turing test;
• DAQUAR: A dataset for Visual Turing Challenge;
• It contains 1088 different nouns in the question, 803 in the answers, and
1586 altogether (573 categories);
• It includes questions that can be reliably answered using common sense
knowledge (reaching about 4 million to account different interpretations of
the external world), with questions of substantial length (10.5 words in
average with variance 5.5; the longest question has 30 words);
• The question answering task is also about understanding hidden
intentions of the questioner with grounding as a sub-goal to solve.
CNN FOR VISUAL QA
• CNN learns not only the image representation, the composition model for question, but also the intermodal interaction between the image and question, for the generation of answer;
• an image CNN to extract the image representation;
• one sentence CNN to encode the question;
• one multimodal convolution layer to fuse the multimodal input of the image and question to obtain the joint representation for the classification in the space of candidate answer words.
• Test on DAQUAR and COCO-QA datasets;
CNN FOR VISUAL QA
MEMORY NETWORK
• Reason with inference components combined with a long-term
memory component;
• The long-term memory can be read and written to, with the goal of
using it for prediction (as a dynamic knowledge base);
• A memory m (an array of objects indexed by mi) and four (potentially
learned) components I (input feature map), G (generalization), O (output
feature map) and R (response);
• Given input x, the flow of the model is:
1. Convert x to an internal feature representation I(x).
2. Update memories mi given the new input: mi = G(mi , I(x),m).
3. Compute output features o given the new input and the memory: o = O(I(x),m).
4. Finally, decode output features o to give the final response: r = R(o).
VISUAL ATTENTION IN RNN
• Extract info. from an image or video by adaptively selecting a
sequence of regions or locations and only processing the selected
regions at high resolution;
• RAM (Recurrent Attention Model): Translation invariance built-in;
• Computation cost is independently of the input image size;
• Can be trained using reinforcement learning methods (task specific);
• It processes inputs sequentially, attending to different locations
within the images (or video frames) one at a time, and incrementally
combines info. from these fixations to build up a dynamic internal
representation of the scene or environ.;
VISUAL ATTENTION IN RNN
A) Glimpse Sensor: Given the coord.s of glimpse and an image, extracts a represent. that
contains multiple patches. B) Glimpse Network: Given the location and image, uses the
glimpse sensor to extract represent.. The represent. and glimpse location mapped into a
hidden space using independent linear layers. The glimpse network defines a trainable
bandwidth limited sensor for the attention network producing the glimpse represent. C) RNN
Model: Takes the glimpse represent. as input and combining with the internal represent. at
previous time step, produces the new internal state. The location and action network use the
internal state to produce the next location to attend to and the action/classification respectively.
This basic RNN iteration is repeated for a variable number of steps.
VQA: COMBINATION OF NLP AND CV
• Visual questions selectively target
different areas of an image, including
BG details and underlying context.
• A VQA system typically needs a more
detailed understanding of the image
and complex reasoning than a system
producing generic image captions.
• VQA is amenable to automatic
evaluation, since many open-ended
answers contain only a few words or a
closed set of answers that can be
provided in a multiple-choice format;
• Benchmark model: MLP + LSTM.
Examples of free-form, open-ended
questions via Amazon Mechanical Turk
EXPLORING MODELS AND DATA FOR IMAGE
QUESTION ANSWERING
image-based question-answering (QA) with new
models and datasets.
use neural networks and visual semantic
embeddings, without intermediate stages (object
detection and image segmentation), to predict
answers to simple questions about images.
a question generation algorithm that converts
image descriptions, which are widely available,
into QA form.
EXPLORING MODELS AND DATA FOR IMAGE
QUESTION ANSWERING
VIS+LSTM Model
EXPLORING MODELS AND DATA FOR IMAGE
QUESTION ANSWERING
MULTILINGUAL IMAGE QUESTION ANSWERING
The mQA model, can answer questions about content of an image.
The answer can be a sentence, a phrase or a single word.
Four components: a LSTM to extract the question representation, a CNN to extract the visual representation, an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer.
A Freestyle Multilingual Image Question Answering (FM-IQA) dataset to train and evaluate the mQA model. It contains over 150,000 images and 310,000 freestyle Chinese
question-answer pairs and their English translations.
The quality of the generated answers of the mQA model on this dataset is evaluated by human judges through a Turing Test.
http://idl.baidu.com/FM-IQA.html.
MULTILINGUAL IMAGE QUESTION ANSWERING
mQA model architecture. Input an image and a question about the image (i.e. “What is
the cat doing?”) to the model. The model is trained to generate the answer to the question
(i.e. “Sitting on the umbrella”). The weight matrix in the word embedding layers of the two
LSTMs (one for the question and one for the answer) are shared. In addition, this weight
matrix is also shared, in a transposed manner, with the weight matrix in the Softmax layer.
MULTILINGUAL IMAGE QUESTION ANSWERING
Sample answers to the visual question generated by our model on the newly
proposed Freestyle Multilingual Image Question Answering (FM-IQA) dataset.
IMAGE QUESTION ANSWERING USING CNN
WITH DYNAMIC PARAMETER PREDICTION
learning a CNN with a dynamic parameter layer whose weights are determined adaptively based on questions.
For the adaptive parameter prediction, employ a separate parameter prediction network, which consists of GRU taking a question as its input and a FCL generating a set of candidate weights as its output.
incorporate a hashing technique, where the candidate weights given by the parameter prediction network are selected using a predefined hash function to determine individual weights in the dynamic parameter layer.
The proposed network—joint network with the CNN for ImageQA and the parameter prediction network— is trained e2e through BP, where its weights are initialized using a pre-trained CNN and GRU.
IMAGE QUESTION ANSWERING USING CNN
WITH DYNAMIC PARAMETER PREDICTION
Overall architecture of Dynamic Parameter Prediction network (DPPnet),
composed of classification network and parameter prediction network. The
weights in the dynamic parameter layer are mapped by a hashing trick from
the candidate weights from the parameter prediction network.
A NEURAL-BASED METHOD FOR VISUAL QA
• Set as a Visual Turing Test;
• An end-to-end formulation of this problem for which all parts are trained jointly;
• All CNN models are first pre-trained on the ImageNet dataset, and next fine-
tune the last layer together with the full training of the LSTM network.
COMBINE NN AND VISUAL SEMANTIC EMBEDDINGS
• W/O intermediate stages such as
object detection and image
segmentation;
• Build directly on top of the LSTM
sentence model and is called the
“VIS+LSTM” model.
• Idea: treating the image as a word
from caption generation work;
• The LSTM(s) outputs are fed into
a softmax layer at the last time
step to generate answers. Use the last hidden layer of the 19-layer
Oxford VGG Conv Net trained on ImageNet
2014 Challenge as visual embeddings
STACKED ATTENTION NETWORKS FOR VISUAL QA
• Stacked attention networks (SANs);
• Multi-stage reasoning: a multiple-layer SAN in which it queries an
image multiple times to infer the answer progressively;
• Semantic representation of a question as query to search for
the regions in an image that are related to the answer;
• (1) the image model, which uses a CNN to extract high level image
representations, e.g. one vector for each region of the image;
• (2) the question model, which uses a CNN or a LSTM to extract a
semantic vector of the question;
• (3) the stacked attention model, which locates, via multi-step
reasoning, the image regions that are relevant to the question for
answer prediction.
STACKED ATTENTION NETWORKS FOR VISUAL QA
The stacked attention network first
focuses on all referred concepts, e.g.,
bicycle, basket and objects in the
basket (dogs) in the first attention layer
and then further narrows down the
focus in the second layer and finds out
the answer dog.
DYNAMIC MEMORY NETWORKS FOR VISUAL
AND TEXTUAL QUESTION ANSWERING
NN architectures with memory and attention
mechanisms exhibit certain reasoning capabilities
required for QA.
dynamic memory network (DMN), obtained high accuracy on
a variety of language tasks.
Based on an analysis of the DMN, some improvements
to its memory and input modules.
Together with these changes an input module for
images is built to be able to answer visual questions.
DYNAMIC MEMORY NETWORKS FOR VISUAL
AND TEXTUAL QUESTION ANSWERING
Question Answering (text+image) using Dynamic Memory Network.
DYNAMIC MEMORY NETWORKS FOR VISUAL
AND TEXTUAL QUESTION ANSWERING
The input module with a “fusion layer”, where
the sentence reader encodes the sentence
and the bi-directional GRU allows info. to
flow between sentences.
VQA input module to represent images for the DMN
DYNAMIC MEMORY NETWORKS FOR VISUAL
AND TEXTUAL QUESTION ANSWERING
The episodic memory module of the DMN+
when using two passes. The Ḟ is the output
of the input module.
(a) The traditional GRU model, and
(b) the attention-based GRU model
MULTIMODAL RESIDUAL LEARNING
FOR VISUAL QA
Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning.
Unlike the deep residual learning, MRN effectively learns the joint representation from vision and language information.
The main idea is to use element-wise multiplication for the joint residual mappings exploiting the residual learning of the attentional models.
Various alternative models by multimodality are explored.
Visualize the attention effect of the joint representations for each learning block using BP algorithm, even though the visual features are collapsed without spatial info..
MULTIMODAL RESIDUAL LEARNING
FOR VISUAL QA
Inference flow of Multimodal Residual Networks (MRN). A schematic diagram of
MRNs with 3-block layers.
MULTIMODAL RESIDUAL LEARNING
FOR VISUAL QA
Alternative models are explored to justify the model. (a) The base model. (b)
extra embedding for visual modality. (c) extra embeddings for both modalities.
(d) identity mappings for shortcuts. (e) two shortcuts for both modalities.
Eventually, chose (b) as the best performance and relative simplicity.
MULTIMODAL COMPACT BILINEAR POOLING FOR VISUAL
QUESTION ANSWERING AND VISUAL GROUNDING
Utilize Multimodal Compact Bilinear pooling (MCB) to
efficiently and expressively combine multimodal features.
An architecture which uses MCB twice, once for predicting
attention over spatial features and again to combine the
attended representation with the question representation.
Multimodal Compact Bilinear Pooling for visual question answering.
MULTIMODAL COMPACT BILINEAR POOLING FOR VISUAL
QUESTION ANSWERING AND VISUAL GROUNDING
Multimodal Compact Bilinear Pooling (MCB)
MULTIMODAL COMPACT BILINEAR POOLING FOR VISUAL
QUESTION ANSWERING AND VISUAL GROUNDING
Architecture for VQA: Multimodal Compact Bilinear (MCB) with Attention.
Conv implies convol. layers and FC implies fully connected layers.
MULTIMODAL COMPACT BILINEAR POOLING FOR VISUAL
QUESTION ANSWERING AND VISUAL GROUNDING
Architecture for VQA: MCB with
Attention and Answer Encoding.
Architecture for
Grounding with MCB.
TRAINING RECURRENT ANSWERING UNITS
WITH JOINT LOSS MINIMIZATION FOR VQA
Visual question answering based on a RNN, where every module corresponds to a complete answering unit with attention mechanism by itself.
The network is optimized by minimizing loss aggregated from all the units, which share model parameters while receiving different info. to compute attention prob.
For training, the model attends to a region within image feature map, updates its memory based on the question and attended image feature, and answers the question based on its memory state.
Observations: multi-step inferences are often required to answer questions while each problem may have a unique desirable number of steps.
Strategy: Make the 1st unit in the network solve problems, but allow it to learn the knowledge from the rest of units by BP unless it degrades the model; Early-stop training each unit as soon as it starts to overfit; The last answering unit in the unfolded RNN is typically killed first while the first one remains last; A single-step prediction for a new question using the shared model.
This strategy works better than the other options within the framework since the selected model is trained effectively from all units without overfitting.
TRAINING RECURRENT ANSWERING UNITS
WITH JOINT LOSS MINIMIZATION FOR VQA
Answering unit comprising subtask embedding, attention and predict operation.
TRAINING RECURRENT ANSWERING UNITS
WITH JOINT LOSS MINIMIZATION FOR VQA
Overall architecture of the network. The network is a RNN, where each
recurrent unit corresponds to a complete module for visual QA. For training,
unfold the network to predict answer and give supervision for every steps. For
testing, use a single answering unit to answer a question about an image.
HADAMARD PRODUCT FOR LOW-RANK
BILINEAR POOLING
Bilinear models provide rich representations vs linear models.
However, bilinear representations are high-dimensional, limiting the applicability to computationally complex tasks.
Low-rank bilinear pooling using Hadamard product for an efficient attention mechanism of multimodal learning.
A schematic diagram of MLB. Replicate
module copies an question embedding
vector to match with S2 visual feature
vectors. Conv modules indicate 1 × 1
convolution to transform a given channel
space, which is computationally equivalent
to linear projection for channels.
DATASET
• DAQUAR
(question
answering on
real world
images);
• COCO-QA;
• VQA: 0.25M
images,
0.76M
questions,
10M
answers;
PART 3: IMAGE/VIDEO GENERATION
INTRODUCTION
Statistical natural image modeling remains a fundamental problem in computer vision and image understanding;
Defining image pixel distributions that were restricted to being either unconditioned or conditioned on classification labels;
Learning generative models conditioned on text allows a better understanding of the generalization performance of the model;
Generating high dimensional realistic images from their descriptions combines the two challenging components of language modeling and image generation;
Variational Auto-Encoder (VAE) can be seen as a neural network with continuous latent variables; The encoder is used to approximate a posterior distribution and the decoder is
used to stochastically reconstruct the data from latent variables;
Generative Adversarial Networks (GANs) are generative models that use noise-contrastive estimation to avoid calculating an intractable partition function. The model consists of a generator that generates samples using a uniform
distribution and a discriminator that discriminates btw real and generated images.
CONDITIONAL IMAGE GENERATION WITH
PIXELCNN DECODERS
Conditional image generation with an image density model based on the PixelCNN architecture.
The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks.
When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects, landscapes and structures.
When conditioned on an embedding produced by a convol. net given a single image of an unseen face, it generates a variety of new portraits of the same person with different facial expressions, poses and lighting conditions.
CONDITIONAL IMAGE GENERATION WITH
PIXELCNN DECODERS
The basic idea of the architecture is to use AR connections to model images pixel by pixel, decomposing the joint image distribution as a product of conditionals.
PixelRNN: the pixel distributions are modeled with two-dimensional LSTM; PixelRNNs generally give better performance.
PixelCNN: the pixel distributions are modeled with convolutional networks.
PixelCNNs are much faster to train because convolutions are inherently easier to parallelize; given the vast number of pixels present in large image datasets this is an important advantage.
Gated PixelCNN: a gated variant of PixelCNN that matches the log-likelihood of PixelRNN.
Conditional PixelCNN: a conditional variant of the Gated PixelCNN that allows us to model the complex conditional distributions of natural images given a latent vector embedding.
CONDITIONAL IMAGE GENERATION WITH
PIXELCNN DECODERS
A single layer in the Gated PixelCNN architecture. Convolution operations
are shown in green, element-wise multiplications and additions are shown
in red. The convolutions are combined into a single operation shown in
blue, which splits the 2p features maps into two groups of p.
GENERATING IMAGES FROM CAPTIONS
WITH ATTENTION
Generate images from natural language descriptions.
Iteratively draws patches on a canvas, while attending to the relevant words.
AlignDRAW model for generating images by learning an alignment btw the input
captions and generating canvas. The caption is encoded using the Bidirectional
RNN (left). The generative RNN takes a latent sequence sampled from the prior
and the dynamic caption representation to generate the canvas matrix, used to
generate the final image x (right).
DEEP GENERATIVE IMAGE MODELS USING A LAPLACIAN
PYRAMID OF ADVERSARIAL NETWORKS
A generative parametric model, LAPGAN, capable of producing high quality samples
of natural images.
Uses a cascade of convnets within a Laplacian pyramid framework to generate
images in a coarse-to-fine fashion.
At each level of the pyramid, a separate generative convnet model is trained using the
Generative Adversarial Nets (GAN) approach.
Samples drawn from the model are of higher quality than alternate approaches.
DEEP GENERATIVE IMAGE MODELS USING A LAPLACIAN
PYRAMID OF ADVERSARIAL NETWORKS
GENERATIVE ADVERSARIAL TEXT TO IMAGE SYNTHESIS
A deep architecture and GAN formulation to effectively bridge SoA techniques in
text and image modeling, translating visual concepts from characters to pixels.
To train a deep convolutional generative adversarial network (DC-GAN)
conditioned on text features encoded by a hybrid character-level CRNN.
Both the generator network G and the discriminator network D perform feed-
forward inference conditioned on the text feature.
GENERATIVE ADVERSARIAL TEXT TO IMAGE SYNTHESIS
UNSUPERVISED REPRESENTATION LEARNING WITH DEEP
CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS
Bridge the gap between the success of CNNs for supervised learning and
unsupervised learning.
A class of CNNs called Deep Convolutional Generative Adversarial
Networks (DCGANs), that have certain architectural constraints, and
demonstrate that they are a strong candidate for unsupervised learning.
Via training, the deep convolutional adversarial pair learns a hierarchy of
representations from object parts to scenes in both generator and discriminator.
Additionally, use the learned features for general image representations.
UNSUPERVISED REPRESENTATION LEARNING WITH DEEP
CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS
UNSUPERVISED REPRESENTATION LEARNING WITH DEEP
CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS
F-GAN: TRAINING GENERATIVE NEURAL SAMPLERS USING
VARIATIONAL DIVERGENCE MINIMIZATION
Generative neural samplers are probabilistic models that implement
sampling using feed-forward neural networks;
These models are expressive and allow efficient computation of
samples and derivatives, but cannot be used for computing likelihood
or for marginalization;
The generative adversarial training method allows to train such
models through the use of an auxiliary discriminative neural network;
The generative-adversarial approach is a special case of an existing
more general variational divergence estimation approach;
Any f-divergence can be used for training generative neural samplers.
F-GAN: TRAINING GENERATIVE NEURAL SAMPLERS USING
VARIATIONAL DIVERGENCE MINIMIZATION
[26] F. Nielsen and R. Nock. On the chi-square and higher-order chi distances for approximating f-divergences. Signal Processing Letters, IEEE, 21(1):10–13,
2014.
[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pp2672–2680,
2014.
Definition:
F-GAN: TRAINING GENERATIVE NEURAL SAMPLERS USING
VARIATIONAL DIVERGENCE MINIMIZATION
Variational Divergence Minimization (VDM):
Use the variational lower bound on the f-divergence Df(P|Q) in order to estimate a
generative model Q given a true distribution P;
Use two NNs, generative model Q and variational function T: Q taking as input a
random vector and outputting a sample of interest, parametrizing Q through a
vector θ and write Qθ; T taking as input a sample and returning a scalar,
parametrizing T using a vector ω and write Tω.
Learn a generative model Q θ by finding a saddle-point of the following f-GAN
objective function, where we minimize wrt θ and maximize wrt ω as
F-GAN: TRAINING GENERATIVE NEURAL SAMPLERS USING
VARIATIONAL DIVERGENCE MINIMIZATION
Samples from three different divergences
ENERGY-BASED GANS
It views the discriminator as an energy function that attributes low energies
to the regions near the data manifold and higher energies to other regions;
A generator is seen as being trained to produce contrastive samples with
minimal energies, while the discriminator is trained to assign high energies
to these generated samples;
Use the discriminator as an energy function allows to use various
architectures and loss functionals in addition to binary classifier with logistic
output;
Instantiation of EBGAN framework as using an auto-encoder architecture,
with the energy being the reconstruction error, in place of the discriminator;
A single-scale architecture can be trained to generate high-resolution
images.
ENERGY-BASED GANS
EBGAN architecture with an auto-encoder discriminator
o Propose the idea “repelling regularizer” which fits well into the EBGAN auto-
encoder model, to keep the model from producing samples that are clustered in one
or a few modes of pdata (similar to “mini-batch discrimination” by Salimans et.al);
oImplementing the “repelling regularizer” has a pulling-away (PT) effect at a
representation level;
oThe PT term defined as
ENERGY-BASED GANS
Generation from LSUN bedroom full-images. Left(a): DCGAN generation. Right(b):EBGAN-PT generation.
UNSUPERVISED IMAGE-TO-IMAGE TRANSLATION
NETWORKS
UNsupervised Image-to-image Translation (UNIT) framework, which is
based on variational autoencoders and generative adversarial networks.
The framework can learn the translation function without any corresponding
images in two domains.
Combining a weight-sharing constraint and an adversarial training objective.
The UNsupervised Image-to-image Translation (UNIT) network has two encoders E1 and E2,
two generators G1 and G2, and two adversarial discriminators D1 and D2 where E1, G1, and
D1 is for the 1st domain while E2, G2, and D2 is for the 2nd domain. They are all realized as
CNN. The encoder and generator pair in the same domain forms a VAE, while the generator
and adversarial discriminator pair in the same domain forms a GAN.
UNSUPERVISED IMAGE-TO-IMAGE TRANSLATION
NETWORKS
IMAGE-TO-IMAGE TRANSLATION WITH CONDITIONAL
ADVERSARIAL NETS
Conditional adversarial networks as a general-purpose solution to image-to-image translation problems.
These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping.
It is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
IMAGE-TO-IMAGE TRANSLATION WITH CONDITIONAL
ADVERSARIAL NETS
Training a conditional GAN to predict aerial photos from maps. The
discriminator, D, learns to classify between real and synthesized pairs.
The generator learns to fool the discriminator. Unlike an unconditional
GAN, both the generator and discriminator observe an input image.
IMAGE-TO-IMAGE TRANSLATION WITH
CONDITIONAL ADVERSARIAL NETS
PLUG & PLAY GENERATIVE NETWORKS: CONDITIONAL
ITERATIVE GENERATION OF IMAGES IN LATENT SPACE
Synthesize novel images by performing gradient ascent in the latent space of
a generator network to maximize the activations of one or multiple neurons in
a separate classifier network.
Introducing an additional prior on the latent code, improving both sample
quality and sample diversity, leading to a state-of-the-art generative model
that produces high quality images at higher resolutions (227x227) than
previous generative models, and does so for all 1000 ImageNet categories.
A unified probabilistic interpretation of related activation maximization
methods and call the general class of models "Plug and Play Generative
Networks".
PPGNs are composed of 1) a generator network G that is capable of
drawing a wide range of image types and 2) a replaceable "condition"
network C that tells the generator what to draw.
Improves the state of the art of Multifaceted Feature Visualization, which
generates the set of synthetic inputs that activate a neuron in order to better
understand how deep neural networks operate.
PLUG & PLAY GENERATIVE NETWORKS: CONDITIONAL
ITERATIVE GENERATION OF IMAGES IN LATENT SPACE
Deep Generator Network-based Activation Maximization (DGN-AM) involves training a
generator G to create realistic images from compressed features extracted from a
pretrained classifier network E;
To generate images conditioned on a class, an optimization process is launched to
find a hidden code h that G maps to an image that highly activates a neuron in
another classifier C (not necessarily the same as E);
A major limitation with DGN-AM, is the lack of diversity in the generated samples;
Idea: adding a prior on the latent code that keeps optimization along the
manifold of realistic-looking images; to unify and interpret activation maximization
approaches as a type of energy-based model where the energy function is a sum of
multiple constraint terms: (a) priors and (b) conditions;
Metropolis-adjusted Langevin sampling repeatedly adds noise and gradient of log p(x,
y) to generate samples (Markov chain);
Denoising autoencoders estimate required gradient;
Use a special denoising autoencoder hat has been trained with multiple losses,
including a GAN loss, to obtain best results.
PLUG & PLAY GENERATIVE NETWORKS: CONDITIONAL
ITERATIVE GENERATION OF IMAGES IN LATENT SPACE
Different variants of PPGN models tested. The Noiseless Joint PPGN-h (e) empirically
produces the best images. In all variants, perform iterative sampling following the gradients
of two terms: the condition (red arrows) and the prior (black arrows). (a) PPGN-x: a p(x)
prior modeled via a DAE for images.(b) DGN-AM. (c) PPGN-h: a learned p(h) prior modeled
via a multi-layer perceptron DAE for h. (d) Joint PPGN-h: treating G + E1 + E2 as a DAE
that models h via x. (e) Noiseless Joint PPGN-h. (f) A pre-trained image classification
network (here, AlexNet trained on ImageNet) serves as the encoder network E component.
(g) attaching a recurrent, image-captioning network to the output layer of G.
PLUG & PLAY GENERATIVE NETWORKS: CONDITIONAL
ITERATIVE GENERATION OF IMAGES IN LATENT SPACE
VIDEO GENERATION
Understanding object motions and scene dynamics is a core problem in computer vision.
Generative video models has focused mostly on small patches, and evaluated it for video clustering.
A GAN for video with a spatio-temporal convol. architecture that untangles scene’s FG from BG.
THANKS!