visual language processing: image/video captioning and question answering

VISUAL LANGUAGE PROCESSING:

IMAGE/VIDEO CAPTIONING AND VISUAL QA

Yu Huang

Sunnyvale, California

[email protected]

mailto:[email protected]

OUTLINE

Part 1: Image/Video Captioning and Description

Part 2: Visual Question Answering

Part 3: Image/Video Generation

PART 1: IMAGE/VIDEO CAPTIONING AND DESCRIPTION

• Computer vision + Natural language processing;

• One not only needs to correctly recognize what appears in images, but also

incorporate knowledge of spatial relationships and interactions btw objects,

then needs to generate a description relevant and grammatically correct;

• Treated as a “Retrieval” task: retrieving the sentences given the query

image or retrieving the images given the query sentences;

• Assumes a specific rule of the language grammar, parses the sentence and divide

it into several parts;

• Learns a probability density over space of multimodal inputs (i.e. sentences,

images);

• It is natural to think of image caption generation as a translation problem;

• Transform a sentence S written in a source language, into its translation T in the

target language, by maximizing p(T/S).

INTRODUCTION

• A language model is needed in addition to visual understanding;

• Visual primitive recognizers combined with a structured formal language,

e.g. AND-OR graphs or logic systems, converted to natural language via

rule-based systems;

• Generation of image descriptions:

• Template-based methods, filling in sentence templates, such as triplets,

based on the results of object detections and spatial relationships;

• Composition-based methods, harness existing image-caption databases

by extracting components of related captions and composing them

together to generate novel descriptions;

• Neural network methods, generate descriptions by sampling from

conditional neural language models (multimodal).

INTRODUCTION

DEEP LEARNING FACE ATTRIBUTES IN THE WILD

• It cascades two CNNs (LNet and ANet) for face localization and attribute prediction;

• Trained in a cascade manner with attribute labels, but pre-trained differently.

• LNet pretrained with general object categories, ANet pre-trained with face identities.

• This not only outperforms state-of-the-art with large margin, but also reveals multiple

valuable facts on learning face representation.

DEEP LEARNING FACE ATTRIBUTES IN THE WILD

EXPLAIN IMAGES WITH MULTIMODAL RECURRENT NEURAL NETWORKS

• A multimodal Recurrent Neural Network (m-RNN) model for generating

novel sentence descriptions to explain the content of images.

• Models the prob. distribution of generating a word given previous words

and the image, and image descriptions generated by sampling from this

distribution.

• Two sub-networks: a deep recurrent NN for sentences and a deep

convolutional network for images, interacting with each other in a

multimodal layer.

EXPLAIN IMAGES WITH MULTIMODAL RECURRENT NEURAL NETWORKS

The simple RNN

m-RNN model

unfolded m-RNN

SHOW AND TELL: A NEURAL IMAGE CAPTION GENERATOR

• A generative model based on a deep recurrent architecture that combines

recent advances in computer vision and machine translation and that can

be used to generate natural sentences describing an image.

• Trained to maximize the likelihood of the target description sentence.

SHOW AND TELL: A NEURAL IMAGE CAPTION GENERATOR

Long-Short Term Memory (LSTM) net

LSTM model combined with a CNN

image embedder and word

embeddings. All LSTMs share the

same parameters.

SHOW, ATTEND AND TELL: A NEURAL IMAGE CAPTION

GENERATOR WITH VISUAL ATTENTION

A LSTM cell, lines with bolded squares imply projections

with a learnt weight vector. Each cell learns how to

weigh its input components (input gate), while learning

how to modulate that contribution to the memory (input

modulator). It also learns weights which erase the

memory cell (forget gate), and weights which control

how this memory should be emitted (output gate).

Examples of attending to the correct object

(white indicates the attended regions,

underlines indicated the corresponding word).

VISUAL-SEMANTIC EMBEDDINGS WITH MULTIMODAL NEURAL LANGUAGE MODELS

• An encoder-decoder pipeline learns a multimodal joint embedding space

with images and text and a novel language model for decoding distributed

representations.

• Unifies joint image-text embedding models with multimodal neural language

models;

• The structure-content neural language model disentangles the structure of a

sentence to its content; the encoder allows one to rank images and

sentences while the decoder can generate novel descriptions from scratch.

Encoder: A deep CNN and long short-term memory recurrent network (LSTM) for learning a

joint image-sentence embedding. Decoder: A new neural language model that combines

structure and content vectors for generating words one at a time in sequence.

VISUAL-SEMANTIC EMBEDDINGS WITH MULTIMODAL NEURAL LANGUAGE MODELS

(a): multiplicative neural language

model.

(b): Structure-content neural

language model (SC-NLM).

(c): The prediction problem of an

SC-NLM.

FROM CAPTIONS TO VISUAL CONCEPTS AND BACK

• Generating image descriptions: visual

detectors and language models learned

directly.

• Multiple Instance Learning to train visual

detectors for words in captions, including

many different parts of speech such as

nouns, verbs, and adjectives.

• The word detector outputs serve as

conditional inputs to a maximum-entropy

language model that learns from a set of

image descriptions to capture statistics of

word usage.

• Capture global semantics by re-ranking

caption candidates using sentence-level

features and a deep multimodal similarity

model.

FROM CAPTIONS TO VISUAL CONCEPTS AND BACK

DEEP VISUAL-SEMANTIC ALIGNMENTS FOR GENERATING IMAGE DESCRIPTIONS

• A model that generates free-form

natural language descriptions of

image regions and leverages

datasets of images and their

sentence descriptions to learn about

the inter-modal correspondences

between text and visual data.

• Combination of CNN over image

regions, bidirectional RNN over

sentences, and a structured

objective that aligns the two

modalities through a multimodal

embedding.

• RNN architecture that uses the

inferred alignments to learn to

generate novel descriptions of

image regions.

DEEP VISUAL-SEMANTIC ALIGNMENTS FOR GENERATING IMAGE DESCRIPTIONS

Diagram for evaluat.

image-sentence score

Diagram of our multimodal

Recurrent Neural Network

generative model

LEARNING A RECURRENT VISUAL REPRESENTATION FOR

IMAGE CAPTION GENERATION

Bi-directional mapping btw images and their sentence-based

descriptions learning by a recurrent neural network.

Apply recurrent visual memory that automatically learns to

remember long-term visual concepts to aid in both sentence

generation and visual feature reconstruction.

LEARNING A RECURRENT VISUAL REPRESENTATION FOR

IMAGE CAPTION GENERATION

(a) shows the full model used for training. (b) and (c) show the parts

of the model needed for generating sentences from visual features

and generating visual features from sentences respectively.

PHRASE-BASED IMAGE CAPTIONING

A model to generate descriptive sentences given a sample image.

Strong focus on syntax of the descriptions.

Train a bilinear model that learns a metric btw an image representation (a previously trained CNN) and phrases that are used to described them.

Based on caption syntax statistics, a language model that can produce relevant descriptions for a given test image using the phrases inferred.

The constrained language

model for generating

description given the predicted

phrases for an image.

PHRASE-BASED IMAGE CAPTIONING

Schematic illustration of the phrase-based model for image descriptions.

FAST NOVEL VISUAL CONCEPT LEARNING FROM

SENTENCE DESCRIPTIONS OF IMAGES

Learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions.

Using linguistic context and visual features, hypothesize the semantic meaning of new words and add to word dictionary, used to describe images which contain these novel concepts.

A transposed weight sharing scheme, which not only improves performance on image captioning, but also makes the model more suitable for the novel concept learning task.

Using a few “quidditch”

images with sentence

descriptions, the

method is able to learn

that “quidditch” is played

by people with a ball.



(a). The image captioning model (b). The transposed weight sharing of UD and UM.



Training novel concepts. Only update the

sub-matrix UDn in UD that is connected to

the node of new words in the One-Hot

layer and the SoftMax layer during the

training for novel concepts.

Organization of the novel concept datasets

LANGUAGE MODELS FOR IMAGE CAPTIONING:

THE QUIRKS AND WHAT WORKS

Method 1 uses a pipelined process where a set of candidate words is generated by a CNN trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence.

Method 2 uses the penultimate activation layer of the CNN as input to a RNN that then generates the caption sequence.

Compare the merits of these different language modeling approaches by using the same state-of-the-art CNN as input.

Examine linguistic irregularities, caption repetition, and data set overlap.

Combine key aspects of the ME and RNN methods.

WHAT VALUE DO EXPLICIT HIGH LEVEL CONCEPTS

HAVE IN VISION TO LANGUAGE PROBLEMS?

The CNN - RNN method does not explicitly represent high-level semantic concepts, rather progress from image features to text.

Incorporate high-level concepts into successful CNN-RNN approach.

Attribute based V2L framework:

The image analysis module learns a

mapping btw an image and the

semantic attributes through a CNN.

The language module learns a

mapping from the attributes vector to

a sequence of words using an LSTM.



Attribute prediction CNN: the

model is initialized from VggNet

pre-trained on ImageNet. The

model is then fine-tuned on the

target multi-label dataset. Given a

test image, a set of proposal

regions are selected and passed

to the shared CNN, and finally the

CNN outputs from different

proposals are aggregated with

max pooling to produce the final

multi-label prediction, which gives

us the high-level image

representation, Vatt(I).



Language generators for

different types of tasks:

(a) Image Captioning, (b)

VQA-single word, (c)

VQA-sentence. red

arrow indicates attributes

input Vatt(I) while blue

dash arrow shows the

baseline method input

CNN(I).

VIDEO CAPTIONING & DESCRIPTION

Tagging videos with metadata;

Clustering captions and videos;

Retrieval and predicting event tags rather than generating descriptive

sentences;

Two stages of description generation: Identify the semantic content;

train classifiers to identify candidate objects, actions and scenes;

Generate a sentence based on a template

combine visual confidences with a language model in a probabilistic graphical model;

Pro: detach content generation and surface realization;

Con: select a set of relevant objects, actions and scenes to recognize, and

lose richness of human language in the templates;

Deep learning: create the visual-semantic embedding Learn the spatio-temporal visual feature and also the temporal context model.

YOUTUBE2TEXT: RECOGNIZING AND DESCRIBING ARBITRARY ACTIVITIES

USING SEMANTIC HIERARCHIES AND ZERO-SHOT RECOGNITION

Take a short video clip and outputs a brief sentence that sums up the main activity in the video, such as actor, action and its object.

Small portions of the Hierarchies learned over

Subjects, Verbs and Objects.

ALIGNING BOOKS AND MOVIES: TOWARDS STORY-LIKE VISUAL

EXPLANATIONS BY WATCHING MOVIES AND READING BOOKS

reason about visual and dialog (text) alignment btw a movie and a book;

exploit a neural sentence embedding that is trained in an unsupervised way

from a large corpus of books, and a video-text neural embedding for

computing similarities btw movie clips and sentences in the book;

a simple pairwise CRF that smooth the alignments by encouraging them to

follow a linear timeline, both in the video and book domain.

Sentence neural embedding

DESCRIBING MULTIMEDIA CONTENT USING ATTENTION-

BASED ENCODER–DECODER NETWORKS

Translating a (short) video clip to

its natural language description;

CNN + GRU (RNN) + Attention;

DESCRIBING VIDEOS BY EXPLOITING TEMPORAL STRUCTURE

Incorporate models of local temporal dynamic of videos and global temporal structure;

The local structure is modeled using the temporal feature maps of a 3-D CNN, while a

temporal attention mechanism is used to combine information across the entire video;

Encoder-decoder to generate video description: encoded by CNN, decoded by RNN;

Spatio-temporal CNN;

LSTM;

TRANSLATING VIDEOS TO NATURAL LANGUAGE

USING DEEP RECURRENT NEURAL NETWORKS

An unified DNN with both

convolutional and

recurrent structure;

Create sentence

descriptions of open-

domain videos with large

vocabularies;

An end-to-end deep

model for video-to-text

generation that

simultaneously learns a

latent “meaning” state,

and a fluent grammatical

model of the associated

language. Video Description Network

LONG-TERM RECURRENT CONVOLUTIONAL NETWORKS

FOR VISUAL RECOGNITION AND DESCRIPTION

Long-term Recurrent Convolutional

Networks (LRCNs), a class of

architectures leveraging the

strengths of rapid progress in CNNs

for visual recognition problem, and

the growing desire to apply such

models to time-varying inputs and

outputs;

LRCN is directly connected to

modern visual CNN models and

can be jointly trained to

simultaneously learn temporal

dynamics and convolutional

perceptual representations.



Instantiations of the LRCN model for activity recognition, image description,

and video description.



Three variations of the LRCN image captioning architecture to evaluate.



Video description in LRCN. (a) LSTM encoder & decoder with CRF max (b) LSTM decoder

with CRF max (c) LSTM decoder with CRF probabilities.

THE LONG-SHORT STORY OF MOVIE DESCRIPTION

Train the visual classifiers for verbs, objects and places, using

different visual features: DT (dense trajectories), LSDA (large scale

object detector) and PLACES (Places-CNN );

Next, concatenate the scores from a subset of selected robust

classifiers and use them as input to our LSTM.

JOINTLY MODELING EMBEDDING AND TRANSLATION TO

BRIDGE VIDEO AND LANGUAGE

An unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding; locally maximize the probability of generating the next word given

previous words and visual content;

create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content;

Include three parts: a 2-D and/or 3-D deep convolutional neural networks for learning

powerful video representation;

a deep RNN for generating sentences;

A joint embedding model for exploring the relationships between visual content and sentence semantics.

JOINTLY MODELING EMBEDDING AND TRANSLATION TO

BRIDGE VIDEO AND LANGUAGE

LSTM-E framework with a language generating LSTM and a visual-

semantic embedding model.

SEQUENCE TO SEQUENCE – VIDEO TO TEXT

End-to-end sequence-to-sequence model to generate captions for videos;

Exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation;

Train the LSTM on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip;

SEQUENCE TO SEQUENCE – VIDEO TO TEXT

A stack of two LSTMs that learn a representation of a sequence of frames in order to

decode it into a sentence that describes the event in the video. The top LSTM layer

models visual feature inputs. The second LSTM layer models language given the text

input and the hidden representation of the video sequence.

TEMPORAL TESSELLATION: A UNIFIED

APPROACH FOR VIDEO ANALYSIS

A general approach to video understanding, inspired by semantic transfer that is used for 2D image analysis.

A video is a 1D sequence of clips, each one associated with its own semantics.

The semantics – natural language captions – depends on the task at hand.

A test video is processed by forming correspondences btw its clips and the clips of reference videos with known semantics, following which, reference semantics can be transferred to the test video.

Matching methods, both designed to ensure that (a) reference clips appear similar to test clips and (b), taken together, the semantics of the selected reference clips is consistent and maintains temporal coherence.



Tessellation for temporal coherence. Given a query video, seek reference

video clips with similar semantics. Tessellation ensures that the semantics

assigned to the test clip are not only the most relevant (the five options for

each clip) but also preserve temporal coherence.



Two non-local tessellations. Left: Tessellation by restricted Viterbi. For a

query video, find visually similar videos and selects the clips that preserve

temporal coherence using the Viterbi Method. Right: Tessellation by predicting

the dynamics of semantics. Given a query video and a previous clip selection,

use an LSTM to predict the most accurate semantics for the next clip.

PART 2: VISUAL QUESTION ANSWERING

INTRODUCTION

• NLP, knowledge representation and visual image understanding;

• Answer natural language questions on real world visual images;

• Interaction btw human and computers;

• Task in QA: Given the question, learn the relevant visual and text

representation to infer the answer;

• Feature extraction from visual images: CNN;

• Question encoding in NLP: LSTM or CNN;

• Answer generation by the learned model;

• Attention or memory network.

VISUAL TURING TEST

• An operator-assisted device that produces a stochastic sequence of

binary questions from a given test image;

• VQA is a good task for visual Turing test;

• DAQUAR: A dataset for Visual Turing Challenge;

• It contains 1088 different nouns in the question, 803 in the answers, and

1586 altogether (573 categories);

• It includes questions that can be reliably answered using common sense

knowledge (reaching about 4 million to account different interpretations of

the external world), with questions of substantial length (10.5 words in

average with variance 5.5; the longest question has 30 words);

• The question answering task is also about understanding hidden

intentions of the questioner with grounding as a sub-goal to solve.

CNN FOR VISUAL QA

• CNN learns not only the image representation, the composition model for question, but also the intermodal interaction between the image and question, for the generation of answer;

• an image CNN to extract the image representation;

• one sentence CNN to encode the question;

• one multimodal convolution layer to fuse the multimodal input of the image and question to obtain the joint representation for the classification in the space of candidate answer words.

• Test on DAQUAR and COCO-QA datasets;

CNN FOR VISUAL QA

MEMORY NETWORK

• Reason with inference components combined with a long-term

memory component;

• The long-term memory can be read and written to, with the goal of

using it for prediction (as a dynamic knowledge base);

• A memory m (an array of objects indexed by mi) and four (potentially

learned) components I (input feature map), G (generalization), O (output

feature map) and R (response);

• Given input x, the flow of the model is:

1. Convert x to an internal feature representation I(x).

2. Update memories mi given the new input: mi = G(mi , I(x),m).

3. Compute output features o given the new input and the memory: o = O(I(x),m).

4. Finally, decode output features o to give the final response: r = R(o).

VISUAL ATTENTION IN RNN

• Extract info. from an image or video by adaptively selecting a

sequence of regions or locations and only processing the selected

regions at high resolution;

• RAM (Recurrent Attention Model): Translation invariance built-in;

• Computation cost is independently of the input image size;

• Can be trained using reinforcement learning methods (task specific);

• It processes inputs sequentially, attending to different locations

within the images (or video frames) one at a time, and incrementally

combines info. from these fixations to build up a dynamic internal

representation of the scene or environ.;

VISUAL ATTENTION IN RNN

A) Glimpse Sensor: Given the coord.s of glimpse and an image, extracts a represent. that

contains multiple patches. B) Glimpse Network: Given the location and image, uses the

glimpse sensor to extract represent.. The represent. and glimpse location mapped into a

hidden space using independent linear layers. The glimpse network defines a trainable

bandwidth limited sensor for the attention network producing the glimpse represent. C) RNN

Model: Takes the glimpse represent. as input and combining with the internal represent. at

previous time step, produces the new internal state. The location and action network use the

internal state to produce the next location to attend to and the action/classification respectively.

This basic RNN iteration is repeated for a variable number of steps.

VQA: COMBINATION OF NLP AND CV

• Visual questions selectively target

different areas of an image, including

BG details and underlying context.

• A VQA system typically needs a more

detailed understanding of the image

and complex reasoning than a system

producing generic image captions.

• VQA is amenable to automatic

evaluation, since many open-ended

answers contain only a few words or a

closed set of answers that can be

provided in a multiple-choice format;

• Benchmark model: MLP + LSTM.

Examples of free-form, open-ended

questions via Amazon Mechanical Turk

EXPLORING MODELS AND DATA FOR IMAGE

QUESTION ANSWERING

image-based question-answering (QA) with new

models and datasets.

use neural networks and visual semantic

embeddings, without intermediate stages (object

detection and image segmentation), to predict

answers to simple questions about images.

a question generation algorithm that converts

image descriptions, which are widely available,

into QA form.


QUESTION ANSWERING

VIS+LSTM Model


QUESTION ANSWERING

MULTILINGUAL IMAGE QUESTION ANSWERING

The mQA model, can answer questions about content of an image.

The answer can be a sentence, a phrase or a single word.

Four components: a LSTM to extract the question representation, a CNN to extract the visual representation, an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer.

A Freestyle Multilingual Image Question Answering (FM-IQA) dataset to train and evaluate the mQA model. It contains over 150,000 images and 310,000 freestyle Chinese

question-answer pairs and their English translations.

The quality of the generated answers of the mQA model on this dataset is evaluated by human judges through a Turing Test.

http://idl.baidu.com/FM-IQA.html.


mQA model architecture. Input an image and a question about the image (i.e. “What is

the cat doing?”) to the model. The model is trained to generate the answer to the question

(i.e. “Sitting on the umbrella”). The weight matrix in the word embedding layers of the two

LSTMs (one for the question and one for the answer) are shared. In addition, this weight

matrix is also shared, in a transposed manner, with the weight matrix in the Softmax layer.


Sample answers to the visual question generated by our model on the newly

proposed Freestyle Multilingual Image Question Answering (FM-IQA) dataset.

IMAGE QUESTION ANSWERING USING CNN

WITH DYNAMIC PARAMETER PREDICTION

learning a CNN with a dynamic parameter layer whose weights are determined adaptively based on questions.

For the adaptive parameter prediction, employ a separate parameter prediction network, which consists of GRU taking a question as its input and a FCL generating a set of candidate weights as its output.

incorporate a hashing technique, where the candidate weights given by the parameter prediction network are selected using a predefined hash function to determine individual weights in the dynamic parameter layer.

The proposed network—joint network with the CNN for ImageQA and the parameter prediction network— is trained e2e through BP, where its weights are initialized using a pre-trained CNN and GRU.

IMAGE QUESTION ANSWERING USING CNN

WITH DYNAMIC PARAMETER PREDICTION

Overall architecture of Dynamic Parameter Prediction network (DPPnet),

composed of classification network and parameter prediction network. The

weights in the dynamic parameter layer are mapped by a hashing trick from

the candidate weights from the parameter prediction network.

A NEURAL-BASED METHOD FOR VISUAL QA

• Set as a Visual Turing Test;

• An end-to-end formulation of this problem for which all parts are trained jointly;

• All CNN models are first pre-trained on the ImageNet dataset, and next fine-

tune the last layer together with the full training of the LSTM network.

COMBINE NN AND VISUAL SEMANTIC EMBEDDINGS

• W/O intermediate stages such as

object detection and image

segmentation;

• Build directly on top of the LSTM

sentence model and is called the

“VIS+LSTM” model.

• Idea: treating the image as a word

from caption generation work;

• The LSTM(s) outputs are fed into

a softmax layer at the last time

step to generate answers. Use the last hidden layer of the 19-layer

Oxford VGG Conv Net trained on ImageNet

2014 Challenge as visual embeddings

STACKED ATTENTION NETWORKS FOR VISUAL QA

• Stacked attention networks (SANs);

• Multi-stage reasoning: a multiple-layer SAN in which it queries an

image multiple times to infer the answer progressively;

• Semantic representation of a question as query to search for

the regions in an image that are related to the answer;

• (1) the image model, which uses a CNN to extract high level image

representations, e.g. one vector for each region of the image;

• (2) the question model, which uses a CNN or a LSTM to extract a

semantic vector of the question;

• (3) the stacked attention model, which locates, via multi-step

reasoning, the image regions that are relevant to the question for

answer prediction.

STACKED ATTENTION NETWORKS FOR VISUAL QA

The stacked attention network first

focuses on all referred concepts, e.g.,

bicycle, basket and objects in the

basket (dogs) in the first attention layer

and then further narrows down the

focus in the second layer and finds out

the answer dog.

DYNAMIC MEMORY NETWORKS FOR VISUAL

AND TEXTUAL QUESTION ANSWERING

NN architectures with memory and attention

mechanisms exhibit certain reasoning capabilities

required for QA.

dynamic memory network (DMN), obtained high accuracy on

a variety of language tasks.

Based on an analysis of the DMN, some improvements

to its memory and input modules.

Together with these changes an input module for

images is built to be able to answer visual questions.



Question Answering (text+image) using Dynamic Memory Network.



The input module with a “fusion layer”, where

the sentence reader encodes the sentence

and the bi-directional GRU allows info. to

flow between sentences.

VQA input module to represent images for the DMN



The episodic memory module of the DMN+

when using two passes. The Ḟ is the output

of the input module.

(a) The traditional GRU model, and

(b) the attention-based GRU model

MULTIMODAL RESIDUAL LEARNING

FOR VISUAL QA

Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning.

Unlike the deep residual learning, MRN effectively learns the joint representation from vision and language information.

The main idea is to use element-wise multiplication for the joint residual mappings exploiting the residual learning of the attentional models.

Various alternative models by multimodality are explored.

Visualize the attention effect of the joint representations for each learning block using BP algorithm, even though the visual features are collapsed without spatial info..


FOR VISUAL QA

Inference flow of Multimodal Residual Networks (MRN). A schematic diagram of

MRNs with 3-block layers.


FOR VISUAL QA

Alternative models are explored to justify the model. (a) The base model. (b)

extra embedding for visual modality. (c) extra embeddings for both modalities.

(d) identity mappings for shortcuts. (e) two shortcuts for both modalities.

Eventually, chose (b) as the best performance and relative simplicity.

MULTIMODAL COMPACT BILINEAR POOLING FOR VISUAL

QUESTION ANSWERING AND VISUAL GROUNDING

Utilize Multimodal Compact Bilinear pooling (MCB) to

efficiently and expressively combine multimodal features.

An architecture which uses MCB twice, once for predicting

attention over spatial features and again to combine the

attended representation with the question representation.

Multimodal Compact Bilinear Pooling for visual question answering.



Multimodal Compact Bilinear Pooling (MCB)



Architecture for VQA: Multimodal Compact Bilinear (MCB) with Attention.

Conv implies convol. layers and FC implies fully connected layers.



Architecture for VQA: MCB with

Attention and Answer Encoding.

Architecture for

Grounding with MCB.

TRAINING RECURRENT ANSWERING UNITS

WITH JOINT LOSS MINIMIZATION FOR VQA

Visual question answering based on a RNN, where every module corresponds to a complete answering unit with attention mechanism by itself.

The network is optimized by minimizing loss aggregated from all the units, which share model parameters while receiving different info. to compute attention prob.

For training, the model attends to a region within image feature map, updates its memory based on the question and attended image feature, and answers the question based on its memory state.

Observations: multi-step inferences are often required to answer questions while each problem may have a unique desirable number of steps.

Strategy: Make the 1st unit in the network solve problems, but allow it to learn the knowledge from the rest of units by BP unless it degrades the model; Early-stop training each unit as soon as it starts to overfit; The last answering unit in the unfolded RNN is typically killed first while the first one remains last; A single-step prediction for a new question using the shared model.

This strategy works better than the other options within the framework since the selected model is trained effectively from all units without overfitting.



Answering unit comprising subtask embedding, attention and predict operation.



Overall architecture of the network. The network is a RNN, where each

recurrent unit corresponds to a complete module for visual QA. For training,

unfold the network to predict answer and give supervision for every steps. For

testing, use a single answering unit to answer a question about an image.

HADAMARD PRODUCT FOR LOW-RANK

BILINEAR POOLING

Bilinear models provide rich representations vs linear models.

However, bilinear representations are high-dimensional, limiting the applicability to computationally complex tasks.

Low-rank bilinear pooling using Hadamard product for an efficient attention mechanism of multimodal learning.

A schematic diagram of MLB. Replicate

module copies an question embedding

vector to match with S2 visual feature

vectors. Conv modules indicate 1 × 1

convolution to transform a given channel

space, which is computationally equivalent

to linear projection for channels.

DATASET

• DAQUAR

(question

answering on

real world

images);

• COCO-QA;

• VQA: 0.25M

images,

0.76M

questions,

10M

answers;

PART 3: IMAGE/VIDEO GENERATION

INTRODUCTION

Statistical natural image modeling remains a fundamental problem in computer vision and image understanding;

Defining image pixel distributions that were restricted to being either unconditioned or conditioned on classification labels;

Learning generative models conditioned on text allows a better understanding of the generalization performance of the model;

Generating high dimensional realistic images from their descriptions combines the two challenging components of language modeling and image generation;

Variational Auto-Encoder (VAE) can be seen as a neural network with continuous latent variables; The encoder is used to approximate a posterior distribution and the decoder is

used to stochastically reconstruct the data from latent variables;

Generative Adversarial Networks (GANs) are generative models that use noise-contrastive estimation to avoid calculating an intractable partition function. The model consists of a generator that generates samples using a uniform

distribution and a discriminator that discriminates btw real and generated images.

CONDITIONAL IMAGE GENERATION WITH

PIXELCNN DECODERS

Conditional image generation with an image density model based on the PixelCNN architecture.

The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks.

When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects, landscapes and structures.

When conditioned on an embedding produced by a convol. net given a single image of an unseen face, it generates a variety of new portraits of the same person with different facial expressions, poses and lighting conditions.


PIXELCNN DECODERS

The basic idea of the architecture is to use AR connections to model images pixel by pixel, decomposing the joint image distribution as a product of conditionals.

PixelRNN: the pixel distributions are modeled with two-dimensional LSTM; PixelRNNs generally give better performance.

PixelCNN: the pixel distributions are modeled with convolutional networks.

PixelCNNs are much faster to train because convolutions are inherently easier to parallelize; given the vast number of pixels present in large image datasets this is an important advantage.

Gated PixelCNN: a gated variant of PixelCNN that matches the log-likelihood of PixelRNN.

Conditional PixelCNN: a conditional variant of the Gated PixelCNN that allows us to model the complex conditional distributions of natural images given a latent vector embedding.


PIXELCNN DECODERS

A single layer in the Gated PixelCNN architecture. Convolution operations

are shown in green, element-wise multiplications and additions are shown

in red. The convolutions are combined into a single operation shown in

blue, which splits the 2p features maps into two groups of p.

GENERATING IMAGES FROM CAPTIONS

WITH ATTENTION

Generate images from natural language descriptions.

Iteratively draws patches on a canvas, while attending to the relevant words.

AlignDRAW model for generating images by learning an alignment btw the input

captions and generating canvas. The caption is encoded using the Bidirectional

RNN (left). The generative RNN takes a latent sequence sampled from the prior

and the dynamic caption representation to generate the canvas matrix, used to

generate the final image x (right).

DEEP GENERATIVE IMAGE MODELS USING A LAPLACIAN

PYRAMID OF ADVERSARIAL NETWORKS

A generative parametric model, LAPGAN, capable of producing high quality samples

of natural images.

Uses a cascade of convnets within a Laplacian pyramid framework to generate

images in a coarse-to-fine fashion.

At each level of the pyramid, a separate generative convnet model is trained using the

Generative Adversarial Nets (GAN) approach.

Samples drawn from the model are of higher quality than alternate approaches.

DEEP GENERATIVE IMAGE MODELS USING A LAPLACIAN

PYRAMID OF ADVERSARIAL NETWORKS

GENERATIVE ADVERSARIAL TEXT TO IMAGE SYNTHESIS

A deep architecture and GAN formulation to effectively bridge SoA techniques in

text and image modeling, translating visual concepts from characters to pixels.

To train a deep convolutional generative adversarial network (DC-GAN)

conditioned on text features encoded by a hybrid character-level CRNN.

Both the generator network G and the discriminator network D perform feed-

forward inference conditioned on the text feature.

GENERATIVE ADVERSARIAL TEXT TO IMAGE SYNTHESIS

UNSUPERVISED REPRESENTATION LEARNING WITH DEEP

CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS

Bridge the gap between the success of CNNs for supervised learning and

unsupervised learning.

A class of CNNs called Deep Convolutional Generative Adversarial

Networks (DCGANs), that have certain architectural constraints, and

demonstrate that they are a strong candidate for unsupervised learning.

Via training, the deep convolutional adversarial pair learns a hierarchy of

representations from object parts to scenes in both generator and discriminator.

Additionally, use the learned features for general image representations.

UNSUPERVISED REPRESENTATION LEARNING WITH DEEP

CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS

F-GAN: TRAINING GENERATIVE NEURAL SAMPLERS USING

VARIATIONAL DIVERGENCE MINIMIZATION

Generative neural samplers are probabilistic models that implement

sampling using feed-forward neural networks;

These models are expressive and allow efficient computation of

samples and derivatives, but cannot be used for computing likelihood

or for marginalization;

The generative adversarial training method allows to train such

models through the use of an auxiliary discriminative neural network;

The generative-adversarial approach is a special case of an existing

more general variational divergence estimation approach;

Any f-divergence can be used for training generative neural samplers.



[26] F. Nielsen and R. Nock. On the chi-square and higher-order chi distances for approximating f-divergences. Signal Processing Letters, IEEE, 21(1):10–13,

2014.

[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pp2672–2680,

2014.

Definition:



Variational Divergence Minimization (VDM):

Use the variational lower bound on the f-divergence Df(P|Q) in order to estimate a

generative model Q given a true distribution P;

Use two NNs, generative model Q and variational function T: Q taking as input a

random vector and outputting a sample of interest, parametrizing Q through a

vector θ and write Qθ; T taking as input a sample and returning a scalar,

parametrizing T using a vector ω and write Tω.

Learn a generative model Q θ by finding a saddle-point of the following f-GAN

objective function, where we minimize wrt θ and maximize wrt ω as



Samples from three different divergences

ENERGY-BASED GANS

It views the discriminator as an energy function that attributes low energies

to the regions near the data manifold and higher energies to other regions;

A generator is seen as being trained to produce contrastive samples with

minimal energies, while the discriminator is trained to assign high energies

to these generated samples;

Use the discriminator as an energy function allows to use various

architectures and loss functionals in addition to binary classifier with logistic

output;

Instantiation of EBGAN framework as using an auto-encoder architecture,

with the energy being the reconstruction error, in place of the discriminator;

A single-scale architecture can be trained to generate high-resolution

images.

ENERGY-BASED GANS

EBGAN architecture with an auto-encoder discriminator

o Propose the idea “repelling regularizer” which fits well into the EBGAN auto-

encoder model, to keep the model from producing samples that are clustered in one

or a few modes of pdata (similar to “mini-batch discrimination” by Salimans et.al);

oImplementing the “repelling regularizer” has a pulling-away (PT) effect at a

representation level;

oThe PT term defined as

ENERGY-BASED GANS

Generation from LSUN bedroom full-images. Left(a): DCGAN generation. Right(b):EBGAN-PT generation.

UNSUPERVISED IMAGE-TO-IMAGE TRANSLATION

NETWORKS

UNsupervised Image-to-image Translation (UNIT) framework, which is

based on variational autoencoders and generative adversarial networks.

The framework can learn the translation function without any corresponding

images in two domains.

Combining a weight-sharing constraint and an adversarial training objective.

The UNsupervised Image-to-image Translation (UNIT) network has two encoders E1 and E2,

two generators G1 and G2, and two adversarial discriminators D1 and D2 where E1, G1, and

D1 is for the 1st domain while E2, G2, and D2 is for the 2nd domain. They are all realized as

CNN. The encoder and generator pair in the same domain forms a VAE, while the generator

and adversarial discriminator pair in the same domain forms a GAN.

UNSUPERVISED IMAGE-TO-IMAGE TRANSLATION

NETWORKS

IMAGE-TO-IMAGE TRANSLATION WITH CONDITIONAL

ADVERSARIAL NETS

Conditional adversarial networks as a general-purpose solution to image-to-image translation problems.

These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping.

It is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.

IMAGE-TO-IMAGE TRANSLATION WITH CONDITIONAL

ADVERSARIAL NETS

Training a conditional GAN to predict aerial photos from maps. The

discriminator, D, learns to classify between real and synthesized pairs.

The generator learns to fool the discriminator. Unlike an unconditional

GAN, both the generator and discriminator observe an input image.

IMAGE-TO-IMAGE TRANSLATION WITH

CONDITIONAL ADVERSARIAL NETS

PLUG & PLAY GENERATIVE NETWORKS: CONDITIONAL

ITERATIVE GENERATION OF IMAGES IN LATENT SPACE

Synthesize novel images by performing gradient ascent in the latent space of

a generator network to maximize the activations of one or multiple neurons in

a separate classifier network.

Introducing an additional prior on the latent code, improving both sample

quality and sample diversity, leading to a state-of-the-art generative model

that produces high quality images at higher resolutions (227x227) than

previous generative models, and does so for all 1000 ImageNet categories.

A unified probabilistic interpretation of related activation maximization

methods and call the general class of models "Plug and Play Generative

Networks".

PPGNs are composed of 1) a generator network G that is capable of

drawing a wide range of image types and 2) a replaceable "condition"

network C that tells the generator what to draw.

Improves the state of the art of Multifaceted Feature Visualization, which

generates the set of synthetic inputs that activate a neuron in order to better

understand how deep neural networks operate.



Deep Generator Network-based Activation Maximization (DGN-AM) involves training a

generator G to create realistic images from compressed features extracted from a

pretrained classifier network E;

To generate images conditioned on a class, an optimization process is launched to

find a hidden code h that G maps to an image that highly activates a neuron in

another classifier C (not necessarily the same as E);

A major limitation with DGN-AM, is the lack of diversity in the generated samples;

Idea: adding a prior on the latent code that keeps optimization along the

manifold of realistic-looking images; to unify and interpret activation maximization

approaches as a type of energy-based model where the energy function is a sum of

multiple constraint terms: (a) priors and (b) conditions;

Metropolis-adjusted Langevin sampling repeatedly adds noise and gradient of log p(x,

y) to generate samples (Markov chain);

Denoising autoencoders estimate required gradient;

Use a special denoising autoencoder hat has been trained with multiple losses,

including a GAN loss, to obtain best results.



Different variants of PPGN models tested. The Noiseless Joint PPGN-h (e) empirically

produces the best images. In all variants, perform iterative sampling following the gradients

of two terms: the condition (red arrows) and the prior (black arrows). (a) PPGN-x: a p(x)

prior modeled via a DAE for images.(b) DGN-AM. (c) PPGN-h: a learned p(h) prior modeled

via a multi-layer perceptron DAE for h. (d) Joint PPGN-h: treating G + E1 + E2 as a DAE

that models h via x. (e) Noiseless Joint PPGN-h. (f) A pre-trained image classification

network (here, AlexNet trained on ImageNet) serves as the encoder network E component.

(g) attaching a recurrent, image-captioning network to the output layer of G.

VIDEO GENERATION

Understanding object motions and scene dynamics is a core problem in computer vision.

Generative video models has focused mostly on small patches, and evaluated it for video clustering.

A GAN for video with a spatio-temporal convol. architecture that untangles scene’s FG from BG.

THANKS!

visual language processing: image/video captioning and question answering

Technology