deep learning for computer vision - graduate center, cuny€¦ · deep learning models are a...

Deep Learning Models for Multimodal

Sensing and Processing: A Survey

Presented by: Farnaz Abtahi

Committee Members: Professor Zhigang Zhu (Advisor)

Professor YingLi Tian

Professor Tony Ro

Overview Multimodal sensing and processing have shown promising results in

detection, recognition and identification in various applications.

Two different ways to generate multiple modalities: via sensor diversity, or,

via feature diversity.

We will focus on deep learning models for multimodal sensing and processing, including: Deep Belief Networks (DBNs),

Deep Boltzmann Machines (DBMs),

Deep Autoencoders, and

Convolutional Neural Networks (CNNs).

Some of the above models are compared to more traditional multimodal learning approaches. We will review a couple of them, including: Support Vector Machines (SVMs), and

Linear Discriminant Analysis (LDA).

1

Contents Introduction: Problems and Solutions

Traditional Models

SVM

SVM for Multimodal Data

LDA

LDA for Multimodal Data

SVM vs. LDA: Comparison

RBM-based Deep Learning Models

RBM

DBNs

DBN for Multimodal Data

DBMs

DBM for Multimodal Data

Deep Autoencoders

Deep Autoencoders for Multimodal Data

Summary of the RBM-based Models

CNNs

CNNs for Multimodal Data

Summary and Conclusions

2


Traditional Models

SVM


LDA




RBM

DBNs


DBMs


Deep Autoencoders


Summary of the RBM-based models

CNNs



3

Introduction: Problems and Solutions

Until a few years ago, most machine learning and signal processing techniques were based on shallow-structured architectures.

Typically contain a single layer of nonlinear feature transformations which make shallow architectures effective in solving many simple or well-constrained problems.

But, their limited modeling and representational power can cause difficulties when dealing with more complicated real-world applications.

Deep learning models are a solution to the above problems.

These models are able to automatically extract task-specific features form the data.

4

Introduction: Problems and Solutions (contd.)

In most real-world applications, dealing with multimodal

data is inevitable due to the nature of the task.

This requires machine learning methods capable of efficiently

combining knowledge from multiple modalities.

Traditional methods such as SVM do the task by training a

separate SVM on each individual modality and combining the

results.

What is missed: the ability of learning the association between

different modalities

This shared representation of the data which reveals the

association between different modalities makes the trained

structure a generative model.

5


Traditional Models

SVM


LDA




RBM

DBNs


DBMs


Deep Autoencoders



CNNs



6

Traditional Models: SVM

SVM was introduced in 1992 and has been widely used in

classification tasks since then [Boser et al., 1992].

For input/output sets X/Y, the goal is to learn the function

y = f (x, α), where α are the parameters of the function in

such a way that the margin shown below is maximized.

7

Traditional Models: SVM (contd.)

8

For inseparable classes, the function f is nonlinear and hard to find.

In this case, the trick is to map data into a richer feature space and then construct a hyper-plane in that space to separate the classes.

This is called “the Kernel trick”.

A mediator agent controls the fusion of the individual biometric match scores, using a “bank” of SVMs that cover all possible subsets of the biometric modalities being considered.

The agent selects an appropriate SVM for fusion, based on which modality classifiers are currently available.

This fusion technique differs from a traditional SVM ensemble: Rather than combining the output of

all of the SVMs, they apply only the SVM that best corresponds to the available modalities.

Biometric modalities: face, fingerprint, and DNA profile data.

Using SVM for multi-biometric fusion [Dinerstein et al., 2007]:


9

LDA [Fisher, 1936] transforms the data into a new space in

which the ratio of between-class variance to within-class

variance is maximized, thereby guaranteeing maximal

separability.

Traditional Models: LDA

10


11

multimodal biometric user identification based on voice and facial information [Khan et al., 2012]: Two build-in modules:

visual recognition system This module attempts to match the facial features of a user to its template in the

database. It uses Principal Component Analysis (PCA), LDA and K-Nearest Neighbor (KNN).

audio recognition system. Mel Frequency Cepstrum Coefficients (MFCCs) are extracted from the raw data.


12

Two main differences [Gokcen et al., 2002]:

SVM is a classifier, but LDA is often used as a data

transformation method.

LDA can be considered as a sub-category of SVM:

LDA always draw lines, but SVM can draw non-linear curves, which

could have better performance.


Traditional Models

SVM


LDA




RBM

DBNs


DBMs


Deep Autoencoders



CNNs



13


14

We are going to introduce three models: DBN

DBM

Deep Autoencoder

All these models use Restricted Boltzmann Machines (RBMs) [Fischer et al., 2012]

as their building blocks. RBM is a non-directed probabilistic energy-based graphical models that assigns a

scalar energy value to each variable configuration.

The model is trained in a way that the plausible configurations are associated with lower energies (higher probabilities).

The model is called “restricted” because there are no connections between the visible units or between the hidden units.

The energy function: E(x,h)=-h'Wx

Probability distribution: p(x,h)~exp(-E(x,h))

RBM-based Deep Learning Models (contd.)

15

Training algorithm (simplified):

1. Set x equal to a training sample

2. Generate a sample from p(h|x)~W.x

3. Use to generate from p(x|h)~W'.h

4. Update W based on the difference between and

5. Go back to step 2 unless the difference is bellow some

threshold (convergence)

x~h~

x x~

h~

x

h


Traditional Models

SVM


LDA




RBM

DBNs


DBMs


Deep Autoencoders



CNNs



16

Deep Belief Networks [Hinton, 2009]

Training has two phases:

Unsupervised pretraining:

Every two consecutive layers form a RBM.

Each RBM is trained using the algorithm

explained earlier.

Labels are not taken into account.

This phase extracts information hidden in the

data.

Hidden layers can be used as features.

17

Supervised training (refining the parameters):

An extra layer is added to the model for labels.

The DBN is fine-tuned as if it is a traditional multi-layered Neural Network using error backpropagation on the labels.

Deep Belief Networks (contd.)

Testing the model on a sample also has

two steps:

Generating the label:

Single pass through all layers except the last

two.

Sample from the last two layers by going up

and down until convergence.

Pass the generated sample to the last

“classification” layer to find the label.

Comparing the generated label with the

ground truth.

18

Same idea for generating a sample from the model.

DBNs for Multimodal Data

19

DBNs for learning the joint representation of the data [Srivastava et

al., 2012]: two modalities:

Text

Image

The model could deal with missing modalities and could be used for both image retrieval and image annotation.

The joint representation using the DBN showed superior results compared to SVM and LDA.

DBNs for Multimodal Data (contd.)

20

Some results from this work:

Examples of data from the MIR

Flicker Dataset, along with text

generated from the DBN by

sampling from P(vtxt |vimg,q)

DBNs for Multimodal Data (contd.)

21

Examples of data from the MIR

Flicker Dataset, along with text

generated from the DBN by

sampling from P(vimg |vtxt,q)


Traditional Models

SVM


LDA




RBM

DBNs


DBMs


Deep Autoencoders



CNNs



22

Deep Boltzmann Machines [Salakhutdinov, 2009]

The main difference between DBM and

DBN is that in DBM, all links between

the layers are undirected (or in other

words, bi-directional).

Training is very similar to DBNs.

To test the model on a sample, or to

generate a sample from the model,

sampling is done for every two

consecutive layers (every RBM) until

convergence.

23

DBM

DBMs for Multimodal Data

24

DBMs for learning the joint representation of data [Srivastava et al,, 2012]:

two modalities:

Text

Image

Similar to the previous work from the same group [Srivastava et al.,

2012], the multimodal DBM is constructed using an image-text bi-modal DBM.

DBMs for Multimodal Data (contd.)

25

Some results from this work:


Traditional Models

SVM


LDA




RBM

DBNs


DBMs


Deep Autoencoders



CNNs



26

Deep Autoencoders [Bengio, 2009]

27

Composed of two, symmetrical DBNs that each typically has four or five layers.

The first DBN represents the encoding half of the autoencoder, and the second DBN make up the decoding half.

The goal is to optimize the weights in both blocks in order to minimize the reconstruction error.

The “denoising Autoencoder” (dA) is an extension of the classical autoencoder [Ngiam et al, 2011].

The idea behind denoising autoencoders is that in order to force the hidden layer to discover more robust features and prevent it from simply learning the identity, we train the autoencoder to reconstruct the input from a corrupted version of it.


28

Different settings for employing deep autoencoders to learn the multimodal data representation [Ngiam et al, 2011]: Two modalities:

Speech audio

Video of the lips

Three learning settings are considered: Multimodal fusion

Cross modality learning

Shared representation learning

Deep Autoencoders for Multimodal Data (contd.)

29

Bimodal deep autoencoder was trained in a denoising

fashion, using an augmented dataset with examples that

require the network to reconstruct both modalities given

only one.

Deep Autoencoders for Multimodal Data (contd.)

30

They also tried the “Hearing to see” and “Seeing to hear”

idea by combining the shared representation with a

classifier.

The figure shows the “Hearing to see” setting.

The classification results were 29.4 and 27.5 respectively.


Traditional Models

SVM


LDA




RBM

DBNs


DBMs


Deep Autoencoders



CNNs



31

Summary of the RBM-based Models [Deng, 2012]

32


Traditional Models

SVM


LDA




RBM

DBNs


DBMs


Deep Autoencoders



CNNs



33

Convolutional Neural Networks [LeCun et al., 1995]

34

Biologically inspired multi-layer neural networks specifically adapted for computer vision problems and visual object recognition.

The idea of CNNs is similar to the mechanism of the visual cortex. Consecutively extracting features by convolving with filter banks and

reducing the resolution by subsampling.

Convolutional Neural Networks (contd.)

35

Consecutive convolution and subsampling


36

CNN-based multimodal learning method for RGB-D object recognition [Wang et al., 2015]:

Two CNNs are built to learn feature representations for color and depth separately.

The CNNs are then connected with a final multimodal layer.

This layer is designed to not only discover the most discriminative features for each modality, but also harness the complementary relationship between the two modalities.


Traditional Models

SVM


LDA




RBM

DBNs


DBMs


Deep Autoencoders



CNNs



37

Summary: Sallow vs. Deep

Problems of shallow-structured architectures.

Lack multiple layers of adaptive non-linear features.

Features are extracted based on traditional engineered feature

extraction methods and are manually obtained from the raw data.

Their limited modeling and representational power can cause

difficulties when dealing with more complicated real-world

applications involving natural signals such as human speech, natural

sound and language, and natural images and visual scenes.

In such cases, methods that are able to automatically extract task-

specific features form the data are much more desirable.

Solutions: Deep learning models

In this survey, we reviewed DBNs, DBMs, Deep Autoencoders and

CNNs, and their applications in multimodal data processing.

38

Summary: Multimodality and Solutions

In most real-world applications, dealing with multimodal data is inevitable due to the nature of the task. This requires machine learning methods that are capable of efficiently

combining their learned knowledge from multiple modalities. In traditional machine learning methods such as SVM, multimodal learning

is performed by training a separate SVM on each individual modality and combining the results by voting, weighted average or other probabilistic methods… Does not efficiently combine information.

A very important aspect of multimodal learning that is missed in these approaches is the ability of learning the association between different modalities. This can be easily achieved by utilizing deep learning methods, as they are capable of extracting task-specific features from the data and learning the relationship between modalities through a shared representation. This shared representation of the data which reveals the association

between different modalities makes the trained structure a generative model.

39

References1. Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. "A training algorithm for optimal margin classifiers." Proceedings of the fifth annual

workshop on Computational learning theory. ACM, 1992.

2. Dinerstein, Sabra, Jonathan Dinerstein, and Dan Ventura. "Robust multi-modal biometric fusion via multiple SVMs." Systems, Man and Cybernetics,

2007. ISIC. IEEE International Conference on. IEEE, 2007.

3. Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179-188.

4. Khan, Aamir, et al. "A Multimodal Biometric System Using Linear Discriminant Analysis For Improved Performance." arXiv preprint arXiv:1201.3720

(2012).

5. Gokcen, Ibrahim, and Jing Peng. "Comparing linear discriminant analysis and support vector machines." Advances in Information Systems. Springer

Berlin Heidelberg, 2002. 104-113.

6. Fischer, Asja, and Christian Igel. "An introduction to restricted Boltzmann machines." Progress in Pattern Recognition, Image Analysis, Computer Vision,

and Applications. Springer Berlin Heidelberg, 2012. 14-36.

7. Hinton, Geoffrey E. "Deep belief networks." Scholarpedia 4.5 (2009): 5947.

8. Srivastava, Nitish, and Ruslan Salakhutdinov. "Learning representations for multimodal data with deep belief nets." International Conference on

Machine Learning Workshop. 2012.

9. Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Deep boltzmann machines." International Conference on Artificial Intelligence and Statistics. 2009.

10. Srivastava, Nitish, and Ruslan R. Salakhutdinov. "Multimodal learning with deep boltzmann machines." Advances in neural information processing

systems. 2012.

11. Bengio, Yoshua. "Learning deep architectures for AI." Foundations and trends® in Machine Learning 2.1 (2009): 1-127.

12. Ngiam, Jiquan, et al. "Multimodal deep learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.

13. Deng, Li. "Three classes of deep learning architectures and their applications: a tutorial survey." APSIPA transactions on signal and information

processing (2012).

14. LeCun, Yann, and Yoshua Bengio. "Convolutional networks for images, speech, and time series." The handbook of brain theory and neural networks

3361.10 (1995).

15. Wang, Anran, et al. "Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition." Multimedia, IEEE Transactions on 17.11 (2015):

1887-1898.

40

Thank you!

41

deep learning for computer vision - graduate center, cuny€¦ · deep learning models are a...

Documents