deep learning for computer vision - graduate center, cuny€¦ · deep learning models are a...
TRANSCRIPT
Deep Learning Models for Multimodal
Sensing and Processing: A Survey
Presented by: Farnaz Abtahi
Committee Members: Professor Zhigang Zhu (Advisor)
Professor YingLi Tian
Professor Tony Ro
Overview Multimodal sensing and processing have shown promising results in
detection, recognition and identification in various applications.
Two different ways to generate multiple modalities: via sensor diversity, or,
via feature diversity.
We will focus on deep learning models for multimodal sensing and processing, including: Deep Belief Networks (DBNs),
Deep Boltzmann Machines (DBMs),
Deep Autoencoders, and
Convolutional Neural Networks (CNNs).
Some of the above models are compared to more traditional multimodal learning approaches. We will review a couple of them, including: Support Vector Machines (SVMs), and
Linear Discriminant Analysis (LDA).
1
Contents Introduction: Problems and Solutions
Traditional Models
SVM
SVM for Multimodal Data
LDA
LDA for Multimodal Data
SVM vs. LDA: Comparison
RBM-based Deep Learning Models
RBM
DBNs
DBN for Multimodal Data
DBMs
DBM for Multimodal Data
Deep Autoencoders
Deep Autoencoders for Multimodal Data
Summary of the RBM-based Models
CNNs
CNNs for Multimodal Data
Summary and Conclusions
2
Contents Introduction: Problems and Solutions
Traditional Models
SVM
SVM for Multimodal Data
LDA
LDA for Multimodal Data
SVM vs. LDA: Comparison
RBM-based Deep Learning Models
RBM
DBNs
DBN for Multimodal Data
DBMs
DBM for Multimodal Data
Deep Autoencoders
Deep Autoencoders for Multimodal Data
Summary of the RBM-based models
CNNs
CNNs for Multimodal Data
Summary and Conclusions
3
Introduction: Problems and Solutions
Until a few years ago, most machine learning and signal processing techniques were based on shallow-structured architectures.
Typically contain a single layer of nonlinear feature transformations which make shallow architectures effective in solving many simple or well-constrained problems.
But, their limited modeling and representational power can cause difficulties when dealing with more complicated real-world applications.
Deep learning models are a solution to the above problems.
These models are able to automatically extract task-specific features form the data.
4
Introduction: Problems and Solutions (contd.)
In most real-world applications, dealing with multimodal
data is inevitable due to the nature of the task.
This requires machine learning methods capable of efficiently
combining knowledge from multiple modalities.
Traditional methods such as SVM do the task by training a
separate SVM on each individual modality and combining the
results.
What is missed: the ability of learning the association between
different modalities
This shared representation of the data which reveals the
association between different modalities makes the trained
structure a generative model.
5
Contents Introduction: Problems and Solutions
Traditional Models
SVM
SVM for Multimodal Data
LDA
LDA for Multimodal Data
SVM vs. LDA: Comparison
RBM-based Deep Learning Models
RBM
DBNs
DBN for Multimodal Data
DBMs
DBM for Multimodal Data
Deep Autoencoders
Deep Autoencoders for Multimodal Data
Summary of the RBM-based models
CNNs
CNNs for Multimodal Data
Summary and Conclusions
6
Traditional Models: SVM
SVM was introduced in 1992 and has been widely used in
classification tasks since then [Boser et al., 1992].
For input/output sets X/Y, the goal is to learn the function
y = f (x, α), where α are the parameters of the function in
such a way that the margin shown below is maximized.
7
Traditional Models: SVM (contd.)
8
For inseparable classes, the function f is nonlinear and hard to find.
In this case, the trick is to map data into a richer feature space and then construct a hyper-plane in that space to separate the classes.
This is called “the Kernel trick”.
A mediator agent controls the fusion of the individual biometric match scores, using a “bank” of SVMs that cover all possible subsets of the biometric modalities being considered.
The agent selects an appropriate SVM for fusion, based on which modality classifiers are currently available.
This fusion technique differs from a traditional SVM ensemble: Rather than combining the output of
all of the SVMs, they apply only the SVM that best corresponds to the available modalities.
Biometric modalities: face, fingerprint, and DNA profile data.
Using SVM for multi-biometric fusion [Dinerstein et al., 2007]:
SVM for Multimodal Data
9
LDA [Fisher, 1936] transforms the data into a new space in
which the ratio of between-class variance to within-class
variance is maximized, thereby guaranteeing maximal
separability.
Traditional Models: LDA
10
LDA for Multimodal Data
11
multimodal biometric user identification based on voice and facial information [Khan et al., 2012]: Two build-in modules:
visual recognition system This module attempts to match the facial features of a user to its template in the
database. It uses Principal Component Analysis (PCA), LDA and K-Nearest Neighbor (KNN).
audio recognition system. Mel Frequency Cepstrum Coefficients (MFCCs) are extracted from the raw data.
SVM vs. LDA: Comparison
12
Two main differences [Gokcen et al., 2002]:
SVM is a classifier, but LDA is often used as a data
transformation method.
LDA can be considered as a sub-category of SVM:
LDA always draw lines, but SVM can draw non-linear curves, which
could have better performance.
Contents Introduction: Problems and Solutions
Traditional Models
SVM
SVM for Multimodal Data
LDA
LDA for Multimodal Data
SVM vs. LDA: Comparison
RBM-based Deep Learning Models
RBM
DBNs
DBN for Multimodal Data
DBMs
DBM for Multimodal Data
Deep Autoencoders
Deep Autoencoders for Multimodal Data
Summary of the RBM-based models
CNNs
CNNs for Multimodal Data
Summary and Conclusions
13
RBM-based Deep Learning Models
14
We are going to introduce three models: DBN
DBM
Deep Autoencoder
All these models use Restricted Boltzmann Machines (RBMs) [Fischer et al., 2012]
as their building blocks. RBM is a non-directed probabilistic energy-based graphical models that assigns a
scalar energy value to each variable configuration.
The model is trained in a way that the plausible configurations are associated with lower energies (higher probabilities).
The model is called “restricted” because there are no connections between the visible units or between the hidden units.
The energy function: E(x,h)=-h'Wx
Probability distribution: p(x,h)~exp(-E(x,h))
RBM-based Deep Learning Models (contd.)
15
Training algorithm (simplified):
1. Set x equal to a training sample
2. Generate a sample from p(h|x)~W.x
3. Use to generate from p(x|h)~W'.h
4. Update W based on the difference between and
5. Go back to step 2 unless the difference is bellow some
threshold (convergence)
x~h~
x x~
h~
x
h
Contents Introduction: Problems and Solutions
Traditional Models
SVM
SVM for Multimodal Data
LDA
LDA for Multimodal Data
SVM vs. LDA: Comparison
RBM-based Deep Learning Models
RBM
DBNs
DBN for Multimodal Data
DBMs
DBM for Multimodal Data
Deep Autoencoders
Deep Autoencoders for Multimodal Data
Summary of the RBM-based models
CNNs
CNNs for Multimodal Data
Summary and Conclusions
16
Deep Belief Networks [Hinton, 2009]
Training has two phases:
Unsupervised pretraining:
Every two consecutive layers form a RBM.
Each RBM is trained using the algorithm
explained earlier.
Labels are not taken into account.
This phase extracts information hidden in the
data.
Hidden layers can be used as features.
17
Supervised training (refining the parameters):
An extra layer is added to the model for labels.
The DBN is fine-tuned as if it is a traditional multi-layered Neural Network using error backpropagation on the labels.
Deep Belief Networks (contd.)
Testing the model on a sample also has
two steps:
Generating the label:
Single pass through all layers except the last
two.
Sample from the last two layers by going up
and down until convergence.
Pass the generated sample to the last
“classification” layer to find the label.
Comparing the generated label with the
ground truth.
18
Same idea for generating a sample from the model.
DBNs for Multimodal Data
19
DBNs for learning the joint representation of the data [Srivastava et
al., 2012]: two modalities:
Text
Image
The model could deal with missing modalities and could be used for both image retrieval and image annotation.
The joint representation using the DBN showed superior results compared to SVM and LDA.
DBNs for Multimodal Data (contd.)
20
Some results from this work:
Examples of data from the MIR
Flicker Dataset, along with text
generated from the DBN by
sampling from P(vtxt |vimg,q)
DBNs for Multimodal Data (contd.)
21
Examples of data from the MIR
Flicker Dataset, along with text
generated from the DBN by
sampling from P(vimg |vtxt,q)
Contents Introduction: Problems and Solutions
Traditional Models
SVM
SVM for Multimodal Data
LDA
LDA for Multimodal Data
SVM vs. LDA: Comparison
RBM-based Deep Learning Models
RBM
DBNs
DBN for Multimodal Data
DBMs
DBM for Multimodal Data
Deep Autoencoders
Deep Autoencoders for Multimodal Data
Summary of the RBM-based models
CNNs
CNNs for Multimodal Data
Summary and Conclusions
22
Deep Boltzmann Machines [Salakhutdinov, 2009]
The main difference between DBM and
DBN is that in DBM, all links between
the layers are undirected (or in other
words, bi-directional).
Training is very similar to DBNs.
To test the model on a sample, or to
generate a sample from the model,
sampling is done for every two
consecutive layers (every RBM) until
convergence.
23
DBM
DBMs for Multimodal Data
24
DBMs for learning the joint representation of data [Srivastava et al,, 2012]:
two modalities:
Text
Image
Similar to the previous work from the same group [Srivastava et al.,
2012], the multimodal DBM is constructed using an image-text bi-modal DBM.
DBMs for Multimodal Data (contd.)
25
Some results from this work:
Contents Introduction: Problems and Solutions
Traditional Models
SVM
SVM for Multimodal Data
LDA
LDA for Multimodal Data
SVM vs. LDA: Comparison
RBM-based Deep Learning Models
RBM
DBNs
DBN for Multimodal Data
DBMs
DBM for Multimodal Data
Deep Autoencoders
Deep Autoencoders for Multimodal Data
Summary of the RBM-based models
CNNs
CNNs for Multimodal Data
Summary and Conclusions
26
Deep Autoencoders [Bengio, 2009]
27
Composed of two, symmetrical DBNs that each typically has four or five layers.
The first DBN represents the encoding half of the autoencoder, and the second DBN make up the decoding half.
The goal is to optimize the weights in both blocks in order to minimize the reconstruction error.
The “denoising Autoencoder” (dA) is an extension of the classical autoencoder [Ngiam et al, 2011].
The idea behind denoising autoencoders is that in order to force the hidden layer to discover more robust features and prevent it from simply learning the identity, we train the autoencoder to reconstruct the input from a corrupted version of it.
Deep Autoencoders for Multimodal Data
28
Different settings for employing deep autoencoders to learn the multimodal data representation [Ngiam et al, 2011]: Two modalities:
Speech audio
Video of the lips
Three learning settings are considered: Multimodal fusion
Cross modality learning
Shared representation learning
Deep Autoencoders for Multimodal Data (contd.)
29
Bimodal deep autoencoder was trained in a denoising
fashion, using an augmented dataset with examples that
require the network to reconstruct both modalities given
only one.
Deep Autoencoders for Multimodal Data (contd.)
30
They also tried the “Hearing to see” and “Seeing to hear”
idea by combining the shared representation with a
classifier.
The figure shows the “Hearing to see” setting.
The classification results were 29.4 and 27.5 respectively.
Contents Introduction: Problems and Solutions
Traditional Models
SVM
SVM for Multimodal Data
LDA
LDA for Multimodal Data
SVM vs. LDA: Comparison
RBM-based Deep Learning Models
RBM
DBNs
DBN for Multimodal Data
DBMs
DBM for Multimodal Data
Deep Autoencoders
Deep Autoencoders for Multimodal Data
Summary of the RBM-based models
CNNs
CNNs for Multimodal Data
Summary and Conclusions
31
Summary of the RBM-based Models [Deng, 2012]
32
Contents Introduction: Problems and Solutions
Traditional Models
SVM
SVM for Multimodal Data
LDA
LDA for Multimodal Data
SVM vs. LDA: Comparison
RBM-based Deep Learning Models
RBM
DBNs
DBN for Multimodal Data
DBMs
DBM for Multimodal Data
Deep Autoencoders
Deep Autoencoders for Multimodal Data
Summary of the RBM-based models
CNNs
CNNs for Multimodal Data
Summary and Conclusions
33
Convolutional Neural Networks [LeCun et al., 1995]
34
Biologically inspired multi-layer neural networks specifically adapted for computer vision problems and visual object recognition.
The idea of CNNs is similar to the mechanism of the visual cortex. Consecutively extracting features by convolving with filter banks and
reducing the resolution by subsampling.
Convolutional Neural Networks (contd.)
35
Consecutive convolution and subsampling
CNNs for Multimodal Data
36
CNN-based multimodal learning method for RGB-D object recognition [Wang et al., 2015]:
Two CNNs are built to learn feature representations for color and depth separately.
The CNNs are then connected with a final multimodal layer.
This layer is designed to not only discover the most discriminative features for each modality, but also harness the complementary relationship between the two modalities.
Contents Introduction: Problems and Solutions
Traditional Models
SVM
SVM for Multimodal Data
LDA
LDA for Multimodal Data
SVM vs. LDA: Comparison
RBM-based Deep Learning Models
RBM
DBNs
DBN for Multimodal Data
DBMs
DBM for Multimodal Data
Deep Autoencoders
Deep Autoencoders for Multimodal Data
Summary of the RBM-based models
CNNs
CNNs for Multimodal Data
Summary and Conclusions
37
Summary: Sallow vs. Deep
Problems of shallow-structured architectures.
Lack multiple layers of adaptive non-linear features.
Features are extracted based on traditional engineered feature
extraction methods and are manually obtained from the raw data.
Their limited modeling and representational power can cause
difficulties when dealing with more complicated real-world
applications involving natural signals such as human speech, natural
sound and language, and natural images and visual scenes.
In such cases, methods that are able to automatically extract task-
specific features form the data are much more desirable.
Solutions: Deep learning models
In this survey, we reviewed DBNs, DBMs, Deep Autoencoders and
CNNs, and their applications in multimodal data processing.
38
Summary: Multimodality and Solutions
In most real-world applications, dealing with multimodal data is inevitable due to the nature of the task. This requires machine learning methods that are capable of efficiently
combining their learned knowledge from multiple modalities. In traditional machine learning methods such as SVM, multimodal learning
is performed by training a separate SVM on each individual modality and combining the results by voting, weighted average or other probabilistic methods… Does not efficiently combine information.
A very important aspect of multimodal learning that is missed in these approaches is the ability of learning the association between different modalities. This can be easily achieved by utilizing deep learning methods, as they are capable of extracting task-specific features from the data and learning the relationship between modalities through a shared representation. This shared representation of the data which reveals the association
between different modalities makes the trained structure a generative model.
39
References1. Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. "A training algorithm for optimal margin classifiers." Proceedings of the fifth annual
workshop on Computational learning theory. ACM, 1992.
2. Dinerstein, Sabra, Jonathan Dinerstein, and Dan Ventura. "Robust multi-modal biometric fusion via multiple SVMs." Systems, Man and Cybernetics,
2007. ISIC. IEEE International Conference on. IEEE, 2007.
3. Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179-188.
4. Khan, Aamir, et al. "A Multimodal Biometric System Using Linear Discriminant Analysis For Improved Performance." arXiv preprint arXiv:1201.3720
(2012).
5. Gokcen, Ibrahim, and Jing Peng. "Comparing linear discriminant analysis and support vector machines." Advances in Information Systems. Springer
Berlin Heidelberg, 2002. 104-113.
6. Fischer, Asja, and Christian Igel. "An introduction to restricted Boltzmann machines." Progress in Pattern Recognition, Image Analysis, Computer Vision,
and Applications. Springer Berlin Heidelberg, 2012. 14-36.
7. Hinton, Geoffrey E. "Deep belief networks." Scholarpedia 4.5 (2009): 5947.
8. Srivastava, Nitish, and Ruslan Salakhutdinov. "Learning representations for multimodal data with deep belief nets." International Conference on
Machine Learning Workshop. 2012.
9. Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Deep boltzmann machines." International Conference on Artificial Intelligence and Statistics. 2009.
10. Srivastava, Nitish, and Ruslan R. Salakhutdinov. "Multimodal learning with deep boltzmann machines." Advances in neural information processing
systems. 2012.
11. Bengio, Yoshua. "Learning deep architectures for AI." Foundations and trends® in Machine Learning 2.1 (2009): 1-127.
12. Ngiam, Jiquan, et al. "Multimodal deep learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.
13. Deng, Li. "Three classes of deep learning architectures and their applications: a tutorial survey." APSIPA transactions on signal and information
processing (2012).
14. LeCun, Yann, and Yoshua Bengio. "Convolutional networks for images, speech, and time series." The handbook of brain theory and neural networks
3361.10 (1995).
15. Wang, Anran, et al. "Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition." Multimedia, IEEE Transactions on 17.11 (2015):
1887-1898.
40
Thank you!
41