multimodal machine learningmooney/gnlp/slides/multimodal...introduction: preliminary terms modality:...
TRANSCRIPT
![Page 1: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/1.jpg)
Multimodal Machine Learning
![Page 2: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/2.jpg)
Main GoalDefine a common taxonomy for multimodal machine
learning and provide an overview of research in this area
![Page 3: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/3.jpg)
Introduction: Preliminary Terms
Modality: the way in which something happens or is experienced
Multimodal machine learning (MML): building models that process and relate information from multiple modalities
![Page 4: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/4.jpg)
History of MMLAudio-Visual Speech Recognition (AVSR)
Multimedia content indexing and retrieval
Multimodal interaction
Media Description
● McGurk effect● Visual information
improved performance when the speech signal was noisy
● Searching visual and multimodal content directly
● Understanding human multimodal behaviors (facial expressions, speech, etc.) during social interactions
● Image captioning● Challenging
problem to evaluate
![Page 5: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/5.jpg)
Five Main Challenges of MML1. Representation – representing and summarizing multimodal data
2. Translation – mapping from one modality to another (e.g., image captioning)
3. Alignment – identifying the corresponding elements between modalities (e.g., recipe steps to the correct video frame)
4. Fusion – joining information from multiple modalities to predict (e.g., using lip motion and speech to predict spoken words)
5. Co-learning – transferring knowledge between modalities, their representation, and their predictive models
These challenges need to be tackled for the field to progress.
![Page 6: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/6.jpg)
RepresentationMultimodal representation: a representation of data using information from multiple entities (an image, word/sentence, audio sample, etc.)
We need to represent multimodal data in a meaningful way to have good models.
This is challenging because multimodal data are heterogeneous.
![Page 7: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/7.jpg)
Joint
Representation
Coordinated
Example constraints: minimize cosine similarity, maximize correlation
![Page 8: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/8.jpg)
Joint RepresentationMostly used when multimodal data is present during training and inference
Methods:
● Simple concatenation● Neural networks● Probabilistic graphical models● Sequential representation
Neural networks are often pre-trained using an autoencoder on unsupervised data.
![Page 9: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/9.jpg)
Coordinated RepresentationSimilarity Models Structured Coordinated Space Models
Enforce similarity between representations by minimizing the distance between modalities in the coordinated space
Enforce additional constraints between modalities
Example: cross-modal hashing. Additional constraints are:
● N-dimensional Hamming space● The same object from different modalities has
to have a similar hash code● Similarity-preserving
“dog”
![Page 10: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/10.jpg)
Translation: Mapping from one modality to another (e.g., image captioning)
Example-based
Use a dictionary to translate between modalities
Generative
Construct a model that translates between modalities
![Page 11: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/11.jpg)
Example-Based TranslationCombination-Based Retrieval-Based
Combines retrievals from the dictionary in a meaningful way to create a better translation
Rules are often hand-crafted or heuristic.
Use retrieved translation without modification
Problem: Often requires an extra processing step (e.g., re-ranking of retrieved translations) – similarity in the unimodal space does not always mean a good translation
Solution: Use an intermediate semantic space for similarity comparison.Performs better because the space reflects both modalities and allows for bi-directional translation.Requires manual construction or learning of the space, which needs large training dictionaries.
![Page 12: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/12.jpg)
Generative TranslationConstructing models that perform multimodal translation on a unimodal source
Grammar-Based Encoder-Decoder Continuous Generation
Detect high-level concepts from source and generate a target using a pre-defined grammar
More likely to generate logically correct targets
Formulaic translations, need complex pipelines for concept detection
Example: video description of who did what to whom and where and how
Encode the source modality into a latent representation, then decode that representation into the target modality (one pass)
Encoders: RNNs, DBNs, CNNsDecoders: RNNs, LSTMs
May be memorizing the dataRequire lots of data for training
Generate target modality at every timestep based on a stream of source modality inputs
HMMs, RNNs, encoder-decoders
Requires the ability to understand the source and generate the target
![Page 13: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/13.jpg)
Translation Evaluation: A Major ChallengeThere are often multiple correct translations.
Evaluation methods
● Human evaluation – impractical and biased● BLEU, ROUGE, Meteor, CIDEr – low correlation to human judgment, require
a high number of reference translations● Retrieval – better reflects human judgments
○ Rank the available captions and assess if the correct captions get a high rank
● Visual question-answering for image captioning – ambiguity in questions and answers, question bias
![Page 14: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/14.jpg)
Alignment“Finding relationships and correspondences between sub-components of instances from two or more modalities.”
Examples:
● Given an image and caption, find the areas of the image corresponding to the caption.
● Given a movie, align it to the book chapters it was based on.
![Page 15: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/15.jpg)
Explicit Alignment (unsupervised and supervised)Unsupervised: no direct alignment labels. Supervised: direct alignment labels.
Most approaches inspired from work on statistical machine translation and genome sequences.
If there is no similarity metric between modalities, canonical correlation analysis (CCA) is used to map the modalities to a shared space.
CCA finds the linear combinations of data that maximizes their correlation
Example applications:
● Spoken words ↔ visual objects in images● Movie shots and scenes ↔ screenplay● Recipes ↔ cooking videos● Speakers ↔ videos● Sentences ↔ video frames● Image regions ↔ phrases● Speakers in audio ↔ locations in video● Objects in 3D scenes ↔ nouns in text
![Page 16: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/16.jpg)
Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
![Page 17: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/17.jpg)
Implicit AlignmentUsed as an intermediate step for another task
Does not rely on supervised alignment examples
Data is latently aligned during model training
Useful for speech recognition, machine translation, media description, visual question-answering
Example: alignment of words and image regions before performing image retrieval based on text descriptions
Difficulties in alignment:
● Few datasets with explicitly annotated alignments
● Difficult to design similarity metrics● May exist 0, 1, or many correct alignments
![Page 18: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/18.jpg)
FusionEarly fusion - features integrated immediately (concatenation)
Late fusion - each modality makes an independent decision (averaging, voting schemes, weighted combinations, other ensemble techniques)
Hybrid fusion - exploits advantages of both
![Page 19: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/19.jpg)
Fusion TechniquesMultiple kernel learning (MKL):
● An extension of kernel support vector machines
● Kernels function as similarity functions between data
● Modality-specific kernels allows for better fusion
MKL Application: performing musical artist similarity ranking from acoustic, semantic, and social view data.
(McFee et al., Learning Multi-modal Similarity)
Neural networks (RNN/LSTM) can learn the multimodal representation and fusion component end-to-end. They achieve good performance but require large datasets and are less interpretable.
LSTM Applications:
● Audio-visual emotion classification● Neural image captioning
![Page 20: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/20.jpg)
![Page 21: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/21.jpg)
Co-learningModeling a resource poor modality by exploiting a resource rich modality.
Used to address lack of annotated data, noisy data, and unreliable labels.
Can generate more labeled data, but also can lead to overfitting.
![Page 22: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/22.jpg)
Co-learning examplesTransfer learning application: using text to improve visual representations for image classification by coordinating CNN features with word2vec features
Conceptual grounding: learning meanings/concepts based on vision, sound, or smell (not just on language)
Zero-short learning (ZSL): recognizing a class without having seen a labeled example of it
ZSL Example: using an intermediate semantic space to predict unseen words people are thinking about from fMRI data
![Page 23: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/23.jpg)
Zero-Shot Learning with Semantic Output Codes
![Page 24: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/24.jpg)
Grounding Semantics in Olfactory Perception
“This work opens up interesting possibilities in analyzing smell and even taste. It could be applied in a variety of settings beyond semantic similarity, from chemical information retrieval to metaphor interpretation to cognitive modelling. A speculative blue-sky application based on this, and other multi-modal models, would be an NLG application describing a wine based on its chemical composition, and perhaps other information such as its color and country of origin.”
![Page 25: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/25.jpg)
Paper CritiqueThis paper is very thorough in its survey of MML challenges and what researchers have done to approach them.
MML is central to the advancement of AI; thus, this area must be studied in order to make progress.
Future research directions include any MML projects that make headway in the five challenge areas.
![Page 26: Multimodal Machine Learningmooney/gnlp/slides/multimodal...Introduction: Preliminary Terms Modality: the way in which something happens or is experienced Multimodal machine learning](https://reader036.vdocuments.us/reader036/viewer/2022062507/5fe08664566ad52e403dc860/html5/thumbnails/26.jpg)
Questions?