end-to-end learning of visual representations from ... · motivation - related work - mil-nce...

41
End-to-End Learning of Visual Representations from Uncurated Instructional Videos Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman Julia Guerrero Viu Seminar on Current Works in Computer Vision Summer Semester 2020 1

Upload: others

Post on 04-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman

Julia Guerrero ViuSeminar on Current Works in Computer Vision

Summer Semester 2020

1

Page 2: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

1. Motivation

2. Related Work 3. The MIL-NCE objective

4. Experiments

5. Conclusion and discussion impulses

Outline

2

Page 3: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Motivation

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● Objective: Learning video representations

● Challenge: Most current video representation models require extensive annotations. Annotating videos is expensive and not scalable

● Possible solution: Leveraging narrated videos that are available at scale on the web

3

Page 4: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Motivation

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● Narrated / instructional videos: Videos that include an oral description of what it is happening

● HowTo100M: dataset of 100 million clips-narrations from YouTube

● Challenges: Narration supervision is weak and noisy. In particular, weak alignment between text and image (~50% misalignment)

4

Page 5: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Motivation

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

In this work...

● Learning embeddings of video and text in a self-supervised manner directly from uncurated instructional videos

● MIL-NCE objective: New specific loss to address misalignments in narrated videos

5

MILMultiple Instance Learning

NCENoise Contrastive Estimation

MIL-NCE

Page 6: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Related Work

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

1. Self-supervised learning on videos

2. Joint video-language

3. Multiple Instance Learning

4. Noise Contrastive Estimation

6

Page 7: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Self-supervised learning on videos

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● Use metadata from social media videos as labels [Ghadiyaram et al. 2019]

● Self-supervised: learn a proxy task with labels taken directly from videos

...

● Domain gap between curated and uncurated videos [Caron et al. 2019]

7

Geometric transformations [Jing et al. 2018]

Predicting the future [Han et al. 2019]

Page 8: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Joint video-language

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● Learn a joint embedding space for visual and textual data- Supervised: Manual annotated datasets- Self-Supervised: Exploit semantic information from natural language

(audio speech or Automatic Speech Recognition)

● [Miech et al. HowTo100M 2019] and [Sun et al. CBT 2019]: Self-supervised using ASR but leverage pre-trained visual representations on ImageNet and Kinetics

8

Page 9: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Multiple Instance Learning (MIL)

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● Weakly supervised learning with labeled sets of many samples instead of individual labels per sample

● Multiple instances of the same object, where the label refers to the object

● MIL deals with problems with incomplete knowledge

9

Page 10: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Multiple Instance Learning (MIL)

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● MIL applied to video understanding- Max-pooling: MIL-SVM - Discriminative clustering: DIFFRAC

10

Person recognition in movies [Miech et al. 2017]Action localization [Weinzaepfe et al. 2016]

Page 11: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Noise Contrastive Estimation (NCE)

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● Discriminate between samples from a ‘real’ distribution and an artificially generated noise distribution

● Used to train classifiers with a very large number of classes

11

Conventional Classifier NCEPositive

Negative

Negative

Page 12: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Noise Contrastive Estimation (NCE)

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● [Hénaff et al. 2019] and [Van den Oord et al. 2018] apply NCE to self-supervised learning using InfoNCE loss

● [Sun et al. CBT 2019] apply NCE loss to video-text representation learning:

- Different way of constructing the negative samples

12

Page 13: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

MIL-NCE objective

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● Learning joint embedding space from video and text● Embedding similarity when text and video content semantically similar

13

Embedding spacevideo model

text model

[Miech et al. 2019]

Page 14: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

MIL-NCE objective

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 14

Page 15: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

MIL-NCE objective

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● “Maximizing ratio of the sum of positive candidate scores to the sum of negative samples scores, where score is exponentiated dot product of the embeddings”

15

x: video-clip y: narrationf and g: embedding functions

P: positive candidates N: negative candidates

Page 16: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

MIL-NCE objective

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 16

Building the MIL-NCE objective step by step…

1. Simple joint probabilistic model

2. MIL contribution: Multiple options for matching video with narration

3. NCE contribution

Page 17: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Joint probabilistic model

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● Input: Set of n video-text pairs from the joint data distribution

● Output: Two parametrized functions f and g that map video and text to a d-dimensional vector space

● Probability of a matching pair (x , y) can be estimated up to a constant as:

17

Page 18: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

MIL contribution

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● Key idea: Consider multiple options to match a video with a narration

● Given a video-clip x, K positive narrations that are close in time Joint probability of x happening with any of the yk (mutually exclusive):

● Symmetric joint probability

18

Page 19: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

NCE contribution

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● A lot of all possible pairs of video-text: intractable for generative loss

● Discriminative loss: softmax version of NCE [Jozefowicz et al. 2016]

19

Page 20: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

NCE contribution

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● A lot of all possible pairs of video-text: intractable for generative loss

● Discriminative loss: softmax version of NCE [Jozefowicz et al. 2016]

20

MIL-NCE

Page 21: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

MIL-NCE objective

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 21

● NCE: Discriminate between positive and negative candidates

● Model a symmetric joint probability between text and video

● MIL: Solve the temporal misalignments between text and video

Page 22: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Experiments

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 22

1. Implementation details

2. Downstream tasks

3. Ablation studies

4. Comparison to state-of-the-art

Page 23: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Implementation details

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 23

● Video-model:I3D [Carreira et al. 2017] / S3D [Xie et al. 2018]

● Text-model:word2Vec pre-trained on Google News [Mikolov et al. 2013]

● Train on HowTo100M dataset- 120M pairs 15 years - 3.2 seconds video (32 frames)- Automatic Speech Recognition- max. 16 words narration

Page 24: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Implementation details

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 24

● Positive samples

Size of the positive set: |P| = 3

Page 25: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Implementation details

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 25

...

● Negative samples Size of the negative set: |N| = 8

Symmetric

Page 26: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Downstream tasks

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 26

Action Recognition Temporal Action Localization Action Step Localization

Temporal Action Segmentation Text-to-Video Retrieval

datasetsHMDB-51UCF-101Kinetics700

metricaccuracy

datasetYoutube8M Segments

metricmAP

datasetCrossTask(CTR)

metricaverage recall

Add egg

Pour mixture

datasetCOIN

metricframe accuracy (FA)

datasetsYouCook2(YR10)

MSR-VTT(MR10)

metricrecall@Kpred GT

Output:

Input: cut tomato

playing guitar

Page 27: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Ablation Studies

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 27

Page 28: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Ablation Studies

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 28

Page 29: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Ablation Studies

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 29

Page 30: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Ablation Studies

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 30

Page 31: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Ablation Studies

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 31

Page 32: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 32

● Importance of Multiple Instance Learning: trade-off between likelihood to align and noise

● Many symmetric negative candidates

● Simple language models

Ablation Studies

Page 33: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 33

Ablation Studies

Page 34: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 34

Comparison to SOTA● Video-only representation:

Action recognition

Page 35: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 35

Comparison to SOTA● Video-only representation:

Action Segmentation pred GT

Page 36: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 36

Comparison to SOTA● Joint text-video representation:

Video-to-text retrieval for Action Step Localization

without fine-tuning!

Add egg

Pour mixture

Page 37: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 37

Comparison to SOTA● Joint text-video representation: Text-to-video retrieval

without fine-tuning!

Output:

Input: cut tomato

Page 38: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion 38

Experiments https://www.di.ens.fr/willow/research/mil-nce/

Page 39: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Conclusion

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

“ Use the novel MIL-NCE objective to learn video representations without annotations, by dealing with misalignments from uncurated instructional videos ”

39

MILMultiple Instance Learning

NCENoise Contrastive Estimation

MIL-NCE

Page 40: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Conclusion

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

“ Use the novel MIL-NCE objective to learn video representations without annotations, by dealing with misalignments in uncurated instructional videos ”

40

MILMultiple Instance Learning

NCENoise Contrastive Estimation

MIL-NCE

Thank you!

Page 41: End-to-End Learning of Visual Representations from ... · Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion Objective: Learning video representations Challenge:

Discussion impulses

Motivation - Related Work - MIL-NCE objective - Experiments - Conclusion

● How does their self-supervised fine-tuned version compare to supervised SOTA?

● Further explanation about differences among datasets in ablation studies

● In which other applications could it be interesting to use MIL-NCE loss?

41