video paragraph captioning using hierarchical recurrent...

44
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks Presented by Yuting Wang, Yuwei Wang Acknowledgement: some figures are from Haonan Yu’s slides. http://upplysingaoflun.ecn.purdue.edu/~yu239/slides/cvpr2016slides.pdf

Upload: others

Post on 26-May-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks

Presented by Yuting Wang, Yuwei Wang

Acknowledgement: some figures are from Haonan Yu’s slides. http://upplysingaoflun.ecn.purdue.edu/~yu239/slides/cvpr2016slides.pdf

Page 2: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Motivation: Video Captioning

• Generating one or more sentences describing a video.

• A critical step to AI

• Applications• Video retrieval

• Automatic video subtitling

• Blind navigation

2

The cat sat on a roomba driver as it moved around the floor.

https://www.youtube.com/watch?v=LQ-jv8g1YVI

Page 3: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Motivation: Video Captioning

• Comparison with image captioning• Video has temporal information.

• Captions should describe appearance and actions.

• A paragraph is fundamental for long videos.

3

Page 4: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Motivation: Video Captioning

For longer videos, a sentence is not enough.

4

The person is cooking.

Page 5: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Motivation: Video Captioning

For longer videos, a sentence is not enough.

5

The person walked into the kitchen.

The person took out a cutting board and a

knife.

The person took a cucumber out of the

refrigerator.

The person took a plate out of the cabinet.

The person washed the cucumber at the sink.

The person sliced the cucumber with the knife.

Page 6: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Outline

• Motivation

• Related Work

• Model

• Sentence Generator

• Paragraph Generator

• Training and Generation

• Experiments

• Youtube Clips

• TACos-MultiLevel

• Conclusion

6

Page 7: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Rule-based

7

• In 2002, Kojima et al designed some

simple method to identify video objects

and a set of rules to construct

sentences.

• To much manually established rules.

• Limited type of objects and actions.

Kojima, A., Tamura, T., & Fukunaga, K. (2002). Natural language description of human activities from video images based on concept hierarchy of

actions. International Journal of Computer Vision, 50(2), 171-184.

Page 8: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Statistical Models

8Krishnamoorthy, Niveda, et al. "Generating Natural-Language Video Descriptions Using Text-Mined Knowledge." AAAI. Vol. 1. 2013.

• Identify subject, verb and objects (SVO)

triplets using text-mined knowledge

• Generate candidate sentences

• Rank sentences using a statistical language

models trained on web-scale data

• Do not perform well on large-scale dataset

such as YouTubeClips and TACoS-

MultiLevel

Page 9: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Deep Neural Network

• A general approach• Encoder-Decoder Framework

• Encoder• Convolution Neural Network(CNN)• Object Detection

• Decoder• LSTM unit or variant• Generate sentence, a sequence of

words

• Plus• Attention Mechanism

9

Input video

Encoder(CNN)

Feature vector

Decoder(RNN)

Ground Truth: the person was cooking.

Generated Sentence: The person is cutting carrot

Yao, Li, et al. "Describing videos by exploiting temporal structure." Proceedings of the IEEE international conference on computer vision. 2015.

Page 10: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Hierarchy

• Video• Video temporal structures are layered

• Action: Blowing candles -> cutting cake -> eating cake

• Hierarchical Recurrent Neural Encoder (HRNE)

• Text• A sentence is hierarchical

10Yao, Li, et al. "Describing videos by exploiting temporal structure." Proceedings of the IEEE international conference on computer vision. 2015.

Page 11: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Model Overview

A Hierarchical Neural Network

11

Page 12: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Model Overview

12

Page 13: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Gated Recurrent Unit

Simplification of the Long Short-Term Memory Architecture

Capture long-term temporal information.

More simple and takes less calculation.

13

Page 14: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Gated Recurrent Unit

Simplification of the Long Short-Term Memory Architecture

Layers:• Reset gate: 𝐫

• Update gate: 𝐳

• Hidden state: 𝐡

• Input: 𝐱

14

Page 15: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Gated Recurrent Unit

Simplification of the Long Short-Term Memory Architecture

Layers:

𝐫𝑡 = 𝜎 𝐖𝑟𝐱𝑡 + 𝐔𝑟𝐡

𝑡−1 + 𝐛𝑟

𝐳𝑡 = 𝜎 𝐖𝑧𝐱𝑡 + 𝐔𝑧𝐡

𝑡−1 + 𝐛𝑧

෪𝐡𝑡 = 𝜙 𝐖ℎ𝐱𝑡 + 𝐔ℎ 𝐫

𝑡⊙𝐡𝑡−1 + 𝐛ℎ

𝐡𝑡 = 𝐳𝑡⊙𝐡𝑡−1 + 1 − 𝐳𝑡 ⊙෪𝐡𝑡

15

Page 16: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Sentence Generator

16

Video Feature PoolAttention Model

Recurrent Network

Page 17: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Sentence Generator

• A sentence is generated word by word through a RNN with Gated Recurrent Units.

• The embedding and softmax layer use two transposed weight matrix.

𝑊 𝑊𝑇

17

Page 18: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Video Features

• Appearance• VGG-16 (Simonyan et al., 2015), pre-trained on ImageNet dataset

• Action• C3D (Tran et al., 2015), pre-trained on Sports-1M dataset

• Dense Trajectories + Fisher Vector (Wang et al., 2011)

18

Page 19: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Attention Model

In captioning, we should model both spatial and temporal attention.

19

Page 20: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Attention Model

In captioning, we should model both spatial and temporal attention.

20

Page 21: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Attention Model

21

Page 22: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Attention Model

• Video Feature Pool: a series of feature vectors 𝐯1, 𝐯2, … , 𝐯𝐾𝑀• Attention Layer I and II: project vector to a scalar

𝑞𝑚𝑡 = 𝐰𝑇𝜙 𝐖𝑞𝐯𝑚 + 𝐔𝑞𝐡

𝑡−1 + 𝐛𝑞• Softmax Layer:

𝛽𝑚𝑡 =

exp 𝑞𝑚𝑡

σ𝑚′=1𝐾𝑀 exp 𝑞𝑚′

𝑡

• Weighted Average Layer:

𝐮𝑡 =

𝑚=1

𝐾𝑀

𝛽𝑚𝑡 𝐯𝑚

22

Page 23: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Attention Model

𝑖 − 1 𝑖 + 1𝑖

Feature pool

Attention weights

Previous recurrent state

Average feature

23

Same procedure for each channel.

Page 24: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Paragraph Generator

24

Page 25: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Paragraph Generator Unrolled

Sentence 𝑁

Sentence 𝑁 + 1

Paragraph

Generator

Visual features

Encode semantic context

25

Visual features

Paragraph

Generator

Page 26: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Paragraph Generator

• Input• The average of all the word

embedding of sentence 𝑁.

• The last state of the recurrent layer I.

• Sentence Embedding Layer (512)

• 2nd Gated RNN

• Paragraph State Layer

• Output• Initialize hidden state for 1st Gated

RNN Sentence 𝑁 + 1

Word

Embeddings

Last State of RNN 1Average Word Embedding

(512) Sentence Embedding Layer

2nd Gated RNN

(512)Paragraph State Layer

Initial hidden State for 1st Gated RNN,

Sentence 𝑁 + 1 26

Page 27: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Model Overview

A Hierarchical Neural Network

27

Page 28: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Model Unrolled

Recurrent ISentence Generator

Recurrent IIParagraph Generator

Today is Monday

Sentence 𝑁 − 1 Sentence 𝑁 Sentence 𝑁 + 1

28

Paragraph State Layer Paragraph State Layer

Page 29: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Training: Cost function

• The likelihood of generating a word

(= activation value in softmax layer)

• The cost of generating a word = - log(P)

• The cost of generating a paragraph• N sentence in Paragraph, T_n words in

sentence s_n

29

Page 30: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Generation with Beam Search

30

BOS

The 0.95

A 0.80

This 0.9

BOS

The person 5.0

A cat 1.5

The cat 4.0

A car

A dog

BOS

The person is

A cat is

The cat sits

A car runs

A dog runs

Page 31: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Beam Search

31

Page 32: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Beam Search

32

Sentence A

Sentence B

Sentence C

Sentence D Paragraph

Generator

Sentence W

Sentence X

Sentence Y

Sentence Z

1st

Sentence

Pool

2nd

Sentence

Pool

Sentence

Generator Sentence A

Sentence Y

BOS-EOS

Paragraph

Page 33: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

YouTubeClips

• Short video clips (9 seconds on average) downloaded from YouTube

• 1,970 videos, ~80k video-sentence pairs, 12k unique words

• Generate one sentence for a video

33

Page 34: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

YouTubeClips

• Features• Object appearance: VGG-16, pre-trained on ImageNet

• Action: C3D, pre-trained on Sports-1M dataset

34

• Evaluation Metrics• BLEU : overlap

• METEOR: alignment

• CIDEr : cosine similarity

Page 35: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

YouTubeClips

35

Page 36: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

TACoS-MultiLevel

• 185 long video clips (6 mins on average) for daily cookingscenarios

• Manually annotated <Intervals , sentence> pair

• 16,145 intervals, 52,478 sentence

36

Page 37: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

TACoS-MultiLevel

• Small Object Detection• Optical Flow to detect actors

• Extract K(=3~5) image patches near the actor

• Compute VggNet feature for each patch

• Motion and Activity• Fine-grained cooking activity

• Dense Trajectories, Fisher vector

37

Page 38: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Result

38

RNN-sent : For each sentence, the initial state of the sentence generator is set to zero.

RNN-cat : For each paragraph, the initial state of the sentence generator is set to zero.

Page 39: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

39

Result

Duplicate

Page 40: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Result

40

RNN-sent : For each sentence, the initial state of the sentence generator is set to zero.

RNN-cat : For each paragraph, the initial state of the sentence generator is set to zero.

Page 41: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

41

Human Evaluation on Amazon Mechanical Turk

Result

Page 42: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Summary

• A hierarchical-RNN framework for video paragraph captioning.

• Models inter-sentence dependency to generate a sequence of sentences given video data.

• Experimentally shown to generate a paragraph for a long video.

• Outperform state of the art results.

42

Page 43: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Limitations & Future Work

• Information flowing from the beginning to the end, not also in the reverse way.

• Using bidirectional RNN for sentences generation.

• Suffering from a known problem of discrepancy between objective function used by training and the one used by generation.

• Using Scheduled Sampling in training.

• Directly optimize the metric used at test time.

43

Page 44: Video Paragraph Captioning Using Hierarchical Recurrent ...cseweb.ucsd.edu/classes/sp17/cse252C-a/CSE252C_20170508.pdf · Gated Recurrent Unit Simplification of the Long Short-Term

Limitations & Future Work

• Having difficulty handling very small objects.• Still remaining a difficult problem.

44

… The person sliced the orange …

mango