multilayer and multimodal fusion of deep neural...

Xiaodong Yang, Pavlo Molchanov, Jan KautzXiaodong Yang, Pavlo Molchanov, Jan Kautz

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

22

INTELLIGENT VIDEO ANALYTICS

Surveillance event detection

Human-computer interaction

Multimedia search and indexing

@bmw.com

Video Classification

33

Local feature extraction

Global feature representation

Temporal modeling

INTELLIGENT VIDEO ANALYTICS Related Work

44



Temporal modeling


Dense trajectories,H. Wang et al. ICCV 2013

55



Temporal modeling


Bag-of-visual-words,J. Gemert et al. TPAMI 2009

Fisher vector,F. Perronnin et al. ECCV 2010


66



Temporal modeling


Bag-of-visual-words,J. Gemert et al. TPAMI 2009

Fisher vector,F. Perronnin et al. ECCV 2010


Spatio-temporal pyramid,X. Yang et al. ECCV 2014

77


2D-CNN, A. Karpathy et al, CVPR 2014 C3D, D. Tran et al, ICCV 2015

Two-stream networks, K. Simonyan et al, NIPS 2014 LSTM, J. Ng, CVPR 2015

88

OUR CONTRIBUTIONS

Overview of multilayer and multimodal fusion for video classification

Local feature extraction:

• Multilayer representations from CNN

Global feature representation:

• Multimodal representations

• Fusion by boosting

Temporal modeling:

• Structure of FC-RNN

99

MULTILAYER REPRESENTATIONS

Dense image prediction

FCN by Long et al. FlowNet by Fischer et al.

1010


Features of conv layers

Poses, parts, articulations, objects, etc.

Visualization by Zeiler et al.

1111


Convert feature maps to feature descriptors

Feature maps of dimension 28×28×5

28×28 feature descriptors of dimension 5

1212


Learn spatial discriminative weights of conv layers

Spatial information of conv layers to enhance representations

Video frames Feature maps of a conv layer over time

Spatial weights of a conv layer

import

ance

1313


Aggregate feature descriptors by Fisher vector (FV)

Gaussian mixture modelFeature maps of a conv layer over time

1414


Represent conv layers by improved Fisher vector (iFV)

Gaussian mixture modelFeature maps of a conv layer over time

Spatial weights of a conv layerim

port

ance

1515


Represent conv layers by improved Fisher vector (iFV)

Represent fc layers by temporal max pooling

Overview of multilayer representation

1616

FC-RNN STRUCTUREModeling Temporal Dynamics

Don’t be a hero—use pre-trained models

1717


Images/Snippets Videos


Many pre-trained models from ImageNet and Sports1M

VGG/C3D

1818





VGG/C3D VGG/C3D

fc layer

RNN

Standard RNN

1919





VGG/C3D VGG/C3D

fc layer

RNN

Standard RNN

VGG/C3D

fc layer

RNN

FC-RNN

2020





VGG/C3D VGG/C3D

fc layer

RNN

Standard RNN

VGG/C3D

FC-RNN

FC-RNN

2121


RNN

FC-RNN

Pre-trained CNN, fc layer:

Transfer to recurrent layers

Comparison of standard RNN and FC-RNN

2222

MULTIMODAL REPRESENTATIONS

Static and dynamic information

2D-CNN/3D-CNN with video frames/optical flow maps

A single frame

A single flow map

A buffer of frames

A buffer of flow maps

2323

FUSION BY BOOSTING

Optimize a linear combination of predictions of multiple layers from multiple modalities

LPBoost:

boost-u: learn uniform weights for all classes

boost-c: learn class specific weights

2424

FUSION BY BOOSTING

Optimize a linear combination of predictions of multiple layers from multiple modalities

LPBoost:

boost-u: learn uniform weights for all classes

boost-c: learn class specific weights

4 layers and 4 modalities M = 16

2525

EXPERIMENTS

Benchmark datasets

UCF101: 13,320 videos in 101 classes

HMDB51: 6,766 videos in 51 classes

Skiing

Kissing

2626

EXPERIMENTSFC-RNN

Outperforms RNN and LSTM by 3.0% and 2.9%

Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101

error rate

epochs

2727

EXPERIMENTSFC-RNN

Outperforms RNN and LSTM by 3.0% and 2.9%

Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101

error rate

epochs

3 %

Up to

improvement

2828

EXPERIMENTSFeature Aggregation

Comparison of FV and iFV to represent conv layers of different modalities


import

ance

A single frame

A single flow map

A buffer of frames


2929

EXPERIMENTSFeature Aggregation

Comparison of FV and iFV to represent conv layers of different modalities


import

ance

A single frame

A single flow map

A buffer of frames


2.5 %

Up to

improvement

3030

EXPERIMENTSMultilayer Fusion

Classification accuracy of single layers over different modalities and multilayer fusion results

3131



3232



3333



8 %

Up to

improvement

3434

EXPERIMENTSMultimodal Fusion

Classification accuracy of different modalities and various combinations

Comparison to the state-of-the-art results

6 %

Up to

improvement

3535

EXPERIMENTSLPBoost

17%

31%

23%

29%

0%

38%

12%

50%fc7

conv5

fc6

conv4

Modalities Layers

3636

EXPERIMENTSEffect of Multimodal Fusion

SKIING SKIJET

skiing : )Multimodal Fusion

2D-CNN-SFskijet : (

3737

EXPERIMENTSEffect of Multimodal Fusion

2D-CNN-OF boxing speeding bag : (

boxing punching bag : )

Multimodal Fusion

BOXING PUNCHING BAG BOXING SPEEDING BAG

3838

OUR CONTRIBUTIONS

Local feature extraction:

• Multilayer representations from CNN

Global feature representation:

• Multimodal representations

• Fusion by boosting

Temporal modeling:

• Structure of FC-RNNOverview of multilayer and multimodal fusion for video classification

multilayer and multimodal fusion of deep neural...

Documents