multilayer and multimodal fusion of deep neural...
TRANSCRIPT
Xiaodong Yang, Pavlo Molchanov, Jan KautzXiaodong Yang, Pavlo Molchanov, Jan Kautz
Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification
22
INTELLIGENT VIDEO ANALYTICS
Surveillance event detection
Human-computer interaction
Multimedia search and indexing
@bmw.com
Video Classification
33
Local feature extraction
Global feature representation
Temporal modeling
INTELLIGENT VIDEO ANALYTICS Related Work
44
Local feature extraction
Global feature representation
Temporal modeling
INTELLIGENT VIDEO ANALYTICS Related Work
Dense trajectories,H. Wang et al. ICCV 2013
55
Local feature extraction
Global feature representation
Temporal modeling
INTELLIGENT VIDEO ANALYTICS Related Work
Bag-of-visual-words,J. Gemert et al. TPAMI 2009
Fisher vector,F. Perronnin et al. ECCV 2010
Dense trajectories,H. Wang et al. ICCV 2013
66
Local feature extraction
Global feature representation
Temporal modeling
INTELLIGENT VIDEO ANALYTICS Related Work
Bag-of-visual-words,J. Gemert et al. TPAMI 2009
Fisher vector,F. Perronnin et al. ECCV 2010
Dense trajectories,H. Wang et al. ICCV 2013
Spatio-temporal pyramid,X. Yang et al. ECCV 2014
77
INTELLIGENT VIDEO ANALYTICS Related Work
2D-CNN, A. Karpathy et al, CVPR 2014 C3D, D. Tran et al, ICCV 2015
Two-stream networks, K. Simonyan et al, NIPS 2014 LSTM, J. Ng, CVPR 2015
88
OUR CONTRIBUTIONS
Overview of multilayer and multimodal fusion for video classification
Local feature extraction:
• Multilayer representations from CNN
Global feature representation:
• Multimodal representations
• Fusion by boosting
Temporal modeling:
• Structure of FC-RNN
99
MULTILAYER REPRESENTATIONS
Dense image prediction
FCN by Long et al. FlowNet by Fischer et al.
1010
MULTILAYER REPRESENTATIONS
Features of conv layers
Poses, parts, articulations, objects, etc.
Visualization by Zeiler et al.
1111
MULTILAYER REPRESENTATIONS
Convert feature maps to feature descriptors
Feature maps of dimension 28×28×5
28×28 feature descriptors of dimension 5
1212
MULTILAYER REPRESENTATIONS
Learn spatial discriminative weights of conv layers
Spatial information of conv layers to enhance representations
Video frames Feature maps of a conv layer over time
Spatial weights of a conv layer
import
ance
1313
MULTILAYER REPRESENTATIONS
Aggregate feature descriptors by Fisher vector (FV)
Gaussian mixture modelFeature maps of a conv layer over time
1414
MULTILAYER REPRESENTATIONS
Represent conv layers by improved Fisher vector (iFV)
Gaussian mixture modelFeature maps of a conv layer over time
Spatial weights of a conv layerim
port
ance
1515
MULTILAYER REPRESENTATIONS
Represent conv layers by improved Fisher vector (iFV)
Represent fc layers by temporal max pooling
Overview of multilayer representation
1616
FC-RNN STRUCTUREModeling Temporal Dynamics
Don’t be a hero—use pre-trained models
1717
FC-RNN STRUCTUREModeling Temporal Dynamics
Images/Snippets Videos
Don’t be a hero—use pre-trained models
Many pre-trained models from ImageNet and Sports1M
VGG/C3D
1818
FC-RNN STRUCTUREModeling Temporal Dynamics
Images/Snippets Videos
Don’t be a hero—use pre-trained models
Many pre-trained models from ImageNet and Sports1M
VGG/C3D VGG/C3D
fc layer
RNN
Standard RNN
1919
FC-RNN STRUCTUREModeling Temporal Dynamics
Images/Snippets Videos
Don’t be a hero—use pre-trained models
Many pre-trained models from ImageNet and Sports1M
VGG/C3D VGG/C3D
fc layer
RNN
Standard RNN
VGG/C3D
fc layer
RNN
FC-RNN
2020
FC-RNN STRUCTUREModeling Temporal Dynamics
Images/Snippets Videos
Don’t be a hero—use pre-trained models
Many pre-trained models from ImageNet and Sports1M
VGG/C3D VGG/C3D
fc layer
RNN
Standard RNN
VGG/C3D
FC-RNN
FC-RNN
2121
FC-RNN STRUCTUREModeling Temporal Dynamics
RNN
FC-RNN
Pre-trained CNN, fc layer:
Transfer to recurrent layers
Comparison of standard RNN and FC-RNN
2222
MULTIMODAL REPRESENTATIONS
Static and dynamic information
2D-CNN/3D-CNN with video frames/optical flow maps
A single frame
A single flow map
A buffer of frames
A buffer of flow maps
2323
FUSION BY BOOSTING
Optimize a linear combination of predictions of multiple layers from multiple modalities
LPBoost:
boost-u: learn uniform weights for all classes
boost-c: learn class specific weights
2424
FUSION BY BOOSTING
Optimize a linear combination of predictions of multiple layers from multiple modalities
LPBoost:
boost-u: learn uniform weights for all classes
boost-c: learn class specific weights
4 layers and 4 modalities M = 16
2525
EXPERIMENTS
Benchmark datasets
UCF101: 13,320 videos in 101 classes
HMDB51: 6,766 videos in 51 classes
Skiing
Kissing
2626
EXPERIMENTSFC-RNN
Outperforms RNN and LSTM by 3.0% and 2.9%
Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101
error rate
epochs
2727
EXPERIMENTSFC-RNN
Outperforms RNN and LSTM by 3.0% and 2.9%
Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101
error rate
epochs
3 %
Up to
improvement
2828
EXPERIMENTSFeature Aggregation
Comparison of FV and iFV to represent conv layers of different modalities
Spatial weights of a conv layer
import
ance
A single frame
A single flow map
A buffer of frames
A buffer of flow maps
2929
EXPERIMENTSFeature Aggregation
Comparison of FV and iFV to represent conv layers of different modalities
Spatial weights of a conv layer
import
ance
A single frame
A single flow map
A buffer of frames
A buffer of flow maps
2.5 %
Up to
improvement
3030
EXPERIMENTSMultilayer Fusion
Classification accuracy of single layers over different modalities and multilayer fusion results
3131
EXPERIMENTSMultilayer Fusion
Classification accuracy of single layers over different modalities and multilayer fusion results
3232
EXPERIMENTSMultilayer Fusion
Classification accuracy of single layers over different modalities and multilayer fusion results
3333
EXPERIMENTSMultilayer Fusion
Classification accuracy of single layers over different modalities and multilayer fusion results
8 %
Up to
improvement
3434
EXPERIMENTSMultimodal Fusion
Classification accuracy of different modalities and various combinations
Comparison to the state-of-the-art results
6 %
Up to
improvement
3535
EXPERIMENTSLPBoost
17%
31%
23%
29%
0%
38%
12%
50%fc7
conv5
fc6
conv4
Modalities Layers
3636
EXPERIMENTSEffect of Multimodal Fusion
SKIING SKIJET
skiing : )Multimodal Fusion
2D-CNN-SFskijet : (
3737
EXPERIMENTSEffect of Multimodal Fusion
2D-CNN-OF boxing speeding bag : (
boxing punching bag : )
Multimodal Fusion
BOXING PUNCHING BAG BOXING SPEEDING BAG
3838
OUR CONTRIBUTIONS
Local feature extraction:
• Multilayer representations from CNN
Global feature representation:
• Multimodal representations
• Fusion by boosting
Temporal modeling:
• Structure of FC-RNNOverview of multilayer and multimodal fusion for video classification