deep end2end voxel2voxel prediction · ahmed osman • motivation –“convolutional neural...

Ahmed Osman

Deep End2End Voxel2Voxel Prediction

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri

Presented by: Ahmed Osman

Ahmed Osman

•Problems– Video Semantic Segmentation

– Optical Flow Estimation

– Video Coloring

•Related Work

•Contribution

•Method

•Experiments and Results

•Conclusion

2

Outline

Ahmed Osman

• Problems– Video Semantic Segmentation


– Video Coloring

• Related Work

• Contribution

• Method

• Experiments and Results

• Conclusion

3

Outline

Ahmed Osman

• Semantic Segmentation

Video Semantic Segmentation

4

http://jamie.shotton.org/work/images/resear6.png

Ahmed Osman

• Video Semantic Segmentation

Video Semantic Segmentation

5

http://jamie.shotton.org/work/images/resear6.png

Ahmed Osman



– Video Coloring

• Related Work

• Contribution

• Method


• Conclusion

6

Outline

Ahmed Osman

Optical Flow Estimation

7

http://www.cvlibs.net/projects/objectsceneflow/showcase.jpg

A Filter Formulation for Computing Real Time Optical FlowAdarve et al.https://www.youtube.com/watch?v=_oW1vMdBMuY

Ahmed Osman



– Video Coloring

• Related Work

• Contribution

• Method


• Conclusion

8

Outline

Ahmed Osman

Video Coloring

9

http://images.mentalfloss.com/sites/default/files/styles/article_640x430/public/colorizing-movies_6.jpg

Ahmed Osman



– Video Coloring

• Related Work

• Contribution

• Method


• Conclusion

10

Outline

Ahmed Osman

Traditional Computer Vision Pipeline

11

Ahmed Osman

• Motivation– “Convolutional Neural Networks (CNN) are biologically-

inspired variants of MLPs.”

– “Revolutionized the traditional computer vision pipeline”

– Re-popularized by Krizhevsky et al. in 2012 by producing state-of-the-art results on the ImageNet dataset (Image Classification).

– Why was AlexNet successful?• Large labeled datasets

• GPU Computing

Convolutional Neural Networks

12

Ahmed Osman

ConvNets

13

Ahmed Osman

• Convolution

ConvNets

14

https://developer.apple.com/library/ios/documentation/Performance/Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html

Ahmed Osman

• Convolution Layer

ConvNets

15

http://cs231n.github.io/convolutional-networks/

Ahmed Osman

• Activation function

ConvNets

16

Ahmed Osman

• Activation function– Rectified Linear Unit (ReLU)

• No gradient vanishing problem

• Non linear

ConvNets

17

Ahmed Osman

• Pooling

ConvNets

18

Ahmed Osman

• Fully Connected Layer

ConvNets

19

Ahmed Osman

• How to determine the weights?– Learn them using backpropagation

ConvNets

20

Ahmed Osman

• Loss Function

– Softmax

– Huber

– L2

ConvNets

21

Ahmed Osman

• Loss Function

– Softmax

– Huber

– L2

ConvNets

22Green: Huber Blue: L2

Ahmed Osman


ConvNets

23

Ahmed Osman


– Chain Rule

ConvNets

24

Ahmed Osman

Backpropagation

25

Slides from Stanford University Course CS231Nhttp://cs231n.stanford.edu/slides/winter1516_lecture4.pdf

Ahmed Osman

Backpropagation

26


Ahmed Osman

Backpropagation

27


Ahmed Osman

Backpropagation

28


Ahmed Osman

Backpropagation

29


Ahmed Osman

Backpropagation

30


Ahmed Osman

• Fully Convolutional Network

• FlowNet

• Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

Related Work

31

Ahmed Osman

• Fully Convolutional Network (FCN)

Related Work

32

Ahmed Osman

• FlowNet

Related Work

33

Ahmed Osman

Related Work

34

• FlowNet

Ahmed Osman

• Eigen et al. [2014]

Related Work

35

Ahmed Osman



– Video Coloring

• Related Work

• Contribution

• Method


• Conclusion

36

Outline

Ahmed Osman

• 3D CNN end-to-end voxel-wise prediction

• Same network architecture for all three challenges.

• Introduces an approach for training with limited data.

Contribution

37

Ahmed Osman



– Video Coloring

• Related Work

• Contribution

• Method


• Conclusion

38

Outline

Ahmed Osman

• Input: Channels x # of Frames x Height x Width

• Output: K x # of Frames x Height x Width

Recap: Problem

39

Segmentation done by http://segmentit.sourceforge.net/http://barkpost.com/wp-content/uploads/2013/03/oie_5181838bU3HJXJp.gif

Ahmed Osman

• Adapted from C3D

• Main Difference:

Method

40

Learning Spatiotemporal Features with 3D Convolutional NetworksDu Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri

Ahmed Osman

• Adapted from C3D

• Main Difference: Added deconvolution layers

Method

41

Learning Spatiotemporal Features with 3D Convolutional NetworksDu Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri

Ahmed Osman

Deconvolution

42

Visualizing and Understanding Convolutional Networks

Matthew D Zeiler, Rob Fergus

Layer 1 Layer 2

Ahmed Osman

Deconvolution

43



Ahmed Osman

Deconvolution

44



Ahmed Osman

Deconvolution

45



Ahmed Osman

Deconvolution

46

Upsampling

Learnable DeconvolutionVisualization Deconvolution

Ahmed Osman



– Video Coloring

• Related Work

• Contribution

• Method


• Conclusion

47

Outline

Ahmed Osman


• Optical Flow Estimation

• Video Coloring

Experiments and Results

48

Ahmed Osman

• Dataset: – GATECH dataset

– Training set: 63 videos

– Test set: 38 sequences

– 8 Classes

Experiments: Video Semantic Segmentation

49

Geometric Context from Videos. Hussain Raza Matthias Grundmann Irfan Essa

Ahmed Osman

• Experiment: – Training:

• Split each video into all possible clips of length 16 frames (i.e. stride:1).

– Testing:• Performed on all non-overlapping clips (i.e. stride: 16).


50

Geometric Context from Videos. Hussain Raza Matthias Grundmann Irfan Essa

16 frames16 frames

Ahmed Osman

• Experiment:

– Network details (V2V):• Loss layer: Softmax

• Weights initialized from C3D. New layers are randomly initialized.

• Initial learning rate: 10-4, divided by 10 every 30K iterations


51

Ahmed Osman

Results: Video Semantic Segmentation

52

Ahmed Osman


53

Bilinear

Ahmed Osman


54

Bilinear

Ahmed Osman


55

Bilinear

Ahmed Osman


56

Ahmed Osman

• d


57

Smooth

Noisy

Net

dep

th

Ahmed Osman



• Video Coloring

Experiments

58

Ahmed Osman

• Training:– Problem:

• No large dataset with optical flow ground truth.

– Solution?• Fabricate “semi-truth” from an existing optical flow method.

• Brox’s method was used.

– Dataset: • (V2V) UCF101 (Partial: test split 1)

• (Fine-tuned V2V) MPI-Sintel

• Network:– Loss function: Huber loss

– Initial learning rate: 10-8, divided by 10 every 200K iterations

Experiments: Optical Flow Estimation

59

Ahmed Osman

• Testing:– MPI-Sintel

Results: Optical Flow Estimation

60

Input V2V Brox Ground truth

Ahmed Osman



61

Input V2V Brox Ground truth

Ahmed Osman



62

Ahmed Osman

• Fine-tuning from C3D does not improve a lot.

• Same Architecture, Different Purpose


63

Ahmed Osman



• Video Coloring

Experiments

64

Ahmed Osman

• Dataset:– UCF101

– Convert color videos to grayscale.

• Experiment: – Training:

• Loss function: L2

• Initial learning rate: 10-8, divided by 10 every 200K iterations

Experiments: Video Coloring

65

Ahmed Osman

Network Average Distance Error (ADE)

2D-V2V 0.1495

V2V 0.1375

Results: Video Coloring

66

Ahmed Osman


67

• V2V learns “common sense” colors

Input

Ground TruthV2V

Ahmed Osman


68


Input

Ground TruthV2V

Ahmed Osman


69


Input

Ground TruthV2V

Ahmed Osman


70


Input

Ground TruthV2V

Ahmed Osman



– Video Coloring

• Related Work

• Contribution

• Method


• Conclusion

71

Outline

Ahmed Osman

• Contributions:– 3D CNN end-to-end voxel-wise prediction

– “Same” network architecture for all three challenges.

– Utilizes a well-established method to generate training data.

• Criticisms– Fine-tuning improved the result in OF, noticeably in

comparison with Brox’s method

– No mention activation function even in C3D

Conclusion

72

Ahmed Osman

Thank You

for Listening

73

Questions?

Ahmed Osman

• “Deep End2End Voxel2Voxel Prediction”– Tran et al. 2015

• “Flownet: Learning optical flow with convolutional networks”– Fischer et al. 2015

• “Imagenet classification with deep convolutional neural networks”– Krizhevsky et al. 2012

• “Learning spatiotemporal features with 3d convolutional networks”– Tran et al. 2015

• “Visualizing and understanding convolutional networks”– Zeiler et al. 2014

• “Fully convolutional networks for semantic segmentation”– Long et al. 2015

• “Depth map prediction from a single image using a multi-scale deep network”

– Eigen et al. 2014

• “Large displacement optical flow: Descriptor matching in variational motion estimation”

– Brox et al. 2011

References

74

Ahmed Osman

Backup Slides

75

Ahmed Osman

• A perceptron is a linear classifier that utilizes a set of weights to predict an output for a feature vector.

Multi-layer Perceptron

76

https://blog.dbrgn.ch/images/2013/3/26/perceptron.png

deep end2end voxel2voxel prediction · ahmed osman • motivation –“convolutional neural...

Documents