deep end2end voxel2voxel prediction · ahmed osman • motivation –“convolutional neural...
TRANSCRIPT
Ahmed Osman
Deep End2End Voxel2Voxel Prediction
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri
Presented by: Ahmed Osman
Ahmed Osman
•Problems– Video Semantic Segmentation
– Optical Flow Estimation
– Video Coloring
•Related Work
•Contribution
•Method
•Experiments and Results
•Conclusion
2
Outline
Ahmed Osman
• Problems– Video Semantic Segmentation
– Optical Flow Estimation
– Video Coloring
• Related Work
• Contribution
• Method
• Experiments and Results
• Conclusion
3
Outline
Ahmed Osman
• Semantic Segmentation
Video Semantic Segmentation
4
http://jamie.shotton.org/work/images/resear6.png
Ahmed Osman
• Video Semantic Segmentation
Video Semantic Segmentation
5
http://jamie.shotton.org/work/images/resear6.png
Ahmed Osman
• Problems– Video Semantic Segmentation
– Optical Flow Estimation
– Video Coloring
• Related Work
• Contribution
• Method
• Experiments and Results
• Conclusion
6
Outline
Ahmed Osman
Optical Flow Estimation
7
http://www.cvlibs.net/projects/objectsceneflow/showcase.jpg
A Filter Formulation for Computing Real Time Optical FlowAdarve et al.https://www.youtube.com/watch?v=_oW1vMdBMuY
Ahmed Osman
• Problems– Video Semantic Segmentation
– Optical Flow Estimation
– Video Coloring
• Related Work
• Contribution
• Method
• Experiments and Results
• Conclusion
8
Outline
Ahmed Osman
Video Coloring
9
http://images.mentalfloss.com/sites/default/files/styles/article_640x430/public/colorizing-movies_6.jpg
Ahmed Osman
• Problems– Video Semantic Segmentation
– Optical Flow Estimation
– Video Coloring
• Related Work
• Contribution
• Method
• Experiments and Results
• Conclusion
10
Outline
Ahmed Osman
Traditional Computer Vision Pipeline
11
Ahmed Osman
• Motivation– “Convolutional Neural Networks (CNN) are biologically-
inspired variants of MLPs.”
– “Revolutionized the traditional computer vision pipeline”
– Re-popularized by Krizhevsky et al. in 2012 by producing state-of-the-art results on the ImageNet dataset (Image Classification).
– Why was AlexNet successful?• Large labeled datasets
• GPU Computing
Convolutional Neural Networks
12
Ahmed Osman
ConvNets
13
Ahmed Osman
• Convolution
ConvNets
14
https://developer.apple.com/library/ios/documentation/Performance/Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html
Ahmed Osman
• Convolution Layer
ConvNets
15
http://cs231n.github.io/convolutional-networks/
Ahmed Osman
• Activation function
ConvNets
16
Ahmed Osman
• Activation function– Rectified Linear Unit (ReLU)
• No gradient vanishing problem
• Non linear
ConvNets
17
Ahmed Osman
• Pooling
ConvNets
18
Ahmed Osman
• Fully Connected Layer
ConvNets
19
Ahmed Osman
• How to determine the weights?– Learn them using backpropagation
ConvNets
20
Ahmed Osman
• Loss Function
– Softmax
– Huber
– L2
ConvNets
21
Ahmed Osman
• Loss Function
– Softmax
– Huber
– L2
ConvNets
22Green: Huber Blue: L2
Ahmed Osman
• How to determine the weights?– Learn them using backpropagation
ConvNets
23
Ahmed Osman
• How to determine the weights?– Learn them using backpropagation
– Chain Rule
ConvNets
24
Ahmed Osman
Backpropagation
25
Slides from Stanford University Course CS231Nhttp://cs231n.stanford.edu/slides/winter1516_lecture4.pdf
Ahmed Osman
Backpropagation
26
Slides from Stanford University Course CS231Nhttp://cs231n.stanford.edu/slides/winter1516_lecture4.pdf
Ahmed Osman
Backpropagation
27
Slides from Stanford University Course CS231Nhttp://cs231n.stanford.edu/slides/winter1516_lecture4.pdf
Ahmed Osman
Backpropagation
28
Slides from Stanford University Course CS231Nhttp://cs231n.stanford.edu/slides/winter1516_lecture4.pdf
Ahmed Osman
Backpropagation
29
Slides from Stanford University Course CS231Nhttp://cs231n.stanford.edu/slides/winter1516_lecture4.pdf
Ahmed Osman
Backpropagation
30
Slides from Stanford University Course CS231Nhttp://cs231n.stanford.edu/slides/winter1516_lecture4.pdf
Ahmed Osman
• Fully Convolutional Network
• FlowNet
• Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
Related Work
31
Ahmed Osman
• Fully Convolutional Network (FCN)
Related Work
32
Ahmed Osman
• FlowNet
Related Work
33
Ahmed Osman
Related Work
34
• FlowNet
Ahmed Osman
• Eigen et al. [2014]
Related Work
35
Ahmed Osman
• Problems– Video Semantic Segmentation
– Optical Flow Estimation
– Video Coloring
• Related Work
• Contribution
• Method
• Experiments and Results
• Conclusion
36
Outline
Ahmed Osman
• 3D CNN end-to-end voxel-wise prediction
• Same network architecture for all three challenges.
• Introduces an approach for training with limited data.
Contribution
37
Ahmed Osman
• Problems– Video Semantic Segmentation
– Optical Flow Estimation
– Video Coloring
• Related Work
• Contribution
• Method
• Experiments and Results
• Conclusion
38
Outline
Ahmed Osman
• Input: Channels x # of Frames x Height x Width
• Output: K x # of Frames x Height x Width
Recap: Problem
39
Segmentation done by http://segmentit.sourceforge.net/http://barkpost.com/wp-content/uploads/2013/03/oie_5181838bU3HJXJp.gif
Ahmed Osman
• Adapted from C3D
• Main Difference:
Method
40
Learning Spatiotemporal Features with 3D Convolutional NetworksDu Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri
Ahmed Osman
• Adapted from C3D
• Main Difference: Added deconvolution layers
Method
41
Learning Spatiotemporal Features with 3D Convolutional NetworksDu Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri
Ahmed Osman
Deconvolution
42
Visualizing and Understanding Convolutional Networks
Matthew D Zeiler, Rob Fergus
Layer 1 Layer 2
Ahmed Osman
Deconvolution
43
Visualizing and Understanding Convolutional Networks
Matthew D Zeiler, Rob Fergus
Ahmed Osman
Deconvolution
44
Visualizing and Understanding Convolutional Networks
Matthew D Zeiler, Rob Fergus
Ahmed Osman
Deconvolution
45
Visualizing and Understanding Convolutional Networks
Matthew D Zeiler, Rob Fergus
Ahmed Osman
Deconvolution
46
Upsampling
Learnable DeconvolutionVisualization Deconvolution
Ahmed Osman
• Problems– Video Semantic Segmentation
– Optical Flow Estimation
– Video Coloring
• Related Work
• Contribution
• Method
• Experiments and Results
• Conclusion
47
Outline
Ahmed Osman
• Video Semantic Segmentation
• Optical Flow Estimation
• Video Coloring
Experiments and Results
48
Ahmed Osman
• Dataset: – GATECH dataset
– Training set: 63 videos
– Test set: 38 sequences
– 8 Classes
Experiments: Video Semantic Segmentation
49
Geometric Context from Videos. Hussain Raza Matthias Grundmann Irfan Essa
Ahmed Osman
• Experiment: – Training:
• Split each video into all possible clips of length 16 frames (i.e. stride:1).
– Testing:• Performed on all non-overlapping clips (i.e. stride: 16).
Experiments: Video Semantic Segmentation
50
Geometric Context from Videos. Hussain Raza Matthias Grundmann Irfan Essa
16 frames16 frames
Ahmed Osman
• Experiment:
– Network details (V2V):• Loss layer: Softmax
• Weights initialized from C3D. New layers are randomly initialized.
• Initial learning rate: 10-4, divided by 10 every 30K iterations
Experiments: Video Semantic Segmentation
51
Ahmed Osman
Results: Video Semantic Segmentation
52
Ahmed Osman
Results: Video Semantic Segmentation
53
Bilinear
Ahmed Osman
Results: Video Semantic Segmentation
54
Bilinear
Ahmed Osman
Results: Video Semantic Segmentation
55
Bilinear
Ahmed Osman
Results: Video Semantic Segmentation
56
Ahmed Osman
• d
Results: Video Semantic Segmentation
57
Smooth
Noisy
Net
dep
th
Ahmed Osman
• Video Semantic Segmentation
• Optical Flow Estimation
• Video Coloring
Experiments
58
Ahmed Osman
• Training:– Problem:
• No large dataset with optical flow ground truth.
– Solution?• Fabricate “semi-truth” from an existing optical flow method.
• Brox’s method was used.
– Dataset: • (V2V) UCF101 (Partial: test split 1)
• (Fine-tuned V2V) MPI-Sintel
• Network:– Loss function: Huber loss
– Initial learning rate: 10-8, divided by 10 every 200K iterations
Experiments: Optical Flow Estimation
59
Ahmed Osman
• Testing:– MPI-Sintel
Results: Optical Flow Estimation
60
Input V2V Brox Ground truth
Ahmed Osman
• Testing:– MPI-Sintel
Results: Optical Flow Estimation
61
Input V2V Brox Ground truth
Ahmed Osman
• Testing:– MPI-Sintel
Results: Optical Flow Estimation
62
Ahmed Osman
• Fine-tuning from C3D does not improve a lot.
• Same Architecture, Different Purpose
Results: Optical Flow Estimation
63
Ahmed Osman
• Video Semantic Segmentation
• Optical Flow Estimation
• Video Coloring
Experiments
64
Ahmed Osman
• Dataset:– UCF101
– Convert color videos to grayscale.
• Experiment: – Training:
• Loss function: L2
• Initial learning rate: 10-8, divided by 10 every 200K iterations
Experiments: Video Coloring
65
Ahmed Osman
Network Average Distance Error (ADE)
2D-V2V 0.1495
V2V 0.1375
Results: Video Coloring
66
Ahmed Osman
Results: Video Coloring
67
• V2V learns “common sense” colors
Input
Ground TruthV2V
Ahmed Osman
Results: Video Coloring
68
• V2V learns “common sense” colors
Input
Ground TruthV2V
Ahmed Osman
Results: Video Coloring
69
• V2V learns “common sense” colors
Input
Ground TruthV2V
Ahmed Osman
Results: Video Coloring
70
• V2V learns “common sense” colors
Input
Ground TruthV2V
Ahmed Osman
• Problems– Video Semantic Segmentation
– Optical Flow Estimation
– Video Coloring
• Related Work
• Contribution
• Method
• Experiments and Results
• Conclusion
71
Outline
Ahmed Osman
• Contributions:– 3D CNN end-to-end voxel-wise prediction
– “Same” network architecture for all three challenges.
– Utilizes a well-established method to generate training data.
• Criticisms– Fine-tuning improved the result in OF, noticeably in
comparison with Brox’s method
– No mention activation function even in C3D
Conclusion
72
Ahmed Osman
Thank You
for Listening
73
Questions?
Ahmed Osman
• “Deep End2End Voxel2Voxel Prediction”– Tran et al. 2015
• “Flownet: Learning optical flow with convolutional networks”– Fischer et al. 2015
• “Imagenet classification with deep convolutional neural networks”– Krizhevsky et al. 2012
• “Learning spatiotemporal features with 3d convolutional networks”– Tran et al. 2015
• “Visualizing and understanding convolutional networks”– Zeiler et al. 2014
• “Fully convolutional networks for semantic segmentation”– Long et al. 2015
• “Depth map prediction from a single image using a multi-scale deep network”
– Eigen et al. 2014
• “Large displacement optical flow: Descriptor matching in variational motion estimation”
– Brox et al. 2011
References
74
Ahmed Osman
Backup Slides
75
Ahmed Osman
• A perceptron is a linear classifier that utilizes a set of weights to predict an output for a feature vector.
Multi-layer Perceptron
76
https://blog.dbrgn.ch/images/2013/3/26/perceptron.png