bidirectional recurrent convolutional networks for video...
TRANSCRIPT
Bidirectional Recurrent Convolutional Networks for Video Super-Resolution
Qi Zhang & Yan Huang
May 10, 2017
Center for Research on Intelligent Perception and Computing (CRIPAC)National Laboratory of Pattern Recognition (NLPR)
Institute of Automation, Chinese Academy of Sciences (CASIA)
CRIPAC
CRIPAC mainly focuses on the following research topics
related
to national public security.
• Biometrics
• Image and Video Analysis
• Big Data and Multi-modal Computing
• Content Security and Authentication
• Sensing and Information Acquisition
CRAPIC receives regular fundings from various
Government departments or agencies. It is also
supported by funds of R&D projects from many other
national and international sources.
CRIPAC members publish widely in leading national and international journals
and conferences such as IEEE Transactions on PAMI, IEEE Transactions on
Image Processing, International Journal of Computer Vision, Pattern
Recognition, Pattern Recognition Letters, ICCV, ECCV, CVPR, ACCV, ICPR, ICIP, etc.
http://cripac.ia.ac.cn/en/EN/volumn/home.shtml
2
NVAIL
Artificial Intelligence Laboratory
Researches on artificial intelligence and deep learning
3
Outline
4
2 Recurrent Convolutional Networks
4 Future Work
1 Deep Learning
3 Application to Video Super-Resolution
Outline
5
2 Recurrent Convolutional Networks
4 Future Work
1 Deep Learning
3 Application to Video Super-Resolution
Deep Neural Networks (DNN)
6
• Originate from: - 1962 – simple/complex cell, Hubel and Wiesel - 1970 – efficient error backpropagation, Linnainmaa- 1979 – deep neocognitron, convolution, Fukushima- 1987 – autoencoder, Ballard- 1989 – backpropagation for CNN, Lecun- 1991 – fundamental deep learning problem, Hochreiter- 1991 – deep recurrent neural network, Schmidhuber- 1997 – supervised LSTM RNN, Schmidhuber
Large numbers of parameters → High computational cost
Small training set → Over-fitting problem
• Two drawbacks:
Two Recent Developments
7
Big Data
10000 1300018200
27300
54600
87360
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
2009 2010 2011 2012 2013 2014
Video surveillance data size (PB)Cheap Computation
DNN can thus be fitted efficiently
Breakthrough in 2006 ImageNet: 74% vs. 85%
Deep Learning:The Resurgence of DNN
8
2006 2012 2014
Activity recognition, CVPR2015
Video caption, CVPR2015
RNN for sequence analysis
Representation learning
DeepFace, CVPR2014
RCNN for detection, CVPR2014
∙∙∙∙∙∙
CNN for visual tasks
Deep Learning promotes the fast development of various visual computing areas
Outline
9
2 Recurrent Convolutional Networks
4 Future Work
1 Deep Learning
3 Application to Video Super-Resolution
Deep Neural Networks (DNN)
10
𝐱
𝐡
𝐲
𝐖
• 𝐱 ∈ ℝ𝑑 , 𝐡 ∈ ℝ𝑛, 𝐖 ∈ ℝ𝑑×𝑛
• 𝐡 = 𝜎 𝐱𝐖 , 𝜎 𝑡 =1
1+𝑒−𝑡
Sigmoid function 𝜎 𝑡
Recurrent Neural Networks (RNN)
11
𝐱𝟏
𝐡𝟏
𝐖
𝐱𝟐
𝐡𝟐
𝐖
𝐱𝟑
𝐡𝟑
𝐖
𝐔 𝐔
• 𝐡𝒕 = 𝜎 𝐱𝒕𝐖+𝐡𝒕−𝟏𝐔
• 𝐱𝐭 ∈ ℝ𝑑 , 𝐡𝐭 ∈ ℝ𝑛, 𝐖 ∈ ℝ𝑑×𝑛, 𝐔 ∈ ℝ𝑛×𝑛
𝐲
𝐱
𝐡
𝐖
𝐲
RNNDNN
Temporal dependency modeling
Recurrent Convolutional Networks (RCN)
12
DNN
Convolutional
CNN
Sequential
RNN
Sequential
Convolutional
RCN
DNN: Deep Neural Networks RNN: Recurrent Neural Networks CNN: Convolutional Neural Networks
Applications of RCN
13
Scene Labeling, NIPS15
Action Recognition, ICLR15
Video SR, NIPS15 & TPAMI17 Weather Nowcasting, NIPS15
Person ReID, CVPR16Object Recognition, CVPR15
Outline
14
2 Recurrent Convolutional Networks
4 Future Work
1 Deep Learning
3 Application to Video Super-Resolution
Video Super-Resolution
15
A great need for super resolvinglow-resolution videos
High-resolution devices
Display
Low-resolution videos
High-resolution videos
Super-resolution: denoising, deblurring, upscaling
Display
16
1. Single-Image super-resolution [1-6]
Two Main Approaches (1/2)
One-to-One scheme, super resolve each video frame independently
Ignore the intrinsic temporal dependency relation of video frames
[1] Dong et al., Learning a deep convolutional network for image super resolution. ECCV, 2014.[2] Timofte et al., Anchored neighborhood regression for fast example-based super resolution. ICCV, 2013.[3] Zeyde et al., On single image scale-up using sparse-representations. Curves and Surfaces, 2012.[4] Yang et al., Image super-resolution via sparse representation. IEEE TIP, 2010.[5] Bevilacqua et al., Low-complexity single-image super resolution. BMVC, 2012.[6] Chang et al., Super-resolution through neighbor embedding. CVPR, 2004.
Low computational complexity, fast
17
⋯
Two Main Approaches (2/2)
[7] Liu and Sun, On bayesian adaptive video super resolution. IEEE PAMI, 2014.[8] Takeda et al., Super-resolution without explicit subpixel motion estimation. IEEE TIP, 2009.[9] Mitzel et al., Video super resolution using duality based tv-l 1 optical flow. PR, 2009.[10] Protter et al. Generalizing the nonlocal-means to super-resolution reconstruction. IEEE TIP, 2009.[11] Fransens et al., Optical flow based super-resolution: A probabilistic approach. CVIU, 2007.
2. Multi-Frame super-resolution [7-11]
Model the temporal dependency relation by motion estimation
Many-to-One scheme, use multiple adjacent frames to super resolve a frame
High computational complexity, slow
Motivation
18
• RNN can model long-term contextual information of temporal sequences well
• Convolutional operation can scale to full videos of any spatial size and temporal step
➢ Propose bidirectional recurrent convolutional networks, different from vanilla RNN:
RNN: Recurrent Neural Networks SR: Super-Resolution
1. Commonly-used full connections are replaced with weight -sharing convolutions
2. Conditional convolutions are added for learningvisual-temporal dependency relation
Bidirectional Recurrent Convolutional Networks
19
learn spatial dependency between a low-resolution frame and its high-resolution result
model long-term temporal dependency relation across video frames
enhance visual-temporal dependency relation modeling
Learning
20
• Define an end-to-end mapping 𝑂 ∙ from low-resolution frames 𝒳 to high-resolution frames 𝒴
• Learning proceeds by optimizing the Mean Square Error (MSE) between predicted frames 𝑂(𝒳) and 𝒴
𝐿 = 𝑂 𝒳 −𝒴 2
– stochastic gradient descent
– small learning rate in the output layer: 1e-4
Experiments
21
• Train the model on 25 YUV format video sequences
– volume-based training
– number of volumes: roughly 41,000
– volume size: 32 × 32 × 10
• Test on a variety of real world videos
– severe motion blur
– motion aliasing
– complex motions
Training videos
Testing videos
PSNR Comparison
22
Table1: The results of PSNR (dB) and test time (sec) on the test video sequences.
PSNR: peak signal-to-noise ratio
[1] Video enhancer. http://www.infognition.com/videoenhancer/, version 1.9.10. 2014.[4] Bevilacqua et al., Low-complexity single-image super resolution. BMVC, 2012.[5] Chang et al., Super-resolution through neighbor embedding. CVPR, 2004.[6] Dong et al., Learning a deep convolutional network for image super resolution. ECCV, 2014.[20] Takeda et al., Super-resolution without explicit subpixel motion estimation. IEEE TIP, 2009.[22] Timofte et al., Anchored neighborhood regression for fast example-based super resolution. ICCV, 2013.[24] Yang et al., Image super-resolution via sparse representation. IEEE TIP, 2010.[25] Zeyde et al., On single image scale-up using sparse-representations. Curves and Surfaces, 2012.
Surpass state-of-the-art methods in PSNR, due to the effective temporal dependency modelling
• Investigate the impact of our model architecture on the performance
• Take a simplified network containing only feedfoward (𝑣) convolution as a benchmark
• Study its variants by successively adding the bidirectional (𝑏), recurrent (𝑟)and conditional (𝑡) schemes
Model Architecture
23
Table1: The results of PSNR (dB) by variants of BRCN on the testing video sequences.
Running Time
24
Figure: Speed vs. PSNR for all the comparison methods.
Outperform both single-image and multi-frame SR methods
Achieve comparable speed with the fastest single-image SR methods
Closeup Comparison
25
Figure: Comparison among original frames (2th, 3th and 4th frames, from the top row to
the bottom) of the Dancing video and super resolved results by Bicubic, 3DSKR, ANR and
BRCN, respectively.
Our method is able to recover more image details than others, under severe motion conditions
Example
26
Upscaling factor:487 × 157 → 348 × 628
Comparison:Bicubic (top)Ours (bottom)
Conclusion
27
• Bidirectional Recurrent Convolutional Networks
– bidirectional recurrent and conditional convolutions
– an end-to-end framework, without pre/post-processing
– well performance and fast speed
For more details, please refer to the following papers:
1. Yan Huang, Wei Wang, and Liang Wang, Bidirectional Recurrent Convolutional Networks for Multi-Frame Super-Resolution. Advances in Neural Information Processing Systems (NIPS), pp. 235-243, 2015
2. Yan Huang, Wei Wang, and Liang Wang, Video Super-Resolution via Bidirectional Recurrent Convolutional Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017, Accepted
Outline
28
2 Recurrent Convolutional Networks
4 Future Work
1 Deep Learning
3 Application to Video Super-Resolution
Future Work
29
➢ For performance improvement
– extend our model to have a deeper architecture, e.g., based on 19 layers VGG net
– incorporate some effective strategies, e.g., motion ensemble and residual connection
➢ For speed acceleration
– replace the used pre-upsampling by learning diverse upsampling filters with deconvolution layers
➢Others
– collect a large-scale high-resolution video dataset, and try to learn our model directly from raw videos
Acknowledgement
NVAILArtificial Intelligence Laboratory
Sponsor excellent hardware resources
30
THANK YOU