video analysis with convolutional neural networks (master computer vision barcelona 2017)
TRANSCRIPT
![Page 1: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/1.jpg)
@DocXavi
Module 4 - Lecture 6
Video Analysis with CNNs31 January 2017
Xavier Giró-i-Nieto
[http://pagines.uab.cat/mcv/]
![Page 2: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/2.jpg)
Acknowledgments
2
Víctor Campos Alberto Montes
![Page 3: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/3.jpg)
Linked slides
![Page 4: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/4.jpg)
Outline
1. Recognition2. Optical Flow3. Object Tracking4. Audio and Video5. Generative models
4
![Page 5: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/5.jpg)
Recognition
Demo: Clarifai
MIT Technology Review : “A start-up’s Neural Network Can Understand Video” (3/2/2015)5
![Page 6: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/6.jpg)
Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.
6
Recognition
![Page 7: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/7.jpg)
7
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
![Page 8: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/8.jpg)
8
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Previous lectures
![Page 9: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/9.jpg)
9
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
![Page 10: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/10.jpg)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE.
Slides extracted from ReadCV seminar by Victor Campos 10
Recognition: DeepVideo
![Page 11: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/11.jpg)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 11
Recognition: DeepVideo: Demo
![Page 12: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/12.jpg)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 12
Recognition: DeepVideo: Architectures
![Page 13: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/13.jpg)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 13
Unsupervised learning [Le at al’11] Supervised learning [Karpathy et al’14]
Recognition: DeepVideo: Features
![Page 14: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/14.jpg)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 14
Recognition: DeepVideo: Multiscale
![Page 15: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/15.jpg)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 15
Recognition: DeepVideo: Results
![Page 16: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/16.jpg)
16
Recognition
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
![Page 17: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/17.jpg)
17
Recognition: C3D
Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
![Page 18: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/18.jpg)
18Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Demo
![Page 19: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/19.jpg)
19K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition ICLR 2015.
Recognition: C3D: Spatial dimensionSpatial dimensions (XY) of the used kernels are fixed to 3x3, following Symonian & Zisserman (ICLR 2015).
![Page 20: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/20.jpg)
20Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Temporal dimension3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets
Temporal depth
2D ConvNets
![Page 21: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/21.jpg)
21Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets
Recognition: C3D: Temporal dimension
![Page 22: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/22.jpg)
22Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
No gain when varying the temporal depth across layers.
Recognition: C3D: Temporal dimension
![Page 23: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/23.jpg)
23Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Architecture
Featurevector
![Page 24: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/24.jpg)
24Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Feature vector
Video sequence
16 frames-long clips
8 frames-long overlap
![Page 25: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/25.jpg)
25Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Feature vector
16-frame clip
16-frame clip
16-frame clip
16-frame clip
...
Average
4096
-dim
vid
eo d
escr
ipto
r
4096
-dim
vid
eo d
escr
ipto
r
L2 norm
![Page 26: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/26.jpg)
26Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: VisualizationBased on Deconvnets by Zeiler and Fergus [ECCV 2014] - See [ReadCV Slides] for more details.
![Page 27: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/27.jpg)
27Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: Compactness
![Page 28: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/28.jpg)
28Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Convolutional 3D(C3D) combined with a simple linear classifier outperforms state-of-the-art methods on 4 different benchmarks and are comparable with state of the art methods on other 2 benchmarks
Recognition: C3D: Performance
![Page 29: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/29.jpg)
29Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
Recognition: C3D: SoftwareImplementation by Michael Gygli (GitHub)
![Page 30: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/30.jpg)
30Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." 2014.
Recognition: Two stream Two CNNs in paralel:
● One for RGB images● One for Optical flow (hand-crafted features)
Fusion after the softmax layer
![Page 31: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/31.jpg)
31Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code]
Recognition: Two stream Two CNNs in paralel:
● One for RGB images● One for Optical flow (hand-crafted features)
Fusion at a convolutional layer
![Page 32: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/32.jpg)
32Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.(Slidecast and Slides by Alberto Montes)
Recognition: Localization
![Page 33: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/33.jpg)
33
Recognition: Localization
Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.(Slidecast and Slides by Alberto Montes)
![Page 34: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/34.jpg)
34
Recognition: Localization
Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016.(Slidecast and Slides by Alberto Montes)
![Page 35: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/35.jpg)
Outline
1. Recognition2. Optical Flow3. Object Tracking4. Audio and Video5. Generative models
35
![Page 36: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/36.jpg)
Optical Flow
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 36
![Page 37: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/37.jpg)
Optical Flow: DeepFlow
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 37
Andrei BursucPostoc INRIA@abursuc
![Page 38: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/38.jpg)
Optical Flow: DeepFlow
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 38
● Deep (hierarchy) ✔
● Convolution ✔
● Learning ❌
![Page 39: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/39.jpg)
Optical Flow: Small vs Large
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 39
![Page 40: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/40.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 40
Optical FlowClassic approach:Rigid matching of HoG or SIFT descriptors
Deep Matching:Allow each subpatch to move:
● independently● in a limited range
depending on its size
![Page 41: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/41.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 41
Optical Flow: Deep Matching
![Page 42: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/42.jpg)
Source: Matlab R2015b documentation for normxcorr2 by Mathworks42
Optical Flow: 2D correlation
Image
Sub-Image
Offset of the sub-image with respect to the image [0,0].
![Page 43: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/43.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 43
Instead of pre-trained filters, a convolution is defined between each:
● patch of the reference image● target image
...as a results, a correlation map is generated for each reference patch.
Optical Flow: Deep Matching
![Page 44: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/44.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 44
Optical Flow: Deep Matching
The most discriminative response map
The less discriminative
response map
![Page 45: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/45.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 45
Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search.
Optical Flow: Deep Matching
4x4 patches
8x8 patches
16x16 patches
32x32 patches
Top-down matching
(TD)Bottom-upextraction
(BU)
![Page 46: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/46.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 46
Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search.
Optical Flow: Deep Matching
4x4 patches
8x8 patches
16x16 patches
32x32 patches
Bottom-upextraction
(BU)
![Page 47: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/47.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 47
Optical Flow: Deep Matching (BU)
![Page 48: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/48.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 48
Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search.
Optical Flow: Deep Matching (TD)
4x4 patches
8x8 patches
16x16 patches
32x32 patches
Top-down matching
(TD)
![Page 49: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/49.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 49
Optical Flow: Deep Matching (TD)Each local maxima in the top layer corresponds to a shift of one of the biggest (32x32) patches.If we focus on local maximum, we can retrieve the corresponding responses one scale below and focus on shift of the sub-patches that generated it
![Page 50: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/50.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 50
Optical Flow: Deep Matching (TD)
![Page 51: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/51.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 51
Optical Flow: Deep Matching
![Page 52: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/52.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 52
Ground truth
Dense HOG[Brox & Malik 2011]
Deep Matching
Optical Flow: Deep Matching
![Page 53: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/53.jpg)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 53
Optical Flow: Deep Matching
![Page 54: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/54.jpg)
Optical Flow
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 54
![Page 55: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/55.jpg)
Optical Flow: FlowNet
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 55
![Page 56: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/56.jpg)
Optical Flow: FlowNet
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 56
End to end supervised learning of optical flow.
![Page 57: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/57.jpg)
Optical Flow: FlowNet (contracting)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 57
Option A: Stack both input images together and feed them through a generic network.
![Page 58: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/58.jpg)
Optical Flow: FlowNet (contracting)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 58
Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.
![Page 59: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/59.jpg)
Optical Flow: FlowNet (contracting)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 59
Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.
Correlation layer: Convolution of data patches from the layers to combine.
![Page 60: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/60.jpg)
Optical Flow: FlowNet (expanding)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 60
Upconvolutional layers: Unpooling features maps + convolution.Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.
![Page 61: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/61.jpg)
Optical Flow: FlowNet
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. ICCV 2015 61
Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset is generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise; changes in brightness, contrast, gamma and color).
Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI.
Data augmentation
![Page 62: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/62.jpg)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 62
Optical Flow: FlowNet
![Page 63: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/63.jpg)
Outline
1. Recognition2. Optical Flow3. Object Tracking4. Audio and Video5. Generative models
63
![Page 64: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/64.jpg)
Object tracking: MDNet
64Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
![Page 65: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/65.jpg)
Object tracking: MDNet
65Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
![Page 66: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/66.jpg)
Object tracking: MDNet: Architecture
66Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
Domain-specific layers are used during training for each sequence, but are replaced by a single one at test time.
![Page 67: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/67.jpg)
Object tracking: MDNet: Online update
67Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
MDNet is updated online at test time with hard negative mining, that is, selecting negative samples with the highest positive score.
![Page 68: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/68.jpg)
Object tracking: FCNT
68Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015 [code]
![Page 69: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/69.jpg)
Object tracking: FCNT
69Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification.
conv4-3 conv5-3
![Page 70: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/70.jpg)
Object tracking: FCNT: Specialization
70Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking sequence.
![Page 71: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/71.jpg)
Object tracking: FCNT: Localization
71Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
Although trained for image classification, feature maps in conv5-3 enable object localization…...but is not discriminative enough to different objects of the same category.
![Page 72: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/72.jpg)
Object tracking: Localization
72Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015.
Other works have shown how features maps in convolutional layers allow object localization.
![Page 73: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/73.jpg)
Object tracking: FCNT: Localization
73Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation…
conv4-3 conv5-3
![Page 74: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/74.jpg)
Object tracking: FCNT: Architecture
74Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
SNet=Specific Network (online update)
GNet=General Network (fixed)
![Page 75: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/75.jpg)
Object tracking: FCNT: Results
75Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
![Page 76: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/76.jpg)
Outline
1. Recognition2. Optical Flow3. Object Tracking4. Audio and Video5. Generative models
76
![Page 77: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/77.jpg)
77
Audio and Video
Audio Vision
![Page 78: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/78.jpg)
78
Audio and Video: Soundnet
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016.
Object & Scenes recognition in videos by analysing the audio track (only).
![Page 79: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/79.jpg)
79Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.
Videos for training are unlabeled. Relies on CNNs trained on labeled images.
Audio and Video: Soundnet
![Page 80: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/80.jpg)
80Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.
Videos for training are unlabeled. Relies on CNNs trained on labeled images.
Audio and Video: Soundnet
![Page 81: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/81.jpg)
81Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016.
Audio and Video: Soundnet
![Page 82: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/82.jpg)
82Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art.
Audio and Video: Soundnet
![Page 83: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/83.jpg)
83Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Audio and Video: Soundnet
![Page 84: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/84.jpg)
84Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Audio and Video: Soundnet
![Page 85: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/85.jpg)
85Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Audio and Video: Soundnet
![Page 86: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/86.jpg)
86Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016.
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7):
Audio and Video: Soundnet
![Page 87: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/87.jpg)
87
Audio and Video: Sonorizaton
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.
Learn synthesized sounds from videos of people hitting objects with a drumstick.
![Page 88: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/88.jpg)
88
Audio and Video: Visual Sounds
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.
No end-to-end
![Page 89: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/89.jpg)
89
Audio and Video: Visual Sounds
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.
![Page 90: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/90.jpg)
90
Learn moreRuus Salakhutdinov, “Multimodal Machine Learning” (NIPS 2015 Workshop)
![Page 91: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/91.jpg)
Generative models for Video
91
SlidesD2L5 by Santi Pascual.
![Page 92: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/92.jpg)
92
What are Generative Models?
We want our model with parameters θ = {weights, biases} and outputs
distributed like Pmodel to estimate the distribution of our training data Pdata.
Example) y = f(x), where y is scalar, make Pmodel similar to Pdata by training
the parameters θ to maximize their similarity.
![Page 93: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/93.jpg)
Key Idea: our model cares about what distribution generated the input data
points, and we want to mimic it with our probabilistic model. Our learned
model should be able to make up new samples from the distribution, not
just copy and paste existing samples!
93
What are Generative Models?
Figure from NIPS 2016 Tutorial: Generative Adversarial Networks (I. Goodfellow)
![Page 94: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/94.jpg)
94
Video Frame Prediction
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016
![Page 95: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/95.jpg)
95
Video Frame Prediction
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016
![Page 96: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/96.jpg)
96
Video Frame Prediction
Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016
![Page 97: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/97.jpg)
Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake.
100
100
It’s not even green
![Page 98: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/98.jpg)
Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake.
100
100
There is no watermark
![Page 99: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/99.jpg)
Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake.
100
100
Watermark should be rounded
![Page 100: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/100.jpg)
Adversarial Training analogy
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake.
?
After enough iterations, and if the counterfeiter is good enough (in terms of G network it means “has enough parameters”), the police should be confused.
![Page 101: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/101.jpg)
Adversarial Training (batch update)
● Pick a sample x from training set● Show x to D and update weights to
output 1 (real)
![Page 102: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/102.jpg)
Adversarial Training (batch update)
● G maps sample z to ẍ ● show ẍ and update weights to output 0 (fake)
![Page 103: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/103.jpg)
Adversarial Training (batch update)
● Freeze D weights● Update G weights to make D output 1 (just G weights!)● Unfreeze D Weights and repeat
![Page 104: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/104.jpg)
104
Generative Adversarial Networks (GANs)
Slide credit: Víctor Garcia
DiscriminatorD(·)
GeneratorG(·)
Real World
Randomseed (z)
Real/Synthetic
![Page 105: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/105.jpg)
105Slide credit: Víctor Garcia
Conditional Adversarial Networks
Real World
Real/Synthetic
Condition
DiscriminatorD(·)
GeneratorG(·)
Generative Adversarial Networks (GANs)
![Page 106: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/106.jpg)
Generating images/frames
(Radford et al. 2015)
Deep Conv. GAN (DCGAN) effectively generated 64x64 RGB images in a single shot. For example bedrooms from LSUN dataset.
![Page 107: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/107.jpg)
Generating images/frames conditioned on captions
(Reed et al. 2016b) (Zhang et al. 2016)
![Page 108: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/108.jpg)
Unsupervised feature extraction/learning representations
Similarly to word2vec, GANs learn a distributed representation that disentangles concepts such that we can perform operations on the data manifold:
v(Man with glasses) - v(man) + v(woman) = v(woman with glasses)
(Radford et al. 2015)
![Page 109: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/109.jpg)
Image super-resolution
Bicubic: not using data statistics. SRResNet: trained with MSE. SRGAN is able to understand that there are multiple correct answers, rather than averaging.
(Ledig et al. 2016)
![Page 110: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/110.jpg)
Image super-resolution
Averaging is a serious problem we face when dealing with complex distributions.
(Ledig et al. 2016)
![Page 111: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/111.jpg)
Manipulating images and assisted content creation
https://youtu.be/9c4z6YsBGQ0?t=126 https://youtu.be/9c4z6YsBGQ0?t=161
(Zhu et al. 2016)
![Page 112: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/112.jpg)
112
Adversarial Networks
Slide credit: Víctor Garcia
Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial networks." arXiv preprint arXiv:1611.07004 (2016).
Generator
Discriminator
Generated Pairs
Real World
Ground Truth Pairs
Loss → BCE
![Page 113: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/113.jpg)
113Víctor Garcia and Xavier Giró-i-Nieto (work under progress)
Generator
Discriminator Loss2 GAN {Binary Crossentropy}
1/0
Generative Adversarial Networks (GANs)
![Page 114: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/114.jpg)
Generative models for video
114Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." 2016.
![Page 115: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/115.jpg)
Generative models for video
115Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." NIPS 2016.
![Page 116: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/116.jpg)
116
Adversarial Networks
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." NIPS 2014Goodfellow, Ian. "NIPS 2016 Tutorial: Generative Adversarial Networks." arXiv preprint arXiv:1701.00160 (2016).
F. Van Veen, “The Neural Network Zoo” (2016)
![Page 117: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/117.jpg)
Outline
1. Recognition2. Optical Flow3. Object Tracking4. Audio and Video5. Generative models
117
![Page 118: Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)](https://reader033.vdocuments.us/reader033/viewer/2022042908/58ecee3f1a28ab71728b4673/html5/thumbnails/118.jpg)
118
Thank you !
https://imatge.upc.edu/web/people/xavier-giro
https://twitter.com/DocXavi
https://www.facebook.com/ProfessorXavi
Xavier Giró-i-Nieto
[Part B: Video and audio]