towards efficientend-to-end architecturesfor actionrecognition...
TRANSCRIPT
![Page 1: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/1.jpg)
||
Limin WangComputer Vision Laboratory, ETH Zurich
Workshop on frontiers of video technology -- 2017
17/7/27Limin Wang (CVL ETHZ) 1
Towards efficient end-to-end architectures foraction recognition and detection in videos
![Page 2: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/2.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 2
Action recognition in videos
§ 1. Action recognition “in the lab”: KTH, Weizmann etc.§ 2. Action recognition “in TV, Movies”: UCF Sports, Holloywood etc.§ 3. Action recognition “in Web Videos”: HMDB, UCF101, THUMOS,
ActivityNet etc.Haroon Idrees et al. The THUMOS Challenge on Action Recognition for Videos "in the Wild”, in Computer Vision and Image Understanding (CVIU), 2017.
![Page 3: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/3.jpg)
||
§ Action Recognition: classify the short clip or untrimmedvideo into pre-defined class.
§ Action Temporal Localization: detect starting and ending times of action instances in untrimmed video.
§ Action Spatial Detection: detect the bounding boxes ofactors in trimmed videos.
§ Action Spatial-Temporal Detection: combine temporaland spatial localization in untrimmed videos.
17/7/27Limin Wang (CVL ETHZ) 3
Action Understanding Tasks
![Page 4: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/4.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 4
Action recognition -- deep networks (2014)
Andrej Karpathy et al., Large-scale Video Classification with Convolutional Neural Networks, in CVPR, 2014.
![Page 5: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/5.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 5
Action recognition – two stream CNN (2014)
Karen Simonyan and Andrew Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos,in NIPS, 2014.
![Page 6: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/6.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 6
Action recognition – 3D CNN (2015)
Du Tran et al. Learning Spatiotemporal Features with 3D Convolutional Networks, in ICCV, 2015.
![Page 7: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/7.jpg)
||
§ Opportunities§ Videos provide huge and rich data for visual learning§ Action is important in motion perception and has many applications
§ Challenges§ Temporal models and representations§ High computational and memory cost§ Noisy and weakly labels
17/7/27Limin Wang (CVL ETHZ) 7
Opportunities and Challenges
![Page 8: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/8.jpg)
||
Action Recognition(AR)
• TemporalSegment Net(TSN)
Action Detection(AD)
• StructuredSegment Net(SSN)
Weakly SupervisedAR & AD
• UntrimmedNet
17/7/27Limin Wang (CVL ETHZ) 8
Overview
§ [1] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in ECCV, 2016.
§ [2] L. Wang, Y. Xiong, D. Lin, and L. Van Gool, UntrimmedNets for Weakly Supervised Action Recognition and Detection, in CVPR 2017.
§ [3] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, D. Lin, and X. Tang, Temporal Action Detection with Structured Segment Networks, in ICCV 2017.
![Page 9: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/9.jpg)
||
§ Towards end-to-end and video-level architecture.§ Modeling issue: mainstream CNN frameworks focus on
appearance and short-term motion.
17/7/27Limin Wang (CVL ETHZ) 9
Motivation of TSN
![Page 10: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/10.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 10
Modeling Long-Range Structure
Stacking multiple frames: dense and local
[1] Joe Yue-Hei Ng et al. Beyond Short Snippets: Deep Networks for video classification, in CVPR 2015.[2] C. Feichtenhofer et al. Convolutional Two-Stream Network Fusion for Video Action Recognition, in CVPR 2016.[3] Gul Varol et al., Long-term Temporal Convolutions for Action Recognition in PAMI 2017.
![Page 11: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/11.jpg)
||
§ There are high data redundancy in video.§ High-level semantics vary slowly (slowness).§ Our segment sampling share two properties:
§ Sparse: processing efficiency§ Global: duration invariant and modeling the entire video content.
17/7/27Limin Wang (CVL ETHZ) 11
Segment Based Sampling
![Page 12: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/12.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 12
Modeling Long-Range Structure
Segment based sampling: sparse and global
![Page 13: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/13.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 13
Overview of TSN
TSN is a video-level framework based on simple strategies of segment samplingand consensus aggregation.
![Page 14: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/14.jpg)
||
§ Aggregation function aims to summarize the predictions ofdifferent snippet to yield the video-level prediction.
§ Simple aggregation functions:§ Mean pooling, max pooling, weighted average
§ Advanced aggregation functions:§ Top-k pooling, attention weighting
17/7/27Limin Wang (CVL ETHZ) 14
Aggregation Function
![Page 15: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/15.jpg)
||
Stacking RGB differenceStacking warped optical field
17/7/27Limin Wang (CVL ETHZ) 15
Input modalities
![Page 16: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/16.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 16
Experiment result -- input modality
S. Ioffe et al., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, inICML 2015.
![Page 17: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/17.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 17
Exploration on TSN
92.4
94.2 94.7 94.9 94.9
8586.5 86.7 86.4 86.2
88.389.8 90.1 90.1 89.7
K=1 K=3 K=5 K=7 K=9
Accuracy on UCF101 (%)
Two stream Spatial stream Temporal stream
![Page 18: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/18.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 18
Evaluation on TSN
81.8 84 83.1 84.7 84.1
62
72.8 70.573.6 71.8
85.4 87.8 86.4 88.1 88.2
50556065707580859095
MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING
mAP on ActivityNet (%)
Spatial stream Temporal Stream Two stream
84.986.4 86.4 85.5 86.1
83.5
90.1 89.7 88.8 89.1
92.494.9 94.8 94.2 94.6
7678808284868890929496
MAX POOLING AVERAGE POOLING WEIGHTED AVERAGE TOP-k POOLING ATTENTION WEIGHTING
Accuracy on UCF101 (%)
Spatial stream Temporal Stream Two stream
![Page 19: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/19.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 19
Comparison on Trimmed Datasets
57.2
61.759.4
56.8
63.264.8
71
50
55
60
65
70
75Accuracy on HMDB51 (%)
85.988.3 88
85.2
90.391.7
94.9
75
80
85
90
95
100Accuracy on UCF101 (%)
[1] L. Soomro et al., UCF101: A datasetof 101 human action classes from videos in the wild, in arXiv 1212.0402,2012.[2] H. Kuehne et al., HMDB: A large video database for human motion recognition, in ICCV, 2011.
![Page 20: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/20.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 20
Comparison on Untrimmed Datasets
63.166.1
71.6
61.5
80.1
5055606570758085
mAPonTHUMOS(%)
66.5
71.974.1
78.1
89.6
50556065707580859095
mAP onActivityNet (%)
[1] H. Idrees et al., The THUMOS Challenge on Action Recognition for Videos “in the Wild”, in CVIU, 2017.[2] F. C. Heilbron et al., ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015.
![Page 21: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/21.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 21
CVPR ActivityNet Challenge -- 2016
![Page 22: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/22.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 22
CVPR ActivityNet Challenge -- 2016
![Page 23: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/23.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 23
CVPR ActivityNet Challenge -- 2016
![Page 24: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/24.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 24
Kinetics dataset
![Page 25: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/25.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 25
Results on Kinetics Dataset
RGB, Pretrain on ImageNet, TSN: top-1 70.28%, 89.13%RGB, Train from scratch, TSN: top-1 69.55%, 88.68%
![Page 26: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/26.jpg)
||
Action Recognition(AR)
• TemporalSegment Net(TSN)
Action Detection(AD)
• StructuredSegment Net(SSN)
Weakly SupervisedAR & AD
• UntrimmedNet
17/7/27Limin Wang (CVL ETHZ) 26
Overview
§ [1] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in ECCV, 2016.
§ [2] L. Wang, Y. Xiong, D. Lin, and L. Van Gool, UntrimmedNets for Weakly Supervised Action Recognition and Detection, in CVPR 2017.
§ [3] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, D. Lin, and X. Tang, Temporal Action Detection with Structured Segment Networks, in ICCV 2017.
![Page 27: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/27.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 27
Motivation of Structured Segment Network
1. Action detection in untrimmedvideo is an important problem.
2. Snippet-level classifier is difficult toaccurately localize the temporalextent of action instance.
Context and Structure Modeling!
![Page 28: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/28.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 28
Temporal Region Proposal
Bottom up proposal generation based on actionness map
![Page 29: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/29.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 29
Structured Segment Network (SSN)
![Page 30: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/30.jpg)
||
§ To model the class classes and completeness ofinstances, we design a two classifier loss
P(c,b|p) = P(c|p)P(b|c,p)§ Action class classifier measure the likelihood of action
class distribution: P(c|p)§ Completeness classifier measure the likelihood of
instance completeness: P(b|c,p)§ A joint loss to optimize these two classifiers:
17/7/27Limin Wang (CVL ETHZ) 30
Two Classifier Design
![Page 31: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/31.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 31
Experiment result -- Action Proposal
[1] V. Escorcia, F. Caba Heilbron, J. C. Niebles, and B. Ghanem. Daps: Deep action proposals for action understanding. In, ECCV, pages 768–784, 2016. [2] Fabian Caba Heilbron et al.. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In CVPR, 2016.
![Page 32: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/32.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 32
Experiment result -- Component Analysis
![Page 33: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/33.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 33
Experiment result -- Comparison
[1] H. Idrees et al., The THUMOS Challenge on Action Recognition for Videos “in the Wild”, in CVIU, 2017.[2] F. C. Heilbron et al., ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015.
![Page 34: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/34.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 34
Detection example (1)
Green: correct detectionRed: bad localizationYellow: multiple detections
![Page 35: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/35.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 35
Detection example (2)
Green: correct detectionRed: bad localizationYellow: multiple detections
![Page 36: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/36.jpg)
||
Action Recognition(AR)
• TemporalSegment Net(TSN)
Action Detection(AD)
• StructuredSegment Net(SSN)
Weakly SupervisedAR & AD
• UntrimmedNet
17/7/27Limin Wang (CVL ETHZ) 36
Overview of temporal modeling
§ [1] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in ECCV, 2016.
§ [2] L. Wang, Y. Xiong, D. Lin, and L. Van Gool, UntrimmedNets for Weakly Supervised Action Recognition and Detection, in CVPR 2017.
§ [3] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, D. Lin, and X. Tang, Temporal Action Detection with Structured Segment Networks, in ICCV 2017.
![Page 37: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/37.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 37
Motivation of UntrimmedNet
1. Labeling untrimmed videois expensive and timeconsuming
2. Temporal annotation issubjective and notconsistent across personsand datasets
![Page 38: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/38.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 38
Overview of UntrimmedNet
![Page 39: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/39.jpg)
||
§ Uniform Sampling§ Uniform sampling of fixed duration
§ Shot based Sampling§ First shot detection based HOG difference§ For each shot, perform uniform sampling.
17/7/27Limin Wang (CVL ETHZ) 39
Clip Proposal
![Page 40: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/40.jpg)
||
§ Following TSN framework:§ Sampling a few snippets from each clip.§ Aggregating snippet-level predictions with average pooling
§ In practice, we use two stream input: RGB and OpticalFlow
17/7/27Limin Wang (CVL ETHZ) 40
Clip Classification
![Page 41: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/41.jpg)
||
§ Selection aims to select discriminative clips or rank themwith attention weights.
§ Two selection methods:§ Hard selection: top-k pooling over clip-level prediction§ Soft selection: learning attention weights for different clips
17/7/27Limin Wang (CVL ETHZ) 41
Clip Selection
![Page 42: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/42.jpg)
||
Top-k Pooling
Washing Dishes
![Page 43: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/43.jpg)
||
-1.3 0.8 -0.9 1.5 0.4
0.03 0.25 0.05 0.50 0.17
X X X X X
Attention weighting
![Page 44: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/44.jpg)
||
§ UntrimmedNet is an end-to-end learning architecture,combing three modules: feature extraction, classificationmodule, selection module.
§ Video-level prediction: a bilinear model over classificationscore and selection weights.
§ The whole pipeline could be optimized with standard backpropagation algorithm.
17/7/27Limin Wang (CVL ETHZ) 44
UntrimmedNet
![Page 45: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/45.jpg)
||
§ Action Recognition:§ In practice, we sample a single frame (or 5 frame stacking of
optical flow) every 30 frames.§ The recognition from sampled frames are aggregated with top-k
pooling (k set to 20) to yield the final video-level prediction. § Action Detection:
§ we sample frames every 15 frame and for each frame, we get both prediction scores and attention weights.
§ we remove background by thresholding (set to 0.0001) on the attention weights .
§ we produce the final detection results by thresholding (set to 0.5) on the classification scores.
17/7/27Limin Wang (CVL ETHZ) 45
Weakly supervised AR and AD
![Page 46: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/46.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 46
Exploration Study
![Page 47: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/47.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 47
Exploration Study
![Page 48: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/48.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 48
Exploration Study
![Page 49: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/49.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 49
Experiment Results -- Action recognition
[1] H. Idrees et al., The THUMOS Challenge on Action Recognition for Videos “in the Wild”, in CVIU, 2017.[2] F. C. Heilbron et al., ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015.
![Page 50: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/50.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 50
Experiment Results -- Action recognition
[1] H. Idrees et al., The THUMOS Challenge on Action Recognition for Videos “in the Wild”, in CVIU, 2017.[2] F. C. Heilbron et al., ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015.
![Page 51: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/51.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 51
Experiment Results -- Action detection
[1] H. Idrees et al., The THUMOS Challenge on Action Recognition for Videos “in the Wild”, in CVIU, 2017.
![Page 52: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/52.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 52
Examples of Attention
![Page 53: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/53.jpg)
||
§ Temporal modeling is important for action understanding.§ Segment based sampling shares two properties: global
and sparse.§ TSN is a general and flexible framework for action
modeling.§ SSN extends TSN for action detection with context and
structure modeling.§ UntrimmedNet extends TSN for weakly supervised
setting with attention modeling.
17/7/27Limin Wang (CVL ETHZ) 53
Summary
![Page 54: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/54.jpg)
||
§ [1] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in ECCV, 2016.
§ [2] L. Wang, Y. Xiong, D. Lin, and L. Van Gool, UntrimmedNets for Weakly Supervised Action Recognition and Detection, in CVPR 2017.
§ [3] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, D. Lin, and X. Tang, Temporal Action Detection with Structured Segment Networks, in ICCV 2017.
17/7/27Limin Wang (CVL ETHZ) 54
Code and References
§ Temporal segment network:https://github.com/yjxiong/temporal-segment-network
§ Structured segment network:https://github.com/yjxiong/action-detection§ UntrimmedNet:https://github.com/wanglimin/UntrimmedNet
§ Video Caffe:https://github.com/yjxiong/caffe
![Page 55: Towards efficientend-to-end architecturesfor actionrecognition ...wanglimin.github.io/fvt_slide.pdf · Du Tranet al. Learning Spatiotemporal Features with 3D Convolutional Networks,inICCV,](https://reader036.vdocuments.us/reader036/viewer/2022081405/5f0aee8b7e708231d42e0c29/html5/thumbnails/55.jpg)
|| 17/7/27Limin Wang (CVL ETHZ) 55
Collaborators