a distribution based video representation for human action recognition
DESCRIPTION
A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION. Yan Song, Sheng Tang, Yan-Tao Zheng , Tat- Seng Chua, Yongdong Zhang, Shouxun Lin Laboratory of Advanced Computing Research, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/1.jpg)
A DISTRIBUTION BASED VIDEO REPRESENTATION
FOR HUMAN ACTION RECOGNITION
Yan Song, Sheng Tang, Yan-Tao Zheng, Tat-Seng Chua, Yongdong Zhang, Shouxun LinLaboratory of Advanced Computing Research, Institute of Computing Technology,
Chinese Academy of Sciences, Beijing, China2Graduate School of Chinese Academy of Sciences, Beijing, China
3Institute for Infocomm Research, A*STAR, Singapore4School of Computing, National University of Singapore, Singapore
![Page 2: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/2.jpg)
Outline
• Introduction• Steps Overview• Experiments• Conclusion
![Page 3: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/3.jpg)
Introduction
• Recently, researchers has turned their attention to local spatial-temporal features for human action recognition
• BoW has some drawbacks:– Partitions local feature space into discrete parts and
brings ambiguity and uncertainty in video representation
– Re-training is required when adding a new category to the database or applying on new database
![Page 4: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/4.jpg)
Steps Overview
• Extract the Spatial-Temporal (local) feature– Applies Gaussian filter to spatial domain– Applies Gabor filter to temporal domain– Finds interest points by max arguments for response
function below
• R=[I * gσ(x,y) *hev(t)]2 + [ I * gσ(x,y) * hod(t)]2
![Page 5: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/5.jpg)
Generating feature vector(Behavior Recognition via Sparse Spatio-Temporal Features)[3]
• Gradients can be found not only along x and y, but also along t,
• spatio-temporal corners are defined as regions where the local gradient vectors point in orthogonal directions spanning x, y and t. Intuitively
• a spatio-temporal corner is an image region containing a spatial corner whose velocity vector is reversing direction
Visualization of cuboid based behavior recognition
![Page 6: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/6.jpg)
Interest points belonging different Gaussian components
example of interest points belonging to different Gaussian components in 8 sampled frames from the action of “running”. Different colors denote different Gaussian components.
![Page 7: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/7.jpg)
Steps Overview• Represent feature vectors with Gaussian
Mixture Model– It takes into account the fact that human motion
pattern is continuously distributed– attempts to reveal the probabilistic structures of
the local ST features– Use MDL(Minimum Description Length ) criterion
to the get the number of mixture components to prevent over-fitting.
– Estimate GMM with Expectation-Maximization algorithm
![Page 8: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/8.jpg)
Probabilistic Generative ModelsGaussian Mixture Model
• Mixture Model
• Mixture Example http://www.csse.monash.edu.au/~lloyd/Archive/2005-06-
Mixture/
![Page 9: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/9.jpg)
Using MDL to generate initial parameters for EM
• GMM mixture Model:
• log-likelihood function:
![Page 10: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/10.jpg)
Optimal number of components
number of GMM components automatically selected by MDL criterion in the KTH dataset
![Page 11: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/11.jpg)
Steps Overview
• Compute distance of two videos by (Kullback Leibler) KL divergence of two GMMs– Too high computation complexity for estimating
with Monte-Carlo simulation– Uses variational lower bound [12] to estimate KL
divergence
![Page 12: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/12.jpg)
KL divergenceAPPROXIMATING THE KULLBACK LEIBLER DIVERGENCE BETWEEN GAUSSIAN MIXTURE MODELS [12]
• Definition of KL distance
• The KL divergence of two GMM functions don’t have closed form
• Uses variational lower bound[12] to estimate
![Page 13: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/13.jpg)
Experiments
• Employ KTH dataset and UCF sports dataset• Using average of recognition accuracy to be
evaluation criteria
![Page 14: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/14.jpg)
Average accuracies of three tests on KTH
Average recognition accuracies of three tests on KTH.
![Page 15: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/15.jpg)
Average accuracies of three approaches on UCF sports
Average accuracies of three approaches on UCF sports
![Page 16: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/16.jpg)
Confusion Matrices
(a) Confusion matrixes on (a) KTH. (b) UCF sports
![Page 17: A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION](https://reader035.vdocuments.us/reader035/viewer/2022062309/5681458c550346895db27395/html5/thumbnails/17.jpg)
Conclusion
• Exploited the probabilistic distribution to encode local ST features
• Makes representation compatible with most discriminative classifiers