iciap 2
TRANSCRIPT
Daily Living Activities Recognition via EfficientHigh and Low Level Cues Combination and
Fisher Kernel Representation
Negar Rostamzadeh1
Gloria Zen1
Ionut Mironica2
Jasper Uijlings1
Nicu Sebe1
1 DISI, University of Trento, Trento, Italy2 LAPI, University Politehnica of Bucharest, Bucharest, Romania
Outline
• Daily Living Action Recognition• State-of-the-art• Our approach • Results• Conclusion
2/25
Motivation – State of the art – Our approach – Results - Conclusion 3/23
Action Recognition in videos
Answer phone or dial phone?
Difficulties in fine-grained activities:1. Slightly different activities in motion and appearance2. Different manner of performing the similar task.
Object Centric approaches- SoA
Object-centric approaches- based on tracking and trajectory
Brendel et al, ICCV 2011 [5], May et al, CVPR 2004 [6]Campos et al, WACV 2011 [23], Liu et al, CVPR 2007[16]
Limitations
Providing semantic/high-level information of the scene
AdvantagesHandling occlusions in objects interactionsThe broken and missed trajectories The problem of curse of dimensionality
4/23Motivation – State of the art – Our approach – Results - Conclusion
Non-object centric approaches - SoABag-of-words approach relying on low-level
Laptev et al, CVPR 2008 [1], Willems et al, ECCV 2008 [2], Hospedales et al, ICCV 2009. Zen et al, CVPR 2011 [4], Wong et al, CVPR 2007 [15], Chang et al, ICCV 2011 [17], Gilbert et al, ICCV 2009 [19],Zelniker et al, ICCV 2008 [20], Gehrig et al, ICCV 2009[21], Mahbub et al, ICIEV [25]
Foreground pixels HoFSTIPHoG
5/23
Limitations
Robustness to noise & occlusions Computational efficiency
Advantages 1. Discard semantic & high-level information of the scene.2. Discard relationship among spatio-temporal local features.
Motivation – State of the art – Our approach – Results - Conclusion
Enhanced descriptors - SoA
Which body-part causes what motion?
6/23
Messing et al, ICCV, 2009 [7], Fathi et al, 2008 [8], Zhang et al, 2012 [9], Matikainen et al, 2010 [10], : Gaur et al, 2011 [11], Savarese et al, 2008 [12], Kovashka et al, Gireddy et al, 2011 [14], CVPR 2007[18] , Shechtman et al, CVPR 2011 [24]
Motivation – State of the art – Our approach – Results - Conclusion
1. Relation between local featuresPair-wise10,11,12,18, local space or time neighborhood11,18, ST phrases9
2. Combining different local featuresSuch as local motion, appearance, and positions14,24
3. Enriching the combination of low level features with high- level information
Detect and localize faces7, STIP volume8,9
Input Video ClassifierFusing information
to produce enriched descriptor
Apply a Feature-representation
Recognizing Activities
Body-part detector
Low-level cues
Accumulation over each video
Fisher Kernel to model the Temporal
variation
Approach in a glance
7/25Motivation – State of the art – Our approach – Results - Conclusion
Body-pose estimation
What is the problem with an off-the-shelf detector?
Our Solution:
Employ an already-trained off-the-shelf detector
Enhanced pose estimator
We use the already trained classifier, but we provide some additional information from the
new dataset
ADLBUFFY
8/25Motivation – State of the art – Our approach – Results - Conclusion
Body-pose estimation- build on ‘Yang and Ramanan PAMI2012, CVPR11’ [29]
9/25
1. Model the body as a pictorial structure (Felzenshwalb-CVPR 2010)2. Model the body as a Tree3. Each possible body-configuration has a score
Pair-wise scoreLocal score: HoG
HoG - appearance
Scores by employing off-the-
shelf detector = Sinitial
Motivation – State of the art – Our approach – Results - Conclusion
Enhanced pose estimator
10/25
New Score = Sinitial
Foreground Score Optical Flow Score
weightsRelative importance of foreground and optical flow score
Motivation – State of the art – Our approach – Results - Conclusion
Enhanced pose estimator
New Score = Sinitial
Our approachSoA ForegroundSoA Optical FlowOur approach
Enhanced pose estimator
SoA Optical FlowOur approachOur approachSoA Foreground
11/25Motivation – State of the art – Our approach – Results - Conclusion
12/25
New Score = Sinitial
Tuning
Motivation – State of the art – Our approach – Results - Conclusion
Enhanced pose estimator used to enrich action recognition approach
Body-part detector
Input VideoFusing information
to produce enriched descriptor
Low-level cues
Accumulation over each video
ClassifierApply a Feature-representation
Recognizing Activities
Fisher Kernel to model the Temporal
variation
Approach in a glance
13/25Motivation – State of the art – Our approach – Results - Conclusion
Fisher Kernel (FK) Theory
1. Introduced by Jaakkol NIPS’99 [26]) for protein detection2. Web audio classification (Moreno 2000)3. Introduced in Computer Vision for Image categorization by [Perronnin, CVPR’07]
Motivation – State of the art – Our approach – Results - Conclusion 14/25
Fisher Kernel in image categorization Vs video analysis
1. Modeling the : spatial variation temporal variation2. Visual documents: small patches frames of the video3. Initial feature vectors : SIFT our novel descriptors for
action recognition
Fisher Kernel in the state-of-the-art
Fisher Kernel (FK) Theory
- combines the benefits of generative and discriminative approaches
- represents a signal as the gradient of the probability density function that is a learned generative model of that signall
Motivation – State of the art – Our approach – Results - Conclusion 15/25
Results on the ADL Rochester dataset
16/25Motivation – State of the art – Our approach – Results - Conclusion
Conclusion
17/25
We proposed a novel descriptor that is combining high-level semantic information and low–level cues.
We propose an enhanced body-pose estimator.
We model the Temporal variation by the Fisher-Kernel representation.
Motivation – State of the art – Our approach – Results - Conclusion
Thank you!
References1. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008, June). Learning
realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE..
2. Willems, G., Tuytelaars, T., & Van Gool, L. (2008). An efficient dense and scale–invariant spatio–temporal interest point detector. Computer Vision–ECCV 2008, 650–663.
3. Hospedales, T., Gong, S., & Xiang, T. (2009, September). A markov clustering topic model for mining behaviour in video. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 1165–1172). IEEE
4. Zen, G., and Ricci, E. "Earth mover's prototypes: A convex learning approach for discovering activity patterns in dynamic scenes." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.
5. Brendel, W., & Todorovic, S. (2011, November). Learning spatiotemporal graphs of human activities. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 778–785). IEEE.
References6. Han, M., Xu, W., Tao, H., & Gong, Y. (2004, June). An algorithm for multiple
object trajectory tracking. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on(Vol. 1, pp. I–864). IEEE.
7. Messing, R., Pal, C., & Kautz, H. (2009, September). Activity recognition using the velocity histories of tracked keypoints. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 104–111). IEEE.
8. Fathi, A., and Mori, G. (2008, June). Action recognition by learning mid–level motion features. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE.
9. Zhang, Y., Liu, X., Chang, M. C., Ge, W., & Chen, T. (2012). Spatio–Temporal phrases for activity recognition. Computer Vision–ECCV 2012, 707–721.
10. Matikainen, M. Hebert, and R. Sukthankar. Representing pairwise spatial and temporal relations for action recognition. European Conference of Computer Vision (ECCV) 2010, pages 508{521, 2010.
References11. Gaur, U., Zhu, Y., Song, B., & Roy–Chowdhury, A. (2011, November). A
“string of feature graphs” model for recognition of complex activities in natural videos. InComputer Vision (ICCV), 2011 IEEE International Conference on (pp. 2595–2602). IEEE.
12. Savarese, S., DelPozo, A., Niebles, J. C., & Fei–Fei, L. (2008, January). Spatial–Temporal correlatons for unsupervised action classification. In Motion and video Computing, 2008. WMVC 2008. IEEE Workshop on (pp. 1–8). IEEE.
13. Taralova, E., De la Torre, F., & Hebert, M. (2011, November). Source constrained clustering. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 1927–1934) IEEE.
14. M. Malgireddy, I. Nwogu, and V. Govindaraju. A generative framework to investigate the underlying patterns in human activities. International Conference of Computer Vision Workshops (ICCV Workshops), 2011
15. Wong, S. F., Kim, T. K., & Cipolla, R. (2007, June). Learning motion categories using both semantic and structural information. In Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on (pp. 1-6). IEEE.
References16. Liu, J., Luo, J., & Shah, M. (2009, June). Recognizing realistic actions from
videos “in the wild”. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. 1996-2003). IEEE.
17. Chang, M. C., Krahnstoever, N., & Ge, W. (2011, November). Probabilistic group-level motion analysis and scenario recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 747-754). IEEE.
18. Kovashka, A., & Grauman, K. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on(pp. 2046-2053). IEEE.
19. Gilbert, A., Illingworth, J., & Bowden, R. (2009, September). Fast realistic multi-action recognition using mined dense spatio-temporal features. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 925-931). IEEE.
20. E. Zelniker, S. Gong, T. Xiang, et al. Global abnormal behaviour detection using a network of cctv cameras. In The Eighth International Workshop on Visual Surveillance-VS2008, 2008.
References21. Gehrig, D., Kuehne, H., Woerner, A., & Schultz, T. (2009, December).
Hmm-based human motion recognition with optical flow data. In Humanoid Robots, 2009. Humanoids 2009. 9th IEEE-RAS International Conference on (pp. 425-430). IEEE.
22. Sadanand, S., & Corso, J. J. (2012, June). Action bank: A high-level representation of activity in video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 1254-1241). IEEE.
23. de Campos, T., Barnard, M., Mikolajczyk, K., Kittler, J., Yan, F., Christmas, W., & Windridge, D. (2011, January). An evaluation of bags-of-words and spatio-temporal shapes for action recognition. In Applications of Computer Vision (WACV), 2011 IEEE Workshop on (pp. 344-351). IEEE.
24. Shechtman, E., & Irani, M. Space-time behavior based correlation. In Conference of Computer Vision and Pattern Recognition (CVPR), 2011.
25. Mahbub, U., Imtiaz, H., Ahad, M., & Rahman, A. (2012, May). Motion clustering-based action recognition technique using optical flow. In Informatics, Electronics & Vision (ICIEV), 2012 International Conference on (pp. 919-924). IEEE.
References26. Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in
discriminative classifiers. Advances in neural information processing systems, 487-493.