iciap 2

Daily Living Activities Recognition via EfficientHigh and Low Level Cues Combination and

Fisher Kernel Representation

Negar Rostamzadeh1

Gloria Zen1

Ionut Mironica2

Jasper Uijlings1

Nicu Sebe1

1 DISI, University of Trento, Trento, Italy2 LAPI, University Politehnica of Bucharest, Bucharest, Romania

Outline

• Daily Living Action Recognition• State-of-the-art• Our approach • Results• Conclusion

2/25

Motivation – State of the art – Our approach – Results - Conclusion 3/23

Action Recognition in videos

Answer phone or dial phone?

Difficulties in fine-grained activities:1. Slightly different activities in motion and appearance2. Different manner of performing the similar task.

Object Centric approaches- SoA

Object-centric approaches- based on tracking and trajectory

Brendel et al, ICCV 2011 [5], May et al, CVPR 2004 [6]Campos et al, WACV 2011 [23], Liu et al, CVPR 2007[16]

Limitations

Providing semantic/high-level information of the scene

AdvantagesHandling occlusions in objects interactionsThe broken and missed trajectories The problem of curse of dimensionality

4/23Motivation – State of the art – Our approach – Results - Conclusion

Non-object centric approaches - SoABag-of-words approach relying on low-level

Laptev et al, CVPR 2008 [1], Willems et al, ECCV 2008 [2], Hospedales et al, ICCV 2009. Zen et al, CVPR 2011 [4], Wong et al, CVPR 2007 [15], Chang et al, ICCV 2011 [17], Gilbert et al, ICCV 2009 [19],Zelniker et al, ICCV 2008 [20], Gehrig et al, ICCV 2009[21], Mahbub et al, ICIEV [25]

Foreground pixels HoFSTIPHoG

5/23

Limitations

Robustness to noise & occlusions Computational efficiency

Advantages 1. Discard semantic & high-level information of the scene.2. Discard relationship among spatio-temporal local features.

Motivation – State of the art – Our approach – Results - Conclusion

Enhanced descriptors - SoA

Which body-part causes what motion?

6/23

Messing et al, ICCV, 2009 [7], Fathi et al, 2008 [8], Zhang et al, 2012 [9], Matikainen et al, 2010 [10], : Gaur et al, 2011 [11], Savarese et al, 2008 [12], Kovashka et al, Gireddy et al, 2011 [14], CVPR 2007[18] , Shechtman et al, CVPR 2011 [24]


1. Relation between local featuresPair-wise10,11,12,18, local space or time neighborhood11,18, ST phrases9

2. Combining different local featuresSuch as local motion, appearance, and positions14,24

3. Enriching the combination of low level features with high- level information

Detect and localize faces7, STIP volume8,9

Input Video ClassifierFusing information

to produce enriched descriptor

Apply a Feature-representation

Recognizing Activities

Body-part detector

Low-level cues

Accumulation over each video

Fisher Kernel to model the Temporal

variation

Approach in a glance


Body-pose estimation

What is the problem with an off-the-shelf detector?

Our Solution:

Employ an already-trained off-the-shelf detector

Enhanced pose estimator

We use the already trained classifier, but we provide some additional information from the

new dataset

ADLBUFFY


Body-pose estimation- build on ‘Yang and Ramanan PAMI2012, CVPR11’ [29]

9/25

1. Model the body as a pictorial structure (Felzenshwalb-CVPR 2010)2. Model the body as a Tree3. Each possible body-configuration has a score

Pair-wise scoreLocal score: HoG

HoG - appearance

Scores by employing off-the-

shelf detector = Sinitial



10/25

New Score = Sinitial

Foreground Score Optical Flow Score

weightsRelative importance of foreground and optical flow score




Our approachSoA ForegroundSoA Optical FlowOur approach


SoA Optical FlowOur approachOur approachSoA Foreground


12/25


Tuning


Enhanced pose estimator used to enrich action recognition approach

Body-part detector

Input VideoFusing information

to produce enriched descriptor

Low-level cues

Accumulation over each video

ClassifierApply a Feature-representation

Recognizing Activities

Fisher Kernel to model the Temporal

variation

Approach in a glance


Fisher Kernel (FK) Theory

1. Introduced by Jaakkol NIPS’99 [26]) for protein detection2. Web audio classification (Moreno 2000)3. Introduced in Computer Vision for Image categorization by [Perronnin, CVPR’07]


Fisher Kernel in image categorization Vs video analysis

1. Modeling the : spatial variation temporal variation2. Visual documents: small patches frames of the video3. Initial feature vectors : SIFT our novel descriptors for

action recognition

Fisher Kernel in the state-of-the-art

Fisher Kernel (FK) Theory

- combines the benefits of generative and discriminative approaches

- represents a signal as the gradient of the probability density function that is a learned generative model of that signall


Results on the ADL Rochester dataset


Conclusion

17/25

We proposed a novel descriptor that is combining high-level semantic information and low–level cues.

We propose an enhanced body-pose estimator.

We model the Temporal variation by the Fisher-Kernel representation.


Thank you!

References1. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008, June). Learning

realistic human actions from movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE..

2. Willems, G., Tuytelaars, T., & Van Gool, L. (2008). An efficient dense and scale–invariant spatio–temporal interest point detector. Computer Vision–ECCV 2008, 650–663.

3. Hospedales, T., Gong, S., & Xiang, T. (2009, September). A markov clustering topic model for mining behaviour in video. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 1165–1172). IEEE

4. Zen, G., and Ricci, E. "Earth mover's prototypes: A convex learning approach for discovering activity patterns in dynamic scenes." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.

5. Brendel, W., & Todorovic, S. (2011, November). Learning spatiotemporal graphs of human activities. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 778–785). IEEE.

References6. Han, M., Xu, W., Tao, H., & Gong, Y. (2004, June). An algorithm for multiple

object trajectory tracking. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on(Vol. 1, pp. I–864). IEEE.

7. Messing, R., Pal, C., & Kautz, H. (2009, September). Activity recognition using the velocity histories of tracked keypoints. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 104–111). IEEE.

8. Fathi, A., and Mori, G. (2008, June). Action recognition by learning mid–level motion features. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE.

9. Zhang, Y., Liu, X., Chang, M. C., Ge, W., & Chen, T. (2012). Spatio–Temporal phrases for activity recognition. Computer Vision–ECCV 2012, 707–721.

10. Matikainen, M. Hebert, and R. Sukthankar. Representing pairwise spatial and temporal relations for action recognition. European Conference of Computer Vision (ECCV) 2010, pages 508{521, 2010.

References11. Gaur, U., Zhu, Y., Song, B., & Roy–Chowdhury, A. (2011, November). A

“string of feature graphs” model for recognition of complex activities in natural videos. InComputer Vision (ICCV), 2011 IEEE International Conference on (pp. 2595–2602). IEEE.

12. Savarese, S., DelPozo, A., Niebles, J. C., & Fei–Fei, L. (2008, January). Spatial–Temporal correlatons for unsupervised action classification. In Motion and video Computing, 2008. WMVC 2008. IEEE Workshop on (pp. 1–8). IEEE.

13. Taralova, E., De la Torre, F., & Hebert, M. (2011, November). Source constrained clustering. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 1927–1934) IEEE.

14. M. Malgireddy, I. Nwogu, and V. Govindaraju. A generative framework to investigate the underlying patterns in human activities. International Conference of Computer Vision Workshops (ICCV Workshops), 2011

15. Wong, S. F., Kim, T. K., & Cipolla, R. (2007, June). Learning motion categories using both semantic and structural information. In Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on (pp. 1-6). IEEE.

References16. Liu, J., Luo, J., & Shah, M. (2009, June). Recognizing realistic actions from

videos “in the wild”. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. 1996-2003). IEEE.

17. Chang, M. C., Krahnstoever, N., & Ge, W. (2011, November). Probabilistic group-level motion analysis and scenario recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on (pp. 747-754). IEEE.

18. Kovashka, A., & Grauman, K. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on(pp. 2046-2053). IEEE.

19. Gilbert, A., Illingworth, J., & Bowden, R. (2009, September). Fast realistic multi-action recognition using mined dense spatio-temporal features. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 925-931). IEEE.

20. E. Zelniker, S. Gong, T. Xiang, et al. Global abnormal behaviour detection using a network of cctv cameras. In The Eighth International Workshop on Visual Surveillance-VS2008, 2008.

References21. Gehrig, D., Kuehne, H., Woerner, A., & Schultz, T. (2009, December).

Hmm-based human motion recognition with optical flow data. In Humanoid Robots, 2009. Humanoids 2009. 9th IEEE-RAS International Conference on (pp. 425-430). IEEE.

22. Sadanand, S., & Corso, J. J. (2012, June). Action bank: A high-level representation of activity in video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 1254-1241). IEEE.

23. de Campos, T., Barnard, M., Mikolajczyk, K., Kittler, J., Yan, F., Christmas, W., & Windridge, D. (2011, January). An evaluation of bags-of-words and spatio-temporal shapes for action recognition. In Applications of Computer Vision (WACV), 2011 IEEE Workshop on (pp. 344-351). IEEE.

24. Shechtman, E., & Irani, M. Space-time behavior based correlation. In Conference of Computer Vision and Pattern Recognition (CVPR), 2011.

25. Mahbub, U., Imtiaz, H., Ahad, M., & Rahman, A. (2012, May). Motion clustering-based action recognition technique using optical flow. In Informatics, Electronics & Vision (ICIEV), 2012 International Conference on (pp. 919-924). IEEE.

References26. Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in

discriminative classifiers. Advances in neural information processing systems, 487-493.

iciap 2

Documents

advantagesbrendel et

mahbub et

liu et

campos et

kovashka et

hospedales et

willems et

gireddy et