collaborative activities understanding from 3d data€¦ · collaborative activities understanding...
TRANSCRIPT
Collaborative Activities Understanding
from 3D Data
ALCOR LAB, DEPARTMENT OF COMPUTER, CONTROL AND
MANAGEMENT ENGINEERING “ANTONIO RUBERTI”
Fabrizio Natola, Valsamis Ntouskos, Prof. Fiora Pirri
DOCTORAL CONSORTIUM
F. Natola, V. Ntouskos, F. Pirri Collaborative Activities Understanding
from 3D Data
Page 2
Introducing myself
• Name: Fabrizio Natola
• Bachelor’s Degree in Computer Engineering at Sapienza, University
of Rome. – Title of the thesis: Gestione Remota di un Database Multimediale con Applicativo
Mobile su Piattaforma Android (Remote Management of a Multimedia Database by
means of a Mobile Application on Android Platform) (Supervisor: Prof. Gianni Orlandi)
• Master’s Degree in Artificial Intelligence and Robotics, at Sapienza,
University of Rome. – Title of the Thesis: Activity Understanding from 3D Data (Supervisor: Prof. Fiora Pirri)
• Research Statement: Human Action Recognition
F. Natola, V. Ntouskos, F. Pirri Collaborative Activities Understanding
from 3D Data
Page 3
Introduction
• Goal: Recognizing Collaborative Human Activities from 3D Data;
Problems
Collaboration
General
• Occlusions
• Background Variations
• Scale
• Inter/Intra person variations
• Recognition of only simple actions
Atomic Actions
+
Interactions
[1] - Kong, Yu, Yunde Jia, and Yun Fu. "Learning human interaction by interactive phrases. (ECCV, 2012)
[2] - Choi, Wongun, Khuram Shahid, and Silvio Savarese. "What are they doing?: Collective activity classification using spatio-temporal relationship among people."
(ICCV Workshops, 2009)
[3] - Ryoo, M. S., and J. K. Aggarwal. "UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA)." (2010).
[4] - Ionescu, Catalin, et al. "Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments." (PAMI, 2013)
Datasets
(Kong et al., 2012)𝟏,
(Choi et al., 2009)𝟐,
(Ryoo et al., 2010)𝟑,
(Ionescu et al., 2013)𝟒
F. Natola, V. Ntouskos, F. Pirri Collaborative Activities Understanding
from 3D Data
Page 4
State of the Art Single Human Action
Recognition Video Sequences
• (Lv and Nevatia, 2007) – Action Net𝟓
• (Lv and Nevatia, 2006) – HMM + AdaBoost𝟔
• (Gong et al., 2014) – Kernelized Temporal Cut + Dynamic Manifold Warping𝟕
• (Zhang and Fa, 2011) – Joint gait−pose manifold𝟖
• (Ntouskos et al., 2013) – Discriminative sequence and back constrained GP−LVM𝟗
MOCAP
Collaborative Action
Recognition Video Sequences
????
MOCAP
[1] - Junejo, I., Dexter, E., Laptev, I., and Perez, P. (2011). View-independent action recognition from temporal self-similarities. (PAMI, 2011)
[2] - Weinland, D., O¨ zuysal, M., and Fua, P. (2010). Making action recognition robust to occlusions and viewpoint changes. (ECCV, 2010)
[3] - Kantorov, Vadim, and Ivan Laptev. "Efficient feature extraction, encoding and classification for action recognition.“ (CVPR, 2014)
[4] - Raptis, Michalis, Iasonas Kokkinos, and Stefano Soatto. "Discovering discriminative action parts from mid-level video representations." (CVPR, 2012)
[5] - Lv, F. and Nevatia, R. (2007). Single view human action recognition using key pose matching and viterbi path searching. (CVPR, 2007)
[6] - Lv, F. and Nevatia, R. (2006). Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. (ECCV, 2006)
[7] - Gong, D., Medioni, G., and Zhao, X. (2014). Structured time series analysis for human action segmentation and recognition. (PAMI, 2014)
[8] - Zhang, X. and Fan, G. (2011). Joint gait-pose manifold for video-based human motion estimation. (CVPRW Workshops, 2011)
[9] - Ntouskos, V., Papadakis, P., Pirri, F., et al. (2013). Discriminative sequence back-constrained GP-LVM for mocap based action recognition. (ICPRAM, 2013)
[10] - Choi, Wongun, and Silvio Savarese. "Understanding Collective Activities of People from Videos." (PAMI, 2013)
[11] - Yu Kong; Yunde Jia; Yun Fu, "Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition," (PAMI, 2014)
• (Junejo et al., 2011) – Self Similarity Matrices𝟏
• (Weinland et al., 2010) – 3D HOG Descriptors𝟐
• (Kantorov et al., 2014) – Local features representations𝟑
• (Raptis et al., 2012) – Mid−level representations𝟒
• (Choi et al., 2014) – Crowd Context Descriptor𝟏𝟎
• (Kong et al., 2014) – Interactive Phrases𝟏𝟏
F. Natola, V. Ntouskos, F. Pirri Collaborative Activities Understanding
from 3D Data
Page 5
Stage of our Research
• Action Recognition of a single subject: Temporal Segmentation + Recognition
Structure of the Sequence
Representation
Temporal Segmentation
Structured Time Series
Spatio-Temporal Manifolds
Searching for temporal cuts
Gong, D., Medioni, G., and Zhao, X. (2014). Structured time series analysis for human action segmentation and recognition. (PAMI, 2014)
Recognition
Dynamic Manifold Warping
F. Natola, V. Ntouskos, F. Pirri Collaborative Activities Understanding
from 3D Data
Page 6
MOCAP Action Sequence Representation
Frame 1 Frame 2 Frame 3
• Human motion sequence =
Multivariate Structured Time
Series
• 𝑋1:𝐿𝑥= 𝑥1 … 𝑥𝐿𝑥 ∈ ℝ𝐷×𝐿𝑥
• In 3D MOCAP each frame 𝑥𝑡 is a
column vector 𝑥𝑡 = 𝑝1, … , 𝑝𝑀𝑇, where
𝑝𝑖 is the xyz position of the i-th marker
• M is the number of joints and then each
frame is considered as a high
dimensional point with 𝐷 = 3 × M
• Yi Li; Fermuller, C.; Aloimonos, Y.; Hui Ji, "Learning shift-invariant sparse representation of
actions," (CVPR, 2010)
• Jiajia Luo; Wei Wang; Hairong Qi, "Group Sparsity and Geometry Constrained Dictionary
Learning for Action Recognition from Depth Maps,"
F. Natola, V. Ntouskos, F. Pirri Collaborative Activities Understanding
from 3D Data
Page 7
Single Person Action Recognition Approach
• KTC – H = Kernelized
Temporal Cut – Hierarchical;
• KTC – R = Kernelized
Temporal Cut – Real Time;
• KAC = Kernelized Alignment
Cut.
• PROs
– Few labelled examples are needed for each action
category in the training stage
– Intra/inter-person variations handled thanks to the
spatial-temporal alignment algorithm
– View invariance thanks to Spatio-Temporal Manifold and
Spatial-temporal alignmens
– Good performances in segmentation and recognition
Dynamic Manifold
Warping
ACTION UNITS
F. Natola, V. Ntouskos, F. Pirri Collaborative Activities Understanding
from 3D Data
Page 8
Single Human Action Recognition - Results
• Accuracy on Recognition on 35 sequences performed by 5 different subjects: 86%
• Problems to be solved
– Instance-Based Learning No explicit generalization
– For recognizing an action, the action itself must be in the dataset considered
(no prediction)
TEMPORAL SEGMENTATION TEMPORAL ALIGNMENT
F. Natola, V. Ntouskos, F. Pirri Collaborative Activities Understanding
from 3D Data
Page 9
Research problem
• Our project aims at solving a two-fold problem: Collaborative Actions Modelling +
Collaborative Actions Recognition.
• We are currently working on:
1. Representation of MOCAP sequences;
2. Public database construction;
3. Temporal segmentation and Alignments of sequences, both MOCAP MOCAP
and MOCAP RGBD videos;
4. Modelling the metric space for the motion distance functions (several works compute
distance measures by means of canonical correlation analysis [1], [2], [3]; dynamic time
warping: [4],[5]; canonical time warping: [6]) so as to introduce a learning algorithm (as
some frameworks based on hierarchy approaches [7], [8], [9])
[1] - Bach, Francis R., and Michael I. Jordan. "Kernel independent component analysis." (JMLR, 2003)
[2] - Kim, Tae-Kyun, and Roberto Cipolla. "Canonical correlation analysis of video volume tensors for action categorization and detection." (PAMI, 2009)
[3] - Loy, Chen Change, Tao Xiang, and Shaogang Gong. "Multi-camera activity correlation analysis." (CVPR, 2009)
[4] - Ukrainitz, Yaron, and Michal Irani. Aligning sequences and actions by maximizing space-time correlations. (ECCV, 2006)
[5] - Carceroni, Rodrigo L., et al. "Linear sequence-to-sequence alignment." (CVPR 2004)
[6] - Zhou, Feng, and Fernando Torre. "Canonical time warping for alignment of human behavior." (NIPS, 2009)
[7] - Fei-Fei, Li, Robert Fergus, and Pietro Perona. "One-shot learning of object categories." (PAMI, 2006)
[8] - Salakhutdinov, Ruslan, Joshua B. Tenenbaum, and Antonio Torralba. "Learning with hierarchical-deep models." (PAMI, 2013)
[9] - Rodriguez, Abel, David B. Dunson, and Alan E. Gelfand. "The nested Dirichlet process." (Journal of the American Statistical Association , 2008)
F. Natola, V. Ntouskos, F. Pirri Collaborative Activities Understanding
from 3D Data
Page 10
Recognition of activities performed by two
collaborating subjects
• Recognition of interleaving process between two action sequences: spatio-temporal
alignment of the two sequences + identifying when they intertwine (collaboration)
• Collaborative actions as e.g. maintenance and repair operations in working
environment
• Necessity of databases for MOCAP collaborative activities
• Nodal Points
• Human-Robot Interactions
F. Natola, V. Ntouskos, F. Pirri Collaborative Activities Understanding
from 3D Data
Page 11
Thanks for your attention