collaborative activities understanding from 3d data€¦ · collaborative activities understanding...

Collaborative Activities Understanding

from 3D Data

ALCOR LAB, DEPARTMENT OF COMPUTER, CONTROL AND

MANAGEMENT ENGINEERING “ANTONIO RUBERTI”

Fabrizio Natola, Valsamis Ntouskos, Prof. Fiora Pirri

DOCTORAL CONSORTIUM

F. Natola, V. Ntouskos, F. Pirri Collaborative Activities Understanding

from 3D Data

Introducing myself

• Name: Fabrizio Natola

• Bachelor’s Degree in Computer Engineering at Sapienza, University

of Rome. – Title of the thesis: Gestione Remota di un Database Multimediale con Applicativo

Mobile su Piattaforma Android (Remote Management of a Multimedia Database by

means of a Mobile Application on Android Platform) (Supervisor: Prof. Gianni Orlandi)

• Master’s Degree in Artificial Intelligence and Robotics, at Sapienza,

University of Rome. – Title of the Thesis: Activity Understanding from 3D Data (Supervisor: Prof. Fiora Pirri)

• Research Statement: Human Action Recognition


from 3D Data

Introduction

• Goal: Recognizing Collaborative Human Activities from 3D Data;

Problems

Collaboration

General

• Occlusions

• Background Variations

• Scale

• Inter/Intra person variations

• Recognition of only simple actions

Atomic Actions

+

Interactions

[1] - Kong, Yu, Yunde Jia, and Yun Fu. "Learning human interaction by interactive phrases. (ECCV, 2012)

[2] - Choi, Wongun, Khuram Shahid, and Silvio Savarese. "What are they doing?: Collective activity classification using spatio-temporal relationship among people."

(ICCV Workshops, 2009)

[3] - Ryoo, M. S., and J. K. Aggarwal. "UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA)." (2010).

[4] - Ionescu, Catalin, et al. "Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments." (PAMI, 2013)

Datasets

(Kong et al., 2012)𝟏,

(Choi et al., 2009)𝟐,

(Ryoo et al., 2010)𝟑,

(Ionescu et al., 2013)𝟒


from 3D Data

State of the Art Single Human Action

Recognition Video Sequences

• (Lv and Nevatia, 2007) – Action Net𝟓

• (Lv and Nevatia, 2006) – HMM + AdaBoost𝟔

• (Gong et al., 2014) – Kernelized Temporal Cut + Dynamic Manifold Warping𝟕

• (Zhang and Fa, 2011) – Joint gait−pose manifold𝟖

• (Ntouskos et al., 2013) – Discriminative sequence and back constrained GP−LVM𝟗

MOCAP

Collaborative Action

Recognition Video Sequences

????

MOCAP

[1] - Junejo, I., Dexter, E., Laptev, I., and Perez, P. (2011). View-independent action recognition from temporal self-similarities. (PAMI, 2011)

[2] - Weinland, D., O¨ zuysal, M., and Fua, P. (2010). Making action recognition robust to occlusions and viewpoint changes. (ECCV, 2010)

[3] - Kantorov, Vadim, and Ivan Laptev. "Efficient feature extraction, encoding and classification for action recognition.“ (CVPR, 2014)

[4] - Raptis, Michalis, Iasonas Kokkinos, and Stefano Soatto. "Discovering discriminative action parts from mid-level video representations." (CVPR, 2012)

[5] - Lv, F. and Nevatia, R. (2007). Single view human action recognition using key pose matching and viterbi path searching. (CVPR, 2007)

[6] - Lv, F. and Nevatia, R. (2006). Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. (ECCV, 2006)

[7] - Gong, D., Medioni, G., and Zhao, X. (2014). Structured time series analysis for human action segmentation and recognition. (PAMI, 2014)

[8] - Zhang, X. and Fan, G. (2011). Joint gait-pose manifold for video-based human motion estimation. (CVPRW Workshops, 2011)

[9] - Ntouskos, V., Papadakis, P., Pirri, F., et al. (2013). Discriminative sequence back-constrained GP-LVM for mocap based action recognition. (ICPRAM, 2013)

[10] - Choi, Wongun, and Silvio Savarese. "Understanding Collective Activities of People from Videos." (PAMI, 2013)

[11] - Yu Kong; Yunde Jia; Yun Fu, "Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition," (PAMI, 2014)

• (Junejo et al., 2011) – Self Similarity Matrices𝟏

• (Weinland et al., 2010) – 3D HOG Descriptors𝟐

• (Kantorov et al., 2014) – Local features representations𝟑

• (Raptis et al., 2012) – Mid−level representations𝟒

• (Choi et al., 2014) – Crowd Context Descriptor𝟏𝟎

• (Kong et al., 2014) – Interactive Phrases𝟏𝟏


from 3D Data

Stage of our Research

• Action Recognition of a single subject: Temporal Segmentation + Recognition

Structure of the Sequence

Representation

Temporal Segmentation

Structured Time Series

Spatio-Temporal Manifolds

Searching for temporal cuts

Gong, D., Medioni, G., and Zhao, X. (2014). Structured time series analysis for human action segmentation and recognition. (PAMI, 2014)

Recognition

Dynamic Manifold Warping


from 3D Data

MOCAP Action Sequence Representation

Frame 1 Frame 2 Frame 3

• Human motion sequence =

Multivariate Structured Time

Series

• 𝑋1:𝐿𝑥= 𝑥1 … 𝑥𝐿𝑥 ∈ ℝ𝐷×𝐿𝑥

• In 3D MOCAP each frame 𝑥𝑡 is a

column vector 𝑥𝑡 = 𝑝1, … , 𝑝𝑀𝑇, where

𝑝𝑖 is the xyz position of the i-th marker

• M is the number of joints and then each

frame is considered as a high

dimensional point with 𝐷 = 3 × M

• Yi Li; Fermuller, C.; Aloimonos, Y.; Hui Ji, "Learning shift-invariant sparse representation of

actions," (CVPR, 2010)

• Jiajia Luo; Wei Wang; Hairong Qi, "Group Sparsity and Geometry Constrained Dictionary

Learning for Action Recognition from Depth Maps,"


from 3D Data

Single Person Action Recognition Approach

• KTC – H = Kernelized

Temporal Cut – Hierarchical;

• KTC – R = Kernelized

Temporal Cut – Real Time;

• KAC = Kernelized Alignment

Cut.

• PROs

– Few labelled examples are needed for each action

category in the training stage

– Intra/inter-person variations handled thanks to the

spatial-temporal alignment algorithm

– View invariance thanks to Spatio-Temporal Manifold and

Spatial-temporal alignmens

– Good performances in segmentation and recognition

Dynamic Manifold

Warping

ACTION UNITS


from 3D Data

Single Human Action Recognition - Results

• Accuracy on Recognition on 35 sequences performed by 5 different subjects: 86%

• Problems to be solved

– Instance-Based Learning No explicit generalization

– For recognizing an action, the action itself must be in the dataset considered

(no prediction)

TEMPORAL SEGMENTATION TEMPORAL ALIGNMENT


from 3D Data

Research problem

• Our project aims at solving a two-fold problem: Collaborative Actions Modelling +

Collaborative Actions Recognition.

• We are currently working on:

1. Representation of MOCAP sequences;

2. Public database construction;

3. Temporal segmentation and Alignments of sequences, both MOCAP MOCAP

and MOCAP RGBD videos;

4. Modelling the metric space for the motion distance functions (several works compute

distance measures by means of canonical correlation analysis [1], [2], [3]; dynamic time

warping: [4],[5]; canonical time warping: [6]) so as to introduce a learning algorithm (as

some frameworks based on hierarchy approaches [7], [8], [9])

[1] - Bach, Francis R., and Michael I. Jordan. "Kernel independent component analysis." (JMLR, 2003)

[2] - Kim, Tae-Kyun, and Roberto Cipolla. "Canonical correlation analysis of video volume tensors for action categorization and detection." (PAMI, 2009)

[3] - Loy, Chen Change, Tao Xiang, and Shaogang Gong. "Multi-camera activity correlation analysis." (CVPR, 2009)

[4] - Ukrainitz, Yaron, and Michal Irani. Aligning sequences and actions by maximizing space-time correlations. (ECCV, 2006)

[5] - Carceroni, Rodrigo L., et al. "Linear sequence-to-sequence alignment." (CVPR 2004)

[6] - Zhou, Feng, and Fernando Torre. "Canonical time warping for alignment of human behavior." (NIPS, 2009)

[7] - Fei-Fei, Li, Robert Fergus, and Pietro Perona. "One-shot learning of object categories." (PAMI, 2006)

[8] - Salakhutdinov, Ruslan, Joshua B. Tenenbaum, and Antonio Torralba. "Learning with hierarchical-deep models." (PAMI, 2013)

[9] - Rodriguez, Abel, David B. Dunson, and Alan E. Gelfand. "The nested Dirichlet process." (Journal of the American Statistical Association , 2008)


from 3D Data

Recognition of activities performed by two

collaborating subjects

• Recognition of interleaving process between two action sequences: spatio-temporal

alignment of the two sequences + identifying when they intertwine (collaboration)

• Collaborative actions as e.g. maintenance and repair operations in working

environment

• Necessity of databases for MOCAP collaborative activities

• Nodal Points

• Human-Robot Interactions


from 3D Data

Thanks for your attention

collaborative activities understanding from 3d data€¦ · collaborative activities understanding...

Documents