pose invariant action recognition for automated behaviour...

Mutually Reinforcing Motion-Pose Framework for Pose Invariant Action

Recognition

22nd November 2016, Tuesday

Manoj Ramanathan

Research Engineer, IMI

IMI Research Seminar

Contents

• Introduction • Literature Review

– Motion – Pose – Motion + Pose

• Proposed Framework – Propagate Motion Forward (PMF) Path – Canonical Pose Feedback (CPF) Path

• Experimental Results & Discussion • Conclusion

Mutually Reinforcing Motion-Pose Framework for Pose Invariant Action Recognition

2

Introduction

• For several applications, it is necessary for device to understand environment and humans.

• Recognition of human action is essential. • RGB camera based action recognition is not easy.


3

Introduction & Motivation

Occlusion

Background Clutter

View invariance

Motivating Challenges & factors

Execution rate

Anthropometric variations

Moving Cameras

Generalizability

Action localization


4

Introduction & Motivation

• Objectives:

– RGB camera Action recognition that can handle following challenges

• View angle changes

• Occlusion

• Pose Variations

• Background Clutter

– Generalized to handle actions performed in non-upright human postures.


5

Literature Review

Motion Based Approaches


Motion History Images & Motion Energy Images [1] – To indicate presence of motion and recency of motion

Trajectories [2]

- Optical Flow [3] - Kinematic Features [4,5]

[1] Aaron F. Bobick and James W. Davis, “The recognition of human movement using temporal templates”, IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(3): 257 - 267, March 2001. [2] H. Wang, A. Klser, C. Schmid, and C.-L. Liu, “Dense trajectories and motion boundary descriptors for action recognition,” Intl Journal on Computer Vision, vol. 103, pp. 60 – 79, May 2013. [3] L. Liu, L. Shao, and P. Rockett, “Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition,” Pattern Recognition 46, Elsevier, pp. 1810 – 1818, July 2013. [4] M. Jain, H. Jegou, and P. Bouthemy, “Better exploiting motion for better action recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition, June 2013. [5] S. Ali and M. Shah, “Human action recognition in videos using kinematic features and multiple instance learning,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 32, pp. 288 – 303, February 2010. 7

Pose Based Approaches


- Shape [1], Contours [2] - Based on extraction and representation of key poses [5] - Silhouette [4]

Poselets [3] – Body part detectors in 3D appearance space

[1] H. Zhang and L. E. Parker, “4-Dimensional local spatio-temporal features for human activity recognition,” in IEEE Intl. Conf. on Intelligent Robots and Systems, pp. 2044 – 2049, September 2011. [2] S. Cheema, A. Eweiwi, C. Thurau, and C. Bauckhage, “Action recognition by learning discriminative key poses,” in IEEE Intl. conf. on computer vision workshops, pp. 1302 – 1309, November 2011. [3] M. Raptis and L. Sigal, “Poselet key-framing: A model for human activity recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2650 – 2657, October 2013. [4] F. Lv and R. Nevatia, “Single view human action recognition using keypose matching and Viterbi path searching,” in IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1–8, June 2007. [5] N. Ikizler-Cinbis and S. Scarloff, “Web-based classifiers for humanaction recognition,” IEEE Trans. On Multimedia, vol. 14, pp. 1031 –1045, August 2012.

8

Motion + Pose Based Approaches


Shape-Motion Prototypes [1] Motionlets [2] - a mid-level spatio-temporal part, which are a tight cluster in motion and appearance space corresponding to each body part movements.

[1] Z. Jiang, Z. Lin, and L. S. Davis, “Recognizing human actions by learning and matching shape-motion prototype trees,” IEEE Trans. On Pattern Analysis and Machine Intelligence, vol. 34, pp. 533 – 547, March 2012. [2] L. Wang, Y. Qiao, and X. Tang, “Motionlets mid-level 3d parts for human motion recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition, June 2013.

9

Proposed Pose Invariant Action Recognition Framework

Consists of two components, namely motion and pose component in a mutually reinforcing framework

Framework for Action Recognition

• Actions are manifested as movements of body parts

• Detection of body parts and analyzing their motion provides a good framework

• Mutually assistive components to improve each other’s performance

• Represent motion of each body part with respect to the body (for pose-invariance)


11 Confidential To be Published

Input Video

Propagation Mechanism – Grid Division

Human body centric space conversion

Kinematic features – Div, Curl, Proj, Rot

Action Model from training

videos

ELM Classifier

Canonical Pose Hypothesis – Identify pose in the frame

Canonical Sticks from

training videos

Preprocessing – Foreground detection

Temporal Stick Features – Implicitly captures dynamics

of pose evolution

Recognized Action

Propagation Motion Forward Path

Canonical Pose Feedback Path

Realign grids based on head size

Kinematic features – Div, Curl, Proj, Rot, BodyProj, BodyRot


Propagate Motion Forward Path

Parameters assumed as available or estimated Foreground Neck point

Major viewing direction

Propagation Mechanism – Grid Division Requires neck and foreground

Human body centric space conversion Requires viewing direction

Kinematic features – Div, Curl, Proj, Rot Requires neck

Propagation Motion Forward Path


2x

5x

6x

Divide into grids based on body

proportion




• Optical Flow [1] used in the framework

• Kinematic features [2] extracted from Optical flow to represent and characterize actions

– Divergence

– Vorticity (Curl)

– Projection

– Rotation

[1] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, "High Accuracy Optical Flow Estimation Based on a Theory for Warping," ECCV, vol. 3024, pp. 25-36, 2004. [2] S. Shojaeilangari, W. Y. Yau, k. Nandakumar, J. Li, and E. K. Teoh.,” Robust Representation and Recognition of Facial Emotions Using Extreme Sparse Learning”, IEEE Trans. on Image Processing, Vol.24, No.7, pg. 2140 – 2152, March 2015.




• Weighted & Unweighted Histograms of the motion features were used in Pose invariant Emotion recognition [1]

• Assumed that the face is frontal and deals only with 2D motion

• For action recognition

– Method should handle 3D motion

– Human performing action need not be frontal

[1] S. Shojaeilangari, W. Y. Yau, k. Nandakumar, J. Li, and E. K. Teoh.,” Robust Representation and Recognition of Facial Emotions Using Extreme Sparse Learning”, IEEE Trans. on Image Processing, Vol.24, No.7, pg. 2140 – 2152, March 2015.



Propagate Motion Forward Path Up

Left

Front

3

Grid 1

Grid 2

Grid 3

1 2

4

5 6

3

Human body centric space

Encode the grids based on view

Confidential To be Published


17


Initially, Head size is assumed to divide into different grids.

Use the initial motion features to recognize an initial action

Canonical Pose Hypothesis – Identify pose in the frame Uses initial action and body part detector

Canonical Sticks from training videos

- offline training only once

Temporal Stick Features – Implicitly captures dynamics of pose evolution


Realign grids based on head size

Kinematic features – Div, Curl, Proj, Rot, BodyProj, BodyRot





Available training videos

1) Crop foreground region in frame 2) Convert to grayscale 3) Resize to fixed dimension 4) Collect all resized images for all

videos 5) Apply NNMF to this data and

extract top N (=100) Eigen vectors or principal components

- Manually mark the sticks of each Eigen vector. - Use neck point and size of head, to obtain a normalized stick representation that can be used for comparison with test frame. Mutually Reinforcing Motion-Pose Framework for

Pose Invariant Action Recognition

Canonical Stick Extraction


20

Weizmann

KTH

UCF Sports

Bend Wave2 Run

Boxing Hand Clapping Running

Golf Swing Kicking SkateBoarding





1) Compare each canonical stick of action with the image to identify the most possible pose

2) A hypothesis for each canonical pose is computed based on the formula


Canonical Pose hypothesis

𝑃𝐻𝑁 = ∑i ∑j L(i,j)

Where L(i,j) = l(I,j), if dj ≤ Τd and (Θj –Θs ) ≤ ΤΘ

= 0, otherwise

l(i,j) – Body part i Likelihood score in segment j Td – distance threshold TΘ – Orientation threshold i – For each body part j – For each motion consistent segment


22


Algorithm 1. Start with likelihood score Li for each part i in stick pose as 0, Li = 0. 2. Using kinematic motion features, obtain an initial segmentation of the foreground region. 3. Pass each of these segments through the body part detector [1], to know if the segment is a body part i or not. 4. If segment m is detected as body part i associated likelihood score li;m is obtained. 5. If segment m satisfies distance and orientation constraints, the likelihood score Li for body part i in stick pose is accumulated by li;m. (Distance constraints are imposed in normalized stick coordinates). 6. Repeat steps 3 - 5 for every segment and obtain the final Li for the canonical stick pose. 7. The pose hypothesis PHn for canonical stick pose n is summation of all Li. 8. Repeat steps 1 - 7 for every canonical pose n and compute pose hypothesis. 9. Choose the top 3 poses with highest pose hypothesis and compute the mean pose 10. Perform a pixel-wise segmentation into one of the body parts based on the distance from each body part’s stick in the mean pose. 11. Compute the body orientation using obtained torso region and neck point. 12. Compute the head size using obtained head region and body orientation. 13. Repeat steps 5- 12 if computed head size and initial approximate


[1] Manoj Ramanathan, Wei-Yun Yau and Eam Khwang Teoh, `Human Body Part Detection Using Likelihood Score Computations', IEEE Symposium on Computational Intelligence in Biometrics and Identity Management (CIBIM), pg. 160 – 166, December 2014.


23

Input video divided into T Temporal Segments For each segment, average Stick Pose and Neck Point is computed

Total T Stick Poses

…..

Motion of each stick joint between consecutive segments

Proj & Rot features computed with respect to neck point

Temporal Stick Features




• Pose component helping motion component – Re-align the grids of the according to the canonical

pose identified for each frame – Compute body part referenced kinematic feature

using pixel wise segmentation for each pixel

– Action recognized based on the original motion feature and newly computed feature.

• Framework forms a loop-like structure that can be repeated until action recognition converges.



25

NUAD - Focus on Non-upright action (NUAD) instead of the usual set of upright actions - 35 actors - 8 actions

- Bend - Squat - Push up - Climber - Knee bending - Single hand wave - Double hand wave - Lying down wave

- 3 views (Front, Left, Right) - Ground truth marking done to

indicate all body parts and neck points in the frames



26

Experiments & Discussion

- Datasets • Simple ones – KTH & Weizmann • Challenging ones – UCF Sports & Hollywood • Cross Dataset – MSR Action • Posture Variation – NUAD - Tolerance range for neck markings

Method Performance (%)

Proposed (only PMF) 92.47

Proposed (PMF + CPF) 100

Shape –Motion Prototype [1] 100

Kinematic Features [2] 95.75

MHI & MEI based [3] 93

Experiments

Weizmann Dataset - 9 actors - 10 actions - Simple background - Leave one actor out method



[1] Z. Jiang, Z. Lin, and L. S. Davis, “Recognizing human actions by learning and matching shape-motion prototype trees,” IEEE Trans. On Pattern Analysis and Machine Intelligence, vol. 34, pp. 533 – 547, March 2012. [2] S. Ali and M. Shah, “Human action recognition in videos using kinematic features and multiple instance learning,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 32, pp. 288 – 303, February 2010. [3] Y. Lu, Y. Li, Y. Chen, F. Ding, X. Wang, J. Hu, and S. Ding, “A Human action recognition method based on Tchebichef moment invariants and temporal templates,” in Intl. Conf. on Intelligent Human-Machine Systems and Cybernetics, pp. 76–79, August 2012. [4] L. Wang, Y. Qiao, and X. Tang, “Motionlets mid-level 3d parts for human motion recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition, June 2013.

KTH Dataset - 25 actors - 6 actions - 4 scenarios - Leave one out (LOO) method - (16+9) Test validation

Method 16 + 9 (%) LOO (%)

Proposed (only PMF) 87 90

Proposed (PMF + CPF) 90 93.32

Shape –Motion Prototype [1]

95.77

Kinematic Features [2]

87.77

Motionlets [4] 93.3

28



Proposed (PMF + CPF) 87.4

Shape –Motion Prototype [1] 88

[2] 96.6

[3] 81.7

Experiments UCF Sports - 152 videos of 10 sports based actions - Dynamic backgrounds, view changes,

camera motion and pose variations - Skateboard – walk & Run – Kick confused



[1] Z. Jiang, Z. Lin, and L. S. Davis, “Recognizing human actions by learning and matching shape-motion prototype trees,” IEEE Trans. On Pattern Analysis and Machine Intelligence, vol. 34, pp. 533 – 547, March 2012. [2] M. T. Harandi, C. Sanderson, S. Shirazi, and B. C. Lovell, “Kernel analysis on grassmann manifolds for action recognition,” Pattern Recognition Letters, vol. 34, pp. 1906 – 1913, November 2013. [3] K. G. Derpanis, M. Sizintsev, K. J. Cannons, and R. P. Wildes, “Action spotting and recognition based on spatiotemporal orientation analysis,” IEEE Trans. On Pattern Analysis and Machine Intelligence, vol. 35, pp. 527 – 540, March 2013. [4] A. Gilbert, J. Illingworth, and R. Bowden, “Action recognition using mined hierarchical compound features,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 33, pp. 883 – 897, May 2011. [5] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1–8, June 2008

Hollywood Dataset - 8 actions (Interactions included) - Dynamic backgrounds, view changes,

camera motion and pose variations - Occlusion Handling –

- Stick limits & Grid based - Test & train set provided




[4] 53.3

[5] 38.2

29

Dataset All canonical Sticks

Lesser no. of Sticks

UCF Sports 87.4% 83.4%

Hollywood 56.87 51.89

Experiments Testing Pose effectiveness - Reducing available canonical stick poses for each action. - Adding sticks extracted from other datasets for certain actions

NUAD Dataset - View-invariance only with mirror image cases. (Around 92.5%) - Frontal & side cases, the accuracy is very less. (Around 58% only) - Because of Human body centric space created using 2D images is not same for different views.



Cross Dataset applicability of canonical sticks - Tested using MSR Action Dataset - 3 actions only same as KTH dataset - Canonical sticks extracted from KTH and

used.







30


Experiments

Errors as neck is not available or action is not visible Erroneous pose identification


Discussion

- Availability of different canonical sticks for each action. - Based on available 2D stick project to 3D stick so that it can be used for

comparing with any view.

- Estimation of Neck point and viewing direction.

- Tolerance for neck region – only 3% performance drop

- Foreground estimated using background averaging.

- Body part detection error resulting in wrong pose estimation



32

Conclusion

- Mutually reinforcing motion – pose framework action recognition - Pose-invariant - Partially view-invariant - Partial occlusion handling.

- Forward path handles motion & Feedback path handles pose.

- Representation of motion of each body part in a body centric space that allows

pose-invariance.

- Motion determines initial action, that determines the canonical stick poses to be used.

- Canonical stick pose identified help to realign grids and include motion features for each body part motion



33

Thank you!!

Q & A ??

pose invariant action recognition for automated behaviour...

Documents