human action recognition in videos employing 2dpca on 2dhoof and radon transform

Human Action Recognition in Videos Employing

2DPCA on 2DHOOF and Radon Transform

Presented in Partial Fullment of the Requirements of the Degree of Masters of Science in the School of Communication and Information Technology

Fadwa Fawzy FouadSupervisor: Dr. Moataz

M.Abdelwahab

Agenda

Introduction

Quick overview

2DHOOF/2DPCA Contour Based Optical Flow Algorithm

Human Gesture Recognition Employing Radon Transform/2DPCA

Introduction

• Importance & Applications• Action V.S. Activity• Challenges & characteristics of the domain

Importance &Applications

Human action\activity recognition is one of the most promising applications of computer vision. The interest of this topic is motivated by the promise of many applications include

• character animation for games and movies

• advanced intelligent user interfaces

• biomechanical analysis of actions for sports and medicine

• automatic surveillance

Action V.S. Activity

Action

Simple motion pattern

Single person

Short time duration

Activity

Complex sequence of actions

Single/ multiple person(s)

Long time duration

Challenges and characteristics of the

domainThe difficulty of the recognition process is associated with multiple variation sources

Inter- and intra-class variations

Environmental Variations and Capturing conditions

Temporal variations

• Inter-class variations (variations within single class)

The variations in the performance of certain action due to anthropometric differences between individuals. For example, running movements can differ in speed and stride length.

• Intra-class variations (variations within different classes)

Overlap between different action classes due to the similarity in actions performance.

• Environmental variations

Destructions originate from the actor’s surroundings include dynamic or cluttered environments, illumination variation, Body occlusion

• Capturing conditions

Depend on the method used to capture the scene, wither single\multiple static/dynamic camera(s) systems.

• Temporal variations

Includes the changes in the performance rate from one person to another. Also, the changes in the recording rate (frame/sec).

Agenda

Introduction

Quick overview



Overview

The main structure of action recognition system

The main structure of action recognition

systemThe structure of the action recognition system is typically hierarchical.

Action classificati

on

Extraction of the action descriptors

Human detection & segmentation

Capture the input videoStart

End

Capture the input video

For single camera, the scene is captured from only one viewpoint, so it can't provide enough information about the action performed in case of poor viewpoint. Besides, it can't handle the occlusion problem.

Video 1

Video 2

Video 3 Video 4

Multi-camera systems can capture the same view from different poses., so they provide sufficient information that can alleviate the occlusion problem.

Camera 0 Camera 1

Camera 2 Camera 3

The new technology of Kinect depth camera can be utilized to capture theperformed actions. The device has: RGB camera, depth sensor and multi-array microphone.

It provides full-body 3D motion capture, facial recognition and voice recognition capabilities. Furthermore, depth information can be used for segmentation.

Kinect depth camera

RGBinformation

Depth information

It’s the first step of the full process of human sequence evaluation.

Techniques can be divided into :

• Background Subtraction techniques

• Motion Based techniques

• Appearance Based techniques

• Depth Based Segmentation

Human detection & segmentation

Extraction of the action descriptors

Input videos consist of massive amounts of information in the form of spatio-temporal pixel intensity variations. But most of this information is not directly relevant to the task of understanding and identifying the activity occurring in the video.

In this work we used Non-Parametric approaches in which a set of features are extracted per video frame, then these features are accumulated and matched to stored templates.

Example: Motion Energy Image & Motion History Image

When the extracted features are available for an input video, human action recognition becomes a classification problem.

Dimensionality reduction is a common step before the actual classification and is discussed first.

Action classificati

on

Dimensionality reductionImage representations are often high-dimensional. This makes matching task computationally more expensive. Also, the representation might contain noisy features. This problem trigged the idea of obtaining a more compact, robust feature representation by reducing the space of the image representation into a lower dimensional space.

Example: One\Two Dimension(s) Principal component analysis (PCA)

Nearest neighbor classification

k-Nearest neighbor (NN) classifiers use the distance between the features of anobserved sequence and those in a training set. The most common label among the k closest training sequences is chosen as the classification.

NN classification can be either performed at the frame level, or for the whole video sequences. In the latter case, issues with different frame lengths need to be resolved.

In our work we used 1-NN with Euclidean distance to classify the tested actions.

is class

is class

Agenda

Introduction

Quick overview



2DHOOF/2DPCA Contour BasedOptical Flow Algorithm

• Dense V.S. Sparse OF• Alignment issues with OF• The Calculation of 2D Histogram of Optical Flow(2DHOOF)• Overall System Description• Experimental Results

Dense V.S. Sparse OF

In practice, dense OF is not the best choice to get the OF. Besides it’s high computation complexity, it is not accurate for homogenous moving objects (aperture problem).

Align actor then calculate OF

Calculate OF then Align it

Alignment issues with OF

We had two choices to decide the best order for actor alignment:

Jumping & Transition effects in Running action

Align actor then calculate OF Calculate OF then Align OF

The Calculation of 2D Histogram of Optical

Flow(2DHOOF)

Calculated OF

Histogram layersW/m x H/m x n

An example to obtain the n-layers 2DHOOF for any two successive frames

Accumulated 2D-HOOF that represents the whole video

1DHOOF V.S. 2DHOOF

Confusion between Wave and Bend actions when using 1DHOOF

Wave

Bend

Overall System Description

Segmentation & Contour Extraction

Extract the dominant vectors

Store extracted features

Sparse OF 2DHOOF 2DPCA


Projection on the

dominant vectors

Classification and Voting

Scheme

Sparse OF 2DHOOF

Training Mode

Testing Mode

Training Mode





Segmentation & Contour Extraction (Method 1)

• Geodesic segmentation

Input Video Frame

Face Detection

Initial Stroke

Blob Extraction

Final Contour

GD

Where xi : stroke pixels (black)x : other pixels (white)I : image intensity

Segmentation & Contour Extraction (Method 2)

• Contour extraction from Magnitude dense OF

Edge pixel has specific criteria based on it's (3 x 3) neighbor pixels.

Applying edgy criteria on the magnitude of the dense OF

Steps of contour extraction from dense OF

Training Mode





2DHOOF-2DPCA Features Extraction

Projection

Final Features

2DHOOF ofTraining Videos

Mea

n/L

ayer

Cov

aria

nce

/Lay

er

Dom

inan

t Ve

ctor

s/La

yer

Training Mode





Testing Mode


Projection on the

dominant vectors

Classification and Voting

Scheme

Sparse OF 2DHOOF

Projection on the dominant vectors

Classification

D1

D2

D3

Dj

Final Decision

based on the minimum D

value

Experimental Results

Two experiments were conducted to evaluate the performance of the proposed algorithm.

• For the first experiment Weizmann dataset was used to measure the performance of the low resolution single camera operation.

• For the second Experiment IXMAS multi-view dataset was used to evaluate the performance of the parallel camera structure.

The two experiments was conducted using the Leave-One-Actor-Out (LOAO) technique to be consistent with the most recent algorithms.

Both datasets provide RGB frames and the actor ‘s silhouettes.

Weizmann dataset

The Weizmann dataset consists of 90 low-resolution video sequences showing 9 different actors, each performing 10 natural actions such as walk, run, jump forward, gallop sideways, bend, wave with one hand (wave1), wave with two hands (wave2), jump in place (Pjump), jump-jack, and skip.

Bend Run Jump Jump-jack Gallop

The confusion matrix for this experiment shows that the average recognition accuracy is 97.78%, and eight actions were 100% accurate.

2DHOOF / 2DPCA

On the other hand, using 1DHOOF with 1DPCA decreases the accuracy to 63.34% because of the large confusion between actions (as discussed before).

1DHOOF / 1DPCA

Comparison with the most recent algorithms:

Method Accuracy

Previous Contribution

98.89%

Our Algorithm 97.79%

Shah et al. 95.57%

Yang et al. 92.8%

Yuan et al. 92.22%

• Recognition Accuracy

Method Average Runtime

Our Algorithm 66.11 msec


113.00 msec

Shah et al. 18.65 sec

Blank et al. 30 sec

• Average Testing Time

Samples from the calculated contour OF

Walk Skip P-jump

IXMAS Dataset

The proposed parallel structure algorithm was applied on the IXMAS multi-view dataset. Each camera is considered as an independent system, then a voting scheme was carried out between the four cameras to obtain the final decision.

Our AlgorithmCamera0

Our AlgorithmCamera1

Our Algorithm

Our Algorithm

Camera2

Camera3

Voting Scheme

Final Decision

This dataset consists of 5 cameras capturing the scene, 12 actors, each performing 13 natural actions 3 times in which the actors are free to change their orientation for each scenario.

The actions: check watch, cross arms, scratch head, sit down, get up, turn around, walk, wave, punch, kick, and pick up and throw.

Example on IXMAS multi-camera dataset. Action: Pick up and Throw

Camera 0 Camera 1

Camera 2 Camera 3

Confusion matrix for IXMAS dataset shows that average accuracy is 87.12%,where SH=Scratch head, CW=Check watch, CA=Cross arms, SD=Sit down, GU=Get up, TA=Turn around, PU=Pick up.

Method Actors #

Cam(0) %

Cam(1) %

Cam(2) %

Cam(3) %

Overall

Vote%

Proposed Algorithm 12 97.29 79.04 72.47 78.53 87.12


12 78.9 78.61 80.93 77.38 84.59

Weinland et al. 10 65.04 70.00 54.30 66.00 81.30

Srivastava et al. 10 N/A N/A N/A N/A 81.40

Shah et al. 12 72.00 53.00 68.00 63.00 78.00

Comparison with the best reported accuracies shows that we achieved the highest accuracy with an enhancement of 3%.

Bold indicates the best performance, N/A= Not available in published reports

Samples from the calculated contour OF

Walk Set down Kick

Published Paper

F. Fawzy, M. Abdelwahab, and W. Mikhael. 2DHOOF-2DPCA Contour Based Optical Flow Algorithm for Human Activity Recognition . IEEE International Midwest Symposium on Circuits and Systems (MWSCAS 2013), Ohio, USA.

Agenda

Introduction

Quick overview



Human Gesture Recognition Employing

Radon Transform/2DPCA

• Radon Transform (RT)• Overall system description

Radon Transform

The RT computes projections of an image matrix along specified directions. A projection of a two-dimensional function f(x,y) is a set of line integrals along parallel paths, or beams.

Projections can be computed along any angle , by using general equation of the Radon Transform:

where is the delta function with value not equal zero only for argument equal 0, and is the projection direction, and is the orientation of this direction.

Overall system description

The proposed system is designed and tested for gesture recognition and can be extended to regular action recognition.

We have two modes for this algorithm• Training Mode• Testing Mode

Both have a pre-processing step before feature extraction.

Training Mode

Pre-processing Step: 1) Input videos

The One Shot Learning ChaLearn Gesture Dataset was used for this experiment. In this dataset a single user facing a fixed Kinect™ camera, interacting with a computer by performing gestures was captured.

Videos are represented by RGB and depth images.

Each actor has from 8 to 15 different gestures(vocabulary) for training, and 47 input videos each has from 1 to 5 gesture(s) for testing.

We applied our algorithm on a subset of this dataset consists of 37 different actors.

The dataset can be divided into two main groups; standing actors, and sitting actors. In this experiment we used a subset of the standing actor group in which actors are using their whole body to perform the gesture and make significant motion to be captured by the MEI and MHI.

Standing actors Sitting actors

Also, we used only the depth videos as input videos. Depth information makes the segmentation task easier than using RGB or gray videos, especially when the actor's clothes have the same color as the background, or textured background.

Training Mode

Pre-processing Step: 2) Segmentation & Blob extraction

We used Basic Global Thresholding Algorithm in order to extract the actor's blob.

1. Select an initial estimate for T (typically the average grey level in the image).

2. Segment the image using T into two groups of pixels: consisting of pixels with grey levels > T and consisting pixels with grey levels < T.

3. Compute the average grey levels of pixels in to give and to give .

4. Compute a new threshold value: Repeat steps 2-4 until the difference T is less than 1 or the number of total iterations is more than 10.

In some cases the resultant blob has some objects with it. This noise results from some objects that were at the same depth as the actor.

Case 1

Case 2

Case 3

In this situation we perform a noise elimination step

Case 1

Case 2

Case 3

Training Mode

Alignment using RT of the First Frame

• Vertical alignment using the projection on the y-axis (90o from RT)

• Horizontal alignment using the projection on the x-axis (0o from RT)

Training Mode

Calculate the MEI and MHI

MEI MHI MEI MHI

Whole Body Body Parts

Training Mode

Get Radon Transform for MEI and MHI

Basically, the difference between RT of the whole body and RT of the body parts is the white portion in the center representing the projection of the actor's body

Training Mode

Testing Mode

Video Chopping

We can do that by two main steps :1. Calculate the plot that represents the moving area/frame2. Apply the Local minima criteria on this plot.

As we have mentioned, the testing videos may contain from 1 to 5 different gestures per video. In this case we need to separate these gestures into one gesture per video to test our system with.

1. Calculate the plot that represents the moving area/frame

2. Apply the Local minima criteria

We are searching for a frame i that satisfies the following conditions:

a) The number of frames before this i is greater than or equal to the Frame Threshold.

b) The amount of decrease in the area at i is greater than 50% of Peak value.

c) The area at i-1 and i+1 is grater than the area at i to insure that i is a local minima between two peaks.

Good Results

Bad Results

Experimental Results

We did four One Shot Learning experiments

OSL Experimen

ts

Radon Transform

2DPCA

Direct correlation

MEI/MHI

2DPCA

Direct correlation

I, II

III, IV

Features

Experiment

Whole Body Body Parts

MEI MHI MEI MHI

RTI 71 69 82 81.5

II 70 70 81.7 81.6

MEI/MHIIII 70 68 82 81.7

IV 71.24 68.7 83.33 82.9

Recognition accuracy of the four experiments

Comparison between using RT, and using MEI/MHI directly without RT

Features % Maintained Energy

Storage Requirements

RT 99% 72 Mbytes

MEI/MHI 88% 102Mbytes

30% OFF

2D

PC

A

Better

Thank You