hopc: histogram of oriented principal components of arxiv:1408.3809v4 [cs.cv… · hopc: histogram...

HOPC: Histogram of Oriented Principal Components of3D Pointclouds for Action Recognition

Hossein Rahmani, Arif Mahmood, Du Q Huynh, and Ajmal Mian

Computer Science and Software Engineering, The University of Western Australia,35 Stirling Highway, Crawley, WA 6009 Australia

Abstract. Existing techniques for 3D action recognition are sensitive to view-point variations because they extract features from depth images which changesignificantly with viewpoint. In contrast, we directly process the pointclouds andpropose a new technique for action recognition which is more robust to noise, ac-tion speed and viewpoint variations. Our technique consists of a novel descriptorand keypoint detection algorithm. The proposed descriptor is extracted at a pointby encoding the Histogram of Oriented Principal Components (HOPC) within anadaptive spatio-temporal support volume around that point. Based on this descrip-tor, we present a novel method to detect Spatio-Temporal Key-Points (STKPs) in3D pointcloud sequences. Experimental results show that the proposed descriptorand STKP detector outperform state-of-the-art algorithms on three benchmarkhuman activity datasets. We also introduce a new multiview public dataset andshow the robustness of our proposed method to viewpoint variations.

Keywords: Spatio-temporal keypoints, multiview action dataset

1 Introduction

Human action recognition has many applications in smart surveillance, human-computerinteraction and sports. The Kinect and other depth cameras have become popular for thistask because depth sequences do not suffer from the problems induced by variations inillumination and clothing texture. However, the presence of occlusion, sensor noise andmost importantly viewpoint variations still make action recognition a challenging task.

Designing an efficient depth sequence representation is an important task in manycomputer vision problems. Most existing action recognition techniques (e.g., [4,21,38])treat depth sequences the same way as color videos and use color-based action recog-nition methods. However, while these methods are suitable for color video sequences,simply extending them to depth sequences may not be optimal [19]. Information cap-tured by depth cameras actually allows geometric features to be extracted to form richdescriptors. For instance, Tang et al. [27] used histograms of the normal vectors forobject recognition in depth images. Given a depth image, they computed spatial deriva-tives, transformed them to the polar coordinates and used the 2D histograms as objectdescriptors. Recently, Oreifej and Liu [19] extended the same technique to the temporaldimension by adding time derivative. A downside of treating depth sequences this wayis that the noise in the depth images is enhanced by the differential operations [31].Histogramming, on the other hand, is analogous to integration and is more resilient to

arX

iv:1

408.

3809

v4 [

cs.C

V]

22

Sep

2014

2 H. Rahmani, A. Mahmood, D.Q. Huynh, A. Mian

High Z

Low Z

Time

X

Y

X

Y

X

Y

X

Y

X

Y

(b) Side view at 50o horizontal rotation

X

Y

X

Y

X

Y

X

Y

X

Y

Time

(a) Front view (0o horizontal rotation)

Fig. 1. Two sequences of 3D pointclouds of a subject performing the holding head action. No-tice how the depth values (colours) have significantly changed with the change in viewpoint.Simple normalization cannot compensate for such depth variations. Existing depth based actionrecognition algorithms will not be accurate in such cases

the effect of noise. Furthermore, viewpoint variations are unavoidable in real scenarios.However, none of the existing 3D sensor based techniques is designed for cross-view ac-tion recognition where training is performed on sequences acquired from one view andtesting is performed on sequences acquired from a significantly different view (> 25).

We directly process the 3D pointcloud sequences (Fig. 1) and extract point descrip-tors which are robust to noise and viewpoint variations. We propose a novel descriptor,the Histogram of Oriented Principal Components (HOPC), to capture the local geomet-ric characteristics around each point within a sequence of 3D pointclouds. To extractHOPC at a point p, PCA is performed on an adaptive spatio-temporal support volumearound p (see Fig. 2) which gives us a 3× 3 matrix of eigenvectors and the correspond-ing eigenvalues. Each eigenvector is projected onto m directions corresponding to thevertices of a regular m-sided polyhedron and scaled by its eigenvalue. HOPC is formedby concatenating the projected eigenvectors in decreasing order of their eigenvalues.

HOPC is used in a holistic and local setting. In the former approach, the sequence of3D pointclouds is divided into spatio-temporal cells and HOPC descriptors of all pointswithin a cell are accumulated and normalized to form a single cell descriptor. All celldescriptors are concatenated to form a holistic HOPC descriptor. In the latter approach,local HOPC are extracted at candidate spatio-temporal keypoints (STKP) and a HOPCquality factor is defined to rank the STKPs. Only high quality STKPs are retained. Allpoints within the adaptive spatio-temporal support volume of each STKP are alignedalong the eigenvectors of the spatial support around STKP. Thus the support volumeis aligned with a local object centered coordinate basis and extracting HOPC, or anyother feature, at the STKP will be view invariant. See Section 4.2 for details. Sincehumans may perform the same action at different speeds, to achieve speed invariance,we propose automatic temporal scale selection by minimizing the eigenratios over avarying temporal window size. The main contributions of this paper include:

– A HOPC descriptor for 3D pointclouds.– A spatio-temporal key-point (STKP) detector and a view invariant descriptor.

HOPC: Histogram of Oriented Principal Components for Action Recognition 3

– A technique for speed normalization of actions.

Moreover, we introduce a new 3D action dataset which has scale variations of sub-jects and viewpoint variations. It contains thirty actions which is larger number than anyexisting 3D action dataset. This dataset will be made public. Experimental comparisonon four datasets, including three benchmark ones [13,19,32], with eight state-of-the-artmethods [4,10,19,21,31,32,35,36] shows the efficacy of our algorithms. Data and codeof our technique are available [1].

2 Related Work

Based on the input data, human action recognition methods can be divided into threecategories including RGB based, skeleton-based and depth based methods. In RGBvideos, in order to recognize actions across viewpoint changes, mostly view inde-pendent representations are proposed such as view invariant spatio-temporal features[2,20,22,23,25,33]. Some methods infer the 3D scene structure and use geometric trans-formations to achieve view invariance [5,9,15,26,39]. Another approach is to find aview independent latent space [7,8,12,14] in which features extracted from the actionscaptured at different view points are directly comparable. Our proposed approach alsofalls in this category. However, our approach is only for 3D pointclouds captured bydepth sensors. To the best of our knowledge, we are the first to propose cross-viewaction recognition using 3D pointclouds. We propose to normalize the spatio-temporalsupport volume of each candidate keypoint in the 3D pointcloud such that the featureextracted from the normalized support volume becomes view independent.

In skeleton based methods, 3D joint positions are used for action recognition. Multi-camera motion capture (MoCap) systems [3] have been used for human action recogni-tion, but such special equipment is marker-based and expensive. Moreover, due to thedifferent quality of the motion data, action recognition methods designed for MoCapare not suitable for 3D pointcloud sequences which is the focus of this paper [32].

On the other hand, some methods [36,31,37] use the human joint positions extractedby the OpenNI tracking framework (OpenNI) [24] as interest points. For example, Yangand Tian [37] proposed pairwise 3D joint position differences in each frame and tempo-ral differences across frames to represent an action. Since 3D joints cannot capture allthe discriminative information, the action recognition accuracy is compromised. Wanget al. [32] extended the previous approach by computing the histogram of occupancypattern of a fixed region around each joint in a frame. In the temporal dimension, theyused low frequency Fourier components as features and an SVM to find a discrimina-tive set of joints. It is important to note that the estimated joint positions are not reliableand can fail when the human subject is not in an upright and frontal view position (e.g.lying on sofa) or when there is clutter around the subject.

Action recognition methods based on depth maps can be divided into holistic [19,21,13,38,29]and local approaches [32,35,11,31]. Holistic methods use global features such as silhou-ettes and space-time volume information. For example, Li et al. [13] sampled boundarypixels from 2D silhouettes as a bag of features. Yang et al. [38] added temporal deriva-tive of 2D projections to get Depth Motion Maps (DMM). Vieira et al. [29] computedsilhouettes in 3D by using the space-time occupancy patterns. Recently, Oreifej and


Liu [19] extended histogram of oriented 3D normals [27] to 4D by adding time deriva-tive. The gradient vector was normalized to unit magnitude and projected on a refinedbasis of 600-cell Polychrome to make histograms. The last component of normalizedgradient vector was inverse of the gradient magnitude. As a result, information fromvery strong derivative locations, such as edges and silhouettes, may get suppressed [21].The proposed HOPC descriptor is more informative than HON4D as it captures thespread of data in three principal directions. Thus, HOPC achieves more action recogni-tion accuracy than exiting methods on three benchmark datasets.

Depth based local methods use local features where a set of interest points are ex-tracted from the depth sequence and a feature descriptor is computed for each interestpoint. For example, Cheng et al. [4] used interest point detector proposed by Dollar etal. [11] and proposed a Comparative Coding Descriptor (CCD). Due to the presenceof noise in depth sequences, simply extending color-based interest point detectors suchas [6] and [11] may degrade the efficiency of these detectors [19].

Motion trajectory based action recognition methods[30,34] are also not reliable indepth sequences [19]. Therefore, recent depth based action recognition methods re-sorted to alternative ways to extract more reliable interest points. Wang et al. [31] pro-posed Haar features to be extracted from each random subvolume. Xia and Aggarwalin [35] proposed a filtering method to extract spatio-temporal interest points. Their ap-proach fails when the action execution speed is faster than the flip of the signal causedby the sensor noise. Both techniques are sensitive to viewpoint variations.

In contrast to previous interest point detection methods, the proposed STKP detectoris robust to variations in action execution speed, sensor viewpoint and the spatial scaleof the actor. Since the proposed HOPC descriptor is not strictly based on the depthderivatives, it is more robust to noise. Moreover, our methods do not require skeletondata which may be noisy or unavailable especially in the case of side views.

3 Histogram of Oriented Principal Component (HOPC)

LetQ = Q1, Q2, · · · , Qt, · · · , Qnf represent a sequence of 3D pointclouds capturedby a 3D sensor, where nf denotes the number of frames (i.e. number of 3D pointcloudsin the sequence) and Qt is the 3D pointcloud at time t. We make a spatio-temporalaccumulated 3D pointcloud by merging the sequence of individual pointclouds in thetime interval [t − τ, t + τ ]. Consider a point p = (xt yt zt)

>, 1 ≤ t ≤ nf in Qt. Wedefine the spatio-temporal support of p,Ω(p), as the 3D points which are in a sphere ofradius r centered at p (Fig. 2). We propose a point descriptor based on the eigenvaluedecomposition of the scatter matrix C of the points q ∈ Ω(p):

C =1

np

∑q∈Ω(p)

(q− µ)(q− µ)>,where µ =1

np

∑q∈Ω(p)

q, (1)

and np = |Ω(p)| denotes the number of points in the spatio-temporal support of p.Performing PCA on the scatter matrix C gives us CV = EV , where E is a diagonalmatrix of the eigenvalues λ1 ≥ λ2 ≥ λ3, and V contains three orthogonal eigenvectors[v1 v2 v3] arranged in the order of decreasing magnitude of their associated eigen-values. We propose a new descriptor, the Histogram of Oriented Principal Components


(HOPC), by projecting each eigenvector onto m directions obtained from a regular m-sided polyhedron. We use m = 20 to make a regular icosahedron which is composedof 20 regular pentagonal facets and each facet corresponds to a histogram bin. LetU ∈ R3×m be the matrix of the center positions u1,u2, · · · ,um of facets:

U = [u1,u2, · · · ,ui, · · · ,um] (2)

For a regular icosahedron with center at the origin, these normalized vectors are(±1Lu

,±1Lu

,±1Lu

),

(0,±ϕ−1

Lu,±ϕLu

),

(±ϕ−1

Lu,±ϕLu

, 0

),

(±ϕLu

, 0,±ϕ−1

Lu

), (3)

where ϕ = (1 +√5)/2 is the golden ratio, and Lu =

√ϕ2 + 1/ϕ2 is the length of

vector ui, 1 ≤ i ≤ m. The eigenvectors are basically directions of maximum varianceof the points in 3D space. Thus, they have a 180 ambiguity. To overcome this problem,we consider the distribution of vector directions and their magnitudes within the supportvolume of p. We determine the sign of each eigenvector vj from the sign of the innerproducts of vj and all vectors within the support of p:

vj = vj .sign

∑q∈Ω(p)

sign(o>vj)(o>vj)

2

(4)

where o = q − p and the sign function returns the sign of an input number. Notethat the squared projection ensures the suppression of small projections, which couldbe due to noise. If the signs of eigenvectors v1,v2, and v3 disagree i.e. v1 × v2 6= v3,we switch the sign of the eigenvector whose |

∑npw=1 sign(o

>wvj)(owvj)

2| value is thesmallest. We then project each eigenvector vj onto U to give us:

bj = U>vj ∈ Rm, for 1 ≤ j ≤ 3. (5)

In case vj is perfectly aligned with ui ∈ U , it should vote into only ith bin. However, allui’s are not orthogonal, therefore bj will have non-zero projection in other bins as well.To overcome this effect, we quantize the projection of bj . For this purpose, a thresholdvalue ψ is computed by projecting any two neighbouring vectors uk and ul,

ψ = uk>ul =

ϕ+ ϕ−1

Lu2 , uk,ul ∈ U. (6)

Note that for any uk ∈ U , we can find a ul ∈ U such that ψ = (ϕ+ ϕ−1)/Lu2. The

quantized vector is given by

bj(z) =

0 if bj(z) ≤ ψbj(z)− ψ otherwise,

where 1 ≤ z ≤ m. We define hj to be bj scaled by the corresponding eigenvalue λj ,

hj =λj · bj||bj ||2

∈ Rm, for 1 ≤ j ≤ 3. (7)


We concatenate the histograms of oriented principal components of all three eigenvec-tors in decreasing order of their eigenvalues to form a descriptor of point p:

hp = [h>1 h>2 h>3 ]> ∈ R3m. (8)

The spatio-temporal HOPC descriptor at point p encodes information from bothshape and motion in the support volume around it. Since the smallest principal com-ponent of the local surface is in fact the total least squares estimate of the surface nor-mal [18], our descriptor, which inherently encodes the surface normal, is more robustto noise than gradient-based surface normal used in [27,19]. Using this descriptor, wepropose two different action recognition algorithms in the following section.

4 Action Recognition

We propose a holistic and a local approach for human action recognition. Our holisticmethod is suitable for actions under occlusions, more inter-class similarities of localmotions, and where the subjects do not change their spatial locations. On the otherhand, our local method is more suitable for cross-view action recognition and in caseswhere the subjects change their spatial locations.

4.1 Action Recognition with Holistic HOPC

A sequence of 3D pointclouds is divided into γ = nx × ny × nt spatio-temporal cellsalong X , Y , and T dimensions. We use cs,where s = 1 · · · γ, to denote the sth cell.The spatio-temporal HOPC descriptor hp in (8) is computed for each point p within thesequence. The cell descriptor hcs is computed by accumulating hcs =

∑p∈cs hp and

then normalizing hcs ← hcs/||hcs ||2. The final descriptor hv for the given sequence isa concatenation of hcs obtained from all the cells:hv = [h>c1 h>c2 ... h

>cs ... h

>cγ ]>. We

use hv as the holistic HOPC descriptor and use SVM for classification.

Computing a Discriminative Cell Descriptor: The HOPC descriptor is highly cor-related to the order of eigenvalues of the spatio-temporal support volume around p.Therefore, for each point a pruning approach is introduced to eliminate the ambiguouseigenvectors of each point. For this purpose, we define two eigenratios:

δ12 =λ1λ2, δ23 =

λ2λ3. (9)

For 3D symmetrical surfaces, the values of δ12 or δ23 will be equal to 1. The prin-cipal components of symmetrical surfaces are ambiguous. To get a discriminative hp,the values of δ12 and δ23 must be greater than 1. However, to manage noise we choosea threshold value θ > 1 + ε, where ε is a margin and select only the discriminativeeigenvectors as follows:

1. If δ12 > θ and δ23 > θ: hp = [h>1 h>2 h>3 ]>.

2. If δ12 ≤ θ and δ23 > θ: hp = [0> 0> h>3 ]>.

3. If δ12 > θ and δ23 ≤ θ: hp = [h>1 0> 0>]>.4. If δ12 ≤ θ and δ23 ≤ θ: In this case, we discard p.


4.2 STKP: Spatio-Temporal Key-Point Detection

Consider a point p = (xt yt zt)> within a sequence of 3D pointclouds. In addition to

the spatio-temporal support volume around p defined in section 3, we further define aspatial only support volume around p as the 3D points of Qt that fall inside a sphereof radius r centered at p. Thus, we perform PCA on both the spatial and the spatio-temporal scatter matrices C ′ and C.

Let λ′1 ≥ λ′2 ≥ λ′3 and λ1 ≥ λ2 ≥ λ3 represent the eigenvalues of the spatial C ′

and spatio-temporal C scatter matrix, respectively. We define the following ratios:

δ′12 =λ′1λ′2, δ′23 =

λ′2λ′3, δ12 =

λ1λ2, δ23 =

λ2λ3. (10)

For a point to be identified as a potential keypoint, the condition δ12, δ23, δ′12, δ′23 >θ must be satisfied. This process prunes ambiguous points and produces a subset ofcandidate keypoints. It reduces the computational burden of the subsequent steps. Leth′p ∈ R3m represent the spatial HOPC and hp ∈ R3m represent the spatio-temporalHOPC. A quality factor is computed at each candidate keypoint p as follows:

ηp =1

2

3m∑i=1

(h′p(i)− hp(i))2

(h′p(i) + hp(i)). (11)

When h′p = hp, the quality factor has the minimum value of ηp = 0. It means that thecandidate keypoint p has a stationary spatio-temporal support volume with no motion.

We define a locality as a sphere of radius r′ ( with r′ r) and a time interval2τ ′+1 (with τ ′ ≤ τ ). We sort the candidate STKPs according to their quality values andstarting from the highest quality keypoint, all STKPs within its locality are removed.The same process is repeated on the remaining STKPs. Fig. 2 shows the steps of ourSTKP detection algorithm. Fig. 3-a shows the extracted STKPs from three differentviews for a sequence of 3D pointclouds corresponding to the holding head action.

4.3 View-Invariant Key-Point Descriptor

Let p = (xt yt zt)> represent an STKP. All points within the spatio-temporal support

volume of p i.e., Ω(p), are aligned along the eigenvectors of its spatial scatter matrix,B = PV ′, where P ∈ Rnp×3 is a matrix of points within Ω(p) and V ′ = [v′1 v′2 v′3]denotes the 3 × 3 matrix of eigenvectors of the spatial scatter matrix C ′. Recall thatthe signs of these eigenvectors have a 180 ambiguity. As mentioned earlier, we usethe sign disambiguation method to overcome this problem. As a result, any feature (e.g.raw depth values or HOPC) extracted from the aligned spatio-temporal support volumearound p will be view invariant.

In order to describe the points within the spatio-temporal support volume of key-point p, the spatio-temporal support of p is represented as a 3D hyper-surface in the4D space (X,Y, Z) and T . We fit a 3D hyper-surface to the aligned points within thespatio-temporal support volume of p. A uniform mx×my ×mt grid is used to samplethe hyper-surface and its raw values are used as the descriptor of keypoint p.


Time

p

Spatial support of p

Spatio-temporal support of p

Compute Compute

Discard

NO

YES

Mag

nitu

de

Orientation

Spatio-temporal HOPC ( )

Mag

nitu

de

Orientation

Spatial HOPC ( )

Select L points with highest

quality as STKPs

1 60 601

Compute Quality

X

Y

X

Y

X

Y

X

Y

X

Y

X

Y

X

Y

X

Y

X

Y

Fig. 2. STKP: Spatio-Temporal Key-Point detection algorithm

We use the bag-of-words approach to represent each 3D pointcloud sequence andbuild a codebook by clustering the keypoint descriptors using K-means. Codewords aredefined by the cluster centers and descriptors are assigned to codewords using Euclideandistance. For classification, we use SVM with the histogram intersection kernel [16].

5 Adaptive Support Volume

So far we have used a fixed spatial (r) and temporal (τ ) support volume to detect anddescribe each keypoint p. However, subjects can have different scales (in height andwidth) and perform actions with different speeds. Therefore, simply using a fixed spatial(r) and temporal (τ ) support volume is not optimal. Large values of r and τ enable theproposed descriptors to encapsulate more information about shape and motion of asubject. However, this also increases sensitivity to occlusion and action speed.

A simple approach to finding the optimal spatial scale (r) for a STKP is based onthe subject’s height (hs) e.g. r = e × hs, where e is a constant that is empiricallychosen to make a trade-off between descriptiveness and occlusion. This approach isunreliable and may fail when a subject touches the background or is not in an uprightposition. Several automatic spatial scale detection methods [28] have been proposed for3D object recognition. In this paper, we use the automatic spatial scale detection method


Fig. 3. (a)-STKPs projected onto XY dimensions on top of all points within a sequence of 3Dpointclouds corresponding to the holding head action (from three different views). Note that alarge number of STKPs are detected only where movement is performed. (b)-Sample pointcloudsat different views from the UWA3D Multiview Activity dataset

proposed by Mian et al. [17] to determine the optimal spatial scale for each keypoint.The optimal spatial scale (rb) is selected as the one for which the ratio between thefirst two eigenvalues of the spatial support of a keypoint reaches a local maximum. Ourresults show that the automatic spatial scale selection [17] achieves the same accuracyas the fixed scale when the height (hs) of each subject is available.

For temporal scale selection, most previous works [19,35,21,29,6] used a fixed num-ber of frames. However, we propose automatic temporal scale selection to make our de-scriptor robust to action speed variations. Our proposed method follows the automaticspatial scale detection method by Mian et al. [17]. LetQ = Q1, Q2, · · · , Qt, · · · , Qnf represent a sequence of 3D pointclouds. For a point p = [xt yt zt]

>, we start withpoints in [Qt−τ , · · · , Qt+τ ] for τ = 1 which are within its spatial scale r (note thatwe assume r as the optimal spatial scale for p) and calculate the summation of ratiobetween the first two eigenvalues (λ2/λ1) and the last two eigenvalues (λ3/λ2) as:

Aτp =λ2λ1

+λ3λ2, (12)

where λ1 ≥ λ2 ≥ λ3. This process continues for all τ = 1, · · · , ∆ and the optimaltemporal scale τ corresponding to the local minimum value of Ap found for point p. Apoint which does not have a local minimum is not considered as a candidate keypoint.

6 Experiments

The proposed algorithms were evaluated on three benchmark datasets including MSRAc-tion3D [13], MSRGesture3D [31], and ActionPairs3D [19]. We also developed a new“UWA3D Multiview Activity” dataset to evaluate the proposed cross-view action recog-nition algorithm. This dataset consists of 30 daily activities of ten subjects performedat different scales and viewpoints (Subsection 6.4). For our algorithms, we used k =


Fig. 4. Sample 3D pointclouds from the MSRAction3D, MSRGesture3D, ActionPairs3D, andUWA3D Multiview Activity datasets

1000, θ = 1.12, mx = my = 20 and mt = 3 in all experiments. To test the per-formance of our holistic approach, each sequence of 3D pointclouds was divided into6× 5× 3 spatio-temporal cells along X , Y , and T dimensions, respectively.

The performance of the proposed algorithms was compared with seven state-of-the-art methods including Histogram of Oriented Gradient (HOG3D) [10], Random Oc-cupancy Pattern (ROP) [31], Histogram of 3D joints(HOJ3D) [36], Actionlet Ensem-ble [32], Histogram of 4D Oriented Normals (HON4D) [19], Depth Spatio-TemporalInterest Points (DSTIP) [35], and Histograms of Depth Gradient (HDG) [21]. Theaccuracy is reported from the original papers or from the authors’ implementationsof DSTIP [35], HDG [21], HOG3D [10], and HON4D [19]. The implementation ofHOJ3D [36] is not available, therefore we used our own implementation.

6.1 MSRAction3D dataset

MSRAction3D dataset [13] consists of 20 actions each performed by 10 subjects 2-3times (Fig. 4). The dataset is challenging due to high inter-action similarities. To testour holistic approach, we used five subjects for training and five for testing and repeatedthe experiments 252 folds exhaustively as proposed by [19]. To show the effectivenessof our automatic spatio-temporal scale selection, we used four different settings usingfixed and varying values of r and τ . Table 1 compares our algorithms with existingstate-of-the-art. Note that the proposed algorithm outperformed all techniques under allfour settings. The maximum accuracy was achieved using constant r and adaptive τ .Adaptive r did not improve results since there is little scale variation in this dataset.Note that HOJ3D [36], Moving Pose [40] and Actionlet [32] use skeleton data which isnot always available.

We also evaluated our local method with automatic spatial and temporal scale selec-tion and achieved 90.90% accuracy (subjects 1,3,5,7,9 used for training and the restfor testing). This is higher than 89.30% of DSTIP [35] and 88.36% of HON4D [19].Note that DSTIP [35] only reported the accuracy of the best fold and used additional


Table 1. Accuracy comparison on MSRAction3D dataset. Mean ± STD is computed over 252folds. Fold 5/5 means subjects 1,3,5,7,9 used for training and the rest for testing. a MovingPose [40] used different setting

Method Mean±STD Max Min 5/5

HOJ3D [36] 63.55±5.23 75.91 44.05 75.80HOG3D [10] 70.38±4.40 82.78 55.26 82.78ROP [31] - - - 86.50Moving Pose [40] - - - 91.70a

Actionlet [32] - - - 88.20HON4D [19] 81.88±4.45 90.61 69.31 88.36DSTIP [35] - 89.30 - -HDG [21] 77.68±4.97 86.13 60.55 83.70

Holistic HOPCconstant r, constant τ 85.45±2.31 92.39 73.54 91.64adaptive r, constant τ 84.78±2.89 91.64 72.41 90.90constant r, adaptive τ 86.49±2.28 92.39 74.36 91.64adaptive r, adaptive τ 85.01±2.44 92.39 72.94 91.27

steps such as mining discriminative features which can be applied to improve the accu-racy of any descriptor. We did not include such steps in our method.

6.2 MSRGesture3D dataset

The MSRGesture3D dataset [31] contains 12 American sign language gestures per-formed 2-3 times by 10 subjects. For comparison with previous techniques, we use theleave-one-subject-out cross validation scheme proposed by [31]. Because of the ab-sence of full body subjects (only hands are visible), we evaluate our methods in twosettings only. Table 2 compares our method to existing state-of-the-art methods exclud-ing HOJ3D [36] and Actionlet [32] since they require 3D joint positions which are notpresent in this dataset. Note that both variants of our method outperform all techniquesby a significant margin achieving an average accuracy of 96.23% which is 3.5% higherthan the nearest competitor HDG [21]. We also tested our local method with automaticspatial and temporal scale selection and obtained an accuracy of 93.61%.

6.3 ActionPairs3D dataset

The ActionPairs3D dataset [19] consists of depth sequences of six pairs of actions(Fig. 4) performed by 10 subjects. This dataset is challenging as each action pair hassimilar motion and shape. We used half of the subjects for training and the rest for test-ing as recommended by [19] and repeated the experiments 252 folds. Table 3 comparesthe proposed holistic HOPC descriptor in two settings with existing state-of-the-artmethods. Our algorithms outperformed all techniques with 2.23% improvement overthe nearest competitor. Adaptive τ provides better improvement on this dataset com-pared to the previous two. We also evaluated our local method with automatic spatial


Table 2. Comparison with state-of-the-art methods on MSRGesture3D dataset

Method Mean±STD Max Min

HOG3D [10] 85.23±12.12 100 50.00ROP [31] 88.50 - -HON4D [19] 92.45±8.00 100 75HDG [21] 92.76±8.80 100 77.78

Holistic HOPCadaptive r, constant τ 95.29±6.24 100 83.67adaptive r, adaptive τ 96.23±5.29 100 88.33

Table 3. Accuracy comparisons on the ActionPairs3D dataset. Mean±STD are computed over252 folds. 5/5 means subjects 6,7,8,9,10 used for training and the rest for testing

Method Mean±STD Max Min 5/5

HOJ3D [36] 63.81±5.94 67.22 50.56 66.67HOG3D [10] 85.76±4.66 85.56 65.00 82.78Actionlet [32] - - - 82.22HON4D [19] 96.00±1.74 100 91.11 96.67

Holistic HOPCconstant r, constant τ 97.15±2.21 100 88.89 97.22constant r, adaptive τ 98.23±2.19 100 88.89 98.33

and temporal scale selection and obtained 98.89% accuracy using subjects 6.7.8.9.10for training and the rest for testing.

6.4 UWA3D Multiview Activity dataset

We collected a new dataset using the Kinect to emphasize three factors: (1) Scale vari-ations between subjects. (2) View-point variations. (3) All actions were performed ina continuous manner with no breaks or pauses. Thus, the start and end positions ofbody for the same actions are different. Our dataset consists of 30 activities performedby 10 human subjects of varying scales: one hand waving, one hand Punching, sit-ting down, standing up, holding chest, holding head, holding back, walking, turningaround, drinking, bending, running, kicking, jumping, moping floor, sneezing, sittingdown(chair), squatting, two hand waving, two hand punching, vibrating, falling down,irregular walking, lying down, phone answering, jumping jack, picking up, puttingdown, dancing, and coughing (Fig. 4). To capture depth videos from front view, eachsubject performed two or three random permutations of the 30 activities in a continuousmanner. For cross-view action recognition, 5 subjects performed 15 activities from 4different side views (see Fig. 3-b). We organized our dataset by segmenting the contin-uous sequences. The dataset is challenging due to self-occlusions and high similarity.For example, drinking and phone answering actions have very similar motion and onlythe hand location in these actions is slightly different. As another example, lying down


Table 4. Accuracy comparison on the UWA3D Activity dataset for same-view action recognition

Method Mean±STD Max Min

HOJ3D [36] 48.59±5.77 58.70 28.93HOG3D [10] 70.09±4.40 82.78 51.60HON4D [19] 79.28±2.68 88.89 70.14HDG [21] 75.54±3.64 85.07 61.90

Holistic HOPCconstant r, constant τ 83.77±3.09 92.18 74.67constant r, adaptive τ 84.93±2.75 93.11 74.67

and falling down actions have very similar motion, but the speed of action executionis different. Moreover, some actions such as: holding back, holding head, and answer-ing phone have self-occlusions. The videos were captured at 30 frames per second at aspatial resolution of 640× 480.

We evaluate our proposed methods in the same-view, and cross-view action recog-nition settings. The holistic approach is used to classify actions captured from the sameview and the local approach is used for cross-view action recognition where the trainingvideos are captured from front view and the test videos from side views.

Same-view Action Recognition We selected half of the subjects as training and therest as testing and evaluated our holistic method in two settings: (1) constant r, constantτ , (2) constant r, adaptive τ . Table 4 compares our methods with existing state of theart. Both variants of our algorithm outperform all methods achieving a maximum of84.93% accuracy. The adaptive τ provides minor improvement because there is no ex-plicit action speed variation in the dataset. To further test the robustness of our temporalscale selection (adaptive τ ) to action speed variations we use depth videos of actionsperformed by half of the subjects captured at 30 frames per second as training data anddepth videos of actions performed by the remaining subjects captured at 15 frames persecond as test data. The average accuracy of our method using automatic temporal scaleselection was 84.64% which is higher than 81.92% accuracy achieved by our methodusing constant temporal scale and the 76.43% accuracy achieved by HON4D. Next, weswap the frame rates of the test and training data. The average accuracy of our methodusing automatic temporal scale selection was 84.70% which is higher than 81.01% ac-curacy achieved by our method using constant temporal scale. The accuracy of HON4Dwas 75.81% in this case.

Cross-view Action Recognition In order to evaluate the STKP detector and HOPCdescriptor for cross-view action recognition, we used front views of five subjects astraining and side views of the remaining five subjects as test. Table 5 compares ourmethod with existing state-of-the- art holistic and local methods for cross-view actionrecognition. Note that the performance of all other methods degrades when the subjectsperform actions at different viewing angles. This is not surprising as existing methodsassume that actions are observed from the same viewpoint i.e. frontal. For example,


Table 5. Cross-view action recognition on the UWA3D Multiview Activity dataset. Depth se-quences of five subjects at 0o are used for training and the remaining subjects at 0o and 4 differentside-views are used for testing. Average accuracy is computed only for the cross-view scenario

View angleMethod 0 −25 +25 −50 +50 Average

Holistic MethodsHON4D [19] 86.55 62.22 60.00 35.56 37.78 48.89HDG [21] 79.13 60.00 64.44 33.33 35.56 48.33

Local MethodsHOJ3D [36] 63.34 60.00 62.22 37.78 40.00 50.00DSTIP+DCSF [35] 80.80 66.67 71.11 35.56 40.00 53.33

STKP+hyper-surface fitting 87.39 81.33 82.67 71.11 71.11 76.56STKP+HOPC 91.79 86.67 88.89 75.56 77.78 82.23

HON4D achieved 86.55% accuracy when the training and test samples were in the sameview (frontal). The average accuracy of HON4D dropped to 48.89% when the trainingsamples were captured from front view and the test samples were captured from fourdifferent side views. We also observed that the performance of existing methods did notdegrade only for actions like standing up, sitting down, and turning around. This is dueto the distinctness of these actions regardless of the viewpoint.

We test two variants of our method. First, we apply our STKP detector on 3D point-cloud sequences and use the raw values of fitted hyper-surface as features. The averageaccuracy obtained over the four different side views (±25 and ±50) was 76.56% inthis case. Next, we use the STKP detector combined with the proposed HOPC descrip-tor. This combination achieved the best average accuracy i.e. 82.23%. Comparison withother methods and the accuracy of each method on different side views are shown inTable 5. These experiments demonstrate that our STKP detector in conjunction withHOPC descriptor significantly outperforms state-of-the-art methods for cross-view aswell as same-view action recognition.

7 Conclusion

Performance of current 3D action recognition techniques degrades in the presence ofviewpoint variations across the test and the training data. We proposed a novel techniquefor action recognition which is more robust to action speed and viewpoint variations. Anew descriptor, Histogram of Oriented Principal Components (HOPC), and a keypointdetector are presented. The proposed descriptor and detector were evaluated for activityrecognition on three benchmark datasets. We also introduced a new multiview publicdataset and showed the robustness of our proposed method to viewpoint variations.

Acknowledgment

This research was supported by ARC Discovery Grant DP110102399.


References

1. UWA3D Multiview Activity dataset and Histogram of Oriented Principal Components Mat-lab code. http://www.csse.uwa.edu.au/∼ajmal/code.html (2014)

2. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes.In: ICCV (2005)

3. Campbell, L., Bobick, A.: Recognition of human body motion using phase space constraints.In: ICCV (1995)

4. Cheng, Z., Qin, L., Ye, Y., Huang, Q., Tian, Q.: Human daily action analysis with multi-viewand color-depth data. In: ECCVW (2012)

5. Darrell, T., Essa, I., Pentland, A.: Task-specific gesture analysis in real-time using interpo-lated views. In: PAMI (1996)

6. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: ICCV (2005)

7. Farhadi, A., Tabrizi, M.K.: Learning to recognize activities from the wrong view point. In:ECCV (2008)

8. Farhadi, A., Tabrizi, M.K., Endres, I., Forsyth, D.A.: A latent model of discriminative aspect.In: ICCV (2009)

9. Gavrila, D., Davis, L.: 3D model-based tracking of humans in action: a multi-view approach.In: CVPR (1996)

10. Klaeser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients.In: BMVC (2008)

11. Laptev, I.: On space-time interest point. In: IJCV (2005)12. Li, R.: Discriminative virtual views for cross-view action recognition. In: CVPR (2012)13. Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: CVPRW

(2010)14. Liu, J., Shah, M., Kuipersy, B., Savarese, S.: Cross-view action recognition via view knowl-

edge transfer. In: CVPR (2011)15. Lv, F., Nevatia, R.: Single view human action recognition using key pose matching and

viterbi path searching. In: CVPR (2007)16. Maji, S., Berg, A.C., Malik, J.: Classification using intersection kernel support vector ma-

chines is efficient. In: CVPR (2008)17. Mian, A., Bennamoun, M., Owens, R.: On the repeatability and quality of keypoints for local

feature-based 3D object retrieval from cluttered scenes. In: IJCV (2010)18. Mitra, N.J., Nguyen, A.: Estimating surface normals in noisy point clouds data. In: SCG

(2003)19. Oreifej, O., Liu, Z.: HON4D: histogram of oriented 4D normals for activity recognition from

depth sequences. In: CVPR (2013)20. Parameswaran, V., Chellappa, R.: View invariance for human action recognition. In: IJCV

(2006)21. Rahmani, H., Mahmood, A., Mian, A., Huynh, D.: Real time action recognition using his-

tograms of depth gradients and random decision forests. In: WACV (2014)22. Rao, C., Yilmaz, A., Shah, M.: View-invariant representation and recognition of actions . In:

IJCV (2002)23. Seitz, S., Dyer, C.: View-invariant analysis of cyclic motion. In: IJCV (1997)24. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A.,

Blake, A.: Real-time human pose recognition in parts from single depth images. In: CVPR(2011)

25. Syeda-Mahmood, T., Vasilescu, A., Sethi, S.: Recognizing action events from multiple view-points. In: IEEE Workshop on Detection and Recognition of Events in Video (2001)


26. Syeda-Mahmood, T., Vasilescu, A., Sethi, S.: Action recognition from arbitrary views using3D exemplars. In: ICCV (2007)

27. Tang, S., Wang, X., Lv, X., Han, T., Keller, J., He, Z., Skubic, M., Lao, S.: Histogram oforiented normal vectors for object recognition with a depth sensor. In: ACCV (2012)

28. Timbari, F., Stefano, L.D.: Performance evaluation of 3D keypoint detectors. In: IJCV (2013)29. Vieira, A.W., Nascimento, E., Oliveira, G., Liu, Z., Campos, M.: STOP: space-time occu-

pancy patterns for 3D action recognition from depth map sequences. In: CIARP (2012)30. Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In:

CVPR (2011)31. Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y.: Robust 3D action recognition with random

occupancy patterns. In: ECCV (2012)32. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with

depth cameras. In: CVPR (2012)33. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history

volumes. In: CVIU (2006)34. Wu, S., Oreifej, O., Shah, M.: Action recognition in videos acquired by a moving camera

using motion decomposition of lagrangian particle trajectories. In: ICCV (2011)35. Xia, L., Aggarwal, J.: Spatio-temporal depth cuboid similarity feature for activity recongition

using depth camera. In: CVPR (2013)36. Xia, L., Chen, C.C., Aggarwal, J.K.: View invariant human action recognition using his-

tograms of 3D joints. In: CVPRW (2012)37. Yang, X., Tian, Y.: EigenJoints-based action recognition using naive bayes nearest neighbor.

In: CVPRW (2012)38. Yang, X., Zhang, C., Tian, Y.: Recognizing actions using depth motion maps-based his-

tograms of oriented gradients. In: ACM ICM (2012)39. Yilmaz, A., Shah, M.: Action sketch: a novel action representation. In: CVPR (2005)40. Zanfir, M., Leordeanu, M., Sminchisescu, C.: The moving pose: an efficient 3D kinematics

descriptor for low-latency action recognition and detection. In: ICCV (2013)

hopc: histogram of oriented principal components of arxiv:1408.3809v4 [cs.cv… · hopc: histogram...

Documents