combining sparse and dense descriptors with temporal …jiechen/paper/iccv2011-combining...

8
1 Abstract Automatic categorization of human actions in the real world is very challenging due to the great intra-class differences. In this paper, we present a new method for robust recognition of human actions. We first cluster each video in the training set into temporal semantic segments by a dense descriptor. Each segment in the training set is represented by a concatenated histogram of sparse and dense descriptors. These histograms of segments are used to train a classifier. In the recognition stage, a query video is also divided into temporal semantic segments by clustering. Each segment will obtain a confidence evaluated by the trained classifier. Combining the confidence of each segment, we classify this query video. To evaluate our approach, we perform experiments on two challenging datasets, i.e., the Olympic Sports Dataset (OSD) and Hollywood Human Action dataset (HOHA). We also test our method on the benchmark KTH human action dataset. Experimental results confirm that our algorithm performs better than the state-of-the-art methods. 1. Introduction Human action recognition has been widely applied in video surveillance, human computer interaction, video games and multimedia retrieval. It is challenging because of the large range of possible human motions and the variations in scene, clothing, scales, illumination, background, viewpoint and occlusion [1]. These issues make the action recognition a complicate problem. Some example frames illustrating the variations of actions are shown in Fig. 1. These videos contain complex motions (e.g., basketball layup action, platform diving, getting out of car) and variations in scene, clothing, viewpoint and occlusion. Most of the state-of-the-art approaches for human action recognition can be divided into two categories based on the frame-sequence descriptors used. One category of method is based on the sparse interest points [3, 6, 8, 9, 12, 13, 15, 19, 25]. The other is based on the dense holistic features [7, 10, 24, 26, 27]. They combine the information over space and time to form a global representation, i.e., bag of words or a space-time volume, and use a classifier to label the resulting representation. The sparse descriptor is relatively robust to variations in viewpoints and scales and it is suitable for modeling the actions [8]. Whereas, the dense descriptor is relatively discriminative to model the background and the background is also informative for the recognizing of human actions which contains contextual information for actions [5, 21]. Taking the advantages of these two kind of descriptors, we combine them to achieve a more discriminative and descriptive feature. Specifically, in our case, the sparse descriptors are used to characterize motion and appearance of local features of actions. The dense descriptors are used to model the background of a video, which provide contextual information for actions. Fig. 1. Some example frames of human action videos from OSD [13] (the top two rows) and HOHA [9] (the bottom two rows). Many interesting human activities are characterized by a complex temporal composition of simple actions. For instance, a sequence from the triple-jump action class shows an approach phase, a hop and step phase, and a jump and landing phase. Automatic recognition of such complex actions can benefit from a good understanding of the temporal units of actions of a video. Niebles et al. [13] achieve good performance by modeling the temporal structures of decomposable motion segments for activity classification. To divide a video into segments, they use the Latent Support Vector Machine (LSVM) to compute the best locations of the motion segment classifiers on each training video in an iterative manner. In most cases, the iterative algorithm requires careful initialization [13]. To this end, we use a simple method by dividing each video into segments by clustering using a dense descriptor. In conclusion, in this paper, we present a new method for robust human action recognition. The main steps are as follows: Divide each video into temporal semantic segments by clustering methods instead of treating it as a whole. Before clustering, we represent each video by a dense Combining Sparse and Dense Descriptors with Temporal Semantic Structures for Robust Human Action Recognition Jie Chen, Guoying Zhao, Vili-Petteri Kellokumpu and Matti Pietikäinen Machine Vision Group, Department of Computer Science and Engineering, P. O. Box 4500 FI-90014 University of Oulu, Finland E-mail:{jiechen, gyzhao, kello, mkp}@ee.oulu.fi 2011 IEEE International Conference on Computer Vision Workshops 978-1-4673-0063-6/11/$26.00 c 2011 IEEE 1524

Upload: others

Post on 01-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Combining Sparse and Dense Descriptors with Temporal …jiechen/paper/ICCV2011-Combining Sparse... · 2017-09-09 · a dense descriptor. Each segment in the training set is represented

1

Abstract Automatic categorization of human actions in the real

world is very challenging due to the great intra-class differences. In this paper, we present a new method for robust recognition of human actions. We first cluster each video in the training set into temporal semantic segments by a dense descriptor. Each segment in the training set is represented by a concatenated histogram of sparse and dense descriptors. These histograms of segments are used to train a classifier. In the recognition stage, a query video is also divided into temporal semantic segments by clustering. Each segment will obtain a confidence evaluated by the trained classifier. Combining the confidence of each segment, we classify this query video. To evaluate our approach, we perform experiments on two challenging datasets, i.e., the Olympic Sports Dataset (OSD) and Hollywood Human Action dataset (HOHA). We also test our method on the benchmark KTH human action dataset. Experimental results confirm that our algorithm performs better than the state-of-the-art methods.

1. Introduction Human action recognition has been widely applied in video surveillance, human computer interaction, video games and multimedia retrieval. It is challenging because of the large range of possible human motions and the variations in scene, clothing, scales, illumination, background, viewpoint and occlusion [1]. These issues make the action recognition a complicate problem. Some example frames illustrating the variations of actions are shown in Fig. 1. These videos contain complex motions (e.g., basketball layup action, platform diving, getting out of car) and variations in scene, clothing, viewpoint and occlusion.

Most of the state-of-the-art approaches for human action recognition can be divided into two categories based on the frame-sequence descriptors used. One category of method is based on the sparse interest points [3, 6, 8, 9, 12, 13, 15, 19, 25]. The other is based on the dense holistic features [7, 10, 24, 26, 27]. They combine the information over space and time to form a global representation, i.e., bag of words or a space-time volume, and use a classifier to label the resulting representation.

The sparse descriptor is relatively robust to variations in viewpoints and scales and it is suitable for modeling the actions [8]. Whereas, the dense descriptor is relatively discriminative to model the background and the background is also informative for the recognizing of

human actions which contains contextual information for actions [5, 21]. Taking the advantages of these two kind of descriptors, we combine them to achieve a more discriminative and descriptive feature. Specifically, in our case, the sparse descriptors are used to characterize motion and appearance of local features of actions. The dense descriptors are used to model the background of a video, which provide contextual information for actions.

Fig. 1. Some example frames of human action videos from OSD [13] (the top two rows) and HOHA [9] (the bottom two rows).

Many interesting human activities are characterized by a complex temporal composition of simple actions. For instance, a sequence from the triple-jump action class shows an approach phase, a hop and step phase, and a jump and landing phase. Automatic recognition of such complex actions can benefit from a good understanding of the temporal units of actions of a video. Niebles et al. [13] achieve good performance by modeling the temporal structures of decomposable motion segments for activity classification. To divide a video into segments, they use the Latent Support Vector Machine (LSVM) to compute the best locations of the motion segment classifiers on each training video in an iterative manner. In most cases, the iterative algorithm requires careful initialization [13]. To this end, we use a simple method by dividing each video into segments by clustering using a dense descriptor.

In conclusion, in this paper, we present a new method for robust human action recognition. The main steps are as follows:

Divide each video into temporal semantic segments by clustering methods instead of treating it as a whole. Before clustering, we represent each video by a dense

Combining Sparse and Dense Descriptors with Temporal Semantic Structures for Robust Human Action Recognition

Jie Chen, Guoying Zhao, Vili-Petteri Kellokumpu and Matti Pietikäinen

Machine Vision Group, Department of Computer Science and Engineering, P. O. Box 4500 FI-90014 University of Oulu, Finland

E-mail:{jiechen, gyzhao, kello, mkp}@ee.oulu.fi

2011 IEEE International Conference on Computer Vision Workshops978-1-4673-0063-6/11/$26.00 c©2011 IEEE

1524

Page 2: Combining Sparse and Dense Descriptors with Temporal …jiechen/paper/ICCV2011-Combining Sparse... · 2017-09-09 · a dense descriptor. Each segment in the training set is represented

2

descriptor. That is, we employ the regular sampling of local patches and use the local binary pattern (LBP) to represent each patch due to its high discriminative power and computational efficiency [14].

Represent each temporal semantic segment by a combination of sparse and dense descriptors. For the sparse descriptors, we use the spatiotemporal extension of the Harris operator to detect the interest points of a video, and use the Histogram of Gradients (HOG) and Histogram of Flow (HOF) descriptors to represent each interest point [9]. For the dense descriptors, we employ the LBP descriptor to represent each patch.

Train a classifier by the temporal semantic segments in the training set. For the query video, we also divide it into temporal segments and classify the video according to the confidence of each segment measured by the trained classifier.

The main contributions of our paper are as follows: 1) we use a simple LBP descriptor to divide each video into motion segments; 2) we combine LBP and HOG+HOF to represent each segment for human action recognition; 3) we obtain very promising results on OSD (80.0%) compared to Laptev et al. [9] (62.1%); and on HOHA (53.3%) compared to [9] (38.4%).

2. Descriptors for video

2.1 Sparse descriptor For the sparse feature, we adopt the spatiotemporal interest points. Each interest point is detected by the Harris3D operator [8] and described by the HOG and HOF descriptors [9]. The descriptors of the interest points of video sequences in the training set are clustered by k-means method to obtain a descriptor codebook. Afterwards, each descriptor is indexed by the cluster center nearest to it in Euclidean distance. Using these quantized descriptors, we compute the histograms of each video and denote them as HHOG and HHOF, respectively. Likewise, for the test set, we also compute the same descriptors (HOG/HOF) of each interest point and vector quantize them. In our case, for the descriptors (HOG/HOF) of each interest point, we use Laptev’s code publicly available on the website [9].

2.2 Dense descriptor For the dense feature, we use the LBP descriptor [14]. The basic form of LBP is illustrated in Fig. 2. The operator takes as input a local neighborhood around each pixel and thresholds the neighborhood pixels at the value of the central pixel. The resulting binary-valued string is then weighted as follows:

1

0( ) 2 ( )

pi

c i ci

LBP I s I I−

=

= −∑ , (1)

where the parameter i runs over the p neighbors of the central pixel Ic. Ic and Ii are the gray-level values at c and i, and s(A) is 1 if A ≥ 0 and 0 otherwise. In our case, we set

p=8. To compute the LBP descriptor for each video, we first

compute the LBP feature for each pixel of each frame. We then divide each frame into patches with 50% overlapping between neighboring patches, and set the width (wp) and height (hp) of each patch to 32 pixels. For each patch, we compute its LBP histogram descriptor having 256 bins. These LBP descriptors are vector quantized by computing memberships with respect to a descriptor codebook, which is obtained by k-means clustering of the LBP descriptors in the training set. Using these vector quantized descriptors, we compute the histogram for each video counting the occurrence of each codebook element and denote it as HLBP.

Fig. 2. The basic idea of LBP.

2.3 Combining the sparse and dense descriptors To make the feature more discriminative, we concatenate the histograms of (HOG/HOF) and LBP into one histogram for each video:

H=[ HHOG, HHOF, HLBP]. (2)Note that, before concatenation, the histograms of HOG/HOF and LBP are normalized (sum to one) respectively. This normalization balances the different temporal length of different videos.

2.4 Codebook and soft vector quantization One challenge for the clustering of HOG/HOF/LBP descriptor is that the number of the descriptors is too huge to be clustered (for example, the number of descriptors of LBP might be about 20,000,000 or even more for a dataset). Given a dataset containing q classes {ω1, ω2, …, ωq}, to compute the codebook for this case, we use the following approximate methods:

(1) Normalize each video to be maximally 320 pixels in width or height;

(2) Randomly select maximally 200,000 descriptors of each class ωi to perform the clustering process;

(3) Concatenate the cluster centers of each class together and generate the codebook for the whole dataset;

After obtaining the codebook, we use the soft vector quantization (SVQ) [2] to represent the descriptors by computing memberships. During model training and matching, we compute histograms of codebook memberships over a given video. For example, we denote H as histogram of codebook memberships for HOG/HOF/LBP of a video. For a descriptor x (of an interest point for HOG/HOF or of a local patch for LBP), we compute its Kvq nearest cluster centers in the codebook, i.e., xcm, m =1,…, Kvq, and the similarity between the x and xcm is sm. The normalized similarity snm is computed as follows:

1525

Page 3: Combining Sparse and Dense Descriptors with Temporal …jiechen/paper/ICCV2011-Combining Sparse... · 2017-09-09 · a dense descriptor. Each segment in the training set is represented

3

1,.., vq

mm

jj K

ssns

=

=∑

. (3)

The binning of the descriptor x over H is as follows: Hm= Hm+ snm (m =1,…, Kvq) (4)

where Hm is the m-th bin of H. Eq. (4) means that the descriptor x is binned to Kvq bins by the different values snm. We set Kvq=10 experimentally.

3. Semantic segments for video

3.1 Exploiting LBP for semantic segments A complex human action can be divided into several sub-actions. An example video is shown in Fig. 3 (a). It is an instance of basketball-layup action class. We divide it into three sub-actions: dribbling the ball, layup and getting the ball after the shoot. To divide a video into sub-actions, we use a very simple clustering based method.

Specifically, we compute the feature of each frame, and then clustering them. For the feature of each frame, we use the same method as presented in Section 2.2. Given a video v={f1, …, fm}, where fi denotes the i-th frame of the video v; m is the number of frames. For fi, we compute the LBP feature for each pixel and divide fi into patches fi ={p1,…, pn}, where n is the number of patches and the neighboring patches are overlapping 50%. Both the width (wp) and height (hp) of a patch are set to be 32 pixels. For each patch pj, we compute its LBP histogram. Combining LBP histogram of each patch of each frame, we compute the codebook for the video v by k-means clustering. After obtaining the codebook for a video, we quantize the LBP histogram of each patch pj for fi by SVQ. Using the quantized code of pj, we build a histogram Hi for fi.

After computing the histogram Hi for each frame fi of the video v, we perform clustering using Hi (i =1,…, m) by k-means method to cluster the video v into segments. An example after clustering for one video of basketball_layup action is shown in Fig. 3 (a). The first row illustrates the clustering results, where the x-axis is the index of frames and the y-axis is the index of clusters. It shows that neighboring frames are clustered into one segment due to the larger similarity compared to those frames that are not neighbors. The following three rows show the three semantic segments, corresponding to three sub-actions: dribbling the ball, layup and getting the ball after the shoot. In Fig. 3 (b) and (c), we show two more examples of temporal semantic segments. Although the obtained motion segments in Fig. 3 do not match the sub-actions in the sports exactly (because some sub-actions, e.g., hop and step phase for triple-jump, are much shorter temporally compared to other sub-actions, e.g., approach phase), these resulting motion segments make sense from the point of view of segmentation.

Note that the codebook of the descriptor LBP used for segmentation in this section is different from the codebook of LBP used for classification in Section 2.4. In Section 2.4, the codebook is computed over all the videos in one class.

In this section, the codebook is computed over only the current video, and the codebook of the current video is much simpler because the videos in one class might be very different, which brings noise for the vector quantization of the current video.

0 10 20 30 40 50 60 70 80 90 100 110 120 1301

2

3

Index of frame

Inde

x of

clu

ster

1-48

49-86

87-130

(a)

1-77

78-151

152-225

226-293

(b)

1-106

107-209

210-237

238-305

(c) Fig. 3. Example videos from [13] are divided into semantic segments. (a) A video of basketball_layup action is divided into three segments: dribbling the ball, layup and getting the ball after the shoot. The number in the left column corresponds to the frames of each segment. (b) and (c) denote two videos of triple-jump are divided into four segments roughly: slow approach phase, fast approach phase, hop and step phase, and jump phase.

1526

Page 4: Combining Sparse and Dense Descriptors with Temporal …jiechen/paper/ICCV2011-Combining Sparse... · 2017-09-09 · a dense descriptor. Each segment in the training set is represented

4

3.2 Number of segments For the clustering, one issue is the number of cluster centers, which corresponds to the semantic segments of a video. To determine the number of the segments, we plot the within-cluster sums of point-to-centroid distances over the different clusters. One example is shown in Fig. 4, where the x-axis is the number of clusters (/segments) and y-axis refers to the within-cluster sums of point-to-centroid distances over all the clusters. We compute the elbow of this curve as the number of the segments of this video.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.01

0.02

0.03

0.04

0.05

0.06

0.07

Number of clusters

With

in-c

lust

er su

ms o

f po

int-t

o-ce

ntro

id d

istan

ces

Fig. 4. The within-cluster sums of point-to-centroid distances over all the clusters (one video of basketball_layup action [13]).

0 10 20 30 40 50 60 70 80 90 100 110 120 1301

2

3

Index of frame Inde

x of

clu

ster

0 10 20 30 40 50 60 70 80 90 100 110 120 1301

2

3

Index of frame Inde

x of

clu

ster

0 10 20 30 40 50 60 70 80 90 100 110 120 1301

2

3

Index of frame Inde

x of

clu

ster

0 10 20 30 40 50 60 70 80 90 100 110 120 1301

2

3

Index of frame Inde

x of

clu

ster

Fig. 5. Results by clustering the frames of a video several times

Here, we use a simple method to obtain a proxy of the elbow of a curve. Specifically, we approximate the point with the maximum curvature by the absolute second derivative. Given a set of discrete points {(xi, yi)}, the second derivative can be approximated with a central difference y’i = yi+1 + yi-1 - 2 yi, and the index of the point is

{ }a rg max ' )ii

i y= . Although there are other methods to

compute this point, e.g., [20], this way computes it very fast and works well as shown in Section 5.

Another issue of k-means clustering is that the results might be slightly different because the initial centroids are selected randomly. As shown in Fig. 5, we perform the clustering several times on one video. From this figure, we can find that this video sequence can be divided into three semantic segments smoothly. The different begins and ends of each segment in different clusterings are due to the inherent ambiguity for actions [16].

3.3 Classification Intuitively, the classification in our case is to divide the video samples in both the training set and testing set into semantic segments by clustering, and then use the segments in the training set to train a classifier. The trained classifier

is used to evaluate the video segments in the testing set with the probability estimates. Combining the probability estimates of the segments from one video, we perform the classification for the video.

Specifically, given a training set Strain={(vi, yi)| vi∈Rn, yi ∈ R, i=1,…,N } where N is the number of videos and yi∈Y={yk|k= 1,…, K }is the class label of vi, we divide each video vi into a set of segments vi={πi,1, πi,2, …, πi,Li }, where Li is the number of segments of video vi and Li is computed by the method of elbow of the curve as shown in Fig. 4. For each segment πi,t, we compute the feature Hi,t (i=1, …, N; t= 1, …, Li) of space-time words occurrence as shown in Eq. (2). Combining all the { Hi,t }, we train an SVM classifier.

In our case, we use the non-linear Support Vector Machines (SVM) with RBF kernel and the library libSVM [4]. The best parameters C and γ were chosen by doing a 5-fold cross validation in a grid approach on the training data and one against one approach has been used for multi-class classification.

For each video uj in the testing set Stest={uj| j=1,…, M} with M being the number of videos, we also divide uj into segments uj={πj,r|r=1,…, Lj}, where Lj is the number of segments of video uj. Likewise, we compute the feature Hj,r (r=1,…, Lj) as shown in Eq. (2). We then compute P(yj,r| Hj,r, Strain) for classification:

P(yj,r| Hj,r, Strain)= FSVM(Hj,r), (5)where FSVM is the trained SVM classifier above; the output P(yj,r| Hj,r, Strain) are of probability estimates of Hj,r.

Thus, for uj={πj,r|r=1,…, Lj}, we have Yj={yj,r|r=1,…, Lj} and Pj={P(yj,r)|r=1,…, Lj}. Let

Yj’=Funique(Yj)={y’j,q |q=1,…, Q}, (6)where Funique is to return the same values as in Yj but with no repetitions and Q is the number of elements in Yj’, i.e., the different class label for the segments of uj and 0<Q ≤K.

For each y’j,q, we compute the confidence by the sum of probability estimates over all the segments of uj={ πj,r}:

' ', , ,

1( ) ( | )

iL

j q j q j rr

y P y H=

Ξ =∑ , (7)

where Hj,r is the histogram of πj,r and ', ,( | )j q j rP y H is the

probability estimates of πj,r computed by Eq.(5). The predict class of uj is as follows:

{ },,'

a rg max ( ' )j qj qy

y y= Ξ . (8)

As discussed above, we use the video segments in the training and testing sets for training and action classification. For a video uj∈ Stest and uj={πj,r|r=1,…, Lj}, we argue that different segments of uj play different roles for classification. For example, both the triple-jump and long-jump comprise the approach run phase, which would play less important role for classification compared to the characteristic phase (hop and step phase of triple-jump). Correspondingly, the probability estimates ( | )P y H of the approach run phase for the triple-jump is less than that of the hop and step phase for the triple-jump. That is to say, Eq. (7) and (8) mean that we classify the query video uj∈ Stest by the weighting votes of the different sub-actions of uj, where

1527

Page 5: Combining Sparse and Dense Descriptors with Temporal …jiechen/paper/ICCV2011-Combining Sparse... · 2017-09-09 · a dense descriptor. Each segment in the training set is represented

5

the characteristic sub-actions play more important roles for this task.

3.4 Discussion In this section, we will discuss two issues. One is the reasons that we use LBP to divide a video into semantic segments and combine the dense descriptor (LBP) and the sparse descriptor (HOG+HOF) to represent the video sequences. The other is the difference between the related work and our method.

To divide a video into semantic segments, one option is to use the sparse descriptor (HOG+HOF) as in [13]. However, a problem is that the number of interest points of a video might be too small for this task. For example, the video sequence UdE8qAHGi9U_00154_00228.avi of basketball- layup action class in the OSD, has a resolution of 352×262 pixels with 75 frames. Using the code in [9], we obtain 164 interest points, i.e., about 2 interest points for each frame. An intuitive idea is to find a new descriptor for this task. An option is LBP, which has shown its high discriminative power and computational efficiency [14]. As mentioned in Section 2.2, by dividing each frame into patches, we can obtain 247 patches for a frame, i.e., 247 LBP descriptors, which is much larger than the number of the sparse descriptors computed by interest points (i.e., about 2 interest points). Furthermore, LBP performs this task very well as shown in Section 3.2. After computing the LBP descriptors to divide each video into semantic segments, a straightforward idea is to combine the sparse descriptor (HOG+HOF) and the available dense descriptor (LBP). Similarly, combining HOG and LBP, Wang et al. obtained good performance in human detection [23].

One question about using LBP to divide a video into segments is that how to guarantee that the consecutive frames are clustered together. The difference between neighboring frames is usually very small and would cluster into the same segment. However, the consecutive frames would not cluster together if there are periodical motions or noise in the videos, e.g, new objects appear. The non-consecutive frames showing periodical motion clustering into the same segment do not change the performance. If a resulting video segment, due to noise, includes a segment by consecutive frames and several frames from non-consecutive frames, it does not change LBP histogram descriptor significantly. So, the performance does not change significantly. If the length of interruption by an object is long enough, these frames are divided into a separate segment, which would also not change the performance according to Eqs. (7) and (8).

The methods in [13], [16] and [11] are related to the proposed method. Niebles et al. [13] used LSVM to compute the best locations of the motion segment classifiers. In most cases, the iterative algorithm requires careful initialization [13]. In addition, for each video, they used K motion segment classifiers and the value of K was fixed, which was manually predefined before training. Furthermore, Niebles et al. used the histograms of

vector-quantized spatiotemporal interest points to represent the appearance of a video for the similarity measure when they modeled the temporal structure by LSVM. A problem is that the number of the interest points might be too small as mentioned above. Different from them, we use a simple method by dividing each video to segments by clustering using LBP descriptor. Moreover, the number of the motion segments varies adaptively according to different videos, and the number of motion segments is computed by the elbow of curve as shown in Fig. 4. Furthermore, by dividing each frame into patches, we can obtain a sufficient number of LBP descriptors for each frame compared to the number of the sparse descriptors computed for interest points. We also combine the sparse descriptor (HOG+HOF) and dense descriptor (LBP) to represent a video while Niebles et al. represented a video by sparse descriptor (HOG+HOF) only.

Satkin and Hebert [16] presented a framework for estimating what portions of videos are most discriminative for the task of action recognition. They explored the impact of the temporal cropping of training videos on the overall accuracy of an action recognition system, and formalized what makes a set of croppings optimal. Different from Satkin and Hebert [16], we divide each video into several semantic segments adaptively. In other words, our method provides much smaller description granularity (several semantic segments) compared to that of Satkin and Hebert (one optimal cropping). Furthermore, in our case, the different sub-actions (i.e., different segments) of the query video play different roles for the classification by the weighting votes as shown in Eqs. (7) and (8).

Liu et al. [11] also used multiple features for human action recognition. Specifically, they combined the gradient- based descriptors for the interest point of spatio-temporal volumes and spin-image features for the 3D action volume generated from the contour by the background subtraction algorithm. Different from them, we use a different sparse descriptor (HOG+HOF) and a different dense descriptor (LBP) for the whole frame instead of the 3D action volumes generated from the contours.

4. Experiments To test our method, we consider three experimental scenarios. First, we test the robustness of our approach on the challenging OSD, which is collected from YouTube sequences [13] and contains complex motions. Second, we test the effectiveness of our model on HOHA [9], which is collected from thirty-two movies. Last, we evaluate our method on KTH Human actions dataset [17] to test the ability of our method to classify simple motions on a benchmark dataset.

4.1 Datasets The OSD [13] contains 16 sport classes: high jump, long jump, triple jump, pole vault, discus throw, hammer throw, javelin throw, shot put, basketball layup, bowling, tennis serve, platform (diving), springboard (diving), snatch

1528

Page 6: Combining Sparse and Dense Descriptors with Temporal …jiechen/paper/ICCV2011-Combining Sparse... · 2017-09-09 · a dense descriptor. Each segment in the training set is represented

6

(weightlifting), clean and jerk (weightlifting) and vault (gymnastics). The sequences, obtained from YouTube, contain severe occlusions, camera movements, compression artifacts, etc. The sport activities depicted in the dataset contain complex motions. Some example frames are shown in Fig. 1. Following the setup in [13], we randomly select 10 video sequences from each class for testing and the other sequences for training. In this fashion, the videos belonging to the training set and to the testing set are disjoint. The results are reported as the average value and standard deviation over ten runs to avoid bias.

The HOHA dataset [9] is collected from thirty-two movies, e.g. “American Beauty”, “Big Fish”, “Forest Gump”. It contains eight types of actions, “AnswerPhone”, “GetOutCar”, “HandShake”, “Hug-Person”, “Kiss”, “SitDown”, “SitUp”, and “StandUp”. The dataset provides two types of annotations, the “automatic annotation” and the “clean annotation”. The “automatic annotation” labels the clips using the scripts of the movie and the “clean annotation” labels the clips manually. In the experiments, we use data from the “clean annotation”, which contains 219 training clips and 211 testing clips. Some example frames are shown in Fig. 1.

The KTH dataset [17] is used to test the ability of our method to classify simple motions. The dataset contains 2,396 video sequences covering 6 types of actions performed by 25 actors. We follow the experimental settings described in [17].

4.2 Parameters of our method There are two parameters involved in our method. The first one is the dimensionality of codebook for each descriptor. The second is the number of segments. In addition, we also discuss the time cost of dividing a video into segments by LBP.

Dimensionality of codebook For each video class, we obtain 100 words by clustering. Thus, the codebook of OSD dataset for each descriptor has a dimensionality of 16×100=1600. Combining the three descriptors (HOG+HOF+LBP), the dimensionality of each histogram for each video (/segment) is 4,800. Likewise, the codebooks of HOHA and KTH datasets have the dimensionality of 2,400 and 1,800, respectively.

Number of segments We test the influence of the number of segments on the performance of recognition over OSD. The performance comparison is shown in Fig. 6. Here original means that we do not divide each video and use the original whole video as training and test sets; the other two curves are the performance variations when the number of segments changes. The performance on training set is obtained by the cross-validation of LibSVM, and the performance on test set is to use the trained classifier to evaluate on the test set.

In Fig. 6, the zero point in x-axis denotes to the elbow of the curve. From this point leftward it means that the number of segments decreases one point-by-point and increases one

rightward. For OSD, we obtain on average four segments (2,560 motion segments for 633 videos in the training set). For instance, “-4” means we have one segment on average (in other words, only a few of videos are divided into two segments and others are not). From this figure one can find that the performance of our method changes little when the number of segments for each video changes around the point corresponding to the elbow of the curve.

-5 -4 -3 -2 -1 0 1 2 370

75

80

85

90

95

Acc

urac

y

Number of segments

training settest setOriginal

Fig.6. Accuracy changes with respect to the number of segments.

Table 1. Comparison of time cost

Feature LBP HOG HOFLBP feature encoding Time (s) 0.028 0.012 0.4 0.38

One interesting observation from Fig. 6 is that, when the number of segments is equal to “-4”, the performance on test set is about 2.3% higher than the “original” one. An explanation is that a few of videos are divided into two segments and so the discriminative parts of these videos are separated out, which play more important roles in recognition. It in turn improves the performance of the classifier according to Eqs. (7) and (8). Satkin and Hebert also found the same conclusion [16].

In addition, in Table 1, we also show the time cost when using LBP for the video dividing. We also use OSD as test set and each video are normalized to be maximally 320 pixels in width or height. The experiments are performed on a [email protected]/2.39 GHz Intel Core using 2GB RAM by executing C/C++ code. To compute the LBP feature over one frame is 0.028s. The encoding (SVQ) would cost 0.012s by heap sorting over one frame. To compute HOG/HOF feature for one frame would spend 0.4s/0.38s, respectively. From this table, one can find that using LBP for video dividing is quite fast.

4.3 Experiments on Olympic Sports Dataset Table 2 shows the classification results of our algorithm. We compare the performance of our method to the multi-channel method of [9] and temporal structure modeling method of [13]. In [9], Laptev et al. incorporated rigid spatiotemporal binnings and captured a rough temporal ordering of features. In [13], Niebles et al. presented a framework for modeling motion by exploiting the temporal structure of the human activities. From this

1529

Page 7: Combining Sparse and Dense Descriptors with Temporal …jiechen/paper/ICCV2011-Combining Sparse... · 2017-09-09 · a dense descriptor. Each segment in the training set is represented

7

table, one can find that our method outperforms the state-of-the-art methods.

In Fig. 7, we show the performance of using different components tested on the OSD dataset. In this figure, HOG-no-SVQ and HOG+HOF-no-SVQ denote that the vector quantization of HOG and HOG+HOF only uses the closest membership in the codebook (i.e., Keq=1). The remains all use SVQ method for vector quantization (i.e., Keq=10). Specifically, HOG, HOF and HOG+HOF denote that we use these descriptors computed from the interest points as features (i.e., sparse descriptors) for classification, respectively. LBP denotes that we use LBP descriptor computed from each patch of each frame as feature (i.e., dense descriptors) for classification. HOG+HOF+LBP denotes that we combine the sparse and dense descriptors for classification. In addition, HOG+HOF+LBP+seg denotes that we use LBP to divide each video into segments and then use HOG+HOF+LBP as descriptors for classification.

One can find that SVQ method improves the performance of HOG by 5% and that of HOG+HOF by 5.9%; LBP achieves a comparable performance to HOG+HOF. Combining them (LBP and HOG+HOF) together, we obtain about 8% performance gains. Furthermore, HOG+HOF+ LBP+seg improves the performance about 5%. It means that the sparse descriptors (HOG+HOF) and the dense one (LBP) are complementary to each other for the task of action recognition. In addition, dividing each video into semantic segments is also helpful for this task.

From Fig. 7, one can find that the LBP descriptor also obtains good performance. An explanation is that the dense descriptor (LBP) might model the background well, which is also informative for the recognition of human actions by providing useful context information. Another reason is that LBP descriptor describes the variations of the textures in the video sequence by clustering similar LBP histograms of local patches into the same cluster and LBP histograms close to it into a cluster nearby. Using the methods of computing the codebook and SVQ in Section 2.4, the descriptor x is binned to Kvq bins by the different weights. Thus, those bins (corresponding to the close clusters in the clustering space) in the histogram of LBP descriptor contain the texture variations.

Although the videos of different actions could have the same background and the same action could happen in completely different environments, the dense LBP descriptor in our case models both background and foreground (containing human action). In Fig. 7, LBP performed very well (68.2%) compared to HOG+HOF (67.7%) over OSD [13], which is such a dataset. It is because that each frame was divided into patches as shown in Sec. 2.2. Some patches are from background and others from foreground. So are the words in codebook. In the final LBP histogram descriptor, some bins are used to describe the foreground and others to describe the background. In

addition, as shown in Fig. 3 (b), the second and third rows, the background changes but these frames (from 78-151, 152-225) are divided into the same segment, respectively. It also provides the evidence that LBP descriptor models both the background and foreground, by which we can use LBP to solve the issue that the background changes but sub-action does not. In addition, using dense descriptors and the context modeling for action recognition have been found useful previously, e.g. in [18, 21, 26].

0 1 2 3 4 5 6 7 8 940

45

50

55

60

65

70

75

80

85

57.5(0.4)

61.8(0.3)62.5(0.4)

56.3(0.5)

68.2(0.4)67.7(0.3)

75.2(0.3)

80.0(0.2)

HO

G-n

o-S

VQ

HO

G+H

OF-

no-S

VQ

HO

G

HO

F

LBP

HO

G+H

OF

HO

G+H

OF+

LBP

HO

G+H

OF+

LBP

+Seg

Acc

urac

y

Results for component comparison

Fig. 7. The accuracy evaluation of the different parts in our method over OSD.

4.4 Hollywood and KTH human action datasets The average recognition accuracy of our method and the

state-of-the-art methods on HOHA are shown in Table 3. From this table, one can find that our method obtains the state-of-the-art results for this dataset as well.

Experimental results on KTH dataset are shown in Table 4. A direct comparison is possible with the methods that follow the same experimental setup [17]. We note that our method also obtains very competitive results.

5. Conclusion

In this paper we present a new method for human action recognition by combining the sparse descriptors (HOG+HOF) over the interest points and the dense descriptors (LBP) over the local patches. Each video is divided into several semantic segments according to their LBP descriptors. By this segmentation, we can obtain the characteristic motion segments of a video, which play an important role during voting for classification. It in turn improves the performance of our method. We test our method on two challenging datasets collected from the real word, i.e., the Olympic Sports Dataset and Hollywood Human Action dataset. The experimental results show that our method outperforms the state-of-the-art methods. We also test our method on the benchmark dataset (KTH) and find that our method also works very well. The future works are to learn the weights for the different motion segments for classification and develop a method to measure the consistence of the segments obtained by clustering.

1530

Page 8: Combining Sparse and Dense Descriptors with Temporal …jiechen/paper/ICCV2011-Combining Sparse... · 2017-09-09 · a dense descriptor. Each segment in the training set is represented

8

Table 2. Average Precision of action classification in the Olympic Sports Dataset

Sport class Our method Niebles et al.[13]

Laptev et al.[9] Sport class Our method Niebles et

al.[13] Laptev et al.

[9] high-jump 85 68.9 52.4 javelin-throw 55 74.6 61.1 long-jump 94 74.8 66.8 hammer-throw 60 77.5 65.1 triple-jump 51 52.3 36.1 discus-throw 85 58.5 37.4 pole-vault 85 82.0 47.8 diving-platform 100 87.2 91.5

gymnastics-vault 85 86.1 88.6 diving-springboard 85 77.2 80.7 shot-put 100 62.1 56.2 basketball-layup 95 77.9 75.8 snatch 66 69.2 41.8 bowling 85 72.7 66.7

clean-jerk 65 84.1 82.3 tennis-serve 85 49.1 39.6 Average(APP) 80.0 72.1 62.0

Table 3. Comparison of our method to some exiting methods in the HOHA dataset Method Accuracy (%)Laptev et al. [9] 38.4 Rapantzikos et al. [15] 33.6 Sun et al. [19] 47.1 Wang et al. [22] 39.5 Yeffet and Wolf [26] 43.8 Our method 53.3

Table 4. Comparison of our method to some exiting methods in the KTH dataset

Method Accuracy (%)Kovashka and Grauman [6] 94.5 Laptev et al. [9] 91.8 Mattivi and Shao [12] 91.3 Niebles et al.[13] 91.3 Schuldt et al. [17] 71.5 Wang et al. [22] 92.1 Yeffet and Wolf [26] 90.1 Our method 95.3

Acknowledgements This work was supported by the Infotech Oulu and Academy of Finland.

References [1] J. K. Aggarwal, M. S. RyooHuman activity analysis: A

review, ACM Comput. Surv., 2011. [2] E. Alpaydın, Soft vector quantization and the EM algorithm,

Neural Networks, 1998. [3] L. Cao, Z. Liu, T. Huang, Cross-dataset Action Detection,

CVPR 2010. [4] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support

vector machines, 2001. [5] A. Kläser, M. Marszalek, I. Laptev, C. Schmid, Will person

detection help bag-of-features action recognition? INRIA Technique report, 2010.

[6] A. Kovashka, K. Grauman, Learning a Hierarchy of Discriminative Space-Time Neighborhood Features for Human Action Recognition, CVPR 2010.

[7] N. Ikizler-Cinbis and S. Sclaroff, Object, Scene and Actions: Combining Multiple Features for Human Action Recognition, ECCV 2010.

[8] I. Laptev, On Space-Time Interest Points, IJCV 2005.

[9] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies. CVPR 2008.

[10] M. Lewandowski, D. Makris, and J.-C. Nebel, View and Style-Independent Action Manifolds for Human Activity Recognition, ECCV 2010.

[11] J. Liu, S. Ali, M. Shah, Recognizing Human Actions Using Multiple Features, CVPR 2008.

[12] R. Mattivi and L. Shao, Human Action Recognition Using LBP-TOP as Sparse Spatio-Temporal Feature Descriptor, CAIP 2009.

[13] J. C. Niebles, C.-W. Chen, and L. Fei-Fei, Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification, ECCV 2010.

[14] T. Ojala, M. Pietikäinen and T. Mäenpää, Multiresolution gray-scale and rotation invariant texture classification with Local Binary Patterns, IEEE TPAMI, 2002.

[15] K. Rapantzikos, Y. Avrithis, and S. Kollias, Dense saliency based spatiotemporal feature points for action recognition. CVPR 2009.

[16] S.Satkin and M. Hebert, Modeling the Temporal Extent of Actions , ECCV 2010.

[17] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: A local SVM approach. ICPR 2004.

[18] L. Shao, R. Mattivi, Feature detector and descriptor evaluation in human action recognition. CIVR 2010.

[19] J. Sun, X. Wu, S. Yan, L. Cheong, T. Chua, and J. Li. Hierarchical spatio-temporal context modeling for action recognition. CVPR 2009.

[20] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data set via the gap statistic. J. Roy. Statist. Soc. B, 2001.

[21] H. Wang, M. M. Ullah, A. Kläser, I. Laptev, C. Schmid Evaluation of local spatio-temporal features for action recognition, BMVC 2009.

[22] L. Wang, Y. Wang, W. Gao, Mining Layered Grammar Rules for Action Recognition, IJCV 2010.

[23] X.Wang, T.X. Han, S Yan. An HOG-LBP human detector with partial occlusion handling. ICCV 2009.

[24] D. Weinland, M. Özuysal, P. Fua, Making Action Recognition Robust to Occlusions and Viewpoint Changes, ECCV 2010.

[25] A. Yao, J. Gall, L. V. Gool, A Hough Transform-Based Voting Framework for Action Recognition, CVPR 2010.

[26] L. Yeffet and L. Wolf, Local Trinary Patterns for Human Action Recognition, ICCV 2009.

[27] Z. Zeng and Q. Ji, Knowledge Based Activity Recognition with Dynamic Bayesian Network, ECCV 2010.

1531