human action recognition - semantic scholar€¦ · several approaches for human action recognition...
TRANSCRIPT
Human Action RecognitionA project report
submitted in partial fulfillment of the
requirements for the Degree of
Master of Technology
in
Computational Science
by
Rajendra Kumar
SUPERCOMPUTER EDUCATION AND RESEARCH CENTER
INDIAN INSTITUTE OF SCIENCE
Bangalore - 560012
July 2012
Dedicated to All teachers, Family and
Friends
Abstract
Human Action Recognition
Human action recognition is an important topic of computer vision research and
applications. The goal of the action recognition is an automated analysis of on-
going events from video data. A reliable system capable of recognizing various
human actions has many important applications. The applications include surveil-
lance systems, health-care systems, and a variety of systems that involve interac-
tions between persons and electronic devices such as human-computer interfaces.
In this project, the problem of human action recognition from video sequences
is addressed. Human Action Recognition(HAR) for both Depth as well as RGB
video sequences were analysed. In Depth based HAR, we propose two methods,
first with the local features and second uses the global features. In both methods,
l1minimization framework was employed for classification. Experiments were per-
formed on Video Analytics Lab(VAL) dataset. For RGB based HAR, Latent
Dirichlet Allocation(LDA) was used with Space Time Interest Points(STIP) feature
descriptors. STIP effectively captures the local structure in spatio temporal dimen-
sions of the video sequence. Each video sequence was represented as a ’bag-of-visual
words’. Experiments were performed on WEIZMANN, KTH and Video Ana-
lytics Lab(VAL) databases in two scenarios, one in which number of topics for
all actions were constant and manually chosen, another where number of topics for
each action changed depending on human action categories.
Keywords: Human Action Recognition, l1minimization, Latent Dirichlet Allo-
cation, Space Time Interest Points, Kinect Depth sensor.
iii
Acknowledgements
I am extremely lucky to have worked with my advisor Dr. R.Venketesh Babu, and
I would like to thank him for his guidance, encouragement and invaluable inputs
throughout the project. I am grateful to him for inspiring to learn and for being
very supportive to me.
I extremely thankful to Sreekanth, Priti, Naresh, Avinash and Sovan for their valu-
able time, encouragement and support.
I thank Bhuvnesh and his team who have collected the human action depth database.
I also want to show my deepest gratitude to all my friends who have helped me in
many ways. I am highly obliged to my Parents, Brothers and Sisters who have been
extremely understanding and supportive to my studies.
Last but not the least, I thank god.
iv
Contents
Abstract iii
Acknowledgements iv
List of Figures vii
List of Tables ix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 5
3 Methodology 7
3.1 Depth Based HAR: . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Local Approach: . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1.1 Average with Overlap and Slice . . . . . . . . . . . 10
3.1.1.2 Most tinct frames and slice . . . . . . . . . . . . . 13
3.1.2 Global Approach: . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2.1 Average of difference with overlap . . . . . . . . . . 14
3.2 RGB Based HAR: . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Experiments and Results 21
4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 Depth Based HAR: . . . . . . . . . . . . . . . . . . . . . 23
4.2.1.1 Average with overlap and slice: . . . . . . . . 24
4.2.1.2 Most distinct frames and slice: . . . . . . . . . 24
4.2.1.3 Average of difference with overlap: . . . . . . 25
4.2.2 RGB Based HAR: . . . . . . . . . . . . . . . . . . . . . . 26
v
vi Human Action Recognition
4.2.2.1 Weizmann: . . . . . . . . . . . . . . . . . . . . . 27
5 Conclusion and Future Work 33
Appendices
A SPArse Modeling Software 35
B Space Time Interest Points(STIP) 37
C Latent Dirichlet Allocation 39
Bibliography 47
List of Figures
2.1 RGB and Depth Frame . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 Our Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Average with overlap . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Average with overlap and slice : Training . . . . . . . . . . . . . . . 9
3.5 Recognition of a feature vector into a class action . . . . . . . . . . 12
3.6 Most distinct frames and slice : Training . . . . . . . . . . . . . . . 13
3.7 Average of difference with overlap : Training . . . . . . . . . . . . . 15
3.8 RGB based HAR : Training . . . . . . . . . . . . . . . . . . . . . . 17
3.9 RGB based HAR : Testing . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Confusion Matrix corresponds to 73.9 . . . . . . . . . . . . . . . . 23
4.2 Confusion Matrix corresponds to 77.8 . . . . . . . . . . . . . . . . 25
4.3 Confusion Matrix corresponds to 87.2 . . . . . . . . . . . . . . . . 26
4.4 Weizmannafter removing 7th action and 5th subjects (Confusion Ma-trix) k = 10, k = 15 . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 KTHvariable topics (Confusion Matrix), subjects per action = 15 . 29
4.6 VALfix topics (Confusion Matrix),removed ”Boxing” 5th sub. . . . 31
C.1 The generative LDA Process . . . . . . . . . . . . . . . . . . . . . . 39
C.2 Representation of the LDA model . . . . . . . . . . . . . . . . . . . 39
vii
List of Tables
4.1 Confused Actions . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Recognition in % on different parameters value . . . . . . . . 24
4.3 Average Recognition in % . . . . . . . . . . . . . . . . . . . . . 24
4.4 Recognition in % on different parameters value . . . . . . . . 24
4.5 Average Recognition in % . . . . . . . . . . . . . . . . . . . . . 25
4.6 Confused Actions . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.7 Recognition in % on different parameters value . . . . . . . . 26
4.8 Confused Actions . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.9 Average Recognition in % . . . . . . . . . . . . . . . . . . . . . 26
4.10 Confused Actions . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.11 Recognition in % on different parameter’s value . . . . . . . 27
4.12 Recognition in % on different parameter’s value . . . . . . . 28
4.13 Recognition in % on different parameter’s value . . . . . . . 28
4.14 Confused Actions . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.15 Recognition in % on different parameter’s value . . . . . . . 29
4.16 Confused Actions . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.17 Recognition in percentage for KTH . . . . . . . . . . . . . . . 30
4.18 Recognition in % on different parameter’s value . . . . . . . 30
4.19 Confused Actions . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.20 Recognition in percentage For VAL . . . . . . . . . . . . . . . 30
ix
Chapter 1
Introduction
In recent years, Human action recognition (HAR) has evoked considerable interest
in the various research areas due to its potential use in proactive computing. A
reliable system capable of recognizing various human actions has many important
applications such as automated surveillance systems, human computer interaction,
smart-home health care systems and control free gaming systems etc. The problem
of human action recognition from video sequences was addressed in this project.
The aim is to develop an algorithm which can recognize low-level-actions such as
Bending, Bowling, Boxing, Jogging, Jumping, Kicking etc. from the input video
sequences. Human Action Recognition(HAR) for both depth sequence as well as
RGB video sequences was analysed.
In depth based HAR, two methods were proposed, one uses the local fea-
tures and the other uses the global features. In both the methods, l1minimization
framework was employed for classification. For this SPAMS Software [1] was used.
Action recognition algorithms which use grey scale images are sensitive to illu-
mination changes. But depth information (in depth frames) obtained using Kinect
sensors, are independent of these variations. The performance of these methods
were tested on the data collected at Video Analytics Lab (VAL database). ’Leave-
one-out’ strategy was used for performance evaluation. The results indicate that
the features in global approach give better recognition results compared to local
approach. The parameters such as number of frames to be averaged, number of
frames to be overlapped, etc. need to be set during the training and testing phase.
NOTE– The performance of the algorithm is dependent on the values chosen
for the parameters and are documented with the corresponding results.
1
2 Human Action Recognition
For RGB based HAR, Latent Dirichlet Allocation(LDA) was used to ana-
lyze the performance. Space Time Interest Points(STIP)[2][3] feature descriptors
of videos were used in LDA. STIP effectively captures the local structure in spatio
temporal dimensions of the video sequence. Each video sequence is represented
as a ’bag-of-visual words’, where visual words are the key STIP features, obtained
by clustering all STIP features of the training videos of all actions using K-means
clustering. Models for each action were obtained by LDA Latent topic model which
learns the spatio-temporal distribution of visual words.
Most important cue for recognizing human low-level action is this space-time
interest points, which detect local structures in space-time where the image values
have significant local variations in both dimensions.
Experiments were done on WEIZMANN, KTH and VAL databases in two sce-
narios, one in which the number of topics for all actions were constant and manually
chosen. In the second scenario, the number of topics for each action were automat-
ically chosen depending on the human action categories.
Databases:
• Weizmann
• KTH
• VAL Database
Training and Testing framework : leave-one-out and average of results is the final
result.
Challenges: Discrimination between the similar actions such as:- Jumping-
sitting, Jogging-Walking, Boxing-Stretching etc.
Keywords: Depth frame, HAR, l1minimization, SPAMS, LDA, STIP, Kinect,
Depth sensor.
1.1 Motivation
As mentioned earlier, Human action recognition (HAR) has evoked considerable
interest in the various research areas and applications due to its potential use in
Introduction 3
proactive computing. Proactive computing is a technology that pro-actively an-
ticipated people necessity in situation such as health-care or life-care and takes
appropriate actions on their behalf. A system or solution capable of recognizing
various human actions has many important applications such as automated surveil-
lance systems, human computer interaction, smart home health-care systems and
control free gaming systems etc. Thus Human Action Recognition(HAR) is a very
fertile domain with many promising applications and it draws attentions of several
researchers, institutions and commercial companies.
1.2 Aim
Our aim is to analyze this problem and develop a robust HAR algorithm using l1
minimization and Latent Dirichlet Allocation(LDA) framework, which can recognize
low-level-Actions such as: Bending, Bowling, Boxing, Jogging, kicking etc. from the
input video sequences.
1.3 Outline of thesis
The thesis is organized in the following way:
1. Chapter 1 : Introduction introduces the context and motivation of the
research presented in this thesis.
2. Chapter 2 : Related work describes related work and our contribution.
3. Chapter 3 : Methodology explains about different approaches used in
detail. It tells how we can achieve HAR.
4. Chapter 4 : Experiments and Results describe the datasets used for
HAR, briefly present the bag-of-visual word model used for evaluating our
algorithm and in the end we describe the parameter values and scenarios in
which experiments were performed.
5. Chapter 5 : Conclusion and Future Work concludes the project’s thesis
along with future plans.
Chapter 2
Related Work
Several approaches for human action recognition have been proposed. A survey on
HAR can be found at [4]. A variety of approaches use features which describe the
motion and/or shape of the entire human body figure to perform human action
recognition.
Efros et al. [5] recognize the actions of small scale figures using features derived
from blurred optical flow estimates. Blank et al. [6] represent an action by consid-
ering the shape carved by its silhouette in time. Local shape descriptors based on
the Poisson equations are computed, then aggregated into a global descriptor by
computing moments.
Another group of methods uses features derived from small-scale patches, usually
computed at a set of interest points. Schuldt et al. [7] have computed local space-
time features at location selected in a scale-space representation. These features
are used in an SVM classification scheme.
Traditional approaches for motion analysis mainly involve the computation of
optical flow (Barron et al., 1994) [8] or feature tracking[9] [10](Smith and Brady,
1995; Blake and Isard, 1998). Although very effective for many tasks, both of these
techniques have limitations. Optical flow approaches mostly capture first order
motion and may fail when the direction of motion has sudden changes. Feature
trackers often assume a constant appearance of image patches over time and may
hence fail when the appearance changes, for example, in situations when two objects
in the image merge or split. Model-based solutions for this problem have been
presented by (Black and Jepson, 1998)[11].
5
6 Human Action Recognition
Figure 2.1: RGB and Depth Frame
Image structures in videos are not restricted to constant velocity and/or constant
appearance over time. On the contrary, many interesting events in videos are char-
acterized by strong variations in the data along both the spatial and the temporal
dimensions.
In the spatial domain, points with a significant local variation of image inten-
sities have been extensively investigated in the past[12](Forstner and Gulch, 1987;
Harris and Stephens, 1988;[13] Lindeberg, 1998[14]; Schmid et al., 2000)[15]. Such
image points are frequently referred to as Interest points and are attractive due
to their high information content and relative stability with respect to perspective
transformations of the data.
In this paper Human Action Recognition(HAR) for both DEPTH sequence as
well as RGB video sequence are analysed. In depth based HAR - the depth infor-
mation (Figure 2.1) of video sequence is used whereas in RGB based HAR - the
extended notion of interest points (proposed by IVAN LAPTEV - 2004)[16] into the
spatio-temporal domain is used for a compact representation of video data as well
as for interpretation of spatio-temporal events. Latent Dirichlet Allocation(LDA)
is then used for modeling the human action.
Chapter 3
Methodology
We have proposed the two different approaches (Figure 3.1) for human action
recognition, one based on depth information while other based on STIP features.
3.1 Depth Based HAR:
In this approach two methods were proposed, one uses the local features whereas
the other uses the global features. In both methods, l1 minimization framework
was employed for classification. Performance was evaluated on the Video Analytics
Lab database(VAL). The dataset has M = 9 subjects, each one is performing N
= 11 actions viz. Bending, Bowling, Boxing, Jogging, Jumping, Kicking, Sitting,
Figure 3.1: Our Approaches
7
8 Human Action Recognition
Figure 3.2: Slicing
Figure 3.3: Average with overlap
Stretching, Swimming, Walking and Waving.
Before going into the detail of each method, there are terms which need to be
explained.
1) Slicing: A process which takes single depth frame and gives p binary frames
just by chopping out the depth range of input frame into p equal subranges. In
Figure 3.2 input frame has depth range from 0 to 60 and output three frames
having range 0-20, 21-40, 41-60. Non zero values are replaced with 1 to get binary
images.
Methodology 9
Figure 3.4: Average with overlap and slice : Training
2) Average with overlap: This can be better explained with an example.
Suppose we have 5 depth frames for a single action numbered from a, b, c, d, e and
we want to average N = 3 frames with n = 2 overlap. Then the resultant frame1
will be the result of average of a, b, and c, similarly resultant frame2, frame3 will
be average of b, c, d, and c, d, e respectively. Refer to Figure 3.3.
3.1.1 Local Approach:
In this approach local features are used for recognition. For training and testing,
the framework, used is leave-one-out(ie. training with M1 subjects and testing with
remaining one subject) and the average of results is the final result. Parameters cor-
responding to the number of frames needed for summing, overlapping and number
of distinct frames are tunable.
There are two schemes - refer to Figure 3.1
10 Human Action Recognition
3.1.1.1 Average with Overlap and Slice
In this approach features are extracted from local grid blocks of human figure centric
frames.
Training:
For training, x depth frames, with y frames overlap, were averaged. The resultant
frames obtained were sliced.
In our scenario there are (M − 1) =8 training subjects per action. For a partic-
ular action, let there be, total n resultant depth frames and each resultant depth
frame was sliced into p binary frames. So there is total n*p binary slice frames per
action.
For each of the p binary slices of a resultant depth frame:
• Chop the frame into finite number of small blocks ie. convert frame to grid.
• For each grid block – find the summation of 1’s available divided by grid size.
This numerical value is the local feature for that particular grid block.
• Make a column vector (local feature vector) of numerical values which are
obtained in 2nd step.
Thus p column vectors are obtained, concatenation of these p column vectors in
to a single column vector will be the feature vector corresponding to a resultant
depth frame. Similarly the local feature vectors for all (in our case N = 11) the
training actions are obtained. Refer to Figure 3.4.
Dictionary Building: Put all the feature vectors of an action together as
columns and normalized the dictionary. Now the normalized dictionary looks as:
D = [A11....A1n, A21....A2n..., AN1...ANn]; (3.1)
of size (no.of gridblocks*p) x (N*n) . where
Xi = [Ai1Ai2...Ain] (3.2)
is the set of feature vectors for an action Ai .
n = Total resultant depth frames for an action class
Methodology 11
p = Output of slicing
N = Number of Actions in dataset
Testing
For testing each action video of the test subject, user is allowed to provide the
values for parameters corresponding to the number of frames to be averaged and
overlapped. These values could be different from what was chosen during the train-
ing.
Assume, n’ resultant depth frames were obtained, number of binary slices must
be p per resultant depth frames. Suppose after completing the averaging and slicing
process, we got
Xi = [Ai1Ai2...Ain′ ] (3.3)
which is of size (no. of gridblocks*p) x n’, the set of feature vectors for an action
Ai
To classify Xi, we need to solve lasso problem Let
X = [x1, x2, ..., xn] (3.4)
of m x n dimension is a response matrix and D of m x p dimension be a matrix
of predictors, then the lasso problem is commonly written as ..
minimizeα∈Rp
‖ α ‖1 s.t. ‖ x−Dα ‖22≤ λ (3.5)
that is to find alpha corresponding to a column x of X input response ma-
trix which is solution of this minimization problem. For solving this problem, the
function- mexLasso available in open source SPAMS[1] software was used.
Function mexLasso
This is a fast implementation of the LARS algorithm for solving the Lasso. It
is optimized for solving a large number of small or medium-sized decomposition
problem.It first computes the Gram matrix
GM = DTD (3.6)
then perform a Cholesky-based Orthogonal Matching Pursuit (OMP) of the input
signals in parallel.
12 Human Action Recognition
Figure 3.5: Recognition of a feature vector into a class action
It takes X and D as inputs and depending on input parameters, the algorithm
returns a matrix of coefficients .
A = [α1, α2, ..., αn] (3.7)
in p x n dimension such that for every column x of X, the corresponding column
alpha of A in the solution of above l1minimization(Lasso) problem.
After getting this A of N*n x n’dimension, corresponding to our X (test ma-
trix) of (no. of gridblocks*p) x n’ dimension and D (dictionary) of (no. of
gridblocks*p) x N*n.
Recognition:
Since we know test matrix’s columns are feature vectors for test action. So for a
particular feature vector of a test action there is corresponding coefficient column
vector in the A matrix whose size is (N*n) x 1.
Divide the N*n in to N equal size groups, either find the sum of coefficients of
each group, pick the group which has the maximum sum or choose the group which
has highest peak, as the recognized action label, Refer to Figure 3.5 where it is
classified to A1 class action.
So for a particular test action we got the n recognitions corresponding to n
columns of the test matrix Xi. We showed the result in percentage of accuracy, ie.
out of n how many are recognized correctly.
Methodology 13
Figure 3.6: Most distinct frames and slice : Training
3.1.1.2 Most tinct frames and slice
Training In this method collect the most distinct x frames from total frames of a
video sequence of an action(in our case total frames is - last 48 frames out of 58
depth frames leaving first 10 depth frames).
• Pick 1st frame as first distinct frame and put in set X(initially empty) and
remove the same from the set Y( initially having all 48 depth frames).
• Pick next frame from the Y set which is most distinct from all depth frames
available in the set X(in our case distance measure is the l2−norm of difference
frame ).
Do the step 2 until X contains x, and do it for all videos of a particular action and
collect in X.
Thus X, the set of ((M-1)*x ) = n most distinct depth frames for a particular action
is obtained. After that do slicing for each of the frame in X set as described in
Figure 3.6 and build a normalized dictionary matrix D as.
14 Human Action Recognition
D = [A11....A1n, A21....A2n..., AN1...ANn]; (3.8)
is of size (no.of gridblocks*p) x (N*n) where p is the no. of slices per distinct
depth frame,
Xi = [Ai1Ai2...Ain] (3.9)
is the set of feature vectors of an action Ai.
n = Total most distinct depth frames for an action class
P = Outputs of slicing
N = No. of Actions
Testing
• For given test action find the say x1(may be differ from the x) most distinct
frames.
• Slice and form grid
• Procedure for classification is same as describe in method 3.1.1.1
3.1.2 Global Approach:
3.1.2.1 Average of difference with overlap
In this approach features are extracted from complete human figure centric frame.
Training:
Find the difference frames from the consecutive frames (in our case it is 48 depth
) of an action (make sure to avoid the subtraction from or to zero depth value
because this leads to high value which comes due to the slight oscillations of the
body, which is actually not a part of human action). On these difference frames,
average the x frames with y overlap to get x resultant frames. Since in our case,
there are 8 training subjects per action, so the total resultant frames for a particular
action is (8*x ) = n. Reshape these n resultant depth frames in to column vectors
as described in Figure 3.7. Build the dictionary D and normalized it.
Methodology 15
Figure 3.7: Average of difference with overlap : Training
D = [A11....A1n, A21....A2n..., A111...A11n]; (3.10)
of size sizeofframe x (N*n). where ,
Xi = [Ai1Ai2...Ain] (3.11)
is the set of feature vector for an action Ai.
n = Total most distinct depth frames for an action class
N = No. of Actions
Testing:
• For a given test action find the resultant frames (say x’, may be different from
x ) by taking average of say, x1 frames with y1 overlap.
• Reshape each of the frames into column vector.
16 Human Action Recognition
• Procedure for classification is same as describe inmethod 3.1.1.1 or method
3.1.1.2
3.2 RGB Based HAR:
This approach uses the Space Time Interest[2][3] Points , which detect local struc-
ture in space-time where the image or frame value have significant local variations
in both dimensions. This is an important cue for recognizing human low-level ac-
tion. Each video sequence is represented as a bag-of-visual words, where visual
words are the key STIP features, obtained by clustering all STIP features of the
training videos of all actions using clustering algorithm. For action model forma-
tion Latent Dirichlet Allocation (LDA) was used. Performance was evaluated on
Weizmann, KTH, VAL Databases. For training and Testing, the framework used,
is leave-one-out.
Training:
The training for learning a classifier to distinguish among the different action
classes consists of two layers as described in Figure 3.8:
1) clustering to obtain the visual words(key descriptors)
• Obtain all STIP features from the videos of the training subjects actions.
• Arrange STIP features descriptors of a training subject video sequence in an
ascending order according to the temporal information (optional).
• Concatenate these arranged STIP feature descriptors into a thin matrix.
• Use clustering algorithm to cluster into different clusters.
Cluster centers(centroids) are the visual words or key features whose combination
represents an action.
2) Action Modeling by LDA Topic Model
Latent Dirichlet Allocation topic model which learns the spatio-temporal distri-
bution of visual words was used for action modeling.
To get topic models for an action – for each training subject of an action:
Methodology 17
Figure 3.8: RGB based HAR : Training
• Compute all STIP features from video sequence.
• Arrange them in an ascending order according to the temporal information.
Bag-Of-Word representation:
• Replace each stip feature descriptor to one of the key stip feature descriptor
based on minimum l2−norm distance measure.
• Compute the frequency of words in this bag of word representation of video
sequence (in word:frequency format and write it on text file in a single line).
Do above steps for all training subject videos to get a text file having no. of rows
equal to no. of train subjects.
• Use LDA on text file to get action model(in terms of beta matrix).
18 Human Action Recognition
Figure 3.9: RGB based HAR : Testing
Columns of beta are topic models for that particular action.
Testing:
Testing of each action of a test subject was done on the bases of stip descriptors
in some range.
• Compute all STIP features video sequence.
• Arrange them in an ascending order according to the temporal information.
Bag-Of-Word representation:
• Take all stip descriptors in some range and represent each stip feature de-
scriptor to one of the key stip feature descriptor based on minimum l2−norm
distance measure.
• Compute the frequency of words in this bag of word representation to obtain
a test column vector of size equal to no. of clusters(having only frequencies).
Methodology 19
Recognition:
• Compute the distances of test vector to all models of each actions.
• Find the index of minimum of these. That corresponds to the class action to
which it is classified.
Refer to Figure 3.9
Chapter 4
Experiments and Results
4.1 Experiment Setup
In this section, we first describe the datasets used for action recognition. We, then,
briefly present the bag-of-visual word model used for evaluating our algorithm and
in the end we describe the parameter values and scenarios in which experiments
were performed.
Experiments of depth based HAR and of RGB based HAR were performed on
the VAL database and Weizmann, KTH and VAL datasets respectively.
KTH dataset: The KTH dataset consists of N = 6 human action classes:
walking, jogging, running, boxing, handwaving and handclapping. Each action is
performed several times by M = 25 subjects. The sequences were recorded in
four different scenarios: outdoors, outdoors with scale variation, outdoors with
different clothes and indoors. The Background is homogeneous and static in most
sequences. In total, the data consists of 2391 video samples. The Experiment setup
was ’Leave-One-Out’. Here only one instance out of four instances of a subject was
left for test set. Average over all subject’s test set’s results was considered as the
final performance measure.
Weizmann dataset: The Weizmann human action dataset contains of N = 10
action classes: bend, jack, jump, pjump, run, side, skip, walk, wave1, wave2. Each
action is performed by M = 9 subjects. Dataset contains 93 low-resolution(180
x 144 pixels) video sequences. The Experiment setup was Leave-One-Out. Aver-
age of all test subjects result was the final classification result. Our method has
achieved approximately 98.30% average classification where we have removed 7th
21
22 Human Action Recognition
action (skip) that was mostly confusing with jack and pjump. Fifth subject from
all actions were also removed to improve performance.
Video Analytic Lab (VAL) dataset: The VAL dataset was collected using
Kinect depth sensor and contains of eleven action classes for depth based HAR:
bending, bowling, boxing, jogging, jumping, kicking, sitting, stretching, swimming,
walking and waving while for RGB based HAR there is only ten action classes:
bending, bowling, boxing, jogging, kicking, sitting, stretching, swimming, walking,
waving. Each action is performed by nine subjects. The dataset was collected in
the indoor scenario in lab, This dataset contains of depth frame sequences along
with RGB frame sequences but for depth based HAR, depth frame sequences were
used whereas RGB frame sequences were used for RGB based HAR. Our approach
in RGB based HAR has achieved approximately 95.56% average classification.
Bag of visual words:
To evaluate the performance of our RGB based HAR where STIP features were
used, we used a standard bag of word approach. We clustered all stip features
of actions into different groups by using clustering algorithm and centroids were
taken as the key features or visual words. For a given video sequence, compute stip
features and by using l2 norm distance measure corresponds each and every stip
feature to one of the key features. So this representation of video sequence in visual
words, so called bag-of-visual word.
Parameters:
In RGB based HAR, there was two scenarios in which performance was evaluated.
One in which the number of topics for all actions are constant and manually chosen
(our case it was 8). In the second scenario, the number of topics for each action
are automatically set depending on the human action categories. Idea behind the
variable topic was, every action has different atomic action labels. Just clustered
the all stip features of an action into k clusters and find particular k where within-
cluster sums of point-to-centroid distances is relatively minimum. For finding the
key descriptors or visual words from all stip features of all actions, we clustered them
into K = 500. To increase the precision, the K-means algorithm was replicated
thrice and the results with lowest within-cluster sum were kept. Distance measure
used for K-means algorithm was square of Euclidean distance.
Experiments and Results 23
Figure 4.1: Confusion Matrix corresponds to 73.9
4.2 Results
4.2.1 Depth Based HAR:
Experiments were performed on VAL dataset, we have used only the depth frames
not RGB frames. Each action folder has 58 depth images and in our experiment
we have used the last 48 depth frames out of 58, leaving the first 10 depth frames
because they were almost similar with high probability. These 58 depth frames of
each action were preprocessed by following steps
1) Normalization
2) Tight bounding box
3) Resizing
4) Reshaping in to vector
Table 4.1: Confused Actions
Actions Confused %
Bowling and Waving 9.5
Jogging and Sitting 18.5
Jogging and Walking 18.5
Waving and Bowling 20.4
24 Human Action Recognition
4.2.1.1 Average with overlap and slice:
Experiments were performed on the various parameters’value, refer to Table 4.2
and for confusion matrix corresponding to Avg.% classification = 73.9 refer to
Figure 4.1. Most of confusing actions which degrade the average percentage of
classification are shown in the Table 4.1. Average percentage of classification per
action is shown in Table 4.3.
Table 4.2: Recognition in % on different parameters value
frames Overlap frames Overlap Slice Gridsize Avg.% Recog.
16(train) 4(train) 30(test) 25(test) 5 4x4 72.4
30(train) 25(train) 30(test) 25(test) 5 4x4 73.9
Table 4.3: Average Recognition in %
Action % Action %
Bending 94.4 Bowling 81.5
Boxing 55.6 Jogging 25.9
Jumping 55.6 Kicking 81.5
Sitting 100 Stretching 72.2
Swimming 94.4 Walking 72.2
Waving 79.6 - -
4.2.1.2 Most distinct frames and slice:
Experiments were performed on the various parameters’value, refer to Table 4.4
and for confusion matrix corresponding to Avg. % classification = 77.8 refer to
Figure 4.2. Most of confusing actions which degrade the average percentage of
classification are shown in the Table 4.6. Average percentage of classification per
action is shown in Table 4.5.
Table 4.4: Recognition in % on different parameters value
Distinct frames Slice Grid size Avg.% Recog.
15 6 4x4 77.8
30 5 4x4 72.9
Experiments and Results 25
Figure 4.2: Confusion Matrix corresponds to 77.8
Table 4.5: Average Recognition in %
Action % Action %
Bending 94.8 Bowling 77.7
Boxing 82.2 Jogging 53.3
Jumping 61.5 Kicking 91.5
Sitting 79.3 Stretching 80.0
Swimming 68.1 Walking 82.9
Waving 84.4 - -
Table 4.6: Confused Actions
Actions Confused %
Jogging and Walking 22.9
Jumping and Jogging 20.0
Jogging and Sitting 18.5
Swimming and Bowling 17.0
4.2.1.3 Average of difference with overlap:
Experiments were performed on the various parameters’value, refer to Table 4.7
and for confusion matrix corresponding to Avg.% classification = 87.2 refer to Fig-
ure 4.3. Most of confusing actions degrade the average percentage of classification.
Confused actions in case of experiment whose avg. % Recog. = 87.2 are shown in
the Table 4.8. Average percentage of classification per action is shown in Table
4.9. Confused actions in case of experiment whose avg. % Recog. = 72.2 are
shown in the Table 4.10
26 Human Action Recognition
Table 4.7: Recognition in % on different parameters value
Removed Testframes Overlap Avg.% Recog.
None 20 10 72.2
None 30 20 80.1
5th(Jumping) 30 20 87.2
Figure 4.3: Confusion Matrix corresponds to 87.2
Table 4.8: Confused Actions
Actions Confused %
Jogging and Sitting 16.7
Jogging and Kicking 11.1
Boxing and Swimming 11.1
Walking and Sitting 11.1
Table 4.9: Average Recognition in %
Action % Action %
Bending 88.8 Bowling 100
Boxing 66.7 Jogging 72.2
Kicking 94.4 Sitting 88.9
Stretching 88.9 Swimming 100
Walking 77.8 Waving 94.4
4.2.2 RGB Based HAR:
Experiments were performed on three datasets ie., Weizmann, KTH and Visual
Analytics Lab (VAL dataset).
Experiments and Results 27
Table 4.10: Confused Actions
Actions Confused %
Bowling and Waving 11.1
Jogging and Jumping 25.0
Jogging and Walking 22.9
Jumping and Sitting 33.3
By performing various experiments on different values of K, distance measure of
K-means algorithm and no. of topics in training, we found that on K = 500 (no.
of clusters), distance measure = square of Euclidean distance and no. of
topic = 8(in fix topic case scenario) the performances were good.
4.2.2.1 Weizmann:
kparam: It is the k parameter in Harris Function, corresponds to the sensitivity
factor, generally in the range of (0, 0.25). Smaller the value of k, more likely it is
that the algorithm can detect sharp corners.
thresh: It is intensity comparison threshold. This is for omitting weak points.
Larger the value of thresh more weak points.
Experiment1: We took some appropriate action in order to achieve better
recognition rate. Experiments were performed in fix topics per action class sce-
nario, refer Table 4.11
Table 4.11: Recognition in % on different parameter’s value
Removed Cluster Topics Testdescriptors Overlap Avg.classifi.
None 500 8 12 6 88.95
5th subjecbject 500 8 12 6 89.56
”Skip” 500 8 12 6 92.64
None 500 8 16 8 90.46
5th subject 500 8 16 8 91.30
”Skip” 500 8 16 8 93.28
Experiment2: Changed kparam’s value from 0.00050 to 0.00010, just to get the
more no. of STIP features descriptors. Experiments were performed on fix topics
per action class scenarios, refer Table 4.12.
28 Human Action Recognition
Table 4.12: Recognition in % on different parameter’s value
Removed Cluster Topics Testdescrip. Overlap Avg.% classifi.
None 500 5 16 8 88.64
5th subject 500 5 16 8 94.65
None 500 8 16 8 92.47
5th subject 500 8 16 8 96.13
Figure 4.4: Weizmann after removing 7th action and 5th subjects (ConfusionMatrix) k = 10, k = 15
Experiment3: Here we have 1) changed kparam’s value from 0.00050 (default)
to 0.00010, to get more no. of STIP features descriptors, 2) taken variable topics for
each action class based on k = 10 and k = 15 . Refer Table 4.13 for experiments,
and for confusion matrices corresponding to 98.33% and 96.99% respectively refer
to Figure 4.4. Confused actions corresponding to avg % recognition = 96.99 shown
in Table 4.14.
Table 4.13: Recognition in % on different parameter’s value
Removed Cluster Testdescr. Overlap k Avg.% classifi.
None 500 16 8 10 90.24
5th subject 500 16 8 10 94.38
5th sub.& ”Skip” 500 16 8 10 98.33
5th subject 500 16 8 15 94.18
5th sub. & ”Skip” 500 16 8 15 96.99
Table 4.14: Confused Actions
Actions Confused %
Jump and Jack 12.5
Wave1 and Jack 25.0
Jogging and Walking 8.3
Run and Walk 6.2
Experiments and Results 29
Figure 4.5: KTH variable topics (Confusion Matrix), subjects per action = 15
KTH:
Here we 1) have changed kparam’s value from 0.00050 (default) to 0.00080 and
thresh’s value from 1.000e-009 (default) to 1.000e-008, 2) have used frames between
50 to 200 but for 1,2,3rd instance of 9th subject’s we have used frames between 201
to 351 (since there was no STIP features in 50 to 200 frames range), to capture less
no. of STIP feature descriptors to make computation easier and faster. We took all
actions and 5 or 9 or 15 out of 25 subjects with all four instances.
Experiment: We have evaluated our method in two scenarios, one in fixed or
constant manually selected topics = 8 while in other topics were selected automat-
ically. Refer Table 4.15 for experiments, and for confusion matrix corresponding
to 82.84% refer to Figure 4.5. Confused actions shown in Table 4.16.
Table 4.15: Recognition in % on different parameter’s value
Sub.Taken Clusters Topics Testdescr. Overlap k Avg. % classifi.
5 500 8 16 8 10 71.92
5 500 variable 16 8 10 80.40
9 500 8 16 8 10 76.92
9 500 variable 16 8 10 77.10
15 500 8 16 8 10 80.36
15 500 variable 16 8 10 82.84
VAL:
30 Human Action Recognition
Table 4.16: Confused Actions
Actions Confused %
Jogging and Running 27.4
Handclap and Handwave 51.5
Handwave and Boxing 42.0
Running and Jogging 24.6
Table 4.17: Recognition in percentage for KTH
Boxing Handclap Handwave Jog Run Walk
90.37 91.73 92.68 62.57 71.69 88.02
Experiment: We have evaluate our method in two scenarios, one in fixed or con-
stant manually selected topics = 8 while in others topics were selected automati-
cally. Refer Table 4.18 for experiments, and for confusion matrix corresponding
to 95.56% refer to Figure 4.6. Confused actions shown in Table 4.19.
Table 4.18: Recognition in % on different parameter’s value
Removed Clusters Topics Testdescr. Overlap k Avg. % classifi.
None 500 8 20 10 10 91.55
None 500 variable 20 10 10 89.35
”Boxing” 500 8 20 10 10 93.93
”Boxing” 500 variable 20 10 10 92.59
5th sub.& ”Boxing” 500 8 20 10 10 95.56
Table 4.19: Confused Actions
Actions Confused %
Bowl, Kick, Sit – Walk 4.3
Sitting and Bending 5.7
Handwave and Boxing 42.0
Waving and Stretching 10.8
Table 4.20: Recognition in percentage For VAL
Bend Bowling Jog Kick Sit
100 95.83 97.22 94.95 91.74Stretch Swim Walk Wave -
98.21 100 95.77 86.29 -
Experiments and Results 31
Figure 4.6: VAL fix topics (Confusion Matrix),removed ”Boxing” 5th sub.
Chapter 5
Conclusion and Future Work
In this paper we have analysed the performance of human action recognition based
on depth and RGB video sequences. In Depth based HAR, we took the advantage
of depth information in feature description because of its insensitiveness toward
illumination changes, method based on global features perform better than local
features based. In RGB based HAR, we have used Space-Time Interest Points[2][3]
through Latent Topic Model for classification. STIP effectively captures the local
structure in spatio temporal dimensions of the video sequence. Each video sequence
was represented as a ”bag-of-visual words”. Results show our approaches performing
satisfactorily.
Exploring new features for HAR on sparse(l1) Framework and exploring the pos-
sibility of combining Depth and Gray scale videos for HAR are my future plans.
33
Appendix A
SPArse Modeling Software
SPAMS (SPArse Modeling Software) is an open-source optimization toolbox under
licence GPLv3. It implements algorithms for solving various machine learning and
signal processing problems involving sparse regularizations.
Function mexLasso was used in our project for solving the l1 minimization to
accomplish the classification of human action.
mexLasso: This is a fast implementation of the LARS algorithm [8] (variant for
solving the Lasso) for solving the Lasso or Elastic-Net. Given a matrix of signals
X=[x1,,xn] in m n and a dictionary D in m p, depending on the input parameters,
the algorithm returns a matrix of coefficients A=[1,,n] in p n such that for every
column x of X, the corresponding column of A is the solution of
minimizeα∈Rp
‖ α ‖1 s.t. ‖ x−Dα ‖22≤ λ (A.1)
For efficiency reasons, the method first compute the covariance matrix DTD,
then for each signal, it computes DTx and performs the decomposition with a
Cholesky-based algorithm. The implementation has also an option to add positivity
constraints on the solutions . When the solution is very sparse and the problem size
is reasonable, this approach can be very efficient. Moreover, it gives the solution
with an exact precision, and its performance does not depend on the correlation
of the dictionary elements, except when the solution is not unique (the algorithm
breaks in this case). Note that mexLasso can return the whole regularizations path
of the first signal x1 and can handle implicitly the matrix D if the quantities DTD
and DTx are passed as an argument, see below:
35
36 Human Action Recognition
Usage: [A [path]]=mexLasso(X,D,param); or [A [path]]=mexLasso(X,Q,q,param);
Name: mexLasso
Description: mexLasso is an efficient implementation of the homotopy-LARS
algorithm for solving the Lasso.
Go to site:– http://www.di.ens.fr/willow/spams/index.html for more informa-
tion.
Appendix B
Space Time Interest Points(STIP)
In spatial donmai, points with a significant local variation of image intensities are
frequently denoted as ”interest points” and are attractive due to their high informa-
tion contents. And these interest points detectors used in the various application
such as image indexing[15], stereo matching[17][18][], optical flow estimation and
tracking[19], and recognition[20][21].
Laptev and Lindeberg[3] extended this idea of interest points in spatio-temporal
domain and illustrated how these resulting space-time features often corresponds
to interesting events in video data. The idea of detecting spatio-temporal interest
points built upon the Harris and Forstner interest point operators[12][13] which
capture the large variations in the both spatial and temporal dimensions.
We have used the code for STIP computation from this site–
— STIP implementation v1.0 — (18-06-2008)
http://www.irisa.fr/vista/Equipe/People/Laptev/download/stip-1.0.zip
This was developed in 2006-2008 jointly at INRIA Rennes (http://www.irisa.fr/vista)
and IDIAP (www.idiap.ch) under supervision of Ivan Laptev and Barbara Caputo.
General: The code detects Space-Time Inter- est Points (STIPs) and computes
corresponding local space-time descriptors. The currently implemented detector re-
sembles the extended space-time Harris detector described in [Laptev IJCV05]. The
code does not implement scale selection but selects scale that roughly correspond to
the size of the detected events in the space and to their duration in time and detects
points for a set of multiple combinations of spatial and temporal scales. This simpli-
fication appears to produce similar (or bet- ter) results in applications (e.g. action
37
38 Human Action Recognition
recognition) while resulting in a considerable speed-up and close-to-video-rate run
time.
The currently implemented types of descriptors are HOG (Histograms of Oriented
Gradients) and HOF (Histograms of Optical Flow) computed on a 3D video patch
in the neighborhood of each detected STIP. The patch is partitioned into a grid with
3x3x2 spatio-temporal blocks; 4-bin HOG descriptors and 5-bin HOF descriptors are
then computed for all blocks and are concatenated into a 72-element and 90-element
descriptors respectively.
Appendix C
Latent Dirichlet Allocation
latent Dirichlet allocation (LDA) is a generative model that allows sets of ob-
servations to be explained by unobserved groups that explain why some parts of
the data are similar. It is a powerful learning algorithm for automatically and
jointly clustering words into ”topics” and documents into mixtures of topics. A
topic model is, roughly, a hierarchical Bayesian model that associates with each
document a probability distribution over ”topics”, which are in turn distributions
over words.
Figure C.1: The generative LDA Process
Figure C.2: Representation of the LDA model
39
40 Human Action Recognition
We have used the freely available code for Latent Dirichlet Allocation, developed
by Daichi Mochihashi Computational Linguistics Laboratory, NAIST, Japan / ATR
Spoken Language Translation Research Laboratories, Kyoto, Japan [email protected]
Overview
lda is a Latent Dirichlet Allocation (Blei et al., 2001) package written both in
MATLAB and C (command line interface). This package provides only a standard
variational Bayes estimation that is first proposed, but has a simple textual data
format that is almost the same as SVMlight or TinySVM. This package can be
used as an aid to understand LDA, or simply as a regularized alternative to PLSI,
which has a severe overfitting problem due to its maximum likelihood structure. For
advanced users who wish to benefit from the latest result, consider using npbayes
or MPCA: though, they have non-trivial data structures.
Requirements
C version:
* ANSI C compiler.
Systems below are confirmed to compile.
- Linux 2.4.20, Redhat 9, gcc 3.2.2
- Linux 2.6.5, Fedora core release 2, gcc 3.3.3
- FreeBSD 4.8-STABLE, gcc 2.95.4 (GNU make)
- SunOS 5.8, gcc 2.95.3 (GNU make) MATLAB version:
* A MATLAB environment. Statistical Toolbox may be needed for psi() func-
tion (but in case it is not installed, consider using Minkas Lightspeed MATLAB
toolbox).
* Octave is not supported.
Install
C version: 1. Take a glance at Makefile; and type make.
2. C version is not intended to be used by those who are not familiar with C.
Makefile and source files are very simple, so you can modify it as needed if it
does not compile (If severe problems are found, please contact to the author.)
Latent Dirichlet Allocation 41
MATLAB version:
simply add a directory where you have unpacked *.m into MATLAB path. For
example:
1) cd /work
2) tar xvfz lda-0.1-matlab.tar.gz
3) cd lda-0.1-matlab/
4) matlab
addpath /work/lda 0.1 matlab
Download
* C version: lda-0.1.tar.gz * MATLAB version: lda-0.1-matlab.tar.gz
Performance
* C version runs about 8 times or more faster than MATLAB (while MATLAB
codes are fully vectorized).
* However, MATLAB version is closed under MATLAB environment; so it is easy
to investigate and manipulate the parameters (especially graphically using plot or
surf). Moreover, MATLAB codes are so simple and easy to understand.
* To estimate the parameters of 50 class LDA decomposition of the standard
Cranfield collection (1397 documents, 5177 unique terms),
- C version took 1 minute 32 seconds,
- MATLAB version took 38 minutes 55 seconds, on a Xeon 2.8GHz.
It runs in low memory efficiently: in the experiment above, it uses only 6.8MB
(C) and 29MB (MATLAB) of memory.
Getting Started
This package contains a sample data file train which was compiled from the
first 100 documents of the Cranfield collection. Each feature id corresponds to the
respective line of file train.lex; that is, feature 20 means a word accuracy, feature
21 means accurate, and so on. After compilation, you can test it using train data
as follows.
42 Human Action Recognition
C version:
lda -N 20 train model
MATLAB version:
matlab
[alpha, beta] = ldamain( train , 20);
C version creates two files model.alpha and model.beta; MATLAB version creates
1x20-dimensional vector alpha and 1324x20-dimensional matrix beta. Parameters
of the resulting models are explained in the sections below.
Data Format
Data format is common to both C and MATLAB version and almost the same as
widely-used SVMlight, except that there is no label in Latent Dirichlet Allocation
since LDA is an unsupervised method.
A data file is a ASCII text file, where each line represents a document (NB.
document is simply a synonym for a group of data; So you can interpret it as you
like whenever it means a group of data.) Typical data file is as follows:
1:1 2:4 5:2
1:2 3:3 5:1 6:1 7:1
2:4 5:1 7:1
* Each line can be maximum 65535 bytes (about 820 lines in 80-column text) by
default. For a standard document this value is sufficient, but if you wish to increase
this limit, modify BUFSIZE in feature.c as you like.
* Each line consists of pairs of < featureid >:< count >. Here, featureid is an
integer from 1 (this is the same as SVMlight); count can be an integer or a real
number that must be positive.
* < featureid >:< count > pairs are separated by (possibly multiple) white
spaces. The program is coded to work even if there are any empty lines, but it is
preferable that there are no such unnecessary lines.
* For a complete specification, please refer to SVMlights page.
Command Line Syntax
Latent Dirichlet Allocation 43
C version
lda is typically invoked simply as: lda -N 100 train model
train is a data file that has a format described above, and model is a basename
of output files of model parameters. Specifically, lda uses two outputs: model.alpha
and model.beta that represent alpha and beta in the LDA described in (Blei et al.,
2001); that is, alpha is the parameter of prior Dirichlet distribution over the latent
classes, and beta is the set of class unigrams for each latent class. -N 100 is the
number of latent classes to assume in the data. For the standard model of LDA,
this is the only parameter we must provide in advance. In this case, 100 latent
classes are assumed.
Besides, there are several rarely-used options:
lda -h
lda, a Latent Dirichlet Allocation package. Copyright (c) 2004 Daichi Mochihashi,
All rights reserved. usage: lda -N classes [-I emmax -D demmax -E epsilon] train
model
-I emmax
Maximum of iteration of the outer VB-EM algorithm, which is exited when con-
verged. (default 100)
-D demmax
Maximum of iteration of the inner VB-EM algorithm for each document, which is
exited when converged. (default 20)
-E epsilon
A threshold to determine the whole convergence of the estimation. It is a lower
threshold of the relative increase in the total data likelihood. (default 0.0001)
-h
displays help.
MATLAB version
First, you must load a data file into MATLAB data structure:
matlab
d = fmatrix(train);
44 Human Action Recognition
And run a function lda to estimate the parameters. The second argument is the
number of latent classes that you assume. (in the example below, 20) – help lda
Latent Dirichlet Allocation, standard model. Copyright (c) 2004 Daichi Mochi-
hashi, all rights reserved.
Id : index.html, v1.32004/12/0412 : 47 : 35daiti mExp
[alpha,beta] = lda(d,k,[emmax,demmax])
d : data of documents
k : of classes to assume
emmax : of maximum VB-EM iteration (default 100)
demmax : of maximum VB-EM iteration for a document (default 20)
[alpha,beta] = lda(d,20);
Optional two parameters emmax and demmax can be fed into lda, which has the
same meaning as the C version. If you find that loading text data into MATLAB
structure in advance is troublesome, there is a wrapper function ldamain that works
exactly the same as the C version:
[alpha,beta] = ldamain(train.dat,20);
Output Format
MATLAB version
In the example above, alpha is a N-dimensional row vector of alpha for cor-
responding latent topics, and beta is a [V,N]-dimensional matrix of beta where
beta(v,n) = p(vn) (n = 1 .. N, v = 1 .. V; V is the size of the lexicon). You can
save them to file using standard MATLAB function save, for example, as:
[alpha,beta] = ldamain(train.dat,20);
number of latent classes = 20
number of documents = 100
number of words = 1324
iteration 26/100.. likelihood = 339.167
ETA: 0:01:03 (1 sec/step)
converged.
save(alpha.dat, alpha, -ascii);
save(beta.dat, beta, -ascii);
C version
Latent Dirichlet Allocation 45
If you invoke lda as the following,
lda -N 100 train model
two files model.alpha and model.beta are created. These two files are exactly of
the same format as those which are saved from MATLAB: model.alpha is a space-
separated N-dimensional vector of alpha, and model.beta is a space-separated V x
N matrix of beta. These parameters can be loaded into MATLAB using standard
ways:
beta = load(model.beta);
And you can manipulate these parameters within MATLAB.
Bibliography
[1] “SParse Modelling Software (spams),” in http://www.di.ens.fr/willow/spams/index.html.
[2] “Space Time Interest Points (stip),” in http://www.di.ens.fr/ laptev/interest-
points.html”.
[3] I. Laptev and T. Lindeberg, “Space-time interest points,” Proceedings Ninth
IEEE International Conference on Computer Vision, vol. pages, no. Iccv,
pp. 432–439 vol.1, 2003.
[4] T. B. Moeslund, A. Hilton, and V. Kruger, “A survey of advances in vision-
based human motion capture and analysis,” Computer Vision and Image Un-
derstanding, vol. 104, no. 2-3, pp. 90–126, 2006.
[5] A. Efros, A. Berg, G. Mori, and J. Malik, “Recognizing action at a distance,”
in Proc. 9th Int. Conf. Computer Vision, vol. 2, pp. 726–733, 2003.
[6] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as
space-time shapes,” IEEE Conference on Computer Vision, pp. 1395–1402,
2005.
[7] C. S. I. Laptev and B. Caputo, “Recognizing human actions: A local svm
approach,” International Conf. on Pattern Recognition, 2004.
[8] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, “Performance of optical flow
techniques,” International Journal of Computer Vision, vol. 12, no. 1, pp. 43–
77, 1994.
[9] S. M. Smith and J. M. Brady, “Susan - a new approach to low level image
processing,” International Journal of Computer Vision, vol. 23, no. 1, pp. 45–
78, 1997.
[10] M. Isard and A. Blake, “Icondensation: Unifying low-level and high-level track-
ing in a stochastic framework,” in ECCV (1) (H. Burkhardt and B. Neumann,
47
48 Human Action Recognition
eds.), vol. 1406 of Lecture Notes in Computer Science, pp. 893–908, Springer,
1998.
[11] M. J. Black and A. D. Jepson, “Recognizing temporal trajectories using the
condensation algorithm,” in FG, pp. 16–21, IEEE Computer Society, 1998.
[12] W. Frstner and E. Glch, A fast operator for detection and precise location of
distinct points, corners and centres of circular features, pp. 281–305. ISPRS
Intercomission Workshop, Interlaken, 1987.
[13] C. Harris and M. Stephens, A combined corner and edge detector, vol. 15,
pp. 147–151. Manchester, UK, 1988.
[14] T. Lindeberg, “Feature detection with automatic scale selection,” International
Journal of Computer Vision, vol. 30, no. 2, pp. 79–116, 1998.
[15] C. Schmid and R. Mohr, “Local grayvalue invariants for image retrieval,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 5,
pp. 530–535, 1997.
[16] I. Laptev and T. Lindeberg, “Local descriptors for spatio-temporal recogni-
tion,” in SCVMA (W. J. MacLean, ed.), vol. 3667 of Lecture Notes in Computer
Science, pp. 91–103, Springer, 2004.
[17] T. Tuytelaars and L. V. Gool, “Wide baseline stereo matching based on local
, affinely invariant regions,” Baseline, vol. pages, p. 412425, 2000.
[18] K. Mikolajczyk and C. Schmid, “An affine invariant interest point detector,”
Image Rochester NY, vol. 1, no. 1, pp. 128–142, 2002.
[19] D. Tell and S. Carlsson, “Combining appearance and topology for wide baseline
matching,” Proc ECCV, pp. 68–81, 2002.
[20] D. G. Lowe, “Object recognition from local scale-invariant features,” Proceed-
ings of the Seventh IEEE International Conference on Computer Vision, vol. 2,
no. [8, pp. 1150–1157 vol.2, 1999.
[21] D. Hall, J. L. Crowley, and V. C. D. Verdi, “View invariant object recognition
using coloured receptive fields,” Machine Graphics And Vision, pp. 1–12, 2000.