a multiple kernel learning based fusion framework for real-time multi-view action recognition
TRANSCRIPT
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
A MKL Based Fusion Framework for Real-TimeMulti-View Action Recognition
Feng Gu, Francisco Florez-Revuelta, Dorothy Monekosso andPaolo Remagnino
Digital Imaging Research CentreKingston University, London, UK
December 3rd, 2014
1 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
1 Introduction
2 Framework Overview
3 Experimental Conditions
4 Results and Analysis
5 Conclusions and Future Work
2 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Background and Motivations
Real-time multi-view action recognition:
Gain an increasing interest in video surveillance, humancomputer interaction, and multimedia retrieval etc.
Provide complementary field of views (FOVs) of a monitoredscene via multiple cameras
Lead to a more robust decision making based on multipleheterogeneous video streams
Real-time capacity enables continuous long-term monitoring
If possible multiple cameras should be deployed to monitorhuman behaviour, where data fusion techniques can be applied.
3 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Illustration of the Monitored Scenario
C4
C1
C2C3
4 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Motion-Based Person Detector
We use a state-of-the-art motion-based tracker [6]:
Each pixel modelled as a mixture of Gaussians in RGB space
Background model to find foreground pixels in a new frame
Found foreground pixels grouped to form large regionsassociated the person of interest
Kalman filters used to track foreground detections
Person detections generated for every frame
5 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Feature Representation of Videos
Use of STIP and improved dense trajectories (IDT) [7] aslocal descriptor to extract visual features from a video
Person detections and frame spans to define a XYT cuboidassociated with an action performed by the monitored person
Apply bag of words (BOWs) to compute the feature vector ofa cuboid, where K-Means clustering used for the generationof a codebook
6 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Disciminative Models for Classification
Let xki ∈ RD , where i ∈ {1, 2, . . . ,N} is the index of a featurevector corresponding to a XYT cuboid and k ∈ {1, 2, . . . ,K} is theindex of a camera view. We learn a SVM classifier as
f (x) =N∑i=1
αiyik(xi , x) + b (1)
We then compute a classification score via a sigmoid function as
p(y = 1|x) =1
1 + exp(−f (x))(2)
7 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Simple Fusion Strategies
Concantenation of Features: concatenate the featurevectors of multiple views into one single feature vector suchthat xi = [x1i , . . . , x
Ki ]
Sum of Classification Scores: compute a classification scorefor each camera view p(y = 1|x) as in 2, and then averagethem as 1
K
∑Kk=1 p(y = 1|xk)
Product of Classification Scores: apply the product rule tothe classification scores of all the camera views as∏K
k=1 p(y = 1|xk)
8 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Multiple Kernel Learning
Combine of multiple kernels corresponding to different datasources (e.g. camera views) via a convex function such as
K(xi , xj) =K∑
k=1
βkkk(xi , xj) (3)
where βk ≥ 0 and∑K
k=1 βk = 1 and each kernel kk only uses adistinct set of features from a data source.
9 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Two-Step Optimisation
We need to learn the kernel parameters weights (αk) and bias(bk) of a SVM model, and the combination parameters βk in 3.This can be solved as follows:
Step-1: optimise over the kernel parameters αk and bk
while fixing the combination parameters βk (quadraticprogramming)
Step-2: optimise over the combination parameters βk whilefixing the kernel parameters αk and bk (gradient decent)
Alternates between two steps iteratively until the systemconverges to an optimal solution
10 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
IXMAS Multi-View Dataset
Created for view-invariant human action recognition [8]
Include 13 daily actions, each of which performed 3 times by12 actors
Video sequences collected via 5 cameras, at 23 frames persecond and 390× 291 resolution
We use all 12 actors and 5 cameras and evaluate 11 actionsas in [9]
Leave-one-subject-out cross validation used in theexperiments
11 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Implementation Details
A codebook sized 4000 quantised from 100000 randomlyselected descriptor features of the training set
STIP descriptor uses the entire image plane and the framespan of an action given in the ground truth to define a cuboid
IDT descriptor relies on the person detections in addition tothe frame span
All the SVM models use `1 normalisation and the χ2 kernel
12 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Person Detection Results
cam0 cam1 cam2 cam3 cam4
Figure: Detection results of the motion-based tracker of the first run ofthe subject ‘Alba’, for all the camera views.
13 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Results of STIP (Internal Comparison)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
chec
k watc
h
cross
arms
scratc
h hea
d
sit do
wnge
t up
turn a
round walk wav
epu
nch
kick
pick u
p
SVM−COMSVM−SUMSVM−PRDSVM−MKL
Figure: Class-wise mean recognition rates of all the folds of the comparedmethods using STIP descriptor, where µSVM−COM = 0.819,µSVM−SUM = 0.820, µSVM−PRD = 0.815, and µSVM−MKL = 0.842.
14 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Results of IDT (Internal Comparison)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
chec
k watc
h
cross
arms
scratc
h hea
d
sit do
wnge
t up
turn a
round walk wav
epu
nch
kick
pick u
p
SVM−COMSVM−SUMSVM−PRDSVM−MKL
Figure: Class-wise mean recognition rates of all the folds of the comparedmethods using IDT descriptor, where µSVM−COM = 0.915,µSVM−SUM = 0.927, µSVM−PRD = 0.921, and µSVM−MKL = 0.950.
15 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Comparison with State-of-the-Art (External Comparison)
Method Actions Actors Views Rate FPSCilla et al. [3] 11 12 5 0.913 N/A
Weiland et al. [10] 11 10 5 0.933 N/ACilla et al. [4] 11 10 5 0.940 N/AHolte et al. [5] 13 12 5 1.000 N/A
Weinland et al. [9] 11 10 5 0.835 500Chaaraoui et al. [1] 11 12 5 0.859 26Chaaraoui et al. [2] 11 12 5 0.914 207
SVM-MKL (IDT+BOWs) 11 12 5 0.950 25
Table: Comparison of the proposed MKL method using (IDT) descriptorand BOWs, where the methods with ‘N/A’ in the FPS column are offline.
16 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Conlusions and Future Work
Proposed MKL based framework outperforms the simplefusion techniques, and the state-of-the-art methods
IDT descriptor superior than STIP descriptor for featurerepresentation in action recognition
The proposed framework capable of performing real-timeaction recognition at 25 frames per second
For the future, apply to other similar vision problems, and studyalternative feature representation and fusion techniques.
17 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Thank you very much! Any questions?
18 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
A. A. Chaaraoui, P. Climent-Perez, and F. Florez-Revuelta.Silhouette-based human action recognition using sequences ofkey poses.Pattern Recognition Letters, 34:1799–1807, 2013.
A. A. Chaaraoui, J. R. Padilla-Lopez, F. J. Ferrandez-Pastor,M. Nieto-Hidalgo, and F. Florez-Revuelta.A vision-based system for intelligent monitoring: humanbehaviour analysis and privacy by context.Sensors, 14:8895–8925, 2014.
R. Cilla, M. A. Patricio, and A. Berlanga.A probabilistic, discriminative and distributed system for therecognition of human actions from multiple views.Neurocomputing, 75:78–87, 2012.
R. Cilla, M. A. Patricio, A. Berlanga, and J. M. Molina.
19 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Human action recognition with sparse classification andmultiple-view learning.Expert Systems, DOI: 10.1111/exsy.12040, 2013.
M. Holte, B. Chakraborty, J. Gonzalez, and T. Moeslund.A local 3-D motion descriptor for mult-view human actionrecognition from 4-D spatio-temporal interest points.IEEE Journal of Selected Topics in Signal Processing,6:553–565, 2012.
C. Stauffer and W. Grimson.Learning patterns of activity using real time tracking.IEEE Transactions on Pattern Analysis and MachineIntelligence (PAMI), 22(8):747–767, 2000.
H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid.
20 / 22
OutlineIntroduction
Framework OverviewExperimental Conditions
Results and AnalysisConclusions and Future Work
Evaluation of local spatio-temporal features for actionrecognition.In British Machine Vision Conference (BMVC), 2009.
D. Weinland, E. Boyer, and R. Ronfard.Action recognition from arbitrary views using 3d exemplars.In IEEE International Conference on Computer Vision (ICCV),pages 1–7, 2007.
D. Weinland, M. Ozuysal, and P. Fua.Making action recognition robust to occlusions and viewpointchanges.In European Conference on Computer Vision, 2010.
D. Weinland, R. Ronfard, and E. Boyer.Free viewpoint action recognition using motion historyvolumes.
21 / 22