a multiple kernel learning based fusion framework for real-time multi-view action recognition

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

A MKL Based Fusion Framework for Real-TimeMulti-View Action Recognition

Feng Gu, Francisco Florez-Revuelta, Dorothy Monekosso andPaolo Remagnino

Digital Imaging Research CentreKingston University, London, UK

December 3rd, 2014

1 / 22

OutlineIntroduction



1 Introduction

2 Framework Overview

3 Experimental Conditions

4 Results and Analysis

5 Conclusions and Future Work

2 / 22

OutlineIntroduction



Background and Motivations

Real-time multi-view action recognition:

Gain an increasing interest in video surveillance, humancomputer interaction, and multimedia retrieval etc.

Provide complementary field of views (FOVs) of a monitoredscene via multiple cameras

Lead to a more robust decision making based on multipleheterogeneous video streams

Real-time capacity enables continuous long-term monitoring

If possible multiple cameras should be deployed to monitorhuman behaviour, where data fusion techniques can be applied.

3 / 22

OutlineIntroduction



Illustration of the Monitored Scenario

C4

C1

C2C3

4 / 22

OutlineIntroduction



Motion-Based Person Detector

We use a state-of-the-art motion-based tracker [6]:

Each pixel modelled as a mixture of Gaussians in RGB space

Background model to find foreground pixels in a new frame

Found foreground pixels grouped to form large regionsassociated the person of interest

Kalman filters used to track foreground detections

Person detections generated for every frame

5 / 22

OutlineIntroduction



Feature Representation of Videos

Use of STIP and improved dense trajectories (IDT) [7] aslocal descriptor to extract visual features from a video

Person detections and frame spans to define a XYT cuboidassociated with an action performed by the monitored person

Apply bag of words (BOWs) to compute the feature vector ofa cuboid, where K-Means clustering used for the generationof a codebook

6 / 22

OutlineIntroduction



Disciminative Models for Classification

Let xki ∈ RD , where i ∈ {1, 2, . . . ,N} is the index of a featurevector corresponding to a XYT cuboid and k ∈ {1, 2, . . . ,K} is theindex of a camera view. We learn a SVM classifier as

f (x) =N∑i=1

αiyik(xi , x) + b (1)

We then compute a classification score via a sigmoid function as

p(y = 1|x) =1

1 + exp(−f (x))(2)

7 / 22

OutlineIntroduction



Simple Fusion Strategies

Concantenation of Features: concatenate the featurevectors of multiple views into one single feature vector suchthat xi = [x1i , . . . , x

Ki ]

Sum of Classification Scores: compute a classification scorefor each camera view p(y = 1|x) as in 2, and then averagethem as 1

K

∑Kk=1 p(y = 1|xk)

Product of Classification Scores: apply the product rule tothe classification scores of all the camera views as∏K

k=1 p(y = 1|xk)

8 / 22

OutlineIntroduction



Multiple Kernel Learning

Combine of multiple kernels corresponding to different datasources (e.g. camera views) via a convex function such as

K(xi , xj) =K∑

k=1

βkkk(xi , xj) (3)

where βk ≥ 0 and∑K

k=1 βk = 1 and each kernel kk only uses adistinct set of features from a data source.

9 / 22

OutlineIntroduction



Two-Step Optimisation

We need to learn the kernel parameters weights (αk) and bias(bk) of a SVM model, and the combination parameters βk in 3.This can be solved as follows:

Step-1: optimise over the kernel parameters αk and bk

while fixing the combination parameters βk (quadraticprogramming)

Step-2: optimise over the combination parameters βk whilefixing the kernel parameters αk and bk (gradient decent)

Alternates between two steps iteratively until the systemconverges to an optimal solution

10 / 22

OutlineIntroduction



IXMAS Multi-View Dataset

Created for view-invariant human action recognition [8]

Include 13 daily actions, each of which performed 3 times by12 actors

Video sequences collected via 5 cameras, at 23 frames persecond and 390× 291 resolution

We use all 12 actors and 5 cameras and evaluate 11 actionsas in [9]

Leave-one-subject-out cross validation used in theexperiments

11 / 22

OutlineIntroduction



Implementation Details

A codebook sized 4000 quantised from 100000 randomlyselected descriptor features of the training set

STIP descriptor uses the entire image plane and the framespan of an action given in the ground truth to define a cuboid

IDT descriptor relies on the person detections in addition tothe frame span

All the SVM models use `1 normalisation and the χ2 kernel

12 / 22

OutlineIntroduction



Person Detection Results

cam0 cam1 cam2 cam3 cam4

Figure: Detection results of the motion-based tracker of the first run ofthe subject ‘Alba’, for all the camera views.

13 / 22

OutlineIntroduction



Results of STIP (Internal Comparison)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

chec

k watc

h

cross

arms

scratc

h hea

d

sit do

wnge

t up

turn a

round walk wav

epu

nch

kick

pick u

p

SVM−COMSVM−SUMSVM−PRDSVM−MKL

Figure: Class-wise mean recognition rates of all the folds of the comparedmethods using STIP descriptor, where µSVM−COM = 0.819,µSVM−SUM = 0.820, µSVM−PRD = 0.815, and µSVM−MKL = 0.842.

14 / 22

OutlineIntroduction



Results of IDT (Internal Comparison)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

chec

k watc

h

cross

arms

scratc

h hea

d

sit do

wnge

t up

turn a

round walk wav

epu

nch

kick

pick u

p

SVM−COMSVM−SUMSVM−PRDSVM−MKL

Figure: Class-wise mean recognition rates of all the folds of the comparedmethods using IDT descriptor, where µSVM−COM = 0.915,µSVM−SUM = 0.927, µSVM−PRD = 0.921, and µSVM−MKL = 0.950.

15 / 22

OutlineIntroduction



Comparison with State-of-the-Art (External Comparison)

Method Actions Actors Views Rate FPSCilla et al. [3] 11 12 5 0.913 N/A

Weiland et al. [10] 11 10 5 0.933 N/ACilla et al. [4] 11 10 5 0.940 N/AHolte et al. [5] 13 12 5 1.000 N/A

Weinland et al. [9] 11 10 5 0.835 500Chaaraoui et al. [1] 11 12 5 0.859 26Chaaraoui et al. [2] 11 12 5 0.914 207

SVM-MKL (IDT+BOWs) 11 12 5 0.950 25

Table: Comparison of the proposed MKL method using (IDT) descriptorand BOWs, where the methods with ‘N/A’ in the FPS column are offline.

16 / 22

OutlineIntroduction



Conlusions and Future Work

Proposed MKL based framework outperforms the simplefusion techniques, and the state-of-the-art methods

IDT descriptor superior than STIP descriptor for featurerepresentation in action recognition

The proposed framework capable of performing real-timeaction recognition at 25 frames per second

For the future, apply to other similar vision problems, and studyalternative feature representation and fusion techniques.

17 / 22

OutlineIntroduction



Thank you very much! Any questions?

18 / 22

OutlineIntroduction



A. A. Chaaraoui, P. Climent-Perez, and F. Florez-Revuelta.Silhouette-based human action recognition using sequences ofkey poses.Pattern Recognition Letters, 34:1799–1807, 2013.

A. A. Chaaraoui, J. R. Padilla-Lopez, F. J. Ferrandez-Pastor,M. Nieto-Hidalgo, and F. Florez-Revuelta.A vision-based system for intelligent monitoring: humanbehaviour analysis and privacy by context.Sensors, 14:8895–8925, 2014.

R. Cilla, M. A. Patricio, and A. Berlanga.A probabilistic, discriminative and distributed system for therecognition of human actions from multiple views.Neurocomputing, 75:78–87, 2012.

R. Cilla, M. A. Patricio, A. Berlanga, and J. M. Molina.

19 / 22

OutlineIntroduction



Human action recognition with sparse classification andmultiple-view learning.Expert Systems, DOI: 10.1111/exsy.12040, 2013.

M. Holte, B. Chakraborty, J. Gonzalez, and T. Moeslund.A local 3-D motion descriptor for mult-view human actionrecognition from 4-D spatio-temporal interest points.IEEE Journal of Selected Topics in Signal Processing,6:553–565, 2012.

C. Stauffer and W. Grimson.Learning patterns of activity using real time tracking.IEEE Transactions on Pattern Analysis and MachineIntelligence (PAMI), 22(8):747–767, 2000.

H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid.

20 / 22

OutlineIntroduction



Evaluation of local spatio-temporal features for actionrecognition.In British Machine Vision Conference (BMVC), 2009.

D. Weinland, E. Boyer, and R. Ronfard.Action recognition from arbitrary views using 3d exemplars.In IEEE International Conference on Computer Vision (ICCV),pages 1–7, 2007.

D. Weinland, M. Ozuysal, and P. Fua.Making action recognition robust to occlusions and viewpointchanges.In European Conference on Computer Vision, 2010.

D. Weinland, R. Ronfard, and E. Boyer.Free viewpoint action recognition using motion historyvolumes.

21 / 22

OutlineIntroduction



Computer Vision and Image Understanding, 104(2-3):249–257,2006.

22 / 22

a multiple kernel learning based fusion framework for real-time multi-view action recognition

Technology

future work2

future workbackground

future workillustration

future worka mkl

expf x

camera view py

fusion framework

classification scorefor