a general survey of previous works on action recognition

Action Recognition A general survey of previous works on Sobhan Naderi Parizi September 2009

Upload: zukun

Post on 30-Nov-2014




1 download




Page 1: A general survey of previous works on action recognition

Action RecognitionA general survey of previous works on

Sobhan Naderi Parizi

September 2009

Page 2: A general survey of previous works on action recognition

List of papers

Statistical Analysis of Dynamic Actions

On Space-Time Interest Points

Unsupervised Learning of Human Action Categories Using Spatial-Temporal


What, where and who? Classifying events by scene and object recognition

Recognizing Actions at a Distance

Recognizing Human Actions: A Local SVM Approach

Retrieving Actions in Movies

Learning Realistic Human Actions from Movies

Actions in Context

Selection and Context for Action Recognition

Page 3: A general survey of previous works on action recognition

Non-parametric Distance Measure for Action Recognition

Paper info: Title:▪ Statistical Analysis of Dynamic Actions

Authors:▪ Lihi Zelnik-Manor▪ Michal Irani

TPAMI 2006 A preliminary version appeared in CVPR

2001▪ “Event-Based video Analysis”

Page 4: A general survey of previous works on action recognition

“Statistical Analysis of Dynamic Actions”

Overview: Introduce a non-parametric distance measure Video matching (no action model): given a reference

video, similar sequences are found Dense features from multiple temporal scales (only

corresponding scales are compared) Temporal extent of videos in each category should be

the same! (a fast and slow dancing are different) New database is introduced▪ Periodic activities (walk)▪ Non-periodic activities (Punch, Kick, Duck, Tennis)▪ Temporal Textures (water)▪ www.wisdom.weizmann.ac.il/~vision/EventDetection.html

Page 5: A general survey of previous works on action recognition

“Statistical Analysis of Dynamic Actions”

Feature description: Space-time gradient of each pixel Threshold the gradient magnitudes Normalization (ignoring appearance) Absolute value (invariant to dark/light

transitions)▪ Direction invariant

▪ 222 )()()(












Page 6: A general survey of previous works on action recognition

“Statistical Analysis of Dynamic Actions”

Comments: Actions are represented by 3L independent 1D

distributions (L being number of temporal scales) The frames are blurred first▪ Robust to change of appearance e.g. high textured

clothing Action recognition/localization▪ For a test video sequence S and a reference sequence of

T frames:▪ Each consequent sub-sequence of length T is compared to the


▪ In case of multiple reference videos:▪ Mahalanobis distance

Page 7: A general survey of previous works on action recognition

Space-Time Interest Points (STIP)

Paper info: Title:▪ On Space-Time Interest Points

Authors:▪ Ivan Laptev: INRIA / IRISA

IJCV 2009

Page 8: A general survey of previous works on action recognition

“On Space-Time Interest Points”

Extends Harris detector to 3D (space-time) Local space-time points with non-constant

motion: Points with accelerated motion: physical forces

Independent space and time scales Automatic scale selection

Page 9: A general survey of previous works on action recognition

“On Space-Time Interest Points”

Automatic scale selection procedure: Detect interest points Move in the direction of optimal scale Repeat until locally optimal scale is

reached (iterative) The procedure can not be used in

real-time: Frames in future time are needed There exist estimation approaches to

solve this problem

Page 10: A general survey of previous works on action recognition

Unsupervised Action Recognition

Paper info: Title:▪ Unsupervised Learning of Human Action

Categories Using Spatial-Temporal Words Authors:▪ Juan Carlos Niebles: University of Illinois▪ Hongcheng Wang: University of Illinois▪ Li Fei-Fei: University of Illinois

BMVC 2006

Page 11: A general survey of previous works on action recognition

“Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words”

Generative graphical model (pLSA) STIP detector is used (piotr dollár et al.)

Laptev’s STIP detector is too sparse Dictionary of video words is created The method is unsupervised Simultaneous action

recognition/localization Evaluations on:

KTH action database Skating actions database (4 action classes)

Page 12: A general survey of previous works on action recognition

“Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words”

Overview of the method:


kkijkjkjiji zwPdzPdPzdwPdwP



w: video word d: video sequence z: latent topic (action category)

Page 13: A general survey of previous works on action recognition

“Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words”

Feature descriptor: Brightness gradient + PCA Brightness gradient found equiv. to Optical Flow

for motion capturing

Multiple action can be localized in the video:

Average classification accuracy: KTH action database: 81.5% Skating dataset: 80.67%


l jlli






Page 14: A general survey of previous works on action recognition

Event recognition in sport images

Paper info: Title:▪ What, where and who? Classifying events by

scene and object recognition Authors:▪ Li-Jia Li: University of Illinois▪ Li Fei-Fei: Princeton University

ICCV 2007

Page 15: A general survey of previous works on action recognition

“What, where and who? Classifying events by scene and object recognition”

Goal of the paper: Event classification in still images Scene labeling Object labeling

Approach: Generative graphical model Assumes that objects and scenes are

independent given the event category Ignores spatial relationships between objects

Page 16: A general survey of previous works on action recognition

“What, where and who? Classifying events by scene and object recognition”

Information channels: Scene context (holistic representation) Object appearance Geometrical layout (sky at infinity/vertical

structure/ground plane)

Feature extraction: 12x12 patches obtained by grid sampling (10x10) For each patch:▪ SIFT feature (used both for scene and object models)▪ Layout label (used only for object model)

Page 17: A general survey of previous works on action recognition

“What, where and who? Classifying events by scene and object recognition”

The graphical model E: event S: scene O: object X: scene feature A: appearance feature G: geometry layout

Page 18: A general survey of previous works on action recognition

“What, where and who? Classifying events by scene and object recognition”

A new database is compiled: 8 sport even categories (downloaded from

web) Bocce, croquet, polo, rowing,

snowboarding, badminton, sailing, rock climbing

Average classification accuracy over all 8 event classes = 74.3%

Page 19: A general survey of previous works on action recognition

“What, where and who? Classifying events by scene and object recognition”

Sample results:

Page 20: A general survey of previous works on action recognition

Action recognition in medium resolution regimes

Paper info: Title:▪ Recognizing Actions at a Distance

Authors:▪ Alexei A. Efros: UC Berkeley▪ Alexander C. Berg: UC Berkeley▪ Greg Mori: UC Berkeley▪ Jitendra Malik: UC Berkeley

ICCV 2003

Page 21: A general survey of previous works on action recognition

“Recognizing Actions at a Distance”

Overall review: Actions in medium resolution (30 pix tall) Proposing a new motion descriptor KNN for classification Consistent tracking bounding

box of the actor is required Action recognition is done only

on the tracking bounding box Motion in terms of as relative

movement of body parts No info. about movements is given by the tracker

Page 22: A general survey of previous works on action recognition

“Recognizing Actions at a Distance”

Motion Feature: For each frame, a local temporal

neighborhood is considered Optical flow is extracted (other alternatives:

image pixel values, temporal gradients) OF is noisy: ▪ half-wave rectifying + blurring

To preserve motion info:▪ OF vector is decomposed to its

vertical/horizontal components

Page 23: A general survey of previous works on action recognition

“Recognizing Actions at a Distance”

Similarity measure: i,j: index of frame T: temporal extent I: spatial extent A: 1st video sequence = B: 2nd video sequence =

Tt c Iyx


tic yxbyxajiS


1 ,


},,,{ 4321iiii aaaa

},,,{ 4321iiii bbbb

Page 24: A general survey of previous works on action recognition

“Recognizing Actions at a Distance”

New Dataset: Ballet (stationary camera):▪ 16 action classes▪ 2 men + 2 women▪ Easy dataset (controlled environment)

Tennis (real action, stationary camera):▪ 6 action classes (stand, swing, move-left, …)▪ different days/location/camera position▪ 2 players (man + woman)

Football (real action, moving camera):▪ 8 action classes (run-left 45˚, run-left, walk-left, …)▪ Zoom in/out

Page 25: A general survey of previous works on action recognition

“Recognizing Actions at a Distance”

Average classification accuracy: Ballet: 87.44% (5NN) Tennis: 64.33% (5NN) Football: 65.38% (1NN)

What can be done?

Page 26: A general survey of previous works on action recognition

“Recognizing Actions at a Distance”

Applications: Do as I Do:▪ Replace actors in videos

Do as I Say:▪ Develop real-world motions in computer

games 2D/3D skeleton transfer Figure Correction:▪ Remove occlusion/clutter in movies

Page 27: A general survey of previous works on action recognition

KTH Action Dataset

Paper info: Title:▪ Recognizing Human Actions: A Local SVM

Approach Authors:▪ Christian Schuldt: KTH university▪ Ivan Laptev: KTH university

ICPR 2004

Page 28: A general survey of previous works on action recognition

“Recognizing Human Actions: A Local SVM Approach”

New dataset (KTH action database): 2391 video sequences 6 action classes (Walking, Jogging, Running,

Handclapping, Boxing, Hand-waving) 25 persons Static camera 4 scenarios:▪ Outdoors (s1)▪ Outdoors + scale variation (s2): the hardest scenario▪ Outdoors + cloth variation (s3)▪ Indoors (s4)

Page 29: A general survey of previous works on action recognition

“Recognizing Human Actions: A Local SVM Approach”

Features: Sparse (STIP detector) Spatio-temporal jets of order 4

Different feature representations: Raw jet feature descriptors Exponential kernel on the histogram of jets Spatial HoG with temporal pyramid

Different classifiers: SVM NN


Page 30: A general survey of previous works on action recognition

“Recognizing Human Actions: A Local SVM Approach”

Experimental results: Local Feature (jets) + SVM performs the

best SVM outperforms NN HistLF (histogram of jets) is slightly better

than HistSTG (histogram of spatio-temporal gradients)

Average classification accuracy on all scenarios = 71.72%

Page 31: A general survey of previous works on action recognition

Action Recognition in Real Scenarios

Paper info: Title:▪ Retrieving Actions in Movies

Authors:▪ Ivan Laptev: INRIA / IRISA▪ Patrik Perez: INRIA / IRISA

ICCV 2007

Page 32: A general survey of previous works on action recognition

“Retrieving Actions in Movies”

A new action database from real movies Experiments only on Drinking action vs.

random/Smoking Main contributions:

Recognizing unrestricted real actions Key-frame priming

Configuration of experiments: Action recognition (on pre-segmented seq.) Comparing different features Action detection (using key-frame priming)

Page 33: A general survey of previous works on action recognition

“Retrieving Actions in Movies”

Real movie action database: 105 drinking actions 141 smoking actions Different scenes/people/views www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html

Action representation: R = (P, ΔP) P = (X, Y, T): space-time coordinates ΔP = (ΔX, ΔY, ΔT):▪ ΔX: 1.6 width of head bounding box▪ ΔY: 1.3 height of head bounding box

Page 34: A general survey of previous works on action recognition

“Retrieving Actions in Movies”

Learning scheme: Discrete AdaBoost + FLD (Fisher Linear Discriminant) All action cuboids are normalized

to 14x14x8 cells of 5x5x5 pixels(needed for boosting)

Slightly temporal-randomized sequences is added to training

HoG(4bins)/OF(5bins) is used Local features:▪ Θ=(x,y,t, δx, δy, δt, β, Ψ)▪ Β Є{plain, temp-2, spat-4}▪ ΨЄ{OF5, Grad4}

Page 35: A general survey of previous works on action recognition

“Retrieving Actions in Movies”

HoG captures shape, OF captures motion Informative motions: start & end of action Key-frame:

When hand reaches head Boosted-Histogram on HOG No motion info

around key-frame Integration of

motion & key-frameshould help

Page 36: A general survey of previous works on action recognition

“Retrieving Actions in Movies”

Experiments: OF/OF+HoG/STIP+NN/only key-frame OF/OF+HoG works best on hard test (drinking vs.

smoking) Extension of OF5 to OFGrad9 does not help!

Key-frame priming: #FPs decreases significantly (different info.

channels) Significant overall accuracy:▪ It’s better to model motion and appearance separately

Speed of key-primed version: 3 seconds per frame

Page 37: A general survey of previous works on action recognition

“Retrieving Actions in Movies”

Possible extensions: Extend the experiments to more action

classes Make it real-time

Page 38: A general survey of previous works on action recognition

Automatic Video Annotation

Paper info: Title:▪ Learning Realistic Human Actions from Movies

Authors:▪ Ivan Laptev: INRIA / IRISA▪ Marcin Marszalek: INRIA / LEAR▪ Cordelia Schmid: INRIA / LEAR▪ Benjamin Rozenfeld: Bar-Ilan university

CVPR 2008

Page 39: A general survey of previous works on action recognition

“Learning Realistic Human Actions from Movies”

Overview: Automatic movie annotation:▪ Alignment of movie scripts▪ Text classification

Classification of real action Providing a new dataset Beat state-of-the-art results on KTH

dataset Extending spatial pyramid to space-time


Page 40: A general survey of previous works on action recognition

“Learning Realistic Human Actions from Movies”

Movie script: Publicly available textual description about:▪ Scene description▪ Characters▪ Transcribed dialogs▪ Actions (descriptive)

Limitations:▪ No exact timing alignment▪ No guarantee for correspondence with real actions▪ Actions are expressed literally (diverse descriptions)▪ Actions may be missed due to lack of conversation

Page 41: A general survey of previous works on action recognition

“Learning Realistic Human Actions from Movies”

Automatic annotation: Subtitles include exact time alignment Timing of scripts is matched by subtitles Textual description of action is done by a text classifier

New dataset: 8 action classes (AnswerPhone, GetOutCar, SitUp, …) Two training sets (automatically/manually annotated) 60% of the automatic training set is correctly

annotated http://www.irisa.fr/vista/actions

Page 42: A general survey of previous works on action recognition

“Learning Realistic Human Actions from Movies”

Action classification approach: BoF framework (k=4000) Space-time pyramids▪ 6 spatial grids: {1x1, 2x2, 3x3, 1x3, 3x1,

o2x2}▪ 4 temporal grids: {t1, t2, t3, ot2}

STIP with multiple scales HoG and HoF

Page 43: A general survey of previous works on action recognition

“Learning Realistic Human Actions from Movies”

Feature extraction: A volume of (2kσ x 2kσ x 2kτ) is taken

around each STIP where σ/τ is spatial/temporal extent (k=9)

The volume is divided to grid

HoG and HoF for each grid cell is calculated and concatenated together

These concatenated features are concatenated once more according to the pattern of spatio-temporal pyramid

233 tyx nnn

Page 44: A general survey of previous works on action recognition

“Learning Realistic Human Actions from Movies”

Different channels: Each spatio-temporal template: one channel Greedy search to find the best channel combination Kernel function = Chi2 distance

Observations: HoG performs better than HoF No temporal subdivision is preferred (temporal grid = t1) Combination of channels improves classification in real scenario Mean AP on KTH action database = 91.8% Mean AP on real movies database:▪ Trained on manually annotated dataset : 39.5%▪ Trained on automatically annotated dataset : 22.9%▪ Random classifier (chance) : 12.5%


channel channelKernelDist1

Page 45: A general survey of previous works on action recognition

“Learning Realistic Human Actions from Movies”

Future works: Increase robustness to annotation noise Improve script to video alignment Learn on larger database of automatic annotation Experiment more low-level features Move from BoF to detector based methods The table shows:▪ effect of temporal division when combining channels (HMM based methods

should work)▪ Pattern of spatio-temporal pyramid changes so that context is best captured

when the action is scene-dependent

Page 46: A general survey of previous works on action recognition

Image Context in Action Recognition

Paper info: Title:▪ Actions in Context

Authors:▪ Marcin Marszalek: INRIA / LEAR▪ Ivan Laptev: INRIA / IRISA▪ Cordelia Schmid: INRIA / LEAR

CVPR 2009

Page 47: A general survey of previous works on action recognition

“Actions in Context”

Contributions: Automatic learning of scene classes from video Improve action recognition using image

context and vice versa Movie scripts is used for automatic training For both action and scene: BoF + SVM New large database:

12 action classes 69 movies involved 10 scene classes www.irisa.fr/vista/actions/hollywood2

Page 48: A general survey of previous works on action recognition

“Actions in Context”

For automatic annotation, scenes are identified only from text

Features: SIFT (modeling scene)

on 2D-Harris HoG and HoF (motion)

on 3D-Harris (STIP)

Page 49: A general survey of previous works on action recognition

“Actions in Context”

Features: SIFT: extracted from 2D-Harris detector▪ Captaures static appearance▪ Used for modeling scene context▪ Calculated for single frame (every 2 seconds)

HoG/HoF: extracted from 3D-Harris detector▪ HoG captures dynamic appearance▪ HoF captures motion pattern

One video dictionary per channel is created Histogram of video words is created for each channel

Classifier: SVM using chi2 distance Exponential kernel (RBF) Sum over multiple channels


exp(),( jichannelchannel channel

ji xxDxxK

Page 50: A general survey of previous works on action recognition

“Actions in Context”

Evaluations: SIFT: better for context HoG/HoF: better for action Only context can also classify

actions fairly good! Combination of the 3 channels

works best

Page 51: A general survey of previous works on action recognition

“Actions in Context”

Observations: Context is not always good▪ Idea: The model should control

contribution of context for each action class individually

Overall, the gain of accuracyis not significant using context:▪ Idea: other types of context should

work better

Page 52: A general survey of previous works on action recognition

Object Co-occurrence in Action Recognition

Paper info: Title:▪ Selection and Context for Action Recognition

Authors:▪ Dong Han: University of Bonn▪ Liefeng Bo: TTI-Chicago▪ Cristian Sminchisescu: University of Bonn

ICCV 2009

Page 53: A general survey of previous works on action recognition

“Selection and Context for Action Recognition”

Main contributions: Contextual scene descriptors based on:▪ Presence/absence of objects (bag-of-detectors)▪ Structural relation between objects and their parts

Automatic learning of multiple features▪ Multiple Kernel Gaussian Process Classifier (MKGPC)

Experimental results on: KTH action dataset Hollywood1,2 Human Action database (INRIA)

Page 54: A general survey of previous works on action recognition

“Selection and Context for Action Recognition”

Main message: Detection of a Car and a Person in its proximity increases

probability of Get-Out-Car action

Provides a framework to train a classifier based on combination of multiple features (not necessarily relevant) e.g. HOG+HOF+histogram intersection, …

Similar to MKL but here Parameters are learnt automatically i.e. (weights + hyper-


Gaussian Process scheme is used for learning




tijim xxkexxk




Page 55: A general survey of previous works on action recognition

“Selection and Context for Action Recognition”

Descriptors: Bag of Detectors▪ Deformable part models are used (Pedro)▪ Once object BBs are detected, 3 descriptors are built:▪ ObjPres (4D)▪ ObjCount (4D)▪ ObjDist (21D): pair-wise distances of object parts for all of

Person detector (7 parts)

HOG (4D) + HOF (5D) from STIP detector (Ivan)▪ Spatial grids: 1x1, 2x1, 3x1, 4x1, 2x2, 3x3▪ Temporal grids: t1, t2, t3

3D gradient features

Page 56: A general survey of previous works on action recognition

“Selection and Context for Action Recognition”

Experimental results: KTH dataset▪ 94.1% mean AP vs. 91.8% reported by Laptev▪ Superior to state-of-the-art in all but Running class

HOHA1 dataset▪ Trained on clean set only▪ The optimal subset of features is found greedily

(addition/removal) based on test error▪ 47.5% mean AP vs. 38.4% reported by Laptev

HOHA2 dataset▪ 43.12% mean AP vs. 35.1% reported by Marszalek

Page 57: A general survey of previous works on action recognition

“Selection and Context for Action Recognition”

Best feature combination