an introduction to action recognition/detection

An Introduction to Action Recognition/Detection

Sami Benzaid

November 17, 2009

What is action recognition?

The process of identifying actions that occur in video sequences (in this case, by humans).

Videos from http://www.nada.kth.se/cvap/actions/

Example Videos

Why perform action recognition?

Surveillance footage

User-interfaces

Automatic video organization / tagging

Search-by-video?

Complications

Different scalesPeople may appear at different scales in

different videos, yet perform the same action.

Movement of the cameraThe camera may be a handheld camera,

and the person holding it can cause it to shake.

Camera may be mounted on something that moves.

Complications, continued

Movement with the cameraThe subject performing an action (i.e.,

skating) may be moving with the camera at a similar speed.

Figure from Niebles et al.


OcclusionsAction may not be fully visible

Figure from Ke et al.


Background “clutter”Other objects/humans present in the video

frame.Human variation

Humans are of different sizes/shapesAction variation

Different people perform different actions in different ways.

Etc…

Why have I chosen this topic?

I wanted an opportunity to learn about it. (I knew nothing of it beforehand).

I will likely be incorporating it into my research in the future, but I'm not there yet.

Why have I chosen this topic?

Specifically: Want to know if someone is on the phone

hence not interruptible; can set status to “busy”

Want to know if someone is present (look for characteristic human actions)

Things other than humans can cause motion, this will make for a more reliable presence detector.

Online status can change accordingly.

Want to know immediately when someone leaves Can accurately set status to unavailable

1st Paper Overview

Recognizing Human Actions: A Local SVM Approach (2004)Use local space-time features to represent

video sequences that contain actions.Classification is done via an SVM. Results

are also computed for KNN for comparison.Christian Schuldt, Ivan Laptev and

Barbara Caputo (2004)

The Dataset

Video dataset with a few thousand instances. 25 people each:

perform 6 different actions Walking, jogging, running, boxing, hand waving, clapping

in 4 different scenarios Outdoors, outdoors w/scale variation, outdoors w/different

clothes, indoors (several times)

Backgrounds are mostly free of clutter. Only one person performing a single action

per video.

The Dataset

Figure from Schuldt et al.

Local Space-time features


Representation of Features

Spatial-temporal “jets” (4th order) are computed at each feature center:

Using k-means clustering, a vocabulary consisting of words hi is created from the jet descriptors.

Finally, a given video is represented by a histogram of counts of occurrences of features corresponding to hi in that video:

( , , , ,..., )x y t xx ttttl L L L L L

1( ,..., )nH h h

Recognition Methods

2 representations of data: [1] Raw jet descriptors (LF) (“local feature”

kernel)Wallraven, Caputo, and Graf (2003)

[2] Histograms (HistLF) (X2 kernel)2 classification methods:

SVMK Nearest Neighbor

Results


Results

Some categories can be confused with others (running vs. jogging vs. walking / waving vs. boxing) due to different ways that people perform these tasks.

Local Features (raw jet descriptors without histograms) combined with SVMs was the best-performing technique for all tested scenarios.

A Potentially Prohibitive Cost

The method we just saw was entirely supervised. All videos used in the training set had labels attached to them.

Sometimes, labeling videos can be a very expensive operation.

What if we want to be able to train on a training set in which the videos are not necessarily labeled, and recognize the occurrences of their actions in a test set?

2nd Paper Overview

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words (2008)Unsupervised approach to classifying

actions that occur in video.Uses pLSA (Probabilistic Latent Semantic

Analysis) to learn a model.Juan Carlos Niebles, Hongcheng Wang

and Li Fei-Fei (2008)

Data

Training set:Set of videos in which a single person is

performing a single action.Videos are unlabeled.

Test set (relaxed requirement):Set of videos which can contain multiple

people performing multiple actions simultaneously.

Space-time Interest Points


Feature Descriptors

Brightness gradients calculated for each interest point ‘cube’.

Image gradients computed for each cube (at different scales of smoothing).

Gradients concatenated to form feature vector. Length of vector: [# of pixels in interest “point”] * [#

of smoothing scales] * [# of gradient directions]

Vector is projected to lower dimensions using PCA.

Codebook Formation

Codebook of spatial-temporal words.K-means clustering of all space-time

interest points in training set.Clustering metric: Euclidean distance

Videos represented as collections of spatial-temporal words.

Representation

Space-time interest points:


pLSA: Learning a Model

Probabilistic Latent Semantic AnalysisGenerative ModelVariables:

wi = spatial-temporal word

dj = video

n(wi, dj) = co-occurrence table (# of occurrences of word wi in video dj)

z = topic, corresponding to an action

Probabilistic Latent Semantic AnalysisProbabilistic Latent Semantic Analysis

• Unsupervised technique• Two-level generative model: a video is a

mixture of topics, and each topic has its own characteristic “word” distribution

wd z

T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999

video topic wordP(z|d) P(w|z)

Slide: Lana Lazebnik

Probabilistic Latent Semantic AnalysisProbabilistic Latent Semantic Analysis

• Unsupervised technique• Two-level generative model: a video is a

mixture of topics, and each topic has its own characteristic “word” distribution

wd z

T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999

K

kjkkiji dzpzwpdwp

1

)|()|()|(Slide: Lana Lazebnik

The pLSA modelThe pLSA model

K

kjkkiji dzpzwpdwp

1

)|()|()|(

Probability of word i in video j(known)

Probability of word i given

topic k (unknown)

Probability oftopic k given

video j(unknown)


The pLSA modelThe pLSA model

Observed codeword distributions

(M×N)

Codeword distributionsper topic (class)

(M×K)

Class distributionsper video

(K×N)

K

kjkkiji dzpzwpdwp

1

)|()|()|(

p(wi|dj) p(wi|zk)

p(zk|dj)

videos

wor

ds

wor

ds

topics

topi

cs

videos

=


Maximize likelihood of data: Observed counts of word i in video j

M … number of codewords

N … number of videos

Slide credit: Josef Sivic

Learning pLSA parametersLearning pLSA parameters

InferenceInference

)|(maxarg dzpzz

• Finding the most likely topic (class) for a video:


InferenceInference

)|(maxarg dzpzz

• Finding the most likely topic (class) for a video:

zzz dzpzwp

dzpzwpdwzpz

)|()|(

)|()|(maxarg),|(maxarg

• Finding the most likely topic (class) for a visual word in a given video:


Datasets

KTH human motion dataset (Schuldt et al. 2004) (Dataset introduced in previous paper.)

Weizmann human action dataset (Blank et al. 2005)

Figure skating dataset (Wang et al. 2006)

Example of Testing (KTH)


KTH Dataset Results


Results Compared to Prev. Paper

Figures from Niebles et al., Schuldt et al.

Results Compared to Prev. Paper

Walking 83.8 00 16.2 00 00 00

Running 6.3 54.9 38.9 00 00 00

Jogging 22.9 16.7 60.4 00 00 00

Handwaving 0.7 00 00 73.6 4.9 20.8

Handclapping 1.4 00 00 3.5 59.7 35.4

Boxing 0.7 00 00 0.7 0.7 97.9

Walking 82 00 17 00 00 01

Running 01 88 11 00 00 00

Jogging 09 37 53 00 01 00

Handwaving 00 00 00 93 00 07

Handclapping 00 00 00 00 86 14

Boxing 00 00 00 00 02 98

First Paper (Supervised Technique) This Paper (Unsupervised Technique)

Weizmann Dataset

10 action categories, 9 different people performing each category.

90 videos.Static camera, simple background.

Weizmann Examples


Example of Testing (Weizmann)


Weizmann Dataset Results


Figure Skating Dataset

32 video sequences7 people3 actions:

Stand-spinCamel-spinSit-spin

Camera motion, background clutter, viewpoint changes.

Example Frames


Example of Testing


Figure Skating Results


A Distinction

Action Recognition vs. Event DetectionAction Recognition

= Classify a video of an actor performing a certain class of action.

Event Detection = Detect instance(s) of event(s) from predefined

classes, occurring in video that more closely resembles real-life events (i.e., cluttered background, multiple humans, occlusions).

3rd Paper Overview

Event Detection in Cluttered Videos (2007)Represent actions as spatial-temporal

volumes.Detection is done via a distance threshold

between a template action volume and a video sequence.

Y. Ke, R. Sukthankar, and M. Hebert (2007)

Example images

Cluttered background + occlusion: hand-waving

Cluttered background: picking something up

Images from Ke et al.

Representation of an event

Spatio-temporal volume:

Example: hand-waving

Image from Ke et al.

Preprocessing

A video in which event detection is to be performed is first segmented into space-time regions using mean-shift.

The video is treated as a volume, and not individually by frames.

Objects in the video are over-segmented in space and time.

Authors state this is equivalent to “superpixels”, in the space-time context.

Preprocessing


Detecting an Event in Video

Want to detect event corresponding to template T (i.e., hand-waving volume), in video volume V (which has been oversegmented).

Slide the template along all locations in the video volume V. Measure shape-matching distance between T and relevant subset of V.

Shape-matching distance

Appropriate region intersection distance:

The authors point out that enumerating through all subsets of segmented objects in V is very inefficient.

They detail an optimization to reduce run-time computation to table lookups, as well as a different distance metric.

( ) ( )T V T V

Shape-matching distance

Basic Idea:


Flow Distance

Additionally, a flow distance metric is used.

For a spatial-temporal patch P1 in T and P2 in V, calculate flow correlation distance (whether or not the same flow could have generated both patches).

Flow correlation distance algorithm by Shechtman and Irani (2005)

Breaking up the template

Why is this useful?


Matching template parts to regions

Template parts may not match oversegmented regions as well:


Detection with parts

Use a cutting plane (where the template was split) when calculating distance.

Encode relationship between multiple parts of a template as a graph, with information about distances.

Energy function to minimize:

Appearance distance: Detection occurs when distance is below a

chosen threshold.

1 ( , )

* argmin ( ) ( , )i j

n

i i ij i jL i v v E

L a l d l l

( ) ( , ; ) ( , ; )i i N i i F i ia l d T V l d T V l

The Data

20ish minutes of video, containing approximately 110 events for which templates exist.

Templates/”training set” are created from a single instance of an action being performed; they are manually segmented and split.

Example Detections


Results


Reference Links

Recognizing Human Actions: A Local SVM Approach Christian Schuldt, Ivan Laptev and Barbara Caputo http://www.nada.kth.se/cvap/actions/

Event Detection in Cluttered Videos Y. Ke, R. Sukthankar, and M. Hebert http://www.cs.cmu.edu/~yke/video/

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words Juan Carlos Niebles, Hongcheng Wang and Li Fei-

Fei http://vision.stanford.edu/niebles/humanactions.htm

Extra Slides

Extra Slides after this point.

Detection of features (Schuldt et al.)

Video is an image sequenceScale space representation of is

constructed by convolution with a Gaussian kernel:

Second-moment matrix is computed using spatio-temporal image gradients:

( and are spatial / temporal scale parameters).

( , , )f x y t

2 2 2 2( , , ) * ( , , )L f g

f

2 2 2 2( ; , ) ( ; , )*( ( ) )Tg s s L L

Detection of features (Schuldt et al.)

The positions of the features are the maxima of over all .

Size/neighborhood of features determined by scale parameters.

More detailed info about space-time interest points: Ivan Laptev and Tony Lindeberg. Space-

Time Interest Points. In Proc. ICCV 2003, Nice, France, pp.I:432-439.

3det( ) ( )k trace ( , , )x y t

Space-time Interest Points (Niebles et al.)

Detection algorithm based on: Dollár, Rabaud, Cottrell, & Belongie (2005)

Response function: 2-d Gaussian kernel applied only in

spatial dimensions. 1-D “Gabor filters” applied over time. Point locations are local maxima of response

function. Size is determined by spatial/temporal scale

factors.

2 2( * * ) ( * * )ev odR I g h I g h

( , ; )g x y

,ev odh h

Recognition of Multiple Actions (Niebles et al.)


an introduction to action recognition/detection

Documents