image-based activity recognition using uavs - …image-based activity recognition using uavs master...

Image-based Activity Recognitionusing UAVs

Master Thesis of

Holger Caesar

At the Department of Informatics -Vision and Fusion Laboratory (IES)

Karlsruhe Institute of Technology (KIT)

In cooperation withFraunhofer Institute of Optronics, System Technologies and

Image Exploitation (IOSB), EttlingenAirbus Defence and Space (AirbusDS), Manching

Reviewer: Prof. Dr.-Ing. Jürgen Beyerer, IOSB, IESAdvisor: Dr. Holger Leuck, AirbusDSSecond advisor: Dr. Wolfgang Hübner, IOSB

Duration: 01 October 2013 – 31 March 2014

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu

I declare that I have developed and written the enclosed thesis completely by myself, andhave not used sources or means without declaration in the text.

Ingolstadt, 31 March 2014

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .(Holger Caesar)

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Fundamentals 52.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Activity Recognition Approaches . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.7 Learning Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Datasets 313.1 Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Real-world Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3 Simulated Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Experiments 374.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Discussion 735.1 Segmentation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.2 Temporal Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.3 Image Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.4 Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.5 Fulfillment of Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.6 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6 Conclusion 816.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.2 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Bibliography 85

v

1. Introduction

In this chapter we introduce the reader to the topic and contents of this work. Section 1.1explains what activity recognition is and where it can be used. The general questions thatare to be answered are stated in Section 1.2. It narrows down the field and limits the scopeof work. In 1.3 we setup requirements that a solution should fulfill from an engineeringperspective. These requirements will be evaluated in Section 5.5.

1.1 Motivation

The purpose of this Master thesis is to assess whether recent developments in computervision can be exploited to automatically detect human activities in videos. These activitiescan include anything from walking to jumping, digging or throwing (see Fig. 1.1 for a fewexamples).During the last few years such techniques have gained popularity in human-machine in-terfaces that are used to control video games. Their applications however go far beyondentertainment and can be used in a multitude of scenarios including affordable health-careand perimeter surveillance. We focus on videos that are gathered from flying platforms

Figure 1.1: Examples of activities that we want to recognize.

such as Unmanned Aerial Vehicles (UAV). The combination of activity recognition andUAVs opens a completely new field. Currently used techniques are able to detect objectsand persons and track them around the scene. They are not able to understand whathappens on the ground and therefore require regular input from a human controller.State-of-the-art techniques cannot replace a human-being in the loop and from an ethicalpoint of view that might not be desirable. However they can help to greatly reduce theburden of the persons that have to monitor computer monitors for several hours.

1

2 1. Introduction

This is useful in any application that requires us to observe wide areas. Public sportsevents can be analyzed to detect signs of human stampedes. Earthquake sites can besearched for humans that signal for help. Railway companies can detect suspicious humanbehavior along their tracks and borders can be secured. In military scenarios activityrecognition can help to improve overall situational awareness in combat. Early warningsystems could detect suspicious behavior of terrorists before they attack their victims.Before we can go on to develop high-level solutions that use complex rules to decide whichpersons are interacting with each other to do what, we first need to have a reliable estimateof the simpler gestures that make up the more complicated activities. This is what wefocus on in this work.

1.2 Problem Statement

The field of static object detection or recognition is a well-studied discipline that datesback several decades. The problems that are encountered here, such as different light-ing, different viewpoints and angles or scale invariance, have been dealt with exten-sively [Lindeberg, 1997][Laptev et al., 2007]. Human activity recognition is a more recentdiscipline that has many similar issues. It actually can be seen as a form of dynamicobject detection since the geometry of the human being is flexible to the movement ofits muscles. The extension to the third dimension (time) introduces additional problemsthat have their counterparts in 2D. These include the time-warping problem that occurs ifthe same activity is executed at varying speeds. The segmentation problem describes thedifficulty of not knowing when an activity starts in a video.This work is not limited to a single approach. Instead we conduct a literature review andextract some of the most promising approaches for our scenario. We specifically choosethose techniques that are suited for a more vertical camera perspective (see Chapter 3) andalter them where necessary. We look at gradient-based, trajectory-based, motion-basedand similarity-based features and compare their performance. Both, sparse local interestpoint descriptors and dense features are examined. For each setup we try to find a suitablelearning machine. The research on the theoretical capabilities is to be complemented byconcrete engineering requirements which are listed in Section 1.3 and evaluated in Sec-tion 5.5.We also analyze requirements that are specific to the application in UAVs. These includethe separation of camera movement and scene movement which in general is not trivial.Further analysis is to be done on the required spatial and temporal resolution of the cam-era. Applications should be analyzed for their real-time capabilities and further potentialfor parallelization should be indicated.

1.3 Requirements

In this section we will briefly describe the requirements that the final implementation ofour activity recognition system should fulfill.

1. Accuracy: The system shall have the ability to recognize a person’s activity conceiv-ably better than by random guessing. This requirement will be tested by evaluatingthe system’s performance in terms of the recognition performance.

2. Robustness: Given the dynamic environment that UAVs are being deployed to,the recognition system needs to be extremely robust, to avoid a deterioration inperformance. Robustness is especially important with regard to view, scale, temporaland rotational invariance. The system should not only be limited to fixed cameras,but also be able to compensate for the motion of the camera itself.

2

1.3. Requirements 3

3. Real-Time: A final system shall require real-time performance to run online duringa flight. Algorithms should allow for parallel execution to maximize throughput.

4. Compatibility: Our algorithm should not be limited to any special hardware, butbe able to be ported from a standard personal computer to the embedded systemused in the aircraft.

5. Universality: Activity recognition should be applied to different recording condi-tions and light spectra, including visible light as well as EO/IR.

3

2. Fundamentals

This chapter covers the fundamental definitions and techniques that our approaches arebased on. It serves as a general overview of the literature on the topic. It also helps todefine ambiguous terms and to establish a common level of background knowledge, whichthe following chapters will be based on.

2.1 Definitions

In this section we briefly describe the terms and definitions that are used throughout thiswork.

2.1.1 Activity

To describe what an activity is, we follow the classification in [Aggarwal and Ryoo, 2011].Activities can be categorized into four categories. However, the distinction between thesecategories is not always clear and should only be seen as a guideline.

• Gesture: An elementary movement of a single body part. Examples include raisingone’s arm or turning the head.

• Action: A combined activity of a single person, possibly consisting of one or moregestures and body parts. Examples include walking, digging and throwing an object.

• Interaction: Activity involving two persons or objects without which the interactionwould be meaningless. An example would be handing over an object to anotherperson.

• Group action: Activity of a group of multiple persons or objects interacting witheach other, often involves strategic acting and one interaction usually depends onanother. An example would be a group of persons marching along a street.

For this work we focus on lower-level activities such as gestures and actions. Once reliablesystems exist for these low-level activities, more complex systems can be build on top toprovide situational awareness and automatic reasoning systems.

5

6 2. Fundamentals

2.1.2 Activity Recognition

The term activity recognition refers to a field in computer science. It “aims to recognizethe actions and goals of one or more agents from a series of observations”1 using computersystems. In this work we focus on vision-based activity recognition which uses cameras asa sensor to collect video data. In human activity recognition we focus on human beingsas agents instead of animals or machines.In accordance with the aforementioned definition of an activity, activity recognition cantake place at different levels ranging from atomic activities to complex group interactions.Since the field is still developing and no out-of-the-box solutions exist yet, most of theliterature focuses on the question what an individual is doing and not what it intends toachieve with its actions. The latter would also require substantially more “intelligence”and background knowledge.Applications include human-computer interaction, robot learning, surveillance and health-care. The cameras can be standard RGB cameras, depth cameras or infrared cameras.Multiple spectra can be combined and several cameras can be used in stereo vision setups.

2.1.3 Unmanned Aerial Vehicle

An Unmanned Aerial Vehicle (UAV) is an aircraft that does not require a human piloton board. It can be controlled either directly by a computer system or via remote controlby a pilot or a stationary computer. Depending on the application, UAVs typically carrycameras or weapons as payload. Applications include military (reconnaissance, combatand logistics) as well as civil (public surveillance, research, logistics, fire detection andarcheology) use. UAVs are typically classified in accordance with the US Military UAVtier system as tier N/A (micro UAV), tier I (low altitude, long endurance), tier II (mediumaltitude, long endurance, MALE) or tier II+ and III- (high altitude, long endurance,HALE). Fig. 2.1 shows a picture of the Barracuda MALE UAV.

Figure 2.1: A picture of the Barracuda UAV, Copyright by Airbus Defence and Space

2.2 Activity Recognition Approaches

Before presenting our approach to activity recognition, we would like to give an overviewof the existing literature. Several papers have discussed how different approaches toactivity recognition can be categorized. In this chapter we follow the taxonomy givenby [Aggarwal and Ryoo, 2011] (see Fig. 2.2).Human activity recognition techniques can be categorized as either single-layered or hier-

archical approaches. Single-layered approaches operate directly on the images and are usedto detect low-level activities such as gestures and actions. Hierarchical approaches build abottom-up hierarchy combining a set of so-called subevents (low-level activities). They areused to detect high-level activities like interactions and group actions. To combine these

1http://en.wikipedia.org/wiki/Activity_recognition, last accessed on 09.01.2014

6

http://en.wikipedia.org/wiki/Activity_recognition

2.3. Overview 7

Figure 2.2: Categorization of different activity recognition approaches, Copyrightby [Aggarwal and Ryoo, 2011]

subevents, three different techniques exist. They can be statistical, i.e. combining theevidence of several subevents into a layered Hidden Markov Model (HMM). They can alsouse syntactic descriptions such as stochastic context-free grammars. Description-basedapproaches describe the temporal, spatial or logical structure of an activity.Single-layered approaches can be split into space-time and sequential approaches. Space-time techniques regard the complete space-time volume as a single video cube, whereassequential approaches work on a frame basis. Space-time approaches either operate di-rectly on the space-time volume, on the trajectory (of a person) in the volume or on localinterest point descriptors (“space-time features”). Compared to sequential approaches,these approaches have the advantage that they can directly compare different frames inthe space-time volume.Sequential approaches either compare a frame to a set of known examples (“exemplar-based”) or they aggregate the estimates of single frames into a state-based model, such asan HMM. These approaches can be very simple, since the classification of a single framemight be easier than that of a whole video. Techniques such as Dynamic Time Warpingor HMMs can be used to overcome the time-warping problem. This is the problem whichoccurs during recognition if the same activity is executed at a different speed.

Due to our focus on low-level activities we only regard single-layered approaches. We useall three subcategories of space-time approaches: space-time volumes (optical flow, HOOF,self-similarity), trajectories (blob trajectories) and space-time features (SIFT3D). We alsoaggregate single frame estimates in state-based approaches (HOG in HMM). Whereas wedo not directly use exemplar-based approaches, using kNN as a learning machine results insimilar behavior due to its lazy learning property. These features and their abbreviationswill be explained in detail in Section 2.5.

2.3 Overview

Fig. 2.3 gives an overview of the activity recognition “pipeline”. It describes the differentsteps that have to be taken to recognize the activity class in a video. These steps make upthe structure of the second half of this chapter. Section 2.4 describes what has to be donebefore one can perform activity recognition. Section 2.5 describes in detail, which kinds offeatures are useful. Section 2.6 explains how to reduce the amount of features and selectonly relevant entries and the final classification is described in Section 2.7

7

8 2. Fundamentals

Figure 2.3: The activity recognition pipeline as described in this work

2.4 Preprocessing

In this section we describe the preprocessing steps that are needed before one can performactivity recognition. The order of these steps is not fixed and for some applications severalsteps can be skipped, such as a normalization of the video data or the image stabilizationprocess. We do not go into detail since most of the topics are out of focus for this work.For our implementation of these preprocessing steps please refer to Section 4.1.2. Thedifferent steps can be described as follows:

1. Data selection: The first step is to select only the subset of data that seemsappropriate for the given task. This means that we filter out the video sequencesthat are not representative of their respective action. This is especially true for low-quality recordings, scenes where the actor is temporarily not in the video or occludedby other objects. However we need to take into account that the real-world data willalso include these imperfections and that we do not oversimplify the task. Obviouslydata selection is the only preprocessing step that is not executed during evaluation.Some approaches suggest to train a system on high-quality data and evaluate it onmore realistic data to get a more reliable impression of the generalization error.

2. Normalize videos: In some cases the quality of the videos will differ between videosas well as during videos. To remove these differences, we can equalize fluctuations inbrightness resulting from different aperture settings. Image contrast can be adaptedglobally as well as in local regions. This is important for static surveillance cameraswhere we know in advance which regions might be more or less lit. For UAVs thedifferent angle of the camera’s line-of-sight towards the sun rays can also result invarying brightness.

3. Stabilization: It is very important to stabilize the video to remove camera motion.This motion generally includes translation and rotation of the camera. If the camerauses automatic focusing techniques or the user manually changes the zoom, thenscaling and blurring might also occur. Whereas blurring might not be removableusing software post-processing, other factors can be attenuated if we determine theunderlying transformation. The transformation is often assumed to be affine orprojective and can then be estimated by multiple point correspondences betweensuccessive frames. This is related to optical flow (see Section 2.5.3.1).It can be used as weak stabilization which reduces swift movements in the camera oras strong stabilization which produces a view as if the camera had not been moved.Both approaches have their advantages. Whereas weak stabilization is still ableto follow an actor if he moves out of the viewing range of the first frame, strongstabilization will remain fixed to this initial position. The frame width and heightneed to be adapted or the pixels that correspond to regions which are not visiblefor the camera must be filled with values that indicate that the information is notavailable.

4. Object Detection: Object detection or more specifically object-class detectionrefers to assigning the correct label (“human”, “car”, etc.) to a region in an image. In

8

2.5. Features 9

this work we use person detection to extract the figure-centric space-time volumes.This narrows down the data to only the regions of interest and increases the activityrecognition performance. In some cases the object detection is triggered by a mech-anism called change detection. This means that we compare successive frames (i.e.by differencing) and extract local changes. If the camera is sufficiently stabilized andthe data is normalized, then we can assume that the changes are incurred by objectmotion.However, object detection should also work if no motion occurs. In this case patternrecognition techniques can be used to detect previously learned pixel patterns or fea-ture functions applied on these pixels. Indeed the whole field of activity recognitionis historically derived from object detection. Most of the features that are presentedin this work are classical object detection techniques extended to a third dimension(time). Due to the abundance of this topic, we cannot cover it in full detail here.An obvious alternative to automatic object detection is to let a person decide wherean object can be found in the scene.

5. Tracking: Tracking is the process of following a previously detected object over timein the scene. For a perfect object detector and only one person per video this wouldnot be necessary. However, if there are multiple persons, possibly occluding andinteracting with each other, then the tracking of a single person becomes importantfor activity recognition. Since object-detectors are not perfect on real-world data,good tracking is essential to maintain a good hypothesis of a person’s position aslong as possible. Examples of such techniques are the Kalman filter and particlefilters.

6. Blob Extraction: Once we have tracked the object we simply need to extractthe figure-centric space-time volume. This is the data on which we will computethe features in the next section. An alternative approach to stabilization, objectdetection, tracking and blob extraction would be to use latent variables. Latentvariables can be used to model additional degrees of freedom (i.e. the translation).They can be included in the learning machine and automatically determined duringtraining (see [Felzenszwalb et al., 2010]).

2.5 Features

To perform activity recognition we need to compute a set of features that has favorableproperties in terms of representing the true nature of the movement as well as a lowdimensionality to allow for efficient processing.Depending on the learning machine, one approach could be to use the whole video (i.e. aconcatenation of the pixels of each images) as an input. However this is usually not feasibleas it would result in very memory-consuming learning machines and time-intensive trainingand testing. It is also very unlikely that any known learning approach would be able tosufficiently abstract from the data and extract the small amount of information that weare interested in (i.e. the color of a person’s shirt is not of interest to us, whereas thekind of activity he performs is). Another problem is that the raw video data includes alot of noise, which can deteriorate the results of our recognition. This is especially truefor image data, since minor fluctuations in the camera (aperture size) or the environment(lighting of the scene) can have major effects on image intensities.We therefore present a selection of more advanced features that are extracted either fromsingle images or from the space-time volume of a video. As we will see below, some ofthese features are dense, which means that they are extracted from the whole image. Otherfeatures are sparse and describe only the environment of local keypoints.

9

10 2. Fundamentals

2.5.1 Gradient-based Features

One of the most important types of features uses image gradients. Gradients representthe directional change in the intensity or color of an image. Given a 2D grayscale imageand assuming that the underlying intensity function is continuous, the gradient can beestimated in a local pixel environment as the difference of neighboring pixels. The gradientsare typically used for edge detection since visual edges in an image have a high gradientmagnitude. As such they can be used to recognize objects even under different lightingconditions, which would not be possible if we looked only at the intensities or color values.

2.5.1.1 Histogram of Oriented Gradients

Histogram of Oriented Gradients (HOG) was introduced by [Dalal and Triggs, 2005]. Ithas become the single most important feature in computer vision fields such as objectdetection, object recognition and activity recognition during the last few years. The ideais that the distribution of gradient directions is more important than the actual positionof the edges in the image. Although extensions to 3D exist (see [Klaser et al., 2008]), weconcentrate only on 2D. The steps of the algorithm are as follows:

1. Normalize gamma and color: Normalization can be applied to different colorspaces such as RGB, LAB and grayscale. RGB and LAB give comparable per-formance according to the authors whereas grayscale results in considerably lowerperformance. Normalization is done using power law gamma equalization, which canbe implemented as follows:

GE(i) =(i−min)r

(max−min)r, r ∈ [0, 1] (2.1)

Here i is a pixel value, min and max are the lowest and highest pixel values in theimage and r determines the degree of normalization. Dalal and Triggs mention thatthe effect of this normalization is rather modest since the contrast-normalizationdescribed below has a similar impact.

2. Compute gradients: Various methods exist to compute local image gradients suchas discrete derivative masks combined with Gaussian smoothing of different standarddeviations σ. These masks can be 1D, as well as 2D and centered ([−1, 0, 1]) or un-centered ([−1, 1]). Despite the large number of possibilities, the authors achievehighest performance using a 1D centered mask with σ = 0. They discover that un-centered derivative masks suffer because filters in x and y direction are positionedat different centers and that Gaussian smoothing or larger kernels deteriorate per-formance, because they destroy fine-scaled details. In case of multi-channel images(such as RGB), the gradient is computed on each channel and the channel with thelargest norm is chosen as gradient vector.

3. Orientation binning: The image is now divided into spatial regions that are calledcells. These cells can be rectangular or radial. Experiments show that the cell widthshould be approximately equivalent to the width of the object that is to be recognized,such as a person’s arm or leg. For each cell a histogram of gradient directions isbeing created with each pixel weighted by its gradient magnitude. Orientation binsare evenly spaced and each vote is bilinearly interpolated between neighboring bincenters to reduce aliasing.

4. Contrast-normalization: In the final step multiple cells are grouped as blocks.Just as the cells, blocks can be arranged in a rectangular (R-HOG) or circular (C-HOG) layout. These blocks have two functions: (1) They allow us to apply contrast-normalization to a local set of cells. (2) Different blocks can overlap, which results

10

2.5. Features 11

in significantly higher performance. The authors cannot explain why this is thecase, but we assume that the overlap increases invariance to minor translations.The normalization reshapes the block (with its cells) into a vector v and applies astandard norm to it. This can be any norm such as L1 (v → v/(‖v‖1 + ε)) or L2(v → v/

√‖v‖22 + ε2), possibly clipped and renormalized. Except for the L1-norm

the resulting performance is equally high for all other choices.

According to Dalal and Triggs, this dense grid approach is favorable to keypoint-basedapproaches as they are not currently capable of reliably detecting human body struc-tures. Local pooling also adds an invariance to “geometric and photometric transforma-tions” [Dalal and Triggs, 2005] such as rotation and translation. This is because minortranslations will still assign a pixel’s gradient to the same grid cell and rotations will onlylead to an offset in the orientation histogram, but not change the histogram proportionsotherwise.

2.5.1.2 Scale Invariant Feature Transform 3D

This chapter describes the Scale Invariant Feature Transform (SIFT) 3D featuresintroduced by [Scovanner et al., 2007] SIFT3D features are local interest point featuresthat are inspired by the popular SIFT (2D) image features invented by David Lowe. Notethat other authors introduced different extensions to SIFT that are also sometimes referredto as SIFT3D [Filipe and Alexandre, 2013][Flint et al., 2007].Traditional SIFT features are placed at remarkable points in an image (typically corners)and can be used for various applications such as tracking, matching, object detection andrecognition. SIFT3D is an extension to SIFT which extends the definition of SIFT to 3dimensions. It can be used to analyze key positions in real 3D data, such as the dataresulting from Magnet Resonance Imaging (MRI) or video which can be interpreted to be3 dimensional (with 2 dimensions for space and 1 dimension for time). This 3D cube ofXYT data is referred to as space-time volume.In contrast to other similar approaches such as [Niebles et al., 2006], the authors ofSIFT3D do not choose interest points according to a specific criterion (such as localgradient maxima), but rather select keypoints randomly from the space-time volume. Onereason for this is that SIFT3D can be faster and even used for online learning. It is alsoquestionable whether the most important points in an image are actually placed on cornersand whether traditional SIFT can always find them. The SIFT3D descriptor uses a bagof (spatio-temporal) words approach and analyzes the relationships between the words toform word groupings.The computation of SIFT3D descriptors can be divided into 4 steps as describedin[Scovanner et al., 2007]. The first step is to choose a keypoint from the space-timevolume at random. The second step is to assign a principal direction to the keypoint’s3D neighborhood. After that we build orientation sub-histograms and finally theconcatenation of these histograms is clustered using unsupervised learning-techniques. Thefollowing subsections describe each step more closely.

1. Keypoint selection: As mentioned above, n keypoints are randomly sampled fromthe space-time volume. The authors recommend n = 200 for their application. Foreach keypoint we look at a local 3D neighborhood of dx × dy × dt pixels, wheredx, dy and dt are the predefined width, height and depth of the keypoint (typically13× 13× 13).

2. Orientation assignment: In the next step the gradient magnitude and the 3Dorientation of each pixel are computed. The local gradients Lx, Ly and Lt areapproximated by using finite difference approximations:

Lx = L(x+ 1, y, t)− L(x− 1, y, t) (2.2)

11

12 2. Fundamentals

Ly = L(x, y + 1, t)− L(x, y − 1, t) (2.3)

Lt = L(x, y, t+ 1)− L(x, y, t− 1) (2.4)

The magnitude m in 3D corresponds to the length of the gradient vector, which isthe euclidean norm (L2):

m(x, y, t) =√L2x + L2

y + L2t (2.5)

The orientation in 3D can be described using scalars θ and φ. θ describes the planarangle in the XY plane, whereas φ describes the angle away from that plane.

θ(x, y, t) = arctanLxLy

(2.6)

φ(x, y, t) = arctanLt√

L2x + L2

y

(2.7)

Each pixel’s orientation can be described by a unique pair of (θ, φ). The nextstep is to build a histogram of the directions weighted by their magnitudes. Theauthors present 2 approaches how to divide the sphere into bins. One such approachis to divide θ and φ into equally sized histogram bins which can be thoughtof as a sphere divided into meridians and parallels. Another way would be toconstruct an icosahedron (a regular polyhedron with 20 equally sized faces). Thefirst approach has the advantage of being simpler and faster, but to estimate theorientation more precisely, we need to use interpolation to find the peak magnitude.Another disadvantage of this approach is that we need to normalize the cumulativemagnitudes by their solid angle ω. The authors explain this requirement by the factthat any 2D map of the earth (which is approximately a sphere) either stretches areasdisproportionately (such as the Mercator projection) or creates discontinuities (suchas the Sinusoidal projection). The actual magnitude that is added to the current(θ, φ) bin is described by the following equation:

hist(iθ, iφ) +=1

ωm(x′, y′, t′)e

−((x−x′)2+(y−y′)2+(t−t′)2)2σ2 (2.8)

Here (x, y, t) describe the keypoint coordinates, whereas (x′, y′, t′) refer to the pixeladded to the histogram. We can see that the weighting depends on the solid angleω and the magnitude at the current position weighted by a Gaussian function of thepixel position. The dominant orientation is stored to create rotationally invariantfeatures in the next step.

3. Descriptor representation: The previously processed pixels are now rotated suchthat their dominant direction (θ, φ) points to a predefined direction (θ′, φ′) = (0, 0)using the rotation matrix:cos θ cosφ − sin θ − cos θ sinφ

sin θ cosφ cos θ − sin θ sinφsinφ 0 cosφ

(2.9)

A local neighborhood (typically 4× 4× 4) of each interest point is extracted, whereeach pixel can be described by (θ, φ,m). For each region the orientations areaccumulated into a sub-histogram. Finally we concatenate the sub-histograms tocreate our descriptor.

12

2.5. Features 13

4. Pre-Classification: At this point we have a set of n keypoints and their descriptors.Due to the random placement of keypoints, it would not be very meaningful to learnthe descriptors using the learning machines introduced in Section 2.7. Instead we firstcluster similar descriptors using hierarchical k-means clustering with a predefinednumber of clusters. Note that this step is an unsupervised learning procedure, whichmeans that we do not know in advance, which cluster a certain descriptor will beassociated with. The cluster centers are referred to as “words”.For each video we build a histogram of word frequencies. Since some words mightco-occur with others, we transform the word frequency histogram into a featuregrouping histogram as follows:We build a co-occurrence matrix and fill it using the word frequency histograms.To quantify similarity between two words, we compute the correlation of theirdistribution vectors. If the correlation is above a given threshold, we know thatboth words represent something similar and can regard them as a single word inthe word frequency histogram. This modified histogram is the feature groupinghistogram than can be learned by a learning machine.

From the algorithm description, we can see that SIFT and SIFT3D are more involved thanHOG features regarding computation and overall complexity. This is despite the fact thatHOG has been invented after SIFT. However it should be noted that SIFT was primarilyinvented for image registration and not for activity recognition. This is what SIFT3Dtries to overcome, but some aspects remain controversial, such as (1) the use of sparsekeypoints, which might not find the most important areas of the image and (2) the focuson invariance to rotation, which might remove a very important source of information.This type of feature is typically referred to as Bag-of-Features or Bag-of-Words (BoW)model. We find local keypoints (SIFT3D), extract codewords (k-means) and store themin a codebook (histogram). The simplicity of this approach which effectively ignoresany spatial or temporal relations between the keypoints can be an advantage as wellas a disadvantage. For object recognition, BoW models are very successful because theoccurrence of a few remarkable keypoints can already indicate the object class. We are notaware of any systematic studies that analyze the usefulness of BoW models for activityrecognition.

2.5.2 Trajectory-based Features

The easiest and most obvious feature types can be extracted from the bounding boxes inthe stabilized images. For example, a running person is quickly recognized by the speed ofthe movement of the center of its bounding box in the image without even looking at thebox contents. The path of the moving center points is called a trajectory. This feature typeonly makes sense with static cameras or moving cameras and a strong stabilization of videoframes. Note also that due to the camera sensor and the ground plane usually not beingparallel, the distance in pixels does not generally correspond to the ground distance. Thisfactor can be addressed by computing the homography between both planes (assuming alevel ground) or computing the height profile of the scene. Other distortions stem fromaberrations incurred by the optical system.The features that can be extracted from the bounding boxes are functions of position andtime. Examples are statistics of distance, speed and acceleration between frames (suchas mean, standard deviation, discontinuities etc.). However we have to consider carefullythat these features are meaningful and do not lead to overfitting of our data. For example,if we were to include the absolute position in pixels, our system might learn that a certainactivity always takes places in a certain region of the image. This would make it impossiblefor the system to generalize from the activity and apply the knowledge to other data. Thisis also true for features that depend on the duration of the video of the activity. We cannot

13

14 2. Fundamentals

generally assume that the duration is proportional to the duration of the activity, especiallywhen activities are performed multiple times in a video and we do not know the numberof repetitions. The question whether such high level features are relevant depends mostlyon the types of activities to detect. On a database of sign language the bounding box ofthe whole person would not be of interest. If we want do distinguish between running andstanding activities, this feature might prove as the most useful one. However this workdoes not focus on such features, as we want to recognize activities from a person-centriccamera view and focus on the way that different body part gestures combine to create anactivity.

2.5.3 Motion-based Features

Apart from trajectory-based global movements there are also local movements taking placein the video. The comparison of two frames can give us important clues which parts havechanged. These features can help us learn how an activity developed throughout time.

2.5.3.1 Optical Flow

Optical flow is the most common motion-based feature. For each pair of frames, itcomputes the direction and velocity of each pixel between both frames. This meansthat we try to approximate the 2D motion field. The explanations here are based on[Fleet and Weiss, 2006]. We assume that pixels are translated from one frame to the next:

I(x, y, t) = I(x+ u, y + v, t+ 1) (2.10)

Here I(x, y, t) is the intensity value of an image at position (x, y) at time t and (v, u) is the2D velocity. This is only an approximation since brightness constantly changes throughouttypical videos. In fact, only Lambertian surfaces with very distant point light sources haveconstant brightness values even with changing camera positions. We can now approximatethe translated image by a first-order Taylor-series,

I(x+ u, y + v, t+ 1) = I(x, y, t) + (u, v)T · ∇I(x, y, t) + It (2.11)

where ∇I = (Ix, Iy) and It are the spatial (in x and y direction) and temporal derivativesof I(x, y, t). By substituting Eq. 2.11 into Eq. 2.10 we get the gradient constraint equation

Ix ∗ u+ Iy ∗ v + It = 0. (2.12)

This equation has two unknown variables u and v and therefore cannot be solved withoutfurther constraints. This is referred to as the aperture problem. It means that we canonly compute the component parallel to the gradient direction, but not the tangentialcomponent. There are two solutions to this problem that introduce further constraints.They are named after their inventors, Lucas-Kanade and Horn-Schunck.

Horn-Schunck

The solution suggested by Horn and Schunck makes the assumption of global smoothnessin the image. It takes the left hand side of Eq. 2.12, adds a smoothness term to it andminimizes the resulting energy functional with smoothness regularization constant λ:

minu,v

∫∫(Ix ∗ u+ Iy ∗ v + It)

2 + λ(‖∇u‖2 + ‖∇v‖2)dx dy. (2.13)

λ can be used to give more or less weight to the smoothing compared to minimizing theaperture equation. The advantage of such a global approach is that values are propagatedespecially for homogeneous areas in the image. Local methods in contrast might leadto singular systems of equations if the intensity values are very uniform. To computethe solution we can solve the corresponding Euler-Lagrange partial differential equations.Global methods to compute optical flow are generally slower than local approaches.

14

2.5. Features 15

Lucas-Kanade

Lucas and Kanade suggested to use a local least squares method (see [Thota et al., 2013]for more information). They assume that the optical flow is identical in the localenvironment of each pixel. If that is the case we can use the known image gradientsfor each of the n pixels in the neighborhood and join them in a single system of equations

Ix1u+ Iy1v = −It1Ix2u+ Iy2v = −It2

...Ixnu+ Iynv = −Itn

(2.14)

where the indices to x, y and t indicate the n pixels in the local neighborhood. This systemcan be solved using a standard least squares method

(u, v)T = (ATA)−1AT (−b) (2.15)

where A is the matrix with elements Ai1 = Ixi and Ai2 = Iyi and b a vector such thatbj = −Itj . Alternatively, the system can be solved using iterative approaches. To put astronger weight on the central pixels in the neighborhood, we can additionally weight eachpixel by a Gaussian kernel centered at the central pixel. Hierarchical Gaussian pyramidsolutions exist that extract optical flow at different scales going from coarse image to fineimage scale. In addition to the efficiency of this approach it is also relatively robust tonoise and image defects.

2.5.3.2 Histogram of Oriented Optical Flow

In the section on HOG features we discussed that histograms are a very promising approachto average over noisy measurements and allow minor translations, but still keep directionalinformation. The approach has been applied to optical flow by [Chaudhry et al., 2009].They call it Histogram of (Oriented) Optical Flow (HOF or HOOF). The authors affirmthat optical flow is generally prone to “background noise, scale changes as well asdirectionality of movement” [Chaudhry et al., 2009].The idea is to accumulate a single histogram with all optical flow vectors. Note thatthis stands in contrast to HOG where we have one histogram for each cell. Each vectorv = (x, y)T is binned“according to its primary angle from the horizontal axis and weightedaccording to its magnitude” [Chaudhry et al., 2009]. The primary angle θ is computed asθ = tan−1( yx). v contributes by

√x2 + y2 to bin b out of B total bins if

−π2

+ πb− 1

B≤ θ < −π

2+ π

b

B. (2.16)

The final histogram is normed such that it sums up to 1. This makes HOOF scale-invariantsince otherwise the magnitudes would be larger for zoomed in images. We assume thatthe invariance to scale only holds if the background is static. Otherwise using a largerimage region would change the ratio of optical flow that is attributable to the person inthis video.Since [Chaudhry et al., 2009] suggest to bin optical flow vectors according to the primaryangle (and not the angle to the x-axis which is extracted by the function usually referredto as “atan2”), it leads to a histogram that is symmetric along the vertical axis. Thisis visualized for B = 4 in Fig. 2.4. Apparently the authors had only horizontal camerasettings in mind and then using this approach can achieve a symmetrical invariance to theleft or right direction.

15

16 2. Fundamentals

Figure 2.4: Histogram formation with four bins, Copyright by [Chaudhry et al., 2009]

2.5.4 Similarity-based Features

The term similarity has different meanings in different fields. We know concepts such asgeometric similarity (two objects having the same shape) and matrix similarity in linearalgebra. Both similarity features that are presented in this section have in common that arepresentation of an object or activity is comparable to the representation of another one.

2.5.4.1 Self-Similarity

In this context self-similarity refers to a similarity between parts of the activities inour video. Certain action types, e.g. walking, often result in movements which havea strong periodic nature. But also non-periodic activities, such as digging a hole, areoften repeated in a very similar manner. These regularities can be exploited to retrievea very different kind of feature compared to the other approaches mentioned above.[Cutler and Davis, 2000] define periodic motion as

X(t+ p) = X(t) + T (t), (2.17)

where X(t) is a point at time t, T (t) is a translation of the point. Then the periodis the smallest p such that p > 0 satisfies the above equation. Periodic motion has atemporal symmetry. To compute self-similarity, two assumptions have to be made: 1) Theorientation and size of an object do not change during a video or at least not withoutperiodicity. 2) The frame rate of our video is high enough to capture the periodic motion(at least twice the highest frequency).The authors of [Cutler and Davis, 2000] describe that their approach requires motionsegmentation and tracking as preprocessing steps. This is important to get robust resultsand if each object can move with a different period. The self-similarity between two framest1 and t2 for object Ot with bounding box Bt can be computed using an arbitrary imagesimilarity metric R(·, ·)

S(t1, t2) =∑

(x,y)∈Bt1

R(Ot1(x, y), Ot2(x, y)) . (2.18)

The authors suggest metrics such as absolute correlation, normalized cross-correlation,Hausdorff distance or color indexing. To compensate for minor translations, the minimum

16

2.5. Features 17

distance is found by translating the object over a search radius r

S′(t1, t2) = min|dx,dy|<r

∑(x,y)∈Bt1

R(Ot1(x+ dx, y + dy), Ot2(x, y)) . (2.19)

The plot of the matrix S′(t1, t2) for each frame pair at time t1 and t2 gives us what weis referred to as a temporal Self-Similarity Map (SSM) [Korner and Denzler, 2013]. Suchan SSM is shown in Fig. 2.5 for a periodic activity. Since high similarity is representedby black color and low similarity by white color, we have a black diagonal, because eachframe is completely similar to itself. Periodic motions are indicated by dark lines parallelto the diagonal. To estimate the spectral power for a given frequency, we can average overseveral columns in the SSM. Only significant peaks of the spectrum are used.The authors go on to describe different techniques that can be used depending on whether

Figure 2.5: Similarity of images T1 and T2, Copyright by [Cutler and Davis, 2000]

the columns of S′ are stationary or whether periodicity exists in one or more frequencies.For non-stationary periodicities they propose using Time-Frequency analysis and a Short-Time Fourier Transform (or windowed Fourier Transform)

Fx(t, v;h) =

∞∫−∞

x(u)h∗(u− t)ei2πvudu , (2.20)

with a short-time window h∗(u− t) at position t, such as a Hanning window.[Korner and Denzler, 2013] extend this approach even further. Instead of using the directpixel intensity values of the image, they compute SSMs over diverse low-level features suchas pixel intensities, HOG, HOOF and Fourier coefficients. Fig. 2.6 visualizes these low-level features applied to different activities. They apply PCA for dimensionality reduction.The resulting features are learned using a Bag of Visual Words approach to assign eachdescriptor to a prototype using a clustering algorithm. The final feature histogram issimilar to what we saw in Section 2.5.1.2. It is learned using a Gaussian Process classifiercombined with a histogram intersection kernel.

2.5.4.2 Symmetry

[Haritaoglu et al., 1999] have exploited spatial symmetry to detect people carrying objects.Spatial symmetry is actually a form of self-similarity under a class of transformations (such

17

18 2. Fundamentals

Figure 2.6: SSMs extracted from recordings of activities of a roboter dog using differentfeatures. Red colors indicate high similarity whereas dark blue indicates verylow similarity. Copyright by [Korner and Denzler, 2013]

as translation, rotations and reflections). Since the authors focus only on the silhouettes ofa person in a video, this is essentially an appearance-based approach which requires a goodperson segmentation. The authors assume that an upright standing person is symmetricalong the main body axis. During tracking, the regions of the silhouette which violate thesymmetry constraints are found. The object can be tracked to decide whether an objectis handed over from one person to another. The natural motion of a person’s body limbscan also cause outliers in the symmetry. The difference between carried objects and bodyparts is that the latter exert periodic motion whereas the former do not.To compute the silhouette model, Haritaoglu et al. apply a PCA (see Section 2.6.2.1)to the silhouette pixels. The vector of the main body axis is the eigenvector with thelargest corresponding eigenvalue in the covariance matrix. The shape of the silhouette isrepresented by two 1D projection histograms for the vertical and the horizontal axes. Herevertical axis means along the main body axis and horizontal means perpendicular to thataxis. Both histograms are normalized by scale and subtracted by their median coordinateto make them comparable between each frame.To analyze the periodicity of a movement, they compute the self-similarity of silhouettesover time. Note that here self-similarity does not refer to the same concept as in the abovesection. A similarity plot S(t1, t2) between two projection histograms pti at time ti iscomputed using

S(t1, t2) = min|dx|<q

∑L<i<R

∣∣pt1i+dx − pt2i∣∣, (2.21)

where L and R are the minimum and maximum value of the projection histogram. Toaccount for tracking errors, we vary the translation of pt1 over a search window q. Thenext step is similar to the Self-Similarity features. We compute the SSM and compute thepeak of each row. The most frequently occurring peak corresponds to the fundamentalfrequency of the motion. The confidence is determined by the number of rows that votefor this peak.The symmetry analysis is done by declaring any pixel which is further away from the mainbody axis than any pixel along the horizontal line that runs through it as non-symmetric.All other pixels are symmetric.

18

2.6. Dimensionality Reduction 19

2.6 Dimensionality Reduction

The goal of dimensionality reduction is to reduce the amount of features that are tobe processed. This is due to the faster convergence of most learning machines in low-dimensional space. It can dramatically speedup the learning process and reduce therequired resources such as CPU runtime or RAM requirements. Another key advantageof dimensionality reduction is that it can reduce the curse of dimensionality and thereforeeven increase accuracy and avoid overfitting.Two different approaches exist: Feature selection and feature extraction. Whereas theformer tries to select a subset of the available features according to some criterion, thelatter creates new and simplified features that represent the original ones.

2.6.1 Feature Selection

Feature selection is the process of selecting a subset of the available features. This canbe helpful when the data contains irrelevant or redundant features. It leads to a shortertraining time of the learning machine and a better generalization, since it avoids overfitting.Feature selection typically consists of two steps. These steps are the search for a subsetof features and the evaluation of its performance using a given evaluation metric. Searchstrategies include exhaustive search, greedy selection, genetic algorithms and simulatedannealing. The evaluation metrics can be consistency-based, correlation-based or based onhow well different classes are separated. Class separability metrics include error probability,inter-class distance, entropy, Gini and twoing (described in 2.7.3.2). The performance isoften cross-validated on a validation dataset to avoid overfitting of the training data.As we will see in Section 2.7.3.2, some learning machines such as Random Forests integratefeature selection mechanisms.

2.6.2 Feature Extraction

Feature extraction techniques seek to compress features to have a lower dimension. Theyare often referred to as projective methods since they project a high dimensional featurespace to a lower dimensional one. An optimal reduction technique does so withoutremoving any information, which is comparable to lossless compression. However thisis usually not possible and therefore we investigate methods that remove undesired noiseand keep the essential information.In Section 2.5.1.2 we described how clustering and histograms are part of the SIFT3Dalgorithm to cluster space-time volumes. Both of these approaches are unsuperviseddimensionality reduction techniques. A disadvantage of k-means clustering is theobservation that the quality of the result depends on the (usually randomly chosen) initialvalue. We also have to know the number of clusters (k) in advance.Histograms are completely deterministic and therefore do not suffer from randomness. Buta standard histogram is only capable of gathering 1D data, although extensions to multipledimensions exist. If our features have a big number of dimensions or a high covariance ahistogram might not be an optimal choice for feature extraction. We therefore present amore involved approach, known as the Principal Component Analysis.

2.6.2.1 Principal Component Analysis

The Principal Component Analysis (PCA) was introduced by Karl Pearson in 1901. Thefollowing explanations are taken from [Shlens, 2005]. Each of the n samples of our dataX is an m-dimensional vector, where m is the number of features. The vector lies in anm-dimensional vector space that is spanned by an orthonormal basis. If X has non-zeromean we need to subtract the mean from each component. What PCA does is to changeto a new basis with more favorable properties. We need to select our transformation so

19

20 2. Fundamentals

Figure 2.7: Redundancy of 2-dimensional data with the best linear fit indicated by a dashedline. Copyright by [Shlens, 2005]

as to deal with the noise, redundancy and rotation. These issues are directly linked withhighly correlated feature dimensions as can be seen in Fig. 2.7. Redundancy occurs whenone feature dimension depends on another (c). Noise mitigates the redundancy by blurringthe dependency (a). Finally, rotation is also a form of redundancy since the data in (c)could be represented by just 1 dimension if it where rotated such that the dashed linecorresponded to our x-axis. The linear transformation P that we are searching for can becomputed using one of the following approaches:

• Compute eigenvectors: The dependence between different feature dimensions isrepresented in the covariance matrix of our data

CX =1

n− 1XX> . (2.22)

This matrix has the properties that the large numbers on the diagonal correspondto either important dynamics or noise, whereas large values in all other elementsrepresent a high degree of redundancy. Therefore we need to diagonalize the matrixto extract only the diagonal elements and reduce redundancy. The matrix can bediagonalized by computing the eigenvectors and sorting them by the size of theireigenvalues. Each column of P then corresponds to an eigenvector of CX.

• Singular Value Decomposition: A computationally more efficient way to performPCA is by using Singular Value Decomposition (SVD). Singular values are definedas σi with σi =

√λi. By definition, the eigenvectors vi and eigenvalues λi of

the covariance matrix fullfil the equation (XTX) vi = λivi. This is equivalent toXvi = σiui for a vector ui.U = [u1,u2, . . . ,un] and V = [v1,v2, . . . ,vm] are the orthogonal matrices thatrepresent an orthonormal basis and are spanned by the respective vectors. Note thatU is identical to P. Let Σ be the matrix that has rank-ordered singular values σion its diagonal. This matrix is filled with zeros to have m × n size. The previousequation is equivalent to the matrix multiplication XV = UΣ. Since V−1 = VT foran orthogonal matrix, we can rewrite the SVD as:

X = UΣVT (2.23)

We can compute this decomposition using standard techniques. If Σ is a squarematrix, then the SVD can be interpreted as a rotation, a scaling and another rotation.

20

2.7. Learning Machines 21

The final basis change is done using

Y = PX , (2.24)

where Y is the new data representation. Each column of P is an eigenvector of CX andcalled a Principal Component (PC). Until this point we have not reduced the dimensionat all. We know that the PCs are sorted by their eigenvalues in decreasing order andthat the eigenvalues correspond to the variance that the PC describes. To reduce thedimensionality without taking away too much variance, we can use only the first k < mdimensions.Despite its great success, the PCA has two weaknesses: (1) It has only a single parameterk. This means that we have no influence on the output. It is essentially an unsupervisedreduction technique and even if we know the labels of our training data, we cannot use themto improve the results. (2) It is linear. There are various situations where the data couldbe arranged along non-linear curves such as a perfect circle. However, techniques existthat first apply a predefined transformation (such as switching to polar coordinates) andthen apply the PCA. These techniques are known as Kernel PCA [Scholkopf et al., 1999].

Several extensions to the PCA have been proposed, two of which will be presented here:

• Random sampling: Random sampling is a very simple approach. If our datadimension exceeds the RAM or runtime requirements during training, we can selecta random subset of the data points and compute the PCA on them.

• The transpose trick: The “transpose trick”2 refers to using the transpose of thedata matrix X for PCA. After we have computed the eigenvectors eTi and eigenvaluesλTi of the covariance matrix, we need to retransform them to get the eigenvectors eiand eigenvalues λi of the original matrix. This is done using3

λTi = λin− 1

m− 1(2.25)

and

eTi =Xei√

2λin−1m−1

. (2.26)

If we compute eigenvectors directly, then the covariance matrix CX of our m × ndata matrix is a symmetric m ×m matrix. If we use the transposed matrix, thenthe covariance matrix is a symmetric n× n matrix. This has a significant influenceon required memory consumption and runtime. For n = 102 samples and m = 105

features, the resulting covariance matrix is 6 orders of magnitude smaller using thetranspose trick. As mentioned in ([Hsieh, 2009], p. 44), this approach does notchange the result.

2.7 Learning Machines

In this section we present some of the most successful learning machines. All of themare supervised learning techniques which we use as classifiers for activities. Our choicereflects the trend of the last decade, which has seen a shift in popularity from NeuralNetworks to Support Vector Machines and more recently to Random Forests. k-NearestNeighbors are mentioned as a simple, but surprisingly effective baseline system to comparethe performance against.

2This method does not seem to have a clearly defined name, although several people refer to it as“transpose trick”. Others call it “economized PCA” ([Altman et al., 2006], p.465), compare it to rotatedPCA ([Hsieh, 2009], p. 44) or do not use a name at all ([Solem, 2012], p.14).

3http://www.statistics4u.com/fundstat_germ/ee_pca_transposed.html

21

http://www.statistics4u.com/fundstat_germ/ee_pca_transposed.html

22 2. Fundamentals

2.7.1 Support Vector Machines

A Support Vector Machines (SVM) is a non-probabilistic binary linear classifier. Itwas initially invented by Vladimir Vapnik and its standard formulation was presentedby [Cortes and Vapnik, 1995]. To perform classification, each data sample is representedby an m-dimensional data point. The SVM learns a hyperplane in the feature spacethat separates the data into two classes. The “side” of the hyperplane that a point is onrepresents its estimated class label. See Fig. 2.8 for a visualization of the 2-dimensionalcase. Here the feature space is divided by a line into left and right side.

Figure 2.8: The separating hyperplane of an SVM in a 2-dimensional feature space. Thetwo classes (white, black) are linearly separable. Public Domain - Created byCyc

Let us assume that the classes are linearly separable, which means that a hyperplane existsfor which all samples of a class are found on the same side of the hyperplane. It is chosensuch that the margin on both sides is maximized. For a hyperplane with normal vector wand offset b, the equation is w · x− b = 0. The Support Vectors (SV) are the data pointsthat are closest to the hyperplane. To make sure that the SVs of both classes are equallyfar away, we choose the parameters such that positive examples fulfill w · x − b ≥ 1 andnegative examples fulfill w · x − b ≤ −1. Hence the minimum distance of both classes is2‖w‖ . We want to maximize this distance by minimizing ‖w‖.

2.7.1.1 Soft Margin in Primal Form

Since we have made the rather unrealistic assumption that both classes are perfectlylinearly separable, our previous equations cannot be solved in general. Instead we needto extend our definition to the so called “soft margin”. By introducing a degree ofmisclassification ξi of each data point xi, we can extend our hyperplane equation to

yi(w · xi + b) ≥ 1− ξi, 1 ≤ i ≤ n, (2.27)

22


where yi denotes the class label +1 or −1 of a data point. To minimize the amount ofimperfections ξi we include them in the minimization problem as

argminw,b

1

2‖w‖22 + C

m∑i=1

ξi. (2.28)

The value C in this equation is a parameter that has to be tuned and it represents thetrade-off between a large hyperplane margin and a small error ξi. Different solutions existto solve the minimization problem 2.28 under the constraints of 2.27. A direct approachis to introduce Lagrange multipliers αi and βi

argminw,ξ,b

maxα,β

{1

2‖w‖22 + C

n∑i=1

ξi −n∑i=1

αi [yi(w · xi − b)− 1 + ξi]−n∑i=1

βiξi

}. (2.29)

This problem statement is also known as the primal form. The stationary Karush-Kuhn-Tucker condition states that the result can be expressed as a linear combination of datasamples xi:

w =n∑i=1

αiyixi . (2.30)

Only a few of these αi will not be zero. These correspond to the SVs. The offset of thehyperplane is computed as

b =1

z

z∑i=1

w · xi − yi, (2.31)

where z denotes the number of SVs.

2.7.1.2 Soft Margin in Dual Form

There exists a very popular alternative formulation which is referred to as “dual form”. Itssolution is identical to that of the primal form

argmaxα

n∑i=1

αi −1

2

∑i,j

αiαjyiyjk(xi,xj)

, (2.32)

subject ton∑i=1

αiyi = 0, 0 ≤ αi ≤ C, i = 1, . . . , n . (2.33)

The dual form includes the term k(·, ·) which is just the dot product k(xi,xj) = xi ·xj forlinear SVMs. It is called the kernel function.

2.7.1.3 Kernel Trick

The most important extension to SVMs is the kernel trick. We have seen that a standardSVM is a linear classifier which tries to split the feature space in two regions using ahyperplane. If we replace the kernel function by a non-linear function, we transform thefeature space to a higher dimensional space. It might be possible that the data can belinearly separated in a higher dimensional space even though this was not possible in theoriginal feature space. The result is then mapped back to the original space. Varioussuch kernel functions exist, with the most popular one being the Gaussian Radial BasisFunction (RBF) kernel

k(xi,xj) = exp(−γ‖xi − xj‖22

), γ > 0. (2.34)

23

24 2. Fundamentals

The constant γ = 1/2 σ2 describes the extent of the RBF and needs to be tuned to find

the best possible value. This kernel is visualized for a 2D feature space in Fig. 2.8. Asa simplification the difference xi − xj is replaced by the vector [x, y]T . Other possiblekernels are polynomials of degree d and hyperbolic tangent (sigmoid) functions with 2 freeparameters each. The kernel trick has the advantage that it changes our solution from aglobal representation (a separating hyperplane) into a solution that adapts to the localspatial structure of the training data.

−5

0

5

−5

0

50

0.2

0.4

0.6

0.8

1

xy

K(x

, y)

Figure 2.9: Visualization of a Gaussian RBF kernel in 2D feature space

2.7.1.4 Details

Until this point we have only treated binary SVMs. This means that we can only solvetwo-class classification problems. Obviously this can be extended to multi-class problems.Although more direct approaches exist, the most commonly used is a combination ofseveral two-class SVMs. We can create one SVM for each pair of classes (one-versus-one).Then we take the class that receives the majority of the votes of the single SVMs. Anotherapproach is to create one SVM per class (one-vs-all). In this case the SVM with the highestscore determines the result. It is important that we calibrate the output scores becauseotherwise they are not comparable between different SVMs.One benefit of the SVM approach is that the separating hyperplane is not representedby some abstract threshold as in decision trees (see Section 2.7.3.1), but rather by theSVs which are data samples. Therefore they can be visualized to get an impression of thequality of the SVM and the data samples that are hard to classify.

2.7.2 k-Nearest Neighbor

The k-Nearest Neighbor (kNN) algorithm was introduced by [Friedman et al., 1977]. Itis a non-parametric lazy learning algorithm. Lazy means that we do not compute aninternal representation of the data during training, such as in an SVM, but rather do thecomputation on-the-fly during evaluation. kNN is able to do classification and regression,as well as multivariate density estimation.In this work we focus only on the classification task, in which a new sample is comparedto the k “closest” samples in the training data and a majority vote of the k labels decideson the classification of the new sample. The notion of a sample being close to another isexpressed by a low dissimilarity or distance. By definition, such a distance function has tosatisfy the properties of symmetry and monotonicity. We are typically interested in metric

24


x 1 x 2 x∞

Figure 2.10: Unit circles for different p-norms, Copyright by Esmil

distances where additionally the triangle inequality holds.Examples for such metrics include the Minkowski metrics or p-norms, such as the cityblock metric (p = 1), the euclidean distance (p = 2) and the Chebychev distance (p→∞).Here dst refers to the distance between sample s of the training data X and sample t ofthe evaluation data Y. One index indicates vectors and two indices indicate an elementof the matrix. m is the number of features:

dst = p

√√√√ m∑i=1

|Xsi −Yti|p (2.35)

Another very important metric is the Mahalanobis distance which has the great advantagethat it lowers the effect of less important feature dimensions by multiplying with the inverseof the covariance matrix C. As such it is invariant to scale and translation.

dst =√

(Xs −Yt)TC−1(Xs −Yt) (2.36)

The choice of the metric directly affects the search space for the k nearest neighbors. Thisbecomes obvious when we visualize the unite circle of different metrics as seen in Fig. 2.10.

Once we have agreed on a metric, we must find an appropriate data structure for rapidqueries on nearest neighbors. Friedman et al. propose an optimized kd tree. A kd tree isa generalization of a binary tree where the leafs hold mutually exclusive subsets of datapoints. The inner nodes store the index of a discriminating key and its discriminator value,which is the median in the current dimension of all data points below this node. Whenquerying for nearest neighbors, we start at the root, extract the feature dimension of thediscriminating key of our sample and compare the value to the stored discriminator value.If it is less than or equal to the discriminator, we continue with the left subtree, otherwisewe continue with the right subtree. Once we reach a leaf node, we can retrieve a list ofd ≥ k data points that are close to our query point (using the respective metric). We canthen compare each point in the list with our query, sort the result by distance and takethe k nearest neighbors.In the last step we do a majority vote of the k data points. The class label that occursmost often is taken as the predicted label for our query point. If the class distributionis skewed, we can do a weighted vote of the data points by their inverse distance to thequery point.

2.7.3 Random Forests

Random Forests are an ensemble learning method used for classification and regression.Ensemble method means that multiple instances of the same kind of model are combinedto improve their performance. These models are single decision trees. This method hasbeen invented by Leo Breiman and Adele Cutler who hold the trademark4. In this section

4See http://www.stat.berkeley.edu/~breiman/RandomForests

25

http://www.stat.berkeley.edu/~breiman/RandomForests

26 2. Fundamentals

we describe how decision trees work, how data points can be split at each node, details onRandom Forests and more advanced techniques such as Hough Forests.

2.7.3.1 Decision Tree

Decision trees are predictive modeling approaches that have a tree structure. The leafnodes hold the labels for classification, whereas the inner nodes are used for thresholdcomparisons. For the sake of simplicity we concentrate on binary trees with a maximumof two children per node.The evaluation of the tree works as follows. For a data point x with n dimensions andelements xi, we start at the root node and at each node we execute a threshold comparison.The node stores the feature index i and the threshold value t for comparison. If x(i) ≤ t,we continue along the left branch. Otherwise we continue to the right.The training is more complicated. We usually start at the root node and recursively followthe tree branches for the data points that are associated with them. At each node wefind and store the best feature and threshold value for the current comparison. This isdone until a stopping criterion is fulfilled. Typically these are a maximum tree depth, aminimum number of children per node or qualitative requirements.The quality of the split at each node can be assessed using the splitting criteria in the nextsection. To find the best split we have to try out all possible combinations of features andthresholds. Since this is computationally expensive, it is usually recommended to use onlya random subsample of those tests if the number of tests is large.This kind of training is called top-down tree growing. An alternative, though not verycommon, would be to use bottom-up techniques to aggregate similar features. The oppositeof growing is pruning. It is often used as a post-processing technique to avoid overfittingof a tree to the training data. The idea is to successively remove the splits with the highestdepth in the tree. This is done as long as the performance evaluated on verification dataimproves. It is thus a cross-validation technique. We will see below why pruning is notnecessary for the decision trees that are used in Random Forests.

2.7.3.2 Splitting Criteria

As described above there are several ways to split a set of data points in two parts. Tofind the best split, we define a measure of impurity and compute the impurity at bothnewly created child nodes. The two impurity scores are then combined to a value v. Forall possible features and threshold values, we take the split with the lowest v. This meansthat we minimize the impurity of our split.The following is a non-exhaustive list of splitting criteria that are especially recommendedfor use in Random Forests. Each measure has a different background and it is not possibleto know which one will perform best without regarding the underlying dataset. Note thatall of these criteria are univariate splitting criteria. This means that they only regard asingle feature for each split. In contrast, multivariate criteria may utilize multiple featuresin one split. This is typically done via a linear combination of the best features, that canbe found by greedy search or linear discriminant analysis [Rokach and Maimon, 2005]. Weare however not aware of any papers that use multivariate splitting criteria in RandomForests. This is probably due to the much more complicated training and evaluationprocedure that becomes necessary for multivariate criteria. It is also questionable whetherfeatures are linear or can be combined via a linear combination.

• Gini index: The Gini index or Gini coefficient describes the purity of a node. It isa measure of statistical dispersion that is often used by sociologists to describe theincome distribution of a country. A Gini coefficient of 0 means that each person hasthe same income, whereas a Gini coefficient of 1 means that one person receives all

26


the income. In our case it means that for a given node, all classes receive the sameratio of labels or one class is assigned all labels. It is computed as

Gini = 1−∑c

p(c)2 , (2.37)

where c is a label class and p(c) is the probability that an element at this nodebelongs to class c. As described above, we need to aggregate the impurity values fortwo child nodes. This is done via the linear combination

v = Ginil · pl + Ginir · pr . (2.38)

Here pl and pr are the ratios of data points that fall into the left or right branch.

• Twoing: The twoing splitting criterion is less known than the Gini index. It differsfrom the other measures presented here by directly defining the combined impurityv, without computing it for both child nodes first:

v = pl · pr ·∑c

|pl(c)− pr(c)|2 (2.39)

Here pl(c) and pr(c) describe the probability that an element at the left or rightsub-node belongs to class c. The formula can be divided in two parts. The left partis just the product of the ratios of data points at the left and right child node. Theright part looks at each class separately and punishes unequal distributions to bothsides. This way this measure puts more stress on equally sized splits and results inmore balanced trees.

• Entropy: Entropy is a well-known measure in information theory of the uncertaintyof a random variable. The uncertainty of a random variable is equivalent to theinformation content that it holds. The (Shannon) entropy measures the informationcontent of a message, usually in bits. It represents the average number of bits persymbol that are needed to encode a message (using an optimal encoding):

Entropy = −∑c

p(c) log2(p(c)) (2.40)

This means that we choose the split which reduces the information content over allclasses by the highest number of bits. If the entropy is 0, then the encoding needsan average of 0 bits per symbol. This corresponds to a split in which all elementsbelong to one side and therefore no information needs to be stored. If on the otherhand it is 1, this means that our encoding needs exactly one bit per symbol, whichcorresponds to equal probabilities for both results (left or right):

v = Entropyl · pl + Entropyr · pr (2.41)

As with the Gini impurity, we define v as a linear combination of the ratios of thedata points to the left and right branch.

Several studies have tried to compare the performance of these splitting criteria on differentdatasets in tree-based classification [Zambon et al., 2006][de Sa et al., 2011]. Most studiesfind that these measures perform comparably in classification, but that Gini impurityperforms slightly better.

27

28 2. Fundamentals

2.7.3.3 Details

As mentioned above, Random Forests are an ensemble of decision trees. Each tree istrained with a subset of the available features and data samples. The classification isaggregated over all trees using the most frequent class label. This is a combination ofseveral successful ideas that was first presented by Leo Breiman in [Breiman, 2001].One idea is Bootstrap Aggregating or bagging [Breiman, 1996]. Bagging means that inorder to grow each tree we repeatedly use a subset of n data samples (drawn at randomwith replacement5). The aggregation is done by combining all trees to vote for a commonoutput. This technique performs better than single decision trees in terms of accuracy andit reduces variance and avoids overfitting.Another influential idea is boosting as presented by [Freund and Schapire, 1997]. Here wecombine several weak learners (that are slightly better than random guessing) into onestrong learner of possibly arbitrary accuracy. This can obviously only work if the outputsof the weak learners are not strongly correlated. The most popular boosting scheme isAdaptive Boosting (AdaBoost). It constructs a strong classifier H(x) as linear combinationof weak classifiers h(x). This is done in each iteration 1 ≤ t ≤ T . The weights αt of thelinear combination are computed by repeatedly taking the best current weak classifier ht(x)and measuring the number of correct classifications. For the next iteration, the weight ofeach correctly classified example is decreased and the weight of each misclassified exampleis increased. The final classifier can be described by

H(x) = sign

(T∑t=1

αtht(x)

). (2.42)

Random Forests are a combination of bagging and boosting. For each tree the data isselected at random with replacement and the final result of our Random Forests is amajority vote of multiple weakly correlated trees. This leads to a number of advantages.Random Forests are usually as good as AdaBoost regarding accuracy. They are morerobust to outliers and noise compared to AdaBoost or other learning machines. This isbecause they do not put a higher weight on previously misclassified items and therefore donot overfit mislabeled data. It is not necessary to prune decision trees in Random Forestsbecause the aggregation over multiple trees avoids overfitting. The execution speed ofRandom Forests is very high because simple trees (weak learners) are sufficient and eachtree can be trained and evaluated in parallel.Random Forests have a number of nice mathematical properties. For a proof of each werefer the reader to [Breiman, 2001]. One such property is that Random Forests do notoverfit as more trees are added, but rather converge to a limiting value of generalizationerror GE for “almost surely all sequences”. Let us assume a two-class problem. If weknow the mean correlation p of the tree outputs and the expected strength s of a set ofclassifiers, we can estimate an upper bound for the generalization error

GE ≤ p(1− s2)/s2 . (2.43)

It is important to notice that the number of features m and the number of samples nchosen per tree influences both the mean correlation and the strength in this equation.If we choose more features or samples per tree, the strength of each single tree increases.However the correlation also increases until each tree uses all data and hence boosting willhave no effect. Therefore n and m are the only parameters in Random Forests that need tobe tuned. Strength and correlation can be estimated using the out-of-bag data. For each

5The original paper writes with replacement, whereas the paper on Random Forests states withoutreplacement. We assume that the original paper is correct.

28


tree, these are the data samples that were not chosen during random sample selection. Wecan even use the out-of-bag error estimates as an approximation to the total generalizationerror. This way we do not need to create special datasets for cross-validation as describedin Section 4.2.Apart from mathematical properties, Random Forests also have the property thatcompared to black-box learning machines, they help us understand what features areimportant and how close different data samples are to each other. To estimate variableimportance, we do the following. In each tree we evaluate the out-of-bag data and countthe number of correct classifications. As a comparison we put nonsense data into asingle feature by permuting its values for all samples. Again we count the number ofcorrect classifications. The difference of both numbers gives us an estimate of the variableimportance. This estimate can be averaged over all trees.The term proximity refers to the similarity of two data points. We can create an N ×Nmatrix for the proximity values of each pair of data points. Now for each tree, we classifythe training and out-of-bag data and if both data points arrive at the same leaf node, weincrease the value of the pair in the proximity matrix.A large number of extensions to Random Forests exist. These include classification ofcategorical variables, regression, novelty detection and handling of missing values6.

2.7.3.4 Hough Forests

Hough Forests are an extension to Random Forest that was initially describedby [Gall and Lempitsky, 2009]. In addition to object or activity recognition they are alsocapable of estimating the position of the object or actor. Hough Forests are trainedsimilarly to Random Forests. While constructing the trees, at each node we need to findthe best splitting criterion according to some impurity measure (see 2.7.3.2). The impurityis computed on both the left and right child node and both values are combined in someway. Hough Forests typically use the class-label uncertainty measure

U1(A) = |A| · Entropy({ci}), (2.44)

where A is the set of data samples at that node and {ci} describes the set of class labelsfor a binary classification problem (c ∈ [0, 1]). Additionally, a second measure is beingused that is called the offset uncertainty

U2(A) =∑i:ci=1

(di − dA)2, (2.45)

where di denotes the distance of the data sample i to the action center and dA is the averageof all distances at the current node. Since Hough Forests optimize our data structure suchthat the classification result has common class labels and similar distances, we need tominimize both U1 and U2. We do this by randomly choosing between both criteria at eachnode. The final split is the one with

argmink

(U?({pi|tk(Ii) = 0}) + U?({pi|tk(Ii) = 1})

), (2.46)

where ? = 1 or 2 and tk are the k possible binary feature comparisons. Hough Foreststypically use small image patches as features and compare two pixels (p, q) and (r, s) fromimage channel a. This pixel test ta,p,q,r,s,τ of image patch I is defined as

ta,p,q,r,s,τ (I) =

{0, if Ia(p, q) < Ia(r, s) + τ1, otherwise

(2.47)

6See http://www.stat.berkeley.edu/~breiman/RandomForests/

29

http://www.stat.berkeley.edu/~breiman/RandomForests/

30 2. Fundamentals

and τ is an offset or handicap value. All parameters of t are chosen at random such thatthey lie inside the constraints of the data. The values that minimize Eq. 2.46 are storedat the current node.It should be noted that each distance di can be a vector of distances to various referencepoints in multiple dimensions. Typical examples include the corners and centers ofan object’s bounding box. A completely different example is the distance in temporaldimensions. We can use Hough Forests to estimate the distance to the start, center or endof our action. Hence they can be used to solve the segmentation problem in spatial andtemporal dimensions. For more information we refer the reader to [Yao et al., 2010] andthe discussion in Section 5.1.During evaluation we start as with Random Forests. We pass down each test data samplefrom the roof node of the forest and at each inner node send the data sample to the leftor right child node, depending on the result of the binary test. Once we arrive at a leafnode, there are n ≥ 1 data samples remaining. As in Random Forests, the class labels ofthese samples can be used in a majority vote to estimate the output label. The distancehowever is generally a continuous measure and therefore we cannot do a majority vote.Instead we use a Parzen-window estimate in the Hough space to accumulate the singlehypotheses with respect to their spatial or temporal relation.The Hough space is just the dual space that has the same number of dimensions as ourdistance vector. It is a generalization of the concept of the Hough transform that is usedto detect shapes such as lines or circles. In the Hough transform, a line is represented byits angle θ to the x-axis and the distance r to the origin. Each point (x, y) on a line casts avote for all parameter values (r, θ) in a dual space that represent the same point. Finallythe parameters which accumulated most of the votes are chosen.In the Hough Forest case, each sample casts a vote in the Hough space using a Gaussianwindow. The probability that for a tree T the object is centered at position x with pixelposition y and distance d is

p(E(x) | I(y); T ) =CL|DL|

∑d∈DL

1

2πσ2exp

(−‖(y − x)− d‖2

2σ2

), (2.48)

where E(x) denotes the event of the object being centered at position x, I(y) theappearance of patch y, CL the proportion of object versus all patches at train time and|DL| the number of distance vectors collected at train time. σ is a constant that describesthe standard deviation of the Gaussian.Finally we sum up the Hough images from each tree Tt into

V (x) =∑

y∈B(x)

p(E(x) | I(y); {Tt}Tt=1) (2.49)

and output the maxima location.

30

3. Datasets

In this chapter we describe several activity recognition datasets that are available for publicuse. We do not include datasets for depth images as they are not relevant for our purpose.Datasets are grouped into real-world datasets that are camera recordings of real personsand simulation-based datasets that model behavior of a person using computer graphics.

3.1 Categorization

Activity recognition datasets can be categorized depending of the viewing angle of thecamera on the scene:

• Horizontal camera settings: This is the most common type of datasets. Videos arerecorded such that the line between camera and actor lies in the horizontal. Thesetting is described as “low-angle shot” in cinematography. The scenes are typicallyclose-up shots where the distance from camera to person is very short with one or twopersons performing an activity. This is usually the simplest perspective for activityrecognition as all body parts of the actor are visible and clearly separable. Theactors are often oriented towards the camera and the background texture has verylow variance (i.e. a white wall). Applications are mostly academic, but also includehuman-machine interfaces (such as Microsoft Kinect).

• Vertical camera settings: This category includes most of the recordings taken fromflying platforms such as planes, balloons or helicopters. The setting is referred toas “high-angle shot” in cinematography. Applications include public surveillanceand military operations, tracking of large crowds, and scientific experiments.Traditionally this seems to be the camera perspective that incurs the highest costsof data collection, since it requires a flying platform to operate from. This becomesapparent from the fact that the most important datasets are gathered by military-based research organizations such as DARPA. Recent developments of low-pricedunmanned aerial vehicles however have tackled this “monopoly” and led to a hugeamount of free-to-use research datasets. For this camera perspective stabilizationof images becomes a very important factor. Since most datasets are gathered fromflying platforms which are affected by winds and their own movements (such as aplane, which cannot remain stationary), a fixed camera perspective is generally notpossible. Rotor based systems such as helicopters and their small-scale remotely-controlled equivalents dramatically improve the quality of the recordings since theycan counteract external forces on the system.

31

32 3. Datasets

Dataset Setting Res. FPS Acts. Recs. Height

KTH Horizontal 160x120 25 6 600 1mSDHA2010 AerialView Challenge

Vertical(tower)

360x240 10 9 108 94m

UCF AerialAction

Vertical(balloon)

720x480 25 7 4 130m

UCF-ARG (A) Vertical(balloon)

960x540 29 10 477 ?

UCF-ARG (R) Angular(roof)

960x540 29 10 471 30m

UCF-ARG (G) Horizontal 960x540 29 10 468 1mUCF LockheedMartin UAV

Vertical(balloon)

960x540 23 ? 3 ?

VIRAT (Ground) Angular(surv. cam)

varies varies 12 11 2-5m

VIRAT (Aerial) Vertical(plane)

720x480 29 ? 24 ?

Weizmann ActionRecognition

Horizontal 180x144 50 10 90 1-2m

Table 3.1: Comparison of real-world datasets. Res. is the resolution in pixels, FPS meansFrames Per Second and Acts. and Recs. refer to the total number of activitiesand recordings.

• Angular camera settings: This category mostly includes surveillance camera footageand is referred to as “medium shot” in film. The camera is placed above a scene tohave a good overview at an angle of 5◦ to 30◦ to the horizontal plane. Applicationsinclude sports broadcasting and compound surveillance. Due to the stationary natureof surveillance cameras, most of these recordings are perfectly stable and allow forvery robust activity recognition.

3.2 Real-world Datasets

Table 3.1 gives an overview of real-world datasets that can be found on the web. Fieldswith “?” indicate that that information is not available or that we cannot estimate it. Inthe following sections we show a few examples of datasets with a vertical camera settingsince this is the relevant setting for our application.

3.2.1 UCF-ARG

The UCF-ARG corpus, published by K. Reddy from the University of Central Florida,consists of 3 different views on the same scene. The views are aerial, rooftop and ground(hence the name). Rooftop and ground cameras are completely static, whereas the aerialcamera is mounted onto the platform of a 4m wide helium balloon. Therefore there is aconsiderable camera motion, especially in the form of rotations around the vertical axis.We also see a varying flying height between different recordings. These effects make therecognition task more difficult, but also more realistic. The choice for a balloon has theadvantage of being much more stable compared to a plane. It also allows for additionalsensors (Global Positioning System and Inertial Measurement Unit) to provide metadataabout the current position and angles as described in1. However the authors do not providethis metadata. Fig. 3.1 shows the setup of the recordings. The dataset is particularly

1http://spie.org/x41092.xml?ArticleID=x41092

32

http://spie.org/x41092.xml?ArticleID=x41092

3.2. Real-world Datasets 33

Figure 3.1: Schematic setup of the UCF-ARG recordings, Copyright by UCF CRCV

interesting for studies on the effects of different viewing perspectives. It allows us toresearch on the question whether actions in an angular setup are comparable to actions ina vertical setup as described above. In the corpus 12 actors perform 10 different actions,each of them 4 times. These actions involve walking around (walking, jogging, running,carrying) and standing at a fixed position (boxing, clapping, open/close trunk, throwing).Most of the activities can be classified as actions, whereas only the open/close trunkactivity is an interaction due to the relation between the person and the vehicle. Fig. 3.2shows sample video frames recorded at the same time from three different perspectives.UCF-ARG covers only the visible light spectrum. Therefore it cannot be used to trainactivity recognition systems for infrared (IR) spectra. Due to the enormous height of theplane and the ability to see through thick layers of clouds, IR sensors are typically used inmilitary applications.

Figure 3.2: Different perspectives in the UCF-ARG dataset, Copyright by UCF CRCV

3.2.2 VIRAT

In this section we take a closer look at the VIRAT dataset [Oh et al., 2011]. VIRATrefers to the “Video and Image Retrieval and Analysis Tool” and is funded by theInformation Processing Technology Office (IPTO) of the Defense Advanced ResearchProjects Agency (DARPA). According to the publisher, the VIRAT dataset includes about8.5 hours of high-definition videos from ground and aerial recordings. The videos coverdifferent frame rates from 2 to 30Hz as well as different zooming levels corresponding toeffective sizes of 10-200px per person. The current version is VIRAT 2.0. Videos areannotated with the ground truth. Unfortunately this does not seem to be true for theaerial recordings, which have neither ground truth nor documentation. The resolution is

33

34 3. Datasets

typically 720x480 with a very low quality and heavy interlacing.Actions that we recognize in the videos include walking, running, digging or cleaning,getting in and out of a car. Fig. 3.3 shows a high quality scene where multiple personsare recognizable. In contrast to this, Fig. 3.4 shows a scene where humans are barelyrecognizable at all. Watching the video repeatedly, we discovered 11 persons as shown bythe red boxes. It becomes clear that this dataset is not usable for activity recognition.

Figure 3.3: A frame from the VIRAT dataset taken from medium altitude. Copyright byVIRAT

Figure 3.4: A frame from the VIRAT dataset taken from high altitude. Persons are markedwith red boxes and hard to recognize. Copyright by VIRAT

34

3.3. Simulated Datasets 35

3.3 Simulated Datasets

An alternative to using real-world datasets is to simulate and render a virtual environmentusing computer graphics. One can then record a video of the scene. However there arevery few appropriate simulators out there and we are not aware of any papers that try touse simulated data for activity recognition. In the following section we present a simulatorthat can be used for such a task. Compared to real-world field studies, simulated datahas the advantage of providing a completely controlled environment. This is a majoraspect that should not be underestimated. It allows a systematic study of influences offactors such as viewing angle, resolution, image scale, temporal scale, texture and typeof movement. There are also benefits regarding the behavior of the actor in the video.A human being that is not appropriately incentivized, might act irrationally or otherwiseuncommon according to experimental economics. If we replace the human being by anappropriately modeled machine, we can ensure rationality.

3.3.1 Virtual Battlespace 2

Virtual Battlespace 2 (VBS-2) is a military simulator. It allows us to model arbitraryscenarios including persons, vehicles and infrastructure. These scenarios can then berecorded as high-resolution video using the internal scripting language. It is possibleto automatically extract the relevant metadata of the scenario, which lowers the costs ofdata acquisition as compared to humans annotating each video sequence.We can insert custom vehicle models, such as the Barracuda UAV and create videodata that is based on the actual physical properties of the UAV (such as its length andwingspan). The video data can be recorded in red, green and blue (RGB) to representthe visible light spectrum. We can also use infrared (IR) imagery, which is much morecommon in medium to high altitude surveillance aircrafts. The simulator allows us tomodel the activities of the persons using a wide range of activities, such as walking, jogging,running, entering or exiting a car, throwing stones and several military activities, includingsalutation and drawing a weapon.However a big disadvantage of using simulation-based datasets to do activity recognition is

Figure 3.5: Screenshot of a scene created in VBS-2

the low visual variability. The most up-to-date simulator dates back to 2007 and comparedto today’s advances in computer graphics, the visual quality is not state of the art. Texturesare generally very monotone and consist of repeated strips of similar patterns as can beseen in Fig. 3.5. Preliminary studies have shown that this can be a big issue for severalpreprocessing steps. Keypoint matching processes, used e.g. for image stabilization, willbe problematic if the same remarkable feature points are repeated all over the image. Thenthe match between the keypoint and its transformed equivalent in the next image will beambiguous and it will not be possible to compute the affine transformation between both.We should also consider the monotonicity in the actions of the simulated persons. Their

35

36 3. Datasets

movements are generally modeled by a character designer and only allow for a few hard-coded alternatives. This is not comparable to a human being whose motion is uniquein every movement and can be quite different to that of another person. If we base ourtraining on simulated data, we cannot generally assume that the resulting system willbe able to generalize a given activity beyond the few hard-coded alternatives that oursimulator provides. Hence the detection performance might be much lower.

36

4. Experiments

Whereas earlier chapters focused on the general problem statement, a presentation ofexisting approaches in the literature and an overview of the datasets, this chapter willexplain our approaches to activity recognition and the most important design decisions.We will give detailed descriptions of the systems that we designed, which methodology wechose and how well our systems are performing. For a discussion of the advantages anddisadvantages of our approaches please refer to Chapter 2.

4.1 System Description

In this section we describe which dataset we have chosen and how we processed it.We present our implementation of the features, dimensionality reduction techniques andlearning machines that were most successful in our application.

4.1.1 Datasets

In Chapter 3 we gave an overview of the most important video datasets for activityrecognition. We showed that they can be categorized by the camera angle on the scene.Due to our focus on Medium Altitude Long Endurance (MALE) UAVs, we run ourexperiments on a dataset with a vertical camera setting. Since a multitude of activities andhigh quality recordings are most important, we focus our experiments on the UCF-ARG(A) dataset, which is the aerial subset of the UCF-ARG dataset.

4.1.2 Preprocessing

We apply a series of preprocessing steps on the data, before we use it to perform activityrecognition. The processing is done on both training and testing data. As depicted inSection 2.4, these steps include image stabilization, data selection and blob extraction.Fig. 4.1 shows an example frame taken from the original data corpus. Fig. 4.2 shows thesame frame stabilized according to the motion in the video. Finally, Fig. 4.3 shows theresulting blob that we extracted from the stabilized frame sequence. Note that this workdoes not seek to evaluate the quality and performance of our preprocessing approaches.These topics have been dealt with extensively in the respective literature. The capabilitiesare already integrated into commercial products, such as the Image Chain project atAirbus Defence and Space. Therefore we use common stabilization approaches and asemi-automatic blob extraction to focus on activity recognition. However, since the natureof our preprocessing steps is important for the discussion of the system performance inSection 4.3, we give an overview of the steps in the following sections.

37

38 4. Experiments

Figure 4.1: The original frame of a digging action.

Figure 4.2: The stabilized frame of the digging action. Note that areas that are not visibleto the camera are filled with black pixels.

Figure 4.3: The blob extracted from the stabilized image.

38

4.1. System Description 39

4.1.2.1 Stabilization

Our image stabilization procedure is based on finding keypoints in subsequent images andmatching each keypoint to its transformed correspondence. The overall process is basedon the tutorial found at1 and extended to fit our requirements. Using at least 3 points in2D, we can create an affine transformation and warp the current image to the perspectiveof the previous one. The steps are as follows:

1. Keypoint detection: We use Features from Accelerated Segment Test (FAST)[Rosten and Drummond, 2006] to find keypoints in both images. We have chosenthese keypoints since they use a machine learning approach to be extracted 5times faster than several other approaches (see [Rosten and Drummond, 2006]). Ourcomparisons to the more time consuming Difference of Gaussian (DoG) keypoints,as used in Scale Invariant Feature Transform (SIFT) [Lowe, 1999], show no visualimprovement, but a runtime that was several times higher for SIFT. The number ofdetected keypoints depends on the image size and its content. We use a thresholdfor the minimum intensity difference between a corner and its surrounding region.This decreases the number of detected corner points, but increases the quality of thecorner points and leads to lower error-rate of the keypoint matching.

2. Extract features: The next step is to extract the features of the corner’sneighborhood which can then be used to compare different keypoints. We usethe Fast RetinA Keypoint (FREAK) features. [Alahi et al., 2012] show that thesekeypoints result in a higher proportion of correct keypoint matches compared toBRISK, SURF and SIFT features. They also perform better under rotation, minorscaling, change of viewpoint, negative brightness shifts and minor Gaussian blur.This is despite the observation that FREAK keypoints are extracted about 140 timesfaster and matched about 40 times faster than SIFT keypoints.

3. Match features: To find point correspondences, we match the features thatdescribe the local neighborhoods. We use the Sum of Squared Distances metricand choose only those point correspondences whose metric is above a giventhreshold. [Lowe, 2004]

4. Estimate affine transformation: The next step is to compute the affinetransformation from one image to another. Since our point correspondencesmight include false matches, we use a RANdom Sample Consensus (RANSAC)algorithm. This class of algorithms is used to eliminate outliers in the data (suchas false point correspondences) while fitting a model to an overdetermined system.More detailed, we use the Maximum Likelihood Estimation SAmple Consensus(MLESAC) [Torr and Zisserman, 2000] which iteratively maximizes the likelihood ofthe model instead of the number of inliers. It has been shown by Torr and Zissermanthat this algorithm is more robust than other RANSAC approaches.

5. Warp image: In the final step we simply transform each video frame t to theprevious frame t − 1. We force the output images to be of equal size and set eachpixel that is not overlaid by the transformed image to be of black color. This waywe can be sure that, despite some problems due to badly conditioned systems ofequations in step 4), each stationary point in the scene is mapped to the same pixelthroughout the video, as long as it is still within viewing range of the camera. Thisstrong stabilization has the advantage of removing the eigenmotion of the camera. Itis however not feasible if the camera is translated or rotated to a certain degree and

1 http://www.mathworks.de/de/help/vision/examples/video-stabilization-using-point-

feature-matching.html

39

http://www.mathworks.de/de/help/vision/examples/video-stabilization-using-point-feature-matching.html

http://www.mathworks.de/de/help/vision/examples/video-stabilization-using-point-feature-matching.html

40 4. Experiments

Name box

ing

carr

yin

g

clap

pin

g

dig

gin

g

joggin

g

op

encl

oset

runk

runnin

g

thro

win

g

walk

ing

wav

ing

Activity ID 1 2 3 4 5 6 7 8 9 10Recordings 40 45 46 48 45 34 45 48 43 46

Frames (in K) 9.1 10.4 12.2 15.7 6.2 28.0 4.0 17.3 11.0 13.6Duration (in s) 314 359 421 541 214 966 138 597 379 469

Table 4.1: Video statistics of the UCF-ARG-AR dataset

the scene moves out of the viewing range of the first video frame (which all otherframes are warped to).

4.1.2.2 Data Selection

After stabilizing the videos, we need to manually check whether the quality of the videosand their stabilized counterpart is acceptable. Remember that these preprocessing stepsare done for both training and testing datasets, to train and evaluate the performanceonly on actual activities and not just random background scenes. In our case we rejectedseveral original videos where the duration of the activity is less than 1 second or whereon average less than 50% of the pixels corresponding to the human body are inside theviewing range.Due to the strong stabilization that was performed in the previous section, we also rejectedsome videos that could have been used with a weak stabilization scheme. However sinceboth the camera and the actor moved away from the position of the first frame, these videoshad to be rejected. This is a design choice, which depends mostly on the characteristics ofthe chosen UCF-ARG dataset, where the camera is supposed to be fixed to the balloon.Therefore we chose to reject 8 videos out of the total of 470 to improve quality at the costof the quantity of our dataset. Altogether a total of 440 video recordings remained, whichinclude 10 different activities with 127, 504 frames.

Table 4.1 shows the number of recordings, frames and the accumulated video durationper activity. We can see that the activity class 6 (“open/close trunk”) has more than twicethe amount of frames (hence the longer duration) compared to other classes, even thoughit has the lowest number of recordings. It is also easy to see that classes 5 and 7 areunder-resourced in terms of frame count and will therefore probably be harder to detectduring evaluation.Fig. 4.4 shows a histogram of the number of frames per recording. We can see that only afew recordings have more than 500 frames, whereas the median is at 259 frames and themean at 290. The shortest recording is only 34 frames long, which is equivalent to 1.2seconds of video data.During the course of this thesis, we decided that we also wanted to run experiments ontemporally segmented data. This means that we manually cut each recording into itsrepeated parts. We call these parts instances. This is very important since the temporallyaligned instances can be used to find out when an action starts or ends. Note that thestart and end point are not clearly defined. This temporal segmentation problem is verysimilar to the well-known spatial segmentation problem that occurs, for instance, in objectrecognition.We chose the start and end points such that the most important motion (such as a legmoving forward) is not split into two parts. This is the case for remarkable stationarypositions such as a foot standing on the ground. Some gestures such as digging are executed

40


0 200 400 600 800 1000 1200 14000

20

40

60

80

100

120

Frames per recording

Num

ber

of o

ccur

renc

es

Histogram of frames per recording

Figure 4.4: This histogram shows for each duration of a recording (in frames) how manyrecordings exist that have this duration. We can see that most recordings havea duration of approximately 200 frames, but a few recordings are several timesas long.

in very different ways and make the alignment very complicated. We differentiate betweenan execution of an activity with the right or the left arm or leg. Therefore each instanceis assigned to one of 16 classes, compared to 10 on the whole recording.

4.1.2.3 Blob Extraction

In this step we describe the semi-automatic system that we use to extract blobs (that is aregion of an image with a special content) that include the human beings in the video.

1. Specify bounding box: The first step is to show the user the first video frame andlet him draw a rectangle on the image where the human being can be found.

2. Track bounding box: In the consequent frames we try to automatically track themovements of the person and its bounding box in the video. To do this we computethe normalized cross-correlation of the first blob to the current frame. The position(x, y) where the cross-correlation has a global maximum becomes the new position ofour bounding box. This approach has the advantage that fluctuations in brightnessdo not affect our tracking. It is also important that we compare the current frame notwith the previous one, but with the first frame selected by the user. Otherwise we cansee that the tracking gradually deteriorates and moves away from the actual person,because more and more background pixel are included in the cross-correlation basedmatching. We use three termination criteria for our tracking. 1) If the maximumnormalized cross-correlation is below a given threshold t (t = 0.4), we abort thetracking. 2) If the position of the next blob is more than d pixels away from the lastposition (d = 5), we reject it as a coincidental match to the background. 3) The usercan push a button at any time to abort the tracking if he is not pleased with the

41

42 4. Experiments

current blob position. If any of these criteria holds, we go back to step 1, startingfrom the previous frame. Additionally, the current values of t and d are displayedand the user can dynamically increase or decrease them to achieve a higher or lowertermination rate.

3. Remove invalid blobs: After processing all frames in step 2, we manually checkthe quality of the resulting blobs.We delete all unusable blobs, such as those whichinclude only parts of the person due to rapid movements of the camera or the person.

4. Correct positions: To fill the gaps for the previously removed frames, we linearlyinterpolate the blob positions from the correct blobs. When the user repositionedthe bounding box, this led to discontinuities in the blob positions. We remove theseoutliers using a median filter. Note that it is important that we first interpolateand then filter. Otherwise we cut away keyframes which we already confirmed tobe correct in step 3. Finally we use a polynomial of degree 3 to approximate theT → X and T → Y functions separately to achieve a smooth interpolation of thebounding boxes. This step is very important as the normalized cross-correlationbased tracking leads to a high-frequency jitter between neighboring pixels in theblob positions, which needs to be removed to allow for reliable feature computation(such as for optical flow) later on.This is appropriate for most of the recordings because the people are either stationaryor walk according to a low degree curve. Only a few recordings had to be corrected todegree 5, since their walking pattern in one dimension consisted of sharp curves. Fora few recordings, a polynomial was not able to describe the blob positions properly.This was due to errors in the stabilization, such as when there was movement directlyin front of the camera (due to the balloon rotating around its vertical axis which wasfixed to the ground by a rope). In these cases we used a sliding window to smooththe blob positions.

4.1.3 Features

Since each dataset has its own characteristics and there is generally not one feature thatfulfills all our expectations, it is common to combine a number of features, apply featureselection and feature extraction and learn the combined features. This is the same forobject recognition where a single feature (such as image gradients) is often not as powerfulfor discrimination as a whole set of features (such as gradients, color, shape and symmetry).We cover gradient-based features, optical flow-based features, trajectories and featuresbased on similarities between different frames in the video.


Our code uses the implementation of HOG 2D provided in “Piotr’s Image & VideoToolbox” [Dollar, 2013]. It is based on Felzenszwalb’s HOG features presented in[Felzenszwalb et al., 2010] and the authors claim a 4-fold speedup over traditional HOG.The implementation makes use of a simplified HOG alternative that uses Integral ChannelFeatures [Dollar et al., 2009].

4.1.3.2 Scale Invariant Feature Transform 3D

We use the SIFT3D reference implementation for Matlab from Scovanner et al.2 tocompute keypoints. Keypoint positions are randomly placed throughout the space-timevolume. Typical amounts of keypoints per video are 200 to 500.However the reference implementation is very slow and only processes about 1 keypoint

2See http://crcv.ucf.edu/source/3D

42

http://crcv.ucf.edu/source/3D


per second. To allow for faster processing, we vectorize the code to avoid loopsand favor parallel processing. This leads to a speedup of 50 to 100. Furtherimprovements might be possible by porting the computation to the GPU. As recommendedby [Scovanner et al., 2007] we use unsupervised k-means clustering on each SIFT3Ddescriptor as a dimensionality reduction technique on a frame-level (from e.g. 645 to1 dimension). The word histograms have as many bins as we used clusters (typically 4to 12). These histograms form the final features that are learned. Contrary to Scovanneret al., we do not compute the grouping histograms, since our distribution vectors do notseem to be strongly correlate.

4.1.3.3 Trajectories

The blob trajectories are extracted from the log-files of our stabilization. As a distancemetric between consecutive frames we use the mean distance along X and Y dimensionseparately, the magnitude of the average distance vector and the mean and standarddeviation of the distances. To account for different speeds, we additionally use the distancerelative to the time between two frames, as well as the mean and standard deviation ofthe speed. This results in 6 features per frame.


As described in Chapter 2, several algorithms to compute optical flow are known,such as Lucas-Kanade (LK) or Horn-Schunck (HS). Since all of these operations arecomputationally expensive, we choose to use an existing implementation that is optimizedfor speed. Several C++ and Matlab implementations have been evaluated and a HS solverperformed best. Our experiments with LK and HS also showed that LK often led toundesired results. This might be attributable to bad implementations or the small imagesizes (70×70 pixels) that invalidate the assumption that the optical flow is uniform in localneighborhoods. Therefore we chose to use HS using a Conjugate Gradients Squared (CGS)method as implemented in the tutorial at3. The motion field is subtracted by the meanoptical flow to compensate for shifts of all pixels due to camera motion.

4.1.3.5 Spatial Histogram of Oriented Optical Flow

We try to combine the ideas of HOG and HOOF. Essentially we implement HOOF assuggested by [Chaudhry et al., 2009], but instead of using a single histogram for the wholeimage, we use one histogram per cell for an equally-spaced grid of m × n cells. We callthis feature Spatial Histogram of Oriented Optical Flow (SHOOF). Note that this is ageneralization of HOOF features.Several parameters have to be tuned to improve the quality, as well as the convergencerate of the Conjugate Gradient algorithm. We chose the best parameters according tovisual appeal and runtime and evaluate the final performance in activity recognition. Wealso analyze whether forced symmetries have an influence on the result.


We follow the approach described in [Cutler and Davis, 2000] to compute a SSM for eachvideo. In the algorithm, various distance functions can be used to compute the dissimilaritybetween frames. A fast and simple choice would be to compute the Sum of AbsoluteDifferences or Sum of Squared Differences. However since these measures are sensitive tochanges in lighting or contrast, they are not preferable. We therefore use the correlation oftwo images, which can be computed very quickly using the Fast Fourier Transform (FFT).

3https://www.ceremade.dauphine.fr/~peyre/numerical-tour/tours/multidim_5_opticalflow

43

https://www.ceremade.dauphine.fr/~peyre/numerical-tour/tours/multidim_5_opticalflow

44 4. Experiments

The correlation is invariant to absolute changes in brightness, since it uses zero meanimages. However it does not take into account relative changes in brightness, which is whywe achieve better results using normalized cross correlation. It is also important that wedo not directly compare two frames pixel by pixel. If two identical images are shifted bya only few pixels, this can lead to considerable distances between them, if they includenon-smooth edges. Therefore we use the maximum of the normalized cross-correlationover all positions of frame i in frame j. This allows us to eliminate minor shifts in theimage which can result from bad image stabilization or quick movements of the persons.We simplify the frequency analysis as follows: Once we computed the SSM we want toextract lines along the minor diagonals. To do that we compute the mean of all valuesalong minor diagonals d = 1 to n (for n frames). The resulting 1-dimensional function v(d)needs to be analyzed to extract peaks. Since subsequent frames are usually very similar,the function is also very smooth and needs only minor smoothing with a 1-dimensionalGaussian kernel (σ = 2, width = 5). We remove the peak at d = 1 and its monotonicallydeclining neighborhood since it does not contain any new information. We now iterativelysearch for peaks as follows:

1. Find the global maximum and mark it as peak.

2. Remove the values in the monotonically declining neighborhood (left and right) untilwe reach a minimum.

3. Repeat until enough peaks are found.

We use p peaks although most of the peaks are integer multiples of others. The peakintensities and positions are used as an input for the learning machines described below.

4.1.4 Dimensionality Reduction

As described above, dimensionality reduction is essential to reduce computational overheadand resource consumption. This is especially true if we combine several kinds of features byconcatenation. We focus on feature extraction and do not explicitly apply feature selection,even though Random Forests internally apply feature selection using the impurity metricsdiscussed in Section 2.7.3.

4.1.4.1 Principal Component Analysis

Numerous implementations of the PCA exist. We chose the implementation from “Piotr’sImage & Video Toolbox”which uses random feature sampling and the transpose trick. Thesolution is found via Singular Value Decomposition. The author’s claim that their solutionis particular robust, since after each iteration they increment a random dimension by ε,assumably to avoid bad conditioning. Finally, principal components of very low varianceare neglected.The order of concatenation and dimension reduction often makes a difference. Wediscovered that especially the low-dimensional and human readable features (similaritypeaks and trajectories) should be concatenated after PCA. The reason for this is notentirely clear. It could be due to random feature sampling, which might sometimes removeimportant features. If this was the case, we should see a higher fluctuation in performance,which we do not. Another possible explanation could be that the PCA is an unsupervisedtechnique which only regards the variance of a feature dimension. Features such as thesimilarity peaks might be easily separable, but not have a high variance.

4.1.5 Learning Machines

To learn our features we make use of the following supervised learning techniques.

44


4.1.5.1 Support Vector Machines

We use LibSVM [Chang and Lin, 2011] as our implementation of Support VectorMachines. It is an open-source framework implemented in ISO C89 that suppliesimplementations or wrappers for various programming languages such as Java, Matlab,.Net and Python. We use the Matlab wrapper. It supports multi-class SVMs using one-vs-all comparisons. This means that it creates n SVMs internally for n classes and then usesthe result with the highest score. It also means that we always have an output, even if thecorrect answer would be “no activity”. We will discuss this problem further in Section 4.2.LibSVM supports ν-SVMs as well as C-SVMs, which are effectively differentparametrizations of the same algorithm (see [Chen et al., 2005] for more information).It can do classification (SVC) as well as regression (SVR). To compute the separationhyperplane, the Sequential Minimal Optimization (SVM) algorithm is used. LibSVMsupports a wide range of kernels including linear, polynomial and radial basis functionkernels.To tune our system, we apply a grid search over the most important parameters such asthe kernel type and the soft margin parameter C and use the parameters that yield thehighest performance during cross-validation. This is a very time consuming process whichcannot always be repeated. We find that the recommended C with C = 1/m where m isthe number of features provides reasonable performance for linear kernels.

4.1.5.2 k-Nearest Neighbor

For k-Nearest Neighbor classification we use a very simple Matlab implementation providedby Yi Cao from Cranfield University in 20084. It applies efficient linear kNN search usingthe just-in-time compiler. The distance metric used is the euclidean distance. The onlyparameter that we specify is the number of neighbors k. We vary it from 1 to 50 and chosethe k with highest performance using cross-validation (typically k = 45).

4.1.5.3 Random Forests

We use the Random Forests implementation from “Piotr’s Image & Video Toolbox”. Itsupports various parameters such as the number of trees, the number of classes, the numberof data points used for each tree, the splitting criterion (Gini, entropy or twoing), theminimum number of data points that are still allowed to be split or used at a child node,the maximum tree depth, the sampling weights and a manually specifiable discretizationfunction for the labels.We found that the number of trees and the number of features per tree have the highesteffect. However, parameter tuning is not as important as with SVMs and we usually get agood baseline result with just one tree and a fixed number of features per tree. The resultsrarely improve with more than 100 trees, since the majority of the trees converges on oneresult. For huge feature dimensions the number of features per tree becomes importantsince otherwise the performance of a single tree becomes too close to chance level andBoosting cannot improve the results any further.


We implement Hough Forests by extending Random Forests. The code design is inspiredby the Random Forest implementation in “Piotr’s Image & Video Toolbox”. The interfacesand user settings are specified in Matlab, whereas the most important routine for findingthe optimal split is implemented in C++ to allow for better performance. Hough Forestsrequire more storage space than Random Forests because for each leaf node we have to

4 http://www.mathworks.com/matlabcentral/fileexchange/19345-efficient-k-nearest-neighbor-

search-using-jit/content/knnsearch.m

45

http://www.mathworks.com/matlabcentral/fileexchange/19345-efficient-k-nearest-neighbor-search-using-jit/content/knnsearch.m

http://www.mathworks.com/matlabcentral/fileexchange/19345-efficient-k-nearest-neighbor-search-using-jit/content/knnsearch.m

46 4. Experiments

Figure 4.5: Recognition of activity blocks: HOGs are extracted from images, concatenatedand learned.

Figure 4.6: Recognition of activity instances: HOGs are extracted from images, learnedand the frame hypotheses are aggregated into a single activity output.

store a complete list of all distances of each sample. This lazy learning approach (for thedistances) is a major difference compared to Random Forests which are eager learningmachines that make the decision during training. It also dramatically slows down theevaluation, because for each distance t Gaussian Parzen-windows have to be computed.Here t is the number of trees.We evaluate Hough Forests on our standard features (i.e. HOG), as well as on imagepatches as explained in Section 2.7.3.4. We measure the quality of the label and thedistance estimates using the gain metric presented in Section 4.2.2.2.

4.1.6 Hypothesis Aggregation

In our work, we follow two different approaches. We learn temporally segmented as wellas unsegmented data. This means that in some experiments we use video blocks of a fixedduration, whereas in others we manually segmented the data into single instances (suchas a single punch in the boxing activity).Fig. 4.5 shows a diagram of the system that recognizes blocks of fixed length. Thefeatures (HOG) are concatenated and learned by a single learner that outputs the videohypothesis. In contrast, Fig. 4.6 shows the system that learns single frame features, outputsa frame hypothesis and aggregates the hypotheses into a single video-level hypothesis. Inthe instance case, each video has a different duration. As aggregation methods we usedone of the following:

• Mode: The mode is simply the class that occurs most often during classification. Ifmultiple classes appear equally often, the lowest index is chosen.

• Relative frequency: The former approach has the disadvantage ofunproportionally weighting the different classes. E.g. if we have 3 observationsof class 1 and 2 observations of class 2, then the mode is class 1, whereas class 2might still occur 10 times more often than class 1. We can alleviate this weakness bycomputing the relative histogram of class frequencies in our observations and on thewhole training set. We then element-wise divide the former by the latter to computethe relative frequency of the classes compared to the overall distribution. Theresulting class is then the maximum frequency in the relative frequency histogram.

46


• Mean probability: Some learning machines such as Random Forests can outputa measure of confidence of the result, such as class posterior probabilities. In thecase of a Random Forest or kNN, this is simply the ratio of tree or neighbor votesfor a particular class. For an SVM this is not as easy, since the decision values(relative distance from the hyperplane) are not normed and therefore cannot becompared between different SVMs. Other techniques which estimate SVM posteriorprobabilities are out of focus for our work.Once we have the probabilities, we can compute the class with the highest meanprobability over all frames and use it as the result of our video. The quality ofthis aggregation method depends very much on the confidence score of the learningmachine.

• Hidden Markov Model: A more profound solution is to use a Hidden MarkovModel (HMM). An HMM is a stochastic Markov model that describes time-discreterandom processes. It can help us to aggregate single observations with respect totheir order. Each activity has its own HMM with an architecture that we specify inadvance (such as an ergodic structure or a forward-directed structure). The HMM isformally defined as λ = (S,V,A,B, π) with states S, observations V, state transitionprobabilities A, output probabilities B and the initial state distribution π. We usethe class outputs or the class posterior probabilities from our learning machine asobservation for the HMM. The forward algorithm is used for evaluation. Each HMMoutputs the probability that it produced the observation sequence and we can takethe most likely HMM and the corresponding activity as the result (see [Caesar, 2012]for more information).Fig. 4.7 shows an example of a forward-directed HMM for a single action class. Siand Final are the states, aij is the transition probability from state i to j and biorepresents the output probability of output o in state i. Some transition probabilitiesare emphasized in the graphic to indicate that ideally, these probabilities should equal1, whereas the other transition probabilities should equal 0.Compared to the other approaches, the HMM has several advantages. The mostimportant advantage is that it can help to overcome the time-warping problemby dynamically estimating the sequence of subactions. If we take for example the“digging” activity, then state S1 could be the initial standing position, state S2 couldcorrespond to lowering the shovel, state S3 would correspond to throwing the mudaside and the final state would represent the person moving back to its originalposition.This approach is however much more complicated and the models have to be trainedand evaluated every time using the same training data and test data as for ourlearning machines. Essentially we are applying two learning machines after another.In the first learning machine, we need to evaluate our classifier on training and testdata to produce the required input for the second learning machine. It is importantthat the first learning machine does not overfit the training data too much, becausethen the output is not comparable between testing and training and the secondclassifier will perform very poorly. The activities must be composed of atomic sub-activities or discrete states which the HMM can switch between.

4.1.7 Activity Subclasses

We want to recognize 10 different activities as can be seen in Fig. 4.8 in the right column.Most of these activities are executed by either the left or the right body side (i.e. boxingwith right and left arm). This is why we split these categories into two sub-categories. Forthe open/close trunk activity we now differentiate between opening and closing the trunk.Some activities are symmetric (i.e. clapping and waving) and do not have two different

47

48 4. Experiments

Figure 4.7: Hidden Markov Model for a single action class

sub-categories. We do this because we suspect that the learning machine can train amore discriminate model if it does not “average” over the different sub-categories. Forevaluation we join the sub-categories again since we are only interested in the performanceon an activity level.We go one step further and implement sub-sub-categories by including the orientationof the person towards the camera (frontal, left, rear, right). The decision for 4 differentorientations was somewhat arbitrary, but the reasoning is the same as for the body sides.Regarding different orientations also has the advantage of being able to analyze rotationalinvariance. The different body sides can be interpreted as a form of symmetrical invariance.

Figure 4.8: This diagram shows how each activity (right) is split into one or two bodysides (middle), each of which can be executed in one of four orientations (left).

48


4.1.8 Live System

The majority of our experiments is conducted offline, which means that they are not runin real-time and therefore no time constraints are put on them. To reveal the real-timepotential of our approaches and to get an idea of which activities are recognized incorrectly,we implement a live system in Matlab. For that we train the learning machines and PCAoffline and store the results (learning model and coefficients of the PCA) on the hard disk.We then load each single frame i out of n total frames, extract the required features andstore them in a cyclic buffer. If there are enough frames in our buffer, we apply PCA tothe buffer, classify the PCA result and output the recognized class. This is displayed inpseudocode in Listing 4.1.

Listing 4.1: Pseudocode of the live system

i n i t ( b u f f e r ) ;

for i = 1 : n ,b u f f e r = sh i f t down ( b u f f e r ) ;

frame = load frame ( i ) ;b u f f e r (1 ) = e x t r a c t f e a t u r e s ( frame ) ;

i f f u l l ( b u f f e r ) ,data = pca ( b u f f e r ) ;c l a s s = c l a s s i f y ( data ) ;print ( c l a s s ) ;

end ;end ;

To further improve speed we can skip each kth frame in recognition. In addition to thatwe can also skip recognition each lth time. Note the difference between both. Whereas kdecreases the recognition performance, l determines how often we run our classifier. Wetypically use k = 4 and l = 1. While running the system we measure the current recognitionperformance and the number of Frames Per Second (FPS) relative to the original video.FPS are computed from the second frame on using

FPS i = FPS i−1 · α+l

ti − ti−1· (1− α), FPS 2 =

l

t2 − t1, (4.1)

where ti denotes the time at frame i ≥ 2 and α ∈ [0, 1] is a constant which determineshow fast the FPS are updated. We also implemented a live system for the instance-levelrecognition. It follows the same procedure, with the difference being that the buffer has asize of only 1 frame.

49

50 4. Experiments

4.2 Methodology

In this section we take a closer look at the methodology of our experiments. We presentthe techniques that we use for cross-validation and a number of metrics to assess theperformance of our systems.

4.2.1 Cross-Validation

Each of the experiments that we run results in a number of performance metrics. Thesupervised learning machines are trained using a training dataset. The parameters areoptimized using a verification dataset (if necessary) and the final performance is estimatedon the testing or evaluation dataset. Depending on how we choose these datasets, we mightachieve different results. This is why we do cross-validation. We average over a series ofmeasurements to retrieve a mean performance that is not subject to random fluctuations.It is important that each activity is proportionally represented in each dataset. Otherwisethere might not be enough data to train a certain activity and we therefore cannot recognizeit during evaluation.We have to take care that different frames from one video are all included in the samedataset. Otherwise the indicated performance would be much higher, since for eachframe in the test set, there might be a frame in the training set that was captured onlymilliseconds after the other and that is therefore almost identical.It is also important to make sure that videos of one person are only present in eithertraining or testing set. This is because each person always wears the same clothes and isoften standing at the same position in the scene. Our experiments have shown that thissetup is much easier and that we can achieve a much higher performance, but the resultsdo not generalize to previously unseen persons.We use two popular techniques for cross-validation:

• Random resampling: Random resampling is the simpler of the two approaches.During each run we randomly split our n data samples into n · β training samplesand n · (1 − β) test samples for a specified ratio β. The split is independent ofany previous split and therefore we cannot rule out that there are overlaps in thedatasets between different runs. After several executions we can compute the meanof the results to have a reliable measure of performance. The standard deviationis also measured to give us an estimate of the confidence. Note that we assume anapproximately Gaussian distribution of results here.

• k-fold cross-validation: k-fold cross validation is very similar to randomresampling. However, instead of using random datasets in each run, we split thedata into k datasets once. During each of the k runs, we take one previously unusedpart as our test set and all others for training. The result is the mean of each singlemeasurement. This approach has the advantage that each data point is evaluatedonly once and therefore all data points account equally for the result. This stands incontrast to random resampling where one data point might be used multiple times ornot at all (subject to randomness). To avoid fluctuations from unproportional splits,we run the whole procedure several times and compute the mean and standarddeviation of each k-fold run.

The choice of our cross-validation method depends on how quickly a single experiment canbe run. Random resampling results in a high standard deviation of the result, but k-foldcross-validation takes k times longer.

4.2.2 Performance Metrics

In this section we want to briefly cover the different performance metrics that we use inthis work.

50

4.2. Methodology 51

4.2.2.1 Accuracy

Accuracy is the most important performance metric used in our work due to its simplicity.It is computed as

Accuracy =TP + TN

P + N, (4.2)

where TP is the true positive rate, TN the true negative rate, P the sum of true positiveand false negative rate, and N the sum of false positive and true negative rate. Accuracymeasures the ratio of correctly classified data points to all data points. As such it is verysimple to use and compare, but it has important limitations which we will consider in thenext sections.

4.2.2.2 Gain

In Section 4.1.5.4 on Hough Forests we pointed out that we need appropriate metrics toevaluate the classification result as well as the segmentation result. A common way wouldbe to use the accuracy for the labels and compute the mean squared error compared tothe ground truth distance. These numbers are however hard to interpret if we do nothave an anchor point for minimum and maximum performance. Instead we introduce a“Gain” metric that scales the performance between chance-level and an optimal solution.We define it as

Gain =Output− Chance

Optimum− Chance. (4.3)

This formula expresses the performance for distances and labels in percent. However, theOutput, Chance and Optimum values are defined differently in both cases. In the followingsection t denotes the ground-truth and o the system’s output. The index l indicates labelsand d indicates distances. The optima are defined as

Optimuml = 100% (4.4)

andOptimumd = 0F. (4.5)

Now we need to define the Output as a single scalar. In the label case this is simply theclassification accuracy of n test samples

Outputl =1

n

n∑i=1

1[tli = oli ]. (4.6)

For the output distances we use the mean absolute deviation from the ground truthdistances

Outputd =1

n

n∑i=1

|odi − tdi |. (4.7)

The chance level for labels is defined as the frequency of the most common label i

Chancel = maxi

1

n

n∑i=1

1[tln = i]. (4.8)

In case of the distances the chance level is defined as the mean absolute deviation fromthe average distance td of all training samples

Chanced =1

n

n∑i=1

|td − tdi |. (4.9)

We call Gain = 0 the chance-level, since this is the case for the trivial odi = td andoli = argmaxi

1n

∑ni=1 1[tln = i]. If Gain is positive, then our system performs better, for

negative values it performs worse than chance. Gain = 1 indicates maximum performance.

51

52 4. Experiments

4.2.2.3 Receiver Operating Characteristic

The accuracy metric only makes sense if our world is limited to a finite number of knownclasses. Otherwise we do not even know whether an activity was performed, when wetry to assess which activity it was. A possible workaround used in speech and characterrecognition is to introduce a rejection class. This class includes examples of persons thatdo not perform any action. However we do not have such data and therefore each one-vs-all classifier needs a threshold above which it triggers the activity output. This thresholddoes not have a single correct value. It depends on how many false positive we are willingto accept. For example, if we do not want to unnecessarily distract the controller of a UAVwe can choose a low false positive rate for each activity. Subsequently the true positiverate will also decrease.The Receiver Operating Characteristic (ROC) curve is a plot of the false positive rate onthe x-axis and the true positive rate on the y-axis [Fawcett, 2006]. It shows how manycorrect results we get for any number of false alarms. The diagonal line in this plotrepresents a system that is at chance level, which means that the answers are random.Results to the upper left of the diagonal are better, results to the lower right are worse.If a classifier is located below the diagonal, its answers can be inverted and will thereforelie above the diagonal. Note that ROC curves only exist for binary problems (such asone-vs-one). For our multi-class classification problems there is not a single ROC curve,but that of each class to each other or any combination of those. Therefore we only showthe one-vs-all ROC curves to avoid showing too many plots.To create a ROC curve we need to know a score that the classifier assigns to eachclassification. This score can in general be a posterior probability (see [Platt, 1999] forSVMs) or the distance to the hyperplane in an SVM. For each learning machine we haveto use a tailored approach to extract such a score. The scores also need to be distributedevenly. If all data points output a score of either 0 or 1, then the ROC curve would beinaccurate due to a “jump” from 0 to 1 at a certain false positive rate. This would be thecase for a very simple learner such as linear kNN with k = 1.Another problem is that we cannot average over several ROC curves collected with differentlearning machine instances, if we use scores from an ordinal scale, but not from an intervalscale. This is the case for SVM hyperplane distances which can only be compared to otherdistances of that SVM, but not to distances of other SVMs or even other learning machines.This is why ROC curves are difficult to use in cross-validated setups such as ours. Onepossible solution is to plot the whole range of resulting ROC curves. We chose to use aneven proportion of data in the training and testing data set and run the experiment onlyonce. This way the results might be slightly worse, but there is less fluctuation.Another metric that makes the comparison simpler is the Area Under the Curve (AUC).The AUC is simply the integral of the ROC curve. As such it takes into account all levelsof false positive rate. It can be interpreted as the probability that the classifier assigns ahigher score to a positive example than to a negative example. The use of AUC is howeverdisputed throughout the literature since it summarizes the performance over regions thatmight be irrelevant and it “does not give information about the spatial distribution ofmodel errors”. [Lobo et al., 2008]

52

4.3. Results 53

4.3 Results

This section shows the results of our experiments. It explains how different features,dimensionality reduction techniques and learning machines affect the result.

4.3.1 Features

Here we look at the different features and analyze their importance. Some features aremore informative than others and can be visualized for deeper understanding of theirstrengths and their weaknesses.


Fig. 4.9 shows an image from our dataset overlaid by a representation of the HOG features.The wireframe displays the cell grid. To stress the importance of edges in the image, onlythe strongest gradient direction is displayed by a green line pointing away from a cellcenter in the direction normal to the gradient direction. Although there are only a fewcells, we can see that the green lines follow the outline of the person. Note that we onlydisplay one out of 3 · o+ 5 channels for o orientations. The coarse-grained structure of theimage is due to a nearest neighbor upsampling which we applied to scale the image forpresentation. Whereas bilinear or bicubic filtering looks more visually appealing, nearestneighbor upsampling has the advantage of not blurring the pixels too much. We achieve

Figure 4.9: Visualization of a HOG image

the best results for a cell width and height of 10 pixels and a total of 8 orientations. Dalaland Triggs mentioned that the cell size should be approximately equal to the width of ahuman limb. In our case the cell size is about twice as large as usually recommended.This might be attributable to the fact that our blobs are often not very well centered andtherefore we must use a larger region to “find” the relevant body parts.

53

54 4. Experiments


For optical flow and the features derived from it, we achieve the best results when down-sampling our original frames from 70 × 70 to 35 × 35 pixels. Above this point thecomputational overhead is too high and we have too many noisy flow vectors. Belowthis point human limbs are not covered any more and only camera motion is captured.Since the optical flow between two successive frames is very small in magnitude comparedto the image noise, we typically compute it with a delay of 5 frames (≈ 0.17s).Fig. 4.10 visualizes our result on a video frame. Each arrow represents one of our down-sampled pixels and the direction and magnitude of its optical flow. We can clearly seethat the person is moving both arms downward. We infer that the person is currentlyperforming the “waving” activity. Complications arise from the observation that theshadow is moving as well.

Figure 4.10: Example of a video frame overlaid by arrows indicating the magnitude anddirection of the optical flow.

4.3.1.3 Spatial Histogram of Oriented Optical Flow

As mentioned above, we use a generalization of HOOF that we call Spatial Histogram ofOriented Optical Flow (SHOOF). If we use just one bin in x and y direction, it becomesequivalent to HOOF. However we find that the best results are achieved with 12 × 12cells which results in a size of 6 × 6 pixels. Each histogram has 8 directions. We alsoanalyzed binning according to the primary angle. The idea of [Chaudhry et al., 2009] wasto represent movement from left to right as equal to movement from right to left (but notfor up and down). However, this sort of invariance to symmetry only makes sense if theaxes of the image are parallel to the horizontal and vertical world axes. This is generallynot the case in a vertical camera setting. Our experiments have shown that symmetricbinning significantly reduces the recognition performance.

54

4.3. Results 55


Fig. 4.11 shows an SSM of a boxing activity. We can see a very strong periodicity atperiods of 20, 40, 60, . . . frames. Note that all periods are integer multiples of 20 frames.We verify this information by checking the underlying frame sequence for repeated actionsand see that the time step between a key position and a similar position (here: arm in frontwhile boxing) is about 20 frames. Note that sometimes in actions such as boxing, which isperformed by both hands using one after another, there can be 2 overlapping periodicities:one for the period between the same arm being in front and one for the period betweenalternating arms being in front. We can also see that prior to frame number 30 there

x

y

20 40 60 80 100 120 140 160 180 200 220

20

40

60

80

100

120

140

160

180

200

220

Figure 4.11: Self-Similarity Map for a video of the boxing action

are no periodicities visible in this matrix. This is due to the person standing still at thebeginning. Fig.4.12 shows a plot of the mean intensities along the diagonals (we onlylook at the upper right triangle part of the similarity matrix). We can clearly see andextract the peaks that correspond to the periods. Further analysis of other recordingsshow that similarity matrices can be analyzed to detect irregularities in the video thatmay carry important information. Fig. 4.13 is an example of such irregularities, sincethere are periodicities at the beginning and end, but not in the middle. We can even seethe distortion of the rectangular grid towards frame number 90. A look into the actualvideo frames shows us that starting from frame 89 the person stops its boxing action toturn around its body by 90◦, before it continues the periodic action at about frame number120. This shows us that this method can extract very high-level information about theactions present.

The final features that are output are the x and y values of each peak. However thesecan be sorted in various ways. We achieve much better activity recognition results duringevaluation, if we sort the peaks by their energy, rather than sorting them by periodicity.This is because most peak positions are simply multiples of the first peak. Therefore bysorting the peaks by dimension y, we compare the most informative peaks. The informationgain for more than 2 peaks is usually negligible and due to the curse of dimensionality theaverage recognition performance even decreases.

55

56 4. Experiments

0 50 100 150 2000.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Diagonal number

Mea

n si

mila

rity

alon

g di

agon

al

Figure 4.12: Self-similarity energy plot for a video of the boxing action. The diagonalnumber is equivalent to the period length in frames. Similarity is takenrelative to maximum similarity and therefore does not have a unit.

The confidence of periodicities decreases with lower frequency. Whereas almost allframes have neighbors with distance 1, only the “middle” frames (of long recordings)have neighbors with distance 500. Therefore the measurements get less accurate sincewe average over fewer elements in the similarity matrix. This should be considered.During preprocessing, we also tried to use the periodic information from the activity blocksto automatically segment the blocks into instances. Whereas the frequency of an activitycan be estimated reliably over repeated action sequences, the phase (i.e. the start of eachaction instance) is hard to estimate. A possible approach would be to let the user annotatea subset of the actions and let a learning machine decide on the rest. We chose to manuallyannotate all instances.

4.3.1.5 Trajectories

Fig. 4.14 shows an example of a running trajectory. The person positioned at (x, y) ≈(650, 430) moves along the scene to the pixel located at (500, 220). Since the groundplane is not parallel to the image plane, neighboring pixels in the upper part of the imagerepresent objects that are further apart in the scene than in the lower part (see alsothe discussion on Ground Sample Distance in Section 5.3). We find that trajectories areuseful to distinguish the different kinds of movement activities (such as carrying, jogging,running, walking) by their speed which is something that stationary features such as HOGcannot do.

4.3.1.6 Feature Comparison

Fig. 4.15 shows a plot of 6 different feature sets and their classification performance afterlearning them with Random Forests. The results are validated using repeated setups of5-fold cross-validation. There is no rejection class. We varied the number of frames andhence the video duration from 8 to 150 frames (0.3s to 5.2s). We see that for each feature

56

4.3. Results 57

10 20 30 40 50

0.8

0.9

1

Periodic distance in frames

Sim

ilarit

y10 20 30 40 50

0.6

0.8

1

Periodic distance in framesS

imila

rity

10 20 30 40 500.85

0.9

0.95

Periodic distance in frames

Sim

ilarit

y

Figure 4.13: Self-Similarity Map for a video of the boxing action. Note that the actorturned around during the second third of the video which is clearly visible inthe similarity plots for each third (right).

x

y

100 200 300 400 500 600 700 800 900

50

100

150

200

250

300

350

400

450

500

Figure 4.14: Example of a video frame overlaid by the running trajectory of a person.

57

58 4. Experiments

performance increases with a higher number of frames. Only for frame counts of 150 theperformance decreases again which is due to a slight imbalance of training data of someactivities which have less amounts of data in our dataset.For 8 and 15 frames, the performance of Similarity, SIFT3D, OFHS and SHOOF is notabove chance level, which means that they have no discriminative power at all. HOG andTrajectory features have the most stable performance with regard to the number of frames.HOGs learn static gradients in the image which are very similar between neighboringframes. Trajectories only slightly gain in confidence if the duration of the video blockis increased. Similarity features show the strongest increase in performance. This isexplainable since most activities have a period of about 30 frames (1s). Furthermore, dueto averaging similarity energy over all frames, the confidence is increased with a highernumber of frames.Further analysis has shown that Trajectory features are only capable of discriminating thedifferent walking activities (carrying, walking, jogging, running). HOG features representimage edges and therefore are best used on standing activities (open/close trunk, boxing,clapping, digging, waving). Hence this is a very good example that fulfills the requirementsfor successful boosting approaches. Different weak classifiers are weakly correlated to eachother and can be combined to create a much better classifier. This can be observed in the“All” feature set, which is a concatenation of all other features. It achieves the highestperformance of up to 66%. Since the SIFT3D and flow-based features (OFHS, SHOOF) arecomputationally expensive and have a rather low performance, we also try a combinationof HOG, Trajectory and Similarity features which we call “Combined” feature set. Itsperformance stays withing a 5% range of the All feature set and the difference is negligiblefor ≥ 90 frames. This is the most promising block-based approach which we use in thefollowing chapters.

101

102

0

10

20

30

40

50

60

70

Number of frames

Acc

urac

y in

%

AllCombinedSimilaritySHOOFHOGTrajectoryOFHSSIFT3DChance level

Figure 4.15: Recognition performance relative to video duration. We use Random Foreststo classify activity blocks of varied length.

58

4.3. Results 59

4.3.2 Learning Machines

This section describes the most important characteristics of the learning machines that weused.

4.3.2.1 Dimensionality

We have conducted several experiments to measure the influence of the number of principalcomponents used. Fig. 4.16 shows the results for different learning machines using onlyHOG features. For a low dimensionality we see that SVMs are inferior to kNN and twodifferent Random Forests.The performance of Random Forests drops with more than 30 PCs for standard RandomForests. Our default implementation (“Forest”) uses

√m features and 5m/t samples for each

tree in the forest, where m is the number of available features and t the number of trees.This means that for more trees, each tree uses less samples. This allows the Random Foreststo scale to very large amounts of features. To make sure that the decrease in performanceis not due to the square root in this formula, we also use Random Forests with a linearnumber of features and samples for each tree, in this case 50% each (“Forest50N50”).We can see that the result of the standard Random Forests is comparable to kNN, but notto SVM. The variant with a linear number of features and samples shows a much betterbehavior for a higher number of PCs, although it is still below SVMs.We therefore claim that Random Forests perform better on low dimensional featurescompared to other learning machines. They are also favorable due to their efficiencyand scalability. SVMs on the contrary are especially slow in training and only work ongreater feature dimensions, although not arbitrarily large.The good performance of kNN has been rather surprising since we used it only as a verysimple baseline system for comparison. Apparently the different recordings of one classare very “similar” according to the euclidean distance metric. The fact that the differentactivities are performed with different viewing directions (relative to the camera) mightdeteriorate the results of other learning machines, whereas they do not necessarily affectkNN if k is smaller than the number of available data samples per activity and direction(assuming that directions can be discretized without much loss of accuracy).Fig. 4.17 shows a similar plot using the Combined feature set. The observations are quitedifferent from the HOG plot. This time Forest50N50 clearly dominates SVM and kNN.Apparently this is because these learning machines cannot cope with combined featuresof different scale and dimension and additional normalization is required. kNN is nowbarely above chance-level. Random Forests have the advantage of treating each dimensionseparately and therefore do not need any kind of scaling. Even if we normalize the data tohave zero mean and unit variance, the results do not change dramatically. The values atPC = 104 are created by not using PCA. They show that in most cases the accuracy evenincreases if we use PCA. This is despite an approximately 100-fold reduction in trainingtime.

4.3.2.2 Support Vector Machines

Here we look at the SVMs and describe the experience that we have made with them. Wehave seen that for some settings with a very fine-tuned set of parameters and typicallyvery high dimensional features SVMs perform very well. However the time needed to trainthe model is usually several times larger compared to Random Forests. The parametertuning for C-SVMs is very tedious and the effect of a different C values is not intuitive.For non-linear SVM kernels even more parameters have to be found.We discovered two phenomena: 1) When learning the frames in the instance-based setup,the number of support vectors was very large (i.e. 50% of all data samples). Apparently

59

60 4. Experiments

100

101

102

20

25

30

35

40

45

50

Principal components

Acc

urac

y in

%

SVMForest50N50kNNForest

Figure 4.16: Recognition performance relative to number of principal components usingthe HOG features on blocks of activities.

our features do not abstract sufficiently from the data and hence the learning machinehas to store a lot of samples as support vectors. 2) In our experiments, linear kernelsconstantly performed better than non-linear kernels (i.e. Radial Basis Functions). Thisobservation seems to contradict the previous one. We would have expected that due tothe high number of support vectors a locally non-linear solution performs better than aglobal linear one. This is not the case.

4.3.2.3 Random Forests

In this section we want to briefly analyze the behavior of our Random Forests. Asmentioned in the Fundamentals chapter, the parameters that have to be selected forRandom Forests form a trade-off between the correlation and the performance of eachtree. For our dataset, we found that a high average tree performance is more importantthan a low correlation. This is what becomes obvious in Section 4.3.2.1, when RandomForests with a linear number of samples and features per tree performed better than asub-linear function for either.Fig. 4.18 shows the recognition accuracy with respect to the number of trees in our HOGBlock Forest setup. We can see that the accuracy converges against an upper bound asthe theory suggests. This upper bound is at an accuracy of about 63% on a block-level.As can be seen in the graph, the average accuracy of a single tree is 46%. The averagecorrelation between the outputs of the trees is 26%. The standard deviation is 5% for theperformances and 11% for the correlations if we use random resampling. Apparently alltrees are quite similar. This is probably since we use so many features and samples foreach tree (50% each). The theoretical lower bound for the generalization accuracy is 6%,which is much lower than the actual accuracy.

60

4.3. Results 61

100

101

102

103

104

25

30

35

40

45

50

55

60

65

Principal components

Acc

urac

y in

%

Forest50N50SVMForestkNN

Figure 4.17: Recognition performance relative to number of principal components usingthe Combined features on blocks of activities. Values shown at PC = 104 donot use PCA.

0 50 100 150 20044

46

48

50

52

54

56

58

60

62

64

Number of trees in RF

Acc

urac

y in

%

Figure 4.18: Video-level accuracy relative to the number of trees t used in a RandomForest. The result converges against an upper bound for t→∞.

61

62 4. Experiments

Fig. 4.19 shows the overfitting behavior of our Combined Block Forest setup relative to

100

101

102

30

40

50

60

70

80

90

100

Minimum number of samples to allow split

Acc

urac

y in

%

TrainTest

Figure 4.19: Frame-level accuracy relative to the minimum number of samples at a treenode which may still be split into sub-nodes.

a parameter which defines the minimum number of data samples at which a node maystill be split into sub-nodes (“minCount”). This parameter helps us to limit the depthof a tree and to avoid overfitting. Our best systems always use trees with a maximumdepth and hence minCount = 1. We can see that the accuracies on train and test datasetimproves with a lower minCount. What is interesting is that for minCount ≤ 4 we havean almost perfect match on the training data, although we still use 200 different trees.This is however not a problem since the accuracy on the test set does not deteriorate fora smaller minCount (which would otherwise be overfitting).We have conducted a similar analysis for the Random Forest parameter maxDepth whichlimits the maximum tree depth. The result for an increasing maxDepth is almost exactlythe same as for a decreasing minCount, since a minimum number of nodes for a split isthe local equivalent to a global maximum depth.


In this section we want to discuss the behavior of Hough Forests. Since our instanceapproach performs only slightly better than chance level (see Section 4.3.5), weintentionally simplify the task to show the potential of Hough Forests. We use the HOGInst features with 16 PCs, but now we do not take care that videos of the same person areonly present in either train or test set. This setup is much simpler since different videosof the same person are quite similar due to the same clothing and a similar background.Since our Matlab implementation of Hough Forests is very slow, we only use two activityclasses: “boxing left” and “carrying right”.Table 4.2 shows the results of our experiments for 4 trees. We vary the Hough ratioparameter which defines for how many of the tree splits we choose the offset uncertaintymeasure instead of the class-label uncertainty measure. The presented values are output,chance and gain for class labels and distances (see 4.2.2.2 for more details). Note that

62

4.3. Results 63

the chance-level fluctuates due to random compositions of train and test set during k-foldcross-validation. We can see that the label output, which is the accuracy, is at 96.8%for standard Random Forests with a Hough ratio of 0%. Since this two-class problem isnot very difficult, the label gain is at 93.0%. Since we do not minimize offset uncertaintywith this setting, the distance gain is close to 0% and even worse than chance-level. Ifwe increase the Hough ratio, we see that the label output departs from the optimum of100%, whereas the distance output approaches its optimum of 0 frames. Hence the labelgain decreases and the distance gain increases, just as the theory predicts. These findingsare visualized in Fig. 4.20. Here we additionally vary the number of trees used in theHough Forests between 4 and 50. The higher number of trees can significantly improvethe results.The results show that our implementation of Hough Forests can be used to estimate thetemporal segmentation of an activity. However they are still far from perfect and we donot think that at this stage activity recognition can benefit from them. This is due to thefeatures which cannot sufficiently discriminate between different frames of a video. Fordatasets with more activities the results are often not above chance level. Image-patches,which are usually used with Hough Forests, are also not above chance-level. Apparentlythe segmentation problem continues to be an issue for datasets that are as complex asours.

Labels DistancesHough Ratio Output Chance Gain Output Chance Gain

0% 96.8% 54.1% 93.0% 4.2F 4.2F -1.4%25% 96.9% 53.5% 93.3% 3.8F 4.1F 6.1%50% 95.5% 54.8% 90.0% 3.7F 4.1F 10.9%75% 94.2% 54.3% 87.3% 3.8F 4.3F 11.3%

100% 85.2% 55.0% 67.1% 3.7F 4.2F 12.2%

Table 4.2: Hough Forest results on a simplified dataset with only 2 activities. We use HOGfeatures with 16 PCs and 4 trees on instances. We can see how the parameterHough ratio forms a trade-off between a better accuracy and a lower distancebetween estimated and actual segmentation distance.

4.3.3 Hypothesis Aggregation vs. Single Hypothesis

In this section we want to analyze whether our system can benefit from aggregatingdifferent hypotheses into a joint hypothesis over a larger block of time. The alternativeis to use a fixed-size block of larger size. Hypothesis aggregation has the benefit of beingmore flexible. We can already get a (weak) hypothesis from the first block (possibly asingle frame) and improve the confidence of the hypothesis over time.Fig. 4.21 shows a plot of the recognition accuracy relative to the size of the blocks inframes and then number of blocks that we aggregate. For that we use the mode of theoutputs. We can see that there is a slight improvement if we aggregate more blocks (thecolors in the lower rows are more red than in the upper rows). However, it also becomesobvious that hypothesis aggregation is inferior to choosing a bigger block size. If we look atthe values (block size, aggregation size) = (30, 9) and (60, 1), the former uses 30 ∗ 9 = 270frames and the latter 60 ∗ 1 = 60. So the accuracy of the former should be greater thanthe accuracy of the latter, but it is 9% worse. In fact this is true for almost all values andonly capped by the maximum performance of our system. This shows that flexibility ofhypothesis aggregation comes at the cost of accuracy.

63

64 4. Experiments

0 10 20 30 40 50 60 70 80 90 100

0

10

20

30

40

50

60

70

80

90

100

Hough Ratio in %

Gai

n in

%

Accuracy, 50 TreesAccuracy, 4 TreesDistance, 50 TreesDistance, 4 Trees

Figure 4.20: The effect of different Hough ratios and different amounts of trees onrecognition accuracy and distance estimation. The gain measure is definedin 4.2.2.2.

Block size in frames

Agg

rega

tion

size

in b

lock

s

8 15 30 60 90 120

1

2

3

4

5

6

7

8

935

40

45

50

55

60

65

Figure 4.21: The effect of hypothesis aggregation relative to the block size in frames andthe number of blocks that are aggregated. The colors indicate the recognitionaccuracy in %.

64

4.3. Results 65

4.3.4 Activity Alignment

We want to find out whether an alignment of our data to the start of an activity instancecan help to improve our results. In an experiment we compare the activity recognitionperformance on blocks of a fixed length that either start from the beginning of an instanceor from a random starting frame. If the aligned blocks do not have enough frames, wefill them by replicating the last frame’s features. This is necessary since the learningmachine requires features with a fixed dimensionality. For the unaligned blocks such afilling scheme is not required since we can just choose a continuous sequence that is longenough. This is obviously an advantage for the unaligned version which should manifestin higher performance.We use the HOG and Trajectory features. The Similarity features would only be helpfulfor unaligned blocks, as there is no remarkable periodicity in a single activity instance.Hence their use would make a direct comparison more difficult.Fig. 4.22 shows the plot of the accuracy relative to the block size s in frames. We can see

100

101

102

10

15

20

25

30

35

40

45

50

Block size in frames

Acc

urac

y in

%

UnalignedAligned

Figure 4.22: Comparison of recognition accuracy with aligned and unaligned featureblocks. We use Random Forests with HOG and Trajectory features. Thechance-level is at 16.1%.

that for s ≤ 70 the aligned blocks outperform unaligned blocks. The observation that fors > 70 the aligned blocks perform worse is only due to filling them with dummy frameswhich is not done for the unaligned blocks. This shows us that the alignment gives aperformance gain compared to the same feature sets without alignment.However we should not forget that we left out the Similarity features which are only usefulfor unaligned blocks. Additionally if aligned blocks are easier to classify, the questionremains how to retrieve the alignment during evaluation. This segmentation problem willbe treated in Section 5.1. It is also questionable whether the performance gain is worththe costs of having to segment the training data into single instances.

65

66 4. Experiments

4.3.5 Blocks vs. Instances

As mentioned before we differentiate between unaligned blocks of a fixed size and alignedinstances of varying size that correspond to a single gesture (i.e. a punch). Both approacheshave their advantages. Blocks are more useful for a focus on implementation, since theseare the inputs to a system in a realistic environment. Without special approaches, such asHough Forests, the segmentation problem cannot be overcome and we do not know whenan activity starts.Instances are very useful for an analysis of the system properties. Evaluating the accuracyon instances answers the question “How many of the activity instances can be successfullyrecognized?”. They also allow us to look at the performance of subclasses (i.e. left orright arm punching) and analyze the rotational invariance of the system by assigning eachinstance a direction. However due to their varying duration (in our case between 4 and100 frames with a median of 27 frames) the block-size has to be very small (a maximumof 4 frames). Since frames typically look very similar to their 3 successors, we chose to usesingle-frame blocks.

Table 4.3 shows a comparison of the best accuracy that we achieved on blocks and

AccuracySystem Frame Block Chance Duration

Combined Block Forest - 63.1% 26.1% 120FCombined Block SVM - 52.6% 26.1% 120F

HOG Inst Forest 24.9% 23.1% 16.1% ≈ 27FHOG Inst SVM 23.7% 26.0% 16.1% ≈ 27F

Table 4.3: Comparison of system accuracy of blocks and instances on frame and blocklevel, the chance-level on blocks and the video duration in frames. Combinedrefers to HOG, Trajectory and Similarity features. Trajectory and Similarityfeatures are not useful for single-frame estimates, which is why we only useHOG features for instances.

instances. We refer to a system as “<feature> Block/Inst <learning-machine>” (i.e.Combined Block SVM). We can see that the block approach performs very well with 63%accuracy for Random Forests and still 53% for SVMs. Instances however barely surpasschance-level and are therefore not recommendable.Fig. 4.23 shows the setup of our recommended Combined Block Forest system. Note thesimilarity to our general activity recognition pipeline in Fig. 2.3. The video is preprocessedand cut into blocks. Trajectories are a by-product of the preprocessing and HOG andSimilarity features are extracted from the video block. We apply PCA on HOG featuresand concatenate all features. Random Forests are used to classify each activity.

Figure 4.23: Activity recognition pipeline for our recommended Combined Block Forestsystem.

66

4.3. Results 67

4.3.6 Activity Subclasses

Here we show the results of the HOG Inst SVM system applied to all possible combinationsof activities, body sides and angles. Table 4.4 gives an overview of the selected setup, thenumber of classes that the system was trained on, the (fixed) number of activities that itwas evaluated on, as well as the frame-level and video-level accuracy. We can see the useof orientation sub-categories alone does not increases the video-level performance, bodysides increase it by 5.6% and the combination of both leads to an increase of 6.0%.We can therefore say that splitting classes into subclasses is to some extent a good idea. Itshould however be noted that there must still be enough data to train each class and thatthe training and evaluation time can dramatically increase for some learning machines (i.e.SVM), but not for others (i.e. Random Forests).

Classes AccuracyBody Sides Orientations Training Evaluation Frame-level Video-level

yes yes 64 10 41.0% 32.0%yes no 16 10 29.4% 31.6%no yes 40 10 23.7% 26.0%no no 10 10 28.9% 26.0%

Table 4.4: HOG Inst SVM recognition accuracy given different types of label classes.The features are reduced to 50 PCs. The last row shows the simplest caseof 10 actions, whereas the other rows are trained on different body sides andorientations. The performance increases with more distinct label classes.

4.3.7 Live System

As mentioned above we also implemented several live versions of our systems in Matlab.Most of them run in real-time. Fig. 4.24 shows a screenshot of such an implementation.On the right we see the first frame of the original video overlaid by the bounding boxesthat indicate the trajectory. Below that we see the contents of the bounding box thatwe work on and on the bottom we see the SSM that is blurred since we only use eachfourth frame. On the left we see the classification results per activity class. We store aresult buffer for the last 30 results and count the ratio of each class. Red colors indicatemisclassifications and green color indicates correct outputs. In this example we see that10 out of 30 frames were misclassified.

Table 4.5 shows the real-time performance of different systems and the amount ofdata that needs to be stored for the learning machine model and the PCA coefficients,preferably in RAM. Note that 29FPS correspond to real-time. This shows that someof our systems can easily be used in real environments. Only the HOG Inst SVM runsvery slowly with less than 1FPS. This is because the SVM trained on single framesholds between 40,000 and 60,000 support vectors, which corresponds to 40% to 60% ofall training frames. We therefore need to multiply each query sample with 50.000 vectorsof dimensionality 800. There are techniques which lower the number of required supportvectors [Koggalage and Halgamuge, 2004][Lin and Lin, 2003]. These are however beyondthe scope of this work.Additionally SVMs should not be used for a large number of classes in time-critical systems.Since the multi-class SVM is essentially a collection of n single SVMs for n classes, eachsupport vector will be used for each SVM and the classification of a single point takes upto 2s on our system. For 64 classes (using activity subclasses in Section 4.3.6) the real-time performance even decreases to 0.5FPS. Random Forests have the huge advantage ofscaling to any number of classes without an increase in runtime, assuming the same tree

67

68 4. Experiments

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cla

ssifi

catio

n re

sults

in %

box

carr

ycl

ap dig

jog

trun

kru

nth

row

wal

kw

ave

Figure 4.24: A screenshot of the live system implemented in Matlab.

structure. They operate locally which means that even though the whole data structuremust be kept in storage, only a very small part of it will be used.The storage requirements of Random Forests with deep trees could however pose a problem.The model is very large since there are 200 trees. Pruning the trees or finding splittingcriteria that are optimized for more even splits (such as twoing) could reduce this number,but also reduce accuracy (see Section 4.3.2.3).If we take into account the accuracy of the systems, only the Combined Block X approachesfulfill our requirements. Compared to the instance-based systems, they have a much lowermodel size. This is because we only train and test them with a total of 831 video blockscompared to 3868 video instances. The blocks are chosen such that they do not overlap. Tocreate more training data, future work could use overlapping blocks which would increasethe amount of video blocks to approximately 77,573. Obviously care needs to be takenthat overlapping video blocks (and even videos from the same person) are not in both,train and test set.

System Accuracy Model Size PC Size RT

Combined Block Forest 60.1% 1.1MB 283.7MB 30FPSCombined Block SVM 55.1% 7.5MB 283.7MB 30FPS

HOG Block Forest 36.7% 2.0MB 283.7MB 700FPSHOG Block SVM 45.0% 85.4KB 283.7MB 30FPS

HOG Inst Forest 23.1% 2.9GB 16.2MB 70FPSHOG Inst SVM 26.0% 490.1MB 16.2MB 0.7FPS

Table 4.5: Overview of our live systems. This columns shows the accuracy, the modelsize of the learning machine, the size of the principal components that must bestored and the real-time (RT) performance in frames per second for 6 differentsystems. Block systems use every 4th frame, whereas instance systems use every8th. Accuracies of blocks and instances are not directly comparable.

68

4.3. Results 69

4.3.8 Problem Simplification

In previous sections we have seen that our system performs better on some activities thanon others. In this section we analyze how the result changes when we leave out some ofthe classes. The idea is that we can improve the relative system performance by removingclasses with low performance. Due to the large number of possible combinations of 10activities, we decide to follow a fixed scheme instead of showing all combinations. Westart with 10 activities and successively remove the class with the lowest performance.Table 4.6 shows the results on 120 frame-wide blocks. For a lower number of activity

Activity Indices Accuracy Chance

Simplified (6,9) 99.9% 76.4%Simplified (5,6,9) 98.5% 69.8%Simplified (5,6,8,9) 91.3% 50.1%Simplified (3,5,6,8,9) 92.2% 42.3%Simplified (2,3,5,6,8,9) 89.1% 37.9%Simplified (2-6,8,9) 76.9% 31.2%Simplified (2-9) 77.4% 31.5%Simplified (2-10) 67.2% 28.0%

Original (1-10) 63.5% 26.1%

Table 4.6: Recognition accuracy of the Combined Block Forest system using the originalfeature set of 10 activities as well as smaller subsets.

classes, the accuracy and the chance-level increase. The effect of leaving out a single classdiffers strongly due to the different amounts of data that are available for this class. This isalso why activity 6 remains in the dataset. It makes up for 26.1% of the data. The effectsare very systematic, except for class 7 which attributes for only 1% of the data. Hence thegain in performance when leaving it out is negligible and due to random fluctuation theresult is even worse than before.Now we want to analyze the performance per activity class. For that we create an accuracyplot in Fig. 4.25. We can see that mostly stationary activities of short duration performbadly. Fig 4.26 shows the corresponding confusion matrix, but with permutated rows andcolumns. We applied a sparse reverse Cuthill-McKee ordering on the confusion matrix.This algorithm reorders a sparse matrix into a band matrix form with a small bandwidth.This way we can directly see “clusters” of confused activities. Colors indicate the accuracyin % as indicated by the colorbar on the right. Rows indicate the activities that shouldbe output, whereas columns show the activities that are output. We can see a strongconfusion between box, wave and clap - three stationary and highly periodic actionsthat are performed mostly by hand. Dig and throw are also confused. These are twostationary activities that involve the movement of the whole body and are more thereformore irregular. The last visual block is that of carry, jog, run and walk. These are allmovement activities and can therefore easily be confused. They can even be separatedinto sub-blocks of slow movement (carry and walk) and fast movement (jog and run).Fig. 4.25 and Fig. 4.26 show the same kind of plots for the subset of activities 3, 5, 6, 8 and9. We can see that most of the confusion is removed by taking only the less problematicclasses and only one or two per cluster.

69

70 4. Experiments

box carry clap dig jog trunk run throw walk wave0

10

20

30

40

50

60

70

80

90

100

Acc

urac

y in

%

Figure 4.25: Accuracy plot for the Combined Block Forest system applied on 10 activities.The names of the activities are shortened for visualization.

trunk throw box wave clap dig carry walk jog run

trunk

throw

box

wave

clap

dig

carry

walk

jog

run

0

10

20

30

40

50

60

70

80

90

Figure 4.26: Confusion matrix for the Combined Block Forest system applied on 10activities. Colors indicate the recognition accuracy in percent. The names ofthe activities are shortened for visualization.

70

4.3. Results 71

carry clap jog trunk throw walk0

10

20

30

40

50

60

70

80

90

100

Acc

urac

y in

%

Figure 4.27: Accuracy plot for the Combined Block Forest system applied on 6 activities.The names of the activities are shortened for visualization.

trunk throw clap walk carry jog

trunk

throw

clap

walk

carry

jog

0

10

20

30

40

50

60

70

80

90

Figure 4.28: Confusion matrix for the Combined Block Forest system applied on 6activities. Colors indicate the recognition accuracy in percent. The names ofthe activities are shortened for visualization.

71

5. Discussion

In this chapter we use our results to discuss important aspects of our system and theunderlying fundamental problems.

5.1 Segmentation Problem

In this section we want to cover the previously mentioned segmentation problem. Thismeans that we want to find out when an activity starts and ends or where its temporalcenter is located. Such knowledge can help us to improve our understanding of theunderlying motion and can increase the classification performance if we use aligned featureblocks (see Section 4.3.4). It is also essential if we want to count occurrences of activities oridentify recurring patterns. Solving the segmentation problem is however not a trivial task.Difficulties stem from the “large intra-person physical variability, wide range of temporalscales, irregularity in the periodicity of human actions, and the exponential nature ofpossible movement combinations” [Spriggs et al., 2009].The approaches to temporal segmentation that we suggest can only be analyzedqualitatively on our data. We do not have annotated continuous video data availablewhich we could use for a quantitative analysis. The approaches are as follows:

• Threshold: A simple yet effective approach for segmentation is to create an outputthreshold for each activity. This means that we only output classification results toa user that are significant. This can be done by analyzing probabilities (RandomForests) or scores (SVMs) of the learning machine. The approach cannot give usan exact estimate of the start and the end of an activity, but it might give us animpression of the most relevant part of an activity (depending on the behavior of theclassifier).

• Hough Forests: We have seen that Hough Forests can be used to estimate thedistance to a point in time or space, such as the activity’s start or end point.They can therefore be used to solve the segmentation problem. The advantage ofHough Forests is that they already combine classification and segmentation. Howeverthe trade-off ratio between classification and segmentation (Hough ratio) might beunacceptable. Perhaps it would be favorable to train and evaluate a Hough Forestwith Hough ratio = 0 (that is a standard Random Forest) for classification and aHough Forest with Hough ratio = 1 for segmentation separately. This way we donot have to sacrifice discrimination for segmentation or vice-versa.

73

74 5. Discussion

It is also questionable whether Hough Forests can handle previously unseen temporalscales (fast or slow executions of an activity). Hough Forests apply a lazy-learningtechnique based on local search (such as k-Nearest Neighbors). Unseen behavior cantherefore not be handled.

• Self-Similarity: A method that we propose is to use the self-similarity maps fortemporal segmentation. We are not aware of any papers that follow this approach,although [Cutler and Davis, 2000] use SSMs for spatial segmentation. The idea isto use the energy plot extracted from a local SSM (such as Fig. 4.12). The mostprominent peak corresponds to the major period.Now we know the duration of an activity instance, but we still have to find out aboutits starting point. This point is not fixed to an amplitude in the sine-like curve (suchas the maximum, minimum or zero intercept). It is not even defined unambiguously.Instead we need a user to define where an action starts.In Section 4.1.2.2 we originally tried to follow this approach and let the user decidewhen an activity starts. The system then proceeds to the next period, annotates itand updates the local SSM. This is repeated until the video ends and done for eachvideo separately. The approach requires a lot of user input. It might be improvedfor larger datasets by learning only a few annotated start frames and applying aclassifier to the other frames. The (reliable) recognition of a single activity startingframe in a video of periodic activities would then be sufficient to segment the wholevideo.

5.2 Temporal Resolution

In this section we analyze the required temporal resolution of our system. The knowledgeretrieved from this analysis can influence the choice of the video camera, as well as thereal-time performance of the system. Since today’s computers usually run more threadsthan they have computing cores, processing fewer FPS can speedup other tasks executedon the same system.We use our Combined Block Forest detector on unaligned blocks. The recognition accuracyis measured on blocks of 120 frames without a rejection class. Fig. 5.1 shows the accuracyrelative to the number of frames (p − 1) that we skip between successive frames. Thismeans that for p = 4 and we use each 4th frame for our recognition. The observation isexemplified in Fig. 5.2.It shows that consecutive frames are very hard to differentiate and therefore do notcontribute further to the recognition, whereas the underlying activity seems to changeslightly at each 4th frame and more significantly at about each 16th frame. Beyond thatwe see a rapid deterioration of performance towards the single frame performance of 31%.We conclude that for human activity recognition a camera with a capture rate of at least3Hz (≈ 10 Frames at 29FPS) is sufficient. This is a requirement which almost all standardcameras easily fulfill.

74

5.2. Temporal Resolution 75

100

101

102

20

25

30

35

40

45

50

55

60

65

70

Frame skip between successive frames

Acc

urac

y in

%

Figure 5.1: Recognition performance relative to number of frames per sector.

Figure 5.2: A sequence of images in a video. For the human eye the difference betweenneighboring frames is barely recognizable and therefore temporal resolutioncan be reduced.

75

76 5. Discussion

5.3 Image Resolution

The analysis of the required image resolution is at least as important as the temporalresolution. When we refer to image resolution we mean the effective image resolution onthe ground. This quantity can be measured via the Ground Sample Distance (GSD). TheGSD is the distance between neighboring pixel centers in meters. As such it is inverselyproportional to the image resolution. It depends directly on the flying height of the UAV,the magnification of the optical system and the image sensor. This is why the assessmentof a maximum required GSD is important, to find out whether our algorithm is actuallyapplicable to a given system. In the case that the image plane is parallel to the objectplane, the GSD can be easily computed as:

GSD =p

fR [meters] (5.1)

where p is detector pitch, f is focal length and R is the range or object distance (see p.30-31 in [Leachtenauer and Driggers, 2001]). The GSD in x and y direction can differ. Ifthis is the case, we compute the 2D GSD as the geometric mean of both GSDs. In the casewhere the two planes are not parallel, the GSD must be corrected for the angle θ betweenground and sensor line-of-sight (see p. 32 in [Leachtenauer and Driggers, 2001]) by usingthe following formula:

GSD =pR

f cos θ[meters] (5.2)

If the object in our image is not planar, the computation becomes more involved. For anaccurate solution we would need a 3-dimensional map of the image pixels. Such a mapmight be constructed using “structure from motion” approaches. Since this is out of focusfor this work, we manually estimate the GSD as follows: We assume an average height of1.7 m and a shoulder width of 45 cm for adult men. Then we randomly sample 100 framesfrom the dataset and measure the width and height of the person in pixels (correctedwhere the person is not standing upright). Despite the different zoom levels in the imagewe approximately get a normal distribution with a mean at GSD = 3.8cm. Since it isidentical in both directions, it suffices to mention a single GSD value. Fig. 5.3 showsa video frame bilinearly downscaled to different GSDs and upscaled again using nearestneighbor interpolation to have the same size.

Now we can run our X Block Forest recognition system for downscaled versions of thedata and analyze the performance. The result can be seen in Fig. 5.4. We get an accuracyof 63% for the original GSD of 4cm. The performance decreases quickly until about 50%at a GSD of 8cm, which corresponds to an image size of 35× 35pixels. The values remainsurprisingly constant until a GSD of 32cm. Below that they deteriorate until 32% on aGSD of 136cm, which corresponds to a 2 × 2pixel image. This is still significantly abovechance level of 26%, computed by guessing the activities using the most frequent activity.If we take a look at the single features that make up the Combined feature set, we cansee that the change in accuracy is dominated by the change in accuracy of the similarityfeatures. The drastic reduction for a GSD of 4cm to 8cm and the decline in performancefor high GSDs is apparent in both setups.Contrary to the analysis on temporal resolution, we cannot say that we can decrease theeffective image resolution without reducing activity recognition performance. It should benoted that low resolution is not the only effect that can deteriorate image quality. Furtherresearch could analyze how white noise or even bad weather conditions (fog, clouds orrain) affect the system.

76

5.3. Image Resolution 77

GSD: 4 GSD: 8 GSD: 13

GSD: 17 GSD: 25 GSD: 34

GSD: 50 GSD: 76 GSD:101

Figure 5.3: A frame from our dataset rescaled to different resolutions. The Ground SampleDistances (GSD) in cm are estimated.

101

102

10

20

30

40

50

60

70

Ground sample distance in cm

Acc

urac

y in

%

CombinedHOGTrajectorySimilarity

Figure 5.4: Recognition performance relative to Ground Sample Distance in cm.Accuracies are measured on blocks of 120 frames size learned with RandomForests.

77

78 5. Discussion

5.4 Key Findings

We have conducted a large amount of experiments in this work. Here we want to summarizethe key findings:

• Aerial activity recognition is more difficult than in horizontal camera settings: Thisis due to the large amount of possible perspectives. Whereas horizontal camerasettings only allow for 360◦ orientation, vertical camera settings also vary in thecamera angle on the scene. Other aspects are the difficult spatial and temporalsegmentation, small image scales and generally low image quality. This finding isdirectly related to the following findings.

• Features need to be tailored for the application: We have seen that some featuressuch as HOOF were introduced in a way that is not usable for aerial activityrecognition (i.e. by exploiting forced symmetric invariances). Other generally moresuccessful recognition clues such as optical flow become very hard to interpret forthe learning machine given the bad spatial and temporal alignment, high amountof different camera perspectives and person orientation. The recently popular localinterest-point approaches, represented by SIFT3D in this work, have proven to beneither fast nor accurate enough. We assume that the low accuracy is due to therelatively small size of our codebook, since our dataset is too small for its difficulty.

• A combination of different features is preferable: In Sections 4.3.1.6 and 4.3.8 weshowed that a combination of different features yields an accuracy that is superior toeach single feature. This is because the different types of activities, such as movementand stationary or periodic and non-periodic, can only be discriminated by differentfeatures.

• We should not rely on temporal or spatial segmentation: We have seen that spatialand temporal (“alignment”) segmentation can improve the result and allow the useof completely different features (such as appearance-based approaches). However forour application the quality of the image material is not good enough to implementreliable segmentation techniques, despite the advances in both spatial and temporalsegmentation (e.g. Hough Forests). Our approach even harnesses the missingtemporal alignment using self-similarity features.

• Dimensionality reduction is indispensable: Dimensionality reduction techniques suchas the PCA are very important for dense features like HOG. In our case we typicallyreduced the dimension from 170,520 to between 5 and 800. This lowers the trainingtime from hours to seconds. During evaluation a lower dimension is equally importantfor real-time behavior. We showed in Section 4.3.2.1 that despite this gain in timethe PCA can even increase performance in some cases and it decreases in none ofour experiments.

• Each learning machine has its advantages: The choice of a learning machine shouldbe made dependent on feature type, dimension and real-time requirements. InSection 4.3.2.1 we have seen that SVMs work better on HOGs and Random Forestswork better on a combination of features. The real-time performance is also animportant decision factor. If we have ten thousands of support vectors, RandomForests can be 100 times faster. If we have only a few, both are comparable. On theother hand, large Random Forests can come with memory requirements that mightbe unacceptable for embedded systems.

• Block-based approaches are better than instance-based approaches: In this work wehave compared activity recognition based on space-time volumes of a fixed size toflexible “instance-based” approaches that aggregate single frame hypotheses. Due to

78

5.5. Fulfillment of Requirements 79

the difficulties of our dataset, such as camera movement, missing segmentation andthe variety of actions and persons, we were not able to create a classifier that workedsufficiently well on a frame-level and therefore the hypothesis aggregation techniqueswere also not very successful. We were even able to show that aggregation techniquesalways perform worse than single-block detection on our dataset. The temporalalignment of activities led to a slight increase in accuracy for HOG features, but didnot justify the immense work of manual segmentation.

• A real-time implementation is possible: We have shown in a Matlab implementationthat our system can achieve real-time performance of about 30FPS. There are variousparameters that can be tuned for a higher execution speed, however all of them comewith a slight deterioration of the system’s accuracy. The most important ones areimage scale, temporal scale and frameskip.

• The segmentation problem remains very difficult: Despite our use of state-of-the-arttechniques such as Hough Forests, we were not able to fully solve the segmentationproblem. Future work will have to focus on finding features and techniques that areespecially suited for aerial video data and the issues that arise from the low qualityand the multitude of camera and person orientations.

5.5 Fulfillment of Requirements

In Section 1.3 we presented a list of requirements that our solution should fulfill froman engineering perspective. In this section we want to analyze the degree to which theserequirements were met.

1. Accuracy: We have seen in Section 4.3.5 that our system can achieve a block-levelactivity recognition accuracy of about 63% on a dataset of 10 activities. Whereas thisresult still leaves room for improvement, Section 4.3.8 shows that we can dramaticallyimprove this result by removing a few problematic activities. The result of 89% on6 activity-classes of our choice shows that the system has enough discriminativepower for some types of activities, but not for others. At this level of performancethe system can be used for applications that alert the user (such as a Forward AirController) of an unmanned aerial system when an activity occurs. It is however notusable for a fully automatic processing and decision-making.

2. Robustness: The stabilization of our video input was not a focus of this work andthe data that we used was neither well-centered nor entirely stabilized. Neverthelessthe system performs sufficiently well and the features that we use are able tocompensate for minor translations. In Sections 5.2 and 5.3 we have seen thatthe system’s temporal resolution can be reduced with only a slight decrease inperformance, whereas the image resolution should not be further reduced.

3. Real-Time: By implementing a life system we were able to show that we can achievereal-time performance. The choice of features like HOG and learning machines likeRandom Forests leaves further room for improvement using parallel implementations.Porting key parts of the code to the GPU might increase the runtime performanceeven further.

4. Compatibility: Our solutions were implemented on standard hardware using acombination of Matlab and C/C++. Key libraries such as LibSVM are written inISO C89 and therefore easily portable to embedded systems that are typically beingused in UAVs.

5. Universality: Our system was successfully tested on the visible light spectrum.Due to the lack of sufficient amounts of data we were not able to evaluate our system

79

80 5. Discussion

on infrared videos. This analysis is to be delayed until such data has been collected.An alternative would be the use of a military simulator (see Section 3.3) that is ableto simulate infrared lighting conditions.

We have shown that all requirements except for universality have been met. Theapplicability to different light spectra still has to be analyzed in future work.

5.6 Recommendations

We have seen that activity recognition is possible, but that it is still subject to somelimitations. The underlying data and the choice of activities to be classified seem to bealmost as important as the features and the classificator. To develop a commercial productfrom these solutions, we stress that the supported activities should be carefully selectedwith respect to the discriminative power of the supported features. Instead of using fourdifferent walking activities, such as in the UCF-ARG dataset, one or two should be chosenif and only if we use trajectory features. The similarity features should be chosen foractivities with a strong period such as “boxing”. HOG features seem to have a weakgeneralization behavior, but they are useful to learn recurring image regions. In our casethese were the cars in the“open/close trunk”activity and the box in the“carrying”activity.At the moment there is no publicly available dataset that can be used to train a systemthat works under real environments. Therefore a new dataset should be created that issimilar to UCF-ARG, but has more diverse backgrounds, clothing and activities. The mostimportant decision is which kind of flying platform one should use. Balloons, airships andquadcopters have their advantages since they can create very stable videos if there areno strong winds. Planes can record videos under more realistic conditions, but at highercosts for flight time and data post-processing. The dataset should include at least 10 actorswith different combinations of clothing and items such as backpacks. Each action shouldbe performed several times in varying directions and at different flying heights. Actionsshould take place in realistic urban or natural environments with varied textures andpartial occlusion of the person in the video. The system might also benefit from rejectionclasses that show persons who are just standing, sitting or lying, but not performing aparticular action. We generally think that the conditions should be as close to the finalapplication as possible. The data should be gathered from a well-stabilized flying platformwith a frame rate of at least 8FPS and a GSD of 4cm or preferably less.The stabilization of the video should be as smooth as possible. In our case wesometimes noticed a shaking video when the cross-correlation-based tracking jumpedbetween neighboring pixels or regions. Blob extraction should leave sufficient space aroundthe person to make sure that even during swift movements all body-parts are included inthe blob. We recommend a time horizon of at least 3s and the use of unaligned blocks offixed size. Due to real-time limitations HOG, Similarity and Trajectory features shouldbe used. The integration of additional meta-data from the plane or sensor is a topic forfuture work.We recommend Random Forests due to their multi-class support and their ability to scaleto large feature dimensions and numbers of samples. Real-time requirements could beimposed by using early stopping after evaluating only a few trees. However with thecurrent setup, the computation of Similarity features consumes most of the time. Theframe rate can be doubled if we give up the translational invariance and just compute thecorrelation of an image pair once (without maximizing the correlation over all possibletranslations).

80

6. Conclusion

This chapter summarizes our findings from the previous chapters. We compare them tosimilar findings in the literature and give an outlook on possible future work.

6.1 Summary

In this work we tackle the topic of activity recognition both from a scientific viewpoint aswell as from an engineering perspective. We state the relevant problems, their relation tothe well-studied discipline of object detection and the unique challenges that they add. Wecompile a list of should and shall requirements of our implementation. Then we developdefinitions for the essential terms used in this work such as activity, activity recognitionand unmanned aerial vehicles. After conducting a review on the relevant literature weintroduce what we call the activity recognition pipeline. It describes the different stepsthat we take and serves as an orientation throughout this work.In Chapter 2 we present the theory behind the techniques that we use in our experiments.We describe different preprocessing steps, features, dimensionality reduction techniquesand learning machines. Chapter 3 gives an overview of the most important datasets foractivity recognition and we develop a distinction between real-world datasets and simulateddatasets. The most important aspect for us is the camera setting which we describe as(near) vertical, angular or (near) horizontal.In Chapter 4 we describe in detail which approaches we take and how we implementthem. After a short interlude on the chosen methodology, we present our observationsand the results. We show that a combination of Histogram of Oriented Gradients, Self-Similarity Maps and Trajectories yields a high recognition accuracy of 63% on 10 activitiescombined with relatively low computational requirements such as RAM or CPU runtime.The result depends mostly on the duration and dimensionality of the video blocks thatwe use to classify activities. Although SVMs and Random Forests work equally well insome experiments, we ultimately choose Random Forests for slightly higher accuracy, morestable run-time performance, direct multi-class support and their simplicity.We try to solve the segmentation problem of unaligned blocks using Hough Forests, but dueto the already low performance in single-frame classification, we do not see results thatare remarkably better than chance-level. We also discover that hypothesis aggregationgenerally performs worse than single-block detection and that activity alignment, despiteslight improvements on HOG features, is not worth the extra effort of manual annotation.These observations as well as the much lower accuracy of instance-based setups make it

81

82 6. Conclusion

clear that our block-based approaches are superior to their instance-based counterparts.We were able to implement a live system of our proposed approach that runs in real-time.For a real-world application we suggest to use only the six best performing activities. Inour case they lead to an accuracy of 89%.Finally, the discussion chapter deals with special issues that we face in activity recognition.It suggests solutions for the previously mentioned segmentation problem, and analyzesthe temporal and spatial resolution required for our approach. The key findings aresummarized and to conclude our engineering task, we describe which requirements havebeen fulfilled. Except for the support for infrared videos, which is part of our future work,all requirements have been met. In the end we give a set of recommendations for the actualimplementation of a complete activity recognition system.

6.2 Comparison

We try to compare our results to other publications. Such a comparison is hard to do sincethe dataset that we use is relatively recent. It is also not delivered in a state that can beused directly. Instead preprocessing steps have to be executed by the user, although theyare not directly related to the task of activity recognition.We have found two other works that use the same dataset. [Bilinski and Bremond, 2012]use local spatio-temporal features on the ground subset of the UCF-ARG dataset1.Although the ground dataset is much simpler than the aerial dataset, they describe it asvery challenging due to “different shapes, sizes and ethnicities of people, scale changes,shadows, cloth variations, inter and intra action class speed variations, and differentscenarios”. The comparison of their system performance on UCF-ARG and the verypopular KTH dataset shows that the generalization error is almost 5 times higher onthe UCF-ARG dataset, which confirms its difficulty.Reilly’s dissertation ([Reilly, 2012]) uses a bag-of-words model of local spatio-temporalfeatures on the aerial subset of the UCF-ARG dataset. They also suggest to reduce thedataset to six activities, although they do not mention which these are. Preprocessingsteps like registration (what we call stabilization), motion detection, tracking and blobextraction are performed just as in our work. Although their work concentrates on one-shot learning, they provide baseline results for a regular multi-shot scenario. They do notuse simple performance metrics like accuracy, but define three metrics based on confusionmatrices. The simplest of these metrics (D1i) is just the average of the diagonal elementsof the confusion matrix. It is 57.5% using their method and about 86.6% using our systemon the simplified 6-class dataset. However they do not mention how many frames they useper video, only that they use a 1000-dimensional histogram of clustered word frequencies.We therefore assume that they are far from real-time processing. It also needs to betaken into account that purely academic papers typically do not include trajectories as afeature, since these are not visual clues from the blob itself. However, from an engineeringperspective this is a perfectly valid and reasonable simplification.Other authors apply approaches that are similar to ours on datasets with horizontal camerasettings. [Baumann, 2013] uses simple HOGs and optical flow trajectories. The HOGfeatures are learned on single frames using Random Forests. A majority vote (what wecall “mode”) is used to aggregate frame-level decisions. The optical flow is extractedfrom consecutive frames and the trajectories are created by concatenating the flow of allframes in a video. The final class probabilities are obtained by multiplying HOG andOF class probabilities. They show that these features yield state-of-the-art results onthe KTH dataset. Note that this approach is very similar to some of our experiments.The difference lies in the amount of different features and in the frame concatenation

1They do not mention this explicitly, but judging from the images on page 6 and the statistics of thedataset we assume so.

82

6.3. Future Work 83

and hypothesis aggregation techniques. Whereas we concatenate features directly andthen classify them, they classify them first, aggregate the results and combine the outputprobabilities.

6.3 Future Work

In this work we have studied possible approaches for airborne activity recognition. This isan open field of study and for datasets that are as realistic as ours, more robust techniqueshave to be found.We have treated preprocessing and activity recognition as two relatively independenttasks. This has its advantages since it allows us to focus on the relatively new fieldof activity recognition and identify promising techniques. However, in the long runboth systems will have to be integrated. Some approaches like weakly stabilized motionfeatures [Park et al., 2013] can even exploit (intentionally) weak stabilization. others likelatent SVMs [Felzenszwalb et al., 2010] are able to model the translation due to badstabilization as a degree of freedom. Even if we think of our proposed pipeline, there ispotential for further integration of preprocessing and activity recognition. Keypoints thatare extracted for stabilization might be reused as local keypoints in activity recognition.The same features and classifiers (such as HOG and Random Forests) that are used forperson detection might also be used for activity recognition. A one-step approach mightreduce the accumulated errors of a multi-step approach. This means that instead ofdiscriminating persons and objects and then detecting the person’s activity, one coulddirectly learn activity-performing persons versus objects.Regarding our features we think that more information can be extracted from the Self-Similarity Maps. [Korner and Denzler, 2013] go deeper into extracting the structure of theSSMs, compared to just using the dominant frequencies of the spectrum. This could leadto further improvement. Despite the bad results that we achieve with optical flow-basedfeatures, we still believe that they should be some of the most important clues in activityrecognition. In our case since we use dense optical flow and dimensionality reduction,the learning machine seems to be looking for the famous needle in the haystack. Insteadboosting techniques used on mid-level motion features [Fathi and Mori, 2008] could beused to extract only the relevant optical flow vectors of arms and legs.Above we mentioned the closed-world assumption that underlies our use of the accuracyas a performance metric. In reality we must assume that most of the time no activityis being executed. The workaround here is to use a threshold above which the systemis triggered. Future work should analyze whether it makes sense to use a rejection classfor non-activities like standing, sitting or lying. Even different kinds of walking (withoutparallel action such as carrying) could be interpreted as a non-action.In this work we often mentioned the problem of invariance to symmetry, scale and rotation.It influenced our choice of the very robust Similarity features. We also introduced activitysubclasses to allow for a more precise training of different articulations of an activity. Wethink that Hidden Markov Models can be used to dynamically estimate the timing of anactivity, but for that we need to define atomic sub-actions, which we were not able to.Further research should be conducted on how to improve the invariance of our system.Our analysis was done on videos captured from the visual light spectrum. For aerialactivity recognition infrared light is equally important. Future work will have to verifywhether the same techniques can be used on different light spectra. We assume that someproblems such as separating the person from the background (which is done implicitlyin our system), activity recognition on partially occluded persons and ambiguities due toshadows should be easier to solve whereas other problems like finding remarkable imagegradients or distinguishing different heat sources might arise.We think that it is very important for the community to create and publish common

83

84 6. Conclusion

datasets for aerial activity recognition. These datasets should include stabilized and ready-to-use videos of annotated activities. The current state of this discipline with only ahandful of papers existing on manually created datasets limits the scientific progress. Adirect comparability using the same performance metrics is very important.Finally, the existing low-level (such as ours) and high-level solutions need to be integratedto serve real-world applications. We need to decide what happens upon the triggering ofan activity. How do we define suspicious behavior, danger or every-day situations? Howare actions of different people related to each other? What are the rules of complex groupinteraction? Once we are able to answer these questions, activity recognition systemsmight be able to improve our daily lives.

84

Bibliography

[Aggarwal and Ryoo, 2011] Aggarwal, J. and Ryoo, M. (2011). Human activity analysis:A review. ACM Comput. Surv., 43(3):16:1–16:43.

[Alahi et al., 2012] Alahi, A., Ortiz, R., and Vandergheynst, P. (2012). Freak: Fast retinakeypoint. In Computer Vision and Pattern Recognition, 2012. CVPR 2012. IEEEConference on, pages 510–517.

[Altman et al., 2006] Altman, R., Dunker, A., Altman, R., and Hunter, L. (2006).Biocomputing 2007. World Scientific.

[Baumann, 2013] Baumann, F. (2013). Action recognition with hog-of features. InWeickert, J., Hein, M., and Schiele, B., editors, Pattern Recognition, volume 8142 ofLecture Notes in Computer Science, pages 243–248. Springer Berlin Heidelberg.

[Bilinski and Bremond, 2012] Bilinski, P. and Bremond, F. (2012). Statistics of pairwiseco-occurring local spatio-temporal features for human action recognition. In ComputerVision–ECCV 2012. Workshops and Demonstrations, pages 311–320. Springer.

[Breiman, 1996] Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2):123–140.

[Breiman, 2001] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

[Caesar, 2012] Caesar, H. (2012). Integrating language identification to improvemultilingual speech recognition. Idiap-RR Idiap-RR-24-2012, Idiap.

[Chang and Lin, 2011] Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for supportvector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[Chaudhry et al., 2009] Chaudhry, R., Ravichandran, A., Hager, G., and Vidal, R. (2009).Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamicalsystems for the recognition of human actions. In Computer Vision and PatternRecognition, 2009. CVPR 2009. IEEE Conference on, pages 1932–1939.

[Chen et al., 2005] Chen, P.-H., Lin, C.-J., and Scholkopf, B. (2005). A Tutorial on nu-Support Vector Machines. Applied Stochastic Models in Business and Industry.

[Cortes and Vapnik, 1995] Cortes, C. and Vapnik, V. (1995). Support-vector networks.Machine Learning, 20(3):273–297.

[Cutler and Davis, 2000] Cutler, R. and Davis, L. (2000). Robust real-time periodicmotion detection, analysis, and applications. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 22(8):781–796.

[Dalal and Triggs, 2005] Dalal, N. and Triggs, B. (2005). Histograms of oriented gradientsfor human detection. In Computer Vision and Pattern Recognition, 2005. IEEEComputer Society Conference on, volume 1, pages 886–893 vol. 1.

85

http://www.csie.ntu.edu.tw/~cjlin/libsvm

86 Bibliography

[de Sa et al., 2011] de Sa, J. M., Sebastiao, R., and Gama, J. (2011). Tree classifiersbased on minimum error entropy decisions. Canadian Journal on Artificial Intelligence,Machine Learning & Pattern Recognition, 2(3).

[Dollar, 2013] Dollar, P. (2013). Piotr’s image and video matlab toolbox (PMT). http:

//vision.ucsd.edu/~pdollar/toolbox/doc/index.html. Version 3.25.

[Dollar et al., 2009] Dollar, P., Tu, Z., Perona, P., and Belongie, S. (2009). Integral channelfeatures. In BMVC.

[Fathi and Mori, 2008] Fathi, A. and Mori, G. (2008). Action recognition by learning mid-level motion features. In Computer Vision and Pattern Recognition, 2008. CVPR 2008.IEEE Conference on, pages 1–8. IEEE.

[Fawcett, 2006] Fawcett, T. (2006). An introduction to roc analysis. Pattern RecognitionLetters, 27(8):861–874.

[Felzenszwalb et al., 2010] Felzenszwalb, P., Girshick, R., McAllester, D., and Ramanan,D. (2010). Object detection with discriminatively trained part-based models. PatternAnalysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645.

[Filipe and Alexandre, 2013] Filipe, S. and Alexandre, L. (2013). A comparativeevaluation of 3d keypoint detectors. In Conf. on Telecommunications - ConfTele,volume 1, pages 145–148.

[Fleet and Weiss, 2006] Fleet, D. and Weiss, Y. (2006). Optical flow estimation. InParagios, N., Chen, Y., and Faugeras, O., editors, Handbook of Mathematical Models inComputer Vision, pages 237–257. Springer US.

[Flint et al., 2007] Flint, A., Dick, A., and Hengel, A. v. d. (2007). Thrift: Local 3dstructure recognition. In Digital Image Computing Techniques and Applications, 9thBiennial Conference of the Australian Pattern Recognition Society on, pages 182–188.

[Freund and Schapire, 1997] Freund, Y. and Schapire, R. E. (1997). A decision-theoreticgeneralization of on-line learning and an application to boosting. Journal of Computerand System Sciences, 55(1):119–139.

[Friedman et al., 1977] Friedman, J. H., Bentley, J. L., and Finkel, R. A. (1977). Analgorithm for finding best matches in logarithmic expected time. ACM Trans. Math.Softw., 3(3):209–226.

[Gall and Lempitsky, 2009] Gall, J. and Lempitsky, V. (2009). Class-specific hough forestsfor object detection. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on, pages 1022–1029.

[Haritaoglu et al., 1999] Haritaoglu, I., Cutler, R., Harwood, D., and Davis, L. (1999).Backpack: detection of people carrying objects using silhouettes. In Computer Vision,1999. The Proceedings of the Seventh IEEE International Conference on, volume 1,pages 102–107 vol.1.

[Hsieh, 2009] Hsieh, W. W. (2009). Machine Learning Methods in the EnvironmentalSciences: Neural Networks and Kernels. Cambridge University Press, 1 edition.

[Klaser et al., 2008] Klaser, A., Marsza lek, M., and Schmid, C. (2008). A spatio-temporaldescriptor based on 3d-gradients. In British Machine Vision Conference, pages 995–1004.

[Koggalage and Halgamuge, 2004] Koggalage, R. and Halgamuge, S. (2004). Reducingthe number of training samples for fast support vector machine classification. NeuralInformation Processing-Letters and Reviews, 2(3):57–65.

86

http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html

http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html

Bibliography 87

[Korner and Denzler, 2013] Korner, M. and Denzler, J. (2013). Temporal self-similarityfor appearance-based action recognition in multi-view setups. In Wilson, R., Hancock,E., Bors, A., and Smith, W., editors, Computer Analysis of Images and Patterns, volume8047 of Lecture Notes in Computer Science, pages 163–171. Springer Berlin Heidelberg.

[Laptev et al., 2007] Laptev, I., Caputo, B., Schuldt, C., and Lindeberg, T. (2007). Localvelocity-adapted motion events for spatio-temporal recognition. Computer Vision andImage Understanding, 108(3):207 – 229. Special Issue on Spatiotemporal Coherence forVisual Motion Analysis.

[Leachtenauer and Driggers, 2001] Leachtenauer, J. and Driggers, R. (2001). Surveillanceand Reconnaissance Imaging Systems: Modeling and Performance Prediction. ArtechHouse optoelectronics library. Artech House.

[Lin and Lin, 2003] Lin, K.-M. and Lin, C.-J. (2003). A study on reduced support vectormachines. Neural Networks, IEEE Transactions on, 14(6):1449–1459.

[Lindeberg, 1997] Lindeberg, T. (1997). Linear spatio-temporal scale-space. InHaar Romeny, B., Florack, L., Koenderink, J., and Viergever, M., editors, Scale-SpaceTheory in Computer Vision, volume 1252 of Lecture Notes in Computer Science, pages113–127. Springer Berlin Heidelberg.

[Lobo et al., 2008] Lobo, J. M., Jimenez-Valverde, A., and Real, R. (2008). Auc: amisleading measure of the performance of predictive distribution models. Global Ecologyand Biogeography, 17(2):145–151.

[Lowe, 1999] Lowe, D. (1999). Object recognition from local scale-invariant features. InComputer Vision, 1999. The Proceedings of the Seventh IEEE International Conferenceon, volume 2, pages 1150–1157 vol.2.

[Lowe, 2004] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints.International Journal Computer Vision, 60(2):91–110.

[Niebles et al., 2006] Niebles, J. C., Wang, H., and Fei-fei, L. (2006). Unsupervisedlearning of human action categories using spatial-temporal words. In In Proc. BMVC.

[Oh et al., 2011] Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.-C., Lee, J. T.,Mukherjee, S., Aggarwal, J. K., Lee, H., Davis, L., Swears, E., Wang, X., Ji, Q., Reddy,K., Shah, M., Vondrick, C., Pirsiavash, H., Ramanan, D., Yuen, J., Torralba, A., Song,B., Fong, A., Roy-Chowdhury, A., and Desai, M. (2011). A large-scale benchmarkdataset for event recognition in surveillance video. In CVPR.

[Park et al., 2013] Park, D., Zitnick, C. L., Ramanan, D., and Dollar, P. (2013). Exploringweak stabilization for motion feature extraction. In Computer Vision and PatternRecognition (CVPR), 2013 IEEE Conference on, pages 2882–2889. IEEE.

[Platt, 1999] Platt, J. C. (1999). Probabilistic outputs for support vector machines andcomparisons to regularized likelihood methods. In Advances in Large Margin Classifiers,pages 61–74. MIT Press.

[Reilly, 2012] Reilly, V. (2012). Detecting, Tracking and Recognizing Activities in AerialVideo. PhD thesis, University of Central Florida.

[Rokach and Maimon, 2005] Rokach, L. and Maimon, O. (2005). Decision trees. InMaimon, O. and Rokach, L., editors, Data Mining and Knowledge Discovery Handbook,pages 165–192. Springer US.

[Rosten and Drummond, 2006] Rosten, E. and Drummond, T. (2006). Machine learningfor high-speed corner detection. In Proceedings of the 9th European Conference onComputer Vision - Volume Part I, ECCV’06, pages 430–443, Berlin, Heidelberg.Springer-Verlag.

87

88 Bibliography

[Scholkopf et al., 1999] Scholkopf, B., Smola, A., and Mller, K.-R. (1999). Kernel principalcomponent analysis. In Advances in Kernel Methods - Support Vector Learning, pages327–352. MIT Press.

[Scovanner et al., 2007] Scovanner, P., Ali, S., and Shah, M. (2007). A 3-dimensionalsift descriptor and its application to action recognition. In Proceedings of the 15thInternational Conference on Multimedia, pages 357–360.

[Shlens, 2005] Shlens, J. (2005). A tutorial on principal component analysis. In SystemsNeurobiology Laboratory, Salk Institute for Biological Studies.

[Solem, 2012] Solem, J. (2012). Programming Computer Vision with Python: Tools andalgorithms for analyzing images. O’Reilly Media.

[Spriggs et al., 2009] Spriggs, E. H., De La Torre, F., and Hebert, M. (2009). Temporalsegmentation and activity classification from first-person sensing. In Computer Visionand Pattern Recognition Workshops, 2009. CVPR Workshops 2009. IEEE ComputerSociety Conference On, pages 17–24. IEEE.

[Thota et al., 2013] Thota, S. D., Vemulapalli, K. S., Chintalapati, K., and Gudipudi,P. S. S. (2013). Comparison between the optical flow computational techniques.International Journal of Engineering Trends and Technology, 4:4507.

[Torr and Zisserman, 2000] Torr, P. H. S. and Zisserman, A. (2000). Mlesac: A new robustestimator with application to estimating image geometry. Computer Vision and ImageUnderstanding, 78(1):138–156.

[Yao et al., 2010] Yao, A., Gall, J., and Van Gool, L. (2010). A hough transform-basedvoting framework for action recognition. In Computer Vision and Pattern Recognition(CVPR), 2010 IEEE Conference on, pages 2061–2068.

[Zambon et al., 2006] Zambon, M., Lawrence, R., Bunn, A., and Powell, S. (2006). Effectof alternative splitting rules on image processing using classification tree analysis.Photogrammetric Engineering & Remote Sensing, 72(1):25–30.

88

image-based activity recognition using uavs - …image-based activity recognition using uavs master...

Documents