pedestrians detection and tracking

Papers:•Pfinder: Real-Time Tracking of the Human Body,

Wren, C., Azarbayejani, A., Darrell, T., and Pentland, P.

•Tracking and Labelling of Interacting Multiple Targets,

J. Sullivan and S. Carlsson

This talk will cover two distinct tracking algorithms. Pfinder: Real-Time Tracking of the Human BodyMulti-target tracking and labeling

For each of them we will present:Motivation and previous approachesReview of relevant techniquesAlgorithm detailsApplications and demos

There is always a major trade-off between genericity and accuracy.

Because we know we are trying to identify and track human beings, we can start making assumptions about our objects.

If we have more specific information (example: tracking players in a football game), we can add even more specific assumptions.

These kind of assumptions will help us to get a more accurate tracking.

Tracking Algorithm #1

Pfinder: Real-Time Tracking of the Human Body

Motivation

Introduction

• Pfinder is a tracking algorithm– Detects human motion in real-time.– Segments the person’s body– Analyze internal features (head, body,

hands, and feet)

Many Tracking algorithm use a static model – For each frame, similar pixels are searched in the vicinity of the bounding box of the previous frame. We will use a dynamic model – One that learns over

time.Most tracking algorithms need some user-input

for initialization. The presented algorithm will do automatic

initialization.

Covariance For a domain of dimension , we define the

sampling domain’s variables The covariance of two variables is defined:

where The covariance of two variables is a measure of how much two variables change together.

1 nx xn

,i jx x

cov ,i j i i j jx x E x x i iE x

The Covariance Matrix (marked ) is defined:

Normal distribution of a variable is defined:

cov ,ij i jx x

The more generalized multivariate distribution is defined:

1 2 1 2

1 1, , exp

N Np x x x x x

Mahalanobis distance: The distance measured from a

sample vector To a group of samples with mean and a covariance matrix is defined:

1 Nx x x

MD x x S x

1. (Automatic) Initialization Background is modeled in a few seconds of

video where the person does not appear. When the person enters the scene, he is

detected and modeled.

2. The analysis loop After the background and person models are

initialized, each pixel in the next frame is checked against all models.

The first step in the algorithm is build a preliminary representation of the person and the surrounding scene.

First we need to acquire a video sequence of the scene that do not contain a person in order to model the background

The algorithm assumes a mostly-static background.

However, it is needed to be robust in illumination changes and to be able to recover from changes in the scene (e.g. a book that was moved from one place to another).

The images in the video are using the YUV color representation (Y = luminance component, UV = chrominance component). There exists a transformation matrix which

transforms RGB representation to YUV.The algorithm models the background by

matching each pixel a Gaussian that describes the pixel’s mean and distribution.

We do this by measuring the pixel’s YUV mean and distribution over time

This pixel has some YUV value on this frame, on the next frame, it might change, so we mark it’s mean asand its covariance matrix as

0 ,x y

0 ,K x yy u

After the scene has been modeled, Pfinder watches for large deviations from this model.

This is done by measuring the Mahalanobis distance in the color space between the new pixel’s value and to the scene model values in the appropriate location.

If the distance is large enough and the change is visible over a sufficient number of pixel, we begin to build a model of a person.

The algorithm represents the detected person’s body parts using blobs.

Blobs are 2D representation of a Gaussian distribution of the spatial statistics.

Also, a support map is built for each blob :

k 1 ,,

x y kS x y

otherwise

To initialize the blob models, Pfinder uses a 2D contour shape analysis that attempts to identify the head, hands, and feet location.

A blob is created for each identified location.

The class analyzer find the location of body features by using statistics from their position and color in the previous frames.

Because no statistics have been gathered yet (this is the first frames where the person appears), the algorithm uses ready-made statistical priors.

Hand and face blobs have strong flesh-colored color priors (it appears that normalized skin color is constant across different skin pigmentation levels).

The other blobs are initialized to cover the clothing regions

The contour analyzer can find features in a single frame, but the results tend to be noisy.

The class analyzer produce accurate result but it depends on the stability of the underlying models (i.e. no occlusion).

A blend of contour analysis and class model is used to find the feature in the next frame.

original

contour

After the initialization step of the algorithm, the information is now divided into scene and person models. Scene (background) model consist of the color space

distribution for each pixel. Person model consist of spatial space and color space

distribution for each blobThe spatial space determines the blob’s location and

sizeThe color space determines the distribution of color in

the blob

Given a person model and a scene model, we can now acquire a new image, interpret it, and update the scene and person models.

1. Update the spatial model associated with each blob using the blob’s measured statistics, to yield the blob’s predicted spatial distribution for the current image.This is done with a Kalman filter assuming simple Newtonian dynamics.

Measuring information from video sequence can be very inaccurate sometimes

Without some kind of filtering it would be impossible to make any short-term forward predictions.

Also, each measurement is used as a seed for the tracking algorithm atthe next frame.

Some kind of filteringis needed to make themeasurements moreaccurate.

Each tracked object is represented with a state vector (usually location)

With each new frame, a linear operator is applied to the state to generate the new state, with some noise mixed in, and some information from the controls on the system

Usually, Newton’s laws are applied.

The noise added is a Gaussian noise with mean 0 and a covariance matrix.

The predicted state is then updated with the real measurement to create the estimate for the next frame.

2. Now when a new image is acquired, we measure the likelihood of each pixel being a member of each of the blob models and the scene model:the vector is defined as the location and color of each pixel. For each class , the log likelihood is measured:

, , , ,p x y Y U V

11 1ln ln 2

2 2 2T

k k k k k

md p K p K

3. Each pixel is now assign to a particular class.Either one of the blobs or the background.A support map is build which indicates which pixel belong to which class

, arg max ,kk

s x y d x y

Connectivity constraints are enforced by iterative morphological growing from a single central point, to produce a connected region.

First, a foreground region is growncomprised of all the blob classes.

Then, each of the individual blob isgrown with the constraint that theyremain confined to the foregroundregion

4. Now the statistical model for each class is updated. For the blob classes, the new mean is calculated

The Kalman filter statistics are also updated at this time.

Background pixels are also updated to have the ability to recover from changes in the scene.

K E p p

The algorithm employs several domain-specific assumptions in order to have an accurate tracking. If one of the assumptions break, the system

degrades.However, the system can recover after a few

frames if the assumptions again holdThe system can track only after a single

person.

RMS (Root Mean Square) errors were found on the order of a few pixels:

TestHandArm

Translation(X,Y)

0.7 pixels(0.2% relative)

2.1 pixels(0.8% relative)

Rotation( )

4.8 degrees(5.2% relative)

3.0 degrees(3.1% relative)

A Modular Interface - An application that provides programmers tracking, segmentation and feature detection.

The ALIVE application places 3d animated characters that interact with the person according to his gestures.

Here, Rexy!

The SURVIVE application recorded the movement of the person to navigate a 3d virtual game environment.

I guess you can’t get any nerdy

than this

Recognition of American Sign Language Pfinder was used as a pre-process for detecting a

40-word subset of ASL. It had 99% sign accuracy

Avatars and Telepresence The model of the person is translated to several

blobs. Which can be used to model 2d characters.

Tracking Algorithm #2

Multi-Target Tracking and Labeling

uses slides by Josephine Sullivan

from http://www.csc.kth.se/~sullivan/

Motivation

Introduction

• The multi-target tracknig and labeling algorithm– Track multiple targets over large

periods of time– Robust collision recovery– Does labeling even when targets are

interacting

Multi Tracking and Labeling

Sometimes Easy Sometimes Hard

The algorithm addresses the problem of the surveillance and tracking of multiple persons over a wide area.

Previous multi-target tracking algorithms are based on Kalman filtering and advanced techniques of particle filtering.

Often tracking algorithms fails if occlusion or interaction between the targets occurs.

This work’s specific goal is to track and label the players in a football game.

This is especially hard when players collide and interact

The researchers used a wide-screen video which was produced using the video from four calibrated cameras.

The images were stitched after the homography between the images was computed.

This produces a high-resolution video which gives good tracking results

1. Background modeling and subtraction2. Build an interaction graph3. Resolve split/merge situations4. Recover identities of temporally separated

player trajectories.

A probabilistic model of the image gradient of each pixel in the background is obtained.

The gradient is used to prevent situation where the player’s uniform has the same color as the background.

Let denote the image gradient at pixel in frame .

Each background pixel is modeled by a mixture of three bivariate normal distributions with means and covariance matrices :

Where and

,i i ix x x x

A pixel in frame is considered a foreground pixel if is larger than a threshold . Let be the set of foreground pixels at time t Let be the set of background pixels at time t

Connected components are then identified and are processed by deleting small “cc”s or joining them to neighboring larger “cc”s. This is made to make sure that each connected

component corresponds to at least one whole player

x t 1Tt t

x x x x xg g

The set of ellipses representing the connected components detected (marked by bounding boxes) is defined:

With being the number of ellipses detected in frame

tntt i i

The first aim is to put the ellipses in and in correspondence.

Definition: ellipses and are an exact match if their size and orientation are sufficiently similar and distance between their centers are sufficiently small.

tE 1tE

Define a relation : if and are an exact match If no such exact match exists for in then

if and has no exact match in

Define a Forward and Backward mappings: Forward mapping:Backward mapping:

1~i jt tE E

itE 1tE

1~i jt tE E 1 0i j

t tArea E E 1jtE

1~i jt t tj F i E E t tk B i i F k

With the forward and backward mapping, we can define events at each frame:

SignalEventSignalEvent

SplitMerge

DisappearAppear

stable

1t t tF i B F i

A maximal sequence of stable events sandwiched between non-stable events is termed a track.

A player track is a track that corresponds to exactly one player

If the event sequence is track split or merge trackthen track involves multiple players

If the event sequence is{split, appear} track {merge ,disappear}then track may be a player track

If such track is long enough and ellipse size is not too big, it is considered a player track.

Other tracks are called multiple players track.

Because we’re dealing with a football game, we know that players are divided into 3 categories: Team A, Team B and officials.

This will help us in cases where teams from different teams appear in multiple players tracks.

Given the labeling of the tracks and their interactions through merging and splitting, the game can be summarized by a graph structure called target interaction graph.

White and gray nodes corresponds to team A / team B player tracks.Black nodes corresponds to multiple players trackThis graph is a small section of the ~5000 node graph describing 10 minutes of analyzed gameplay.

By examining the player interaction graph, it is possible to isolate situations where n player tracks merge and then split into n player tracks.

These merge-split situations are resolved by finding correspondence between input and output tracks.

Input and output tracks are each a set of n tracks.

We wish to find the assignment of the input to the output. It is a bijective mapping

. Where implies that track and are the same player.

Not all assignments are physically possible.

: 1, , 1, ,M n n M i j

For each valid assignment, we estimate the intermediate tracks by exploiting the properties of maintaining continuity of motion and relative depth ordering.

We investigate if any of the intermediary tracks can be described by a constant velocity motion model. This is done by linearly interpolate between the last ellipse of and the first ellipse of .

If there is sufficient image data to support this, the penalty for this estimation is 0.

The overall estimation for each assignment is scored:Where: is the distance traveled during

the hypothesized trajectory.

M i iM i M ii

Sc Dist T T Pen T T

i M iDist T T

1 if T is not consistent with relevant T

i M iPen T Totherwise

If the minimum score assignment was explained solely on linear interpolation, and its estimate is lower than threshold , then we accept this assignment.

Otherwise, we repeat this process at constant time intervals. This is called relative depth ordering.

Intermediary tracks that cannot be explained by simple linear interpolation, is analyzed every mth frame in the interval between the merge and the split.

Starting with the first interval, we define the region as the union of all ellipses and try to interpolate in smaller distances.

The aim at each interval is to maximize the intersection of with the foreground pixels and minimize the intersection with the background pixels.

Again, the penalty is set to 1 if the mentioned intersection is not consistent.

Then the score is re-calculated and the minimum scored assignment is chosen.

This process was found to be working if the number of targets merging was smaller or equal to 5.

Nonetheless, the examined sequence contained roughly 200 merge-split situations, of varying complexity, all resolved.

At this step, it is interesting to see how frequently a player was assigned a player track.

Not all split/merge situations were accurately resolved.

Usually, other features can be used to resolved the identity of player tracks.

In a football game, a player’s identity can be obtained by his relative position to his teammates. .

The easiest example is the goalkeeper who is always behind his teammates.

We can look at the problem as a partitioning problem

This is specific to a football game, but a variation can be used for other applications.

The feature vector for each playerat frame is: which counts the number of players in the team to the left, right, in front and behind the player.

1, ,11i

t , , ,i i i i it t t t tv r l f b

We assign an index to every possible configuration (feature vector) and for each unlabeled player track, we make a histogram of the configuration over the track’s ellipses.

We start by considering only long player tracks (over 40 seconds).

Build their distance matrix:

The distance between every pair of player tracks is shown.Darker values indicated smaller distances

We Grow and merge cluster by using player tracks of decreasing lengths. This clustering considers tracks of 750 frames long.

Clustering at 250 frames tracks:

Errors begin to occor

We’ve seen two algorithmsOne deals with single person tracking, the other

with multi-target trackingBoth algorithm makes specific assumptions. The

first one assumptions about the human body and motion, the other about motion and football game’s conditions.

pedestrians detection and tracking

Documents

joint detection and multi-object tracking with graph ... ·...

multimodal detection and tracking of pedestrians in urban...

automatic detection and tracking of pedestrians from a...

tracking pedestrians using local spatio-temporal motion...

a real-time system for monitoring of cyclists and...

autonomous detection and tracking under illumination...

unsupervised group activity detection by hierarchical...

cognitive fusion of thermal and visible imagery for ... ·...

object detection & tracking

moving object detection and tracking in forward...

knowledge-based radar detection, tracking, and...

oil spill detection and tracking technologies aircraft and...

automatic detection and tracking of pedestrians in videos...

3d textureless object detection and tracking: an edge...

tracking pedestrians and emergent events in disaster...

autotrack: a lightweight object detection and tracking...

abnormal activity detection and tracking

multicamera human detection and tracking supporting...

illuminating pedestrians via simultaneous detection...

aerial image exploitation change detection event detection...