pedestrians detection and tracking
Post on 11-Jan-2016
67 Views
Preview:
DESCRIPTION
TRANSCRIPT
Papers:•Pfinder: Real-Time Tracking of the Human Body,
Wren, C., Azarbayejani, A., Darrell, T., and Pentland, P.
•Tracking and Labelling of Interacting Multiple Targets,
J. Sullivan and S. Carlsson
This talk will cover two distinct tracking algorithms. Pfinder: Real-Time Tracking of the Human BodyMulti-target tracking and labeling
For each of them we will present:Motivation and previous approachesReview of relevant techniquesAlgorithm detailsApplications and demos
There is always a major trade-off between genericity and accuracy.
Because we know we are trying to identify and track human beings, we can start making assumptions about our objects.
If we have more specific information (example: tracking players in a football game), we can add even more specific assumptions.
These kind of assumptions will help us to get a more accurate tracking.
Tracking Algorithm #1
Pfinder: Real-Time Tracking of the Human Body
Motivation
Introduction
• Pfinder is a tracking algorithm– Detects human motion in real-time.– Segments the person’s body– Analyze internal features (head, body,
hands, and feet)
Many Tracking algorithm use a static model – For each frame, similar pixels are searched in the vicinity of the bounding box of the previous frame. We will use a dynamic model – One that learns over
time.Most tracking algorithms need some user-input
for initialization. The presented algorithm will do automatic
initialization.
Covariance For a domain of dimension , we define the
sampling domain’s variables The covariance of two variables is defined:
where The covariance of two variables is a measure of how much two variables change together.
1 nx xn
,i jx x
cov ,i j i i j jx x E x x i iE x
The Covariance Matrix (marked ) is defined:
Normal distribution of a variable is defined:
cov ,ij i jx x
2
2
1exp
22
xp x
x
The more generalized multivariate distribution is defined:
1
1 2 1 2
1 1, , exp
22
T
N Np x x x x x
Mahalanobis distance: The distance measured from a
sample vector To a group of samples with mean and a covariance matrix is defined:
1 Nx x x
MD x
1
T
N S
1T
MD x x S x
1. (Automatic) Initialization Background is modeled in a few seconds of
video where the person does not appear. When the person enters the scene, he is
detected and modeled.
2. The analysis loop After the background and person models are
initialized, each pixel in the next frame is checked against all models.
The first step in the algorithm is build a preliminary representation of the person and the surrounding scene.
First we need to acquire a video sequence of the scene that do not contain a person in order to model the background
The algorithm assumes a mostly-static background.
However, it is needed to be robust in illumination changes and to be able to recover from changes in the scene (e.g. a book that was moved from one place to another).
The images in the video are using the YUV color representation (Y = luminance component, UV = chrominance component). There exists a transformation matrix which
transforms RGB representation to YUV.The algorithm models the background by
matching each pixel a Gaussian that describes the pixel’s mean and distribution.
We do this by measuring the pixel’s YUV mean and distribution over time
This pixel has some YUV value on this frame, on the next frame, it might change, so we mark it’s mean asand its covariance matrix as
0 ,x y
0 ,K x yy u
v
After the scene has been modeled, Pfinder watches for large deviations from this model.
This is done by measuring the Mahalanobis distance in the color space between the new pixel’s value and to the scene model values in the appropriate location.
If the distance is large enough and the change is visible over a sufficient number of pixel, we begin to build a model of a person.
The algorithm represents the detected person’s body parts using blobs.
Blobs are 2D representation of a Gaussian distribution of the spatial statistics.
Also, a support map is built for each blob :
k 1 ,,
0k
x y kS x y
otherwise
To initialize the blob models, Pfinder uses a 2D contour shape analysis that attempts to identify the head, hands, and feet location.
A blob is created for each identified location.
The class analyzer find the location of body features by using statistics from their position and color in the previous frames.
Because no statistics have been gathered yet (this is the first frames where the person appears), the algorithm uses ready-made statistical priors.
Hand and face blobs have strong flesh-colored color priors (it appears that normalized skin color is constant across different skin pigmentation levels).
The other blobs are initialized to cover the clothing regions
The contour analyzer can find features in a single frame, but the results tend to be noisy.
The class analyzer produce accurate result but it depends on the stability of the underlying models (i.e. no occlusion).
A blend of contour analysis and class model is used to find the feature in the next frame.
original
contour
After the initialization step of the algorithm, the information is now divided into scene and person models. Scene (background) model consist of the color space
distribution for each pixel. Person model consist of spatial space and color space
distribution for each blobThe spatial space determines the blob’s location and
sizeThe color space determines the distribution of color in
the blob
Given a person model and a scene model, we can now acquire a new image, interpret it, and update the scene and person models.
1. Update the spatial model associated with each blob using the blob’s measured statistics, to yield the blob’s predicted spatial distribution for the current image.This is done with a Kalman filter assuming simple Newtonian dynamics.
Measuring information from video sequence can be very inaccurate sometimes
Without some kind of filtering it would be impossible to make any short-term forward predictions.
Also, each measurement is used as a seed for the tracking algorithm atthe next frame.
Some kind of filteringis needed to make themeasurements moreaccurate.
Each tracked object is represented with a state vector (usually location)
With each new frame, a linear operator is applied to the state to generate the new state, with some noise mixed in, and some information from the controls on the system
Usually, Newton’s laws are applied.
The noise added is a Gaussian noise with mean 0 and a covariance matrix.
The predicted state is then updated with the real measurement to create the estimate for the next frame.
2. Now when a new image is acquired, we measure the likelihood of each pixel being a member of each of the blob models and the scene model:the vector is defined as the location and color of each pixel. For each class , the log likelihood is measured:
, , , ,p x y Y U V
k
11 1ln ln 2
2 2 2T
k k k k k
md p K p K
3. Each pixel is now assign to a particular class.Either one of the blobs or the background.A support map is build which indicates which pixel belong to which class
, arg max ,kk
s x y d x y
Connectivity constraints are enforced by iterative morphological growing from a single central point, to produce a connected region.
First, a foreground region is growncomprised of all the blob classes.
Then, each of the individual blob isgrown with the constraint that theyremain confined to the foregroundregion
4. Now the statistical model for each class is updated. For the blob classes, the new mean is calculated
The Kalman filter statistics are also updated at this time.
Background pixels are also updated to have the ability to recover from changes in the scene.
k
T
k k k
E p k
K E p p
The algorithm employs several domain-specific assumptions in order to have an accurate tracking. If one of the assumptions break, the system
degrades.However, the system can recover after a few
frames if the assumptions again holdThe system can track only after a single
person.
RMS (Root Mean Square) errors were found on the order of a few pixels:
TestHandArm
Translation(X,Y)
0.7 pixels(0.2% relative)
2.1 pixels(0.8% relative)
Rotation( )
4.8 degrees(5.2% relative)
3.0 degrees(3.1% relative)
A Modular Interface - An application that provides programmers tracking, segmentation and feature detection.
The ALIVE application places 3d animated characters that interact with the person according to his gestures.
Here, Rexy!
The SURVIVE application recorded the movement of the person to navigate a 3d virtual game environment.
I guess you can’t get any nerdy
than this
Recognition of American Sign Language Pfinder was used as a pre-process for detecting a
40-word subset of ASL. It had 99% sign accuracy
Avatars and Telepresence The model of the person is translated to several
blobs. Which can be used to model 2d characters.
Tracking Algorithm #2
Multi-Target Tracking and Labeling
uses slides by Josephine Sullivan
from http://www.csc.kth.se/~sullivan/
Motivation
Introduction
• The multi-target tracknig and labeling algorithm– Track multiple targets over large
periods of time– Robust collision recovery– Does labeling even when targets are
interacting
Multi Tracking and Labeling
Sometimes Easy Sometimes Hard
The algorithm addresses the problem of the surveillance and tracking of multiple persons over a wide area.
Previous multi-target tracking algorithms are based on Kalman filtering and advanced techniques of particle filtering.
Often tracking algorithms fails if occlusion or interaction between the targets occurs.
This work’s specific goal is to track and label the players in a football game.
This is especially hard when players collide and interact
The researchers used a wide-screen video which was produced using the video from four calibrated cameras.
The images were stitched after the homography between the images was computed.
This produces a high-resolution video which gives good tracking results
1. Background modeling and subtraction2. Build an interaction graph3. Resolve split/merge situations4. Recover identities of temporally separated
player trajectories.
A probabilistic model of the image gradient of each pixel in the background is obtained.
The gradient is used to prevent situation where the player’s uniform has the same color as the background.
Let denote the image gradient at pixel in frame .
Each background pixel is modeled by a mixture of three bivariate normal distributions with means and covariance matrices :
Where and
txg x
tbxg
ix
3
21
,i i ix x x x
i
g N
0 1ix
3
1
1ix
i
ix
A pixel in frame is considered a foreground pixel if is larger than a threshold . Let be the set of foreground pixels at time t Let be the set of background pixels at time t
Connected components are then identified and are processed by deleting small “cc”s or joining them to neighboring larger “cc”s. This is made to make sure that each connected
component corresponds to at least one whole player
x t 1Tt t
x x x x xg g
tF
tB
The set of ellipses representing the connected components detected (marked by bounding boxes) is defined:
With being the number of ellipses detected in frame
1
tntt i i
E
E
tnt
The first aim is to put the ellipses in and in correspondence.
Definition: ellipses and are an exact match if their size and orientation are sufficiently similar and distance between their centers are sufficiently small.
tE 1tE
1E 2E
Define a relation : if and are an exact match If no such exact match exists for in then
if and has no exact match in
Define a Forward and Backward mappings: Forward mapping:Backward mapping:
~
1~i jt tE E
itE 1
jtE
itE 1tE
1~i jt tE E 1 0i j
t tArea E E 1jtE
tE
1~i jt t tj F i E E t tk B i i F k
With the forward and backward mapping, we can define events at each frame:
SignalEventSignalEvent
SplitMerge
DisappearAppear
stable
1tF i
0tF i
1t t tF i B F i
1tB j
0tB j
A maximal sequence of stable events sandwiched between non-stable events is termed a track.
A player track is a track that corresponds to exactly one player
If the event sequence is track split or merge trackthen track involves multiple players
If the event sequence is{split, appear} track {merge ,disappear}then track may be a player track
If such track is long enough and ellipse size is not too big, it is considered a player track.
Other tracks are called multiple players track.
Because we’re dealing with a football game, we know that players are divided into 3 categories: Team A, Team B and officials.
This will help us in cases where teams from different teams appear in multiple players tracks.
Given the labeling of the tracks and their interactions through merging and splitting, the game can be summarized by a graph structure called target interaction graph.
White and gray nodes corresponds to team A / team B player tracks.Black nodes corresponds to multiple players trackThis graph is a small section of the ~5000 node graph describing 10 minutes of analyzed gameplay.
By examining the player interaction graph, it is possible to isolate situations where n player tracks merge and then split into n player tracks.
These merge-split situations are resolved by finding correspondence between input and output tracks.
Input and output tracks are each a set of n tracks.
We wish to find the assignment of the input to the output. It is a bijective mapping
. Where implies that track and are the same player.
Not all assignments are physically possible.
M
: 1, , 1, ,M n n M i j
iT jT
For each valid assignment, we estimate the intermediate tracks by exploiting the properties of maintaining continuity of motion and relative depth ordering.
We investigate if any of the intermediary tracks can be described by a constant velocity motion model. This is done by linearly interpolate between the last ellipse of and the first ellipse of .
If there is sufficient image data to support this, the penalty for this estimation is 0.
iT
M iT
The overall estimation for each assignment is scored:Where: is the distance traveled during
the hypothesized trajectory.
1
n
M i iM i M ii
Sc Dist T T Pen T T
i M iDist T T
1 if T is not consistent with relevant T
0
i M i
i M iPen T Totherwise
If the minimum score assignment was explained solely on linear interpolation, and its estimate is lower than threshold , then we accept this assignment.
Otherwise, we repeat this process at constant time intervals. This is called relative depth ordering.
Intermediary tracks that cannot be explained by simple linear interpolation, is analyzed every mth frame in the interval between the merge and the split.
Starting with the first interval, we define the region as the union of all ellipses and try to interpolate in smaller distances.
k jtR
The aim at each interval is to maximize the intersection of with the foreground pixels and minimize the intersection with the background pixels.
Again, the penalty is set to 1 if the mentioned intersection is not consistent.
Then the score is re-calculated and the minimum scored assignment is chosen.
k jtR
This process was found to be working if the number of targets merging was smaller or equal to 5.
Nonetheless, the examined sequence contained roughly 200 merge-split situations, of varying complexity, all resolved.
At this step, it is interesting to see how frequently a player was assigned a player track.
Not all split/merge situations were accurately resolved.
Usually, other features can be used to resolved the identity of player tracks.
In a football game, a player’s identity can be obtained by his relative position to his teammates. .
The easiest example is the goalkeeper who is always behind his teammates.
We can look at the problem as a partitioning problem
This is specific to a football game, but a variation can be used for other applications.
The feature vector for each playerat frame is: which counts the number of players in the team to the left, right, in front and behind the player.
1, ,11i
t , , ,i i i i it t t t tv r l f b
We assign an index to every possible configuration (feature vector) and for each unlabeled player track, we make a histogram of the configuration over the track’s ellipses.
We start by considering only long player tracks (over 40 seconds).
Build their distance matrix:
The distance between every pair of player tracks is shown.Darker values indicated smaller distances
We Grow and merge cluster by using player tracks of decreasing lengths. This clustering considers tracks of 750 frames long.
Clustering at 250 frames tracks:
Errors begin to occor
We’ve seen two algorithmsOne deals with single person tracking, the other
with multi-target trackingBoth algorithm makes specific assumptions. The
first one assumptions about the human body and motion, the other about motion and football game’s conditions.
top related