learning the distribution of object trajectories for event recognition

Image and Vision Computing 14 (1996) 609-615

Learning the distribution of object trajectories for event recognition

Neil Johnson*, David Hogg*

School of Computer Studies, The University of Leeds, Leeds LS2 9JT, UK

Abstract

The advent in recent years of robust, real-time, model-based tracking techniques for rigid and non-rigid moving objects has made automated surveillance and event recognition a possibility. A statistically based model of object trajectories is presented which is learnt from the observation of long image sequences. Trajectory data is supplied by a tracker using Active Shape Models, from which a model of the distribution of typical trajectories is learnt. Experimental results are included to show the generation of the model for trajectories within a pedestrian scene. We indicate how the resulting model can be used for the identification of atypical events.

Keywords: Object motion analysis; Event recognition; Neural networks

1. Introduction

Existing vision systems for surveillance and event recognition rely on known scenes where objects tend to move in predefined ways (see, for example, Howarth and Buxton [l]). Hand generated models of the scene and object behaviours, incorporating high-level qualitative reasoning, are used to recognise events. We wish to approach the surveillance problem from a lower level, using learnt models of typical object behaviour to iden- tify incidents, recognise events and predict object trajectories within unknown scenes where object behaviour is not predefined. An open pedestrian scene is used as an example of such a situation, since pedestrians are free to walk wherever they wish.

Unsupervised learning techniques are used in an attempt to develop models which can be acquired with a minimum of human intervention, and which can be made to adapt over time to long term changes in object

behaviour. We develop a model of the probability density func-

tions of both instantaneous movements and partial trajectories within a scene. Trajectories are initially represented as sequences of flow vectors which allows the modelling of instantaneous movements. This model of instantaneous movements, together with a type of memory mechanism, is then used to derive vector

* Email: {neilj, dch}@scs.Ieeds.ac.uk

0262~8856/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved PZZ: SO262-8856(96)01101-8

representations of partial trajectories from the sequences of flow vectors. This representation is of fixed size, is unique for a particular trajectory, encodes trajectories of different lengths whilst maintaining a similar level of detail, and ensures similarity between vectors close in their feature space (and vice versa).

The model is automatically generated by tracking objects over long image sequences. The pdfs are represented by the distribution of prototype vectors, which are positioned by a neural network implementing vector quantisation. The temporal nature of trajectories is modelled using a type of neuron with short-term memory capabilities to form vector representations of trajectories.

We indicate how the model can be used to recognise atypical movements, and thus flag possible incidents of interest, and how attaching ‘meaning’ to areas of the distributions representing similar instantaneous movements and partial trajectories would allow event recognition to be performed.

2. Data

It is assumed that raw data is available giving the 2D image trajectories of moving objects within the scene. For our experiments, an object tracker [2], based on Active Shape Models [3] acquired automatically from observing long image sequences [4], is used. This system

610 N. Johnson, D. HoggjImage and Vision Computing 14 (1996) 609-615

(4

i

Fig. 1. Trajectory data. (a) Pedestrian scene; (b) smoothed trajectory

data.

provides efficient real time tracking of multiple articulated non-rigid objects in motion, and copes with mod- erate levels of occlusion. In our experiments, pedestrians are tracked in a real world scene using a fixed camera (e.g. see Fig. l(a)).

There is a one way flow of data from the tracker consisting of frame by frame updates to the position in the image plane of the centroid of uniquely labelled objects. The detection of atypical activity and the recognition of events is feasible within the image plane, although it can also be carried out with trajectories that have been buck projected onto the ground plane. The use of the image plane avoids introducing errors associated with the transformation of coordinates from the image to ground plane.

Since each new object being tracked is allocated a unique identifier, it is possible to maintain a history of

the path taken by each object from frame to frame. Due to inaccuracies in the tracking process, the raw data will contain high frequency noise. This noise is minimised by smoothing image positions over a moving window, width W. The tracker processes frames at a fixed rate and thus, for an object i, we have a sequence Ti of n smoothed 2D image coordinates, uniformly spaced in time:

Ti = {Cx17Yl)7 Cx2,Y2), Cx3,Y3),..., (4-2,.h-2)>

(-%l>.h-1)) c%lYJ~. (1)

In our experiments, trajectories are sampled at a rate of two frames per second and smoothing is performed using a window of width w = 5. Fig. 1 shows a large number of these smoothed trajectories with centroid positions con- nected with lines (Fig. l(b)) alongside an image of the ‘empty’ pedestrian scene from which they were obtained

(Fig. l(a)). An object’s trajectory is described in terms of a

sequence of$ow vectors,’ where a flow vector f represents both the position of the object and its instantaneous velocity in the image plane:

f = (x7 Y, 6% by). (2)

Flow vectors are estimated from the change in object centroid coordinates between successive frames.

Flow vectors are transformed so that each component lies in the interval [0, l] (i.e. x, y, Sx, Sy E [0, 11). This involves using a factor of 10 when scaling velocity components relative to positional components; the scaling factor being derived from the maximum observed object speed. The relative scaling of velocity and positional components balances their relative contribution when using the Euclidean distance between flow vectors as a measure of their similarity. Thus an object i is, after preprocessing, represented by a set Qi of 12 flow vectors, all of which lie within a unit hypercube in 4Dphase space:

Qi= {fl,f2,f3,...,fn-2,fn-l,fn}. (3)

3. Modelling probability density functions

This section describes the method used to model the pdf of an arbitrary set of vectors Zi in WN using an unsupervised learning technique. The following two sections detail the modelling of flow vectors and partial trajectories respectively using this method.

In modelling the probability density function of vectors in L?XN, we have two main aims:

l to form as concise and accurate a model as possible, and

’ The termflow vector usually refers to a 2D velocity vector positioned in the image plane, which is thus effectively a 4D yector such as those

referred to here.

N. Johnson, D. Hoggjlmage and Vision Computing 14 (1996) 609-615 611

l to enable ‘meaning’ to be attached to areas of the distribution.

One way of modelling the pdf would be to divide the feature space into an N-dimensional grid structure and increment a count for each cell whenever a vector falls within that cell. This would not be a concise model, and meaning would have to be attached to all cells. Instead, we model the pdf by the point distribution of prototype vectors using vector quantisation.

3.1. Vector quantisation

Vector quantisation is a classical technique from signal processing, originally used for data compression, which provides a method for modelling pdfs by the point distribution of prototype vectors. We implement the technique using a competitive learning neural network which is taught in an unsupervised manner (see, for example,

[5>61). Our network consists of a set of N input nodes (one for

each component of the N-dimensional feature vectors), k output nodes (one for each prototype), and implements the following algorithm:

1. Randomly place the k prototypes within the unit hypercube feature space in BN.

2. Initialise CY, a monotonically decreasing gain coeffi-

cient in the interval (0, 1). 3. Select z(t), the input feature vector,2 from the distri-

bution to be modelled. 4. Find the prototype p,(t) which is nearest to this input

by the Euclidean metric:

Ilz(t) - PC(t)11 = mini Ilz(t) - PiCt)ll. (4)

5. Update prototypes as follows:

P,(l + 1) = P,(f) + a(t)[z(t) - PC(t)1

pi(t + 1) = pi(t) for i # C. (5)

6. Decrease a(t) in line with a ‘cooling schedule’. 7. Repeat steps 3-6 for many iterations.

After learning, each prototype will represent an approximately equal number of training feature vectors and the point density of the prototypes withinthe feature space will approximate the pdf of the feature vectors [6]. The model is thus more accurate in areas of high probability density, and so the representation is both concise and accurate.

A modification to this algorithm to deal with sensitivity to the initial placement of prototypes and to improve density matching is detailed in the appendix.

In our network implementation, each output node represents one of the prototypes and is said to ‘win’ if

’ To improve the performance of the algorithm, input feature vectors

are chosen randomly from a buffer.

its prototype is the nearest to the feature vector being presented on the inputs. The output of a node i is calculated as follows:

0.~~1 = 1 _ IlzCt) - PiCr)ll I

Am

Thus Oi(t) decreases linearly from one to zero as the distance from z(t) to pi(t) increases from zero to 0. The form of this output is not important until later when we use it to form a representation of a trajectory.

The number of prototypes used to describe the distribution is essentially arbitrary, the more prototypes used

the greater the accuracy of the model.

4. Modelling the pdf of flow vectors

A competitive learning network is used to model the pdf of flow vectors generated from the raw input data stream. Before the flow vectors can be presented to the network, some further preprocessing is necessary.

As an object moves it sweeps out a continuous path in 4D phase space. This path is sampled at regular time instants to generate the sequence of vectors which is the result of preprocessing. When the speed at which the path is swept out is low, the sampled vectors are densely distributed, and when it is high, the vectors are sparsely distributed. This will result in a higher probability density in areas where the rate of movement along the path is low.

To avoid this problem, the path is resampled with a constant arc length, S.S. This generates a new sequence of flow vectors which are evenly distributed along the path. The value of Ss is chosen to be approximately equal to the average distance between time sampled flow vectors.

A four input network can now be trained by sequen- tially presenting flow vectors generated in this way from a large number of object trajectories.

4.1. Experimental results

The trajectories shown in Fig. l(b) were used to train a

network consisting of 4 input nodes and 1000 output nodes/prototypes. A value of 6s = 0.025 was used for the generation of corrected flow vectors. The network was trained for 2,000,OOO iterations with the gain coefficient (I decreasing linearly from 1 to 0 over this period. A value of p = 0.01 was used for sensitivity adjustments (see appendix).

The results of this experiment are shown in Fig. 2. The prototype for each of the 1000 output nodes is displayed as an arrow, the position of which represents the (x, v) components, and the size and direction of which represents the (Sx, Sy) components. Comparison between these prototypes and the raw trajectories shows the results to be plausible.

612 N. Johnson, D. Hoggjlmage and Vision Computing 14 (1996) 609-615

Fig. 2. Distribution of prototypes in a 4 input, 1000 output node net-

work trained using the trajectories shown in Fig. l(b).

5. Modelling the pdf of partial trajectories

To model the pdf of sequences of flow vectors using a competitive learning network, we need to form a representation of sequences with the following

properties:

From Eq. (6) the outputs of our competitive learning network produce an activation which decreases linearly from one to zero as the distance between the node’s prototype and the input vector increases from zero to a. As the flow vectors produced by a trajectory are presented to a trained network, the output of certain nodes (whose prototype the trajectory comes close to) will first increase and then decrease.

l sequences of different lengths are modelled. l sequences which are similar should be close in the vec-

tor space of the representation, and vice versa.

Sequences of flow vectors are represented by modelling the sequence of activations they cause on the outputs of the previously trained network. This is achieved by using a layer of ‘leaky neurons’ which acts as a memory mechanism to record a history of activations. Using this layer, a feature vector representation of trajectories is formed which is of fixed size, is unique for a particular trajectory, encodes trajectories of different lengths whilst maintaining a similar level of detail, and ensures similarity between vectors close in their feature space (and vice versa).

By feeding the output of each of these nodes into the input of a corresponding leaky neuron with a slow decay rate, a trace of the trajectory will be formed in the activations of the leaky neurons. By using the complete output vector of a trained network as the input vector to a layer of leaky neurons (i.e. making a one-to-one connection between the output nodes and a set of leaky neurons), a representation of the sequence of activations is formed on the leaky neuron layer as the sequence of flow vectors is presented. When the entire sequence has been presented, the output vector of the leaky neuron layer can be used to represent the sequence of flow vectors and thus the trajectory.

5.1. Leaky neurons

The leaky neurons used are similar to the Leaky Inte- grators of Reiss and Taylor [7] or the neurons of Wang and Arbib [S]. Leaky neurons are different to the neurons in most neural networks in that they hold a certain amount of their activation from previous iterations. This leaky characteristic is present in biological neurons, where electrical potential on the neuron’s surface decays according to a time constant. In this way, the leaky neurons have a memory of previous activations.

Sequences of any length can be represented up to a maximum defined by the number of prototypes and the memory span of the leaky neurons. Sequences must be

simple (i.e. the trajectory must not pass each prototype more than once), but this is almost always the case in reality. Since nodes whose prototypes are close in phase space will have similar outputs, the vector representations of similar trajectories will be close in the representation’s feature space, and vice versa. Further work is required to assess the distortion to the pdf of trajectories caused by representing the trajectories in this manner.

To approximate the pdf of partial trajectories we use vector quantisation to position prototypes within the feature space of the trajectory representation. Partial trajectory vectors are produced for each trajectory from the sequence of time sampled flow vectors and a network of k nodes previously trained on flow vectors as follows:

A leaky neuron has a single input and a single output. 1. Zero the leaky neuron layer, ai = 0.

The activation at iteration t + 1 is calculated from the 2. For each flow vector f(t) in the sequence:

previous activation a(t) and the current input I:

a(t + 1) = I if Z > ra(t)

ra( t) otherwise,

where y is a coefficient in the interval (0,l) which governs the rate of decay and thus the memory span of the neuron.

Such a neuron will mimic its input over a number of iterations, unless the input decreases at a rate which is greater than the rate of decay of the neuron’s activation. A leaky neuron with a slow decay rate (high value of 7) will thus retain a ‘trace’ of its highest input.

5.2. Method

N. Johnson, D. Hogg/Image and Vision Computing 14 (1996) 609-615 613

(a) Let f(t) be the input to the previously trained

network. (b) Calculate the output from the previously trained

network:

o,(t) = 1 _ IIf(f) - Pi(t)11 I

JN . (c) Let the oi(t) be the inputs to

layer. the leaky neuron

(d) Calculate the new leaky neuron activations:

(8)

Ui(t + 1) = 0,(t) if 0,(t) > rUi(t)

raj(t) otherwise. (9)

(e) Let the aj(t + 1) be the components of the partial trajectory vector representing the trajectory up to

this point.

Before the partial trajectory vectors generated in this way can be presented to a second competitive learning network to model their pdf, some further preprocessing is necessary. In the same way that trajectories were represented earlier by a sequence of flow vectors defining a time sampled path through phase space, so trajectories are now represented by a sequence of partial trajectory

1

vectors defining a time sampled path through partial trajectory space. Thus, it is again necessary to resample the path with a constant arc length 6s to ensure that partial trajectory vectors are evenly distributed along the path.

5.3. Experimental results

A layer of 1000 leaky neurons was used to generate partial trajectory vectors from the outputs of the previously trained network as described above. A second competitive learning network consisting of 1000 input nodes and 500 output nodes was used to model the pdf of the partial trajectory vectors. A value of 6s = 0.2 was used for the generation of corrected partial trajectory vectors, and a value of y = 0.998 used to govern the decay of activation in the leaky neurons. The second network was trained for l,OOO,OOO iterations with the gain coefficient Q decreasing linearly from 1 to 0 over this period. A value of p = 0.1 was used for sensitivity adjustments (see Appendix).

Some results from this experiment are shown in Fig. 3. Fig. 3(a) shows a representation of a prototype from the

Fig. 3. Trajectory learning results. (a) (c) Representations of two prototypes; (b) (d) smoothed trajectories which the prototypes represent

614 N. Johnson, D. Hogg/Image and Vision Computing 14 (1996) 609-615

second network where the value of each component is displayed as a shaded arrow. The arrow indicates which prototype from the first network the component corre- sponds to and the shade represents the activation (white being zero and black one). Fig. 3(b) shows smoothed partial trajectories from the data set which cause the proto-

type represented in Fig. 3(a) to win. Figs. 3(c) and (d) are as Figs. 3(a) and (b), but for another prototype.

Examination of prototypes and the smoothed trajectories for which they win suggests a plausible division of the feature space with prototypes representing partial trajectories covering different paths with differing velo- cities. Groups of partial trajectories with high probability density are represented by many similar prototypes as expected.

of distances for which node i is the winner and rejecting inputs where Ilz(t) - Pi(t)\1 > pi + 3Oi.

Since both instantaneous movements and partial trajectories have been modelled, continuous assessment of typicality in terms of both the current movement and complete trajectory is possible. Incidents of interest can

be flagged if the typicality drops below a certain thres- hold, or if the input is found to be outside of the distribution.

Recognition of simple events could be achieved by attaching semantics or meaning to areas of the distributions. This would be achieved by labelling the relevant nodes and retrieving the information when the nodes are activated.

Long-range trajectory prediction could also be achieved using the model.

6. Event recognition 7. Conclusions

The most obvious use for the model we have developed is assessment of the typicality of instantaneous movements and partial trajectories, where typicality is defined statistically. By observing the approximate probability density in the model of an object’s instantaneous movements and partial trajectories, we can flag possible incidents of interest. To achieve this it is necessary to label each prototype with an estimate of the local probability

density. By estimating the volume Vi within feature space for

which a particular node i wins, and assuming the probability density is constant within this region, the probability density can be approximated by

1 Pi=&> (10)

where k is the number of prototypes and the entire distribution is assumed to lie within a unit hypercube. Since estimation of l)i is impractical for high dimensional spaces, the mean distance

cL, = Xi”=1 IlzCt) - Pi(‘1ll.j I >

n (11)

for which node i is the winner can instead be used as a measure of relative probability density.

Since some prototypes will be bordering areas of the distribution which are ‘empty’, it is also necessary to estimate a prototype’s boundary, providing a means of deciding whether an input vector which causes a particular node to win is from the area represented by the node’s prototype or from a nearby region of near-zero probability density. This can be achieved by calculating the standard deviation

(T, = Cy=l (IlzCt) - PiCt)ll j - PiI2 I

i n (14

We have presented a statistically based model of object trajectories which is learnt from image sequences. The model is based on a neural network allowing fast parallel implementation. Experimental results show the generation of the model for the trajectories of pedestrians within a real-life pedestrian scene. Minor additions to the model have been suggested allowing the detection of incidents through the detection of atypical instantaneous movements and partial trajectories, and the recognition of simple events by attaching meaning to prototypes representing instantaneous movements and partial trajectories.

Appendix: Ensuring correct distribution of prototypes

Vector quantisation as described has two major

problems:

The final distribution of prototypes is extremely sen- sitive to their initial random placement within the feature space. Prototypes can be ‘stranded’ in areas where they will never win, which will result in a sub-optimal distribution. This is a particular problem in sparse distributions such as those to be modelled here. The final distribution of prototypes does not exactly model the pdf of input vectors (i.e. the density matching is not exact). Typically, areas of high probability density are under-represented and areas of low probability density are over-represented.

Rumelhart and Zipser [5] propose a solution to the first of these problems called leaky learning, where the losing nodes also move their prototypes towards the input vector, but by a much smaller amount. This results in stranded prototypes moving towards the mean of the distribution. For a sparse distribution this is not

N. Johnson, D. Hoggllmage and Vision Computing 14 (1996) 609-615 615

adequate, since the mean of the distribution may itself be

‘empty’. Instead, a method similar to that suggested by Bien-

stock et al. [9] is used, where each node i has an associated sensitivity. In our implementation, this sensitivity S; is initially zero and is updated on each iteration:

Sj(l+ 1) = &$(t) + Aj, (13)

where

P c=l-(k-l)fi

(14)

and

-P

L

if i winner Ai= p

k-l otherwise.

(15)

/? is a constant in the interval (0, 1) specifying the mag- nitude of adjustments, k is the number of prototypes, and N is the dimsionality of the vectors being modelled. The value of /3 should be small relative to the feature space, but large enough to enable stranded prototypes to ‘escape’ within the network’s learning period. The <s,(t) term in Eq. (13) ensures that S’i 5 J?\i and that sensitivity values will tend to zero. The form of the adjustments Ai in Eq. (15) ensures that, for correctly distributed prototypes, the mean adjustment will be zero.

The sensitivity is subtracted from the Euclidean distance when finding the nearest prototype during learning. In this way a node with positive sensitivity is more

likely, and a node with negative sensitivity is less likely to win the competition, allowing stranded prototypes to enter the competition and improving the density matching of the algorithm. Thus competitive learning with

node sensitivities performs a robust vector quantisation.

References

[I] R. Howarth and H. Buxton, Analogical representation of spatial

events for understanding traffic behaviour, in B. Neumann (ed.),

10th European Conference on Artificial Intelligence, John Wiley,

1992, pp. 7855789.

[2] A. Baumberg and D. Hogg, An efficient method for contour track-

ing using active shape models, IEEE Workshop on Motion of Non-

rigid and Articulated Objects, pp. 194-199. November 1994, IEEE

Press.

[3] T.J. Cootes, C.J. Taylor, D.H. Cooper and J. Graham, Training

models of shape from sets of examples, British Machine Vision

Conference. September 1992, pp. 9918.

[4] A. Baumberg and D. Hogg, Learning flexible models from image

sequences, European Conference on Computer Vision, Vol. 1, May

1994, pp. 299-308.

[5] D. Rumelhart and D. Zipser, Feature discovery by competitive

learning, Cognitive Science, (9) (1985) 75-l 12.

[6] T. Kohonen, The self-organizing map, Proc. IEEE, 78(9) (1990)

146441480.

[7] M. Reiss and G. Taylor, Storing temporal sequences, Neural Net-

works, 4 (1991) 773-787.

[8] D. Wang and M. Arbib, Complex temporal sequence learning based

on short-term memory, Proc. IEEE, 78(9) (1990) 153661542.

[9] E. Bienenstock, L. Cooper and P. Munro, Theory for the develop-

ment of neuron selectivity; Orientation specificity and binocular

interaction in visual cortex, J. Neuroscience, (2) (1982) 32248.

learning the distribution of object trajectories for event recognition

Documents