thesis--- identification of billboards in a live handball game

Table of Contents

ABSTRACT ............................................................................................................................................3 I. INTRODUCTION...............................................................................................................................4

1.1. BACKGROUND ...............................................................................................................................4 1.2. IMAGE AND VIDEO ISSUES .............................................................................................................7 1.3. VARIOUS APPROACHES..................................................................................................................9

II. IMAGE FEATURES FOR TRACKING ......................................................................................12 2.1. EDGES..........................................................................................................................................12 2.2. COLOR .........................................................................................................................................14 2.3. HISTOGRAM.................................................................................................................................16

III. DEFORMABLE TEMPLATE MATCHING..............................................................................18 3.1. BASIC THEORY ............................................................................................................................19

3.1.1. Bayes Theorem....................................................................................................................19 3.1.2. Bayesian Formulation of the Deformation..........................................................................19

3.2. DEFORMATION MODELS ..............................................................................................................21 3.3. ALGORITHM.................................................................................................................................24

IV. CONDITIONAL DENSITY PROPAGATION (CONDENSATION) .......................................27 4.1. BASIC THEORY ............................................................................................................................28

4.1.1. Modelling Shape and Motion ..............................................................................................28 4.1.2. Discrete Time Propagation of State Density.......................................................................28 4.1.3. Temporal Propagation of Conditional Densities ................................................................29 4.1.4. Dynamic Models .................................................................................................................30 4.1.5. Measurement.......................................................................................................................30 4.1.6. Propagation ........................................................................................................................31 4.1.7. Factored Sampling..............................................................................................................32

4.2. THE CONDENSATION ALGORITHM ...............................................................................................33 4.3. TEMPLATE REPRESENTATION ......................................................................................................35

4.3.1. B-spline Curves...................................................................................................................35 4.3.2. Template Curves .................................................................................................................40 4.3.3. Affine Representation of B-spline Curves ...........................................................................47

4.4. CONDENSATION TRACKER...........................................................................................................53 4.4.1. Dynamic Model...................................................................................................................53 4.4.2. Observation Model..............................................................................................................56 4.4.3. Initialization........................................................................................................................58 4.4.4. Detailed Algorithm..............................................................................................................59

V. EXPERIMENT AND FINDINGS ..................................................................................................63 5.1. EXPERIMENT PURPOSE.................................................................................................................63 5.2. DATA SET ....................................................................................................................................64 5.3. ERROR MEASUREMENT................................................................................................................66

5.3.1. Mean Square Error .............................................................................................................66 5.3.2. Confidence Interval.............................................................................................................68

5.4. EXPERIMENTS ON DEFORMABLE TEMPLATE MATCHING .............................................................69 5.4.1. Experimental Results...........................................................................................................69 5.4.2. Findings and Discussions ...................................................................................................76

5.5. EXPERIMENTS ON CONDENSATION ALGORITHM..........................................................................78 5.5.1. Define an Optimal Template ...............................................................................................78 5.5.2. Coefficients 0A , 1A , and B ..............................................................................................79 5.5.3. Result and Findings ............................................................................................................85

5.6. FURTHER DISCUSSIONS..............................................................................................................104 5.6.1. Optimal Definition of Final Q at Measurement Step ........................................................104 5.6.2. Optimal Number of Particles ............................................................................................105

1

5.6.3. Initialization and the Stability of the Filter.......................................................................106 VI. COMPARISON OF TEMPLATE MATCHING AND CONDENSATION ...........................107 VII. CONCLUSION AND FUTURE IMPROVEMENTS..............................................................109 APPENDIX .........................................................................................................................................112

A. GRADIENT MAGNITUDE FOR PIXEL REPRESENTATION.................................................................112 B. COLOR MODEL CONVERSION FROM RGB TO HSI........................................................................112 C. ESTIMATION OF A0, A1, AND B BY MAXIMUM LIKELIHOOD ESTIMATION METHOD. ....................113 D. IMAGES FROM THE EXPERIMENT..................................................................................................115

D.1. The Results of Section 5.4 Experiments on Deformable Template Matching......................115 D.2. 0A s, 1A s and Bs Calculated from the Training Data in Section 5.5.2..............................124 D.3. Results of Section 5.5 Experiments on Condensation .........................................................125

E. MATLAB CODE.............................................................................................................................131 REFERENCES ...................................................................................................................................159

2

Abstract Replacing commercial billboards around the ballgame field in a live TV transmission

has huge commercial potential. The fewer parties are involved in this process, the

more profitable it is. This thesis addresses part of this process in a handball game: the

way to track a billboard in a live handball game in real-time without knowledge about

camera conditions. The information provided beforehand is the approximate location

of cameras and the designs of the billboards. We present two methods to detect non-

rigid objects, namely deformable template matching and condensation algorithm, and

evaluate their accuracy and speed.

The template matching method seeks an object that best matches a deformable

template using the edge direction and the gradient magnitude of the edge. It can detect

the right object quite accurately. However, it is too slow to achieve real-time tracking.

The condensation algorithm predicts the location of the target object using a non

linear dynamic model. Following this the observation model determines the new

probability distribution for the next step by comparing the edge feature between

samples and the B-spline template. B-spline curves are flexible for representing

various shapes. The condensation algorithm is flexible and fast enough to be able to

achieve real-time tracking. However, it is difficult to create an appropriate dynamic

model suitable for many different settings.

Last but not least, we wish to thank our supervisor, Kim Streenstrup Pederson, for his

advice and encouragement. We also thank Maz Spork at Bopa Vision for materials as

well as advice from a practical point of view.

3

I. Introduction

1.1. Background The analysis and manipulation of live TV transmissions has huge commercial

potential. For example, while a TV station in Denmark is live broadcasting a football

match in England, the TV station would be able to replace the advertisement

billboards in the stadium with Danish advertisements without the viewers in Denmark

being able to notice the underlying process. It is even more flexible and profitable if

the TV station in Denmark can achieve this replacement without any information

from other parties, such as the location and properties of the involved cameras. To

implement such a system several problems need to be solved.

This project has been done in collaboration with the Danish company Bopa Vision,

which would like to create a product for real-time commercial replacement for live

TV of, for instance, sports events. Bopa Vision provided us with parts of a video

taped handball game, several pictures with entire billboards, and a map of the

approximate locations of the billboards. The given video sequence consists of many

clips1 taken by different cameras.

Replacing billboards in a live handball game includes a lot of work as depicted in

Figure 1: identifying clips, identifying the location of billboards with advertisement in

each frame, taking care of the change in the appearance of billboards by motion blur,

zooming, shadows, and obstacles in front of the billboards, and placing another

commercial in the right place by adjusting for changes in global illumination without

losing picture quality or introducing delays in the transmission.

1 By a clip we mean a sequence of frames taken by one camera

4

Design of existing billboards

Identifying clips

Clips

Identifying billboards in real-time

Position of billboards or

Transformation

Replacing the design of billboards in real-time

Clips with new billboards

Connecting clips into a sequence

Design of new billboards

• Illumination

• Motion blur • Zooming • Shadow • Occlusion

Video sequence

Figure 1 Diagram of the billboard replacement process

Given the limited scope of our project we will mainly focus on the second part of such

a system. Given a clip, we will detect the relevant billboard and provide either the

transformation parameters or the location of the billboard corresponding to each

frame in the video sequence. Even though Bopa Vision’s request is detecting all

billboards in the scene, we focus on the primary stage of detecting one billboard at a

time. A robust tracker should be able to deal with any occasions happening in the

scene, such as shadows, lighting changes, and situations by overlapping objects and

objects coming into the scene or moving out of it (Stauffer and Grimson, 2000).

In the identification process, we will deal with the following issues:

1. Should the process be done frame by frame? Can it be done on an entire clip?

2. How should the billboard be defined? We are supposed to have only limited

information about the scene before hand, such as the picture of a billboard. Is

5

it possible to detect billboards only from their shape? And if it is the case, will

connected billboards be considered as one?

3. What should the output be? What kind of coordinate descriptions should be

given to describe the locations of the billboards? Since the output should be

ready for the replacement process, it should also identify different kind of

advertisements.

4. How should distorted billboards be handled? Even though every input is a clip

taken by one camera, the camera zooms and pans all the time, which can result

in motion blur. Furthermore, the lighting conditions in the stadium may

change. These factors make the identification rather difficult.

5. How should a billboard, which is not totally visible, be identified? Sometimes

one object or a few objects will be in front of a billboard, which makes part of

the billboard invisible. So we might see only part of the billboard or a

billboard which is disconnected by obstacles.

6. How should we identify obstacles and generate masks for them. As mentioned

in Issue 5, it is often the case that a few obstacles will be in front of the

billboards. We need to identify these obstacles and make masks for them, so

that the frame can be ready for the replacement process. These masks are very

important as we need to put the object back to the original places after

replacing the billboard.

Regarding Issue 1, we start by detecting a template frame by frame, even though the

motion between two consecutive frames is in general not very large. We need to find

the accurate location of the template in order to maintain the sequence’s continuality.

On the other hand, it is impossible to know beforehand when the camera makes a

sudden big panning. If we skip a frame where the camera pans a lot, it is very likely

we will lose the track of the template.

Regarding issue 2, among many image features, we will discuss edges, colors,

histograms and shapes. Especially we will look at edges and mainly use this feature in

our algorithms.

Regarding issue 3, the output could be coordinates of each billboard. However, all

billboards move in the same way based on some transformation functions. Therefore

6

it is also possible to predict the locations of all billboards in a frame if we concentrate

on detecting this common transformation function. We will consider this issue in

Chapter IV.

Regarding issue 5, we will perform a few experiments of the case where a few players

running in front of a billboard. However, we leave the more thorough discussion of

this issue together with issue 4 and issue 6 for future research.

1.2. Image and Video Issues In order to successfully locate and track objects in a video sequence, we need to

understand the features of the targeted objects and some issues concerning the video

sequence.

Useful features of an image are normally the ones that are detectable to the human

eyes, such as color, texture, edges and etc. These features are often used in image

segmentation and object recognition. In our project, each billboard has its unique

feature. Most of the billboards are rectangular containing text of a certain color and

font. But color itself might not be efficient in identifying billboards in our project,

because some billboards have the same colors and the background scene may also

have the same color as the billboards. So it is natural that we will start the tracking by

representing each billboard by a unique feature. This could be the shape of the logos

represented on the billboard and some additional color information. In order to

identify features of each billboard, we will use the color and edge details of each

frame. We will discuss color and image edges in chapter II.

Another source of concern is related to the motion in the video sequence. In video

sequences objects are blurred and the view frequently changes due to the zooming and

panning of the camera. This makes tracking more difficult. Moreover most target

objects do not have exactly the same features as the template image because of not

only motion blur, but also occlusions and lighting changes etc. Even the same object’s

features will change from frame to frame due to the above mentioned factors and the

object will be deformed by the zooming and panning of the camera as well as the

change of the 3 dimensional (3D) orientation of the billboard. Moving objects in a

7

video sequence have inherent motion blur, especially fast moving objects. Very often,

an object is partly occluded in the scene, for example, in our project, the billboards are

very often behind the handball players and parts of the billboard will be outside the

screen. This will cause a failure in locating and tracking the billboards. Figure 2

shows an example of an EL GIGANTEN billboard being occluded by a player.

One player is in front of

ELGIGANTEN

Figure 2 A frame from the video sequence

Figure 3 is an example of motion blur taken from one of the clips. The images blur a

lot when the camera turns quickly to trace the ball. On top of that, players who are

running fast blur a lot as well.

Figure 3 Images with motion blur (taken from the 120 th frame in ’10-2.avi’)

A player on the left and the billboard behind her are heavily blurred because she is running faster than the other players.

In our video sequence, the camera moves constantly, sometimes very fast, in order to

follow the handball and the players. In this way, almost all the billboards we want to

track are motion blurred.

The task at this stage, detecting billboards, in the whole project of commercial

replacement is to provide the accurate location of the billboard in each frame where

another billboard can be inserted. But when the replaced billboard is put into the

scene, the same motion blur need to be generated in order to make the scene look

8

natural. It looks odd, for example, if a blurred edge of the original billboard is visible

at the border of the new, replaced billboard.

Due to the limitation in time, we will only cover motion blur and occlusion briefly

later in the thesis.

Our thesis will be based on the following assumptions:

1) A sequence of frames in the same clip is taken by the same camera.

2) The following information is given: (1) the designs of all billboards, (2) layout

of the billboards in the sports arena (relative positioning of billboards and

cameras)

3) Billboards have to be detected even though they are occluded or part of them

is invisible in the screen.

4) All kinds of billboards can be detected by the system we implement.

5) Real time performance can be achieved by the use of programming languages

such as C++. We have, however, decided to use Matlab and lower the speed

requirement reflecting the expected gains in speed from using a faster

programming language.

6) Clips are already made, which means that separating a video sequence into

clips is out of the scope of this project. The provided video sequence is an

MPEG2 file, which matlab cannot read. Therefore the clips are converted into

avi files, which is readable to matlab. We will use these converted images.

Somewhat surprisingly, the amount of frames is reduced when the MPEG2 file

is converted to an avi file and thereby, the data set becomes smaller than the

original one.

1.3. Various Approaches

There are several real-time tracking approaches using various image features for

different purposes. For video surveillance systems, where a camera neither moves nor

zooms, background subtraction works well because it detects only the pixels which

have changed their colors or intensities depending on whether it uses color images or

9

gray scale images2 (Stauffer and Grimson, 2000) (Lipton et. al., 1998). Lipton et. al.

mention, however, that background subtraction is not robust to changes in object size,

orientation and lighting conditions, which happen often in a handball game.

Combining more features into background subtraction for improving accuracy and/or

speed, Berriss, W. P. et. al. (2003) investigate a color-based approach for MPEG-7

standard and Koller et. al. (1994) use contour tracker and an affine motion model for a

robust real-time traffic scene surveillance. Background subtraction is effective only

when the objects change neither their shape nor size.

Another application of real-time tracking of deformable objects is area tracking, for

example, tracking a speaker’s head in a video conference. Fieguth and Terzopoulos

(1997) develop a very fast color-based model for tracking a speaker’s head. For this

purpose, we do not need the information on precise positions as long as the object is

seen in the screen.

On the contrary, accuracy is one of the important requirements for our project. In

addition, billboards not only change their size but also may deform by the projection

from the three-dimensional (3D) viewing frame to the two-dimensional (2D) viewing

plane. Therefore, we need more precise information on the positions or

transformations of deformable objects for every frame. Deformable template

matching addressed by Jain et. al. (1996, 1998) may be able to achieve this goal by

detecting the precise positions or transformation of an object in non-linear systems. A

disadvantage of this approach is that it is time consuming (Kervrann and Heitz, 1994).

The length of processing time also depends on the capacity of the computer used for

the analysis.

If deformable template matching is too slow to be able to achieve on-lien tracking,

another approach desirable for on-line tracking is to predict the motion as Yang and

Waibel (1996) suggest. Kalman filtering is a popular approach for linear tracking

when clutters do not occur in the image (Isard and Blake, 1996). In the handball game

we need to deal with clutters because players often overlap in front of billboards.

2 Explanation of color and intensity follows in Chapter 2.

10

Particle filtering or condensation algorithm (Isard and Blake, 1996) is considered

because this non-linear tracking can keep tracking even in clutters.

We will discuss image features for tracking at first in chapter 2; then the first method,

the deformable template matching in chapter 3; the second method, the conditional

density propagation method in chapter 4; the experiments on the two methods in

chapter 5; comparison of the two methods in chapter 6 and finally conclusion of our

work in chapter 7.

11

II. Image Features for Tracking In order to locate and track the billboards in the video sequence, we need first to

define the features of these billboards. This is done by making a unique template for

each billboard. The template T can be represented in different ways. For example, T

can be a histogram representing a certain pattern of the billboard. T can also be a

closed or open curve representing the shape of the billboard.

2.1. Edges Edges are very important image features in image processing. They are the points

with high intensity contrast and characterize boundaries of objects contained in an

image. Using edge information of an image also significantly reduces the amount of

data while preserving the important structural properties of an image. An edge is

characterized by its gradient magnitude3 and gradient direction4. Among them we use

the canny edge operator, which can attain (i) low probability of error, (ii) good

localization and (iii) only one response to a single edge, at the same time. Low

probability of error corresponds to maximizing the signal to noise ratio because they

are monotonically decreasing functions. Canny (1983) found that optimization of the

product of the first two conflicting criteria, (i) and (ii), provides a single best solution.

It takes a gray scale image, smoothes it by Gaussian convolution in order to ignore

small intensity differences, and shows where the intensity discontinuity is higher than

a given threshold value5 in a binary image. Those whose gradient magnitudes are

higher than threshold are considered as edges and shown in white (binary value 1) on

an edge map, whereas the other pixels are in black with the binary value 0.

3 The gradient magnitude is the difference in intensities or magnitude between neighboring pixels in

two directions, x and y. It is defined as [ ] yxyx GGGGyf

xff +≈+=

⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

+⎟⎠⎞

⎜⎝⎛∂∂

=∇½22

½22for image

function f (Gonzalez, 2001). The vector f = [ xG is called gradient. More details are given in the appendix.

]TyG

4 The gradient direction is the direction angle of the gradient vector. It is defined as ⎟⎟⎠

⎞⎜⎜⎝

⎛−

x

y

GG1tan

at a point

(x, y). The angle is measured with respect to the x-axis. The edge direction is perpendicular to the direction of the gradient vector (Gonzalez, 2001). 5 The threshold value should be a decimal number between zero and one.

12

It is considered as an edge, if the gradient magnitude is larger than a threshold value

we give.

The following figures show the importance of choosing the right threshold to define

edges. The same picture as Figure 4.a) has different edge images, when different

canny edge thresholds are given. Two figures, Figure 4.b) and Figure 4.c) clearly

show that different threshold values influences the result when the edge feature is

used in image analysis. Figure 4.b) with the threshold 0.4 keeps strong edges in the

logo ’EL GIGANTEN’ as well as other less important edges, whereas most players’

edges and even the ‘El GIGANTEN’ billboard’s boundary disappear with the

threshold 0.6 in Figure 4 c).

a) Original image

b) Edge map with canny edge threshold as 0.4

13

c) Edge map with canny edge threshold as 0.6

Figure 4 Edge pictures by different threshold values

It is very important to choose a suitable threshold value, so that the edges need to be

detected are visible in the edge map. It is, at the same time, a very difficult task. In

this thesis, we try to find a good threshold value through trials and visual judgment.

We will further discuss how the threshold affects the tracking performance later in

chapter 5.

2.2. Color Color is another important and useful descriptor of objects in an image. Using color as

a feature has advantages; firstly, processing color is relatively fast and secondly, color

is orientation invariant (Yang and Waibel, 1996). There are many color models in use

today in order to organize variety of colors.

RGB color model (Red, Green, Blue), HSI model (Hue, Saturation, Intensity), and

HSV model (Hue, Saturation, Value) are the ones most frequently used in digital

image processing. RGB model is used by most digital image devices (e.g. monitors

and color cameras) including our video sequence and images. In RGB model, the

color is represented in terms of the amount of red, green and blue light it contains.

HSI model describes a color in terms of the way perceived by the human eyes. Hue

represents pure color. In the color model, hue is measured as an angle between 0 and

360 degrees starting from red via yellow, green, cyan, blue and magenta. For example

hue of pure green is 120 degrees. Saturation represents how much a pure color is

14

diluted by white light. It is measured between 0 and 1, where 0 indicates absence of

color and 1.0 indicates a pure color. Intensity or brightness is an achromatic notion.

Gray scale images are shown in intensity. Edges are defined only by gradient

magnitude of the intensity. We convert color images into gray scale images and use

the intensity values to find edges. HSI model is obtained by converting from RGB

model6.

We can use color information to differentiate the target objects from their

surroundings. So color might be used as an image feature to identify the target or at

least significantly reduce the searching area.

Figure 5 is a frame image from the video sequence. If, for example, we want to track

billboard ‘e’ , we can limit the searching area in each frame image to the area

where the color information is similar to billboard ‘e’, ie, the combination of red and

white color.

Figure 5 Areas taken out by color information

left: the original picture, right: red areas are selected based on the color in ‘e.’ The right picture is made by using Photoshop

Color information is helpful in identifying target. It can be, at the same time,

misleading. One of the problems with regard to images is the same objects might have

different colors and intensities when the lighting condition changes or there are

shadows. It occurs particularly often in our project. The billboard images for

templates were taken separately in a different circumstance from the actual game in

the video sequence. But in the live match broadcasting, the lighting conditions are

6 The conversion formula from RGB to HSI is given in the appendix.

15

different and they even change often during the match. Moreover there are many

shadows caused by the players in front of the billboards. So when we use the template

color as the sample color and try to find areas with the similar color in the frame, if

the tolerant level is high, a lot of unnecessary area will be included and the reduction

in searching area is not very significant; on the other hand, if the tolerant level is low,

we have the risk of ignoring the important area.

2.3. Histogram Histogram can be another useful feature for tracking. One of the problems is that the

intensity is vulnerable to various unknown and uncontrollable factors which change

constantly (Weng et. al. 1993), such as lighting and shadow. Even though lighting and

shadow affect the intensity, the histogram or at least its distribution may keep

characteristics of the target object. When the billboard is, however, occluded by

players or other objects, the histogram of the target billboard changes, then it is

difficult to identify the correct target. Figure 7 shows the difference in histogram of a

billboard with and without occlusion given in Figure 6.

a) without occlusion b) with occlusion

Figure 6 Billboards without and with occlusion

a) and b)are neighbouring billboards on the same frame, therefore lighting effects are similar. a) taken from 10-1.avi 30th frame (415 ≤≤ x 565, 165 ≤≤ y 220), b) taken from 10-1.avi 30th frame (555 ≤≤ x 705, 155 215) ≤≤ y

a) without occlusion b) with occlusion

Figure 7 Histogram of the intensity of Figure 6 a) and b)

The x-axis represents the intensity value from 0 to 255 and the y-axis represents the number of pixels with the corresponding intensity.

16

We can see the clear difference in histogram between intensity value of 50 and 100.

The histogram changes every time the billboard is occluded. Players move suddenly

and irregularly as they try to trick their opponents. Since we have no clue about when

occlusion occurs, histogram does not seem to be very useful in our case.

17

III. Deformable Template Matching

An object in two-dimensional (2D) video images cannot avoid non-rigid 7

transformation from the original 3D viewing frame. Objects change their shapes, for

example, two parallel lines look as if they meet at the end, if we see them from the

side. This is called shearing. Even though shearing is too small to notice, it is not zero.

Therefore it is dangerous to use a rigid scheme because a target object may be

considered as different from the template by the transformations other than

translation, scaling, and rotation (Jain et. al. 1996).

On the other hand, deformable template can ‘deform’ itself to fit the image features

by non-rigid transformations, meaning various transformations. They are usually

more complex than translation, scaling, and rotation (Jain et. al., 1996). The

deformable template matching method is an approach often used for object

localization and identification of non-rigid shapes (Jain et. al. 1996, 1998, 2000, Tate

and Takefuji 2002). The basic idea is searching for a target object which minimizes

the energy function defined by a Bayesian objective function. The template moves

around and is deformed in order to match with the boundary of objects in an input

image. The one which achieves the minimum energy is considered as the target. Thus

deformable template matching is more versatile and flexible in dealing with various

non-rigid transformations. It is also applicable to tracking (Jain et. al. 2000, Kervrann

and Heitz, 1994). Moreover, templates can be any shape whose exact knowledge may

not be available; it can also be in the form of hand-drawn sketch (Jain et. al. 1996).

The disadvantage of this template matching method is that it may take too much time

to achieve real time tracking. It takes six minutes to process whole 256×256 image

(Kervrann and Heitz, 1994). Among them the method introduced by Jain et. al. (1996)

seems fast enough for consideration. Later on Jain and Zhong (2000) combined color

and texture as well as shape into object localization with still images.

7 Rigid transformation includes translation, scaling, and rotation, where the shape does not change.

18

3.1. Basic Theory

3.1.1. Bayes Theorem Before explaining the basic theory we briefly give the definition of Bayes theorem,

which we use for both template matching and condensation.

Bayes theorem is defined as

)()()(

)(Bp

ApABpBAp = eq. 1

)( BAp is the posterior probability distribution, called a posteriori or posterior in

English. )( ABp is called likelihood and generative model including noise, whereas

provides the prior probability of the model, therefore is called a priori or prior in

English. is a constant in the model and works as a normalization term. We can

obtain the posterior probability based on the prior available knowledge.

)(Ap

)(Bp

3.1.2. Bayesian Formulation of the Deformation

The deformable template matching method combines both the structural knowledge

and local image features. These features are versatile in incorporating object

variations. We try to implement the deformable template matching method as

described by Jain et. al. (1996, 2000) to locate the target in one still frame at first.

Later on we would like to extend it to the sequence as described by Jain et. al. (2000).

Our template deformation model has the following structures.

1) A binary contour map of the prototype template describing a real shape (gray

levels 0 and 255)

2) A set of parametric deformation transformations

3) A probabilistic model of deformation which weights the possible deformation

levels.

19

The templates consist of a set of points on the object contour, which can be either

closed or open. This contour is represented with white pixels (gray level of 255)

surrounded by dark pixels (gray level of 0) elsewhere.

Deformations are defined by classes, such as locations, color, and texture. The points

are moved by the function (x, y) →(x, y) + ( , ), where and

are displacement functions. More details about this projection are described

later. Then the deformed template is represented as follows using the rotation angle θ,

scaling factor s, translation

),( yxxχ ),( yxyχ ),( yxxχ

),( yxyχ

d = ( , ) and local deformation parameter xd yd ξ on a

different orthogonal basis:

[ ]( )),()),((),(),( 0,,,yx

ds ddyxRyxsyx ++⋅ℑ=ℑ θξξθ χ eq. 2

where denotes the prototype template and is the rotation by an angle θ. 0ℑ θR

The posterior probability density of the deformed template given the input image

(p(DT|I)) is obtained by combining the prior probability density of the deformed

template (p(DT)) and the likelihood of the input image given the deformed template

(p(I|DT))) using the Bayes theorem (equation 1).

)(

)()()(

IpDTpDTIp

IDTp =

This probability problem can be replaced by an energy function problem, so that the

template with less transformation has less total energy, sum of the internal and

external energy. The internal energy is model driven and measures how much the

template is deformed from its original shape. The external energy is data driven and

measures the goodness of fit between the template ℑ and the input image I.

20

3.2. Deformation Models

First, a prototype template should be defined. The prototype template is the prior

shape information of the object of interest. It contains the edge or boundary

information. In our case, the templates are the binary edge maps of given pictures of

the billboards.

We assume that the template edge map is drawn on a unit square . The

displacement functions and are continuous and satisfy the following

boundary conditions: . First the binary

edge map space is spanned by the following orthogonal bases defined by Amit et. al.

(1991) so that we can measure the deformation levels relative to these axes (Jain et.

al., 1996),

[ ,0=S ]21

),( yxxχ ),( yxyχ

0)1,()0,(),1(),0( ≡≡≡≡ xxyy yyxx χχχχ

( ) ( )mymxyxe xm ππ cossin2(),( = , 0)

=),( yxe ym (0, ( ) ( )mymx ππ sincos2 ) eq. 3

where m = 1, 2,… and controls the locality and smoothness. As m increases, these

basis functions vary from global and smooth to local and coarse. Using the

deformation parameters ξ = {( , , m = 1, 2, …}, which project the displacement

functions on the orthogonal basis, the deformation in the finite range for m is defined

as

xmξ )y

mξ

),( yxξχ = ( ),( yxxξχ , ),( yxy

ξχ ) = ∑=

⋅+⋅M

m m

ym

ym

xm

xm ee

1 λξξ

eq. 4

where m = 1, 2,… are the normalizing constants. In this way

complex deformations are represented as a function of m and . The smaller is,

the smaller the deformation is; therefore the deformed template is closer to the

original shape. Naturally the prototype template is the most likely prior shape, and the

template closer to the original shape is more likely generated than largely displaced

ones and at the same time has larger internal energy.

)2( 22 mm απλ =

mξ mξ

21

Assuming that and are independent and s are independent of each other, and

the simple internal energy for the deformable template

xmξ

ymξ mξ

ℑ is independent identically

distributed zero-mean Gaussian with variance , the internal energy is given as: 2σ

))(( ζε ℑ =∑ +M

m

ym

xm )

2( 2

22

σξξ

eq. 5

A smaller variance implies a more rigid template. The smaller the deformation

parameters are, the smaller the internal energy is. So the template which is close to

the original shape is favoured.

2σ

mξ

The external energy measures how the template contours fit with the edges in the

input edge map image data I. The better they fit, the smaller the energy is. The

important variables for the external energy are the distance between the contour and

its nearest edges and the difference in their edge directions. The gradient of edges

based on the canny method is used as the direction. First the template is placed in the

edge potential field, which is defined by the positions and the direction of the input

image as follows:

{ ),(exp),( yxDyx }ρ−−=Φ eq. 6

where is the distance to the nearest edge point in I and( ½22),( yxyxD δδ += ) ρ is a

smoothing factor which controls the degree of smoothness of the potential field. The

larger ρ is, the smaller the change in the absolute value of ),( yxΦ is. In this way a

larger ρ smoothes the potential field more. Fitness of the edge is measured in

combination with the distance between two edges and the difference in the direction

of them. Combining edge directions and distances can decrease the risk of false

match. The distance is defined as ),( yxΦ and the difference in edge direction is

measured in the cosine of ),( yxβ , the angle between the edge direction at (x, y) on the

template and the direction of its nearest edge. Let Θ represents all rigid

22

transformations, translation d , scaling s, and rotation θ , that is },,{ θsd=Θ , then the

external energy is defined as:

∑ Φ+=Θℑyx

T

yxyxn

I,

))),(cos(),(1(1)),,(( βξε

= }{ ))),(cos(),(exp1(1,∑ −−yx

T

yxyxDn

βρ eq. 7

Tn is the number of pixels on the template. In order to make the potentials in Σ

positive, the constant 1 is added.

We would like to minimize the total energy function given by combining equation 5

and equation 7:

=Θℑ )),,(( IE ξ }{ ))),(cos(),(exp1(1,∑ −−yx

T

yxyxDn

βρ +∑ +M

m

ym

xm )

2( 2

22

σξξ

eq. 8

This equation should be minimized with respect to the deformation parameterξ and

rigid transformation parameter Θ .

There are several local minima. Jain et. al. (1996) employs a multiresolution

approach, which can quickly find good solutions (Jain et. a., 1996). Starting the search

from the coarsest stage with a large value of ρ in equation 6, insignificant dips are

smoothed. Then there are fewer spurious local minima, which makes it easier to

roughly locate the global minimum. This helps the template quickly move there. This

process is repeated at a finer stage in order to move the template even closer to the

global minimum. By narrowing the search area the exact location is detected. This

multiresolution approach reduces the iteration steps and deformation parameters

because at every step we do not need a thorough search to detect the exact location.

23

3.3. Algorithm First we try with the simplest conditions in order to check whether the method is

robust or not and whether processing time is within an acceptable range. Some

assumptions are made in order to make the primary experiment as simple as possible.

First of all, our template is composed of some points manually selected on the

billboard edges. Even though these points are selected on the curve, they are

considered as individual and their relationships are not utilized.

Secondly since the billboard shapes do not deform much in the video sequence and

thereby the main changes can be regarded as translation, rotation and scaling, we

assume that the change in the internal energy term can be ignored. We consider only

the external energy part in equation 8. Therefore we do not control the locality and

smoothness, because the controlling parameter m is only involved in the internal

energy.

Thirdly, we limit the translation field on the edges in the input image because

applying template matching for the entire image takes time, at least five seconds (Jain

et. al., 1996). Jain and Zhong (2000) use color and texture information in order to

limit the search area by region-based screening. First we use only edge features. We

place one of the template points on the input image pixels which are considered as

edge by canny method.

More general cases will be examined if the simplest algorithm works well.

Preprocessing-Define Prototype Template - on the billboard image

(i) Create an edge map from a given billboard image using the canny edge

method with a threshold value of 0.4, which by visual inspiration seems to

nicely retrieve the contours of the logos. The edge map stores the

normalized directional vector ( , ) at every pixel which is considered

as an edge.

xv yv

24

(ii) On the edge map, manually select some points on the contour or edge

which can represent the billboard as prototype template. Ideally we should

take into account the fit of all pixels on the template. We use, however,

only some points in order to save processing time based on the assumption

that the quality of the search is still kept. Figure 8 shows an example. The

manually selected 12 red * points represent the billboard with the logo

‘tøj.’ Remember that our template is represented only by points;

relationships between them, such as the curves where they lie, are ignored.

Figure 8 Template representing ‘tøj’

Deformable Template Matching - on the frame image (iii) Create a canny edge map of the selected pixels on the frame as done in (i) and

store the normalized directional vectors of each of the selected pixels. Create a

‘distance map’ of the selected pixels as follows. Each pixel should store the

shortest distance from the edge points. In order to minimize the calculation,

calculate distances only along their normal.

250 250 250 250 250 250 250 250 250 250 250 250 3 250 3 3 250 250 2 2 250

250 2 250 250 1 250 250 250 1 250 250 1 250 250 250 1 250 250 1 250 250

250 1 1 250 2 250 250

250 250 2 3 3 250 250 250 250 3 250 250 250 250 250 250 250 250 250 250 250

Figure 9 Nearest edge searching mechanism

At first a very large number, such as 250, is assigned on each pixel as ‘distance.’ If a pixel is within a distance along the normal from a selected point on the template curve, the ‘distance’ assigned on this pixel is overwritten by the actual distance.

25

Set all pixels as a large number, such as 250 as a default. We only take into

account pixels within a certain distance of the edge points in order to reduce the

computational load. Replace the default values along the normals if the real

distances are smaller than the default, such as 250.

(iv) Place the template so that the first point fits on each edge pixel on the frame

image. This process is considered as translating the template on every possible

location. Calculate the external energy using equation 7 and find the template,

which gives the minimum energy. The rigid transformation range is defined as

rotation range: -1/12 π (-30 degrees) to 1/12π (30 degrees) with an interval of

1/24π (15 degrees), scaling range: 21 to 2 with an interval 1/8 as a trial.

26

IV. Conditional Density Propagation (Condensation) The deformable template matching method may be too slow for real-time tracking. In

this chapter, we discuss the Condensation - Conditional Density Propagation –,

algorithm for feature tracking. It is more likely to achieve real-time video tracking for

untrained data than deformable template matching.

Condensation introduced by Blake and Isard (1995 and 1998), is based on the idea

that it is possible to track an object in subsequent images if motion and shape are

modelled (Isard and Blake, 1996). It is a particle filtering algorithm which represents

a tracked object’s state using an entire probability distribution. A probability density

function describing the likely state of the objects is propagated over time using a

dynamic model. The measurements influence the probability function and allow the

incorporation of new objects into the tracking scheme.

In general there are two cases of the probability distribution of states and

measurement: 1) both dynamics and measurement are linear or 2) nonlinear. Even a

slight non-linearity due to objects’ overlap results in multi-peaks in the probability

distribution of a tracked object state, which makes tracking very difficult. Laptev and

Lindeberg (2003) argue that a feature tracking approach may fail when objects

overlap or split because the assumed constant appearance of an object changes.

However, Forsyth and Ponce (2003) suggest that there is an algorithm that often

works for states in a low-dimensional space.

Unlike Kalman filtering8, the Condensation tracker can have more than one local

maximum, thus making it multimodal. It can track an object correctly without losing

the way in a non-linear dynamic model even after hitting one local maximum.

Moreover, this framework is not vulnerable to background clutter.

A billboard in the given still images is used as a template. Our template is represented

as a curve since we use not only straight boundary lines but also letters on the 8 Kalman filtering assumes that the likelihood of the image data given the samples configuration is the Gaussian distributed. Since Gaussian is unimodal, meaning only one local maximum is allowed, the tracker cannot represent simultaneous alternatives in clutter and once the tracker is distracted, it never recovers.

27

billboard. A spline is flexible and often used to produce a smooth curve given a set of

control points (Baker, 2004). A set of weights are distributed along the strip to keep

the spline smooth. Among various splines B-spline curves with ‘B’ standing for

‘basis’, are suitable for our purpose because they are more flexible than other splines

in locally adjusting the curvature; they can represent sharp bends and even corners as

well (Buss, 2003).

We will discuss more details about the condensation algorithm and the representation

of the template in the following sections, where Section 1 and 2 cover the basic theory

of the condensation algorithm; Section 3 discusses B-spline curves and describes our

algorithm for creating templates; Section 4 gives details and definitions for our

condensation tracker.

4.1. Basic Theory

4.1.1. Modelling Shape and Motion First a model is made in order to track the object in an image sequence. This model

should characterise the object features and be resistant to motion over time.

Theoretically, B-spline curves could be parameterised by their control points and

blending functions. Control points move over time, but blending functions stay the

same. At every frame new curves are placed based on the new positions of the control

points.

Secondly, given the target object represented as a curve, the tracking problem is to

estimate the motion of this curve. The modelling of motion is to specify the likely

dynamics of the curve over time relative to the template using the probability

densities. Some reasonable functions can be chosen for these densities.

4.1.2. Discrete Time Propagation of State Density Tracking techniques exploit the previous history of the image features’ motion to

predict the positions of these features in the next frame (Trucco and Verri, 1998). For

28

digital analysis, the propagation process is defined at discrete time t. The following

terms are used in the condensation algorithm

xt : The state of the object at time t

zt: the set of image features at time t

then

The history of the state at time t is ( ) txxx ,..., 21

The history of observation data at time t is ( ) tzzz ,..., 21

Theoretically an observation density that characterises the statistical variability of z

given x, can be estimated for xt given zt at any time t (Blake and Isard, 1996). The

next state is estimated by the current measurement, state, and the motion model.

4.1.3. Temporal Propagation of Conditional Densities The probability density propagation is well described by a so-called ‘Fokker-Planck’

equation 9 given by Isard and Blake (1996), where the density for xt drifts

deterministically and diffuses stochastically. The condensation algorithm is aimed to

address this situation in general. The deterministic components drift the entire body of

the probability density together as shown in Figure 10. Then it is diffused by the

random component of the stochastic part, which results in spreading or increasing

uncertainty and changes the probability density distribution.

9 Fokker Planck equation describes the probability density of the position and velocity of particles over time. It was first made to describe stochastically the motion of particles in fluid.

29

Figure 10 Probability density propagation over a discrete time-step

There are three phases: drift due to the deterministic component of object dynamics; diffusion due to the random component; reactive reinforcement due to observations (Blake and Isard, 1998).

4.1.4. Dynamic Models

We make the general assumption that the probabilistic framework of our dynamic

model is based on a Markov chain, which is a succession of elements each of which is

generated from the preceding elements and the future depends only on the present

regardless the previous. Then the state is assumed to be dependent only on the

previous state, independent of its earlier history:

)(),...,( 1121 −−− = ttttt xxpxxxxp eq. 9

Using second order stochastic differential equation in discrete time, the dynamics are

entirely determined by the conditional density )( 1−tt xxp . We will discuss additional

details of the model later.

4.1.5. Measurement The observations zt are assumed to be independent of both its earlier history and the

previous states. Figure 11 depicts this relationship between x and z.

30

tt

tt

zzzz

xxxx

,...........,.........,

.......

121

121

−

− ↔↔↔↔

b

Figure 11 State x and measurement z is only dependent on its current state ↔ shows the dependency

Using this relationship that zi is dependent only on xi, the probability of the

measurement given the state is expressed as follows:

)x,.....x,xz,....z,z(p)x,....x,xx(p)x,....x,xx,z.....z,z(p tttttttttttt 121121121121121 −−−−−−−−−− =

∏−

=−−=

1

1121

t

iiittt )xz(p)x,.....x,xx(p eq. 10

Therefore the observation process is defined only by the conditional density

)( tt xzp at each time t. We assume that )( tt xzp does not depend on time or

stationary. Then this assumption leads to:

)()( xzpxzp tt = eq. 11

The term on the right is called the observation model. Details about our model are

discussed later.

4.1.6. Propagation The conditional state density )z,......z,zx(p ttt 11− gives all necessary information

about the state at time t determined by the entire history of the data. Based on

Bayes theorem the propagation of state density over time is defined as

tx

)z,......z,zx(p)xz(pk)z,.....zzx(p ttttttt,tt 12111 −−− = eq. 12

31

Where is a normalization constant, which does not depend on . The first term tk tx

)( tt xzp is called the likelihood and the latter term ),.....,( 121 zzzxp ttt −− is known as

the effective prior and is actually a prediction taken from the posterior at t-1. It can be

written as follows

112111121 ),.......,()(),.......,( −−−−−−− ∫= ttttttttt dxzzzxpxxpzzzxp eq. 13

The prior ),....,( 1211 zzzxp ttt −−− at t-1 is the posterior distribution of . 1−tx

Here Bayes theorem is used in a sequence; the prior is based on the information up to

t-1, and the information at time t is added to update it as a posterior. On the other

hand, in template matching the same Bayes theorem is used in the same frame to

deform a template in a proper way.

4.1.7. Factored Sampling The factored sampling algorithm can deal with non-Gaussian observations in an

image sequence. Weighting samples based on the current sample set creates an

artificial conditional situation to predict the future state by tx ),.......,( 121 zzzxp ttt −− .

As described above, the posterior density )( zxp offers all the necessary information

on x . This is obtained by applying Bayes theorem, equation 1.

)()()( xpxzkpzxp = eq. 14

where k is a normalization constant which does not dependent on x. Since the

posterior is not always computable, sampling is done in two steps. First a sample set

{s1, s2,…, sN} with N samples is generated for all possible x values based on the prior

density p(x). Then each sample is chosen with probability.

∑ =

= N

j jz

nn

)sz(p

)sz(p

1

π eq. 15

32

Due to this weight, the samples which are more likely to happen have higher

probability of being chosen. Extending this idea with time evolution, the probability

or weight at any time t is given as

∑ =

=== N

j tjtz

tnttntttn

szp

szpsxzp

1 ,

,,,

)(

)()(π eq. 16

When new samples have been selected at every time t+1, a new set of weights is

calculated based on the new sample set.

4.2. The Condensation Algorithm The condensation algorithm is an iterative process of sampling, predicting the state of

the next time-step, and measuring the position of the object through an image

sequence. The factored sampling approach is used at sampling phase. At each

iteration step t, a sample set {sn,t, n = 1,…N} is updated with a corresponding weight

{ nπ }, which approximates the conditional state density ),....,( 11 zzzxp ttt − . We

introduce another parameter, a cumulative probability {cn} in order to reduce the

computational load of factored sampling. The sum of the weight is normalized as cN =

1. We generate a random number between 0 and 1. If this number is between

and , is selected as a sample. In this way N times random decimal number

selection determines N samples at the next step t+1. c

tic ,1−

tic , tis ,

n is defined as

0,0

,,1,

=

+= −

t

tntntn

ccc π

eq. 17

We construct a new sample set at time step t, { }tntntn cs ,,, ,,π from an old sample set at

t-1 { }1.1,1, ,, −−− tntntn cs π as follows:

1. Select a sample tns ,′

1) Generate a random number [ ]1,0∈r , uniformly distributed.

33

2) Find the smallest j for which 10rc tj ≥−1, .

3) Set 1,, −=′ tjtn ss

2. Predict by sampling from the following equation for each n = 1….N. tns ,

)( ,1 tntt sxxp ′=− eq. 18

In this project, the dynamics are explained by a linear stochastic differential

equation as described later. Therefore the new sample value can be generated

as tntntn BwsAs ,,, +′= where is a standard normal random numbers

representing Gaussian white noise

tnw ,

11, and tBB is the process noise covariance.

3. Measure and weight the new position in terms of the measured feature tz

)( ,, tntttn sxzp ==π eq. 19

Then normalize so that . The cumulative probability is updated as 1, =∑n tnπ

0, =toc

tntntn cc ,,1, π+= − (n =1....N)

Update the new sample set{ }tntntn cs ,,, ,,π .

Figure 12 Condensation algorithm (taken from Blake and Isard, 1996)

10 A fast way is binary subdivision search, which checks whether the given number can fit in the lower half of the given non-descending set. If the given number in the middle is larger than the given number, it should fit in the lower half. Then divide the chosen half into two and repeat the same check until we can find where the number can fit in the given cue. This binary search is very quick because the number of iteration steps is a function of logarithm (Sestoft, 1998). 11Gaussian white noise is a zero-mean random noise whose values are independent and identically distributed.

34

4.3. Template Representation There are several different kinds of splines for representing a smooth curve, such as

Hermite spline, Bézier spline and B-spline.

Hermite spline is described by a cubic polynomial and interpolates two end control

points and has specified first derivatives at the end points (Baker, 2004). Hermite

curve segments are connected so that the neighboring segments share control points.

Bézier spline approximates any arbitrary number of given control points. The degree

of the curve is determined by the number of control points. Some very useful

properties of the Bézier curve are that the curve connects or passes through the two

end control points and that the first derivatives at the end points can be given as a

multiple of the vector from the end point to the next control point. The B-spline also

approximates a set of control points, but more general than Bézier splines. Its two

advantages over Bézier splines are: 1) the degree of a B-spline can be set

independently from the number of control points, and 2) the shape of B-splines can be

controlled locally. On the other hand, a change of a control point affects the whole

part of the Bézier curve. The Bézier spline is a special case of B-splines.

We use B-spline curves for our template because we can determine the degree of the

curve no matter how many control points we need. For example it is possible to use a

first-order linear spline to represent a line and a cubic polynomial to represent a letter.

Furthermore B-splines can represent a sharp bend.

4.3.1. B-spline Curves

4.3.1.1. Definition of B-spline Curves B-splines approximate a set of control points, which indicate the general shape of the

curve. The contribution of each control point is weighted and combined by blending

functions.

35

Using u as a parameter, blending functions B(u) = [B1(u), B2(u), … ,Bn (u)], and X =

[x1, x2, …,xn] T 12 and Y = [y1, y2, ....yn] T , which are coordinates of the ith control point

(xi, yi), a parametric representation of a curve is given by

r(u) = (x(u), y(u)) , ustart ≤ u≤ uend eq. 20

where x(u) = B(u)X

y(u) = B(u)Y .

The parameter u traces the curve from one end to the other, where ustart<uend are the

value of u at two end points. The range of u can be defined by any real numbers;

however starting from zero is simple and easy for computational purposes. uend can be

any positive number. Sometimes it is easy if uend = 1 so that the range is normalized.

A curve is subdivided into n sections and each subinterval endpoint is called a knot

(Baker, 2004). Each blending function is defined over d intervals of the total range of

u = [ustart, uend]. An entire set of knots compose a knot vector, which should be a

nondecreasing sequence. That is, knots can be any values as long as .

Blending functions for B-spline curves are defined by the Cox-de Boor recursion

formulas (de Boor, 1986).

1+≤ jj uu

eq. 21 ⎪⎩

⎪⎨⎧ <≤=

+

.otherwise,

uuuif,)u(B

kk

,k0

1 1

1

)())(1()()()( 1,1,11,,, uBuwuBuwuB dkdkdkdkdk −++− −+= with

⎪⎩

⎪⎨

⎧ ≠−

−

=−+

−+

.,0

,)(

11,

otherwise

uuifuu

uu

uwdkk

kdk

k

dk

One minor exception is that the last point endk uu =+1 is included in the last nonzero function . That is, if 11 =,kB endkk uuu =< +1 , then 11 =,kB for 1+≤≤ kk uuu .

12 T represents transpose here. X and Y should be a column vector for calculation purpose.

36

Each blending function corresponds to a control point. Therefore there are as

many blending functions as control points. Each control point acts as a weight on the

corresponding blending function in the weighted sum in the r(u). The B-spline

blending functions are polynomials of order d-1, where d is the degree parameter

(Baker, 2004). The degree can be any integer value from 2 up to the number of

control points, n.

)(, uB dk

B-spline curves possess the following properties (Baker, 2004):

• The polynomial curve has degree d-1 and Cd-2 continuity13 over the range of u.

• The curve is described with n control points and as many blending functions.

• Each blending function is defined over d subintervals between

)(, uB dk

[ )dkk uu +, .

• A knot vector is composed of d+n knot values.

• Each subinterval of the B-spline curve is only influenced by d control points

and accordingly d blending functions.

• Any one control point can affect the shape of at most d subintervals.

• A B-spline curve is defined only in the interval between [ )1, +nd uu 14, since each

strip of the curve is defined by d blending functions, but some of them are not

defined beyond this interval.

This is illustrated in Figure 13 using an example d = 4 and n = 6. The knot vector is

composed of 10 knots. A blending function, for example, is defined between [ ,

and a B-spline curve is defined between

4,3B 3u

)7u [ 4u , . )7u

13 dth–order parametric continuity, Cd continuity, means that all the first and second and .. dth parametric derivatives exist and continuous at u (Baker, 2004). 14 Much of the literature (Buss, 2003, Baker, 2004, de Boor, 1986) states the number of control points as n+1 starting from CP0 and ending with CPn. For our convenience in programming, the control points are set from CP1 to CPn in total n control points. We stick to use this notation in our thesis as well as in our programming.

37

u7

3

2

2

1

u1

u2

u3 u4

u5

u6

u8

u9B1,4

1

4

4

3

4

B6,4

B5,4

B4,4B3,4

B2,4

u10

Figure 13 The range which blending functions are defined An example with d = 4 and n = 6. Bold numbers on the curve implies the number of blending functions defined between the corresponding intervals. A B-spline is defined only when 4 blending functions are defined, which means in this example a B-spline is defined between [ 4u , . )7u

By choosing a using equation 21 the constraint of a partition of unity is kept. )(, uB dk

eq. 22 1)(0

, =∑=

uBn

kdk

Since any is nonnegative, a B-spline curve lies within the convex hull of at

most d+1 control points (Baker, 2004).

)(, uB dk

B-splines are represented by control points and blending functions. Control points

influence the shape of the curve; however they usually do not lie on the curve itself. If

control points are on the curve, it is because of repeated knots just like the end knots

of open splines. Once a knot is repeated, the corresponding point on the curve loses

one degree of continuity. So if three knots have the same value, which means the knot

is repeated twice, continuity decreases by two degrees. Since the curve loses some

continuity properties for its derivatives at the repeated knots, it loses its good

smoothness (Buss, 2003).

38

4.3.1.2. Classification of B-spline Curves B-splines are generally classified by the knot vector types; there are three knot vector

types, namely uniform, open uniform and nonuniform (Baker, 2004).

1. Uniform B-spline curves

The interval between knot values is constant, such as {-1.0, -0.5, 0.0, 0.5, 1.0, 1.5}. It

is more convenient when a knot vector is normalized such as {0.0, 0.2, 0.4, 0.6, 0.8,

1.0} or when a knot vector starts with zero and the interval between two neighbouring

knots is one, such as {0, 1, 2, 3, 4, 5}.

One of the properties of uniform B-spline curves is that their blending functions are

periodic because the denominators in equation 21 are fixed as (d-1)× the interval of

knots.

)2()()( ,2,1, uuBuuBuB dkdkdk ∆+=∆+= ++ eq. 23

Periodicity can reduce computation time.

2. Open uniform B-spline curves (Open B-spline curves)

Open uniform B-splines are the same as uniform B-splines except at both ends, where

knot values are repeated d times, such as {0, 0, 0, 0, 1, 2, 3, 3, 3, 3} when d = 4 and n

= 6, and {0, 0, 0, 0.5, 1, 1, 1} when d = 3 and n = 4. Open uniform B-spline curves

have similar properties as Bézier spline curves, such as the curve connects the first

and the last control points. Actually Bézier spline curves are a special case of B-

splines, where a knot vector has only zeros and ones, such as {0, 0, 0, 0, 1, 1, 1, 1}.

3. Nonuniform B-spline curves

Nonuniform B-spline curves are defined by any values and intervals for knot vectors

as long as the knot sequence is not descending. Knot vectors can have multiple knot

values and unequal spacing, such as {0, 1, 3, 3, 4} and {0, 0, 0, 1, 1, 3, 3, 3}.

39

Nonuniform B-splines provide more flexibility in controlling a curve shape because

the range of each blending function, which is defined by the interval of two knots, can

be given in different lengths.

4.3.1.3. First-Order Derivatives of Blending Functions Later we need the first-order derivatives of blending functions in order to obtain the

normal to the curve at arbitrary points on the curve. Given the first-order derivative

(x’, y’), where x’ = duudx )( , y’ = duudy )( , at an arbitrary point r(u), the normal is

defined as (y’, -x’)/ 22 '' xy + .

From equation 20, the first-derivative of a curve r(u) depends only on the first-

derivative of blending functions, since the coordinates of control points are fixed. The

first-order derivative of is also calculated using two blending functions one

order lower, and (de Boor, 1986).

)(, uB dk

)(1, uB dk − )(1,1 uB dk −+

)()()( 1,1,11,,, uBzuBzuDB dkdkdkdkdk −++− −= eq. 24

where

⎪⎩

⎪⎨

⎧ ≠−−

=−+

−+

.,0

,11

1,

otherwise

uuifuu

d

zdkk

kdkdk

Therefore the first-order derivatives can be calculated given a specific u value just in

the same way as blending functions in our algorithm.

4.3.2. Template Curves Our template is general and created on a billboard image by manually selecting knots.

Therefore knot vectors are non-uniform and for reasons of convenience the knot

values are normalized as 10 1 =≤≤≤= + endjjstart uuuu . The order of our curve is no

more than 3 (d ≤ 4) because cubic polynomial in general provides a good trade-off

40

between flexibility versus controllability and computational complexity (Baker,

2004).

Our template curve should be drawn on the image based on some selected points on

image edges. Therefore neither the position of control points nor the u parameter on

the curve is available. Instead we know the coordinates of points on the curve. We

would prefer to interpolate points on the curve. Therefore, based on the information

available, we use somewhat an opposite way to create a curve, that is, we interpolate

B-splines using points on the image edges as knots, then calculating the coordinates of

control points. Buss (2003) explains this method.

The u parameters are approximated by the so-called chord length parameterization

method, where is chosen as ku

11 −− −=− kkkk PPuu eq. 25

where is the coordinate of the kth interpolated point on the curve as shown in

Figure 14. Buss (2003) finds that the chord length parameterization method can

provide a successfully smooth curve.

kP

u1

u2

u3

P2

P1 P3 Figure 14 Example of the chord length parameterization

The dotted line represents the approximated length || Pk-Pk-1|| of the real length between knot uk and uk-1.

The next section describes our interpolation method based on Buss (2003). At the

tracking stage, control points are transformed according to our transformation method

described later, whereas corresponding blending functions are kept throughout the

time sequence.

41

4.3.2.1. Interpolating with B-splines Our strategy to interpolate smooth B-splines is first to select knot positions, second to

calculate the chord length parameters for these knots, third to give the blending

function values for these parameters, and fourth to obtain the control points.

The preconditions are:

1) A curve connects or passes through the first and the last control points

2) The degree of the polynomials is at most 4

3) Knots are manually selected on the image so that the coordinates of knots are

available

4) Spacing between knots is nonuniform

5) Multiple selection of the same point as a knot is not allowed15

6) Knot values are normalized

Two end points are used as two control points from the precondition 1). As for 2), a

linear B-spline (d = 2) can be used for a straight line and a cubic polynomial B-spline

(d = 4) can be used for a smoothly bending curve. As for 6), it is easier to use the

curve later if the precise range of knot values is available and fixed.

The below explanation is based on the assumption that the curve is of order three (d =

4). Given points on the curve, [P1, P2, P3,…,Pm16 ] and their parameters

with for all j, from precondition 5), the knot vectors

should repeat the two end knots four times.

[ ]mm uuuu ,,,,,,,, 121 − 1+< jj uu

[ ]mmmmm uuuuuuuuuu ,,,,,,,,,,,,,, 121111 −

= ⎥⎦

⎤⎢⎣

⎡ ∑∑∑−

=+

−

=+

−

=+ 1,1,1,1,||/||,,,,,||/,0,0,0,0

1

11

2

11

1

1121

m

iií

m

iií

m

iií PPPPPPPP eq. 26

15 The repeated knots disable the following matrix calculation because the determinant becomes zero (see equation 29). In reality it is nearly impossible to click exactly the same position twice; therefore we think this assumption does not limit the generality of the template for the condensation algorithm. 16 The number of control points n ≠ the number of selected points on the curve m when d 2. The reason is explained in

≠Table 1

42

There are m+2× (d-1) = m+6 knots. The number of control points then must be (m+2×

(d-1))-d = m+2. That is m+2 control points should be determined from m known

points.

The following table shows the relationships between the number of known conditions,

which are the points to be interpolated, and the number of control points and how

many conditions are required to calculate all control points for each degree d = 2, 3,

and 4.

Degree Number of

knots Number of

known points Number of

control points Number of missing

information 2 m+2 m m 0 3 m+4 m m+1 1 4 m+6 m m+2 2

Table 1 Numbers of known conditions and missing information given m interpolated points

Table 1 shows that from m given interpolated points we would like to know m+1

control points when d = 3, and m+2 control points when d = 4. That means one more

condition is required to calculate all control points when d = 3, and two more

condition are required when d = 4. Therefore, Buss (2003) suggests making ‘one more

arbitrary assumption’ to fulfil the minimum conditions to identify all control points.

That is, the first-derivatives at u = 0 and u = 1 are equal to zero when d = 4. This is

equivalent to assuming that the first two control points must be the same and the last

two control points are identical as well. When d=3, the required condition is just one,

so the additional condition is only that the first derivative at u = 0 is zero. Then we

can calculate m different control points from m interpolated points.

4.3.2.2. B-spline drawing algorithm The goal is to create a nonuniform B-spline curve on an image using a set of manually

selected points on the image as knots and to return the coordinates of control points as

mentioned above. The first-order derivatives are also calculated by following this

algorithm.

43

Given interpolated points, the knot values are calculated using Chord length

parameterization as mentioned in the previous section and normalized. The end knots

0 and 1, are repeated d times to create a knot vector as given in equation 26.

Taking d = 4 as an example, each interpolated point is defined as follows

kkkkkkkkk CPuBCPuBCPuBCPuBuP )()()()()( 4,14,124,234,3 +++= −−−−−−

1+<≤ kk uuu 4 2+≤≤ mk eq. 27

where m denotes the number of selected points on the curve and CPk is the kth control

points.

Regardless of the degree, there are m different control points and as many interpolated

points. We use only m different points and corresponding equations for the linear

matrix calculation in order to avoid obtaining a zero determinant. With the condition

that the two end points are used as control points, equation 27 can be put into a single

matrix equation.

eq. 28

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

⋅

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

=

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

mm CP

CPCP

uBuBuB

uBuBuB

p

pP

.

.

.

1,0....................,.........0,,0,0

.

0,..,0),(),(),(,0

0.....,,.........0),(),(),(

0,.......,0,0,00,1

.

.

.2

1

64,564,464,3

54,454,354,22

1

We can solve this equation in terms of the control points by multiplying the inverse of

the blending function matrix (B) on both terms. In case the inverse does not exist, the

pseudo inverse is used instead of the inverseTT BBBB 1)( −+ = 1−B (Hartley and

Zisserman, 2003).

We need to explain how each blending function is calculated. We know the u value of

each interpolated point. For each point, each of the blending functions k )(, uB dk ∈

[1, n+d-1] is calculated using d blending functions [ ]. )(,,,),( 1,11,11, uBBuB ddkdkdk −−+−+−

44

In the computation process of blending functions, first when d =1, only one blending

function for and other first1)(1, =uBk 1+<≤ kk uuu 0)(1, =uBi for i ∈ [1, n-1] i .

Each blending function influences two blending functions at one higher degree as

shown in Figure 15; for example, influences and .

k≠

)(1, uBk )(2, uBk )(2,1− uBk

Figure 15 How blending functions are influenced by each other Given only = 1 at d = 1, all blending functions at d = 2, 3, and 4 are calculated

as shown in Table 2.

)(1, uBk

d = 1 d = 2 d = 3 d = 4

1 0 0 0 0 2 0 0 0 0 0 0 0

3,24,3 −− ⋅ kk Bw

0 0 2,13,2 −− ⋅ kk Bw 3,14,13,24,2 )1( −−−− ⋅−+⋅ kkkk BwBw

.. 0 2,1−kw 2,3,2,13,1 )1( kkkk BwBw ⋅−+⋅ −− 3,4,3,14,1 )1( kkkk BwBw ⋅−+⋅ −−

k 1 2,kw 2,3, kk Bw ⋅ 3,4, kk Bw ⋅

.. 0 0 0 0 n 0 0 0 0

Table 2 How the blending functions at k, d are calculated

The definition of a scalar wk,d and blending function matrix Bk,d is given in equation 21 The first-derivatives of blending functions are influenced by the same two blending

functions. For example, is calculated using and . Therefore the

first derivatives can be calculated in the same algorithm with different weights.

dkDB , 1, −dkDB 1,1 −+ dkDB

A step-by-step algorithm for producing a B-spline curve given a u value as an input is

given as follows.

45

Algorithm

1. Manually select a number of points {P1, P2, P3, , Pm} on the image.

2. Compute normalized knot values for these points using chord length

parameterization in equation 26.

a) Approximate the distance between every two neighbouring points using

the chord length parameterization shown in

b) Assign the accumulated distance from p1 to each point of {p1, p2, p3, , pm}

c) Set {p’1, p’2, p’3, , p’m} = {0, ∑−

=+

1

1121 ||/

m

iii pppp , ∑

−

=+

1

1132 ||/

m

iii pppp , ,1}

3. Create a knot vector and a set of sample points.

a) Create a knot vector by adding d-1 zeros at the beginning and as many

ones at the last knot.

ex. v = {0, 0, 0, 0, p’2, p’3, ,,, p’m-1 , 1, 1, 1, 1} for d = 4.

4. Calculate for each u for interpolated points. When the first-order

derivatives are not required, the calculation is skipped.

)(, uB dk

)(, uDB dk )(, uDB dk

a) For each knot value u,

assign for k = [1, 2,…, n+d-1]. ⎪⎩

⎪⎨⎧ <<=

−

.,0

,1)(

1

1,otherwise

uuuifuB

kk

k

b) Calculate using for k = [1, 2, , n]. )(2, uBk )(1, uBk

c) Calculate using for k = [1, 2, , n]. )(2, uDBk )(1, uBk

d) Repeat this process until obtaining and using for

k = [1, 2,…., n].

)(, uB dk )(, uDB dk )(1, uB dk −

e) Repeat the process from a) to d) one by one for each u value.

5. Calculate the coordinates of the control points.

[ ] [ )(),(, 1 yPxPBYX ⋅= − ]

46

where X = [ ] ’ and similar for Y. Tn xCPxCPxCP )(),...,(),( 21

CP(x) denotes the x coordinate of the first control point. = and

= as explained in section 4.3.2.1.

1CP 1P

nCP nP

P(x) = [ ] and similar for P(y). PTnn xCPxPxPxPxCP )(),(),...,(),(,)( 2211 − i

denotes the ith sample point.

B is the square blending function matrix nn×

where each row is blending functions [ , ,…, ] for

the corresponding u for each sample point of P.

)(,1 uB d )(,2 uB d )(, uB dn

The pseudo-inverse is used when det(B) =017.

6. Return the coordinates of control points and the first-order derivatives of

sample points.

4.3.3. Affine Representation of B-spline Curves

4.3.3.1. Definition of the Affine Representation In the tracking process, the template is transformed to find a reasonable match. As we

have mentioned in the beginning of chapter III, objects are deformed by the projection

of the 3D viewing frame to the 2D viewing plane. However, since a billboard is a

rigid and planar shape, we approximate the transformation of billboards by the affine

transformation as Weng et al. (1989) assumed the scene is rigid for their image

matching. Since the affine transformation keeps parallel lines as parallel, the

projective transformation of depth in the 3D view frame onto a 2D viewing plane as

shown in Figure 16 is missed.

17 The case where det(B) =0 did not happen during our experiment partly due to the fact that we do not allow repeated knots except for the end knots.

47

2D3D

Figure 16 Non-affine projective transformation from 3D to 2D

In the affine transformation only six affine degrees of freedom is required to describe

a curve, namely translation vertically and horizontally, rotation and scaling vertically,

horizontally and in the diagonal direction, which means equally scaling vertically and

horizontally (Blake et. al., 1995). Furthermore, in our project, we aim at tracking the

template represented by B-spline curves, which is defined by the control points and

the corresponding blending functions. Since the tracking of all control points allows

too many degrees of freedom, tracking all of them will be complicated and unstable.

Instead we decrease the number of variables as low as six without loosing the

accuracy based on the known fact that supposing a billboard is represented as a planar

shape, it can be described by the six affine degrees of freedom (Blake et. al., 1995).

A B-spline curve can be approximated by a six-degree linear vector-valued function,

Q. The relationship between the coordinates of the control points and Q are defined as

follows (Blake et. al., 1995).

⎟⎟⎠

⎞⎜⎜⎝

⎛+=⎟⎟

⎠

⎞⎜⎜⎝

⎛YX

WQYX

eq. 29

or ⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟⎟

⎠

⎞⎜⎜⎝

⎛=

YX

YX

MQ eq. 30

where the matrices W and M defines the relationship between Q and the template

),( YX .

This template ),( YX is a column vector of the x and y coordinates of n control points,

respectively. Using the notation 1 and 0 for a n×1 column vector with all 1 and 0,

respectively, the space of Q-vectors is spanned by the following basis vectors, {( 1 ;

48

0 ), ( 0 ; 1 ), X( ; 0 }, ( 0 ; Y ), ( 0 ; X ), Y( ; 0 )}(Blake et. al., 1995). Then the 2n×6

matrix W can be defined as follows:

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛=

0,,,0,1,0

,0,0,,0,1

XY

YXW eq. 31

Then the pseudo inverse of W, called M is given as follows.

eq. 32 ΗΗ= − TT WWWM 1)(

where H is a 2n×2n matrix,

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=Η

∫

∫n

T

nT

duuBuB

duuBuB

0

0

)()(,,0

0,)()(

Using this expression a curve is rewritten as

eq. 33 )()(),( tQuUtur =

where

WuB

uBuU ⋅

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛=

)(0

0)()(

where . [ ])(,,,),(),()( 21 uBuBuBuB n=

As can be seen from the equation 30, any planar B-spline curve with n control points

is expressible with a six-dimensional vector Q, which enormously reduces

computation time because n is usually much larger than six.

49

4.3.3.2 Transformation in Q space

The following relationship between Q and a control point ( , ) is obtained by

inserting equation 31 into equation 29.

)(iX )(iY

⎟⎟⎠

⎞⎜⎜⎝

⎛+

⎟⎟⎟⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜⎜⎜⎜

⎝

⎛

⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟

⎠

⎞⎜⎜⎝

⎛YX

QQQQQQ

XYYX

YX

)6()5()4()3()2()1(

0,,,0,1,0,0,0,,0,1

eq. 34

This can be rewritten as

X = Q(1) 1⋅ + Q(3) X⋅ + Q(6) Y⋅ + X

Y= Q(2) 1⋅ + Q(4) Y⋅ + Q(5) X⋅ + Y

This means that ith template control point is transformed by Q as follows.

⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎠

⎞⎜⎜⎝

⎛⋅⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

+

+=⎟⎟

⎠

⎞⎜⎜⎝

⎛)2()1(

)()(

)1)4((,)5(

)6(),1)3((

)()(

QQ

iYiX

QQ

QQ

iYiX

eq. 35

The influence of each of the six elements is depicted with the simplest template of a

square with its four corner coordinates as P1=(-1, -1), P2=(1, -1), P3=(1, 1), and

P4=(-1, 1) as shown in Figure 17.

(-1, 1) (1,1)

(-1,-1) (1,-1)

Figure 17 Template for checking the Q space transformation

50

Then the template coordinate vectors are =X [-1, 1, 1, -1]T, and =Y [-1, -1, 1, 1] T .

The Q vectors are changed so that one of the six elements has a non-zero value and

X , Y and each Q-vector are inserted into equation 34 to calculate the transformed

template.

Figure 18 shows the influence of Q(1) and Q(2). Only the value of Q(1) is given in

a) as Qa=[1,0,0,0,0,0]T and b) as Qb=[-1,0,0,0,0,0]T, where only the value of Q(2) is

given in c) as Qc=[0,1,0,0,0,0]T and d) as Qd=[0,-1,0,0,0,0]T. We can see that Q(1)

and Q(2) translate the template in the x direction and the y direction by Q(1) and Q(2),

respectively.

a) Qa=[1,0,0,0,0,0]T b) Qb=[-1,0,0,0,0,0] T c) Qc=[0,1,0,0,0,0]T d) Qd=[0,-1,0,0,0,0]T

Figure 18 Influence of Q(1) and Q(2) The template is transformed by Qa, Qb, Qc, and Qd.

The influence of Q(3) and Q(4) are tested in the same way using another set of Q

vectors, which have a non-zero value either in Q(3) or Q(4).

Figure 19 shows that Q(3) and Q(4) scale the x coordinate and the y coordinate with

respect to the y-axis and the x-axis by Q(3)+1 and Q(4)+1, respectively. It can

naturally assume that Q(3)+1 and Q(4)+1 are always non-negative because it does not

make sense to scale in the negative direction, which implies that a billboard is flipped

just as reflected in a mirror.

51

a) Qe=[0,0,1,0,0,0]T b) Qf=[0,0,-1,0,0,0] T c) Qg=[0,0,0,1,0,0]T d) Qh=[0,0,0,-1,0,0] T

Figure 19 Influence of Q(3) and Q(4)

The template is transformed by Qe, Qf, Qg, and Qh. Finally the influence of Q(5) and Q(6) are tested in the same way using another set of

Q vectors, which have a non-zero value either in Q(5) or Q(6). Figure 20 shows that

Q(5) and Q(6) shear relative to the y direction and the x direction, respectively.

a) Qk=[0,0,0,0,1,0]T b) Ql=[0,0,0,0,-1,0] T c) Qp=[0,0,0,0,0,1]T d) Qq=[0,0,0,0,0,-1]T

Figure 20 Influence of Q(5) and Q(6) The template is transformed by Qk, Ql, Qp, and Qq.

In general affine transformation is a composition of rotations, scaling, shearing and

translations and when the transformation is along the (x, y) axes and centered at the

origin, it can be expressed as:

⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎠

⎞⎜⎜⎝

⎛⋅⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛ −=⎟⎟

⎠

⎞⎜⎜⎝

⎛

y

x

yy

xx

tt

iYiX

scsh

shsc

iYiX

)()(

cos,sin

sin,cos

)()(

θθ

θθ eq. 36

Where and denote scaling in the x and y direction, respectively, and

denote shearing along the x and y direction, respectively, and and denote

translation in the x and y direction, respectively. The transformation includes rotation

only when the appropriate values are given for

xsc ysc xsh

ysh xt yt

θcos and θsin , which satisfies the

following equation by combining equations 35 and 36.

52

= θθ 22 sincos +xyyx sh

Qsh

Qsc

Qsc

Q )6()5(1)4(1)3(⋅−

+⋅

+ =1

This condition is never met when three out of four elements are zero. Therefore the

above examples are not rotated. The following example in Figure 21 with Q = [0.5, -

0.5, 0.25, 0.25, 1, -1]T shows the rotation effect as well.

Figure 21 Transformation with Q = [0.5; -0.5; 0.25, 0.25; 1; -1]

As such we track these six transformation values instead of the control points

themselves. Knowing the control points of the template ( ,X Y ), we can always

recover control points of a target object (X, Y) by equation 34.

4.4. Condensation Tracker

4.4.1. Dynamic Model As mentioned earlier in section 4.1.4, a particle’s state at time t depends on its history.

Based on the assumption that the object dynamics form a Markov chain, the new state

is conditional only on the immediately preceding state, which can be written as:

),......,,,( 121 −tt xxxxp = )( 1−tt xxp eq. 37

Using a second order stochastic differential equation in discrete time, )( 1−tt xxp can

be modelled by:

ttt bxxaxx ω+−=− − )( 1 eq. 38

53

where tω are independent vectors of independent standard Gaussian variables and, as

such, can be regarded as Gaussian white noise, and x here is the mean value of

.1−tx 18 a controls the deterministic drift component of the dynamic model. It

determines the oscillatory motion of the system, such as modes, natural frequencies

and damping constants. b adjusts the stochastic part of the system and it couples the

noise into the deterministic dynamics of the system.

This dynamic model is applied in our 6D affine Q space. Let xt= ⎟⎟⎠

⎞⎜⎜⎝

⎛ −

t

t

QQ 1 , ,⎟⎟

⎠

⎞⎜⎜⎝

⎛=

QQ

x a=

and b = , respectively, equation 38 can then be written as ⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

10 ,

,0

AA

I

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

B,0

0,0

= tQ ++ −− 1120 tt QAQA QAAI )( 10 −− + tBω , eq. 39 2≥t Q represents the transformation to the mean position. Q displaces the template to the

mean position at t-1, meaning that the transformation by Q is represented based on

the mean position at t-1. The next state at t is dependent only on the two previous

states and . 1−tQ 2−tQ

Coefficients and B are important for a good prediction of the next state. They

can be estimated from a training data set or set as a good default values. We first set

likely default values for and B. If it does not work well, we will improve the

algorithm so that coefficients are estimated based on training.

,, 10 AA

,, 10 AA

4.4.1.1. Estimation of the coefficients and B ,, 10 AA

The more accurate the coefficients A0, A1 and B are estimated, the more robust

tracking can be. Since our tracking model is complicated with both stochastic and

18It is confusing to use the same expression as the template described in the earlier section. But this mean can be different from the template.

54

dynamic elements, the motion can be more accurately tracked if the model has been

trained, meaning the model parameters have been estimated. Although the movement

in live handball games is always irregular in terms of the direction and the speed, we

assume that the camera has basically similar movement within a game and from one

game to the next so that we can make use of the same training data.

We use the maximum-likelihood estimation method, an ‘effective’ method for

learning the dynamics from the previous Q sequence introduced by Blake et.

al.(1995). They found that the performance is improved especially by ignoring

background features and by being able to follow rapid motions. The disadvantage of

this method is that the tracker works well only for the specific shapes and motions

used for the training (Blake et. al., 1995).

In Maximum Likelihood Estimation (MLE) we try to find the most likely discrete-

time system parameters, , and B, by maximizing the likelihood of the observed

data set over all possible values of these parameters based on Blake et. al.(1995). The

deterministic and stochastic parameters are independent. Therefore we can separately

estimate them. The likelihood of the parameters , and B with respect to a set of

samples is given as

0A 1A

0A 1A

∏ −++ −−==

nnnnnn BQAQAQpBAAQQQp ))((),,,...,,( 1

11021021 ω eq. 40

Remember that tω are independent vectors of independent zero-mean Gaussian

variables with the variance BBT. The probability density function of a Gaussian

random variable, w, is given by

22 2)(

21)( σµ

σπ−−= wewp eq. 41

where µ is the mean of w, and σ is its standard deviation.

The log-likelihood is often used instead because the calculation becomes much simple

thanks to the change of multiplications into additions and because the log function is

monotonic; the maximum of equation 40 is obtained with the same parameters as the

55

maximum of its logarithm. Then the log-likelihood function L for the sequence of

states given the parameters , and B is described as follows (Blake et. al.,

1995).

iQ 0A 1A

constBAAQQpBAAQQQL nn +≡ ),,,...,(log),,,...,,( 1011021 eq. 42

Our goal is finding and B which maximize the log-likelihood L. Except for the

constant, equation 42 is rewritten by inserting equations 40 and 41 with

,A,A 10

µ = 0 and

= BB2σ T.

),,,...,,( 1021 BAAQQQL n

Bdetlog)m()QAQAQ(Bm

nnnn 2

21 2

1

2

11021 −−−−−= ∑

−

=++

− eq. 43

for a training sequence . It is principally impossible to estimate B itself, but

it is possible to estimate the covariance . Maximizing L with respect to

and is independent of the value of C because it is mutually independent to estimate

deterministic parameters, and , and stochastic parameter, B. By going through

the detailed proof and calculations given in appendix C, we can obtain parameters

which maximize equation 43; and are the solutions of

and

mQQ ,...,1

TBBC = 0A

1A

0A 1A

0A 1A

010100020 =−− SASAS 011101021 =−− SASAS , where , ( i , j=

0, 1, 2) and B =

∑−

=++=

2

1

m

n

Tjninij QQS

)ˆ,ˆ(2

110 AAZ

m −. .

4.4.2. Observation Model The purpose of the observation step is to measure how likely each particle is predicted

given the observed data. Naturally the observation model will utilize the set of image

features , such as edges, color, histograms, etc., to determine how the predicted state

of a sample matches the input or observation data .

tz

tz

56

The observation process is defined by an observation density function )( tt xzp , which

defines the posterior probability of the measurement for a given state . As give in

equation 11,

tz tx

)()( xzpxzp tt = , which means that )( xzp can be used throughout the

tracking (Blake and Isard, 1998).

In a 2D image, the observations z is, in principal, the entire set of visible features.

The observation density )( xzp in two dimensions depicts the distribution of the

observed curve z(s) given the predicted curve r(s), 10 ≤≤ s described by a state

parameter x. Each possible observation is found by tracing normals from the predicted

curve r. It does not always result in finding the correct observation z, because edges

arise from clutter as well as from foreground features. The observation density

)( xzp in 2D is defined as:

)( xzp ∝ ));(2

1exp( 12 µσ

vf− eq. 44

Where );( µvf = min ( , ), 2v 2µ µ is a constant, which also helps limit the tracing in a

finite distance by defining a definite value when the no image features are found

within this finite distance and represents the distance to the closest image feature

from the predicted curve. Therefore

f

1v

µ is the tracing range and can be any value large

enough. In this way we can avoid selecting more than one feature per point on the

predicted curve. The simplest discrete approximation of equation 44 is its integration

of one-dimensional densities with rM=2σ

)));()((2

1exp()(1

µmm

M

msrszf

rMxzp −−= ∑

=

eq. 45

where Mmsm = . This means that the minimum distance is found and evaluated

independently along M curve normals.

57

4.4.3. Initialization As mentioned in the condensation algorithm in section 4.2, the tracking is done

through an iterative process of sampling, predicting, measuring a sample set S {sn,t, n

= 1…N} and updating the corresponding weight { nπ }, so a set of samples needs to be

generated at t = 0 before the tracking starts. Furthermore, since the dynamic model is

assumed to be represented by a second order linear stochastic differential equation

(equation 39), we need at least the state at the first two time steps t = 0 and 1, ie.

and , to start the condensation algorithm.

1x

2x

First, a set of random samples are generated and each sample is represented by a state.

At this stage, the likelihood of the states only depends on how much information on

the targeted objects we have beforehand. If the initial guess of each parameter in the

state is within a rather large range, the likelihood of some samples might be quite low.

At time t = 0, N samples in Q space are ‘randomly’ thrown on the first image of the

video sequence depending on a random number, rand, taken from a uniformly

distributed set between zero and one. The range of each of six Q elements [ ,

] is given as input and the Q(i) of a sample is selected as

)(min iQ

)(max iQ

))()(()( maxmin iQiQrandQiQ mín−×+= 10 ≤≤ rand

Each of the randomly generated sample’s associated weight is directly calculated

using the above discussed observation model. At this time step, no prediction is

needed, since the motion has not started yet.

Next at time t =1, N new samples are generated from the set of N samples at t = 0

using their weights. There is, however, no relationship described in equation 39

between the samples at the first two steps because equation 39 is not defined for the

first two steps. The first samples may be far away from the target object since the

information about the template is not reflected.

58

Even though the tracking result might not be good for the first few time steps, the

condensation algorithm itself can gradually find a better candidate as t increases.

4.4.4. Detailed Algorithm This section depicts an overview of the detailed algorithm of the condensation tracker

as a summary. First, a template curve is created from an image with billboards. Then

the condensation tracker is run to find the best match.

4.4.4.1. Creating a template 1. Define the template curve as described in the B-spline drawing algorithm in section

4.3.2.2 and get the normalized knot vector v and the coordinates of the control

points.

2. Translate the template so that the first point lies on the origin because the Q space

transformation is with respect to x = 0 (y-axis) and y = 0 (x-axis) as explained in

section 4.3.3.2.

3. Transform the template representation from the control point coordinates into the Q

space representation using equation 30 in section 4.3.3.1.

4.4.4.2. Running the Condensation Tracker 1. Generate a sample set with n particles and each particle state represented in Q space

with six parameters as described in section 4.3.3.1. Each of the six parameter

values is randomly chosen by uniform distribution within a range defined by the

user. The range should indicate the transformation range of the state, such as

translation, scaling and shearing.

2. At time t=0, throw the sample set on the first frame of the clip as described in

section 4.4.3. Figure 22 depicts this process using an example.

59

Initially throw 100 particles onto first frame of the video sequence, as the red ‘e’s.

Figure 22 n = 100 samples thrown based on the uniformly distributed random sampling

3. Create an edge map of the first frame image using canny edge function.

4. Get measurement at t = 0 and measure the fitness of all the particles in the sample

set as follows

i) Use equation 29 to compute the corresponding control points coordinates, CP

of the state at t= 0; 0Q

ii) Pick uniformly distributed points on the curve, such as the curve parameter u

as u = 0:1/20:1( ), then calculate the blending function B(u) at each

point. Then get the coordinates of each point using

10 ≤≤ u

CPuBr ⋅= )( .

iii) Calculate the first-derivatives of the blending functions at each point r(u) in

order to obtain the normal there. The first-derivative gives the tangent vector

to the curve at the corresponding point as shown in Figure 23. So, given the

tangent vector as (x, y), the normal vector can be represented as (y,-x).

Figure 23 Normal and tangent at a point

iv) Normalize the normal vectors.

v) Set the default distance at a large constant. Measure the distance to the nearest

edge along the curve normal from each curve point selected in ii). In Figure 24

for example, the black curve is the curve resulted from ii) and the red curve is

the detected canny edge of the image.

60

Figure 24 The nearest edge searching process along the normal The distance to the nearest edge along the normal from each point

5. Sum the shortest distances and calculate a weight using equation 20.

6. Normalize the weights of all the particles and calculate the cumulative weights for

each particle.

7. Define the optimum and recover the curve using and control point

coordinates as a result at t = 0. The optimum result of , can be defined in

different ways, for example, the weighted average or the one with the highest

weight. We take the weighted average for this experiment because it may reflect

various errors in an adequate way. Other possibilities are discussed later.

0Q 0Q

0Q

Figure 25 The weighted average of the initial stage with 0QThe weighted average of samples in Figure 22

8. At t = 1 sample from the first n samples. Re-do step 3 to 7 for the second frame. Up

to this stage, the initialization is finished and a sample set containing n particles

(s(Q,w,c)) with corresponding states, weights and cumulated weights is obtained

at time t=0 and t=1.

9. From the third frame to the last frame of the clip, ie, from t=2 to t=number of

frames in the clip, run the condensation algorithm, that is to select samples,

predict by dynamic model = tQ QAAI )( 10 −− + tBω , then measure and weight by

The black color’e’ is the first measurement result.

normal

61

observation model )));(()( µsrsxzp for)(2

1exp(1

mm

M

mzf

rM−−= ∑

=

Mms = . In

Figure 26 the pictur ed, a ttom

picture shows the weighted average of particles.

m

e on the top shows 100 particles sampl nd the bo

Figure 26 Result of after the 3rd step top: 100 samples, bottom: their weighted average

Particles after resample and

on the

prediction step the 3rd fame ofvideo sequence.

Result after measurement step for 3rd frame.

62

V. Experiment and Findings

We discussed in much detail the theory behind the deformable template matching and

condensation algorithm in the previous two chapters. In this chapter we will

investigate the performance of these two algorithms.

5.1. Experiment Purpose As mentioned in the introduction, we would like to implement real-time tracking of

billboards in a live-broadcasting handball game without any information other than

the billboards’ designs and the layout of the billboards and the cameras beforehand.

The tracker should identify all billboards by their design. The positional and

translation information should be precise enough to replace the billboards with other

designs later19. In order to achieve natural replacement, occlusion should also be taken

into account. As such, our experiments will test mainly the following issues, which

we assume are essential for an accurate real-time tracking:

1) Speed

The algorithm should be fast enough to achieve real-time tracking.

2) Accuracy

The algorithm should achieve the accuracy level, at which the replacement of

billboard are undetectable by human observation.

3) Good template to represent the uniqueness of the billboard.

The template is represented both as a set of points and as curves in the

experiments.

4) Generalization

Even though handball games are different from each other, we assume the

cameras move mainly horizontally with different zooming levels. We hope to

be able to get a set of values to approximate the dynamic model and use them

for all the handball games tracking.

5) Handling of occlusion

19 As mentioned in chapter I, the replacement is out of the scope of this project.

63

How well the algorithms handle situations where occlusion occurs and where

part of the billboard runs out of the scene

5.2. Data Set

Even though our assumption is that clips are given, as mentioned in chapter I, what

we have as data is a video sequence of a handball game in MPEG2 format. So our

training data and test data are made out of the given video sequence of one handball

game. We split the original video sequences into clips and the training data and test

data are video sequences from clips taken by different cameras. Each sequence, which

consists of 50 frames, is converted into AVI file compressed with the IndeoVideo5

codec.

The training data is 18 different video sequences. The first nine training data are

sequences taken from camera N20and the last nine training data are taken from camera

S. The locations of these two cameras are shown as in Figure 27.

camera S

camera N

Figure 27 Reference of camera Figure 28 shows our data set structure. The test data are different video sequences

from the training data. They are taken by either camera N or camera S. We included

especially some sequences where the target billboards are occluded or partly out of

the scene. As described at 4) of 5.1 Experiment Purpose, we would like to experiment

on the generalization of the algorithms, so we hope that the set of value we can get

20 We can see from the given camera layout picture that there is one camera at up side of the playfield and three cameras at the opposite side of the playfield. So in order to make it simple, we refer the camera at the up side as camera N and the ones at the opposite side as camera S (since we do not know which one of the three cameras is the one used for each clip, we made all the data we assumed to be taken by camera S by pure eye observation. )

64

from the training data should be applied generally. As such, the test data are different

from the training data.

• Video clips taken by a camera at one side

Sequence used Sequence used for training for test

t50 ta ta+50t1

Time: t

• Video clips taken by a camera at the other side

Sequence used for test

tc tc+50

Figure 28 Data set structure As for templates, the given still billboard images are very big in terms of file size, so

we cut out only the billboard part and lower the resolution in order to make it easier to

process in matlab. The following images in Figure 29 are the images we used to

obtain the templates. ‘e’ at the left bottom is taken from a frame in the video

sequence. That is why it is blurred a little bit. Since we are only interested in the

shape of the billboard, reducing the size and resolution of the original image does not

affect algorithm’s performance. It does not matter whether the billboard image is

taken from the video or is draw by users.

Figure 29 billboard images used for form templates

65

5.3. Error Measurement None of Jain et. al.(1998), Isard and Blake (1996) and other relevant papers have

quantized the error. The goodness of fit is measured by visual judgment. One of the

reasons might be that their goals for tracking are for surveillance or TV meeting,

therefore the performance does not need to be very accurate as long as the camera can

follow the targeted object. In our case, however, the accuracy of the location is as

important as real-time tracking21.

5.3.1. Mean Square Error

In addition to the visual judgment, an error measurement is made in order to quantify

the quality of tracking. Since the correct tracking data does not exist, we compare the

tracking result with our manual tracking result based on the assumption that both

cannot avoid errors but errors in manual selection is within a reasonably acceptable

range. The mean square error (MSE)22 is often used in error measurement, which can

be defined as

MSE = ∑=

−Tn

iii

T

PXn 1

2)(1 eq. 46 2, ℜ∈ii PX

where, in our case, can the number of points define the template (ie. the points on

the template in the deformable template matching and the control points of the

template in the condensation algorithm ), is the ith point of the tracked result for a

frame and is the corresponding ith point of the manually tracked result for that

frame. So, the average error of a video sequence can be written as

Tn

iX

iP

21 As mentioned in Chapter 1, we cannot achieve real-time tracking because of the processing speed of matlab. Our goal is to make an algorithm which can run in real-time under a better circumstance.

22 MSE is determined by calculating the deviations of points from their true position, summing up the measurements, and then taking the square root of the sum.

66

N

MSEavgError

N

tt∑

== 1 eq. 47

where N is the number of frames in the video sequence and is the mean square

error of the tth frame. The standard deviation of the error for a sequence can then be

written as

tMSE

1

)(1

2

−

−=∑=

N

avgErrorMSEN

tt

σ eq. 48

Since the real tracking starts from the 3rd frame and the tracking results therefore

becomes stable there, the MSE, avgError and σ that we will measure later in the

experiments will not include the first two frames of each test sequence.

In order to reduce the error in manual clicking, the four corners of the billboard are

clicked and the error is measured only from the four manually tracked and the tracker

tracked corners in each frame as in Figure 30. We believe this error is a good

approximation of the error occurred in the tracking process.

Figure 30 manually tracking the 4 corner of the billboard

Image is result of manual tracking the 3rd frame of sequence tn2.avi

67

5.3.2. Confidence Interval Since our manual tracking may not be completely correct, we estimate the range

within which the true tracking error lies with certain probability using the confidence

interval (See Papoulis and Pillai, 2002).

We can estimate the correct value within a tolerance limit, which is defined as interval

estimator (Papoulis and Pillai, 2002). When the unknown θ is in the interval ( 1θ , 2θ ),

100 γ× percent of the samples measured under the same condition falls within this

range. Then ( 1θ , 2θ ) is called a γ confidence interval of θ . This relationship is

represented in the following equation.

P{ 1θ <θ < 2θ }=γ eq. 49

where the constant γ is the confidence coefficient of the estimate θ and δ =1-γ is

called confidence level. If the estimator of the mean η is unbiased, meaning that the

mean of the sampling distribution can be shown to be equal to the estimated mean,

and the density of the MSE is symmetrical about the mean, the mean is in the middle

of the interval ( 1θ , 2θ ), namely 2 η× = 1θ + 2θ .

We define the variance and the error as follows. We two people individually click the

same sequence. One of the clicking results is assumed as the golden standard, Pi and

the other as Xi in equation 46. Then the average error and standard deviation σ

between the two clicked sequences are computed using equations 47 and 48

respectively23. We use this standard deviation as when calculating the confidence

interval.

The error of the tracking result is defined as its difference from the golden standard.

First MSE is calculated by equation 46 using the golden standard as Pi and the result

of tracker as Xi. Then the average MSE of this tracking error, avgError is computed

using equation 47.

23 Following exactly Papoulis and Pillai (2002) we should measure the same sequence many times. However, we assume that clicking different frames provides almost the same effect of generalization of sampling.

68

Suppose the variance is known, but the distribution of the MSE is unknown, then the

probability of the measured errors being in the interval estimate

(avgError , Nδσ

− avgErrorNδσ

+ ) is larger than γ , where N is the number of

frame24. This is represented as follows.

P{ avgError <<− ηδσ

N avgError

Nδσ

+ }>1-δ = γ

If we set the confidence level δ as 0.05, 1/ δ =4.47.

Then the confidence interval with the confidence level δ is calculated as:

avgError <<− ησN

47.4 avgError Nσ47.4+ eq. 50

In the experiments in testing condensation algorithm, we will use the equation 50 to

calculate the confidence interval of the true tracking error with confidence coefficient

as 0.9525 for each test sequence.

5.4. Experiments on Deformable Template Matching

5.4.1. Experimental Results

The experiments are conducted on both a still image with ‘tøj’, which is one of the

billboard images, and two frames from video sequences as show in Figure 31. The

logo ‘SPAR’ is relatively clear, whereas ‘EL GIGANTEN’ is blurred a lot.

Throughout the experiment in template matching the edge smoothing parameter ρ in

equation 8 on page 23 is kept as 1 and the distance map is made as far as 20 from the

template edge with a canny edge threshold value of 0.4 when creating a distance map.

24 As mentioned under equation 48, the first two frames are not included. 25 Confidence coefficient 0.95 means that 95 percent of errors measured under the same condition fall within in this range.

69

a) tøj b) El GIGANTEN c) SPAR

Figure 31 images and their canny edge images for testing deformable template matching The canny edge threshold used is 0.4

The billboards’ canny edge images are as in Figure 32, with which the templates are

defined as described in the algorithm in chapter 3.3. Their original images are in

Figure 29.

a) tøj b) El GIGANTEN c) SPAR

Figure 32 Edge maps of ‘tøj,’ ‘EL GIGANTEN,’ and ‘SPAR’ templates The canny edge threshold used is 0.4. Original images are shown in Figure 29.

5.4.1.1. Experiment with point templates We tested with different number of points to see whether the number of points affect

the results. The points do not have to lie next to each other because each point is

independent. As mentioned earlier, it is very difficult to click exactly the

corresponding points on a frame, especially when the points are not visible in the edge

image. For example, vertical lines between ‘EL GIGANTEN’ billboards disappear in

the edge image of the frame (See in Figure 31 b). So we should avoid selecting points

70

on this edge. It is easier to select the points at the corner, but they are not suitable as

points on the template because their edge directions are not uniquely determined.

Then in order to click exactly the same points, we chose points easier to find, such as

points on the extension of a straight horizontal line in G.

Below Table 326 shows some of the results of experiments with different number of

points representing templates. The processing time to create a template is not included

because it can be done beforehand. The whole experiment results are as in Table 3 in

the Appendix. We experienced almost as many failed matches as good ones although

fewer failed results are shown in the below table. The Quality of the fit is quantized

by the MSE for not clearly ‘failed’ ones. As mentioned before, the error calculated

includes the error in manually locating the exact corresponding points. Throughout

considerable number of experiments, we feel that the error measurement does not

reflect the error very well because the more points are involved; the higher the risk is

to click wrong parts.

Exp.No.1

Figure No.

Temp-late

Number of template points

Time to create a distance map2

(seconds)

Time to find the best match3

(seconds) MSE

1 Figure 33 tøj 4 3.4 27.0 0.5 2 Figure 71 tøj 4 3.8 28.0 1.0 3 Figure 72 tøj 4 3.3 27.0 Failed5

4 Figure 34 tøj 4 3.3 26.6 Failed 5 Figure 74 E G4 6 3.8 43.9 4.0 6 Figure 75 E G 6 3.8 43.8 1.2 7 Figure 76 E G 11 3.8 47.9 2.9 8 Figure 35 E G 11 3.8 49.8 2.5 9 Figure 78 E G 11 3.9 47.5 3.6

10 Figure 36 E G 11 3.9 50.7 Failed 11 Figure 80 E G 15 3.9 52.9 Failed 12 Figure 81 SPAR 8 3.9 46.8 1.1 13 Figure 82 SPAR 8 3.9 45.0 12.1 14 Figure 83 SPAR 10 3.8 44.9 1.0 15 Figure 37 SPAR 10 3.8 45.0 0.7 16 Figure 85 SPAR 4 3.8 41.4 Failed 17 Figure 38 SPAR 10 3.8 45.6 Failed

1. Result images are shown below if numbers are in bold font. 2. Time to create a distance map of the frame up to 20 pixels from edges. 3. Time to find an object which best matches with the template. 4. EG stands for ‘EL GIGANTEN.’ 5. Failed here indicates that another object is located instead of the target.

Table 3 Results of template matching with point templates

26 Results images shown are those whose result numbers are in bold font in Table 3.

71

The most serious problem is the processing time. For every frame, it takes (time to

create a distance map of the frame + time to find the best match) at least 30 seconds.

The more points involved, the longer it takes. We believe that even with better

computer conditions, real-time tracking cannot be achieved.

As for the accuracy, the ‘tøj’ shows a good match with small error even with small

number of points. Although ‘EL GIGANTEN’ is blurred more than ‘SPAR,’ the

number of points to obtain a good result is even smaller. This fact implies that the

good choice of template is more important than the number of points, as long as the

number is sufficient to represent the uniqueness of the billboard. A good template

captures the uniqueness of the design of billboards, such as L in ‘El GIGANTEN’ and

a curve between t and ø in ‘tøj.’ On the other hand, no part of the ‘SPAR’ billboard is

apparently different from any other parts in the frame. When the number of points is

larger than those in Table 3, the error becomes larger because of, presumably, the

larger error in manual clicking.

a) Template with 4 points b) Result

Figure 33 Result 1 with ‘tøj’ -4 points - successful Total time to find the match = 30 seconds, MSE = 0.5


Figure 34 Result 4 with ‘toj’ – 4 points - failed Total time to find the match = 30 seconds

72

a) Template with 11 points b) Result Figure 35 Result 8 with ‘El GIGANTEN’ – 11 points – successful

Total time to find the match = 54 seconds, MSE = 2.5


Figure 36 Result 10 with ‘El GIGANTEN’ – 11 points – failed

Total time to find the match = 55 seconds


Figure 37 Result 15 with ‘SPAR’ - 10 points - successful Total time to find the match = 49 seconds, MSE = 0.7


Figure 38 Result 17 with ‘SPAR’ - 10 points - failed Total time to find the match = 49 seconds

73

Next curves are used to represent templates for billboard ‘EL GIGANTEN’ and

‘SPAR.’ This time manually selected points are used as knots and a given number of

uniformly distributed points between the first and last knots are used for matching.

Therefore there is no problem in clicking corner points because they are not directly

used for matching. Moreover, we can select more points for matching than we

actually click.

Table 4 shows the experiment result using curve as template. Some of the good and

failed results are shown in Figure 39 to Figure 42.

Exp.No.

Figure No. 1

Temp-late

Number of

points clicked2

Number of points

used3

Time to create a distance

map4

(seconds)

Time to find the best

match5

(seconds)

MSE

18 Figure 87 E G6 6 11 3.8 44.7 1.7 19 Figure 39 E G 6 11 3.8 48.2 3.1 20 Figure 89 E G 6 11 3.8 48.2 3.9 21 Figure 90 E G 6 14 3.8 51.0 2.3 22 Figure 91 E G 6 14 3.8 51.0 2.7 23 Figure 92 E G 10 16 3.8 51.7 1.3 24 Figure 93 E G 10 16 3.8 45.0 0.7 25 Figure 40 E G 6 11 3.9 51.9 Failed7

26 Figure 95 E G 10 11 3.9 51.0 Failed 27 Figure 96 SPAR 6 5 3.8 41.0 15.4 28 Figure 97 SPAR 6 10 3.8 48.4 8.3 29 Figure 98 SPAR 6 10 3.8 45.9 11.0 30 Figure 99 SPAR 10 10 3.7 45.2 1.8 31 Figure 100 SPAR 10 15 3.8 48.5 1.9 32 Figure 41 SPAR 10 15 3.8 44.6 1.8 33 Figure 42 SPAR 10 10 3.9 47.8 Failed 34 Figure 103 SPAR 15 15 3.8 52.7 Failed

1. Result images are shown below if numbers are in bold font, other images are in the appendix. 2. The number of points clicked to create a template curve 3. The number of points used to find a best match 4. Time to create a distance map of the frame up to 20 pixels from edges. 5. Time to find an object which best matches with the template. 6. EG stands for ‘EL GIGANTEN. 7. Failed here indicates that another object is located instead of the target.

Table 4 Results of template matching with curve templates In both SPAR and EG cases, a template can be represented by a curve with six points.

The more points are chosen, the more accurate the result is, however the more time it

takes. Then it saves labour if more points are used than manually selected. When we

use as many points to find the best match as we have selected, the results are often not

good, even though many points are selected. Presumably this is because nearly the

same points as manually selected, which are often at corners, are used for finding the

74

best match. Therefore the match is better when more points are used for matching

than originally selected as knots.

However, more points are required for matching than previous experiments 1-17

because presumably we cannot deliberately choose points uniquely representing a

billboard like we could for experiments 1-17. Therefore the choice of points is less

important than point template cases.

a) Template with 6 points curve b) Result

Figure 39 Result 19 with ‘EL GIGANTEN’ - 6 points are clicked, 11 points sampled - successful Total time to find the find the match = 52 seconds, MSE = 3.1


Figure 40 Result 25 with ‘EL GIGANTEN’ - 6 points are clicked, 11 points sampled - failed Total time to find the find the match = 56 seconds


Figure 41 Result 32 with ‘SPAR’ - 10 points are clicked, 15 points sampled - successful T

otal time to find the find the match = 49 seconds, MSE = 1.8

75


Figure 42 Result 33 with ‘SPAR’ - 10 points are clicked, 10 points sampled - failed Total time to find the find the match = 49 seconds

5.4.2. Findings and Discussions

1) Speed

The above experiment shows clearly that deformable template matching is very slow.

It takes approximately half a minute to process the image with a simple background

using only 4 points to represent the template. When the number of the points used for

the template increases and the background becomes more complicated, meaning that

there are more edges in the image, the processing time increases tremendously. It is

unlikely to achieve on-line tracking even with faster computers or fast program

languages. Therefore we give up the idea of using this method for object tracking.

2) Accuracy

The deformable template matching method can locate target objects in images with

clear edges, regardless of blur and the number of template points, as long as these

points are enough to represent the unique shape of the template. This accuracy is very

likely achieved because not only the distance but also the edge direction between all

the edges selected on the template and the target object are used to check the

goodness of fit.

3) Good template to represent the uniqueness of the template

A unique template representation is the key to a successful matching. Curve templates

are better than point templates because it saves labor to select points enough to

represent a billboard, which contributes to reducing the error in manual clicking.

76

However, it becomes more difficult to define an object uniquely when the target

object does not have very strong edges in the image. The deformable template

matching uses only the edge feature for matching, so a well defined clear edge of the

target object is essential in using this method. There is no guarantee that the edge

points of the object, which we would like to track, are visible in the edge map,

because we use one canny edge threshold value to process one sequence. As described

in the algorithm, the template is placed on all the edge points on the edge map, if the

corresponding match point of the first point of the template is not visible in the edge

map, the real object will never be located accurately in the frame. It is difficult to find

an optimal threshold value to suit all the frames or define a different threshold value

for each frame. We will further discuss this in the experiment on condensation

algorithm.

Of course, the more points used, the more likely the uniqueness of the template is kept

and more likely to succeed in matching. However, the processing time simply

increases proportional to the number of points, whereas the accuracy depends more on

the quality of the template representation than the number of points. So as long as the

points are enough to represent the unique feature of the template, a huge number of

points are not necessary. However, it is difficult to know which points better represent

the billboard and at least how many points are required.

4) Generalization

Once a good template is defined for a billboard, we believe it is possible to detect the

same design no matter how frames are taken as long as it is not drastically deformed,

since as assumed in section 3.3, deformation of the template is not taken into account

in the algorithm. If it is, even a deformed design can be detected. We will not try to

include the deformation part as it will take even more time to process a single frame.

We think template matching is a strong method to detect the accurate position of the

template. It is, however, a fundamental flaw that the processing speed is intolerably

slow for tracking.

77

5.5. Experiments on Condensation Algorithm

Based on section 5.1 Experiment Purpose, many experiments are conducted on the

condensation algorithm to show how different factors influence the performance of

the tracker. These factors are mainly a good template representation, an accurate set

of coefficients , , and and an optimal number of particles need to be generated. 0A 1A 0B

As we have discussed earlier in section 4.4.1 in Chapter 4, coefficients , , and

are very important in successfully tracking the target. They approximate the

motion of the system. As such, we will classify the experiments according to how the

coefficients , , and are approximated.

0A 1A

0B

0A 1A 0B

5.5.1. Define an Optimal Template A template can be represented by a single curve, as in Figure 25. However, the

condensation tracker often fails to track a simple single curve when a similar shape,

but of different size or rotation, is hit by a particle or a sample. Furthermore, in

reality, it is very difficult to represent unique features of a billboard with only a single

curve. As such, we use multiple curves to represent billboards.

Take the ‘EL GIGANTEN’ billboard for example; we can represent it with EL plus

the billboard borders.

Figure 43 Template with multiple curves As show in Figure 43, the template is represented by six curves (third-order B-spline

curve) and four straight lines (first-order B-spline curves). We denote these six

segments all together as a curve.

78

In the measurement step, we use the same Q to get the control points for each segment

and then eventually the six segments of each particle. Using the same method for

measuring one curve as described in section 4.4.4.2. Running the condensation

tracker, we measure the distance to the nearest edges to all the six segments and the

sum of these edge distances will be used to calculate the normalized weight. Let L(i)

be the shortest distance to the ith segment, then the total distance can be written as

Total L= and after measuring the Total L for every particle, we use equation

17 to get the normalized weight for each particle.

∑=6

)(I

iiL

It will be ideal to include all the curves of a billboard in the template, but it will

increase the computation time in the measurement step. In the experiments, three

curves are used to represent each billboard. The curves that are chosen to represent a

billboard are, first of all, the ones that we assume can be detected by the canny edge

in the measurement step. Secondly, these curves together should be a good

representation of the billboard.

Figure 44 shows some examples of the templates used to represent some billboards.

a) ‘e’ template b) ‘El GIGANTEN’ template c) ‘SPAR’ template

Figure 44 Templates with multiple curves a) ‘e’ billboard is represented by the shape of ‘e’ and the billboard’s bottom line. b) ‘El GIGANTEN’ billboard is represented by the shape of E and the lines around E and the billboard’s bottom line. c) ‘SPAR’ billboard is represented by the shape of S and R and the lines around SPAR.

5.5.2. Coefficients , , and 0A 1A B

As mentioned earlier, the more accurate the coefficients A0, A1 and B, the more robust

and accurate the tracking will be. As such, we would like to find out if we can train or

define a set of A0, A1 and B, which are suitable for billboard tracking in all parts of the

79

handball games. We assume that the camera at each side moves in roughly the same

pattern, even though every handball game has its own features. So the question is that,

suppose we can train on enough sequences from handball games to obtain A0, A1 and

B, whether this trained set of A0, A1 and B works well in tracking any billboard in

general, or whether we can define a simple default set of A0, A1 and B.

(1) Estimated , , and B from training data 0A 1A

A training sequence, ,…, , where j is the number of training samples, is needed in

order to estimate A

1Q jQ

0, A1 and B (Blake et. al. (1995) and Reynard et. al. (1996)). In

order to make the situation simple, we use the four corner points of the billboard as

the interpolated points to obtain a linear B-spline curve (first-order) to represent the

template. Taking the ‘e’ billboard for example, the manually selected four yellow

corner points seen in Figure 45 are the control points needed to be tracked manually

in each frame of the video sequence. Qs can then be calculated using equation 30.

Figure 45 Template for training Yellow points are manually clicked points used to calculate control points. Red points are calculated control points, which have been transformed so that the first control point is at the coordinate origin.

Using the control points obtained manually from the training data, Qs are calculated and , and B are estimated. 0A 1A We observed that during the video sequence, the target billboard’s size and direction

have only very slightly changed, so the values of Q(2)s, Q(3)s, Q(4)s, Q(5)s and Q(6)s

do not change much through this video sequence. Furthermore, Q(5) and Q(6) are

very small as there is not much shearing for the shape of the billboard for this

sequence. As mentioned earlier, the 5th column determines the rotation and shearing

relative to y coordinate and the 6th column determines the rotation and shearing

relative to the x coordinate, we expect that the 5th and the 6th rows of the , and B 0A 1A

80

matrices will be very small. It can be seen from Table 11 and Table 12 in appendix,

which shows the s, s and Bs calculated from the training data. 0A 1A

In order to observe how the camera moves, we depict the motion of the centre of ‘e’

billboard from the training data (sequences) we manually tracked (denoted as C) in

the following figures, where the x axis is the ordinal numbers of the frames in the

video sequences and the y axis is C’s x coordinate or y coordinate in each frame of the

video sequences.

e1.avi e2.avi e3.avi

e4.avi e5.avi en11.avi

en12.avi en13.avi en14.avi Figure 46 Horizontal motion of the centre of ‘e’ billboard in a sequence—training data take from

camera N (training sequences e1.avi-e9.avi, en11.avi-en14.avi) The x-axis is the ordinal numbers of the frame and the y-axis is C’s x coordinate in each frame.


e4.avi e5.avi en11.avi

en12.avi en13.avi en14.avi Figure 47 Vertical motion of the centre of ‘e’ billboard in a sequence —training data take from

camera N (training sequences e1.avi -e5.avi, en11.avi -en14.avi) The x-axis is the ordinal numbers of the frame and the y-axis is C’s y coordinate in each frame.

81


e9.avi e10.avi es11.avi

es12.avi es13.avi es14.avi

Figure 48 Horizontal motion of the centre of ‘e’ billboard in a sequence —training data take from camera S (training sequences e6.avi -e10.avi, es11.avi -es14.avi)

The x-axis is the ordinal numbers of the frame and the y-axis is C’s x coordinate in each frame.

Figure 49 Vertical motion of the centre of ‘e’ billboard in a sequence —training data take from camera S (training sequences e6.avi -e10.avi, es11.avi -es14.avi)

The x-axis is the ordinal numbers of the frame and the y-axis is C’s y coordinate in each frame.

It is very difficult to observe a clear motion pattern from Figure 46 to Figure 49,

however, it shows that the motion difference between two consecutive frames is about

0-20 in the x coordinate and 0-10 in the y coordinate. The motion along the x

coordinate is bigger than that along the y coordinate. The horizontal motion direction

is changing. Sometimes the billboard moves to the right and sometimes it moves to

the left, so instead of using the trained set of , , and 0A 1A B , a simple default set can

probably be defined.

82

(2) Default coefficients , , and 0A 1A B

In the description of the prediction step, we have stated that , , and values can

be the likely default values. The simple , , and matrix can be the matrix which

have non-zero values only at the diagonal. Our experiments show that the particles do

not move when and are set as:

0A 1A 0B

0A 1A 0B

0A 1A

0A = = 1A

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

500000005000000050000000500000005000000050

..

..

..

and there is no Gaussian noise effect when is set as: B

B =

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

000000000000000000000000000000000000

When the diagonal values for matrix and are set higher than 0.5, the particles

move towards the right, and when the diagonal values are set lower than 0.5, the

particles move towards the left. This finding is only based on our eye observation

from the experiment. We have not further deplored how the elements in matrices ,

, and B influence each other due to time limit. However, we would like to use the

sets of , , and B , which we observed works fairly well for a few sequences, to

test if they can be used as reasonable default values. So, initially we set , , and

B as:

0A 1A

0A

1A

0A 1A

0A 1A

0A =0.48 I× , =0.48 1A I× ; B=diag([15 15 0.01 0.01 0.001 0.001])

and during the tracking, the direction of the motion from time t-2 to t-1 needs to be

checked. One condition for this method to work is that we trust the tracker, that is to

say, we consider the tracked particle as the optimal target. Our assumption is that the

83

next movement will be the same as the previous movement; therefore the movement

at time t will be to the right if the movement from time t-2 to t-1 is to the right. Since

the first value in Q vector, Q(1) indicates the translation in the x direction, we can say

that if (1) - (1) is positive, then at time t, the motion is toward the right, then

we update , and as:

1−tQ 2−tQ

0A 1A

0A =0.52 I× ; =0.52 1A I×

Otherwise, the motion at time t is toward the left and we update , and as: 0A 1A

0A =0.48 I× ; =0.48 1A I×

as illustrated in Figure 50.

1 2

3

tQ 2−tQ 1−tQ

3 1

2−tQ tQ 1−tQ

1−tQ (1)- (1)<0, then Q is supposed to move to the left, ie,

2−tQ t

A0=A1=0.48 × eye (6)

2

1−tQ (1)- (1)>0, then Q is supposed to move to the right, ie,

2−tQ t

A0=A1=0.52 × eye (6)

Figure 50 Check motion and set default and 0A 1A

We will compare tracker’s performance of using the estimated , and B with that

of using default ,and B. Regarding the trained A

0A , 1A

0A , 1A 0, A1 and B five different

situations are tested:

1) whether it can model the dynamics of those video sequences taken from the same

camera;

2) whether it can be used to model the dynamics of the video sequences taken from

other cameras;

3) whether tracking will be successful if we combine the training data from different

cameras and obtain a corresponding set of A0, A1 and B to model the dynamics of

any video sequence of a handball game regardless of which camera being used;

4) whether the trained set of A0, A1 and B obtained by the data of the locations of

billboards located at one place can be used to track the billboards located at other

places in the frame image, for example, whether we can use the trained A0, A1 and

84

B from billboards at the corner, such as the‘e’ billboard, to track the billboards at

the side of the playground, say ‘SPAR’ billboard;

5) whether on-line training can be used to get A0, A1 and B.

5.5.3. Result and Findings

Many factors affect the performance of the condensation algorithm, such as the

number of particles, the initial searching range where the particles need to be

generated, the coefficient , , and and etc. It is very difficult to separate those

factors in the experiments. As such, when doing experiments, we manually specify

some values, such as the initial range, in order to observe the influence of each

individual factor.

0A 1A 0B

5.5.3.1 Experiment on the performance by number of samples In order to conduct such experiments, we manually tracked the billboard of each test

sequence and estimated the , , and B for them. We consider these

estimated , , and B for each test sequence accurate enough to represent the

dynamic model of each sequence. The initial range input for each test sequence can be

found in Table 13 in the appendix.

0A 1A

0A 1A

Although Isard and Blake (1998) use more than 1000 samples, we use 300 at most

because of the processing speed. The more particles are involved, the slower the

implementation becomes. By observing the tracking process, we can see that the

tracking result using 300 particles is better than using 100 particles. With 300

particles, the tracker keeps tracking the target. The tracking is accurate with respect to

location, size and shearing factors. Even when the particles hit places with similar

curves, the tracker can still track the right target (See Figure 51)

85

Figure 51 Tracker successfully track the right target even tough some particles are predicted to

be around a similar curve near by. The image is taken from the process of tracking the ‘e’ billboard in video sequence new11.avi

Below Table 5 shows the average error of tracking for each video sequence we tested

and the standard deviation. As mentioned earlier, the error in the below table includes

the error in manual tracking. So even though the error is big, such as 26.198 in

ree11.avi, the tracking result by inspection is rather good as shown in Figure 52.

Figure 52 Good tracking result despite a high error

Tracking result is rather good by inspection even though the average error measured is as high as 26.198.The images are taken from the tracking result sequence of ree11.avi.

86

ExpNo27.

Test data (avi)

Result file28

(avi)

No. of particles

Confidence interval of the error η

Confidence interval of manual tracking

error η ’29

Standard deviation

1 new 11 53 100 2.5871e+007<η <5871e+007 7.3299<η <43.12 1.0016e+008

2 ´´ 54 300 12.0<η <31.0 ´´ 14.6378

3 ei ree1100 100 91474<η <91486 1.6350<η ’<14.7650 41853

4 ‘’ ree11 300 20.3<η <32.0 ´´ 15.46

5 tn3 retn33 100 11.1<η <25.8 0.5081<η ’<26.7419 10.848

6 ‘’ retn3300 300 10.2<η <25.0 ´´ 10.519

7 ts1 rets110 100 50.1<η <59.5 3.8960<η ’<18.2540 49.124

8 ‘’ rets18 300 42.1<η <51.5 ´´ 38.33 9 ts2 rets22 100 6.3<η <16.5 3.7280<η ’<23.5720 7.1146

10 ‘’ rets21 300 4.7<η <14.9 ´´ 5.4373

Table 5 Errors in tracking using A0, A1 and B trained from the test sequence itself In order to see how the tracker’s performance is through out the tracking process, we

use the below Figure 53 to show the MSE occurred at each frame for the above test

video sequences. The x axis is the ordinal number of the frame and the y axis is the

MSE occurred at each frame. Since errors can be very big, the y axis is with

logarithmic scale, which is often used in plotting drastic changes.

27 Each experiment result is given a number, which makes it easier to refer to a particular experiment result later in the thesis. 28 Each experiment result is a new video sequence, where the tracking result is recoded. 29 If the confidence interval of the error η is below the max. of η ’, we consider the tracker successfully track the target.

87

53.avi 54.avi

ree1100.avi ree11.avi

retn3100.avi retn3300.avi

rets110.avi rets18.avi

ret22.avi ret21.avi

Figure 53 Errors per frame –experiment no. 1 to 10

The x-axis denotes the frame number and the y-axis denotes MSE at each frame. The above experiments show that the number of particle samples is a very important

factor in deciding the performance of the tracker. The more samples we use, the better

the tracking result is. Take experiment 1 and 2 for example, the tracking result using

300 particles is much better than that using 100 particles. When the particle numbers

are not enough, it is more likely that the target object is not hit by the particles, as

such, the tracking result might be very wrong. However, the better performance is

achieved by sacrificing the speed of the tracker, ie, the tracker needs more time for

tracking when the number of particle samples increases. It takes generally 5.1 second

to process each frame using 100 particles and 8.0 second using 300 particles.

It is very difficult to have a universal optimal number to suit all trackers for tracking

under different kinds of conditions. Take experiment no.1, 2 and experiment no.9 10

88

for example, comparing with its performance using 100 particles in experiment 1, the

tracker’s performance improves dramatically in experiment no. 2 using 300 particles.

However, we cannot see such an improvement in experiment no.9 and no.10. The

reason is that the initial range inputs are different. As described in section 4.4.4

Detailed Algorithm of Chapter IV, a number of particles are generated in a certain

searching area with certain rotation and shearing parameters which are defined by the

user’s input Range30. From the table here, we can see that the initial range where

particles are initially generated for experiment no.1

and no.2 (100 x 50) is bigger than that

of experiment no.9 and no.10(50 x30).

Furthermore, the initial shearing factors

are neglected in experiment no.1 and

no.2, while they are much closer to the

true values in experiment no.9 and

no.10. As such, fewer particles are

required to achieve a good tracking

result. We can also observe from Table 5 that the standard deviation of the error

decreases when the number of particles increases, which indicates that the

condensation tracker becomes more stable.

Exp. No. 1&2 9&10

Test Sequence new11.avi ts2.avi

Initial Range Min. Max. Min. Max.

Q0(1) 480 580 100 150

Q0 (2) 120 170 190 220

Q0 (3) -0.5 -0.25 -0.5 -0.25

Q0 (4) -0.5 -0.25 -0.5 -0.25

Q0 (5) 0 0 -0.15 -0.1

Q0 (6) 0 0 0.2 0.4

We assume that at least 300 particles are needed in order to achieve an acceptable

accuracy level using our tracker, however, this optimal number is valid under the

restriction that the initial range, within which the particles are generated, is a good

guess of where the target would be. If the variations of the particles’ locations, sizes

and shearing factors are in a wide range, the optimal number of 300 might not be true.

5.5.3.2. Experiment on the performance using trained , , and B based on another sequence

0A 1A

(1) A0, A1 and B trained from sequences taken by the same camera.

(2) A0, A1 and B trained from sequences taken by different cameras.

(3) tracking non-corner billboard with A0, A1 and B trained from corner

billboard.

30 Later in Chapter VII, we briefly mentioned that we might be able to get automatically the values for Range by using the deformable template matching method.

89

(1) A0, A1 and B trained from sequences taken by the same camera.

Based on the findings in experiments using different numbers of samples, we use 300

particles in the experiments in Table 6 and the initial range for each test sequence is

the same as the one used in the previous experiments no.1 to no. 10.

Exp.No.

Test data (avi)

Result file

(avi)

Coefficients

0A , , and B 1A

Confidence interval of the error η

Confidence interval of manual tracking error

η ’31

Standard deviation

11 e1 ree13 Trained from e1.avi-e5.avi

5348<η <5360 1.6350<η ’<14.765 2813.7

12 tn3 retn32 ´´ 2193<η <2207 0.5081<η ’<26.742 1229.5

13 new11

new11-3 ´´ 3595 <η <3614 7.3299<η ’<43.12 4277.2

14 E1 ree15

Trained from e1.avi-

e5.avi,en11.avi-en14.avi

13799<η <13811 1.6350<η ’<14.765 5111.2

15 tn3 retn37 ´´ 2939<η <2954 0.5081<η ’<26.742 462.49

16 new11

renew116 ´´ 45.2<η <64.3 7.3299<η ’<43.12 74.344

17 ts1 rets16 Trained from

e6.avi-10.avi, es11.avi-es14.avi

2511<η <2521 3.8960<η ’<18.2540 1934.8

18 ts2 rets24 ´´ 8.8<η <19.0 3.7280<η ’<23.5720 9.9144

Table 6 Errors in tracking using A0, A1 and B trained from sequences taken by the same camera We can see that the errors are very big except the experiment no.18, where the initial

range is very limited and close to the true state of the billboard. The failure is due to

many reasons. First of all, the coefficients A0, A1 and B do not accurately represent the

motion model as those trained from each test sequence itself. In different sequences,

the camera moves with different speed and zooming, as such, the oscillatory motion

trained from many sequences are the general representation of all the sequences and

the noise coupled to the dynamics model also increases. Take experiment no.13 and

experiment no.16 for example, test sequence new11.avi is taken by camera S and

training data el.avi to e5.avi are also taken by the same camera S, but all with lower

zooming than that used in new11.avi . Training data en11.avi to en14.avi are taken by

camera S with similar zooming as in new11.avi. When we use , , and B trained

only from el.avi to e5.avi, the tracking error is approximately 3500-3600. The tracker

only manages to track the rough locations of the target for a few frames and lose the

track totally very quickly. When we include en11.avi to en14.avi in the training data

0A 1A

31 If the confidence interval of the error η is below the maximum of η ’, we consider the tracker successfully track the target.

90

and re-estimate , , and B, the tracker’s performance becomes better and the

average error decreases to approximately 45-60. Figure 104 in the appendix shows the

MSE per frame for the above experiments. We can observe that tracker follows the

right location of the targets for most of the frames throughout the sequence. We

believe that if the training data are large enough and carefully chosen, a good

approximation of coefficients , , and B can be obtained.

0A 1A

0A 1A

In order to further test whether coefficients , , and B trained from the data taken

by one camera can be used in tracking the sequences taken by the same camera , we

further manually limit the initial range as

0A 1A

Table 14, so that the searching area and the initial state are very close to the true

location and state of the target. With these new range inputs, we conduct experiment

no.11 to no.17 again and the results are shown in Table 15 in the appendix. It is still

very difficult to conclude whether coefficients , , and B trained from the data

taken by one camera are stable and good enough to be used in tracking as only re-

experiment on experiment no.11 (experiment no.19) shows a satisfactory result. Other

experiments still have very huge errors. The only conclusion we can make is that the

more accurate coefficients , , and B represent the system motion model, the better

the tracker performs. The tracker’s performance is better and more stable when

coefficients , , and B trained from the sequence itself , as in experiments no.2, 4,

6, 8, and 10.

0A 1A

0A 1A

0A 1A

We observed in above experiments, in many failed cases, the tracker keeps on

tracking the same wrong target, even though sometimes, the real target is hit by one of

the particles. If the tracker had tracked various different locations with big variations,

we could have concluded that we could not use , , and B trained from other

sequences. However, we cannot observe such phenomena. Take experiment no.24 for

example, Figure 54 shows some the tracking results in tracking billboard ‘e’. The

tracker tracked the right target until 20

0A 1A

th frame. It keeps on tracking the corner of

billboard ‘Harboe’ from the 20th frame till the end of the test sequence ts1.avi.

91

15th frame 18th frame 21st frame 24th frame

27th frame 30th frame 33rd frame 36th frame

Figure 54 Tracking results from experiment no.24 We can further observe from experiment no.12 and no.20, where we tracked the

billboard ‘e’ in the test video sequence tn3.avi, the tracker jumps to another target

from almost the same frame in the two experiments. The same occurred in experiment

no.17 and no.24, where we tracked the billboard ‘e’ in the test video sequence ts1.avi.

a) top- retn32.avi (exp. No.12), b) top- rets16.avi (exp. No.17),

bottom- retn3R1.avi (exp. No.20) bottom- rets1R3.avi (exp. No.24)

Figure 55 Error per frame for experiment no.12 and no.20 and experiment no.17 and no.24 The x-axis denotes the frame number and the y-axis denotes MSE at each frame.

92

So another very import factor which affects the performance of the tracker is the

edges detected by the canny edge detector in the measurement step. As discussed in

chapter II Image Features for Tracking, different canny thresholds result in different

edge images. A higher threshold value might keep many edges in the searching area

in the edge image of each frame, which might mislead the tracker, as shown in the

below figure. When threshold is set as 0.3, the ‘e’ billboard edges are remained,

however, the edges in the nearby billboard are kept as well. If many particles hits both

red circles, the final located target might not be the true target at all. So in stead of

tracking ‘e’, the tracker might track the nearby similar curves or the final target will

be located in the middle of the two red circles.

a) Edge map of the 1st frame of the test video sequence tn3.avi

b) Tracking result of the 15th frame –experiment no.20 – the location of the target is in the middle

of the two red circles in a)

93

c) Tracking result of the 15th frame –experiment no.20 –the target is the nearby similar

curves in red circles in a) instead of ‘e’.

Figure 56 Influence of canny edge threshold value on tracker’s performance If the canny edge threshold is set higher, the nearby edges might not be kept in the

edge image, but at the same time, the target’s edge might be excluded as well. It is

very difficult to set a single good threshold value to suit all occasions, especially

when the camera changes zooming or moves very fast throughout a sequence, because

the target might get more blurred due to motion blur or being out of focus. As shown

in Figure 57, in the video sequence tn2.avi, the camera pans very fast. Even though

the camera zooms in this video sequence, the edges of the target are not totally visible

due to strong motion blue. The left corner edges of the target are missing (as in the red

circle in the figure). So the tracker is misled to the edge of the players.

a) Tracking result of 4th frame –experiment no. 25 b) Part of the edge image asTable 16 in appendix

Figure 57 Influence of motion blur on tracker’s performance The failure caused by edge image can also be seen later in image d) of Figure 58 in

this chapter when we use the default , , and B for tracking. In all the experiments,

we use actually different thresholds for different sequences in order to exclude or

reduce the influence of the canny edge threshold value. During actual real-time

tracking, however, we cannot change the threshold value manually. So, one optimal

0A 1A

94

threshold value needs to be defined beforehand or the tracker should automatically

generate different threshold to suit tracking in different zooming effect, we will leave

this for future improvement due to time limit.

Besides the problem with edges, the way we calculate the weight of each particle in

the measurement step has limitations as well. Unlike in the template matching, the

gradient direction information is not used. As such, the tracker only checks the

distance to the nearest edge, thereby the shape of the curve is actually ignored.

(2) A0, A1 and B trained from sequences taken by different cameras

Based on the results of the above experiments, we do not expect that the tracker will

track the target successfully using A0, A1 and B trained from sequences taken by

another camera or by different cameras. Table 7and Table 8 show that the tracker fails

to track the right target in all the cases. The errors as per frames are shown in Figure

107 in the appendix.

Exp.No. Test data result file Confidence interval of

the error η


error η ’32

Standard deviation

26 e1.avi ree1R7.avi 48792<η <43804 1.6350<η ’<14.7650 44102

27 tn3.avi retn3R5.avi 5784<η <5800 0.5081<η ’<26.7419 866.8

28 new11.avi renew11R11.avi 383.2<η <402.2 7.3299<η ’<43.12 185.13

29 ts1.avi/ rets111.avi 10282<η <10292 3.8960<η ’<18.2540 3091.2 30 ts2.avi/ rets25.avi 7022<η <7033 3.7280<η ’<23.5720 6269.7

Table 7 Errors in tracking using A0, A1 and B trained from sequences taken by another camera

Exp.No. Test data result file Confidence interval of

the error η


error η ’33

Standard deviation

31 e1.avi ree1R8.avi 22435<η <22447 1.6350<η ’<14.7650 11864

32 tn3.avi retn3R4.avi 4944<η <4959 0.5081<η ’<26.7419 623.92

33 new11.avi renew11R12.avi 106.3<η <125.3 7.3299<η ’<43.12 76.814

34 ts1.avi/ rets1K9.avi 14091<η <14101 3.8960<η ’<18.2540 21723

35 ts2.avi/ rets26.avi/ 1539<η <1550 3.7280<η ’<23.5720 2732.5

Table 8 Errors in tracking using A0, A1 and B trained from sequences taken by both cameras

32 If the confidence interval of the error η is below the maximum of η ’, we consider the tracker successfully track the target. 33 If the confidence interval of the error η is below the max. of η ’, we consider the tracker successfully track the target.

95

(3) Tracking non-corner billboard with A0, A1 and B trained from corner billboard

Most of the non-corner billboards are occluded because players are in front of them.

As such, the experiments on tracking non-corner billboard with A0, A1 and B trained

from corner billboard will be shown section 5.5.3.5 Experiment on occluded

billboards and billboards whose part run out of the scene.

5.5.3.3. Experiment on tracking with default A0, A1 and B

As discussed earlier in section 5.5.2, we would like to test if we can use a likely

default set of A0, A1 and B as the coefficients. Below Table 9 and Figure 58 clearly

show the error for each test sequence. As the experiments done earlier, the tracker

failed to track the right target for test sequence e1.avi and tn3.avi, instead, the tracker

keeps on tracking almost the same wrong places. We assume the reason is the same as

we have discussed so far.

In experiment no.41 to 43, the tracker managed to tracker the target’s rough location,

though the scaling and the shearing are not as accurate as those in experiment no.2, 4,

6, 8, and 10, where we use the coefficient A0, A1 and B trained from the sequence

itself. With the default A0, A1 and B, the tracker’s performance is as unstable as in the

earlier experiments (no.11-no.35). As shown in experiment no.42, when the player

runs near to the billboard, the bottom line edge of the target is invisible, furthermore,

the players have strong edges in the searching area, as such, the tracker jumps to the

player and starts tracking the wrong target, even though it has successfully track the

right targets for 19 frames.

96

Exp.No.

Test data result file

Confidence interval of the

error η


error η ’34

Standard deviation Comment

39 e1.avi ree1Default2.avi 21484<η <21496 1.6350<η ’<14.7650 13728

40 tn3.avi retn3Default2.avi

5663<η <5678 0.5081<η ’<26.7419 2065.5

41 new11.avi

renew11Default2.avi

639.3<η <658.4 7.3299<η ’<43.12 2287.1

42 ts1.avi rets1Default1.avi

8283<η <8292 3.8960<η ’<18.2540 8351.7

43 ts2.avi rets2Default1.avi

109.7<η <120.0 3.7280<η ’<23.5720 45.091

initial ranges used are as those used in experiment no.19 –no.24

Table 9 Errors in tracking using default A0, A1 and B

a) ree1Default2.avi-The tracker failed to track the target.-exp.no.39

b) retn3Default2.avi – The tracker keeps on tracking the same wrong place as in other experiments partly

due to the edge image as described earlier in this section.- exp.no.40

c) renew11Default2.avi—The tracker keeps on track the right target most of the frames. The high

peak error in the pink circle occurs when the tracker jump to the nearby ‘harboe’ billboard.- exp.no.41

34 If the confidence interval of the error η is below the max. of η ’, we consider the tracker successfully track the target.

97

d) rets1Default1.avi. At 20th frame, the tracker jump to the player as the bottom edge line is no longer

visible- exp.no.42

r keeps on tracking the target through out the sequence, but the scaling and

5.5.3.4. On-line training

e observed from the above experiment that both the default and the trained A0, A1,

e) rets2Default1.avi – The trackeshearing are not accurate.- exp.no.43

Figure 58 Tracking result and error per frame using default A , A , and B 0 1

W

and B from the same video sequence give a fairly good tracking result, we would like

to combine them to make on-line training. Questions are when and how often we

should train the data. Our experiment shows that it is not a good idea to start training

data from the very beginning because initially particles are randomly chosen. They do

not represent the system’s motion well. We start to train the data from the 31st frame.

The tracking result is not as accurate as we can observe in using A0, A1, and B trained

from the test sequence itself, but the tracked curve is very close to the target.

Furthermore, we can also conclude that it is not necessary to train data to get a new

set of A0, A1, and B for every frame; however, it is difficult for us to find out an

optimal interval for training. Below figure shows some of the tracking results using

on-line training. 2 experiments result can be seen in Figure 109 and Table 17 in the

appendix.

98

31st frame 36th frame 41st frame 46th frame

Figure 59 Tracking using on-line trained A0, A1, and B—some result from experiment no.44

5.5.3.5. Experiment on occluded billboards and billboards whose part run out of the scene It is very difficult to train , , and B by manually tracking the billboards at each

side of the playfield, as most of the billboards at each side are either totally or partly

covered by the players or totally or partly out of the scene. The following Table 10

shows the experiment results of tracking billboards at the side of the playfield using

both ,

0A 1A

0A 1A , and B trained from tracking the corner billboard in training data taken by

the same camera and that trained from tracking the side billboard in the test sequence

itself. Because experiment no.16 and no.18 show very good tracking results

using ,0A 1A , and B trained from sequences taken by the same camera, we test only

these two video sequences, where in test sequence ts2.avi, billboard ‘e BOKS’ is

occluded and in test sequence new11.avi, billboard ‘EL GIGANTEN’ in the pink

circle as shown in

Figure 60 is also occluded.

99

This billboard, we name it as EL1, eventually goes out of the screen.

This billboard, we name it as EL2, has players in front.

a) 1st frame from new11.avi

b) 1st frame of ts2.avi

Figure 60 Video sequences used to test occlusion cases

Exp.No.

Test data result file

Coefficients

0A , , and B 1AConfidence

interval of the error η

Confidence interval of manual

tracking error η ’35

Standard deviation

46 ts2.avi

reBoksTs25.avi

trained by tracking the side billboard itself of the same sequence

686.0<η <696.0 3.7280<η ’<23.5720

758.07

47 ´´ reBoksTs24.avi

by manually tracking the corner billboard

2383<η <2393 ´´ 1075.4

48 new11.avi reelg03.avi

trained by tracking the side billboard itself of the same sequence

4444<η <4463 7.3299<η ’<43.12 3232.4

49 ‘’ elgan12.avi by manually tracking the corner billboard

45044<η <45064 ´´ 35028

Table 10 Errors in tracking using A0, A1 and B trained by manually tracking the corner

billboard and those trained by tracking the side billboard itself of the same sequence -experiment no.46-49

As shown in Table 10, using the coefficients ,0A 1A , and B trained from tracking the

corner billboard of training data taken by the same camera, the tracker’s performance


100

is not satisfactory. In tracking ‘eBOKS’ billboard case, the tracker managed to track

the rough location of the billboard, but the rotation of the tracking result is very wrong

from the very beginning even though the initial two frames tracked almost the correct

target (see Figure 61 )

1st frame 2nd frame 3rd frame 4th frame

Figure 61 Tracking result of experiment no. 47- reBoksTs24.avi On the contrary, the tracking result using the coefficients and B trained by the

tracking ‘eBOKS’ billboard in the sam test video sequence is much better. We

ed and

nore the projection from 3D to 2D images.

Tracking ‘EL GIGANTEN’ is very difficult. We can observe that the tracking result is

s

mation to define the

0A , 1A ,

e

assume one of the reasons is that we assume the billboards are affine transform

ig

very bad. One reason is clearly the canny edge threshold as we have observed in all

the previous experiments. When we lower the canny edge threshold, the error seem

decreased. Since our measurement step uses only the edge infor

degree of fitness, edges for each frame are very decisive in tracker’s performance.

The figure below shows the canny edge images of some frame. As we observed, there

are seldom clear edges for ‘EL GIGANTEN’ billboard even though we have lowered

the canny edge threshold to 0.3 or 0.1. The strong edges for this billboard are the

bottom line and the lines around text ‘EL GIGANTEN.’ These lines are however not

the unique representation of ‘EL GIGANTEN’ billboard.

101

Figure 62 Canny edges of ‘ELGIGANTEN Further to testing the tracking result for the cases like occlusion and part of im

nning out of the e eOcclusion.avi,

1st frame 6th frame

14th frame 24th frame

’ billboard

ages

ru image, we experiment on the video sequenc

sparOut.avi and brother.avi, where in eOcclusion.avi, billboard ‘e’ is occluded in the

last few frames, in sparOut.avi, billboard ‘SPAR’ runs out of the frame image in the

last few frames and in brother.avi, billboard ‘brother’ is occluded from start. The

results are shown in Table 18 in the appendix.

eOcclusion.avi sparOut.avi brother.avi

Figure 63 Test data—experiment no.50-52

for ppens, the tracker’s

We observe that when occlusion occurs, the tracker’s performance is affected. Take

‘eOcllusion3.avi’ example, before the occlusion ha

performance is pretty good. When occlusion occurs, as long as it does not

significantly change the shape of the curve, the tracker can manage to locate the target

102

as shown in Figure 64. But when occlusion covers most part of the curves, the tracker

fails to track the target.

11th frame 13th frame 15th frame 17th frame

22nd frame 24th frame 26th frame

Figure 64 Tracking occluded object-experiment no.50

ory. We analyse

ne reason is that the occlusion occurs right from start and it is difficult to locate the

For the case of tracking billboard ‘brother,’ the result is not satisfact

o

right candidate from the first few frames and after a few very wrong tracking, it is

difficult for the tracker to recover the target again. Another reason for the bad result is

that ‘brother’ billboard is like the ‘EL GIGANTEN’ billboard. They are rather long.

As such, when we increase the searching area, the number of particles might also need

to be increased in order to get a satisfactory tracking result. Furthermore, most of the

billboards on the side are the same billboards lying one after another. So it is very

likely that the tracking will switch tracking another same pattern billboard instead of

the one we target.

103

5.6. Further Discussions

Q at Measurement Step

has a unique

1 Q vector for its transformation. Since we have no clue about what exactly the

calculates a weighted average of Q vectors because it takes all samples

into account based on their probabilities. The weighted average can be a good

However when the weighted average

ay not always lead a good result because the value with the highest weight does not

a) b)

Figure 66 Mode in different distributions

5.6.1. Optimal Definition of Final

There are n Q vectors every time for n samples because each sample

6×

correct transformation is, an optimal Q vector should be defined based on convincing

reasoning.

Our tracker

alternative when the distribution is symmetrical and steep unimodal as shown in

Figure 65 a).

a) b) c)

Figure 65 Weighted average in different distributions

distribution is asymmetry like Figure 65 b), the

m

contribute much. Instead, the mode as shown in Figure 66 a) reflects the real

probability.

104

hen there are two peak mode as in Figure 66 b)

better than using the weighted average. At most of the handball games, there are

ssions so far only handle tracking one single billboard. Suppose the

acking is correct and there are a few of the same billboards in a scene, it might be a

.6.2. Optimal Number of Particles

,000 or more samples, particle filter’s

erformance improves a lot in tracking a shaking object with a natural speed in six-

rames, the particle

samples are more or less clustered at the same place as shown in Figure 67. As such,

ber of partic

W s as shown in Figure 66 c), using the

is

always a few same billboards in a scene. These billboards are either placed one after

another or at different places. Our current algorithm only gives one best result. As

such, if two billboards are hit by the particles and their weights will be both high, then

we are facing the problem as in Figure 66 c), the weighted average will fail to track

one of the good candidate, instead, the final result will be somewhere in-between the

two targets.

All the discu

tr

good idea to set up a good threshold to check the fitness. All the particles whose

fitness degree is above the threshold can be the multiple targets.

5

Isard and Blake (1998) argue that, with 1

p

dimensional shape space. Our result proves their findings as well, even though we

tested with fewer samples as 100, 200 and 300 due to the running time problem. We

assume that using a large number of particle samples and other fast computer

languages other than matlab, our tracker’s performance will be much better. As

mentioned before, an increased number of particles will increase the computation

time, so it is important to find an optimal number of particles where the required

quality and the speed of tracking are both at an acceptable level.

We observed that when using the trained A , A ,and B, after some f0 1

we assume that we can reduce the num les after a certain number of

frames by only re-sampling those particles which have higher probabilities and

ignoring the ones whose probabilities are very low.

105

1st frame-randomly throw particles in the range area 3rd frame

7th frame

Figure 67 Particles are almost with the same size and shearing as tracking co tinues.

.6.3. Initialization and the Stability of the Filter

As mentioned earlier in this chapter, in order to run the condensation algorithm, we

h particle with a weight

n

5

need to generate a number of particle samples and assign eac

according to how well the particle fits the template. Since the initial particles are

randomly generated, it is very often that the results for the first two frames are not

accurate. As long as the first two steps’ results are not very wrong, the condensation

algorithm will gradually locate the target through the iterative sampling and the

particle filter’s performance becomes stable (as shown in Figure 68). It is very

important that the template is unique enough to prevent the tracker jumping to another

similar object. It also requires that the curves chosen as the template should be strong

edges so that they will be visible in the binary edge image.

106

1st frame 2nd frame 3rd frame 4th frame Figure 68 Initial results & tracking results

hand, so

us games are used to

su ess

atching and Condensation

late matching and

onde ectively and demonstrated these with experiments

mplate matching algorithm, the

radient directions of the template edge and the frame image edge are used in

approximating the posterior probability. Secondly, in the template matching, the

However, in real-time tracking, the video sequence cannot be trained before

only the coefficients A , A , and B trained from previo0 1

approximate the system’s motion model. In such a case, an accurate initialization will

be preferable to ensure cc ful tracking.

VI. Comparison of Template M

In chapter III and chapter IV, we discussed deformable temp

nsation algorithm respc

discussed in chapter V. The two methods have many common features. First of all, a

good unique representation of the template is required in both methods. Secondly, the

shape of the template can be either rigid or non-rigid (deformable), although, in the

template matching method, we considered only the rigid case. Thirdly, both the

deformable template matching and the condensation tracker use edge information to

check the fitness level, ie, we use the edge information to approximate the posterior

probability of each sample (each location in the template matching method). In order

to approximate the posterior probability, the distances between particle’s curves and

their nearest edges in the frame image are calculated.

Both methods also have unique features. First, in the te

g

107

object is located in each frame by minimizing the total energy and does not utilize any

relationship between the frames such as the motion pattern. As such, the result from

the previous frame does not affect the result from the next frame and the tracking

from frame to frame is totally independent. We could utilize the result from the

previous frame by limiting the search region around the previous location of the target

object based on the assumption that the object does not move too quickly.

In contrast, the condensation tracker is a combination of factored sampling and a

stochastic model for object motion. A number of particles are generated, each of

hich has a state defined as a Q-space representation of the template curves and its

has at the same time its difficulties such as

itialization and inaccuracy at the beginning of the tracking. Deformable temple

able template matching to

cate the object initially and then define the initial particles’ range according to the

w

associated weights. The states of the particles at time t depend on their states at time t-

1. As such, the result from the previous frame affects the result from the next frame.

The final object location is obtained as the weighted average location of the particles.

It is more robust and time-saving.

It is clear that the condensation algorithm is more suitable for real-time tracking with

respect to speed. Condensation

in

matching, on the other hand, can locate the right object given a good representation of

the template and a clear edge image right from the start. This is achieved by a huge

amount of computations, which is very time-consuming.

Since each method has its advantage and disadvantage, we can try to utilize both of

them. For example, it may be a good idea to use the deform

lo

location of the detected object. As such, an automatic initialization becomes possible.

Of course, it will cause problem when a new clip starts, since new particles need to be

generated and the long processing time required by the template matching method

will cause delays in the broadcasting.

108

VII. Conclusion and Future Improvements

ing of a billboard

a handball game. Specifically, we studied two algorithms, deformable template

ound is cluttered and

tion is complicated. The deformable template matching finds successfully a target

combines statistical factored sampling

and deterministic and stochastic effects

d to the tracker in the future in order to improve

d the accuracy of the performance.

on of the curve as the state of the

ation in the measurement step to check how well each

We have implemented algorithms which can achieve real-time track

in

matching and condensation algorithm, and made a program for each of these and

tested them in terms of speed and accuracy. We have also discussed what constitutes

good template representation and the handling of occlusion

Tracking in real time is difficult, especially when the backgr

mo

by searching for edges which minimize the potential energy by penalizing the

deformation of the template. It takes a lot of time even to process just one frame. We

think it is more suitable for locating objects and retrieving from some database and

processing speed is not crucial.

The condensation algorithm, on the other hand,

for non-Gaussian observations in a sequence

in dynamic model. This algorithm is fast enough to track non-rigid motion even with

a complicated background. Motion blur, however, is a problem for a robust tracking.

A good number of samples are required to improve the accuracy at the cost of

processing time. Tracking can be achieved in real-time, if a faster programming

language other than matlab is used. Furthermore, the number of control points can

also affect the processing time; the more control points used, the more accurate the

template is represented. The fewer points used in the measurement step, the faster the

performance is.

The following features can be adde

the speed an

First, our tracker uses a Q-space representati

particles and uses edge inform

particle fits the template. However, other state parameters and image features can also

be used. For example, we can define the state as a searching window. Instead of

checking the nearest edges, we can compare the histogram or color combinations of

109

each searching window with that of the template window and then use the difference

to approximate the probability.

Secondly, we can eliminate particles with low likelihood as Rui and Chen (2001)

ggest. In order to do so, we should find a good threshold value to eliminate very

deal with motion blur and occlusion. If we can estimate the amount of motion blur

condensation might improve the

acker’s performance. Since the reliability of the condensation tracker improves a lot

su

unlikely particles. Then all the particles will be our targets as long as their values

obtained from the measurement step are higher than the threshold. This threshold

value will also help us to realize that the targets are no longer in the scene or the

tracker has lost tracking. In such a case, we should reinitiate the particles.

Alternatively, we can reduce the number of particles when the tracker becomes stable

in order to reduce the computational load, and subsequently the processing time

without lowering the quality. Through the iteration, the more likely particles gradually

get higher probabilities and less likely particles get lower probabilities. Until selecting

good particles, it is important to have many samples, however afterwards the number

does not matter a lot. We should find a reasonable number of iterations for the tracker

to stabilize, and after that number of iterations, we can reduce the number of particles.

Thirdly, in order to make the replacement as natural as possible, in the future, we need

to

in the video sequence, we can remove it when getting the edge map of the frames and

put the same motion blur back when inserting the new billboard. Furthermore, we

need to identify players and make masks for them so that we can detect occlusion and

put the players back after inserting new billboards.

Finally, a combination of template matching and

tr

if the initial Q-space state is sampled close to the target, initialization can be done

through template matching. It can be feasible even if it takes one minute for

initialization as long as the delay in presentation time is kept equal for every frame

and the sequence looks natural (Lu, 1997). Actually, Jain et. al. (1998) also suggest

the possibility of combining Kalman filtering or other prediction schemes with the

template matching.

110

The template matching can also be improved in some respects. First, as we mentioned

econdly we could utilize the variable m in equation 3 to change the locality and

inally, Jain and Zhong (2000) discuss that in the first step, region screening by

earlier, when implementing the deformable template matching method, we made the

assumption that the template does not deform so that we can ignore the internal

energy part in equation 8. However, a 3D to 2D projection is likely to deform the

shape of an object. Even though the deformation is small, it should be included in the

energy minimization calculation. At the same time, this inclusion takes additional

computation time, which result in longer processing time.

S

smoothness of the model. The original algorithm uses a threshold to decide whether

an object is the target object. If it is a good candidate, m is incremented by 1. Such an

elaborated screening mechanism searches objects more thoroughly, which may result

in more computational load together with more accuracy. In addition we have no idea

which threshold we should use; we choose the object with the minimum energy

defined by equation 7, which means that we can only detect one object.

F

texture and color information is quite useful for reducing processing time. Combining

edge features together with other features might improve the speed and accuracy of

this step as well.

111

Appendix

A. Gradient Magnitude for Pixel Representation

The gradient magnitude is defined as

[ ] yxyx GGGGyf

xff +≈+=

⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

+⎟⎠⎞

⎜⎝⎛∂∂

=∇½22

½22

eq. 51

for image function f (Gonzalez, 2001), where [ xG is called gradient. ]TyGOften intensities are distributed to each pixel as show in Figure 69.

1z 2z 3z

4z 5z 6z

7z 8z 9z

Figure 69 A 3×3 region of an image, where the z are intensity (gray-level) values In this case, the gradient magnitude is calculated as follows. 6859 zzzzf −+−≈∇ eq. 52

or if we should take gradients in four directions instead of two, 741963321987 2222 zzzzzzzzzzzzf ++−+++++−++≈∇ eq. 53

The corresponding masks for the former are called Roberts cross-gradient operators

and those for the latter are Sobel operators.

B. Color Model Conversion from RGB to HSI

H = eq. 54 ⎪⎩

⎪⎨

⎧

>−

≤

GB

GB

θ

θ

360 with

112

[ ]

[ ]⎪⎪⎭

⎪⎪⎬

⎫

⎪⎪⎩

⎪⎪⎨

⎧

−−+−

−+−= −

212

1 21

)BG)(BR()GR(

)BR()GR(cosθ eq. 55

S = 1- )B,G,Rmin(BGR ++

3 eq. 56

I = 3

BGR ++ eq. 57

We convert the color information into the intensity values by equation 57 in order to

calculate the gradient and subsequently find edges.

C. Estimation of A0, A1, and B by Maximum Likelihood Estimation Method.

First we pursue maximizing the log-likelihood with respect to and . That is

equivalent to minimizing the following equation with respect to and , by

ignoring the second term of equation 43 on page 56 because it is independent of

either or .

0A 1A

0A 1A

0A 1A

22

11102

110 ∑

−

=++

− −−=m

nnnn )QAQAQ(B)A,A(f eq. 58

This equation is equivalent to

∑−

=++++

−− −−−−=2

111021102

1110 ))((),(

m

n

Tnnnnnn

T QAQAQQAQAQBBAAf

= ∑ −

=++++++

− +−−−2

100201120222

1 ((m

n

TTnn

Tnn

TTnn

TTnn

Tnn AQQAQQAAQQAQQQQC

))1111011211110TT

nnTT

nnTnn

TTnn AQQAAQQAQQAAQQA ++++++ ++−+

= ∑∑∑∑−

=++

−

=

−

=++

−

=++

− −++2

1112

2

100

2

11111

2

122

1 )()()((m

n

TTnn

m

n

TTnn

m

n

TTnn

m

n

Tnn AQQAQQAAQQAQQC

∑∑ ∑−

=++

−

=

−

=++ −+−

2

1211

2

1

2

101102 )()()(

m

n

Tnn

m

n

m

n

TTnn

TTnn QQAAQQAAQQ

113

∑ ∑−

=

−

=++ +−

2

1

2

111020 )()(

m

n

m

n

TTnn

Tnn AQQAQQA

Then

eq. 59 )ZC(tr)A,A(f 110

−=where

TTTTTT ASASASAASAASASASAASASZ 101002012101010201210000111122 +−−+−−++=

, i,j=0,1,2, eq. 60 ∑−

=++=

2

1

m

n

Tjninij QQS

From equation 58, f is non-negative and quadratic (two dimensions), there is a

minimum of f and the minimum of zero is achieved when

01102 =−− ++ nnn QAQAQ

This condition is expanded as

=−− ++ nnnn QQAQAQ )( 1102 010100020 =−− SASAS

( =−− +++ 11102 ) nnnn QQAQAQ 011101021 =−− SASAS eq. 61

In this way the solutions and are obtained independent of C. 00 AA = 11 AA =

Now it is the turn to estimate C. By rewriting equation 43 on page 56,

11 detlog)2(21)(

21 −− −+−= CmZCtrL

Then taking the derivative with respect to (using the

identity ) and setting it as zero,

1−CTMMMM −≡∂∂ )(det/)(det

TCCC

mZL )(detdet

2 11

−− ⋅

−+=∇ = 0

114

By fixing and obtained in equation 61 and the optimum C which maximizes L

is given as

0A 1A ˆ

)ˆ,ˆ(2

1ˆ10 AAZ

mC

−= eq. 62

Since C is a covariance matrix, B can be simply obtained as a matrix square root of C:

CB =

D. Images from the Experiment

D.1. The Results of Section 5.4 Experiments on Deformable Template Matching

D. 1. 1. Experiment with Point Templates Given in Table 3

In the following figures blue points are manually clicked points and red points are the

best matched points on the target object.


Figure 70 Result 1 with ‘tøj’ -4 points - successful (The same figure as Figure 33) Total time to find the match = 30 seconds, MSE = 0.5


Figure 71 Result 2 with ‘tøj’ -4 points – successful Total time to find the match = 32 seconds, MSE = 1.0

115


Figure 72 Result 3 with ‘toj’ – 4 points - failed

Total time to find the best match = 30 seconds, MSE = 4578


Figure 73 Result 4 with ‘toj’ – 4 points - failed (The same figure as Figure 34) Total time to find the best match = 30 seconds, MSE = 4578


Figure 74 Result 5 with ‘El GIGANTEN’ -6 points – successful

Total time to find the best match = 47 seconds, MSE = 4.0


Figure 75 Result 6 with ‘El GIGANTEN’ – 6 points - successful Total time to find the best match = 47 seconds, MSE = 1.2

116


Figure 76 Result 7 with ‘El GIGANTEN’ – 11 points - successful

Total time to find the best match = 52 seconds, MSE = 2.9


Figure 77 Result 8 with ‘El GIGANTEN’ – 11 points – successful

(The same figure as Figure 35) Total time to find the best match = 54 seconds, MSE = 2.5


Figure 78 Result 9 with ‘El GIGANTEN’ – 11 points – successful Total time to find the best match = 51 seconds, MSE = 3.6


Figure 79 Result 10 with ‘El GIGANTEN’ – 11 points – failed

(The same figure as Figure 36) Total time to find the best match = 55 seconds

117


Figure 80 Result 11 with ‘El GIGANTEN’ – 15 points – failed Total time to find the best match = 57 seconds


Figure 81 Result 12 with ‘SPAR’ - 8 points – successful Total time to find the best match = 51 seconds, MSE = 1.1


Figure 82 Result 13 with ‘SPAR’ - 8 points - successful Total time to find the best match = 49 seconds, MSE = 12.1


Figure 83 Result 14 with ‘SPAR’ - 10 points – successful Total time to find the best match = 49 seconds, MSE = 1.0

118


Figure 84 Result 15 with ‘SPAR’ - 10 points - successful (The same figure as Figure 37) Total time to find the best match = 49 seconds, MSE = 0.7


Figure 85 Result 16 with ‘SPAR’ - 4 points - failed Total time to find the best match = 45 seconds


Figure 86 Result 17 with ‘SPAR’ - 10 points - failed (The same figure as Figure 38) Total time to find the best match = 49 seconds

D.1.2. Experiment with Curve Templates Given in Table 4

In the following pictures, blue points are manually clicked points in order to create a

curve. Red points are control points of the curve. Yellow points are sampled for

searching the target object.

119


Figure 87 Result 18 with ‘EL GIGANTEN’ - 6 points are clicked, 11 points sampled - successful Total time to find the best match = 49 seconds, MSE = 1.7


Figure 88 Result 19 with ‘EL GIGANTEN’ - 6 points are clicked, 11 points sampled - successful (The same figure as Figure 39) Total time to find the best match = 52 seconds, MSE = 3.1


Figure 89 Result 20 with ‘El GIGANTEN’ - 6 points are clicked, 11 points sampled - successful Total time to find the best match = 52 seconds, MSE = 3.9



120







121


Figure 94 Result 25 with ‘EL GIGANTEN’ - 6 points are clicked, 11 points sampled - failed (The same figure as Figure 40) Total time to find the best match = 56 seconds


Figure 95 Result 26 with ‘EL GIGANTEN’ - 10 points are clicked, 11 points sampled - failed Total time to find the best match = 55 seconds


Figure 96 Result 27 with ‘SPAR’ - 6 points are clicked, 5 points sampled - successful Total time to find the best match = 45 seconds, MSE = 15.4



122


Figure 98 Result 29 with ‘SPAR’ - 6 points are clicked, 10 points sampled - successful Total time to find the best match = 50 seconds, MSE = 11






Figure 101 Result 32 with ‘SPAR’ - 10 points are clicked, 15 points sampled - successful (The same figure as Figure 41) Total time to find the best match = 49 seconds, MSE = 1.9

123


Figure 102 Result 33 with ‘SPAR’ - 10 points are clicked, 10 points sampled - failed (The same figure as Figure 42) Total time to find the best match = 49 seconds


Figure 103 Result 34 with ‘SPAR’ - 15 points are clicked, 15 points sampled - failed Total time to find the best match = 52 seconds

D.2. s, s and Bs Calculated from the Training Data in Section 5.5.2. 0A 1A

0A

Camera

N

1A

124

B

Table 11 , and B estimated from the training data taken from camera N 0A 1A

0A

1A Camera

S

B

Table 12 , and B estimated from training data taken from camera S 0A 1A

D.3. Results of Section 5.5 Experiments on Condensation

Test Sequence new11.avi e1.avi tn3.avi ts1.avi/ ts2.avi

Initial Range Min. Max. Min. Max. Min. Max. Min. Max. Min. Max.

Q0(1) 480 580 380 430 400 450 135 175 100 150

Q0 (2) 120 170 145 175 160 180 165 185 190 220

Q0 (3) -0.5 -0.25 -0.8 -0.5 -0.8 -0.5 -0.5 -0.25 -0.5 -0.25

Q0 (4) -0.5 -0.25 -0.8 -0.5 -0.8 -0.5 -0.5 -0.25 -0.5 -0.25

Q0 (5) 0 0 0 0 0 0 0 0 -0.15 -0.1

Q0 (6) 0 0 0 0 0 0 0 0 0.2 0.4

Table 13 Initial range where particles are generated in each test video sequence—used for

experiment no.1-no.18

125

ree13.avi retn32. avi

new11-3.avi ree15.avi

retn37.avi renew116.avi

rets16.avi rets24.avi

Figure 104 Errors per frame –experiment no. 11 to 18 The x-axis denotes the frame number and the y-axis denotes MSE at each frame.

Test Sequence new11.avi e1.avi tn3.avi ts1.avi/ ts2.avi

Initial Range Min. Max. Min. Max. Min. Max. Min. Max. Min. Max.

Q0(1) 500 550 380 410 400 440 155 200 100 150

Q0 (2) 145 175 145 165 155 175 180 190 190 220

Q0 (3) -0.1 -0.3 -0.8 -0.5 -0.8 -0.5 -0.5 -0.25 -0.5 -0.25

Q0 (4) -0.1 -0.3 -0.8 -0.5 -0.8 -0.5 -0.5 -0.25 -0.5 -0.25

Q0 (5) 0.01 0.02 0.01 0.02 0.001 0.01 -0.1 -0.08 -0.15 -0.1

Q0 (6) 0.004 0.005 0.02 0.03 0.008 0.01 0.2 0.4 0.2 0.4

Table 14 Initial range where particles are generated in each test video sequence—used for

experiment no.19-no.24

Exp.No./

Crp. Exp.no

Testing data

Result file

Coefficients

0A , , and B 1A

Confidence



error η ’36Standard deviation

19/11 e1.avi ree1R1.avi

Trained from e1.avi-e5.avi

5.8<η <17.5 1.6350<η ’<14.765 5.4843

20/12 tn3.avi retn3R1.avi ´´ 3499<η <3514 0.5081<η ’<26.741 2134.3


126

21/13 new11.avi

new11R1.avi ´´smaller and smaller 924.6<η <943.7 7.3299<η ’<43.12 563.56

22/15 e1.avi ree1R5.avi

Trained from e1.avi-

e5.avi,en11.avi-en14.avi

28293<η <28305 1.6350<η ’<14.765 13179

23/16 new11.avi

renew11R13.avi ´´ 46.0<η <65.0 7.3299<η ’<43.12 159.78

24/17 ts1.avi rets1R3.avi

Trained from e6.avi-e10.avi,es11.avi-es14.avi

4739<η <4748 3.8960<η ’<18.254 3587.2

Table 15Errors in tracking using A0, A1 and B trained from the sequences taken by the same

camera experiment no.19-24

ree1R1.avi retn3R1.avi

renew11R1.avi ree1R5.avi

rets1R3.avi

Figure 105 Errors as per frame –experiment no. 19 to 24 The x-axis denotes the frame number and the y-axis denotes MSE at each frame.

.

Exp.No

.

Test data

result file

Coefficients

0A , , and B 1AConfidence



error η ’37

Standard

deviation

25 tn2.avi

retn204.avi

Trained from sequences taken by the same

camera (e1.avi-e5.avi,en11.avi-

en14.avi)

86976<η <87002 4.9268<η ’<40.173 71314

initial range= [380 410 145 155 -0.3 -0.2 -0.3 -0.2 0.01 -0.02 0.07 0.1], threshold =0.5

Table 16 Errors in tracking using A0, A1 and B trained from the sequences taken by the same camera experiment no.25


127

Figure 106 Errors as per frame –experiment no. 25

The x-axis denotes the frame number and the y-axis denotes MSE at each frame.

ree1R7.avi retn3R5.avi

renew11R11.avi rets111.avi

rets25.avi

Figure 107 Errors as per frame –experiment no. 26-no.30 The x-axis denotes the frame number and the y-axis denotes MSE at each frame.

ree1R8.avi retn3R4.av

renew11R12.avi rets1K9.avi

128

rets26.avi

Figure 108 Errors as per frame –experiment no. 31-no.35 The x-axis denotes the frame number and the y-axis denotes MSE at each frame.

avgError Coefficients

0A , , and B 1AExp.No.

Test data

Result file

Default

Estimated from the trained data of the previous

30 frames

Confidence interval of the

error η

Confidence interval of

manual tracking

error η ’38

Total Standard deviation

44 new11.avi 4.avi 25.201 72.124 35.2<η <54.3 7.3299<η ’<

43.12 58.344

45 ts2.avi

rets2On3.avi 38.859 560.64 1087<η <1098 3.7280<η ’<

23.5720 1982.6

Table 17 Errors in tracking the template ‘e’ using on-line trained A0, A1 and B-exp no.44, 45

Figure 109 Error per frame –experiment no. 44 and no.45 The x-axis denotes the frame number and the y-axis denotes MSE at each frame.

reBoksTs24.avi reBoksTs25.avi


129

reelg03.avi Elgan12.avi

Figure 110 Error per frame –experiment no. 46 and no.49 The x-axis denotes the frame number and the y-axis denotes MSE at each frame.

Coefficients , ,

and B 0A 1A

Estimated from the trained data of the same sequence

Experiment factor Occlusion occur later Occlusion from start Out of image

Experiment No. 50 51 52 Confidence interval

of the error η 12044<η <12068 3274<η <3323 35989<η <36013


error η ’39

6.0286<η ’< 29.0214

7.3091<η ’<103.9091

7.0562<η ’<30.293

Standard deviation of the error 37409 959.19 37845

File name of the result1 eOcllusion3.avi brother4.avi spar9.avi

Table 18 Errors –experiment nr.50-52

a) Error per frame–experiment no.50 b) Error per frame –experiment no.51

c) Error per frame –experiment no.52 d) Error per frame –experiment no.53

e) Error per frame –experiment no.54 f) Error per frame–experiment no.55

Figure 111Error per frame –experiment no. 50-52 The x-axis denotes the frame number and the y-axis denotes MSE at each frame.


130

E. Matlab Code 1. Functions used for both template matching and condensation function [BF, DB1] = BleFuncGivenV2(u, v, d) % Create a blending function for a paramaeter u on the curve % from a given knot vector v % input: v = [v1, v2, v3,,,, ] = [0,0,0,0, u1,,,u,u,u,u] % the first and last numbers are repeated 4 times. % u -- parameter on the curve % d -- degree of the B-splines % output: BF -- blending function (B1,B2, B3...,Bn+1) % DB1 -- first derivative of Blending function at u m = length(v); % number of knots n+d+1 B = zeros(m-d,d); % B = number of control points x degree DB = zeros(m-d-1, d-1); if u>=v(d) & u<=v(m-d+1) idx = length(find(u>v)); % the range of non-zero Bk,1 is u(idx)~u(idx+1) if u == 0 B(1:d, 1) = 1; else B(idx,1)=1; end for i = 2:d [B, DB] = calcACol(B,DB,v,u,m-d,i); end end BF = B(:,d)'; DB1 = DB(:,d-1)'; function [B, DB] = calcACol(B,DB,v,u,bfn,d) % calculating the dth column of the blending function matrix col(= n+1) x degree. w = zeros(1,bfn); z = zeros(1,bfn); for k = 1:bfn if v(k+d-1)-v(k) ~= 0 z(k) = 1/(v(k+d-1)-v(k)); w(k) = (u-v(k))*z(k); end end for k = 1:bfn-1 B(k,d)=B(k,d-1)*w(k) + B(k+1,d-1)*(1-w(k+1)); DB(k,d-1)=B(k,d-1)*(d-1)*z(k) - B(k+1,d-1)*(d-1)*z(k+1); end B(bfn,d) = B(bfn,d-1)*w(bfn); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

131

Function [BF, DB1] = BleFuncGivenVp(u, v, p, d) % Create a blending function for a parameter u on the curve % from a given knot vector v % input: v = [v1, v2, v3,..]= [0,...0, u1,...,1,...,1] % the first and last numbers are repeated d times. % u -- parameter on the curve % p -- number of interpolated points % d -- degree of the B-splines % output: BF -- blending function (B1,B2,B3...,Bn+1) % DB1 -- first derivative of Blending function at u if d==4 tempv = zeros(1,p+5); for i = 1:length(tempv) tempv(i) = v(i+1); end v = tempv; end m = length(v); % number of knots n+d+1 B = zeros(p,d); % B = number of control points x degree DB = zeros(p-1, d-1); if u>=v(1) & u<=v(end) idx = length(find(u>=v)); if idx > p idx = p; end % the range of non-zero Bk, 1 is u(idx)~u(idx+1) if u == 0 B(1:d, 1) = 1; else B(idx,1)=1; end for i = 2:d [B, DB] = calcACol(B,DB,v,u,p,i); end end BF = B(:,d)'; DB1 = DB(:,d-1)'; function [B, DB] = calcACol(B,DB,v,u,bfn,d) % calculating the dth column of the blending function matrix col(= n+1) x degree. w = zeros(1,bfn); z = zeros(1,bfn); for k = 1:bfn if v(k+d-1)-v(k) ~= 0 z(k) = 1/(v(k+d-1)-v(k)); w(k) = (u-v(k))*z(k); end end for k = 1:bfn-1 B(k,d)=B(k,d-1)*w(k) + B(k+1,d-1)*(1-w(k+1)); DB(k,d-1)=B(k,d-1)*(d-1)*z(k) - B(k+1,d-1)*(d-1)*z(k+1); end B(bfn,d) = B(bfn,d-1)*w(bfn);

132

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [eout,ax,ay] = cannyEdge(varargin) % Create a binary edge map based on the given threshold value % using Canny edge method. % Arranged by edge function in matlab. % input: template image and canny edge threshold % output: eout -- binary edge map - 1 implies edge and 0 otherwise % ax, ay -- gradient magnitude in x and y direction [a,method,thresh,sigma,H,kx,ky] = parse_inputs(varargin{:}); % Transform to a double precision intensity image if necessary if ~isa(a, 'double') a = im2double(a); end m = size(a,1); n = size(a,2); rr = 2:m-1; cc=2:n-1; % The output edge map: e = repmat(false, m, n); if strcmp(method,'canny') % Magic numbers GaussianDieOff = .0001; PercentOfPixelsNotEdges = .7; % Used for selecting thresholds ThresholdRatio = .4; % Low thresh is this fraction of the high. % Design the filters - a gaussian and its derivative pw = 1:30; % possible widths ssq = sigma*sigma; width = max(find(exp(-(pw.*pw)/(2*sigma*sigma))>GaussianDieOff)); if isempty(width) width = 1; % the user entered a really small sigma end t = (-width:width); gau = exp(-(t.*t)/(2*ssq))/(2*pi*ssq); % the gaussian 1D filter % Find the directional derivative of 2D Gaussian (along X-axis) % Since the result is symmetric along X, we can get the derivative along % Y-axis simply by transposing the result for X direction. [x,y]=meshgrid(-width:width,-width:width); dgau2D=-x.*exp(-(x.*x+y.*y)/(2*ssq))/(pi*ssq); % Convolve the filters with the image in each direction % The canny edge detector first requires convolution with % 2D gaussian, and then with the derivitave of a gaussian. % Since gaussian filter is separable, for smoothing, we can use % two 1D convolutions in order to achieve the effect of convolving % with 2D Gaussian. We convolve along rows and then columns. %smooth the image out

133

aSmooth=imfilter(a,gau,'conv','replicate'); % run the filter accross rows aSmooth=imfilter(aSmooth,gau','conv','replicate'); % and then accross columns %apply directional derivatives ax = imfilter(aSmooth, dgau2D, 'conv','replicate'); ay = imfilter(aSmooth, dgau2D', 'conv','replicate'); mag = sqrt((ax.*ax) + (ay.*ay)); if any(ay) theta = atan(ay./ax); end magmax = max(mag(:)); if magmax>0 mag = mag / magmax; % normalize end % Select the thresholds if isempty(thresh) [counts,x]=imhist(mag, 64); highThresh = min(find(cumsum(counts) > PercentOfPixelsNotEdges*m*n)) / 64; lowThresh = ThresholdRatio*highThresh; thresh = [lowThresh highThresh]; elseif length(thresh)==1 highThresh = thresh; if thresh>=1 error('The threshold must be less than 1.'); end lowThresh = ThresholdRatio*thresh; thresh = [lowThresh highThresh]; elseif length(thresh)==2 lowThresh = thresh(1); highThresh = thresh(2); if (lowThresh >= highThresh) | (highThresh >= 1) error('Thresh must be [low high], where low < high < 1.'); end end % The next step is to do the non-maximum supression. % We will accrue indices which specify ON pixels in strong edgemap % The array e will become the weak edge map. idxStrong = []; for dir = 1:4 idxLocalMax = cannyFindLocalMaxima(dir,ax,ay,mag); idxWeak = idxLocalMax(mag(idxLocalMax) > lowThresh); e(idxWeak)=1; idxStrong = [idxStrong; idxWeak(mag(idxWeak) > highThresh)]; end rstrong = rem(idxStrong-1, m)+1; cstrong = floor((idxStrong-1)/m)+1; e = bwselect(e, cstrong, rstrong, 8); e = bwmorph(e, 'thin', 1); % Thin double (or triple) pixel wide contours end if nargout==0,

134

imshow(e); else eout = e; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % Local Function : cannyFindLocalMaxima % function idxLocalMax = cannyFindLocalMaxima(direction,ix,iy,mag); % % This sub-function helps with the non-maximum supression in the Canny % edge detector. The input parameters are: % % direction - the index of which direction the gradient is pointing, % read from the diagram below. direction is 1, 2, 3, or 4. % ix - input image filtered by derivative of gaussian along x % iy - input image filtered by derivative of gaussian along y % mag - the gradient magnitude image % % there are 4 cases: % % The X marks the pixel in question, and each % 3 2 of the quadrants for the gradient vector % O----0----0 fall into two cases, divided by the 45 % 4 | | 1 degree line. In one case the gradient % | | vector is more horizontal, and in the other % O X O it is more vertical. There are eight % | | divisions, but for the non-maximum supression % (1)| |(4) we are only worried about 4 of them since we % O----O----O use symmetric points about the center pixel. % (2) (3) [m,n,o] = size(mag); % Find the indices of all points whose gradient (specified by the % vector (ix,iy)) is going in the direction we're looking at. switch direction case 1 idx = find((iy<=0 & ix>-iy) | (iy>=0 & ix<-iy)); case 2 idx = find((ix>0 & -iy>=ix) | (ix<0 & -iy<=ix)); case 3 idx = find((ix<=0 & ix>iy) | (ix>=0 & ix<iy)); case 4 idx = find((iy<0 & ix<=iy) | (iy>0 & ix>=iy)); end % Exclude the exterior pixels if ~isempty(idx)

135

v = mod(idx,m); extIdx = find(v==1 | v==0 | idx<=m | (idx>(n-1)*m)); idx(extIdx) = []; end ixv = ix(idx); iyv = iy(idx); gradmag = mag(idx); % Do the linear interpolations for the interior pixels switch direction case 1 d = abs(iyv./ixv); gradmag1 = mag(idx+m).*(1-d) + mag(idx+m-1).*d; gradmag2 = mag(idx-m).*(1-d) + mag(idx-m+1).*d; case 2 d = abs(ixv./iyv); gradmag1 = mag(idx-1).*(1-d) + mag(idx+m-1).*d; gradmag2 = mag(idx+1).*(1-d) + mag(idx-m+1).*d; case 3 d = abs(ixv./iyv); gradmag1 = mag(idx-1).*(1-d) + mag(idx-m-1).*d; gradmag2 = mag(idx+1).*(1-d) + mag(idx+m+1).*d; case 4 d = abs(iyv./ixv); gradmag1 = mag(idx-m).*(1-d) + mag(idx-m-1).*d; gradmag2 = mag(idx+m).*(1-d) + mag(idx+m+1).*d; end idxLocalMax = idx(gradmag>=gradmag1 & gradmag>=gradmag2); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % Local Function : parse_inputs % function [I,Method,Thresh,Sigma,H,kx,ky] = parse_inputs(varargin) % OUTPUTS: % I Image Data % Method Edge detection method % Thresh Threshold value % Sigma standard deviation of Gaussian % H Filter for Zero-crossing detection % kx,ky From Directionality vector error(nargchk(1,5,nargin)); I = varargin{1}; copy_checkinput(I,{'double','logical','uint8','uint16'},... {'nonsparse','2d'},mfilename,'I',1); % Defaults Method='canny'; Thresh=[]; Direction='both'; Sigma=2; H=[]; K=[1 1];

136

methods = {'canny','prewitt','sobel','marr-hildreth','log','roberts','zerocross'}; directions = {'both','horizontal','vertical'}; % Now parse the nargin-1 remaining input arguments % First get the strings - we do this because the intepretation of the % rest of the arguments will depend on the method. nonstr = []; % ordered indices of non-string arguments for i = 2:nargin if ischar(varargin{i}) str = lower(varargin{i}); j = strmatch(str,methods); k = strmatch(str,directions); if ~isempty(j) Method = methods{j(1)}; if strcmp(Method,'marr-hildreth') warning('''Marr-Hildreth'' is an obsolete syntax, use ''LoG'' instead.'); end elseif ~isempty(k) Direction = directions{k(1)}; else error(['Invalid input string: ''' varargin{i} '''.']); end else nonstr = [nonstr i]; end end % Now get the rest of the arguments switch Method case 'canny' Sigma = 1.0; % Default Std dev of gaussian for canny threshSpecified = 0; % Threshold is not yet specified for i = nonstr if prod(size(varargin{i}))==2 & ~threshSpecified Thresh = varargin{i}; threshSpecified = 1; elseif prod(size(varargin{i}))==1 if ~threshSpecified Thresh = varargin{i}; threshSpecified = 1; else Sigma = varargin{i}; end elseif isempty(varargin{i}) & ~threshSpecified % Thresh = []; threshSpecified = 1; else error('Invalid input arguments'); end end otherwise error('Invalid input arguments'); end if Sigma<=0

137

error('Sigma must be positive'); end switch Direction case 'both', kx = K(1); ky = K(2); case 'horizontal', kx = 0; ky = 1; % Directionality factor case 'vertical', kx = 1; ky = 0; % Directionality factor otherwise error('Unrecognized direction string'); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function v = chordLengthPara(P,d) % Calculate the knot parameters using the Chord Length % Parameterization method. % % input: P -- [p1(x,y); p2(x,y);..;pn+1(x,y)]' interpolated points % d -- degree (cubic d=4, quadratic d=3, linear d=2) % output: v -- normalized knot vector for open B-splines m = length(P); % Calculate the distance between two neighboring pixels Lsqrt = sqrt(sum((P(1:m-1,:)-P(2:m,:))'.^2)); totalL = sum(Lsqrt); % Normalize the knot vector s = cumsum(Lsqrt)/totalL; knots = [0, s]; % Pad zeros and ones at both ends of the knot vector so that the curve interpolates the two end control points. v = [zeros(1,d-1), knots, ones(1,d-1)]; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function CP = find_Control_Points_byCL(kv, P, d) % Find coordinates of n+1 control points for a curve defined by n+1 % interpolated points on the curve and corresponding blending function % when d==4, d==3, d==2 % knots length(v) = n+7, n+5, n+3 % control points = n+3, n+2, n+1 % conditons = n+1, n+1, n+1 % we need more conditions when d==4 and d==3 or we should reduce control % points. % Assume that first control point cp0 repeat twice when d==3 % the first and the last control points cp0 and cpn repeat twice when d==4 % input: kv = [v1, v2, v3,,,, ]= [0,..0, u1,,,1,..1] knot vector

-- the first and last numbers are repeated d times. % P = [p1(x,y); p2(x,y);...;pn+1(x,y)] coordinates of % interpolated points

138

% d = degree of the B-spline % output: CP = [cp1(x,y),...,cpn+1(x,y)] coordinates of control points p = length(P); B = zeros(p); B(1,1) = 1; % the first control point = first knot B(end, end) = 1; % the last control point = last knot % Calculate the value of blending functions and the first derivatives at each knot. for i = 2:p-1 [BF, DB1]= BleFuncGivenVp(kv(d+i-1), kv, p, d); B(i,:) = BF; end % Calculate the coordinates of control points using the inverse of B. % If det(B) ==0, pseudo inverse is used instead. if det(B) ~= 0 CP = inv(B)*P; else CP = inv(B'*B)*B'*P; end % add the first and last knots as control points % when d==4 repead both the first and the last control points, when d==3 repeat the first control % point. When d==2, do nothing. switch d case 4 CP = [CP(1,:); CP; CP(end,:)]; case 3 CP = [CP(1,:); CP;]; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % function [AX,AY] = normaliz2(ax, ay); % Normalize a vector (ax(i,j), ay(i,j)) % input : ax -- x coordinates matrix % ay -- y coordinates matrix % output: AX -- normalized x coordinates matrix % AY -- normalized y coordinates matrix [m,n]=size(ax); AX=ones(m,n); AY=ones(m,n); for i = 1:m for j=1:n d=sqrt(ax(i,j)^2+ay(i,j)^2); if d~=0 AX(i,j)=ax(i,j)/d; AY(i,j)=ay(i,j)/d; end end end

139

2. Functions used for template matching function [dMap,ix,iy, iX, iY,nrI,Ie]=disMap(I,thresh,d) %Calculate the distance map Ø and %the correspoinding nearest edge point's unit vector %input: I -- frame image % thresh -- canny edge threshold % d -- range to check the distance from edge points %output: dMap -- distance map % ix, iy -- coordinates of the edges % iX, iY -- normalized directional vectors of the nearest edge % nrI -- number of edge pixels % Ie -- edge map by canny tic %I=imread(I); I=rgb2gray(I); [m,n]=size(I); % Since the frame is blurred, put lower threshold to get better edge [Ie,ix1,iy1] = cannyEdge(I,’canny’, thresh); % Normalize the directional vector [IX, IY] = normaliz2(ix1,iy1); % Get index for the edge points [iy, ix] = find(Ie==1); % Get the distance map for the frame as far as d from edge points nrI=length(iy); dMap=250*ones(m,n);% initial distance map iY=ones(m,n); % used to record distances iX=ones(m,n); for i=1:nrI for j=-d:1:d r=round(iy(i)+j*IY(i)); c=round(ix(i)+j*IX(i)); if r>1 & r<m & c>1 & c<n % not include the border if dMap(r,c)>abs(j) dMap(r,c)=abs(j); iY(r,c)=IY(iy(i),ix(i)); iX(r,c)=IX(iy(i),ix(i)); end end end end toc %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function MSE=errorTM(tx, ty, mx, my) % Calculate the mean square error % The tracking result is compared to the manually selected result % input: tx, ty -- column vector with n tracked coordinate % mx, my -- column vector with n manually selected coordinate % output: MSE -- error per frame % the number of points q=size(tx, 1);

140

% Distance between the tracked and manually selected points. dx = tx-mx; dy = ty-my; SE = (dx'*dx+dy'*dy); MSE= SE/q; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [mx, my]=manualTrack(I, n) % Manually track for error measurement and show the result % input: I -- input image % n - number of interpolated points % output: mx -- a column vector of x coordinates of % manually clicked points. % my -- a column vector of y coordinates of % manually clicked points. %I = imread(I); [BW, ax, ay] =cannyEdge(rgb2gray(I), 'canny', 0.4); figure(2); imshow(BW); [mx, my] = ginput(n); hold on; for i=1:n plot(mx(i), my(i), 'r*'); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [mmx, mmy]=manualTrackCurve(I, n, d, k) % Manually track for error measurement and show the result % input: I -- input image % n -- number of interpolated points % d -- degree % k -- interval of sample points % output: mmx -- a column vector of x coordinates of uniformly % distributed points based on manual click of knots. % mmy -- a column vector of y coordinates of uniformly % distributed points based on manual click of knots. %I = imread(I); [BW, ax, ay] =cannyEdge(rgb2gray(I), 'canny', 0.4); figure(2); imshow(BW); [mx, my] = ginput(n); hold on; m = size(mx,1); % number of interpolated points % calculate knot values using the Chord Length parameterization method P = [mx, my]; v = chordLengthPara(P,d); % Find control points given the knot vector, interpolated points and the degree

141

CP = find_Control_Points_byCL(v, P, d); for i= 1:m plot(P(i,1), P(i,2), 'b*'); end % Sample, draw and save uniformly distributed points r = []; for i = 0:k u = i*1/k; [BF, DB1] = BleFuncGivenV2(u, v, d); r = [r; BF*CP]; plot(r(i+1, 1), r(i+1,2), 'y.:'); end r = round(r); mmx = r(:,1); mmy = r(:,2); hold off %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [E,finY,finX]=objFunc(tx,ty,tX,tY,nrT,In,threshIn,threshT,d,p) % implement objective function % Input: T -- template % In -- frame image % d -- the range where distance will be calculated % p -- smooth factor in objective function % threshIn/T -- cannyedge threshold for Input image/Template % tx, ty -- coordinates of the edges of the template points % tX, tY -- normalized directional vectors of the nearest % edge of the template points % Output: E -- minimum energy % finY -- y coordinates of the final findings % finX -- x coordinates of the final findings tic; [dMap,ix,iy,iX,iY,nrI,Ie]=disMap(In,threshIn,d); t1=toc; tic; [m,n]=size(Ie); %move template over frame image and check the minimum E. minE=inf; % translate so that the first point on the origin Teyy=ty-ty(1); Texy=tx-tx(1); % Rotate the template for r = -1/12*pi:1/24*pi:1/12*pi TeyY=Teyy*cos(r)-Texy*sin(r); TexY=Texy*cos(r)+Teyy*sin(r); TXX=tX*cos(r)+tY*sin(r); TYY=tY*cos(r)-tX*sin(r); %scale based on the 1st edge point for sc=1/2:1/8:2%scale based on the 1st edge point nTey=round(sc*TeyY); nTex=round(sc*TexY);

142

% translate template 'over' frame image for index=1:nrI R=iy(index); C=ix(index); sTey=nTey+R; sTex=nTex+C; if all(max(sTey)<m & min(sTey)>0 & max(sTex)<n & min(sTex)>1)% guarantee all the points are within the image E=0; %calculate the energy for a=1:nrT % the original location used

% to find the directional vector tr=sTey(a); tc=sTex(a); cos=abs(iX(tr,tc)*TXX(a)+iY(tr,tc)*TYY(a)); E=E+1-exp(-p*dMap(tr,tc))*cos; end EE=E/nrT; %objective function energy if EE< minE %find the min E minE = EE; finY=sTey; finX=sTex; sss=sc; sssr=r; end end end end end figure(5) imshow(Ie); hold on for q=1:nrT plot(finX(q),finY(q),'r*'); end hold off t2=toc; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function r = runFindCurve(I,n,d,k) % Create a B-spline curve and sample and return uniformly % distributed points. % input: I -- image % n -- number of interpolated points % d -- degree % k -- interval of sample points % output: r -- uniformly distributed points sampled on the curve imshow(I); hold on; P = ginput(n); % Calculate knot values using Chord Length parameterization method v = chordLengthPara(P,d); CP = find_Control_Points_byCL(v, P, d);

143

% Draw selected points for i= 1:n plot(P(i,1), P(i,2), 'b*'); end % Draw control points for i= 1:length(CP) plot(CP(i,1), CP(i,2), 'r*'); end % Save and draw uniformly distributed points r = []; for i = 0:k u = i*1/k; [BF, DB1] = BleFuncGivenV2(u, v, d); r = [r; BF*CP]; plot(r(i+1, 1), r(i+1,2), 'y.:'); end r = round(r); hold off %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [tx,ty,tX,tY,nrT,P]=tempPoints(T,thresh,n) % Defines the sample points of the template manually % input: T -- template image % thresh -- canny edge threshold % output: tx, ty -- coordinates of input points % tX, tY -- normalized directional vectors % on the rounded coordinates of points % nrT -- number of input points % T=imread(T); T=rgb2gray(T); [Te,tx1,ty1] = cannyEdge('canny', T,thresh); [TX, TY] = normaliz2(tx1,ty1); P = selectPoints(Te, n); tx=P(:,1); ty=P(:,2); nrT=length(tx); tX=zeros(nrT,1); tY=tX; for i=1:nrT tX(i,1)=TX(ty(i,1),tx(i,1)); tY(i,1)=TY(ty(i,1),tx(i,1)); end function [P] = selectPoints(I,n) % input I -- image % n -- number of interpolated points % output P -- input points imshow(I); hold on; P = ginput(n); P = round(P); for i= 1:n

144

plot(P(i,1), P(i,2), 'b*'); end hold off %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [tx,ty,tX,tY,nrT,P]=tempPointsCurve(T, thresh, n, d, k) % Select points for finding the best match given the degree of the curve, % number of knots and the number of sampling on the curve % input: T -- template image % thresh -- canny edge threshold % n -- number of interpolated points % d -- degree % k -- number of sample points for finding the best match % output: tx, ty -- coordinates of input points % tX, tY -- normalized directional vectors on the % rounded coordinates of points % nrT -- number of input sample points % P -- sampled points on the curve to find a best match %T = imread(T); T = rgb2gray(T); % Obtain a binary edge map and gradient magnitude of each edge point [Te, tx1, ty1] = cannyEdge(T,thresh); [TX, TY] = normaliz2(tx1,ty1); % Create a curve and get points for finding the best match. [P] = runFindCurve(Te, n, d, k); tx=P(:,1); ty=P(:,2); % Save normalized directional vectors nrT=length(tx); tX=zeros(nrT,1); tY=tX; for i=1:nrT tX(i,1)=TX(ty(i,1),tx(i,1)); tY(i,1)=TY(ty(i,1),tx(i,1)); end

3. Functions used for condensation function num = binarySubdivide(cumDist, r, n) % Find the smallest number > r by binary subdivision search % Binary subdivision search is a fast way to find where the given % number should fit % The code is given in ITU’s % input: cumDist -- array to be searched % r -- number whose position in cumDist is searched % n -- largest number in cumDist % output: num -- minimum number which satisfies > r high = n; low = 0; while (high > (low+1))

145

middle = round((high + low)/2); if r > cumDist(middle) low = middle; else high = middle; end end num = low + 1; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [Temp1, Temp2, Temp3,Temp4] = create3temp(I, d1, d2, d3,d4) % Create a template with 3 curves (Temp4 and d4 are used for % error measurement) % input I - billboard image from which template can be created. % d – the degree of the b-spline curve % output Templ -curve segment % Temp.CP-control points % Temp.v-knot values used to get the control points % Temp.d-b-spline curve's degree imshow(I); hold on; Temp1 = interpolate(d1); Temp2 = interpolate(d2); Temp3 = interpolate(d3); Temp4 = interpolate(d4); hold off function Temp = interpolate(d); % create a interpolated curve with the given degree d % input: d -- degree % output: Temp -- template containing its knot vector, control points % and degree P = ginput; % calculate knot values using Chord Length parameterization method v = chordLengthPara(P,d); % calucate the control points CP = find_Control_Points_byCL(v,P, d); % draw the curve x = []; y = []; for u = 0:1/20:1 [BF, DB1] = BleFuncGivenV2(u, v, d); r = BF*CP; x = [x, r(1)]; y = [y, r(2)]; end line(x, y,'Color','r'); Temp.v = v; Temp.CP = CP; Temp.d= d;

146

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [newR,dRR]=getCurveAndNormal3(Q,Temp) % Construct the predicted curve from the predicted Q % get the normal of the curve % input: Q -- the space vector of the new curve % Temp -- the control points info. % output: newR -- the new curve % dRR -- the normal of the curve(not normalized) %1st step, get the new control points from the given Q %get W matrix from the template control points Nc = length(Temp.CP(:,1)); One = ones(Nc,1); Zero = zeros(Nc,1); W = [One Zero Temp.CP(:,1) Zero Zero Temp.CP(:,2); Zero One Zero Temp.CP(:,2) Temp.CP(:,1) Zero]; %get the new control points from Q, W ,Temp nCP=W*Q+[Temp.CP(:,1);Temp.CP(:,2)]; newCP=[nCP(1:Nc),nCP(Nc+1:2*Nc)]; %2nd step, get the coordinates of the certain points on the curve % and the corresponding normal vector newR=[]; dR=[]; for u = 0:1/20:1 % the interval can be changed % get the blending functions and the first derivatives of the curve [BF, DB1] = BleFuncGivenV2(u, Temp.v, Temp.d); r = BF*newCP; newR=[newR;r]; dr=DB1*newCP(1:(Nc-1),:); dR=[dR;dr]; end % when the tangent vector(the 1st derivetive) is (dx,dy), % then the normal vector is (-dy, dx) dRR=[-dR(:,2),dR(:,1)]; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [weight, c,QM]=Meas3(newQ, frame, r,thresh,Temp1,Temp2, Temp3,D); % Measurement step % measure all the predicted particles possibilities % input: frame -- the current frame % r -- st = sqrt(rM) = 0.3 % Temp -- the control points and the degree of % the b-spline curve segment of the tempate % D -- the length along the curve norm, whithin which % we need to check edge point % output: weight -- new normalized weight of the each particle % c -- new normalized cummulated weight % QM -- the best matching curve (the average)

147

% using edge info % functin used equ.45 [M N]=size(frame(:,:,1)); %size of the frame image n=length(newQ(1,:)); c=ones(1,n); %get the edges of the frame [Ie,ix1,iy1] = cannyEdge(rgb2gray(frame),'canny', thresh); % measure all thresh curve segments for i=1:n L1(:,i) = Poisson(newQ(:,i), Temp1, Ie, r, M, N, D); L2(:,i) = Poisson(newQ(:,i), Temp2, Ie, r, M, N, D); L3(:,i) = Poisson(newQ(:,i), Temp3, Ie, r, M, N, D); end LL=[L1',L2',L3']; LL=LL'; weight=exp(-0.5*1/r*sqrt(sum(LL.^2))); %addn 3 weights for each sample and normalize sumW = sum(weight); if sumW ~= 0 weight = (weight/sumW); end c=cumsum(weight); QM=newQ*weight';% the tracking result -- the average function L = Poisson(newQ, Temp, Ie, r, M, N, D) % Measurement by the observation model % input: newQ -- new Q vector % Temp -- template including the knot vector, control points and % degree of the curve % Ie -- binary edge map of the frame % r -- st = sqrt(rM) = 0.3 % M, N -- the size of the frame Ie, row and column, respectively % D -- the length along the curve norm, whithin which % we need to check edge point % output: L -- the distance to the nearest edge from sample points % reconstruct the curve and get its norms [newR, dRR]=getCurveAndNorm3(newQ,Temp); %get the normalized norm of the curv [X,Y] = normaliz2(dRR(:,1),dRR(:,2)); nr=length(Y); L=40*ones(nr,1);%initialize the nearest edge distance as 40 (250 is too big) % find the nearest edge for j=1:nr-1 for z=-D:1:D ro=round(newR(j,2)+z*Y(j)); c=round(newR(j,1)+z*X(j)); if ro>1 & ro<M & c>1 & c<N % not include the border

148

if Ie(ro,c)==1 if L(j)>abs(z) L(j)=abs(z); end end end end end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function pf1=pfIn(range,n) % Initialize the particle filter - Creat n particles (t=0) % each particle is a shape-space vector % input: range -- 12X1 vector indicating the initial guess of % the range of the six parameters of Q % [min. of Q(1), max. of Q(1), min. of Q(2), ..., % max. of Q(6)] % n -- number of samples % output: pf1 -- Initial particle set % pf1.q1 ~ pf1.q6 each value of the Q space vector % pf1.w is weight(probablity) % pf1.c is the cumulated weight rand('state', sum(100*clock)); pf1.n=n; pf1.w = zeros(n,1); pf1.c = zeros(n,1); for i=1:n %define the six values for Q space vector pf1.q1(i)=[rand*(range(2)-range(1))+range(1)]; pf1.q2(i)=[rand*(range(4)-range(3))+range(3)]; pf1.q3(i)=[rand*(range(6)-range(5))+range(5)]; pf1.q4(i)=[rand*(range(8)-range(7))+range(8)]; pf1.q5(i)=[rand*(range(10)-range(9))+range(9)]; pf1.q6(i)=[rand*(range(12)-range(11))+range(11)]; end pf1.bar=zeros(6,1); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function pp=plotAll3(Q,file,Temp4,Temp2, Temp3) % Plot the frame image with the tracking result % input: Q -- the Q space vector % file -- the frame image % Temp -- Template with the knot vector, the control points % and the degree of the curve imshow(file); hold on pp = plotEach(Q,file,Temp4); pp2 = plotEach(Q,file,Temp2); pp3 = plotEach(Q,file,Temp3); hold off function pp=plotEach(Q, file, Temp) pp=[];

149

Nc=length(Temp.CP(:,1)); One = ones(Nc,1); Zero = zeros(Nc,1); W = [One Zero Temp.CP(:,1) Zero Zero Temp.CP(:,2); Zero One Zero Temp.CP(:,2) Temp.CP(:,1) Zero]; for e=1:size(Q,2) %get new control points from Q, W ,TempCP nCP=W*Q(:,e)+[Temp.CP(:,1);Temp.CP(:,2)]; newCP=[nCP(1:Nc),nCP(Nc+1:2*Nc)]; pp=[pp,newCP]; %2nd step, get the coordinates of the certain points on the curve % In order to use line function, x and y coordinates should be % saved in a column individually. x = []; y = []; for u = 0:1/20:1 % the interval can be changed. [BF, DB1] = BleFuncGivenV2(u, Temp.v, Temp.d); r = BF*newCP; x = [x, r(1)]; y = [y, r(2)]; end line(x, y); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function newQ=Pr1(PF,Qplus1,A0,A1,B,n,QM) % Prediction step % using the dynamic model to predect the resampled particles' % (at time t-1) Qs at time t % using function: Qn+2=A0Qn+A1Qn+1+(I-A0-A1)Qbar)+BWn or % Xt=xBar+(Xtminus-xBar)*A+B*Noise % input: Q -- resampled PFn % Qplus1 -- resampled PFnplus1 % A0 -- 6x6 matrix % A1 -- 6x6 matrix % B -- 6x6 matrix % QM -- result from the previous time step t-1 % output: newQ -- the predicted state at time t newQ=[]; Q=[[PF.q1]; [PF.q2]; [PF.q3] ; [PF.q4] ;[PF.q5] ;[PF.q6]]; I=1*eye(6); for i=1:n noise=randn(6,1); newq=A0*Q(:,i)+A1*Qplus1(:,i)+(I-A0-A1)*QM+B*noise; newQ=[newQ, newq]; end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

150

function mov = readfile(filename) % Read an avi sequence % input: filename -- the name of the avifile in ' .avi' % output: mov -- avifile mov=aviread(filename); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function Qplus1=reS(pfMinus) % Construct a new sample set at time t from the old one at time t-1. % At first generate a random number r, uniformaly distrubuted % Then find, by binary subdivision, the smallest j for which ct-1(j)>=r % At last set st(n)=st-1(j) % input: pftMinus -- sample set at t-1 % output: [q1T,q2T,q3T,q4T,q5T,q6T] -- sample set's states at time t n=pfMinus.n; Qplus1=zeros(6,n); for i=1:n r=rand(1); smallest = binarySubdivide(pfMinus.c, r, n); Qplus1(1,i)=pfMinus.q1(smallest); Qplus1(2,i)=pfMinus.q2(smallest); Qplus1(3,i)=pfMinus.q3(smallest); Qplus1(4,i)=pfMinus.q4(smallest); Qplus1(5,i)=pfMinus.q5(smallest); Qplus1(6,i)=pfMinus.q6(smallest); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [sS, ppp]=runPFQ3curves(mov,range,n,A0,A1,B,r,thresh,Temp1, Temp2, Temp3,Temp4,D) % Run the particle filter to track target object. % This code can only track one target at a time. % input: mov -- avi video sequence % range -- 12X1 vector indicating the initial guess % of the range of the six parameters of Q % n -- nr. of particles % A -- dynamic model -- reflects drift % B -- dynamic model -- reflects diffuse by noise % r -- as in equation 45 in the thesis report % thresh -- the canny edge threshold % Temp.CP -- the control points of the template % Temp.v -- the normalized knot vector on the template curve % Temp.d -- degree of the B-splines % D -- the length along the curve norm, within % which we need to check edge point sS=[];% record the Q vector, which gives the transformation info in each frame ppp=[];%used to record control points to do the erro measurement later nrF=length(mov); %nr of frames in the clip

151

% generate n number of particles which the range pf0=pfIn(range,n);%get the initial particles at t=0 Q0=[[pf0.q1]; [pf0.q2]; [pf0.q3] ; [pf0.q4] ;[pf0.q5] ;[pf0.q6]]; %Initilalization--get the sample sets with corresponding weights for the 1st %and 2nd frame. pp=plotAll3(Q0,mov(1).cdata, Temp4, Temp2, Temp3);% plot all the particles pause(1/50); % create a new avi file to record the tracking result tracker = avifile('reelg03.avi'); tracker.Compression='Indeo5'; % the compression format tracker.fps=25; % frames per second %for the 1st frame: measure, plot and get weights for particles [weight1, c1,QM]=Meas3(Q0, mov(1).cdata, r,thresh,Temp1,Temp2, Temp3,D); pp=plotAll3(QM,mov(1).cdata,Temp4, Temp2, Temp3); pause(1/50); frame = getframe(gca); tracker = addframe(tracker, frame); % write the new frames into this file PF=upD1(Q0,weight1,c1, QM);% update the partilces %for the 2nd frame: measure, plot and get weights for particles %Qplus=reS(PF);% since we donot do any prediction at t=2, this resample is not very meaningful [weight2, c2,QM]=Meas3(Q0, mov(2).cdata, r,thresh,Temp1, Temp2, Temp3,D); pp=plotAll3(QM,mov(2).cdata,Temp4, Temp2, Temp3); frame = getframe(gca); tracker = addframe(tracker, frame); pause(1/50); PFplus1=upD1(Q0,weight2,c2, QM); %get the initilized particles at t=1 and t=2,ie. PF and PFplus1 %............................................................... %run the condensation algrithm from the 3rd frame for i=3:50 %step1: resample particle set at time t=t+2 Qplus1=reS(PFplus1); %step2: prediction newQ=Pr1(PF,Qplus1,A0,A1,B,n,QM); pp=plotAll3(newQ,mov(i).cdata,Temp4, Temp2, Temp3); pause(1/50);

152

%step3: observation model--measurement [weight, c,QM]=Meas3(newQ, mov(i).cdata, r,thresh,Temp1, Temp2, Temp3,D); sS=[sS,QM]; %show the result pp=plotAll3(QM,mov(i).cdata,Temp4, Temp2, Temp3); pause(1/50); frame = getframe(gca); tracker = addframe(tracker, frame); ppp=[ppp,pp]; %upgrade particles at t=t and t=t+1 [PF,PFplus1]=upD(newQ,PFplus1,weight,c, QM); end tracker=close(tracker); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [PF,PFplus1]=upD(newQ,PFplus1,weight,c,QM) % Uupdate the particles at time t+1 % Input: newQ -- result from prediction step at time t % PFplus1 -- particles at time t=t-1 % weight -- reuslt from the measurement % c -- result from the measurement PF=PFplus1; N=length(newQ(1,:)); PFplus1.n=N; PFplus1.w=weight; PFplus1.c=c; for i=1:N PFplus1.q1(i)=newQ(1,i); PFplus1.q2(i)=newQ(2,i); PFplus1.q3(i)=newQ(3,i); PFplus1.q4(i)=newQ(4,i); PFplus1.q5(i)=newQ(5,i); PFplus1.q6(i)=newQ(6,i); end PFplus1.bar=QM; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function pf=upD1(Q0,weight,c,QM) % Update the particles at time t=1 and t=2 % input: nQ -- the Q0 generated from pIn % weight -- result of the weight from the measurement % c -- result of the cumulated weight from the measurement N=length(Q0(1,:)); pf.n=N; pf.w=weight; pf.c=c; for i=1:N pf.q1(i)=Q0(1,i);

153

pf.q2(i)=Q0(2,i); pf.q3(i)=Q0(3,i); pf.q4(i)=Q0(4,i); pf.q5(i)=Q0(5,i); pf.q6(i)=Q0(6,i); end pf.bar=QM;

4. Functions used for error measurement of condensation function [MSE, MSEperF, ST]=errorCP(cp1,cp2) % Calculate the mean squared error % the tracking result is compared to the manually selected result % input: cp1 -- tracker tracked reuslt % cp2 -- manually tracked result % output: MSE -- average error of a sequence % MSEperF -- mean square error per frame % std -- standard deviation of MSE % the length of the sequence is 48 since we do not want to % take into account the first 2 frames, which are not really % tracked by the particle filter MSEperF=zeros(1,48); for i=1:48 tt=(cp1(:,(2*i-1):(2*i))-cp2(:,(2*i-1):(2*i))).*(cp1(:,(2*i-1):(2*i))-cp2(:,(2*i-1):(2*i))); MSEperF(1,i)=sum(sum(tt,2))/2; end MSE = mean(MSEperF); ST = std(MSEperF); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [A0,A1,B] = estimateAB(setQ) % Estimate A0, A1, and B from the history of (Q1,Q2,,,,,Qt) % input : setQ = [Q1, Q2,,,Qn] 6xn matrix % output: A0, A1, B – coefficients for the dynamic model [h,m] = size(setQ); S = zeros(6,6,3*3); % S = [S(0,0), S(0,1), S(0,2), ,,,S(2,2)] for i = 0:2 for j=0:2 sumQ= zeros(6); for k = 1:m-2 tempQ = setQ(1:h,k+i)*setQ(1:h,k+j)'; sumQ =tempQ+sumQ; end S(1:6,1:6,i*3+j+1) = sumQ; end end A0 = zeros(6); A1 = zeros(6); if det(S(1:6,1:6,4))~=0 & det(S(1:6,1:6,5)) ~=0 S0 = S(1:6,1:6,1)*inv(S(1:6,1:6,4))-S(1:6,1:6,2)*inv(S(1:6,1:6,5)); if det(S0)~=0 A0 = (S(1:6,1:6,7)*inv(S(1:6,1:6,4))-S(1:6,1:6,8)*inv(S(1:6,1:6,5)))*inv(S0)

154

end end if det(S(1:6,1:6,1))~=0 & det(S(1:6,1:6,2)) ~=0 S1 = S(1:6,1:6,4)*inv(S(1:6,1:6,1))-S(1:6,1:6,5)*inv(S(1:6,1:6,2)); if det(S1)~=0 A1 = (S(1:6,1:6,7)*inv(S(1:6,1:6,1))-S(1:6,1:6,8)*inv(S(1:6,1:6,2)))*inv(S1) end end Z = S(1:6,1:6,9)+A1*S(1:6,1:6,5)*A1'+A0*S(1:6,1:6,1)*A0'-S(1:6,1:6,8)*A1'-S(1:6,1:6,7)*A0'+A1*S(1:6,1:6,4)*A0'-A1*S(1:6,1:6,6)-A0*S(1:6,1:6,3)+A0*S(1:6,1:6,2)*A1'; C = Z/(m-2); B = sqrtm(C) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [min, max]=EstMean(cp1,cp2, trackmse) % Quantify the manually tracked error and the confidence interval % the standard deviation is assumed as from two manual clicking. % Suppose normal distribution % When the distribution is not known (Tchebycheff inequality) % zu = 3.16, u = 0.90 % zu = 4.47, u = 0.95 % input: cp1 -- manually tracked result 1 (golden standrad) % cp2 -- manually tracked result 2 % trackmse -- mean error of tracked sequence % output: min -- minimum of the estimated mean % max -- maximum of the estimated mean zu = 4.47; n = 18; MSEperF=zeros(1,n); for i=3:20 tt=(cp1(:,(2*i-1):(2*i))-cp2(:,(2*i-1):(2*i))).*(cp1(:,(2*i-1):(2*i))-cp2(:,(2*i-1):(2*i))); MSEperF(1,i)=sum(sum(tt,2))/2; end s = std(MSEperF); min = trackmse-zu*s/sqrt(n); max = trackmse+zu*s/sqrt(n);zu = 4.47; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [min, max]=EstMean2(cp1,cp2) % Quantify the manually tracked error and the confidence interval of it % Suppose one clicking result is the golden standard. % The standard deviation is assumed from two manual clicking. % The mean error is taken from the same set. % the output is the confidence interval of the error % assume zu = 2 meaning u = 0.975 % When the distribution is not known (Tchebycheff inequality) % zu = 3.16, u = 0.90

155

% zu = 4.47, u = 0.95 % input cp1 -- manually tracked result 1 (golden standrad) % cp2 -- manually tracked result 2 % input can be a sequence of tracker tracking % output: min -- minimum of the estimated mean % max -- maximum of the estimated mean zu = 4.47; n = 18; MSEperF=zeros(1,n); for i=3:20 tt=(cp1(:,(2*i-1):(2*i))-cp2(:,(2*i-1):(2*i))).*(cp1(:,(2*i-1):(2*i))-cp2(:,(2*i-1):(2*i))); MSEperF(1,i)=sum(sum(tt,2))/2; end MSE = mean(MSEperF); s = std(MSEperF); min = MSE+zu*s/sqrt(n); max = MSE+zu*s/sqrt(n); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function [CPP, Q, Cxy]=findQ(filename,n1,n2,CP) % Use manual tracking to get the Q and the control points % used only in the experiment to estimate A0,A1,B and do the error % measurement % input: filename -- the video sequence % n1,n2 -- the starting and the ending frame of the sequence % that we want to manually track % CP -- the template control points % output:CPP -- the manually tracked control points for each frame % Q -- Q for each frame % Cxy -- the coordinates of the center of the control points % for each frame CPP=[]; Q=[]; Cxy=[];%the center of the template for i=n1:n2 imshow(filename(i).cdata); P=ginput; hold on v = chordLengthPara(P,2); cp = find_Control_Points_byCL(v,P, 2);%get the control points for one frame CPP=[CPP, cp];%record the control points for n1:n2 frames cxy=sum(cp)/4;% get the center of the template Cxy=[Cxy;cxy];% recorde the ceter points %get Q %first get blending function n = length(P); B = zeros(n); B(1,1) = 1; % the first control point = first knot

156

B(end, end) = 1; % the last control point = last knot for j = 2:n-1 [BF, DB1]= BleFuncGivenVp(v(2+j-1), v, n, 2); B(j,:) = BF; end %then calculate W and H, then M Nc = size(cp,1); One = ones(Nc,1); Zero = zeros(Nc,1); W = [One Zero CP(:,1) Zero Zero CP(:,2); Zero One Zero CP(:,2) CP(:,1) Zero]; sumBs = B'*B; H = [sumBs, zeros(Nc, Nc); zeros(Nc, Nc), sumBs]; M = inv(W'*H*W)*W'*H; % calculate Q q = M *[cp(:,1)-CP(:,1); cp(:,2)-CP(:,2)]; Q=[Q,q]; i end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function plotMSEperF(MSEperF) % Plot the mean square error per frame in logarithmic scale semilogy(MSEperF,'b-'); hold off %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% function CP=temCPforError(im) %Used only for the experiments %Get the template control points coordinates %Use CP to calculate Q for training data to calculate A,B. %input: im --template image %output: CP --the template curve's control points(the curve used % to do the error measurement) % the control points are transformed so that the 1st points is at the % coordinate origin imshow(im); hold on axis auto P=ginput; v = chordLengthPara(P,2); m = size(P,1); for i= 1:m plot(P(i,1), P(i,2), 'y*'); end CP1 = find_Control_Points_byCL(v,P, 2); n=size(CP1,1); CP=zeros(n,2); for j=1:n CP(j,1)=CP1(j,1)-CP1(1,1); CP(j,2)=CP1(j,2)-CP1(1,2); end

157

for k=1:n plot(CP1(k,1), CP1(k,2), 'r*'); end x = []; y = []; for u = 0:1/20:1 [BF, DB1] = BleFuncGivenV2(u, v, 2); r = BF*CP1; x = [x, r(1)]; y = [y, r(2)]; end line(x, y); hold off

158

References Amit, Y., U. Grenander and M. Piccioni, Structural image restoration through deformable template, Journal of the American Statistical Association. Vol. 86, no.414. pp 376-387, June 1991. Baker, Hearn, Computer Graphics with Open GL, third edition, Pearson Prentice Hall, 2004. Blake, Andrew, Michael Isard, and David Reynard, Learning to track the visual motion of contours, Artificial Intelligence, 78 179-212, 1995. Buss, R. Samuel, 3-D Computer Graphics, Cambridge University Press, 2003. Canny, John, Finding Edges and Lines in Images, Massachusetts Institute of Technology, Cambridge, MA, 1983. de Boor, Carl, B(asic)-spline Basics, in Carl de Boor (Ed.)Extension of b-spline curve algorithms to surfaces(Vols. Course #5, ACM SIGGRAPH 86, pp.18-22). Dallas, TX: ACM, 1986. Duda, Richard O., Peter E. Hart, and David G. Stork, Maximum-likelihood and Bayesian parameter estimation, Pattern classification, 2nd edition, Wiley-Interscience, p.84-90, 2001. Fieguth, P. and D. Terzopoulos, Color-Based Tracking of Heads and Other Mobile Objects at Video Frame Rates, IEEE Conference on Computer Vision and patterns, Puerto Rico, p.21-27, 1997 Forsyth, D., and J. Ponce, Computer Vision: A Modern Approach, chapter 17, Prentice Hall, 2003. Hartley Richard and Andrew Zisserman, Multiple View Geometry in computer Vision, Cambridge University Press, 2. Edition, 2003 Gonzalez, Rafael C. and Richard E. Woods, Digital Image Processing, chapter 3, Addison Wesley, second edition, 2001. Gonzalez, Rafael C. and Richard E. Woods, Digital Image Processing, chapter 5, Addison Wesley, second edition, 2001. Gonzalez, Rafael C. and Richard E. Woods, Digital Image Processing, chapter 6, Addison Wesley, second edition, 2001. Isard, Michael and Andrew Blake. CONDENSATION – Conditional Density Propagation for Visual Tracking, International Journal of Computer Vision, 29(1): 5-28, 1998. Isard, Michael and Andrew Blake. Contour tracking by stochastic propagation of conditional density, European Conference Computer Vision, 343-356, 1996. Jain Anil K. and Yu Zhong, Object localization using color, texture and shape, Pattern Recognition 33, p.671-684, 2000. Jain Anil K. Yu Zhong, and Marie-Pierre Dubuisson-Jolly, Deformable template models: A review, Signal Processing 71 p.109-129, 1998.

159

Jain Anil K. Yu Zhong, and Marie-Pierre Dubuisson-Jolly, Object tracking using deformable templates, IEEE Transactions on pattern analysis and machine intelligence, vol. 22 No.5 p.544-549, May 2000. Jain, Anil K., Yu Zhong and Sridhar Lakshmanan, Object Matching Using Deformable Templates—Fellow, IEEE, 1996 Kervrann, Charles and Heitz, Fabrice, A hierarchical statistical framework for the segmentation of deformable objects in image sequences, Proceedings of Computer Vision and Pattern Recognition, 1994 IEEE Computer Society Conference on, p.724-728, 1994. Koller D., J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, and S. Rassell, Towards robust zutomatic traffic scene analysis in real-time, Proceedings of the 33rd Conference on Decision and Control, Lake Buena Vista, FL, December, p. 3776-3781, 1994. Laptev, Ivan and Tony Lindeberg, Interest point detection and scale selection in space-time, in Proccedings ICCV 2003, Nice, France, pp.432-439, 2003. Lipton, Aln J., Hironobu Fujiyoshi, and Raju S. Patil, Moving target classification and tracking from real-time video, in Workshop Applications of Computer Vision, Princeton, NJ, October, 1998. Lu, Guojun, Communication and Computing for Distributed Multimedia Systems, chapter 9, Artech House Publications, 1997. Papoulis, Athanasios and S. Unnikrishna Pillai, Probability, Random Variables and Stochastic Processes, chapter 8, Fourth Edition, McGraw-Hill, 2002. Pedersen, Kim Steenstrup, Curves and Surfaces, part II, Lecture not for Computer Graphics, IT University of Copenhagen, autumn 2004. Peter Sestoft, Searching and sorting with Java, from lecture note on Java, 1998. Reynard, David, Andrew Wildenberg, Andrew Blake and John Marchant, Learning Dynamics of Complex Motions from Image Sequences, Proceedings of the 4th European Conference on Computer Vision-Volume I, p.357-368, 1996 Rui, Yong and Yuqiang Chen, Better Proposal Distributions: Object tracking using unscented particle filter, in Proceedings of IEEE Conference Computer Vision and Pattern Recognition, Kauai, Hawaii, vol. II p.786-793, 2001. Stauffer, Chris, and W.Eric L. Grimson, Learning patterns of activity using real-time tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.22(8), p.747-757, August 2000. Tate, Shuta and Yoshiyasu Takefuji, Video-based human shape detection by deformable templates and neural network, Knowledge-Based Intelligent Information Engineering Systems and Allied Technologies. KES 2002 Trucco, E. and A. Verri, Introductory Techniques for 3-D computer Vision, chapter 8, Prentice-Hall, 1998. Weng, Juyang, Narendra Ahuja, and Thomas S. Huang, Motion from images: Image matching parameter estimation and intrinsic stability, in Proceedings of IEEE Workshop Visual Motion (Irvine, CA), pp. 356-366, 1989.

160

Weng, J. John, Narendra Ahuja, and Thomas S. Huang, Learning Recognition and Segmentation of 3-D Objects from 2-D Images, in Proceedings of the Fourth International Conference on Computer Vision, IEEE, 1993. Yang, Jie and Alex Waibel, A real-time face tracker, in Proceedings of Third Workshop Applications of Computer Vision, IEEE, p.142-147, 1996.

161

thesis--- identification of billboards in a live handball game

Documents