finding mean traffic speed in low frame-rate video

17
Finding Mean Traffic Speed in Low Frame-rate Video Jared Friedman January 14, 2005 Final Project Report Computer Science 283 1

Upload: jared-friedman

Post on 14-Oct-2014

880 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finding Mean Traffic Speed in Low Frame-Rate Video

Finding Mean Traffic Speedin Low Frame-rate Video

Jared Friedman

January 14, 2005Final Project Report

Computer Science 283

1

Page 2: Finding Mean Traffic Speed in Low Frame-Rate Video

Abstract

In this paper, I present a novel approach to estimate mean traffic speed using lowframe-rate video taken from an uncalibrated camera. This approach takes advantage of aknown relationship between traffic speed and traffic density to make tracking ofindividual vehicles unnecessary. The algorithm has been developed especially fornighttime conditions, though extensions to daytime images seem quite possible. It hasbeen tested on several image sequences and shown to produce results consistent withhuman estimations from those sequences.

Introduction

Computer vision techniques have been applied to images of traffic scenes for avariety of purposes [1]. One of the more popular of these purposes is to attempt to extracta measure of the level of congestion of the road in the scene. Properly disseminated, thisinformation can be used by drivers to plan routes that avoid traffic and by first respondersto identify accidents. The measurement typically used for congestion is the mean speedof traffic, as this is the measurement a rational traveler should care about. In this paper, Ipresent a new algorithm to estimate mean traffic speed using video images at low framespeed. The work is motivated by the presence of Trafficland, a new company that offersvideo streams from over 400 traffic cameras in the Washington D.C. area through freeinternet access (at www.trafficland.com). Previous work on finding traffic speed hasworked by finding it directly, essentially by tracking vehicles for a known time over aknown distance and calculating an average ratio. However, due to bandwidth limitations,Trafficland’s cameras give video at less than one frame per second, with unreliable anddifficult to determine time intervals between frames, making tracking extremely difficultif at all possible. Some published algorithms [2],[3] instead place two virtual lines or“tripwires” on the road at a known separation and measure the time interval between carscrossing the first and crossing the second, in the natural computer vision analogue of thephysical loop detectors on roads. While this may seem to be a different technique fromtracking, it shares many similarities, including the assumption that cars will not travelmuch from one frame to the next, and in practice it requires an even higher frame speedthan tracking.

Nevertheless, clearly humans are able to judge traffic levels from Trafficland’svideo, and they would still be able to do so even if shown only every two or three frames,making tracking literally impossible. I assert that the way we make this judgment is bydetermining how closely spaced the cars are and using the intuitive fact that closelyspaced cars tend to travel more slowly than tightly packed ones. More precisely, we usethe inverse correlation between mean traffic speed and traffic density, which is defined asvehicles per lane-mile [4]. The approach of this paper is to take advantage of thisrelationship and compute density directly, which is easier to compute at low framespeeds, and convert this into a speed using the known relationship.

To compute density in a particular region of interest, we must know the number oflanes of the region, the length of the region, and the number of cars in the region in each

2

Page 3: Finding Mean Traffic Speed in Low Frame-Rate Video

frame. Some traffic vision systems have had to accommodate cameras that could berotated and zoomed by traffic operators with joysticks (e.g., [1], [5]), and thus have had tobuild in some automatic calibration capability to their programs. However, Trafficland’scameras appear to be stationary, and we make the assumption that the number of lanesand the length of the region need only be determined once, and take advantage of humaninput in this one-time low-cost setup procedure. Many published and commercialsystems [6] [7] also require some initial human setup: for one, if there are multiple roadsor two directions of a single road in the picture (as is usually the case), the softwarecannot possibly know which road is the intended one without some human input.Specifically, the initial calibration setup simply requires a human to draw a rectangle (inworld coordinates) around an area of interest and then to trace out the lanes in the region.Using some simple geometric constraints and assuming a typical lane width, the length ofthe region can then be calculated.

The calculation of the number of vehicles in the region proved to be morechallenging than expected. This partially because surprisingly little previous work hasbeen done on the problem. Several algorithms have developed excellent tracking of carsin daytime conditions at high frame speeds, which implies that they are able to recognizevehicles to some extent. However, in tracking vehicles directly, it is not necessary tosegment cars properly, but only to identify blobs that correspond to multiple vehicles orparts of vehicles, since in a traffic stream all the vehicles, their parts and their shadowstend to move at about the same velocity. The algorithm reported in [1] does requirecorrect segmentation of vehicles, because it must estimate their size correctly, but itrequires correct segmentation of only a few vehicles at a time, and thus it simply throwsaway any blobs that do not correspond to a tightly defined vehicle profile. Accuratecounting of vehicles in daytime conditions requires a more sophisticated approach to dealwith occlusion, shadows, and vehicles of widely varying appearance. In this preliminaryreport, I chose to focus on nighttime conditions only and to leave daytime conditions forfuture work. Nighttime conditions are easier because at night, it is usually possible tosimply count the number of headlights appearing in the region, and headlights are muchmore visible and less vulnerable to occlusion, shadow, and varying appearance than cars.Nighttime conditions are in any case a more suitable potential use of the algorithmadvanced in this paper, since tracking-based systems ordinarily find daytime conditionsmuch easier than nighttime conditions, giving a the density approach a particularadvantage in these conditions.

This paper first reviews the key assumptions of the algorithm and discusses theirvalidity and which ones could be relaxed in further work. I then discuss in detail theworkings of the algorithm and follow by some considerations of its computationalefficiency. I conclude with empirical results validating the accuracy of the algorithm.

Underlying Assumptions

The following gives a list of the key assumptions used to simplify the problemand some discussion of their validity.

3

Page 4: Finding Mean Traffic Speed in Low Frame-Rate Video

1) Images are taken at night. Headlights are the brightest objects in the region of interest.

2) Traffic is moving generally towards the camera, but not (almost) directly into it. Thesecond requirement exists because when traffic is going almost directly towards thecamera, the glare from the headlights creates bloom, lens flares, and severe distortion.The first requirement exists because if the traffic is moving away from the camera, theheadlights will not be directly visible. In my opinion, the second of these twinrequirements is much more reasonable than the first. Many of Trafficland’s nighttimeimages are so severely distorted by the lens flares that it is nearly impossible even for ahuman to determine the amount of traffic, and working with these images would be a realchallenge. However, tracking cars going away from the camera will obviously benecessary for a fielded system, and future systems could use either the rear vehicle lightsor the fairly bright reflected glare from the headlights to accomplish this.

3) Vehicles are confined to the road plane, and there exists a region of interest withstraight edges. Also, the number of lanes in the region of interest is constant. Theserequirements are necessary for the calculation of the geometry of the situation.

4) The width of each lane in the picture is approximately 11.5 feet. This assumption isthe distance required to determine the scale of the image and thus the length of the regionof interest. The validity of the assumption is taken from [8] which states that virtually allAmerican highways have lane widths between 10 and 12.5 feet at all times, with lanewidths close to 12 feet being the most common. Other systems have used a variety ofmeans to attempt to produce a scale measurement, mostly by placing physical marks onthe road [9], [10], although [1] did so by assuming a known distribution of vehiclelengths. However, in this situation it was impossible to have an operator placing markson or near the road. Estimating by mean vehicle length requires an algorithm that canaccurately determine vehicle length of all vehicles, including trucks, which is difficult toconstruct and must be run over a considerable period to get an accurate mean value.Furthermore, it is not at all clear that this mean vehicle length is more constant from roadto road than the mean lane width. [11] reports evidence that the mean vehicle lengthchanged considerably depending on the time of day, the highway, and the lane observed,primarily due to the considerable variation in the presence of large trucks, leading to largeerrors in systems that assumed a constant vehicle length. Using the lane width as acalibration tool appears to be a novel suggestion, and it seems a sensible choice for avariety of situations, not limited to low-frame rate video. It is perhaps worth noting thatif the lane width did not hold to the 10-12.5 foot range, then the validity of assumption(5) would be in question anyway, as this would affect the density-speed relationship.

5) Traffic speed and density have a known, and constant relationship, specifically Edie’smodel as given in [4]. This assumption is admittedly somewhat controversial. While theinverse correlation between speed and density is obvious, the exact relationship has beena topic of considerable debate. For decades, it was believed to be a linear relationship onthe basis of a single study using seven data points all collected from a single highway [4].Further study by Greenberg found that a logarithmic relationship was the best fit, as inFig. 1. However, despite the seemingly excellent fit, a number of caveats can be raised

4

Page 5: Finding Mean Traffic Speed in Low Frame-Rate Video

with Greenberg’s methods, and several later studies found that Greenberg’s relationshipwas only a mediocre fit to their data. The modern favored choice for the relationship isEdie’s hypothesis, which is a piecewise function shown in Fig. 2. The piecewiserelationship has not only be confirmed by a rigorous study into the matter [12], it also fitswell with theoretical models of traffic flow, which invariably divide the problem into atleast two subcases corresponding to free-flow and congested-flow, if not more. Despiteall the debate as to the precise nature of the relationship, the actual difference for thisapplication between the initial Greenshields model and the most recent Edie’s model isonly at most 10-20%. However, the data for these studies appears to have been collectedonly during the daytime and only in normal weather conditions. How nighttimeconditions and adverse weather might affect the relationship is quite unknown.

Fig. 1. Greenberg’s speed-density hypothesis, plotted with his data.

Fig. 2. Eddie’s speed-density hypothesis, plotted with his data.

5

Page 6: Finding Mean Traffic Speed in Low Frame-Rate Video

Finally, it is important to note that there is another relationship between trafficvariables that could be useful in further study. The relationship is between traffic speedand traffic volume, which is defined as cars per lane-hour. I chose not to use thisrelationship in my algorithm because the relationship between volume and speed isconsiderably less well-established than the one between speed and density. However,calculating volume does not require finding the length of the region of interest, and thus itis immune from that source of error. If the expected level of error from the estimation ofthe length of the region were greater than the expected error from the estimation of thespeed-volume relationship, then it would be a reasonable choice to abandon density andinstead calculate volume. Determining whether this is in fact the case unfortunately goesbeyond the scope of this paper, but if it were, then the change would be quite easy tomake. The part of the algorithm that calculates the length of the region would simply bedropped, and instead the number of cars counted by the second part of the algorithm,divided by the product of the time period of operation and the number of lanes, could beplugged into the function in [4] that computes an expected relationship between volumeand speed.

Algorithm Operation

This section details the workings of the algorithm. The first part explains the counting ofcars, and the second part explains the camera calibration and determination of the regionlength.

I. Counting Cars

The algorithm operates on a sequence of nighttime images, sampled at virtuallyany frame rate. The images in the dataset are originally in color, but they are converted togreyscale for analysis. For nighttime images, there is normally very little usefulinformation in the color channels, and what information there might have been isobscured by the terrible color distortion in Trafficland’s cameras. The car counting partof the algorithm assumes that an operator has drawn a rectangular region of interest andthat we are counting cars only in that region.

The car-counting algorithm operates essentially by counting headlights. Unlikeother papers [13], I do not assume a particular shape for headlights, nor do I require thateach vehicle have exactly two nearly identical headlights. While those assumptions areoften valid, occlusion, reflections on the road and on the vehicles, and varying headlightconfigurations complicate the picture. Instead, I merely assume that each vehicle has oneor more brightly colored dots on or right next to it. The algorithm finds those dots andattempts to determine which dots belong to which vehicles.

6

Page 7: Finding Mean Traffic Speed in Low Frame-Rate Video

Fig. 3. A typical unprocessed Trafficland image

The first step of the algorithm is to crop the image to the smallest size thatcontains the region of interest, and then to set all the pixels outside the region of interestto zero intensity. The image is then converted to greyscale. A typical image at this stageis shown in Fig. 3. The image is then top-hat filtered. Top-hat filtering is a techniqueused to smooth out uneven dark backgrounds. Top-hat filtering is defined by subtractingthe result of performing a morphological closing on the input image from the input imageitself. This has the effect of reducing background noise by eroding it away, and thusproducing a more even, clean background. Results of top-hat filtering are shown in Fig.4. Top-hat filtering requires a choice of a structuring element for the morphologicalclosure. The choice of a disk shape was easy – this is standard. The choice of the size ofthe disk was more difficult and also somewhat arbitrary. Examination of the scale of afew Trafficland cameras showed that a disk radius of 10 pixels gave good results. Somefurther experimentation showed that the performance of the algorithm was highly non-sensitive to changes in this size.

7

Page 8: Finding Mean Traffic Speed in Low Frame-Rate Video

Fig. 4. Image after top-hat filtering.

The next step is to choose a threshold and convert the greyscale image into blackand white. Choosing the threshold is obviously the difficult part. If there are headlightsin the image, then Otsu’s method, which chooses a threshold to minimize the intra-classvariance, works very well. This is essentially because in this case, the image histogramwill be strongly bimodal between headlight and not-headlight, and this method will easilyfind the dividing point. However, in an image that has no headlights, Otsu’s method willreturn terrible results, as it will cause a segmentation of the road itself based on randomnoise on the road, but in this case we want it to segment to all black pixels. One way tosolve this problem is simply to set a fixed parameter that represents a minimumreasonable headlight intensity, and to take the threshold to be the maximum of thisintensity and the threshold computed by Otsu’s method. In practice, this method willreturn good results with almost all images, as the difference between the roadbackground, generally 0 to .3 in intensity, and the headlight intensity, usually .9 to 1, is soextreme that any parameter value choice of .4 to .7 will correctly separate them, and thechoice of the parameter within this region will have little effect on the quality of thesegmentation.

In the expectation this method might not return optimal results for all images, Ipursued a method that would learn the threshold from previous images. The algorithmbegins using the fixed parameter method with a low fixed parameter value. A record iskept of all the thresholds determined by Otsu’s method during the past 100 images –including those determinations when the value was not used as the actual thresholdbecause it was too low. To this vector, Otsu’s method is itself applied again, to separatethe vector into values that were found during no vehicle presence and ones that werefound during an actual vehicle presence. If there was at least one image with headlightsand one image without headlights, this method will return good results. Once again, theproblem can occur with trying to separate a non bimodal distribution. I make theassumption that in at least one of these images, a car was present. To test to see whetherall the images have headlights in them, I do a 2-sample t-test, comparing the values belowthe computed threshold to the values above the computed threshold. If the difference is

8

Page 9: Finding Mean Traffic Speed in Low Frame-Rate Video

significant, then there are likely two different distributions in the data – one withheadlights, and one without headlights, and I set the minimum parameter to the thresholdcomputed on the vector of 100 previous thresholds. If the difference is not significant,then probably all of the images had cars in them, and I keep the old threshold. Thisassumes that the values of the grey threshold computed when there are cars in the pictureand when there are not cars in the picture both have normal distributions, and I havefound this to be a fairly accurate assumption, as confirmed both visually and by theLilliefors normality test. Fig. 5 shows a histogram of the thresholds determined in a 200image sequence with a highly bimodal distribution due to the presence or absence ofheadlights. I found this method to accurately determine when there were actually carspresent in the picture, and to choose a grey threshold accordingly that reflected thelighting conditions of the image – e.g., images with brighter backgrounds had a higherthreshold. However, due to the non-sensitivity of the rest of the algorithm to the value ofthis parameter, this method failed to return significantly better or even significantlydifferent results on the actual dataset from the simple-minded hard-coded parametermethod.

Fig. 5. Histogram of 100 threshold values determined by Otsu’s method.

Having converted the image into black and white, the next step is to identify carsfrom the white blobs that may correspond to headlights or reflections. A typical image atthis stage is shown in Fig. 6. As discussed before, other work [13] has attempted to useprior knowledge about headlight shape to accomplish this segmentation. Unfortunately,they give virtually no details about their algorithm, so it was not possible to reproducetheir results. After a considerable amount of experimentation with template-basedmatching, the technique used in [13], I decided that these assumptions about headlightshape were not valid enough in general to be useful, and instead I use a simpler method

9

Page 10: Finding Mean Traffic Speed in Low Frame-Rate Video

that relies only on the assumption that all the headlights and headlight reflections on a carwill be close. First, the binary image is dilated, which tends to connect the unconnectedblobs belonging to a single car. Unfortunately, this simple technique runs the risk ofjoining blobs of adjacent cars incorrectly, leading to undercounting the actual number ofvehicles. To mitigate this problem, I use the fact that the user has drawn segmentscorresponding to the lanes in the region in the initial setup and I separate blobs alongthose lane boundaries. Assuming that all cars are entirely in a lane, this essentially solvesthe problem of connection across lane boundaries, and leaves only the potential problemof connecting two cars within a lane. But since the headlights of cars in a single lane areseparated by dark car bodies, this is rarely a problem. Simply counting the blobs found atthis stage gives a reasonable result, but I do one further step of noise-reduction thatimproves performance further. Since all headlights should by now have been joinedtogether in blobs of considerable size, I eliminate all blobs below a certain threshold size,since these usually correspond to small reflections on or around vehicles that have alreadybeen counted. The size chosen is a volume of 15 pixels, which is an extremelyconservative estimate for the size of a headlight, especially after dilation. Rather than anassumption about the size of the headlights in the images, it is best considered as aconstraint on the choice of the region of interest by the traffic operator, requiring theregion to be placed close enough to the camera so that dilated headlights have a volumeof more than 15 pixels. Indeed, if this is not the case, the rest of the algorithm is unlikelyto perform well anyway, as the resolution will be very poor.

Fig. 6. Black and white segmentation.

Dilating the image in the above step requires a choice of a structuring element,and this choice is best determined in a principled manner, as the performance of thealgorithm is considerably affected by it. In images with small headlights, like the onesshown in the above figures, dilation is not necessary, though a small amount rarely hurts.In images with a closer view of the traffic, dilation becomes essential. The shape of thestructuring element is simply a disk, as is standard. The idea behind my method ofchoosing the radius of the disk is that the disk needs to be large enough to join headlight

10

Page 11: Finding Mean Traffic Speed in Low Frame-Rate Video

pairs but should not be much larger, else it will run the risk of joining together differentcars. The algorithm for calculating this size depends on some assumptions aboutheadlight size taken from [14] and also informally observed in Trafficland’s images. Thekey assumption specifically is that the average distance between the headlights isapproximately proportional to the typical headlight size, as recorded by the image. Onetime the assumption is clearly false is when the cars are traveling almost directly towardsthe camera, as then the distortion will make the headlights seem much larger. Anothertime the assumption does not work well is when the traffic is traveling almost directlyacross the camera’s field of view; however, in this case, some dilation is still useful inconnecting cars with the glare reflection, which will be particularly prominent. Inbetween these two extremes, however, the constancy of this ratio is good enough thatsetting the size of the structuring element to be a constant times the estimated headlightsize returns excellent results. The correct ratio is difficult to determine exactly, but it isclose to one; the value I use is 1.3. I measure the size of the headlights by finding themedian area of all the blobs found in the first 100 frames, and calculating thecorresponding radius of a circle of this area.

II. Determining the Region Length

The algorithm I use to determine the region length is taken from [15], adapted tothe information available for the scene. First, the traffic operator gives the initial set-upinformation pictured in Fig. 7. This includes a region of interest, whose projection ontothe road plane must be rectangular in the world coordinates and a trace of all the lanes inthis region of interest. For good performance, the region of interest must be a straightsection of road, it should begin as close to the camera as possible, and it should notextend so far that the resolution at the end of the region is too poor (see above for a moreprecise definition).

Fig. 7. What the traffic operator draws.

11

Page 12: Finding Mean Traffic Speed in Low Frame-Rate Video

The estimate of the region length begins with computing a camera calibrationfrom the data that the operator has given. The camera calibration can then be combinedwith some simple geometry to estimate the region length. The camera calibrationtechnique described in [15] is easier for the traffic operator than the one described in [7],which requires that a grid evenly spaced along the road axis be determined by theoperator, which in practice is a difficult judgment for a human to make. I do not repeatthe detailed derivation of the algorithm in [15] here, but I give an overview of itsoperation as applied to this situation. Most of what follows is taken from this paper; forbrevity, I omit the exact citations.

Camera calibration involves finding a camera’s intrinsic and external parameters.First consider the intrinsic parameters. Recall that they can be represented by the matrix

To simplify the calibration process, I make several assumptions about the internalparameters which are approximately true for most cameras and common in computervision applications. First I assume that the axes are in fact perpendicular, so that, θ = 0. Ialso assume that the horizontal and vertical focal lengths are equal, and that u0 and v0, thecoordinates of the camera center are actually at the image center. This reduces what waspreviously a five parameter problem to a one parameter problem, leaving αu as the onlyunknown.

To calculate the external parameters, we first calculate a vanishing point. Thevanishing point that can be calculated most accurately is the one in the road direction.We could use only the region of interest boundaries to calculate this point, but we will getbetter results if we also use the tracings of the lanes the user has made. Specifically, wewish to find the point whose sum of squared distances to all the lines is a minimum. Thisleast squares estimate can be easily determined by solving a system of linear equations ofthe form Mx = b. If there are n lanes on the road, then M and b will each have (n+2)rows. Let Li be a unit vector in homogeneous coordinates representing the direction ofthe ith line (out of the n+2), and let Pi be a point on that line (in homogeneouscoordinates), then we can define the ith row of M and b as follows:

Then the vanishing point x is simply the pseudo-inverse of M times b. The vanishingdirection can be computed as A-1x, where A is the camera intrinsic parameters matrix(which is not yet entirely known).

We can describe the world coordinates in terms of three axes: Gx, which isperpendicular to the vanishing direction, Gy, which is parallel to the vanishing direction,

12

100sin

0

cot

0

0

v

uv

uu

θα

θαα

[ ]12 iii LLM −−= ( ) [ ]TTiii pLb 100×=

Page 13: Finding Mean Traffic Speed in Low Frame-Rate Video

and Gz, which completes the coordinate system. Let v denote the normalized vanishingdirection. Let ϕ be the roll angle about the vanishing direction, and define β = 1/(1+vy).Then we can determine the three axes in terms of these variables, only two of which areunknown. [15] provides the expressions for Gx and Gz (the expression for Gy is trivial);unfortunately, the expression for Gz contains several apparently typographical errors. Thecorrect expressions are:

If these axes were known (currently they are written in terms of two unknowns), wewould then be able to use the axes, our internal parameter matrix, and some geometry tocalculate distances on the road plane. Specifically, this can be done in the followingmanner. Given an image point x, compute its projection p = A-1x. This is of course avector in the direction of the ray that goes from the camera center to the point x. But theintersection of this ray and the road plane is

Given two such projections P1 and P2, the distance between them is simply ||P1 – P2||, upto some unknown but constant scale factor.

Of course, all of this assumes that we have values for the two unknowns αu and ϕ. The keyinsight of [15] is that with a knowledge of the ratios of lengths in the picture, we can usea non-linear optimization process to solve for those two unknowns. For the Trafficlandsituation, say again that we have n lanes. Then we know of n+2 segments in the directionof the road that must be of the same length in the world. Also, all the 2n+2 segmentsperpendicularly connecting the lanes at the beginning and end of the region of interestmust be of the same length. Let us denote the n+2 road-parallel segments as q0, …, qn+1 andthe 2n+2 perpendicular segments as s0, …, s2n+1. The residual I compute is a modificationof the one in [15] and is defined by:

13

T

zxz

xz

zxx

x

vvvvv

vvvG

−−−−

+−=

ϕβϕβϕϕ

ϕβϕβ

cossin)1(cossin

sincos)1(

2

2

T

zxz

zx

zxx

z

vvvvv

vvvG

−−−−

+−=

ϕβϕβϕϕ

ϕβϕβ

sincos)1(cossin

cossin)1(

2

2

pGp

GP

z

z ˆˆ

2

⋅=

∑∑++

−+

−=

12

1

2

01

1

2

0 11n

i

n

i ss

qq

r

Page 14: Finding Mean Traffic Speed in Low Frame-Rate Video

A non-linear optimization process can then be used to solve for the αu and ϕ that minimizer. [15] recommends the Levenberg-Marquadt method, but I use a subspace trust regionmethod based on the interior-reflective Newton method, as some informalexperimentation showed that this algorithm was much less likely to converge to incorrectlocal minima when given an initial value distant from the correct one. Finally, the scalefactor can be easily determined by assuming the lane width as stated above and dividingthe actual lane width by the average of the computed ones.

Some Considerations of Computational Efficiency

The algorithm essentially has two parts, the initial setup and calibration, and thecounting of vehicles in actual operation. Obviously, the efficiency requirements of thetwo are quite different. The part of the algorithm that operates in real-time must behighly efficient; however, the initial setup, which only happens once, is not nearly soconstrained.

Virtually the entirety of the computational time for the initial setup is consumedby the nonlinear optimization process, which must compute a relatively computationallyintensive function many times to find a good minimum. The function that it computes isconstant time with respect to the size of the image, but it contains several matrixinversions and a good deal of matrix multiplication and arithmetic operations. Typicalrunning time for the nonlinear optimization process to complete is about ten seconds on aPentium 4. Considering that it will take the operator significantly longer to draw the lineson the image that are used in the calibration process, this seems within acceptable limits.

The part of the algorithm which must operate in real time is the car counting.Empirically, this algorithm is highly efficient, requiring only about .2 seconds per frame,whereas the frames occur at less than one frame per second. This implies, assuming noframes are dropped, and indeed it would be possible to drop frames without affectingperformance significantly, that each computer could process the video feeds from 5cameras simultaneously, which is superior to most published algorithms, which areusually able to handle only one camera [1]. Part of the reason for the high efficiencycomes from the small size of the images that are effectively being worked with. Theoriginal traffic image is 320x240. But most of this is background, and the region ofinterest size is typically on the order of 100x100. Profiling the execution of the algorithmusing the excellent profiler tool in MATLAB showed that the algorithm spent 64.4% ofits time directly computing morphological operations of some kind. When the time tocheck arguments, resize matrices and execute other miscellaneous utility functionsconnected with the morphological operations is taken into account, the actual percentageof the time spent doing morphological operations is between 80% and 90%. Virtually allof the rest of the CPU time is spent resizing the image and converting it to greyscale.Computing the threshold using Otsu’s method takes only 2.0% time. The top hattransformation at the beginning is particularly computationally intensive (55%), as for acircular structural element, the computational complexity of the morphological closing isproportional to the area of the circle, which is rather large. This may indicate that insituations where computational efficiency is at a premium, a smaller structural element orone of an easier shape should be substituted, though this was not investigated.

14

Page 15: Finding Mean Traffic Speed in Low Frame-Rate Video

The memory requirements of the algorithm are very modest. Since each image isprocessed individually, only the data to process that particular image must be stored. Inmy implementation, that is approximately 4 times the memory requirement of the imagecropped down to the region of interest, because temporarily we must store the originalimage, the top-hat enhanced image, the segmented image, and the dilation of thesegmented image. The other major memory requirement comes from storing the sizes ofthe headlights of the past 100 frames. In a fielded system, this would really not need tobe calculated every frame; instead it could just be re-calculated every few thousandframes. However, during its calculation it takes a matrix of about 1000 elements to storeit, assuming about ten blobs per image. The other variables are independent of the size ofthe images and very small.

Empirical Results

The gold standard in the empirical validation of computer vision speed detectionalgorithms is simultaneous inductance loop data. Inductance loops are wires run underthe highway and connected to electrical monitoring equipment in such a way that a clearand easily measurable electrical impulse occurs when an axle rolls over the wire. Bybuilding two inductance loops close together at a known distance, the speed of traffic onthe highway can be measured very accurately. If the images being analyzed by acomputer vision algorithm have simultaneous loop detector data, the algorithm’s resultscan be compared with the known good speeds from the inductance loops and thealgorithm validated with high accuracy.

Unfortunately, no such simultaneous data is publicly available. Without it, a totalverification of the algorithm’s accuracy is impossible, but significant confirmation is stillpossible. Recall that the accuracy of the entire algorithm rests on the accuracy of threecomponents: the counting of the cars, the determination of the size of the region, and therelationship between speed and density; if all three of these are correct, then the speedestimates produced must also be correct. The third of these is impossible for me to test,but it has been verified by numerous studies into the matter, and so it is reasonable toassume its accuracy. The second of these is also nearly impossible for me to testaccurately. However, I can say informally that the algorithm returns results within somereasonable bound –there is at least no egregious error in implementation. Moreimportantly, this algorithm has been used before in [15], and they provide considerableempirical validation of the approach using data obtained from physically measuring theroad. Thus, the only part of the algorithm whose accuracy is in serious question is the carcounting, and this is easily checked by counting the cars by hand and comparing thatactual result to the estimated result found by the algorithm. In brief, such a comparisonshows that the algorithm has excellent accuracy.

However, it is not sufficient validation to test the algorithm on a single camera inthat manner, and furthermore it is not really sufficient validation of the robustness of thealgorithm to test the algorithm on a camera whose images have been used to develop thealgorithm. To provide a valid test of robustness, I developed the algorithm while workingwith the images of only one camera. Once the algorithm was performing well, I froze the

15

Page 16: Finding Mean Traffic Speed in Low Frame-Rate Video

code and then tested it on image sequences from several new cameras. However, I didnot choose the new cameras randomly – rather I chose only cameras that met the fairlyrestrictive criteria outline in the Assumptions section. Unfortunately, only a smallpercentage of Trafficland’s cameras actually meet those criteria; however I believe thatmy algorithm will work more or less equally well on all that do. Most of the cameras aredisqualified either because the traffic is going in the wrong direction or because of someform of severe distortion from headlight or streetlight glare.

The key results of the study are shown in Table 1. They consist of twenty-imagesequences from four cameras, and they compare the hand-counted results with theautomatically determined results. Of the four cameras, one was the base camera thealgorithm was developed on, and three were the new cameras in the test set. The resultsshow that the algorithm estimates are essentially nonbiased and quite accurate over afairly large range of traffic densities.

Camera Type Mean S.D. % ErrorBase Manual 4.25 1.1 1.2Base Automatic 4.30 1.7Camera 1 Manual 0.40 0.5 12.5Camera 1 Automatic 0.45 0.6Camera 2 Manual 7.30 1.5 2.1Camera 2 Automatic 7.15 2.1Camera 3 Manual 4.5 1.5 4.4Camera 3 Automatic 4.7 1.8Table 1. Empirical Testing Results.

References

1. Dailey, D.J., Cathey, F.W., Pumrin, S., An algorithm to estimate mean traffic speedusing uncalibrated cameras, IEEE Trans. Intelligent Transportation Systems(1), No. 2,June 2000, pp. 98-107.

2. S. Takaba, M. Sakauchi, T. Kaneko, B. Won-Hwang, and T. Sekine,Measurement of traffic flow using real time processing of moving pictures,in Proc. 32nd IEEE Vehicular Technology Conf., San Diego, CA, May 23–26, 1982, pp.488–494.

3. N. Hashimoto, Y. Kumagai, K. Sakai, K. Sugimoto, Y. Ito, K. Sawai,and K. Nishiyama, Development of an image-processing traffic flow measuring system,Sumitomo Electric Tech. Rev., no. 25, pp. 133–137.

4. Traffic Flow Theory, edited by N.H. Gartner, C.J. Messer, and A.K. Rathi.Washington, D.C.: US Federal Highway Administration. Chap. 2, Traffic StreamCharacteristics, by Hall, F.

16

Page 17: Finding Mean Traffic Speed in Low Frame-Rate Video

5. José Melo, Andrew Naftel, Alexandre Bernardino, José Santos-Victor: ViewpointIndependent Detection of Vehicle Trajectories and Lane Geometry from UncalibratedTraffic Surveillance Cameras. International Conference of Image Analysis andRecognition (ICIAR 2004). Porto, Outubro 2004.: 454-462.

6. Peek Traffic VideoTrak Detection System. Described in http://www.peek-traffic.com/File.asp?FileID=ss96-081-1VideoTrak.

7. Worrall, A. D., Sullivan, G. D. and Baker, K. D. A simple, intuitive camera calibrationtool for natural images, Proc. 5th British Machine Vision Conference, 13-16 September,University of York, York, 1994, pp 781-790.

8. A policy on geometric design of highways and streets (AASHTO Green Book)American Association of State and Highway Transportation Officials . Jan. 2001 pp.315-316.

9. K.W. Dickinson and R. C.Waterfall, “Video image processing for monitoring roadtraffic,” in Proc. IEE Int. Conf. Road Traffic Data Collection, Dec. 5–7, 1984, pp. 105–109.

10. R. Ashworth, D. G. Darkin, K.W. Dickinson, M. G. Hartley, C. L.Wan, and R. C.Waterfall, “Applications of video image processing for traffic control systems,” in Proc.2nd Int. Conf. Road Traffic Control, London, U.K., Apr. 14–18, 1985, pp. 119–122.

11. Bickel, P., Chen, C., Kwonx, J., Rice, J., van Zwety, E., Varaiyaz P. Measuringtraffic. (Preprint) June 2004, http://www.stat.berkeley.edu/users/rice/664.pdf

12. Drake, J.S., J.L. Schofer, and A.D. May. 1967. A statistical analysis of speed densityhypotheses. Highway Research Record 154, Highway Research Board, NRC,Washington, D.C.: 53-87.

13. Cucchiara, R., Piccardi, M., Vehicle detection under day and night illumination. inProc. of IIA’99 - Third Int. ICSC Symp. on Intelligent Industrial Automation., SpecialSession on Vision Based Intelligent Systems for Surveillance and Traffic Control, 1999,pp. 789-794.

14. Zwahlen, H.T., and Schnell, T., Driver-headlamp dimensions, driver characteristics,and vehicle and environmental factors in retroreflective target visibility calculations”,Transportation Research Record 1692, National Academy of Sciences, Washington, DC.,1999.

15. Masoud, O., Papanikolopoulos, N.P., Kwon, E., The use of computer vision inmonitoring weaving sections, IEEE Trans. Intelligent Transportation Systems, (2), No. 1,March 2001, pp. 18-25.

17