active vision for road scene awareness - anu college of...

Active Vision for Road Scene AwarenessAndrew Dankers∗, Nick Barnes†, Alex Zelinsky‡∗Research School of Information Sciences and Engineering

Australian National UniversityCanberra ACT Australia 0200

[email protected]†National ICT Australia1

Locked Bag 8001Canberra ACT Australia 2601

[email protected]‡CSIRO ICT Centre

Canberra ACT Australia [email protected]

Abstract— We present a mapping approach to road sceneawareness based on active stereo vision. We generalise traditionalstatic multi-camera rectification techniques to enable activeepipolar rectification with a mosaic representation of the output.The approach is used to apply standard static depth mappingand optical flow techniques to the active case. We use theframework to extract the ground plane and segment movingobjects in dynamic scenes using arbitrarily moving cameras ona moving vehicle. The approach enables an estimation of thevelocity of the vehicle relative to the road, and the velocity ofobjects in the scene. We provide footage of preliminary resultsof the system operating in real-time, including dynamic objectextraction and tracking, ground plane extraction, and recoveryof vehicle velocity.

I. I NTRODUCTION

The concept of safety through prevention and preparednessis emerging as an automotive industry philosophy. TheSmartCar project, a collaboration between the Australian NationalUniversity, National ICT Australia, and the CSIRO, focussesits attention onDriver Assistance Systemsfor increased roadsafety. One aspect of the project involves monitoring the driverand road scene to ensure a correlation between where thedriver is looking, and events occurring in the road scene[1]. The detection of objects on the road such as signs [2]and pedestrians [3], and the location of the road itself [4],form part of the set of observable road scene events that wewould like to ensure the driver is aware of, or warn the driverabout in the case that they have not noticeably observed suchevents. In this paper, we concentrate solely on the use ofactive computer vision as a scene sensing input to the driverassistance architecture. In particular, we focus on object andground plane detection for subsequent use with higher levelclassification processes such as pedestrian, vehicle, and signdetection.

Stereo vision has become a viable sensor for obtainingthree-dimensional range information [5]. Traditionally, stereo

1National ICT Australia is funded by the Australian Department of Commu-nications, Information Technology and the Arts and the Australian ResearchCouncil throughBacking Australia’s abilityand the ICT Centre of ExcellenceProgram.

sensors have used fixed geometric configurations effectivein obtaining range estimates for regions of relatively staticscenes. In reducing processor expense, most depth-mappingalgorithms match pixel locations in separate camera viewswithin a small disparity range, e.g.±32 pixels. Consequently,depth-maps obtained from static stereo configurations are oftendense and well populated over portions of the scene around thefixed horopter, but they are not well suited to dynamic scenesor tasks that involve resolute depth estimation over larger scenevolumes.

A visual system able to adjust its visual parameters toaid task-oriented behaviour – an approach labeledactive [6]or animate [7] vision – can offer impressive computationalbenefits for scene analysis in realistic environments [8]. Byactively varying the camera geometry it is possible to place thehoropter and/or vergence point over any of the locations of in-terest in a scene and thereby obtain maximal depth informationabout those locations. Where a subject is moving, the horoptercan be made to follow the subject such that information aboutthe subject is maximised. Varying the camera geometry notonly improves the resolution of range information about aparticular location, but by scanning the horopter, it can alsoincrease the volume of the scene that may be densely depth-mapped. Figure 1 shows how the horopter can be scannedover the scene by varying the camera geometry for a stereoconfiguration. This approach is potentially more efficient anduseful than static methods because a small disparity rangescanned over the scene is less computationally expensive andobtains more dense results than a single, unscannable, but largedisparity range from a static configuration.

In [9], we developed a framework for using existing staticmultiple-camera algorithms, such as depth-mapping, on activemulti-camera configurations. The approach involved the useof a three-dimensional occupancy grid for integrating rangeinformation using active stereo. We have since improvedand extended this framework with the goal of detecting,segmenting and localising objects in the road-scene. We showhow the framework enables 3D mass flow in the scene tobe calculated from standard optical flow algorithms on an

Fig. 1. Scanning the horopter over the scene: The locus of zero disparitypoints defines a plane known as the horopter. For a given camera geometry,searching for pixel matches between left and right stereo images over a smalldisparity range defines a volume about the horopter. By varying the geometry,this searchable volume can be scanned over the scene. In the first frame,only the circle lies within the searchable region. As the horopter is scannedoutwards by varying the vergence point, the triangle, then the cube becomedetectable.

active platform because the active rectification process we havedeveloped inherently removes the effect of camera motion. Weuse this inherent benefit of the active rectification process todetermine the velocity of objects detected in the scene relativeto the vehicle and/or road at high frame rates. It is difficult todepict the achievements of an active vision system in a printedformat, and for this reason, we have provided footage of thesystem and its components in operation (see Section X).

II. OUTLINE

We begin by presenting the research platform (Section III)and reviewing our background work on active rectification [9](Section IV) and occupancy grid representation of the scene(Section V).

We introduce a space variant grid representation of the scene(Section V-B), and methods to analyse the occupancy grid forground plane and object extraction (Section VI).

We present our approach to extracting three dimensionalscene motion (Section VII) and describe how the 3D flowof mass in the scene can be used to extract the motion ofobjects and the road while the cameras are in any geometricconfiguration, or even moving. We describe how we usethis actively collated occupancy and velocity information tosegment and track objects (Section VIII).

Finally, we conclude with a summary of our present work,and comment on future improvements and additions to thesystem (Section IX).

III. R ESEARCHPLATFORM

A 1999 Toyota Landcruiser 4WD is equipped with theappropriate sensors, actuators and other hardware to provide anenvironment in which desired driver assistance competenciescan be developed [10]. Installed centrally inside the frontwindscreen is an advanced active stereo vision mechanism.CeDAR, the Cable-Drive Active-Vision Robot [11], incorpo-rates a common tilt axis and two pan axes each exhibiting arange of motion of90o. Angles of all three axes are monitoredby encoders that give an effective angular resolution of0.01o.An important kinematic property of the CeDAR mechanismis that the pan axes rotate about the optical centre of eachcamera, minimising kinematic translational effects to reducecomplexity in the epipolar rectification process. Figure 2

Fig. 2. The research platform. Left: TheSmart Car(bottom) andCeDARmounted behind the windscreen (top); Right: CeDAR.

shows the CeDAR platform as it is mounted in the SmartCar.

IV. A CTIVE RECTIFICATION

In [9] we described a rectification method used to activelyenforce parallel epipolar geometry[12] using camera geo-metric relations, independent of the contents of the images.The camera relations may be determined by any number ofmethods. Visual techniques such as the SIFT algorithm [13]or Harris corner detection [14] can be used to identify featurescommon to each camera view, and thereby infer the geometry.We use a fixed baseline and encoders to measure camerarotations. A combination of visual and encoder techniquescould also be adopted to obtain the camera relationships to amore exacting degree - this is the focus of additional presentwork within the project [15].

The rectification process, an extension of similar workin [16], enables online epipolar rectification of the imagesequences and the calculation of the shift in pixels betweenconsecutive frames from each camera, and between the currentframes from the left and right cameras (in the case of a stereorig, though any number or configuration of cameras can beused as long as the translations and rotations between camerasare known). We have shown the effectiveness of the processby using it to create globally epipolar rectified mosaics of thescene as the cameras were moved. Figure 3 shows a snapshotof online output from the mosaic process for a single camera.

The active rectification process yields the relationship be-tween the left and right camera view frames in terms ofpixels, as well as rectifying the images. Therefore, all thatneeds to be done to obtainabsolute disparities for a pairof concurrent images from any geometric configuration is tooutput a standard disparity map from the overlapping regionsof the current rectified left and right images, and offset alldisparities in the resulting disparity map by the pixel shiftbetween the current left and right images, as calculated by therectification process. It is then a simple matter to convert theabsolute disparities to absolute depths (see [9]) to yield rangedata from active multi-camera configurations.

Fig. 3. Snapshot of online output of the active rectification process: mosaicof rectified frames from right CeDAR camera. The full video sequence is alsoavailable (Section X).

V. A N OCCUPANCY GRID REPRESENTATION OF THE

SCENE

A. Summary of Previous Work

Traditionally, somewhat sparse and noisy stereo depth datahas been used to judge the existence of mass at a location in thescene. Decisions based directly on such unfiltered data couldadversely affect the sequence of future events reliant upon sucha decision. In previous use of stereo range data, only a fewattempts were made to strengthen or attenuate a belief in thelocation of mass in the scene [17]. Occupancy grids can beused to accumulate diffuse evidence about the occupancy of agrid of small volumes of space from individual sensor readingsand thereby develop increasingly confident and detailed mapsof a scene [18].

As well as addressing the above issues, an occupancygrid allows the integration of data according to a sensormodel. Each pixel in the disparity map is considered as asingle measurement for which a sensor model is used to fusedata into the occupancy grid. Not only is uncertainty in themeasurements considered in the sensor model, but it is alsopartially absorbed by the granularity of the occupancy grid.

We integrate the active stereo depth map range data into athree-dimensional occupancy grid using a Bayesian approach[19]. Depth maps are produced using a processor economicalSAD-based technique with difference of gaussian pre-filteringto reduce the effect of intensity variation [5]. Figures 5 and 6show example occupancy grid output.

B. Space Variant Grid Representation and Ray Tracing

At farther scene depths, pixel disparities correspond tolarger changes in scene depth. Accordingly, we have adoptedan occupancy grid configuration that exhibits cell sizes thatincrease with depth. Cell cube edge lengths correspond to theeffect of a specific amount of pixels of disparity at that depth.

Fig. 4. Occupancy grid configuration

For example, 10 pixels of disparity at 1m scene depth yieldsa cell size much smaller than 10 pixels of disparity at 50mscene depth.

This approach significantly reduces the number of cells atlarger depths where high depth resolution is not availableanyway, improving processor performance. It also increasesresolution in the grid at nearer depths where we are moreinterested in an accurate estimation of the location of objects.

The space variant occupancy grid formulation reduces raytracing computations associated with sensor model integra-tion of range measurements. At 1m depth, a slice throughthe occupancy grid contains a fixed number of cells in thehorizontal direction and another fixed number of cells in thevertical direction. At any other depth, the number of verticaland horizontal cells in the slice are the same respectively,with the central cells aligned with the origin at the location ofthe sensor. This means that a ray emanating from the originand passing through a cell at 1m with slice coordinates (x,y)also passes through all other slices at coordinates (x,y). Thisconfiguration means that ray-tracing through the occupancygrid for sensor integration becomes trivial. Figure 4 shows theconstruction of the occupancy grid and rays of cells emanatingfrom the origin.

Updates to the grid occur at a frequency high enough forus to effectively analyze dynamic scenes through the use ofa decay rate applied to all cell certainties. Propagation ofuncertainties according to measured cell velocities would meanthat the reliance upon high frequency updating and decaycould be alleviated (uncertainty propagation is present work,see Section IX).

VI. A NALYSIS OF THE OCCUPANCY GRID

A. Ground Plane Extraction from the Occupancy Grid

Our approach to ground plane extraction is similar to thatof a v-disparity analysis [20]. Essentially, we look at theoccupancy grid from the side to produce an accumulator image

Fig. 5. Online snapshot of occupancy grid construction. Left: view from leftcamera. Right: occupancy grid; the two semi-transparent vertical planes showthe near and far bounds on the region in which a depth response is possible,given the current camera geometry and disparity search range (±16 pixels).

Fig. 6. Online snapshot of occupancy grid construction with colour texturingof cell front faces.

plotting the number of occupied cells at a particular heightand depth in the occupancy grid. A Hough transform [21] isthen used to find the most dominant line in this density sideview of the occupancy grid. We search for the line withinreasonable bounds of where the road is likely to be to reducethe computational expense of the Hough transform. In thismanner, we are able to extract a planar approximation to thelocation of the ground plane in terms of altitude and attitude.The assumption is that the sensor is, under most circumstances,situated at a roll angle that is parallel to the road, and that theroad is planar, so that we do not consider the roll of the roadrelative to the sensor. The granularity of the occupancy gridis such that small violations of this assumption are somewhatabsorbed. Any systematic misalignment can be removed bycalibration. Figure 8 shows an image from the online outputof the occupancy grid, including location of the ground plane.

B. Object Segmentation from the Occupancy Grid

An object in the occupancy grid is considered to be agroup of 26-connected cells located above the ground plane(see Figure 10). We use a 3D raster scan to uniquely labelconnected components.

VII. 3D SCENE FLOW

The velocities of cells in the occupancy grid are calculatedusing an approach similar to that of [22]. First, optical flow

Fig. 7. A snapshot from an online sequence showing 3D vectors representingmass flow in the scene. The complete video sequence is available for viewing(Section X).

in each of the camera view frames is calculated to determinethe x and y components of cell flow. Next, consecutive depth-maps are subtracted to provide the z component. Because weare calculating flow in image space, we are able to assignsub-cell sized motions per frame (velocities) to cells in theoccupancy grid.

Optical flow between consecutive images from a camerais calculated using a simple, but fast flow estimation. Therectification and mosaic construction process described earlierallows the removal of camera rotations and translations so thata standard SAD-based flow estimation can be used [5]. Asthe location of the current and previous frame in the mosaicfrom a single camera is known, we calculate optical flow onthe overlapping region of consecutive frames in the mosaic.In the same manner, the overlapping regions of consecutivedepth-maps are subtracted to obtain the depth flow in the zdirection [22].

The flow vectors are binned into bin sizes corresponding tothe image frame width of a cell in the occupancy grid. Theaverage flow for each bin is then determined. For example,if the cell sizes in the occupancy grid were chosen such thatthe edge length of occupancy grid cubes corresponded to 10pixels of disparity at a particular depth (as described earlier),then adjacent bins would contain the average of flow vectorsfor adjacent 10 by 10 regions in the flow images.

Having determined the x,y and z components of each bin-sized region of pixels in the image, these components arethen overlaid upon the occupancy grid using ray tracing. Welook along a ray in the occupancy grid that corresponds to aparticular bin of pixels in the image until an occupied cell isfound. The velocity components of the bin are then assignedto that occupied cell. Figure 7 shows online output of 3D flow.

A. Object Segmentation from Scene Flow

After an occupancy grid – and subsequently a flow grid –has been calculated, we segment objects in the flow grid ina manner similar to that of the occupancy grid. A 3D rasterscan labels adjacent 26-connected cells whose velocities aresimilar. We use information about the location of the groundplane from the previous step to limit the search for objects tothe region above the ground plane. Essentially, if a cell has a

Fig. 8. Online ground plane extraction from the occupancy grid. Insetshows view from left camera. Detection of the cyclist, light pole, and trees inthe background are also evident on the occupancy grid. The complete videosequence is available for viewing (Section X).

Fig. 9. Vehicle velocity according to unfiltered 3D flow data.

non-zero flow assigned to it, and its velocity is not significantlydifferent to an adjacent cell with a non-zero velocity, it isassigned the same unique object identity as that cell.

B. Vehicle Motion from Scene Flow

We wish to infer the motion of the vehicle relative tothe road from an analysis of the flow grid. Preferrably, theanalysis would not consider regions of the scene that arelikely to be moving in a manner dissimilar to that of theroad. Hence, we only consider regions in the vicinity of theground plane to extract the vehicle velocity. Histograms of thevelocity components of all the cells adjacent to the previouslydetected ground plane are constructed. At present, we use thehistogram mean and associated 95% confidence interval as ameasurement of the vehicle velocity. Once the velocity of thevehicle relative to the road has been calculated, we can removethe velocity of the vehicle from calculations of the velocity ofobjects in the scene.

Fig. 10. Online object segmentation. Inset shows image acquired by rightcamera. The hand (blue) is segmented from the chair (burgundy) because itis moving. The complete sequence is available for viewing (Section X).

Figure 9 shows a preliminary plot of the vehicle velocityas determined by unfiltered 3D flow data. Only the flow inthe z-direction (directly towards the cameras) is considered.Although the velocity of the vehicle was not logged, thefluctuation of the velocity about 30km/h fits well with the factthat the vehicle was driving in a designated 40km/h zone onthe ANU campus. The data velocity from flow was determinedfor the same sequence of footage as that shown in Figure 8.

VIII. T RACKING OBJECTS

Tracking an object initially involves finding object corre-spondences in consecutive frames. Data associated with eachobject in the occupancy grid includes its mass and centre ofgravity. Objects segmented in the flow grid also include addi-tional velocity information. By considering each object in thecurrent frame and comparing the data associated with it to eachobject in the previous frame, we are able to determine likelyobject correspondences over time. The velocity informationenables us to distinguish, for example, a person moving infront of a parked car. Fig 10 shows a hand that has beensegmented from a chair using velocity information, despitebeing in contact with the chair and being labelled as the sameobject in the occupancy grid segmentation.

IX. CONCLUSION

A. Summary

We have presented a method for active epipolar rectificationthat has been shown to allow static stereo algorithms – suchas disparity mapping and optical flow – to operate on anactive stereo platform. We have presented a framework foractive depth mapping with a robot-centered occupancy gridrepresentation for data integration and filtering. We have beenable to analyze the 3D occupancy grid representation of thescene to extract objects and an estimate of the location andattitude of the road.

We have incorporated image based calculations of sceneflow into the grid representation of the scene to enable sub-cell velocity estimations of mass detected in the scene. We areable to use this flow information to further segment objects inthe scene, and to estimate the velocity of the vehicle relativeto the road.

We are able to detect and segment moving or stationaryobjects in dynamic scenes (see Figure 10 and correspondingvideo footage) using moving cameras on a moving platform.Preliminary, unfiltered results show promise for the approachwith increased accuracy and efficiency expected to come.

B. Future Work

Presently, our approach has not incorporated methods tofilter parameters important to the rectification process. Noisyangular readings are to be refined by a combination of filtrationand integration with image-based methods to extract the cam-era geometry to a more exacting degree. Already, work nearcompletion involves the incorporation of SIFT techniques toextract angles. In addition to improved geometry calculations,the location and velocity of the ground plane and objectsdetected in the scene are to be filtered rather than used atface value.

Certainties are currently integrated into the occupancy gridfrom observations of the current scene state with the use of adecay rate to reduce these certainties over time to allow theanalysis of dynamic scenes. Since we determine probabilityestimates of the current location of mass in the scene, aswell as the likely velocity of that mass in three dimensions,we speculate that propagation of uncertainties within theoccupancy grid according to their estimated velocities willyield increased certainty in the location of mass, reduce thereliance an decay rate.

In addition to improving the accuracy of the system, futurework involves its use as an input for road scene understanding.At present, objects are segmented from the scene using the3D occupancy and flow approach, and we are already able tomap these segmented objects back into image space to obtaintheir image frame profile. Objects are thus extracted fromtheir surroundings and background. This extraction processis expected to assist higher level tasks such as classificationof objects in the scene.

The ability to obtain an awareness of the scene regardlessof the geometry of the active head is the first step towardsexperiments in fixation and gaze arbitration. Our approachenables information about the scene to be gathered regardlessof the geometry of the cameras, and for an increased,scannable volume of the scene. Actively tracking objects andretaining an awareness of other objects in the scene suchthat attention can be shifted between regions of the sceneaccording to priority is the next step in our work towardsartificial vision. Our work enables such gaze arbitrationexperimentation in realistic environments.

X. FOOTAGE

It is difficult to depict the achievements of an active visionsystem in a printed format. For this reason, we have publishedfootage of the system in operation at:

http://www.rsise.anu.edu.au/∼andrew/ivs05

XI. A CKNOWLEDGMENTS

The authors would like to thank the significant contributionsmade by all members of the Smart Cars project. In particular,we thank Niklas Pettersson, Lars Petersson and Luke Fletcherfor their direct contributions to this paper.

REFERENCES

[1] L. Fletcher, N. Barnes, and G. Loy, “Robot vision for driver supportsystems,” inInternational Conference on Intelligent Robots and Systems,2004.

[2] G. Loy and N. Barnes, “Fast shape-based road sign detection for a driverassistance system,” inInternational Conference on Intelligent Robotsand Systems, 2004.

[3] G. Grubb, A. Zelinsky, L. Nilsson, and M. Rilbe, “3d vision sensing forimproved pedestrian safety,” inIntelligent Vehicles Symposium, 2004.

[4] N. Apostoloff and A. Zelinsky, “Vision in and out of vehicles: Integrateddriver and road scene monitoring,” inIntelligent Transport Systems,2003.

[5] J. Banks and P. Corke, “Quantitative evaluation of matching methods andvalidity measures for stereo vision,” inInternational Journal of RoboticsResearch, 1991.

[6] J. Alomoinos, I. Weiss, and A. Bandopadhay, “Active vision,” inInternational Journal on Computer Vision, 1988.

[7] D. Ballard, “Animate vision,” inArtificial Intelligence, 1991.[8] R. Bajczy, “Active perception,” inInternational Journal on Computer

Vision, 1988.[9] A. Dankers, N. Barnes, and A. Zelinsky, “Active vision - rectification and

depth mapping,” inAustralian Conference on Robotics and Automation,2004.

[10] A. Dankers and A. Zelinsky, “Driver assistance: Contemporary roadsafety,” in Australian Conference on Robotics and Automation, 2004.

[11] H. Truong, S. Abdallah, S. Rougeaux, and A. Zelinsky, “A novel mech-anism for stereo active vision,” inAustralian Conference on Roboticsand Automation, 2000.

[12] R. Hartley and A. Zisserman, “Multiple view geometry in com-puter vision, second edition,” inCambridge University Press, ISBN:0521540518, 2004.

[13] S. Se, D. Lowe, and J. Little, “Vision-based mobile robot localizationand mapping using scale-invariant features,” inInternational Conferenceon Robotics and Automation, 2001.

[14] C. Harris and M. Stephens, “A combined corner and edge detector,” inAlvey Vis Conference, 1999.

[15] N. Pettersson and L. Petersson, “Online stereo calibration using fgpas(submitted),” inIntelligent Vehicles Symposium, 2005.

[16] A. Fusiello, E. Trucco, and A. Verri, “A compact algorithm for rectifi-cation of stereo pairs,” inMachine Vision and Applications, 2000.

[17] H. Moravec, “Robot spatial perception by stereoscopic vision and 3devidence grids,” inCarnegie Mellon University, Pennsylvania, 1996.

[18] A. Elfes, “Using occupancy grids for mobile robot perception andnavigation,” inCarnegie Mellon University, Pennsylvania, 1989.

[19] H. Moravec, “Sensor fusion in certainty grids for mobile robots,” inCarnegie Mellon University, Pennsylvania, 1989.

[20] R. Labayrade, D. Aubert, and J. Tarel, “Real time obstacle detectionon non flat road geometry through ‘v-disparity’ representation,” inIntelligent Vehicle Symposium, 2002.

[21] T. Tian and M. Shah, “Recovering 3d motion of multiple objectsusing adaptive hough transform,” inInternational Conference on PatternAnalysis and Machine Intelligence, 1997.

[22] S. Kagami, K. Okada, M. Inaba, and H. Inoue, “Realtime 3d depthflow generation and its application to track to walking human being,”in International Conference on Robotics and Automation, 2000.

active vision for road scene awareness - anu college of...

Documents