dynamic aspects in active vision

CVGIP: IMAGE UNDERSTANDING

Vol. 56, No. 1, July, pp. 108-129, 1992

Dynamic Aspects in Active Vision

MASSIMO TISTARELLI AND GIULIO SANDINI

Department of Communication, Computer and Systems Science, Integrated Laboratory for Advanced Robotics (LIRA-Lab), University of Genoa, Via Opera Pia lIA-16145 Genoa, Italy

Received November 11, 1991; accepted March 18, 1992

The term active stresses the role of the motion or, generally speaking, the dynamic interaction of the observer with the environment. This concept emphasizes the relevance of determining scene properties from the temporal evolution of image features. Within the active vision approach, two different aspects are considered: the advantages of space-variant vision for dynamic visual processing and the qualitative analysis of optical flow. A space- variant sampling of the image plane has many good properties in relation to active vision. In particular, two examples are presented in which a log-polar representation is used for active vergence control and to estimate the time-to-impact during tracking egomotion. These are just two modules which could fit into a complete active vision system, but already highlight the advantages of space-variant sensing within the active vision paradigm. In the second part of the paper the extraction of qualitative properties from the optical flow is discussed. A new methodology is proposed in which optical flows are analyzed in terms of anomalies as unexpected velocity patterns or inconsistencies with respect to some predicted dynamic feature. Two kinds of knowledge are necessary: about the dynamic of the scene (for example, the approximate motion of the camera) and about the task to be accomplished, which is analogous to (qualitatively) knowing the kind of scene I would expect to see. The first simply implies the measurement of some motion parameters (for example, with inertial sensors) directly on the camera, or to put some constraints in the egomotion. The second requirement implies that the visual process is task-driven. Some examples are presented in which the method is successfully applied to robotic tasks. o 1992 Academic PKSS,

Inc.

1. INTRODUCTION

The development of dynamic behaviors acitvated by the sensed world and aimed at a given task is the central issue of active perception [24].

This approach is different from more traditional ones in the fact that the exploration of the environment is performed by interacting and not only while interacting with it. Actions like grasping, pushing, or touching objects are purposively planned with the aim of understanding the physical properties of the objects and not only their iconic features.

A primary, although not the most important, advantage of this approach is that it allows a simplification of many computational problems related to visual processing, by actively introducing an additional constraint [2]. I

Within this framework, two different aspects will be presented in this paper. The first is related to the problem of “active fixation” and to the superiority of a space- variant sensing strategy, with respect to uniform resolution sensors, in solving behavioral problems like where- to-look-next, tracking, and estimation of time-to-impact. The second is the essential role of vision in understanding physical properties of the scene subjected to purposively planned actions.

One of the key aspects of active vision is concerned with the problem of where-to-look-next, or, in other words, where to shift thefocus of attention during exploration of the environment or, more simply, during motor actions. This general problem will be discussed with respect to the use of a space-variant sensing strategy which mimics the distribution of the photoreceptors of the human retina. In fact, the existence of a region of the visual field, where the details are perceived with the greatest accuracy (fovea), undoubtedly implies a much stricter relationship between focus of attention and fixation point. Moreover, even if the spatial position of these two points may not be coincident (as it has been demonstrated in many experiments and everyday experience), their uncoupling is possible only by an explicit and voluntary overriding of the motor action which tends to bring the fixation point close to the focus of attention.*

A related issue, which will be discussed in the remainder of the paper, is the fact that visuomotor coordination is essential to fully exploit the use of a space-variant sensor, because the center of the visual field is sampled more densely than the periphery (which, however, is much larger than that of a traditional sensor with the same number of photoreceptors). An important observa-

I Provided by the motion of the observer itself. * In laboratory experiments the subjects must be asked to maintain

fixation while shifting attention; it is a good experimental practice to actually check if the subject is capable of performing as requested.

108 1049-9660192 $5.00 Copyright 0 1992 by Academic Press, Inc. All rights of reproduction in any form reserved.

DYNAMIC ASPECTS IN ACTIVE VISION 109

tion is that this visuomotor strategy is based upon the task being performed:3 vision can only be active.

In this way, active vision seems to be implied and complementary to the notion of space-variant sensing.

The peculiarities of a space-variant sensor are also beneficial to other visual processes, such as the tracking of a moving target in space (which is complementary to fixation). This task implies the measurement of a small displacement near the image center (the “tracking error”), while the motion in the periphery of the visual field is large.4

Therefore, the highest resolution part of the visual field near the fixation point provides the maximum resolution in the estimation of the tracking errors and, even more importantly, the periphery of the visual field is not over- loaded with useless information. It is worth noting, however, that, as long as the sensor can be actively moved, any area within the visual field can be sampled at the desired resolution.

Another crucial advantage of a space-variant sampling structure, in particular with a log-polar distribution of the sampling elements, is related to the computation of dynamic parameters, like the time-to-impact. A log-polar sensor implicitly decomposes the optical flow in an expanding flow and a rotational flow, relative to the fovea, where only the former is proportional to the time-to- impact. In the remainder of the paper, the advantages of the log-polar mapping for the computation of the time-to- impact will be discussed.

Optical flows, for a task-oriented system, can be analyzed for qualitative properties. The reduced resolution in a space-variant vision system, as well as the requirements for real-time vision systems, do not allow the accu- rate computation of image parameters or dynamic data like optical flows. Therefore, it has been proposed to exploit optical flows for qualitative analysis, rather than for quantitative estimation.5

Moreover, the expected results from image analysis are strongly dependent on the task to be accomplished. There is no need for accurately computing the 3D structure of the path to just walk on it or to avoid obstacles. Current research in autonomous navigation demon-

3 As reported in [61], humans follow different strategies scanning the scene with the eyes depending on the task to be accomplished, like recognition or detection of parts.

4 The velocity component in the periphery is only due to the rotational motion of the camera, which is quadratic with the image Carte- sian coordinates referred to the image center; therefore, the amplitude of image displacements increases quadratically, regardless of environ- mental depth, going from the fovea toward the periphery.

s In this respect, binocular stereo seems best suited for a quantitative analysis; in fact, stereo disparity is generally obtained through a correlation approach, which enforces the integration of measurements, while velocity estimates are generally based on differential operators, which tend to enhance image noise. Therefore, stereo estimates can be more robust than velocity.

strated that, in many cases, a rough measurement is sufficient [22, 21, 12, 44, 351. Still, it is not well understood which kind of measurement is needed in navigation for obstacle avoidance, and this is true for many other robotic applications. Therefore, some qualitative parameters can be sufficient in many cases. For example, in the case of navigation, it would be sufficient to detect the obstacle and compute how much time there is to plan and follow an avoidance trajectory (time-to-impact), while the exact estimation of distance and shape may be redun- dant in this case.

In the second part of this paper we will consider the estimation of qualitative properties of optical flow within the active vision paradigm. In particular, we will discuss how it is possible to take advantage of some a priori knowledge about the motion and/or the scene. This knowledge is central to active vision because, contrary to other approaches, the motion of the observer is purposively planned.

Following this approach, the optical flow can be analyzed in terms of the anomalies as unexpected velocity patterns or inconsistences with respect to some expected dynamic behavior. In this respect, we believe that by stressing the concept of visual expectations at the lowest possible level it is not only possible but also highly beneficial to constrain the understanding process.

In this part of the work, the qualitative analysis of optical flows, in terms of anomalies in the vector field, is discussed and some examples related to robotic applications are presented.

2. THE LOG-POLAR TRANSFORMATION

In the human visual system the receptors of the retina are distributed in space with increasing density toward the center of the visual field (the fovea) and decreasing density from the fovea toward the periphery [46, 10, 42, 601.

The space-variant sampling performs a topological transformation on the image plane which is described as a conformal mapping of the points on the polar (retinal) plane (p, 7) onto a Cartesian (log-polar) plane (5 = log p, y = q), where the values of (p, q) can be obtained mapping the Cartesian coordinates (x, y) of each pixel into the corresponding polar coordinates. The resulting log-polar projection is invariant, under certain conditions, to linear scalings and rotations of the retinal image. These complex transformations are reduced to simple translations along the coordinate axes of the log-polar image. This property is valid if, and only if, the scene and/or the sensor moves along (scaling) or around (rotation) the optical axis.

The same properties hold in the case of a simple polar mapping of the image, but a linear dilation around the

110 TISTARELLI AND SANDINI

FIG. 1. Picture of the prototype CCD retinal sensor

fovea is transformed into a linear shift along the radial coordinate in the (p, q) plane. Meanwhile, the log-polar transformation produces a constant shift along the radial coordinate of the log-polar projection. The geometric properties of the polar and log-polar mapping are well known [32, 59,42,60], as well as the relevant data reduc- tion (because the image is not equally sampled through- out the field of view) while preserving a high resolution around the fovea. These properties turn out to be very effective in focusing attention on a particular object feature, or tracking a moving target (i.e., stabilizing the image of the target in the fovea [32, 39, 59, 58, 40, 501.

not a limitation; in fact, the results could be easily gener- alized to any particular log-polar mapping by modifying the (constant) parameters involved in the transformation.

The log-polar mapping is implemented in a circular CCD array [ 18, 161, using a polar scan of a space-variant sampling structure, characterized by a linear relationship between the sampling period and the eccentricity (the distance from the center of the sensor) [18]. In Fig. 1’ a picture of the prototype CCD sensor is presented.

The log-polar transformation is defined by

5 = h&l P - P

2.1. A CCD Retina-like Sensor Y = Pit

In this paper we will refer to the physical characteristics of a prototype sensor which has been designed within

where (p, r)) are the polar coordinates of a point on the

a collaborative project involving several partnerS.6 This iS retinal plane, p; and q and a are constants determined by the physical layout of the CCD sensor.

6 The institutions involved in the design and fabrication of the retinal CCD sensor are: DIST-University of Genoa, Italy; University of ’ Currently, the performances of the CCD sensor are being evaluated Pennsylvania, Department of Electrical Engineering Scuola Superiore using a prototype camera. The reported experiments are carried out by “S. Anna” Pisa, Italy. The actual fabrication of the chip was done at resampling standard TV images following the geometry of the sampling IMEC, Leuven, Belgium. structure of the sensor.


3. ACTIVE VISION AND SPACE-VARIANT SENSING

The role of space-variant vision is to probe the environment at high resolution (with the fovea) and, at the same time, to limit the amount of detailed information (the fovea occupies a small part within the field of view). The processing of fovea1 data is restricted to either static or stabilized (ie., tracked) objects. In fact, it is not possible to provide detailed information on moving objectss8 Therefore, a reflex mechanism exists in humans to keep the fixation on the same object point regardless of ego- and/or ecomotion [37, 11, stabilizing the image on the fovea [27, 19, 28, 7, 501.

The broad field of view provided by peripheral vision has both alerting and guidance functions. With respect to fovea1 vision, peripheral vision has enhanced dynamic performances used for motion detection and for provid- ing the motor control with the information necessary to foveate and track moving objects [ll, 4, Z]. Besides these dynamic characteristics, peripheral vision also has a role in the recognition of gross features.

The concepts of fovea1 and peripheral vision are useless without an efficient motor control capable of direct- ing the fovea toward the desired focus of attention. This function is also important in order to establish the appropriate relations between different locations of the visual field, regardless of the actual image feature falling on the fovea, We argue that the active control of fixation is not always driven by cognitive processes, but in most cases this mechanism is activated by task-related parameters. For example, a fast moving object within the field of view catches one’s attention, invoking a fast saccadic movement of the eye to foveate and track the object. A preattentive, reflex-like process must drive gaze control for fixation.

Similarly, we can consider the vergence control of a binocular vision system. In this case we can split the control into two parts, one devoted to the tracking of the object (made by the dominant eye) and the other keeping the vergence of the eyes on the point currently tracked (this is performed by a coordinated control of both eyes). Both tracking and vergence control can be achieved by means of low-level processing, which further motivates a preattentive process.

Usually, visual attention is related to the process of interpretation of the visual scene: “if I want to recognize a face I better shift my attention on the eyes and the mouth.” Our point is that “attention” also has a very strong behavioral basis: “if I want to walk I better look in front of my feet.” The major difference between these two different views is that while in the former case an a priori hypothesis is necessary, based on some high-level processing or expectation, in the latter everything is

8 For example, consider the blurring effect perceived on close objects when looking outside a train window.

based on self-generated actions which are, usually, task- driven. In the examples given above, both focusing and vergence involve the selection of a point in space before the processing could even start.

A problem that arises is to determine how to direct attention to a particular point in space; three different mechanisms can be considered:

l Involuntary reflex or rejex-like eye movements. For example, when looking outside the window of a moving train the eye movement strategy is characterized by the so-called optokinetic nystagmus, which is composed of smooth movements of the eyes (during which the eyes track a point of the environment) interrupted by fast saccadic movements which reset the direction of gaze. In this case it is virtually impossible to keep the eyes steady, even if the subject is totally uninterested in the scene. The resulting motion strategy maximizes the period during which “useful information” could be analyzed (i.e., information not degraded by motion blurring). This behavior is, again, preattentive or, in other words, does not require “high-level processing.” Another example of this kind is the stabilization of gaze performed during active or passive motion of the head (for example, during motion). This eye-head coordination strategy is performed partially through the vestibulo-ocular reflex (which only accounts for the correction of the motion of the head) and partially by active visual tracking (the control of fixation also depends upon the distance of the tracked object which cannot be determined by the vestibular mechanism). It is worth noting that also in this case, it is impossible to stabilize the orientation of the visual axis by looking at an “abstract” point in space (if this were to happen, the visual information would be corrupted by motion blurring).

l Task-driven gaze control. A very common example of this kind of control is the strategy adopted during walking. In this case the smooth reflex-like movements described previously are interrupted by saccadic movements which are, in general, directed alternatively on the ground plane (to detect small obstacles and to evaluate the roughness of the floor [22, 21, 12, 441) and to points far away (used to self orient in space). During this period the peripheral part of the visual field plays a double role: the first, usually limited to the perifoveal region, is to evaluate image velocity (used to maintain fixation), and the second is to detect unexpected changes of the image brightness (or color) distribution. It is worth noting that, also in this case, the direction of gaze does not depend on high-level reasoning process.

l Voluntary fixation control. In this case fixation is directly controlled by high-level processes like recognition. For example, wishing to self-orient in a partially known environment, a human moves gaze around, trying to identify some known feature. The direction of gaze is moved toward several points in the scene which exhibit a


particular feature, which can facilitate recognition. Doors and windows constitute potential reference objects; therefore fixation can be driven toward vertical edges. Detection of known objects can be performed by moving fixation along relevant features like edges. This approach will be explained in the section on recognition of 2D shapes.

Also in this case, however, it is possible to design motion control strategies based upon “categorical” features (i.e., features not related to the specific object as a whole but to some of its constituent parts) like corners, lumi- nance or color discontinuities, symmetries, and cen- troids.

This behavioral connotation is stressed even more during tracking and for the computation of time-to-impact. In fact, if we consider the requirements of a tracking process, the only situation in which this process is not active is when the camera is fixed and looking at a steady environment. In all other situations the need to reduce motion blurring forces the activation of the tracking system which, consequently, cannot be actively suppressed. The alternative, proposed by the current technological trends, is to use high-speed shutters which “freeze” the image by sampling very shortly in time. This certainly avoids motion blurring but still does not prevent the over- all system from measuring precise information about egomotion in order to be able to extract useful information from the evolving scene (in particular, to separate image velocities generated by egomotion from those due to objects moving independently on the scene: the case of a moving object tracked by a moving camera). Tracking a point of the environment, on the contrary, gives the system a reference point which does not move in the environment and simply the computation of time-to-impact [7, 2.51. It is worth noting, moreover, that the tracking process is data-driven, as is the case for focusing and vergence. The role of high-level processing is limited to the selection of the target to track, which is often based on the behavioral task being executed (think, for in- stance, to the visuomotor coordination during walking which has been discussed before). Therefore, the peculiarities of a retina-like space-variant sensor in relation to visual tracking and vergence control are among the major advantages of this approach.

The relevance of a space-variant sensor for active vision is furthermore stressed by the advantages in the computation of parameters for motor control. In the remainder of the paper we will illustrate a simple algorithm for vergence control and a new method for the estimation of time-to-impact9 which take advantage of the use of a space-variant visual sensor.

9 The time-to-impact represents a parameter which directly relates the observer to the environment, also by taking into account the dynamic evolution of the events. It is very useful for navigation and has a very distinct role from depth or range.

3.1. Binocular Vergence Control

In binocular systems gaze and fixation control involve keeping the optical axes directed at the point in space currently fixated. This is accomplished through active vergence control.

Olson and Coombs [36] and also Coombs and Brown [13, 141 demonstrated a simple and efficient vergence control system based on cepstral filtering of stereo images. The rationale was the computation of a gross cross- correlation score between the left and right views, the maximum correlation identifying the correct vergence of the cameras.

A similar algorithm has been devised for space-variant stereo images sampled with the retinal sensor. The basic idea is that of computing a pointwise cross-correlation between log-polar projections of the left and right views. The vergence between the cameras is varied by moving the nondominant camera, while the cumulative cross- correlation is computed. Even though the log-polar images do not differ for a simple translation, like for uniformly sampled, raster images, the global correlation of the images still provides a measure of the displacement between them. When the cameras are almost correctly verged, the global cross-correlation becomes very high. Therefore, if the global cross-correlation decreases, then the nondominant camera is moved in the opposite direction until the maximum is reached. In Fig. 2a a diagram is shown which illustrates the values of the inverse, global cross-correlation (simply computed as the normalized

a

b

0 e

FIG. 2. Inverse, global cross-correlation of uniformly sampled (a) and space-variant (b) images, computed for different vergence angles 0.

DYNAMIC ASPECTS IN ACTIVE VISION

FIG. 3. Results of the vergence control algorithm. (top) Original correlation in the middle (a) Starting vergence. (b) Correct vergence foun are almost zero.

images, (bottom) log-polar maps, with the result of the inverse cr( x3%

d by the gradient descent algorithm. Note that the values of the correlai tion

sum of the absolute difference of intensity values between the left and right image) obtained by applying the algorithm to the original, uniformly sampled images. As can be noted, the outlines of the diagrams in Figs. 2a and 2b are very similar, but the one obtained from the log- polar images has a sharper peak at the correct vergence angle. This fact implies a faster convergence of the algorithm using images sampled with the retinal sensor. It is worth noting that the space-variant sampling intrinsically emphasizes the importance of the central part of the image weights more than the periphery. As a result the cross-correlation function is much sharper, making the vergence control much easier.

In Fig. 3 the images used to obtain the diagrams in Fig. 2 are shown. The image pairs corresponding to the maximum vergence and the correct vergence found using a gradient descent algorithm, together with the pixel-by- pixel inverse cross-correlation of the log-polar images, are shown.

3.2. Motion and Time-to-Impact

3.2.1. Measurement of Image Velocity

The dynamic evolution of the scene can be described in terms of the instantaneous velocity of (projected) image points. This measurement can be represented as the optical flow computed on the log-polar plane.

The optical flow is computed by solving an overdeter- mined system of linear equations in the unknown terms

(u, u) = V. The equations impose the constancy of the image brightness over time [26] and the stationarity of the image motion field [54, 171:

$I=@ $I= 0,

where Z represents the image intensity of the point (x, y) at time t. The least squares solution of (3) is computed for each point on the log-polar plane as

V((, y, t) = (A’A)-IA’b

A= b=

ar -- at IfPI

11 (

ac$at ’ d9

dydt

3)

where (4, y) represent the point coordinates on the log- polar plane. This method allows a direct measurement of velocity without iterative computations; the resulting flow field is very dense and the estimation is more robust with respect to image noise (both considered in the spatial and temporal domain) than, for example, using second-order derivatives only.


3.2.2. Computation of Time-to-Impact

The ability to quickly detect an obstacle to react in order to avoid it is of vital importance for animates. Pas- sive vision techniques can be beneficially adopted if active movements are performed [7, 6, 33, 25, 4, 40, 411. A dynamic spatiotemporal representation of the scene, which is the time-to-impact with the objects, can be computed from the optical flow which is extracted from monocular image sequences acquired during tracking movements of the sensor [7, 43, 37, 47, 55, 501.

Jain and co-workers [29, 81 pointed out the advantages of processing the optical flow, due to camera translation, by using a log-polar complex mapping of the images and choosing the position of the FOE as the center for the representation.

It is possible to generalize this property to more general and complex kind of motions. Generally, any expansion of the image of an object, due to the motion of either the camera or the object itself, will produce a radial component of velocity on the retinal plane. This intuitive ob- servation can be stated in the following way:

the time-to-impact of a point on the retinal plane affects only the radial component of the optical flow.

We will formally prove this assertion in the remainder of the paper. It turns out that the most convenient way of representing and analyzing velocity is in terms of its radial and angular components with respect to the fovea [511.

Let us consider, for the moment, a general motion of the camera both rotational, with angular displacement (4, 13, $), and translational, with velocity (W,, WY, W,). The velocity on the image plane along the radial and angular coordinates V = (5, -i/) is [51, 521

8 = [9 - g ( W, cos r + WY sin ‘)

+ ($+~)(~sin%-Ocos~]logOe (4)

~=~[(~+e)sin~+I~-~)cos~]-q~,

where (5, r) are defined as in Eq. (1). These equations simply show that, while both components of the optical flow depend upon the depth 2 of the objects in space, only the radial component 4 depends upon the time-to- impact Z/W, . Moreover, only the angular component q depends upon rotations around the optical axis, while the radial component is invariant with respect to $. Note that up to now we have not made any hypothesis about the motion of the sensor. Therefore Eq. (4) certainly holds

for any relative motion between the observer and the objects in the scene.

Equation (4) can be further developed in the case of tracking egomotion. By imposing V(0, 0) = 0 in the general optical flow equations and substituting the values of W, and WY in (4), we obtain

(= [?+ [;+$(1 -$)I

( 4 sin i - 19 cos z II log, e

?=$(I -!$)(0sin~+$cos$) -q$.

Dz is the distance of the fixation point from the retinal plane measured at the frame time following the one where the optical flow is computed.

By assuming the depth Z to be locally constant, the time-to-impact can be computed:

$=$e-- I)($sin$-Bcosz)

$= [$-;(I -$)](+sin:-Bcosi)

(note that p = al+4 and aZl+y = 0, aZ/@ = 0)

(6)

(7)

This equation allows the direct computation of the time- to-impact from the images only. Note that only first-order derivatives of the optical flow are required and the equation does not include the pixel position, which is generally difficult to compute accurately. The parameters q and a are calibrated constants of the CCD sensor. It is interesting to relate this result to the divergence approach proposed by Thompson [48] and also recently by Nelson and Aloimonos [351. Equation (7) can be regarded as a formulation of the oriented divergence for the tracking motion, modified to take into account the fact that the sensor is planar and not spherical.

In Fig. 4a the first and last image of a sequence of 15 is shown. The images have been acquired at a resolution of 256 x 256 pixels, using a conventional CCD camera mounted on a mobile platform (at a height of about 80 cm above the ground); the optical axis of the camera was kept almost parallel to the direction of motion. The vehicle was moving forward at a speed of 200 mm/s and the images were captured at a rate of 8 frames/s. In the scene are a big moving box in the foreground and a desk in the


FIG. 4. (a) First and last image of the sequence. (b) Retinal sampling applied to the central image, and simulated output of the retinal CCD sensor, represented in the Cartesian (x. y) and log-polar (5, y) planes.

background. The box is moving independently along a trajectory inclined with respect to the vehicle trajectory (it was not moving straight toward the camera).

In Fig. 4b the simulated output of the retinal sensor, relative to the central frame of the sequence, is shown. The resulting images, obtained following the characteristics of the CCD sensor, are 30 X 64 pixels. In Fig. 5 the optical flow relative to the image in Figs. 4a and 4b is shown.

The time-to-impact Z/W,, computed by applying Eq. (7) to the optical flow in Fig. 5, is shown in Fig. 6a. Gray levels are directly proportional to the time-to-impact. De- spite the low resolution (1920 pixels) the object with the smaller time-to-impact (the moving box) is correctly located.

Another equation for the time-to-impact can be obtained by computing the second-order partial derivative of 8 [51]:

116

Z -=

WZ [ ~logea-tilog e -’

at2 a 1 *

TISTARELLI AND SANDINI

This equation clearly states that the time-to-impact can be computed using only the radial component of velocity with respect to the fovea. Moreover, this formulation does not depend on the motion of the fixated target and it has been demonstrated in [521 that it can be applied to a completely general motion of the camera and of the objects in the scene.

As it has been shown in [53, 91 numerical differentiation is an ill-posed problem. In fact, the solution to differentiation of a continuous function does not depend con- tinuously on the input data. Differentiation acts like an high-pass filter, amplifying the high-frequency noise. A solution to the problem of transforming differentiation into a well-posed problem has been already suggested [9]. This can be accomplished by filtering the input data with an appropriate regularizing filter. In practice, a smooth low-pass filter (like a Gaussian) can be used.

Optical flows are generally noisy, and consequently the recovered time-to-impact will be affected by noise. In practice, however, the application of a smoothing filter can reduce the noise, while preserving the qualitative properties of time-to-impact. More precise solutions could be found by means of an appropriate interpolation of the flow field, but, of course, the accuracy is mandatory for the given task. It has been shown that, for navigation purposes, a qualitative evaluation of the time-to- impact (even just locating the closest objects in the scene) is sufficient to safely avoid obstacles or to find a collision-free course [35].

In the work presented by Nagel[34] the computation of the optical flow and its derivatives from gray levels is discussed. The application of these results to directly estimate the time-to-impact equations presented in this paper will be exploited in a continuation of this work.

4. QUALITATIVE ANALYSIS OF OPTICAL FLOW

The optical flow field (or image velocity field) can be easily described through differential equations and a well-settled mathematical theory exists which can be used to handle vector fields. As already pointed out by Verri and Poggio [57] and also Nelson and Aloimonos [35] this representation can be analyzed in terms of qualitative properties. Some scene features of a temporally evolving event (the motion of an object and/or the camera, the approaching to a very close obstacle, etc.) can be generally identified by performing simple local operations which exploit some general properties of the optical flow. The focus of expansion, for example, is among these properties and can be easily determined with local, parallel operations, but makes a great deal of information ex-

a

b FIG. 5. Optical flow of the sequence in Fig. 4b, represented in

Cartesian (a) and log-polar (4, y) (b) planes. the

plicit, like the direction of motion of the objects or of the camera.

Girosi and co-workers [54] and others [56, 171 further investigated this topic, developing a general framework to analyze optical flows in terms of “singular points” and then determine the 3D motion which provoked the image velocity, simply determining the singular point on the image plane. In this context very few assumptions were made, mainly to simplify and in some cases linearize the equations, but still giving a general methodology. Never-


FIG. 6. Time-to-impact of the scene in Fig. 4, computed by applying Eq. (7) to the optical flow in Fig. 5. For clarity, data are represented on the retinal and log-polar plane.

theless many problems exist in relating to the possibility of identifying the singular points (due to unknown motions) and then unambiguously characterizing them.

A great simplification can be obtained using some a priori information about, for example, the motion of the camera (or, conversely, by constraining the camera movements, like in the case of an active observer [43,4]). In such cases (for example, if the motion of the observer is known) it is possible to look for some qualitative properties of the optical flow, which can help in finding scene characteristics like obstacles or other global parameters useful during navigation.

An example, which is very important in navigation, concerns the estimation of the heading direction. Con- sider a mobile robot wishing to move along a direction parallel to the optical axis of a camera mounted on board. To accomplish this task, it is necessary to judge the error in the trajectory to adjust the camera heading and/or the direction of motion of the vehicle. This error measure can be simply obtained by checking the position of the focus of expansion on the image plane. An error in the motion direction will be denoted by a shift of the FOE with respect to the image center. In this case the corresponding behavior would be to change the heading as to track the FOE, keeping it in the image center.

This task can be simplified by using a space-variant sensor. As in fact a dilation is mapped into a translation on the log-polar plane, then it is sufficient to check if the flow vectors are all parallel to the radial 5 axis. This is a simple operation which requires less computation than the location of the FOE on the image plane. If the velocity vectors are not parallel, the appropriate correcting motion will be executed until all the vectors are parallel again. This is possible if the check is executed continu- ously so that small corrections are necessary.

Another example concerns a robot moving along a corridor; if there are no obstacles along the path and the vehicle is proceeding in the center of the corridor, then the velocity field should be symmetric with respect to a vertical line passing through the image center. In this case all the points projecting on the left and right subim- ages are at the same distance from the camera; therefore the velocity should be roughly the same. If an obstacle is on the path, then the flow vectors will be smaller in the image area corresponding to the obstacle. Therefore an asymmetric flow field will be obtained. The same effect is observed if the vehicle is not proceeding in the middle of the corridor. In any case both events require the same behavior, which is to correct the motion trajectory. The same principle can be applied to passing between two


obstacles (supposing you have identified the obstacles somehow). In order to pass safely between the obstacles, the optical flow must be symmetric; the trajectory can be corrected until this condition is satisfied.

These are two examples in which global, qualitative parameters can be extracted from optical flow. The observer behavior can then be based on these simple quanti- ties. It is worth noting that, as the computation is performed on all the flow vectors, the resulting measurement is very robust and insensitive to noise. Moreover, the kind of behavior underlying these examples is to incre- mentally correct the motion parameters of the observer. Therefore, measurements can be integrated over time, or, in other words, the motion is stabilized by continu- ously verifying the flow field, according to the described criteria.

In the following section we consider the case in which it is possible to take advantage of the knowledge of some parameters about the motion of the observer or the environment. Again, the task to the accomplished dictates the strategy adopted to estimate scene characteristics.

4.1. Optical Flow Anomalies

Temporal changes in the images can be due to a variety of facts, sometimes unpredictable. The interpretation of image displacements (here treated as optical flow) can be difficult but possible if:

l the visual process is task-driven (or purposively en- gaged Dl);

l some a priori information about the motion and/or the scene is available.

In a number of cases some a priori information on the nature of the observed scene and/or proprioceptive knowledge can be reasonably obtained without limiting the generality of the method. In such cases a prediction of the optical flow and/or the object’s surfaces can be made. In this paper we investigate a new paradigm for the analysis of the image velocity field, namely the detection of anomalies in the optical flow.

Following this approach, the optical flow can be analyzed in terms of the anomalies as unexpected velocity patterns or inconsistences with respect to some predicted dynamic feature. In order to identify and understand the anomalies in the optical flow two kinds of knowledge are necessary: about the dynamic of the scene, for example, the approximate motion of the camera, and about the task to be accomplished (already proposed as active vision or purposive vision [2]), which is analogous to (qualitatively) knowing the kind of scene I would expect to see. The first simply implies the measurement of some motion parameters (for example, with inertial sensors) directly on the camera, or to put some constraints in the egomotion. The second requirement implies that the vi-

sual process is task-driven. Therefore we know what kind of anomaly should be found in the optical flow.

The analysis is performed as an hypothesis/prediction formation stage, which is usually off-line, and an on-line verification. The first phase is based on some constraints which allow the off-line estimation of some ego or eco parameters. Such parameters do not need to be computed explicitly, but can be hidden into a global reference map. For this phase it is necessary to define some features in the optical flow which are invariant to all possible events, which are feasible in reality, except the one we want to detect. Both the event and the feature must be defined explicitly for every given application.

The verification phase consists of the comparison between the hypothesized reference pattern, relative to the defined optical flow feature, and the on-line computed optical flow. It will not always be necessary to explicitly compute the optical flow, but a minimal data set, like only one velocity component, could be sufficient.

The prediction is task-driven and a given reference pattern can be only used to detect a given anomaly. While this approach may seem less general than others, it always gives simpler solutions which are in fact very robust.

In order to clarify and give a taste of the concept of flow anomalies, few examples are reported:

l detection of static obstacles without using depth maps;

l detection of moving obstacles and measurement of velocity, from a moving camera;

l monitoring of object manipulation.

These examples illustrate the applicability of the approach to a wide range of problems, in the area of robotics and machine vision, which still do not have a feasible solution.

4.1.1. Ground Plane Obstacle Detection

This example refers to a common problem in robotic navigation: the avoidance of static obstacle layer on the ground plane. Such a problem requires a reliable and fast method to detect obstacles on the path; on the other hand, computational accuracy is not mandatory for simple reactive behaviors like detecting an obstacle within the field of view and trying to avoid it.

The algorithm assumes a constrained egomotion within a partially constrained environment (an “I know how I am moving, but I can only suppose what I will see” paradigm). The camera is mounted on the moving vehicle and directed toward the ground, with the optical axis forming a fixed angle with the ground plane. The vehicle trans- lates with constant velocity. In a “standard” case the camera should “see” the ground plane and an hypothesis of the optical flow pattern can be made. In this case the


anomaly is defined as a difference between the expected and computed optical flow amplitude.

The method consists of two phases:

l an off-line calibration, during which a reference optical flow relative to the ground plane and a given camera velocity is computed;

l a fast on-line procedure during which the amplitude of the optical flow is computed from the incoming images. Static obstacles are detected making the difference between the amplitude of the computed image velocity and the reference velocity map, estimated during the calibration phase.

An explicit evaluation of the errors involved computing the reference velocity map and the on-line velocity estimation can be made, to determine the appropriate threshold to locate an obstacle.

Similar approaches have been proposed [21, 121 for the detection of static obstacles in outdoor scenes. Apart from the different scenario, we trust our method to be more robust, thanks to the calibration phase and the fact that the computation of the total optical flow is not required in the on-line phase.

Off-Line Calibration. As only rough estimates are needed for obstacle detection, the calibration phase is not conceived to be a procedure to accurately estimate the camera and/or motion parameters. Conversely, an approximate velocity map is computed.

The reference velocity map is obtained by averaging several optical flows (i.e., for each image point a velocity vector which is the mean value over a set of optical flows is computed). Therefore many optical flows are computed from a long sequence (typically 40-50 images). As proposed in [22] for stereo calibration, an estimate of the variance of the optical flows over time can be used to determine how many flow fields must be added to obtain the final map. This variance estimate is performed point by point and a global threshold is used to decide when to stop the averaging process.

In Fig. 7, one image out of a sequence of 47 used to compute the reference optical flow is presented. The optical flow, computed by averaging 10 optical flows, is shown in Fig. Sa. It was computed by filtering the sequence with a spatiotemporal Gaussian with a c equal to 1 and then applying Eq. (3) to the five central images (the derivatives are computed with a five-point (frames) mask). The amplitude of the vectors is displayed in Fig. 8b.

The reference map intrinsically contains informations about the position of the camera with respect to the vehicle and the motion of the vehicle. Therefore a reference map can only be used for a given camera setup and motion trajectory. Nevertheless it is possible to generalize the method building a set of reference maps for each

FIG. 7. One image of the sequence used for the calibration phase.

possible (useful) vehicle direction of motion, with a known orientation of the camera (which could be always the same). The speed of the vehicle simply acts as a scale factor for the velocity map; hence the same velocity map can be used with different vehicle speeds.

Obstacle Detection. During the on-line phase of the algorithm, the amplitude of the optical flow is computed and compared with the reference map. If the difference between the two estimates, normalized to the velocity of the vehicle, exceeds a fixed threshold an obstacle is detected. In fact, it is well known that, in case of translation motion, only the amplitude of the velocity vectors changes with the object’s distance. Even if a threshold must be set, it seems not to be critical. An adaptive threshold which depends only on the “regularity” of the ground plane, which is assumed to be flat, could be adopted.

Instead of computing the total optical flow, only the component along the direction of the intensity gradient V” is needed to characterize the scene [26]. The amplitude of the true velocity vectors is obtained by projecting V” along the direction stated by the FOE computed from the reference map [44],

/vI = /V”12 * IAl V” . A

It- (9)

= I,A, + I>,A, ’


FIG. 8. (a) The computed reference optical flow. (b) Amplitude of the reference optical flow.

where I, = aZ(x, y, t)l&, Z, = aZ(x, y, t)l~?y, I, = dZ(x, y, t)ldt, and A = (A,, A,) = (x - FOE,, y - FOE,,) is the directional vector of the straight line connecting the considered image point with the FOE. This vector also corresponds to the velocity direction predicted from the calibration phase, assuming a given vehicle motion and camera setup.

In this schema only simple geometrical operations which can be easily implemented on general-purpose hardware are performed. As a matter of fact, the computation of V” is less sensitive to quantization errors and is faster than that of the optical flow equations (3). Only first-order derivatives are required, in fact.

In Fig. 9 the first and last image from a subset of 10

FIG. 9. Two images from a sequence of 47 used in the obstacle detection experiment.


images is shown. The sequence was acquired from a camera mounted on a TRC Labmate, mobile platform, moving at a speed of about 200 mm/s. The movement of the vehicle was a translation along a direction almost parallel to the ground plane. The camera optical axis was pointing toward the ground and turned slightly to the right.

In the experiment, the ground plane and the obstacle (a rucksack) were covered with colorful posters, which provide a random pattern, to demonstrate that the obstacle is effectively located from differential velocity measurements, while it could not be detected by analyzing the intensity values only or by means of a simple thresholding of the velocity field (which would be zero on the ground plane in the absence of significant texture).

In Fig. 10 the thresholded difference between the computed velocity amplitude and the amplitude of the reference optical flow in Fig. 8 is shown. It has been obtained by thresholding the difference image in three amplitude maps and verifying the consistency of marked pixels over time. In Fig. 10 it is possible to locate the two obstacles represented by the small box on the upper right part of the original images and the vase with flowers on the upper left.

4.1.2. Identi$cation of Moving Obstacles from a Moving Camera.

The detection of moving obstacles is certainly a very important task in visual navigation. In the past most of the work has been devoted to the detection of static obstacles [5, 1.5, 31, 301 or tracking of moving objects with a steady, possibly rotating camera [38,45,3,39]. The main problem arising trying to detect moving objects during egomotion stems from the fact that discontinuities in optical flow are hard to find in presence of noise [49]. More- over, such discontinuities can be difficult to understand because they can be due to many scene features, such as depth discontinuities.

In the case of translational motion, depth is proportional to the amplitude of image velocity, while the direction of the flow vectors is independent of depth. This last feature is sensitive to independent object motion, while it is invariant to depth. Therefore, it can be used as a feature which detects anomalies due to object motion.

The proposed algorithm assumes a translational motion of the camera. This is not a very restrictive assumption; if the rotational motion of the vehicle is known, for example from odometric measurements, and also the pose of the camera with respect to the vehicle is known, the rotational component of the optical flow can be easily computed and the translational component estimated from V. The optical flow due to camera translation is determined by subtracting the rotational flow from the total optical flow.

In the initial phase of the process a prediction of the

FIG. 10. Thresholded difference between the urnplir~~!e of the optical flow and the reference velocity map. The detected obstacle (a rucksack) can be located in the upper part of the picture.

direction of translational flow in absence of object motion is made. It relies on the independency offlow direction from depth. In the case of translational egomotion the image velocity can be expressed as

V,(x, Y) = xW, - FWx yW, - FWy

z ’ z 1 ’ (10)

where ( Wx, WY, Wz) are the components of the camera velocity, F is the focal length of the camera in pixels, and Z is the distance of the considered point from the camera. It is trivial to show that only the modulus of V, depends on Z, while the direction of V, does not depend on Z,

wz * QCX, Y) /vrI = z

AW,) = xW, - FWx YWZ - FWY

(11)

I+(“, y) = Ix - FOE., , y - FOEy 1,

where A(X) represents the direction of the vector X (the tangent of the angle subtended by the vector with the x axis).

In the case of roto-translational motion the velocity vectors can be decomposed into two terms, V, and V,, due to camera translation and rotation. The translational


vector V, must be computed because A(V) depends on the depth Z:

A(V) = F.K,+Z*H, F.K,+Z*H,

K, = xW, - FWx, KY = yW, - FWy (12)

H, = xyw, - [x2 + F2]wy + yw,,

HY = [ y2 + F2]w, - xyw, - xw, .

The direction of the flow vectors at each image point is computed using the FOE position. The FOE can be determined from image sequences in one of the following ways: computation from odometric measurements of the vehicle motion, computation from a reference optical flow obtained from an image sequence acquired during a given vehicle motion with a fixed camera setup, or estimation using a robust least squares technique to determine the pseudo-intersection of the translational flow vectors. This last method, which is the most practical, can be applied if the moving objects do not cover the majority of the field of view; the solution then approxi- mates well the real FOE position.

During the on-line phase optical flows are computed from the incoming images. The deviation in the direction of the velocity vectors from the predicted velocity map is the anomaly used to detect moving object’s points. The concurrence of object and camera motion and the influ- ence of noise, especially if correlated as in the case of

inexact image displacements due to egomotion drift, can make it difficult to set a threshold to detect deviations in flow direction due to object motion. In such cases, differ- ences also in the velocity amplitude can be used because, with moving objects, it does not vary smoothly.

Once the object’s silhouette has been identified the velocity of the object can be computed from the optical flow.

In Fig. 11 the 21st and 23rd images from a sequence of 47 are shown. The sequence was acquired from a camera mounted on a TRC Labmate mobile platform, moving at a speed of 200 mm/s. The movement of the vehicle was a translation along a direction almost parallel to the optical axis. The camera optical axis was pointing toward the ground and slightly turned right. In Fig. 12a the optical flow computed from the sequence in Fig. 11 is presented. The difference between the direction of the flow vectors and the predicted directions is shown in Fig. 12b. The moving object in the right part of the image has been detected applying a threshold equal to 10” to the difference map. The threshold has been obtained from the histogram of the difference map.

Estimation of Object Velocity. The instantaneous velocity of an image point due to camera and object motion is

V = Vcumeru( W, , WY , W,) + I’:‘@( TX, Ty , T,) + VW, 8, $1 (13)

V = g (el , 4 + (A,, Ay) = K(el, 4 + (A,, A,>,

FIG. 11. 21st and 23rd image from a set of 47, used for the moving obstacle detection.


FIG. 12. (a) The optical flow computed for the walking sequence. (b) Thresholded difference between the direction of the flow vectors and the predicted direction due to the motion of the vehicle.

where ( W,, WY, W,) is the translational velocity of the camera, (TX, I;, Z’,) is the translational velocity of the object, and (4, 0, $J) is the rotational velocity of the object with respect to a Cartesian reference system cen- tered on the object and parallel to the camera coordinate system. Z is the distance of the moving point from the camera, k, is a scale factor which depends on the camera velocity only, (ei , e2) is the directional versor of Vcumera, and (A,, A!) is the velocity of the object point on the image plane. The object displacement can be computed solving Eq. (13) in a 3 x 3 neighborhood. There are two underlying assumptions in solving such equations:

1. the object motion (A,, AY) is constant within the considered neighborhood;

2. the distance Z of all the points within the neighborhood must be constant.

It is worth noting that the estimated velocity component, relative to object motion, lacks a term with a direction parallel to the flow vectors due to egomotion. This term is included in K, and cannot be separated from it. This fact can be understood considering an object moving exactly with the same velocity of the camera: it is impossible to judge, using only local operators, if it is a static object very far from the camera or if it is moving independently, unless a more global analysis is performed. Therefore, the computed object flow determines the motion of the object along every direction different from the direction of motion of the camera. Making the difference between

the total optical flow and the partial object flow, a flow field is obtained which is due only to the relative motion between the camera and the object, along the direction of motion of the camera. Apart from the detection of the object, this is a very important measurement which can be used for robot navigation to decide a trajectory to avoid a collision with the object. The partial object flow determines whether a collision will happen with the robot, while the relative flow between the camera and the object states the time to be elapsed before the collision.

In the proposed method, the rotational velocity of the object is almost ignored. In fact, the only variation al- lowed by the least squares fit is relative to the change in direction of the translational flow of the camera (ei , ez). This is equivalent to assuming that the motion of the object is frontoparallel, with a component along the direction of egomotion (which can not be identified imme- diately). This is not a very restrictive assumption because it is only applied to the small neighborhood over which Eq. (13) is applied. Object flow variation is al- lowed, and can be estimated, over a wider neighborhood, as shown by the vector field in Fig. 13, obtained by applying Eq. (13) to the optical flow in Fig. 12. As can be noted, only the image points corresponding to the moving object have a sensible value of (A,, A,), while it is almost zero elsewhere. The direction of the object flow vectors is almost vertical, pointing down, which correctly corresponds to the motion observed by comparing the two images in Fig. 11: the leg in the foreground and the entire


FIG. 13. Computed velocity field relative to the moving object.

4.1.3. Monitoring of Object Manipulation

The localization of object position and the identifica- tion of possible grasping points are among the most difficult tasks in visuo-guided robot manipulation. In general, complex scene analysis techniques are required to match the object shape and position in the scene.

Another approach is based on the extraction of 3D features in the observed work space [20]. The object position and orientation are determined by segmenting a volumetric representation of the viewed objects. Three- dimensional features like corners or edges and surfaces are extracted from the volumetric representation. Possi- ble grasping points can be selected by applying simple geometrical rules to the final representation.

Before initiating a manipulative action, some object properties can be determined by a direct interaction of the manipulator with the objects, like touching or pushing.

Looking at an object, while the robot end-effector is approaching and then touching it, it is possible to:

l identify the time instant of touching; l localize the object and the end-effector position; and l finally estimate some object properties from its tra-

body appear to move downward, while stepping forward. jectory, relative to the robot movement.

The component of motion toward the camera (the man In Fig. 14 two images from a sequence of 47 are shown was walking toward the camera) is not revealed. This (the computed optical flows are superimposed). The im- procedure, when applied to all image points, could be ages are 256 x 256 pixels with 8 bits per pixel. The se- used also to detect moving obstacles. quence has been acquired from a fixed camera during the

FIG. 14. 16th and 28th image from a set of 47, acquired during a pushing action.

DYNAMIC ASPECTS IN ACTIVE VISION 12.5

motion of a human hand toward a man made object (a toy cat). In the first image the hand is approaching the toy; in the second image and hand pushes the object. The images have been captured at video rate during a continuous movement of the hand toward the object. The optical flow is computed relative to each frame of the sequence. In order to speed up the procedure, in the experiment only the optical flows relative to five frames before the five after the touching instant were computed. In Fig. I5 the optical flow relative to the second frame in Fig. 14 is presented. As can be noted, both the hand and the object appear to be moving as a whole.

It is possible to characterize this scene as an object moving within the camera field of view, with a stationary background. Therefore, a limited, smoothly changing number of moving points should be observed, over time. The histogram representing the number of vectors for each frame as a function of time can be computed. In many cases the histogram should be constant with very small fluctuations. A very slow rising of the histogram can be due to the progress of the arm within the field of view of the camera. If the images are sampled at high frequency with respect to the robot motion, the rise will have a very small slope. If the object has been touched, then a sudden variation in the number of moving points is observed. In this case an anomaly is detected as a sudden change in the number of flow vectors.

The robot end-effector can be isolated and the mean

FIG. 15. Computed velocity field relative to the second frame in Fig. 14.

velocity computed by analyzing the optical flows just pre- ceding the touching instant.

From the object motion following the contact instant it is possible to determine some qualitative properties of the object in the scene:

l if the force applied by the robot is constant and the velocity of the robot changes after the contact with the object, then it is possible to estimate the friction force of the object relative to the ground:

l if the object does not follow the same trajectory of the robot end-effector, then the end-effector is not prop- erly located, as the contact point is relatively distant from the projection of the object center of mass on the contact surface.

This methodology allows the measurement of globul properties of the object, while force and contact sensors can only estimate local properties relative to the contact point, unless many repeated measurements are performed. This last choice is not always possible, especially dealing with natural objects like fruits, deformable man-made objects, or objects which can be degraded by touching them repeatedly.

The deformability of stiffness of the object can be measured as well by comparing the velocity of the pushing robot arm and of the object [23]. In this case the velocity histogram, computed over all the image at one time instant, of both amplitude and phase is used. By analyzing the amplitude histogram only, it is possible to determine if the object is undergoing a pure translation along the pusher trajectory. The translation hypothesis is enforced by also comparing the pusher mean velocity with the velocity of each object point, also taking into account the variance of the object velocity. Object deformation, on the other hand, is characterized by an almost uniform distribution of the velocity directions in the phase histogram between 0 and 27~. Thus it is possible to distinguish between object rotation and deformation by analyzing the velocity phase histogram.

Another object property which can be determined is its stability with respect to external forces applied on its surface. The object’s equilibrium can be determined by performing a tapping action on the object’s surface and then observing its motion: if the object is in stable equilibrium it will stop very shortly after the touching. On the image plane, the number and amplitude of optical flow vectors will gradually decrease after tapping until the object stops. Also in the case, it is possible to characterize this behavior by computing the histogram of the mean amplitude of optical flows versus time. The standard case, which corresponds to a stable equilibrium, is determined by an approximately bell-shaped histogram, where the maximum is very close to the time instant of the contact.


Two possible anomalies can be detected (corresponding to two different events):

stop and motion inversion

l if the object falls (from unstable equilibrium) after the touching, the histogram will have a steep slope (corresponding to the progressive acceleration of the object) with the maximum far from the touching instant and then will suddenly fall, (corresponding to the quick stop of the object on the ground);

l if the object oscillates (from unstable equilibrium, but with the impulsive force applied on a particular point of the object’s body), then the histogram will be double-bell shaped, the two peaks corresponding to the two opposite motions in two different time instants.

a

An experiment has been carried out in which a tapping action is performed on two different points of an object lying, in unstable equilibrium, on a flat table. The same experiment was then performed with the same object in stable equilibrium. In Fig. 16 the velocity histograms computed for the standard case and the two expected events are shown. As can be noted, the anomalies in the histograms, corresponding either to the fall or the oscilla- tion of the object, can be easily detected from a simple analysis of their shapes, for example, by computing the values of the first derivative.

b

Other qualitative object properties, like the reaction to an external force (translational or rototranslational motion, the axis of instantaneous rotation, etc.) can be determined by pushing, touching, and tapping [23].

contact /‘I

It is worth noting that other sensors like contact or force-torque sensors can also provide some information about the position and force reaction of the object. Nev- ertheless, the information is limited to the time instant of contact. Therefore it is not possible to monitor the object behavior after an impulsive tapping action. Vision is necessary to verify the object behavior after the contact with the pusher. This is the only way to understand, for example, if the object is in a stable equilibrium position.

time

FIG. 16. (a) Histogram of mean IV/ with an oscillating object. (b) Histogram of mean IV/ with an object in stable equilibrium. (c) Histo- gram of mean IV1 for a falling object.

Once the object has been grasped and moved from the equilibrium, it must be manipulated or put into another location. During the manipulation it is important to exactly tune the force of the robot gripper to safely hold the object. Sometimes, weak grasping poses or the characteristics of the object surfaces (if they are greasy or spread with oil) can cause the object to slip out of the gripper. This is a dangerous condition which can be detected by analyzing a flow of images acquired from a camera looking at the object. The only requirement is that the object and the robot end-effector are kept within the field of view of the camera. It can be done either by moving the object in front of the camera only or by actively moving the camera to track the end-effector during the manipulation.

ages. As the only moving object in the scene is the pair object plus the robot end-effector, all the optical flow vectors must follow the same motion law. Moreover, as the trajectory of the robot arm is known, the corresponding direction of velocity on the image plane can be easily computed also.

If the object slides out of the end-effector the velocity will be different, at least in amplitude and/or sign, from that of the manipulator. This anomaly can be detected comparing the velocity vectors at each point with the velocity of the end-effector. The end-effector is isolated by a thresholding procedure, before the grasping of the object and the mean velocity computed. The image projection of the end-effector is tracked during the motion and its velocity pattern is compared to that of the grasped object. A difference between the velocity of the robot and that of the object determines that the object is sliding out of the gripper. This event can activate a command to the robot control module to increase the grasping force.

The optical flow is computed from the incoming im- The proposed method requires very simple computa-


tions and is very sensitive to anomalous object movements. This can be used as an adjunct to the usage of tactile sensors. Even though sliding could be detected using tactile and force sensors, it is a quite complex procedure while the sliding velocity (how fast the object is sliding down) cannot be determined.

5. CONCLUSION

The application of a space-variant sampling strategy to dynamic visual tasks has been presented. The distinctive feature between active/space-variant vision and conventional image processing is where to sample data within the visual field, not how much. It is out opinion that in recent years much of the research on computer vision has been biased, as far as sensor technology is concerned, by the requirements of “image processing” without stressing that in computer vision, images are to be understood and not transmitted or enhanced. Currently available visual cameras are considered a good technological solution and improvements are sought both in terms of resolution and in terms of “aperture time.” Both these features are, in our opinion, not relevant for vision-based control of autonomous systems, but are going to make visual computations even more complex, also increasing the data to be processed.

Considering the requirements of a tracking process, the only situation in which this process is not active is when the camera is fixed and looking to a steady environment. In all other situations the need to reduce motion blurring forces the activation of the tracking system which, consequently, cannot be actively suppressed. The alternative, proposed by the current technological trends, is to use high-speed shutters which “freeze” the image by sampling very shortly in time. This certainly avoids motion blurring but still does not prevent the over- all system from measuring precise information about egomotion in order to be able to extract useful information from the evolving scene (in particular to separate image velocities generated by egomotion from those due to objects moving independently on the scene: the case of a moving object tracked by a moving camera). Tracking a point of the environment, on the contrary, gives the system a reference point which does not move in the environment and simplify the computation of time-to-impact [7, 251.

During fixation, and also, during vergence control, the high-resolution part of the sensor is positioned over the object to be fixated [i.e., where, in the image, the smallest displacements (actually the tracking errors) need to be measured], while the low-resolution part is sweeping over the background and motion blurring actually filters out the high-frequency components of the incoming information.

From a behavioral viewpoint, depth computation is necessary only within the grasping range of the arm. Dur- ing manipulative tasks, in fact, the motion of the objects is driven mainly by the arm itself and the “head” can be considered steady. In this situation motion parallax does not provide reliable depth information. As a conse- quence, the major role in depth computation during manipulative tasks is played by the stereo subsystem.

Very different is the role of motion parallax for the control of egomotion. In the case of a steady system (i.e., a system without legs or wheels) the computation of depth outside the range of grasping is entirely useless if one is interested in the role of vision for control and not for interpretation or recognition. On the other hand, as soon as the system starts moving, the relevant information is not how far an object is but how long it will take to reach it. The trajectory used to avoid an obstacle depends more on how fast we are approaching it than on its actual distance. For example, the steering strategy of a driver is very different when parking a car or passing a truck on a highway. Therefore, even if in principle it is possible to compute depth from motion parallax, our opinion is that this is not necessary and that the simpler computation of time-to-impact is much more important in these cases. In the section devoted to the use of the retina-like sensor for the computation of this behavioral variable we stressed the unique characteristics of this sampling strategy for the computation of time-to-impact.

Other parameters can be extracted from image flows. The optical flow, computed from an image sequence, can be exploited for qualitative properties which make explicit some global characteristic related to the scene and/ or to the observer. A method has been presented for the detection of anomalies in the computed velocity pattern. The method acts using different rules depending upon the task to be accomplished. In general, a prediction is made on the general properties of the optical flow field, which relies on the knowledge of some dynamic parameters (for example, expanding flow in the case of a camera undergoing a translational forward motion with static or moving objects in the scene). The anomaly, and consequently a dangerous event, is detected by verifying the appear- ance of the predicted velocity features in the optical flows. This methodology is very effective in dealing with real scenes for many robot applications. The algorithms developed following the proposed approach are generally very robust and are easily implemented with local, fast operations.

Some examples, demonstrating the wide applicability of the approach, have been presented. They include some relevant applications in the area of robot vision, like the detection of static or moving obstacles from a moving camera and the monitoring of workpieces during manipulation with a robot arm.


ACKNOWLEDGMENTS

This research was supported by the Special Project on Robotics of the Italian National Council of Research and by the ESPRIT BRA project P3274 FIRST.

REFERENCES

1. J. S. Albus and T. H. Hong, Motion, depth, and image flow, in Proceedings, IEEE International Conference of Robotics and Au- tomation, Cincinnati, OH, 1990, pp. 1161-1171.

2. J. Aloimonos, Purposive and qualitative active vision, in Proceed- ings, International Workshop on Active Control in Visual Percep- tion, Antibes, France, 1990.

3. J. Aloimonos and D. Tsakiris. Tracking in a complex visual environment, in Proceedings, First European Conference on Computer Vision, Antibes, France, Apr. 23-26, 1990, Springer-Verlag, New York.

4. J. Aloimonos, I. Weiss, and A. Bandyopadhyay, Active vision, Int. J. Comput. Vision l(4), 1988, 333-356.

5. N. Ayache and 0. D. Faugeras. Maintaining representations of the environment of a mobile robot. IEEE Trans. Rob. Autom. RA-5(6), Dec. 1989, 804-819.

6. R. K. Bajcsy, Active perception vs passive perception, in Proceed- ings, Third IEEE Workshop on Computer Vision: Representation and Control, Bellaire, MI, 1985, pp. 13-16.

7. D. H. Ballard, R. C. Nelson, and B. Yamauchi, Animate vision, Opt. News E(5), 1989, 17-25.

8. S. L. Bartlett and R. C. Jain, Depth determination using complex logarithmic mapping, In SPIE International Conference on Intelli- gent Robots and Computer Vision IX: Neural, Biological and 3-D Methods, Boston, MA, Nov. 4-9 1990, Vol. 1382.

9. M. Bertero, T. Poggio, and V. Torre, Ill-posed problems in early vision. Proc. IEEE, 76(8), Aug. 1988, 869-889.

10. C. Braccini, G. Gambardella, G. Sandini, and V. Tagliasco, A model of the early stages of the human visual system: Functional and topological transformation performed in the peripheral visual field, Biol. Cybernet. 44, 1982, 47-58.

11. P. J. Burt, Smart Sensing in Machine Vision, Academic Press, New York, 1988.

12. S. Carlsson and J. 0. Eklundh, Object detection using model based prediction and motion parallax, in Proceedings, First European Conference on Computer Vision, Antibes, France, 1990, pp. 297- 306, Springer-Verlag, New York.

13. D. J. Coombs and C. M. Brown, Intelligent gaze control in binocular vision, in Proceedings, Fifth IEEE International Symposium on Intelligent Control, Philadelphia, PA, Sept. 1990.

14. D. J. Coombs, T. J. Olson, and C. M. Brown, Gaze control and segmentation, in Proceedings, AAAI-90 Workshop on Qualitative Vision, Boston, MA, July 1990.

15. J. L. Crawley, Dynamic modeling of free-space for a mobile robot, in Proceedings, International Workshop on Intelligent Robots and Systems, Tsukuba, Japan, Sept. 4-6, 1989.

16. I. Debusschere, E. Bronckaers, C. Claeys, G. Kreider, J. Van der Speigel, P. Bellutti, G. Soncini, P. Dario, F. Fantini, and G. San- dini, A 2d retinal ccd sensor for fast 2d shape recognition and tracking, in Proceedings, 5th International Solid-State Sensor and Transducers, Montreux, 1989.

17. E. DeMicheli, G. Sandini, M. Tistarelli, and V. Terre, Estimation of visual motion and 3d motion parameters from singular points, in

Proceedings, IEEE International Workshop on Intelligent Robots and Systems, Tokyo, Japan, 1988.

18. J. Van der Spiegel, G. Kreider, C. Claeys, 1, Debusschere, G. Sandini, P. Dario, F. Fantini, P. Bellutti, and G. Soncini, Analog VLSI and Neural Network Implementations, Kluwer, Dordrecht, 1989.

19. E. D. Dickmanns, 4d-dynamic scene analysis with integral spatiotemporal models, in Proceedings, 5th International Symposium on Robotics Research, Tokyo Japan, 1988, (R. Bolles and B. Roth, Eds), MIT Press, Cambridge, MA.

20. G. Succi, E. Grosso, G. Sandini, and M. Tistarelli, 3d feature extraction from sequences of range data, in Proceedings, 5th Interna- tional Symposium on Robotics Research, Tokyo, Japan, 1989, MIT Press, Cambridge, MA.

21. W. Enkelmann, Obstacle detection by evaluation of optical flow fields from image sequences, in Proceedings First European Con- ference on Computer Vision, Antibes France, 1990, pp. 134-138, Springer-Verlag, New York.

22. F. Ferrari, E. Grosso, G. Sandini, and M. Magrassi, A stereo vision system for real time obstacle avoidance in unknown environment, in Proceedings, IEEE International Workshop on Intelligent Ro- bots and Systems, Tsuchiura, Japan, 1990, pp. 703-708.

23. F. Gandolfo, M. Tistarelli, and G. Sandini, Visual monitoring of robot actions, In Proceedings, IEEE International Workshop on Intelligent Robots and Systems, Osaka, Japan, Nov. 3-5, 1991.

24. J. J. Gibson, The Ecological Approach to Visual Perception, Houghton Mifflin, Boston, 1979.

25. E. Grosso, G. Sandini, and M. Tistarelli, 3d object reconstruction using stereo and motion, IEEE Trans. Syst. Man Cybernet. SMC- 19(6), 1989.

26. B. K. P. Horn and B. G. Schunck, Determining optical flow, Artif Intell. 17(1-3), 1981, 185-204.

27. H. Inoue, H. Mizoguchi, Y. Murata, and M. Inaba, A robot vision system with flexible multiple attention capability, in Proceedings, ICAR, 1985, pp. 199-205.

28. L. Jacobson and H. Wechsler, Joint spatial/spatial-frequency representations for image processing, in SPEI International Confer- ence on Intelligent Robots and Computer Vision, Boston MA, 1985.

29. R. C. Jain, S. L. Bartlett, and N. O’Brian, Motion stereo using egomotion complex logarithmic mapping, IEEE Trans. PAMI PAMI- 9(3), 1987, 356-369.

30. D. J. Kriegman, E. Triendl, and T. 0. Binford, Stereo vision and navigation in building for mobile robots, IEEE Trans. Robotics Au- tomat RA-5(6) Dec. 1989, 792-803.

31. R. Manmatha, R. Dutta, E. M. Riseman, and M. A. Snyder, Issues in extracting motion parameters and depth from approximate translational motion, in Proceedings, International Workshop on Visual Motion, Irvine, CA, March 20-22, 1989, IEEE Computer Society.

32. L. Massone, G. Sandini, and V. Tagliasco, Form-invariant topological mapping strategy for 2-d shape recognition. Comput. Vision Graphics Image Process. 30(2), 1985, 169-188.

33. P. Morasso, G. Sandini, and M. Tistarelli, Active vision: Integra- tion of fixed and mobile cameras, in NATO ARW on Sensors and Sensory Systems for Advanced Robots, Berlin Heidelberg, 1986, Springer-Verlag, New York/Berlin.

34. H. H. Nagel, Extending the ‘oriented smoothness constraint’ into the temporal domain and the estimation of derivatives of optical flow, in Proceedings, First European Conference on Computer Vi- sion, Antibes. France, 1990, pp. 139-148, Springer-Verlag, New York.

35. R. C. Nelson and J. Aloimonos, Using flow field divergence for


36. 50.

31. 51.

38. 52.

39.

53.

40. 54.

55. 41.

T. .I. Olson and D. J. Coombs, Real-time vergence control for binocular robots, in Proceedings, DARPA Image Understanding Workshop, Pittsburg, PA, Sept. 1990. D. Raviv and M. Herman, Towards an understanding of camera fixation, in Proceedings, IEEE International Conference of Ro- botics and Automation, Cincinnati, 1990, pp. 28-33. .I. W. Roach and J. K. Aggarwal, Computer tracking of moving objects in space, IEEE Trans. PAMI PAMI- (2) 1979.

G. Sandini, F. Bosero, F. Bottino, and A. Ceccherini, The use of an anthropomorphic visual sensor for motion estimation and object tracking, in Proceedings, OSA Topical Meeting on Image Under- standing and Machine Vision, 1989. G. Sandini and P. Dario, Active vision based on space-variant sensing, in Proceedings, 5th International Symposium on Robotics Re- search, Tokyo, Japan, 1989, MIT Press, Cambridge, MA. G. Sandini, P. Morasso, and M. Tisarelli, Motor and spatial aspects in artificial vision, in Proceedings, 4th International Symposium of Robotics Research, Santa Cruz, CA 1987. MIT Press, Cambridge, MA.

56.

42. G. Sandini and V. Tagliasco, An anthropomorphic retina-like structure for scene analysis. Comput. Graphics Image Process. 14(3) 1980, 365-372.

43. G. Sandini and M. Tistarelli, Active tracking strategy for monocular depth inference over multiple frames. IEEE Trans. PAMI PAMI-12(l), 1990, 13-27.

57.

58

44.

45.

46.

G. Sandini and M. Tistarelli, Robust obstacle detection using optical flow, in Proceedings, IEEE International Workshop on Robust Computer Vision, Seattle, WA, Oct. 1-3, 1990, pp. 396-411. R. J. Schalkoff and E. S. McVey, A model and tracking algorithm for a class of video target, IEEE Trans. PAMI PAMI-4(l) 1982. E. L. Schwartz, Spatial mapping in the primate sensory projection: Analytic structure and relevance to perception, Rio/. Cybernet. 25, 1977, 181-194.

59

47. M. A. Taalebinezhaad, Direct recovery of motion and shape in the general case by fixation, in Proceedings of IEEE International Con- bU.

ference on Computer Vision, Osaka Japan, 1990, pp. 451-455. 48. W. B. Thompson and J. K. Kearney, Inexact vision, in Proceed-

ings, of IEEE Workshop on Motion: Representation and Analysis, 61

obstacle avoidance in visual navigation. IEEE Trans. PAM PAMI. ll(lO), 1989.

49. W. B. Thompson and T. C. Pong, Detecting moving objects, Int. I. Comput. Vision 4(l), 1990, 39-58.

M. Tistarelli and G. Sandini, Estimation of depth from motion using an anthropomorphic visual sensor, Image Vision Comput. 8(4) 1990, 271-278.

M. Tistarelli and G. Sandini, On the advantages of polar and log- polar mapping for direct estimation of time-to-impact from optical flow. IEEE Trans. PAMI, in press.

M. Tistarelli and G. Sandini, Direct estimation of time-to-impact from optical flow, in Proceedings IEEE Workshop on Motion, Princeton NJ, Oct. 7-9, 1991, pp. 226-233. V. Terre and T. Poggio, On edge detection, IEEE Trans. PAMI PAMI-8, Feb. 1986, 147-163.

S. Uras, F. Girosi, A. Verri, and V. Torre, Computational approach to motion perception. Biol. Cybernet. 60, 1988, 69-87.

D. Vernon and M. Tistarelli, Using camera motion to estimate range for robotic parts manipulation. IEEE Trans. Robotics and Automation RA-6(5), 1990, 509-521.

A. Verri, F. Girosi, and V. Torre, Mathematical properties of the 2d motion field: From singular points to motion parameters, in Proceedings of International Workshop on Visual Motion, Irvine, CA, March 20-22 1989, IEEE Computer Society.

A. Verri and T. Poggio, Motion field and optical flow: Qualitative properties, IEEE Trans. PAMI PAMI-11, 1989, 490-498.

C. F. R. Weiman, Polar exponential sensor arrays unify iconic and hough space representation, in SPIE International Conference on Intelligent Robots and Computer Vision VIII: Algorithms and

Techniques, Philadelphia, PA, Nov. 6-10, 1990, Vol. 1192, pp. 832-842.

C. F. R. Weiman and R. D. Juday, Tacking algorithms using log- polar mapped image coordinates, in SPIE International Conference on Intelligent Robots and Computer Vision VIII: Algorithms and Techniques, Philadelphia, PA, Nov. 6-10, 1990, Vol. 1192, pp. 843-853.

C. F. R. Weiman and G. Chaikin, Logarithmic spiral grids for image processing and display, Comput. Graphics Image Process. 11, 1979, 197-226.

1967. A. L. Yarbus, Eye Movements and Vision. Plenum, New York

Kiawah Island Resort, 1986, pp. 15-21.

dynamic aspects in active vision

Documents