towards full omnidirectional depth sensing using active...

Towards Full Omnidirectional Depth Sensing using Active Vision for SmallUnmanned Aerial Vehicles

Adam HarmatDept. of Mechanical Engineering

McGill UniversityMontreal, Canada

[email protected]

Inna SharfDept. of Mechanical Engineering

McGill UniversityMontreal, Canada

[email protected]

Abstract—Collision avoidance for small unmanned aerialvehicles operating in a variety of environments is limited by thetypes of available depth sensors. Currently, there are no sensorsthat are lightweight, function outdoors in sunlight, and coverenough of a field of view to be useful in complex environments,although many sensors excel in one or two of these areas. Wepresent a new depth estimation method, based on conceptsfrom multi-view stereo and structured light methods, thatuses only lightweight miniature cameras and a small laser dotmatrix projector to produce measurements in the range of 1-12 meters. The field of view of the system is limited only bythe number and type of cameras/projectors used, and can befully omnidirectional if desired. The sensitivity of the systemto design and calibration parameters is tested in simulation,and results from a functional prototype are presented.

Keywords-active vision; depth sensing; unmanned aerialvehicle; omnidirectional; structured light

I. INTRODUCTION

Small unmanned aerial vehicles (UAVs) are being envi-sioned for a multitude of uses, such as aerial surveying,videography, and even object delivery, due to their lowcost and increasing ease of operation. However, essentiallyall commercially available platforms to this date operateunder the assumption that they will be flying far enoughfrom obstacles that a collision is unlikely. On the otherhand, many research groups are working on enabling smallUAVs to operate with varying degrees of autonomy, and inparticular, to autonomously avoid obstacles. For example,competitors in the international aerial robotics competition[1] need to use state of the art solutions to simultaneouslocalization and mapping (SLAM), navigation, and collisionavoidance to complete challenging missions autonomously.Unfortunately, these solutions have not translated into robust,reliable collision avoidance in real-world flying vehicles be-cause they are not general enough to work in all reasonableenvironments.

In order to avoid a collision, obstacles must first be sensedin 3D space. There are two common approaches that are lightenough to be used on small UAVs that can provide this typeof data with sufficient density: passive vision, and activevision. Passive vision, such as stereo depth estimation or

visual SLAM, require natural texture in the scene. Moreover,that texture needs to be sufficiently two-dimensional sothat commonly used patch matching strategies give correctresults. This means that such methods may work well inman-made environments, but will suffer when there is alack of texture or the texture is a result of three dimensionalobjects occluding one another, as for example, the branchesof a tree.

Active vision solves this problem by projecting light intothe scene, and inferring depth from the reflected light. Thereare many ways of accomplishing this, and a good overviewis given in [2] and [3]. The most common approaches usedin robotics are the 2D laser rangefinder, structured lightprojection, and time-of-flight (ToF) cameras, which will bediscussed in more detail in section II.

Our proposed depth sensing solution is an active visionmethod, so that lack of proper environmental texture doesnot hinder it. It uses coherent laser light in a multi-dotor multi-line pattern as the source of illumination, due toits proven ability to be used in outdoor scenarios even inbright ambient light [4]. However, unlike other structuredlight approaches, we do not make any assumptions about thescene geometry, as is often done with coded structured light[5]. In addition, the specific laser pattern is not crucial tothe operation of the system, which is not the case for grid-based approaches [6]. Only a single exposure is requiredwith our method, meaning instantaneous capture of 3D datais possible.

Inspired by our work on multi-camera visual SLAMonboard small UAVs [7], we make use of multiple miniaturecameras with overlapping FOVs to observe a projectedlaser pattern from different perspectives. Knowing boththe intrinsic and extrinsic properties of the cameras andprojectors, a set of epipolar constraints is established thatrelate the image of the laser pattern seen by all the camerasto each other. These constraints enable the construction ofa fitness function for each part of the projected pattern, andextracting the set of maxima from these functions resultsin the most likely 3D reconstruction of the pattern. Unlikeother omnidirectional sensors, which typically refer to a

360◦capture region in the azimuthal direction but to a muchsmaller region in the elevation direction [8], our approachis completely flexible in defining a FOV, limited only bythe number and types of cameras and projectors employed.All of the cameras contribute to the depth estimate foreach projector point, unlike the case of simply combininga number of other depth sensors, where at best the sensorsduplicate each other if there’s any overlap in their FOVs,and at worst interfere with one another.

II. RELATED WORK

Although sensitive to the existence of appropriate texture,passive visual depth estimation techniques have nonethelesshad good success in certain applications. In [9], a keyframe-based visual SLAM algorithm is used to compute an un-manned aerial vehicle’s pose as well as build a map of sparsepoint features. The point features are converted into a trian-gle mesh, which can then be used for obstacle avoidance.Since only visually interesting points get made into features,the resulting map is at best a simple approximation to thetrue world. The same visual SLAM algorithm is used in [10]as a source of poses for dense depth map construction froma large number of images. This dense depth map is verywell suited to collision avoidance, however, the fact that theposes used to generate it come from a method that requiresgood 2D features limits the environments where it can beapplied.

Laser scanners, while limited to depth estimation in asingle plane, are used with great success on small UAVsflown indoors [11], where the vertical planar nature of mostbuilding walls eases the requirement for full 3D volumecapture. In a less structured environment, the device can bemade to “nod” about an axis in its capture plane, but then thevolume measurements do not happen all at the same time, aproblem for fast moving vehicles. However, if the sensor isrotated quickly enough, it can capture a full 3D volume ina small enough time frame that realtime collision avoidancecan be performed. This is exactly what the Velodyne laserscanner accomplishes, having a set of individual laser depthsensors which rotate about the device’s vertical axis. A largeversion of the sensor, having 64 individual lasers, was usedby many teams in the DARPA urban challenge, such as theStanford vehicle “Junior” [12]. Recently, a smaller versionof the sensor, having 32 lasers, was used by Phoenix AerialSystems onboard an eight-rotor aerial vehicle for a densepoint cloud reconstruction demonstration [13]. As well asthese sensors perform, there are two large drawbacks: theirextremely high cost, and the limited field of view in theelevation direction.

Structured light methods, in the form of the Kinectsensor from Microsoft, have performed well in indoor depthestimation for small UAVs [14], enabling the generation ofdense 3D maps by fusing the depth data with the image datain a SLAM algorithm. When used in an outdoor setting,

however, the Kinect suffers as its infrared projector is notable to overcome even moderate sunlight.

ToF cameras have found some success in robotics appli-cations, such as analyzing terrain traversability for a groundrobot [15] or pose estimation and map building indoors [16].However, the robots carrying these sensors are ground-based,as the current generation of ToF cameras are too heavy tocarry for most small UAVs, such as the vehicle in [11]which was unable to fly with a Swissranger 4000 sensoronboard. In addition, these sensors typically offer a smallFOV due to the limitation of having to completely illuminatea scene with infrared light. As the desired FOV increases,the volume of illuminated space increases greatly, whichcannot be met without significantly increasing the amount ofpower required by the lighting system. Finally, the currentgeneration of ToF cameras have low pixel counts [17], whichleads to low angular resolution.

Collimated laser light, as shown by the success of bothplanar laser scanners and the Velodyne, is a very good choicefor illuminating scenes even in sunlit outdoor environments.Triangulation-based methods of depth sensing using laserlight have also been shown to work outdoors. For example,in [18] the performance of a line projection system isevaluated for the task of curb detection for a moving vehicle,and in [19] the requirements for an outdoor line projectionsystem are derived based on photometric analysis. Theprojection of dot matrices instead of lines is explored in[20] for a car cockpit occupancy measurement system, andin [21] for terrain traversability in a planetary rover.

In both [20] and [21] and other cases of dot projection,the dot labeling problem is solved by applying the epipolarconstraint to a sparse set of dots viewed by one or twocameras. Ambiguous labels can sometimes be resolved byresorting to additional constraints, such as left-right orderingand smooth motion over time, but these can be violated incertain environments. In contrast, our approach uses onlythe epipolar constraint, which is guaranteed to hold in allscenarios. Instead of labeling spots in images in an attemptto look for matches, we work directly with the imagepixels to construct a fitness function for each part of theprojected pattern, whose maxima are located at the mostlikely estimated depth.

III. APPROACH

Our method is based on approaches found in the field ofmulti-view stereo, specifically in [22], where features areextracted from one image at a time, and matches are soughtalong epipolar lines in the other images using a feature’ssum of squared differences (SSD) score. In our case, weare seeking the depths of a pattern projected from a devicewith known calibration, meaning we know the projectiondirection of all parts of the pattern. Since every projectedpoint is guaranteed to be visible from the location of theprojector, and thinking of the projector as an inverse camera,

we define it to be a “reference image” whose points wewill find matches for in the other images. In many multi-view stereo approaches, establishing a reference image isconsidered an arbitrary decision to be avoided. However, inour case it is justified since we will be looking only for theprojected pattern in the other images.

A. Fundamentals

We represent the pose of a camera c (or projector p) inthe world frame w with a homogeneous transform:

Twc =

[Rwc twc

0 1

](1)

where Twc ∈ SE(3), the special Euclidean group thatdefines all rigid motions (i.e., translation, rotation and theircombination) and is represented by a 4 × 4 matrix. Thesubmatrix Rwc is a 3 × 3 matrix and a member of SO(3),the set of all rigid rotations, and twc ∈ R3 is the translationof the camera’s center of projection from the world origin.Twc is defined such that xw = Twcxc where x is a 3Dpoint.

To keep our method as general as possible, we are usinga spherical camera projection model instead of the typicallyused pinhole model. Specifically, we are using the Taylormodel outlined in [23], which can be used to define cameraswith a FOV greater than 180◦. The Taylor model maps apoint in the camera frame directly to an image point througha spherical projection intermediary step. This mapping willbe defined as u = C(xc) where u is a point in the image. Inkeeping with the inverse camera view of the laser projector,the “unprojection” function is defined as v = C−1(u) wherethe pixel u is part of a virtual image (defined below) andv is a unit length 3-vector. Note that our method can use apinhole camera model as well, but then it will be limited tosmaller FOVs.

B. Projection

The first step in our method is discretization of the laserprojector pattern at a certain resolution. If only dots areprojected, then this step can be skipped, as the dot centerscan be used directly and we know their direction vectorsfrom the projector calibration. For other patterns, we candefine a virtual image of a desired resolution on which thepattern is drawn, and unproject the pixels that comprise thepattern to form the set of pattern vectors V ⊂ R3 so thatv ∈ V .

We define an inverse depth map D : V → R such thatd = D(v) where d is an inverse depth value that satisfiesC−1(C(xc)) = dxc. Finding D for each element of V is theoverall goal of our method, and is accomplished as followsfor any given v. Figure 1 gives an illustration of the process.

First, a minimum and maximum inverse depth is chosento define the interval [dmin, dmax], which is uniformlypartitioned into nd segments. For each candidate di ∈

Camera 1

Camera 2Laser Projector

v

dmin

dmax

Figure 1. Candidate points along v with inverse depth between dmin anddmax are projected into all cameras, and the resulting image coordinateslie on epipolar curves. The Taylor camera model does not lend itself easilyto illustration, therefore a standard pinhole projection model is shown.

[dmin, dmax], we project the point 1div into all other cam-

eras:uij = Cj(Tcjp

1

div) (2)

where i ∈ (1, 2, ..., nd), j ∈ (1, 2, ..., nc), nc is the numberof cameras, Tcjp is the pose of the projector referenceframe in the jth camera’s reference frame, and uij is theimage coordinate of the projected point. Since uij will not,in general, have integer components, the intensity valueIij(uij) is linearly interpolated from neighboring pixels.We employ this inverse depth projection method insteadof computing epipolar lines in the target cameras becausethe Taylor camera model supports FOVs over 180◦, whichcannot be unprojected to the Cartesian image required forstraight epipolar lines. In addition, this method allows us towork with raw camera images rather than having to de-warpthe potentially significant radial distortion that the Taylormodel supports.

C. Fitness Function

We can maximize the brightness of the projected laserpattern by employing narrow-band optical filters on all thecameras, which block out most other unwanted light. There-fore, we define a fitness function fj for each v to simply bethe set of intensity values Iij(uij), i ∈ (1, 2, ..., nd) ratherthan computing a patch-based matching score, as is donein [22]. Figure 2 shows an example of individual fitnessfunctions fj produced by each camera in the example of Fig-ure 1. Computing the overall fitness function F (p), whichhas contributions from fj for all j, can be accomplished assimply as a sum of the individual fitness functions, similar tothe method employed in [22]. Another popular formulation

for F (p) is to use the product of fj , an approach commonlyused in the field of tomographic particle image velocimetry[24]. Unfortunately, this method does not work well whenocclusions between the projected pattern and the observingcameras can be expected, since any camera that doesn’tobserve a given laser spot will reduce F (p) to zero. In ourmethod, we define the overall fitness function as the meanof the individual fitness functions:

F (p) =

nc∑j=1

fjnc

(3)

which allows the application of thresholding with a fixedvalue regardless of the number of cameras, as will be donein section V.

The maximum of F (p) is taken to be the most likelyinverse depth of the projection along pattern vector p.However, if there are multiple large local maxima in F (p)then the solution is ambiguous. How to deal with suchcases dependents on the processing that will be appliedto the generated depth maps. For example, in a SLAMapplication, it could be advantageous to return all maximaof the fitness function along with confidence measures (i.e.,the width of each peak), whereas for a conservative collisionavoidance application we might want to return only the peakat the smallest distance. In our implementation, we can seta threshold by which the largest maximum must exceed allother maxima; otherwise, no depth is returned.

Intensity

Depth

f1

f2

Figure 2. Individual fitness functions generated by projecting candidatepoints into each of the cameras of Figure 1.

IV. SIMULATION

The proposed depth estimation system is sensitive to anumber of parameters, some of which are design choiceswhile others are values found through calibration. To deter-mine the effect of these parameters, simulations are carriedout while varying only parameter at a time. The followingparameters are investigated: pattern density, number of cam-eras, uncertainty in projector/camera intrinsic calibration,and uncertainty in projector/camera extrinsic calibration(translation and rotation separately).

The nominal translational offsets of the cameras used inthe simulations are shown in table I, while the rotations areset to zero. The projector is placed at the origin of the worldframe, with zero rotation as well. The camera translationaloffsets are chosen at random, while keeping the maximumdistance from the projector to 0.2m in any axis in order tomimic the size constraints of a small UAV. The simulatedenvironment consists of a triangle mesh with 50 trianglesthat encompassed the entire field of view of the projectoras well as all cameras. The depth of each triangle vertexfrom the world origin is randomly chosen with a mean of6m and a standard deviation of 0.5m, with the minimumallowed distance set at 1m. This allows us to investigatedepth sensing in environments of the scale we are interestedin, typically 1-12m in depth.

Table ICAMERA TRANSLATIONAL OFFSETS FOR SIMULATION

Name Position (x,y,z) Name Position (x,y,z)

Camera 1 (-0.20, -0.20, 0.00) Camera 4 (0.20, 0.05, 0.00)Camera 2 (0.05, -0.17, 0.00) Camera 5 (-0.04, 0.16, 0.00)Camera 3 (-0.12, -0.03, 0.00) Camera 6 (0.10, 0.20, 0.00)

Generation of the camera images is accomplished throughray tracing from pixels in the camera image to the pro-jector’s virtual image, upon which the projection pattern isdrawn. The pattern consists of white dots drawn on a blackbackground and convolved with a Gaussian function with astandard deviation of one pixel to simulate the cross sectionof a typical laser beam. The nominal intrinsic parametersof all the cameras as well as the projector are the same,representing a device with significant radial distortion and adiagonal FOV of 185◦.

For all the results presented, each data point is theaveraged result of ten runs. For the pattern density andcamera number trials, the environment is regenerated foreach execution of the depth estimation algorithm, whereasfor the camera/projector calibration trials only the noiseapplied to the respective set of variables is changed. Theglobal maximum of the overall fitness function F (p) foreach projection vector was taken as the estimate of theinverse depth, regardless of the presence of other (potentiallylarge) local maxima (i.e., no threshold or other criteria isapplied).

The default values for these trials are as follows: 51× 51matrix of projected dots, 4 cameras, no noise on eitherintrinsic or extrinsic calibration. The performance of thealgorithm is evaluated by comparing the normalized errors,referring to the error in the depth estimate divided by theground truth depth. This is used instead of the raw errorbecause a large error at a small depth is worse than thesame error at a large depth, for the purposes of obstacleavoidance.

A. Pattern Density

The projected pattern is a matrix of dots, simulating atypical laser pattern achievable using diffraction gratings.The size of the pattern is varied from 11× 11 to 111× 111in increments of 20 dots per side. The normalized root meansquared error (RMSE) for these patterns is shown in Figure4(a). It is clear that with increasing pattern density comesincreasing error, as more incorrect peaks in the overall fitnessfunction are identified as the estimated depth. This is anintuitive result, as the limiting case of complete uniformillumination would clearly provide no useful information.The algorithm performance is fairly constant for a lownumber of dots, but beyond a certain threshold it tendstowards this limiting case. When this happens, inaccuraciesin the ray tracing method used to generate the cameraimages becomes a dominant factor in deciding which ofthe numerous, nearly identical maxima in the overall fitnessfunction get selected.

B. Number of Cameras

The number of cameras varies from nc = 1 to 6, takingthe first nc cameras from table I. The normalized RMSE forthe different number of cameras is shown in Figure 4(b).Increasing the number of cameras reduces the chance thatan incorrect peak in the overall fitness function will be themaximum since the peaks in the individual fitness functionsfj , j ∈ (1, ..., nc) need to align correctly for them to producelarge maxima in F (p). This is demonstrated in Figure 3,which compares the responses for the same pattern vector pin the 2 and 6 camera cases. In the 2 camera case, there aretwo large peaks in F (p), and the incorrect one is slightlylarger, thus gets chosen. In the 6 camera case, there is onlyone large peak, so there is no ambiguity.

0.0 0.5 1.0 1.5 2.0Inverse Depth (1/m)

0

50

100

150

200

250

Imag

eIn

tens

ity

0.0 0.5 1.0 1.5 2.0Inverse Depth (1/m)

0

50

100

150

200

250

Imag

eIn

tens

ity

camera1camera2camera3camera4camera5camera6mean

Figure 3. The individual as well as overall (mean) fitness functions forthe 2 camera (bottom) and 6 camera (top) cases.

C. Intrinsic Calibration

The Taylor camera model is characterized by 9 parame-ters: 4 polynomial coefficients, 2 image center coordinates,and 3 affine transformation parameters. We apply Gaussiannoise to all these parameters, varying the standard deviationof the noise from 0.1% to 0.5% of the default value. Figure4(c) shows the relationship between normalized RMSE andthe applied noise. Keeping in mind that a normalized RMSEvalue ≥ 1 has so much error that it is completely useless,it is clear that the proposed method is extremely sensitiveto uncertainty in the intrinsic parameters when the FOV isas large as in the simulation. In [25] it is shown that theuncertainty in a stereo depth estimate due to uncertaintyin the intrinsic calibration is inversely proportional to thefocal length of the lens. Since the lens focal length in apinhole camera model is inversely proportional to FOV, wecan reframe the conclusion to say that the uncertainty isdirectly proportional to FOV. Since our simulated FOV isso large, we expect (and observe) the depth error to be verysensitive to changes in the intrinsic parameters.

D. Extrinsic Calibration: Translation

The translation components of the extrinsic calibration,whose default values are shown in table I, are varied byapplying Gaussian noise. The standard deviation of thenoise varies from 0.001 m to 0.005 m, which is within therange expected for a high quality calibration on a multi-camera system with the dimensions presented here. Figure4(d) shows the relationship between the normalized RMSEand the translation noise. Although our method is clearlysensitive to translation noise, it is to a much lesser degreethan with intrinsic calibration, as the significantly smallerRMSE values demonstrate.

E. Extrinsic Calibration: Rotation

The rotation components of the extrinsic calibration arevaried by applying Gaussian noise to the rotation matrixRwc of each camera pose. This is accomplished by firstcreating a noise vector, then converting it to a rotation matrixusing the exponent operator of SO(3). The noise matrix ispremultiplied by the original rotation matrix (in this case, theidentity rotation) to yield the new noisy rotation matrix. Forsmall values of noise, the rotation induced about each axis isapproximately equal to the elements of the noise vector. Thenoise is varied from 0.001 rad to 0.005 rad, which is withinthe range expected for a high quality calibration of a multi-camera system. Figure 4(e) shows the relationship betweennormalized RMSE and rotation noise. The sensitivity of thedepth estimation to rotation noise is evident, as increasingthe noise rapidly leads to nearly unusable estimates.

F. Summary

The results from the intrinsic and extrinsic parameternoise tests are consistent with the known characteristics of

11 31 51 71 91 111Dots in Pattern (squared)

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Nor

mal

ized

RM

SE

(a)

1 2 3 4 5 6Number of Cameras

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Nor

mal

ized

RM

SE

(b)

0.1 0.2 0.3 0.4 0.5Intrinsic Parameter σ (% of default)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Nor

mal

ized

RM

SE

(c)

0.001 0.002 0.003 0.004 0.005Translation σ (m)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Nor

mal

ized

RM

SE

(d)

0.001 0.002 0.003 0.004 0.005Rotation σ (rad)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Nor

mal

ized

RM

SE

(e)

Figure 4. Normalized RMSE for depth estimation with varying (a) pattern density, (b) camera numbers, (c) intrinsic parameter noise, (d) translationparameter noise, and (e) rotation parameter noise.

camera systems, namely that image formation (or projec-tion) is much more sensitive to the intrinsic and rotationparameters than to the translation parameters. Therefore, allcalibration methods will reduce errors in the intrinsic androtation parameters more than the translation parameters,and calibration methods that produce good results in otherdomains of visual computation (such as SLAM) shouldproduce good results for the proposed depth estimationmethod as well.

The design variables of pattern density and number ofcameras are shown to have stable performance over a widerange of values, after which the performance deteriorates.The method is particularly susceptible to too few cameras,and using fewer than 3 cameras is not advisable. Althoughnot shown, patterns other than dot matrices can be usedsuccessfully, such as lines or concentric circles. However, inthis case performance will suffer somewhat, as the effectivedensity of such patterns is higher. Specifically, the patterndensity in the direction of a line is infinite, meaning that theline center cannot be deduced if an epipolar line is parallelto a projected line.

V. REAL PROTOTYPE RESULTS

A real prototype of the depth estimation system was built,using four cameras and a laser pointer with a diffractiongrating as the source of projected dots. Since the FOV ofthe laser is approximately 70◦, we chose lenses for thecameras that closely matched this. Calibration of the laserprojector can be done with the aid of a known object, suchas a checkerboard plate, as is done in [26]. However, as

our eventual goal is to use a projector with a very largeFOV, we do not implement this method as it requires thewhole calibration object to be visible in all the camerassimultaneously. A potential calibration solution could usedense visual SLAM to map an environment with goodtexture, obviating the need for a calibration object, sincethe environment would be well characterized.

In the prototype system, however, we employ a simplerapproach: one of the cameras is placed very close to the laserprojector, and the resulting image becomes the projector’svirtual image, resulting in a system with 3 cameras and 1projector. To construct the pattern vectors, thresholding isapplied to the image, followed by dilation, erosion, and blobdetection. The centers of the found blobs are unprojected toform the pattern vector set V . Intrinsic calibration of thecameras is performed as in [23], while extrinsic calibrationfollows the procedure of [7]. While a real-world systemwould use an infrared laser and narrow-band optical filterson the cameras, for the prototype we perform experimentsin a dark room with a visible green laser.

The result of one experiment in an office environment ispresented below. Figure 5 shows the images from all thecameras and the projector’s virtual image, after processinghas been applied to generate the pattern vectors. Figure 6shows a 3D reconstruction of the scene, which is approxi-mately 2m-4m in depth. Qualitatively, the reconstruction isquite satisfactory, evidenced by the wall, desk and partitionplanar surfaces.

To achieve this reconstruction, a threshold of 70 is appliedto the overall fitness function for each pattern vector, which

eliminates all points that are generated as a result of smallglobal maxima in the overall fitness function. This canhappen when a point is not seen by all the cameras, asthe contribution of the occluded camera’s fitness functionwill be zero. Two main sources of error are present inthe reconstruction. First, the chair in the scene is mademostly of a semi-transparent mesh material, which resultsin multiple weak reflections that are known to be confusingfor all types of light-based depth sensors [17]. However,since these reflections are weak, a simple threshold on theoverall fitness function is sufficient to suppress incorrectdepths generated by them. Due to this suppression, only onegood point from the chair is visible in Figure 6. Second,the lenses for two of the cameras contain infrared filters,while the other two do not. Green laser light in inexpensiveunits is generated by doubling the frequency of an infraredlaser diode, resulting in both green and infrared beams beingemitted. The diffraction grating that produces the matrixof dots splits different wavelengths of light into differentangles, resulting in the increased spot density visible inthe bottom two images of Figure 5, which come fromthe cameras without infrared filters. Since one of the threecameras can not see these extra dots, any false peaks in theoverall fitness function generated as a result of the infrareddots have a smaller magnitude than dots visible in all threecameras, and thus thresholding is sufficient to remove them.

Figure 5. Screenshot from depth reconstruction showing the projectorvirtual image (top left) and camera images. A selected pattern vector’sinverse depth interval is shown projected into the cameras (zoom in ondigital version).

VI. CONCLUSION

We have demonstrated the feasibility of constructing anactive vision depth sensor using only miniature camerasand a laser dot matrix projector by exploiting the epipolarconstraints between them. The placement of cameras andprojector are flexible, but computing an optimal distributionfor a given volume is an interesting future task. This flexibil-ity allows one to cover as much FOV as desired, constrainedonly by the availability of suitable cameras and projectors.Although currently it is not practical to mount more than ahandful of cameras on a small UAV, the state of the art in

2

3

4Dep

th (

m)

Projector

Wall

Desk

Partition

Shelves

Chair

Figure 6. Reconstruction of an office environment by the prototype depthestimation system.

camera design is improving rapidly. In [27], researchers builta “fly’s eye” omnidirectional sensor out of 100 small CMOScameras, and in [28] a compound eye sensor is built out of anelastomeric microlens array and stretchable photodiodes atthe centimeter scale. Multiple lenslets and a single traditionalimage sensor can also be used to capture multiple viewpointsof a scene, as is done in [29]. Therefore, it is conceivablethat covering the exterior of a small UAV with dozens oreven hundreds of micro-cameras could be feasible, evendesirable as researchers attempt to emulate the capabilities ofinsects, which fly using vision almost exclusively. However,even insects require visual texture; enabling a small UAV tosense depth in all environmental conditions and all directionswould allow it to outperform anything found in nature.

ACKNOWLEDGMENT

This work was supported by the Natural Sciences and En-gineering Research Council (NSERC) through the NSERCCanadian Field Robotics Network (NCFRN).

REFERENCES

[1] (2014, February) International aerial robotics competition.[Online]. Available: http://www.aerialroboticscompetition.org

[2] D. Fofi, T. Sliwa, and Y. Voisin, “A comparative surveyon invisible structured light,” in Proc. SPIE Machine VisionApplications in Industrial Inspection XII, vol. 5303, May2004, pp. 90–98.

[3] N. Pfeifer and C. Briese, “Laser scanning - principlesand applications,” in 3rd International Exhibition & Scien-tific Congress on Geodesy, Mapping, Geology, Geophysics,Novosibirsk, Russia, 2007.

[4] C. Mertz, S. J. Koppal, S. Sia, and S. G. Narasimhan, “Alow-power structured light sensor for outdoor scene recon-struction and dominant material identification,” in 9th IEEEInternational Workshop on Projector-Camera Systems, June2012.

[5] J. Salvi, J. Pags, and J. Batlle, “Pattern codification strategiesin structured light systems,” Pattern Recognition, vol. 37,no. 4, pp. 827–849, April 2004.

[6] R. Dvorak, M. Drahansky, and F. Orsag, “Object surface re-construction from one camera system,” International Journalof Multimedia and Ubiquitous Engineering, vol. 5, no. 2, pp.27–38, 2010.

[7] A. Harmat, I. Sharf, and M. Trentini, “Parallel trackingand mapping with multiple cameras on an unmanned aerialvehicle,” in Int. Conf. Intelligent Robotics and Applications,Montreal, Canada, 2012, pp. 421–432.

[8] R. Orghidan, E. M. Mouaddib, J. Salvi, and J. J. Serrano,“Catadioptric single-shot rangefinder for textured map build-ing in robot navigation,” IET Computer Vision, vol. 1, no. 2,pp. 43–53, June 2007.

[9] S. Weiss, M. Achtelik, L. Kneip, D. Scaramuzza, and R. Sieg-wart, “Intuitive 3d maps for mav terrain exploration and ob-stacle avoidance,” J. Intelligent and Robotic Systems, vol. 61,no. 1-4, pp. 473–493, 2010.

[10] A. Wendel, M. Maurer, G. Graber, T. Pock, and H. Bischof,“Dense reconstruction on-the-fly,” in IEEE Conference onComputer Vision and Pattern Recognition, 2012.

[11] W. Morris, I. Dryanovski, and J. Xiao, “3d indoor mappingfor micro-uavs using hybrid range finders and multi-volumeoccupancy grids,” in Robotics: Science and Systems Conf.,Zaragoza, Spain, 2010.

[12] M. Montemerlo, J. Becker, S. Bhat, H. Dahlkamp, D. Dolgov,S. Ettinger, D. Haehnel, T. Hilden, G. Hoffmann, B. Huhnke,D. Johnston, S. Klumpp, D. Langer, A. Levandowski,J. Levinson, J. Marcil, D. Orenstein, J. Paefgen, I. Penny,A. Petrovskaya, M. Pflueger, G. Stanek, D. Stavens, A. Vogt,and S. Thrun, “Junior: The stanford entry in the urbanchallenge,” Journal of Field Robotics, vol. 25, no. 9, pp. 569–597, September 2008.

[13] (2014, February) Phoenix aerial systems. [Online]. Available:http://www.phoenix-aerial.com

[14] A. Bachrach, S. Prentice, R. He, P. Henry, A. Huang,M. Krainin, D. Maturana, D. Fox, and N. Roy, “Estimation,Planning and Mapping for Autonomous Flight Using anRGB-D Camera in GPS-denied Environments,” InternationalJournal of Robotics Research, vol. 31, no. 11, pp. 1320–1343,September 2012.

[15] H. Balta, G. D. Cubber, D. Doroftei, Y. Baudoin, andH. Sahli, “Terrain traversability analysis for off-road robotsusing time-of-flight 3d sensing,” in IARP RISE-Robotics forRisky Environment-Extreme Robotics, 2013.

[16] A. Prusak, O. Melnychuk, H. Roth, I. Schiller, and R. Koch,“Pose estimation and map building with a pmd-camera forrobot navigation,” International Journal of Intelligent SystemsTechnologies and Applications, vol. 5, pp. 355–364, 2008.

[17] M. Hansard, S. Lee, O. Choi, and R. P. Horaud, Time-of-Flight Cameras: Principles, Methods and Applications,S. Zdonik, P. Ning, S. Shekhar, J. Katz, X. Wu, L. C. Jain,D. Padua, and X. Shen, Eds. Springer Publishing Company,2013.

[18] C. Mertz, J. Kozar, J. R. Miller, and C. Thorpe, “Eye-safelaser line striper for outside use,” in IEEE Intelligent VehicleSymposium, 2002.

[19] D. Ilstrup and R. Manduchi, “Active triangulation in the out-doors: A photometric analysis,” in International Symposiumon 3D Data Processing, Visualization, and Transmission,2010.

[20] J.-M. Lequelle and F. Lerasle, “Car cockpit 3d reconstructionby a structured light sensor,” in IEEE Intelligent VehiclesSymposium, 2000.

[21] L. Matthies, T. Balch, and B. Wilcox, “Fast optical hazarddetection for planetary rovers using multiple spot laser trian-gulation,” in IEEE International Conference on Robotics andAutomation, Albuquerque, New Mexico, April 1997.

[22] M. Okutomi and T. Kanade, “A multiple-baseline stereo,” inIEEE Conference on Computer Vision and Pattern Recogni-tion, June 1991, pp. 63–69.

[23] D. Scaramuzza, “Omnidirectional Vision: From Calibration toRobot Motion Estimation,” Ph.D. dissertation, ETH Zurich,2008.

[24] F. Scarano, “Tomographic piv: principles and practice,” Mea-surement Science and Technology, vol. 24, no. 1, pp. 1–28,January 2013.

[25] S. Das and N. Ahuja, “Performance analysis of stereo,vergence, and focus as depth cues for active vision,” IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 17, no. 12, pp. 1213–1220, 1995.

[26] T. Botterill, S. Mills, and R. Green, “Design and calibrationof a hybrid computer vision and structured light 3d imagingsystem,” in International Conference on Automation, Roboticsand Applications, December 2011, pp. 441–446.

[27] H. Afshari, L. Jacques, L. Bagnato, A. Schmid, P. Van-dergheynst, and Y. Leblebici, “The panoptic camera: Aplenoptic sensor with real-time omnidirectional capability,”Journal of Signal Processing Systems for Signal, Image andVideo Technology, vol. 7, no. 3, pp. 305–328, 2013.

[28] Y. M. Song, Y. Xie, V. Malyarchuk, J. Xiao, I. Jung, K.-J.Choi, Z. Liu, H. Park, C. Lu, R.-H. Kim, R. Li, K. B. Crozier,and Y. H. J. A. Rogers, “Digital cameras with designs inspiredby the arthropod eye,” Nature, vol. 497, pp. 95–99, 2013.

[29] R. Ng, “Digital light field photography,” Ph.D. dissertation,Stanford University, 2006.

towards full omnidirectional depth sensing using active...

Documents