fusing real-time depth imaging with high precision pose

Fusing Real-Time Depth Imaging with High Precision Pose Estimation by aMeasurement Arm

Svenja Kahn, Arjan KuijperFraunhofer IGD

Darmstadt, GermanyEmail: [email protected]

Abstract—Recently, depth cameras have emerged which cap-ture dense depth images in real-time. To benefit from their 3Dimaging capabilities in interactive applications which supportan arbitrary camera movement, the position and orientationof the depth camera needs to be robustly estimated in real-time for each captured depth image. Therefore, this paperdescribes how to combine a depth camera with a mechanicalmeasurement arm to fuse real-time depth imaging with real-time, high precision pose estimation. Estimating the pose ofa depth camera with a measurement arm has three majoradvantages over 2D/3D image based optical pose estimation:The measurement arm has a very precise guaranteed accu-racy better than 0.1mm, the pose estimation accuracy is notinfluenced by the captured scene and the computational loadis much lower than for optical pose estimation, leaving moreprocessing power for the applications themselves.

Keywords-Depth Imaging, Pose Estimation, MeasurementArm, Mixed Reality

I. INTRODUCTION

Whereas up to a few years ago only 2D images could becaptured with video cameras, recently also depth camerashave been developed. These cameras capture dense depthimages in real-time by measuring the distance to a 3D scenefor each pixel in the camera image: time-of-flight depthcameras emit near-infrared light and measure the time ittakes until the light reflected by the 3D scene gets backto the camera sensor [1]. In contrast, structured light basedcameras such as the Kinect depth camera project a near-infrared pattern onto the scene and estimate the distance byanalyzing the distortion the projected pattern [2].

The benefits of real-time 3D imaging over two dimen-sional imaging becomes evident in combination with mixedreality applications: on the one hand, real-time depth imag-ing enhances the realism of mixed reality by providing 3Dinformation about the scene required for realistic shadow andocclusion calculation [3]. On the other hand, real-time depthimaging in combination with augmented reality techniquessupports 3D inspection and 3D modeling tasks by providinga dense 3D geometric difference detection between a realobject and a 3D model of this object [4]. Such interactiveapplications use a depth camera that is not restricted to asingle viewpoint. Thus, the position and orientation of the

depth camera needs to be calculated in real-time for eachcaptured frame.

In previous work [3][4][5], the depth camera’s pose wasestimated with image based camera tracking, either marker-based or with markerless camera tracking. However, imagebased camera pose estimation of depth cameras has thedrawback that the accuracy of the camera pose dependsstrongly on the captured scene. If there are not enoughdistinguishable features in the 2D camera image or if thesefeatures cover only a subpart of the camera image, thetracking accuracy decreases or the tracking fails completely.The weaknesses of 2D image based optical camera trackingcan be reduced by incorporating the depth measurementsin the pose estimation process [6]. However, the alignmentof measured depth images with a reconstructed 3D modelof a scene is computationally very expensive. Calculatingthe pose in real-time requires a very sophisticated approach,which massively parallelizes all required computations onthe graphics card [7]. This approach has the drawback thatthe massive execution on the graphics card blocks the grapiccards’ ressources in view of the application itself, and thatthe tracking feasibility still depends on the availability ofcharacteristic 2D or 3D features in the captured cameraimages.

For 3D imaging applications, inaccuracies in the camerapose estimation usually are more critical than for classi-cal 2D image based augmented reality applications, whichaugment an object onto the 2D camera image: for theaugmentation of an object onto a 2D image, only the 2Dprojection needs to be accurate. However, slightly differentcombinations of 3D camera positions and orientations causevery similar projections in the 2D space. Thus, when theestimated camera pose differs from the real camera pose,this has a larger effect in the 3D space than in projectionsonto 2D images. For example, this effect can be observedwhen a marker tracker is used to estimate the pose of a depthcamera: if a 3D model of the marker is projected onto the2D camera image, this robustly projects the marker cornersonto the correct positions in the 2D image, even in frameswhere the 3D measurements are obviously misaligned dueto inaccuracies in the estimated camera pose.

(a) Kinect (b) SwissRanger 4000 (c) CamCube 3.0

Figure 1. Depth cameras rigidly coupled with a mechanical Faro measurement arm.

II. CONCEPT AND SETUP

To overcome the disadvantages of image based cameratracking and to achieve an adequate pose estimation accuracyfor 3D imaging applications, our approach combines real-time depth imaging with high precision pose estimation bya mechanical measurement arm.

For classical 2D color cameras, the combination of a 2Dcamera with a measurement arm has already been shownto be useful for the acquisition of ground truth data forthe evaluation of camera tracking algorithms [8][9]. Weextend this combination to depth cameras and show, thatthe combination of a measurement arm and a depth cameramakes real-time depth imaging applications feasible whichwould not be possible without this specific combination.Our approach is targeted at industrial applications, wheremeasurement arms are already used for measuring tasks:measurement arms have a point tip with which the user canmeasure the 3D positions of points on the surface of anobject with a very high accuracy.

For calculating the camera pose with a measurement arm,we make use of the fact that the Faro measurement arm notonly provides the exact 3D position of the point tip, but alsoits orientation. The main advantages of our approach are:

• High precision pose estimation: a Faro Platinum mea-surement arm has a measurement range of 3.7 metersand provides a pose with a guaranteed precision of0.073mm within this working space. In contrast, thecamera pose estimated by real-time image-based cam-era tracking can differ from the real pose by severalmillimeters to centimeters.

• The pose estimation does not depend on a well suitedenvironment: in contrast to 2D / 3D image based cam-era tracking, the pose is also calculated robustly, if thereare only few characteristic 2D / 3D features visiblein the current camera image (even, if no trackable

features are visible at all). Thus, the user does not needto be careful to point the camera only towards welltextured regions in order to maintain a stable camerapose estimation.

• Minimal computational load: the measurement armmechanically measures the rotation of each joint withshaft encoders and directly outputs the position andorientation of the point tip relative to the coordinatesystem of the measurement arm. Thus, the applicationonly needs to get the pose from the interface to theFaro arm and does need to spend any further processingresources for the pose calculation.

To track the pose of a depth camera with a measurementarm, the depth camera is rigidly coupled with the measure-ment arm. Figure 1 visualizes three state-of-the art depthcameras rigidly attached to a Faro Platinum arm.

Synchronization: While the measurement arm providesthe possibility to attach a hardware trigger cable, we choseto use a software based approach for the synchronizationof the depth camera and the measurement arm. The reasonfor this choice is to provide a generic approach which canbe used for any depth camera. For example, the Kinectdepth camera does not provide an interface for a hardwaretriggering signal, so it is not possible to use a hardwaretriggered synchronization for this depth camera.

The high update rate of the pose of the measurementarm ensures an accurate synchronization in a software basedapproach: the Faro arm measures and outputs more than 200poses per second. By acquiring the most recent pose fromthe measurement arm directly before capturing a new depthimage, a maximal offset of less than 5ms can be ensured.This offset is very small in view of the typical integrationtime of a time-of-flight depth camera, which is about 100msper captured frame.

p1p2

p3p4

WCS

CCS

TipCS

T1

T2

T3

Figure 2. 3D difference detection with a depth camera attached to ameasurement arm.

III. HAND-EYE CALIBRATION

To calculate the pose of the depth camera from a givenpose of the measurement arm, the relative transformationbetween the depth camera and the measurement tip of thefaro arm needs to be calculated first. This process is calledhand-eye calibration [10][11]. For a depth camera, it issimilar to the hand-eye calibration between a measurementarm and a classical 2D camera. However, the hand-eyecalibration needs to be adapted such that it accounts forthe low resolution of time-of-flight depth cameras.

A necessary prerequisite for estimating the hand-eye cal-ibration is that the intrinsic parameters of the depth cameraneed to be known (horizontal and focal length of the camera,its horizontal and principal point as well as the distortionparameters). These intrinsic parameters only need to beestimated once for each camera and are the same parametersas for 2D color cameras. Thus, they can be calculated in anoffline calibration procedure with a 2D camera calibrationtoolbox (GMLCameraCalibrationToolbox).

Figure 2 visualizes a sketch of the measurement arm, thedepth camera and an image marker which is used to calculatethe hand-eye calibration. We define the world coordinatesystem (WCS) as the coordinate system of the measurementarm. The measurement arm outputs the transformation T1,which is the relative transformation between the measure-ment tip’s coordinate system (TipCS) and the coordinatesystem of the base of the measurement arm (WCS). Thetransformation T2 is the hand-eye transformation betweenthe coordinate system of the depth camera (CCS) and TipCS.T3 is the camera pose relative to the world coordinate sys-tem. Once the hand-eye transformation is known, the camerapose can be calculated from the pose of the measurementarm and the hand-eye transformation with equation 1. In thenotation of this equation, each transformation Ti is split upinto its rotational and translational component (Ri and ti).

R3 = R2 ·R1

t3 = R2 · t1 + t2(1)

Figure 3. Marker captured with a SwissRanger 4000 depth camera. Theintensity image is more blurred than typical 2D images of higher resolution2D cameras.

The hand-eye transformation is calculated from n posepairs (T1j , T3j) with 1 ≤ j ≤ n. Each such pair contains apose of the measurement arm’s point tip and a depth camerapose, both relative to the world coordinate system. The mainchallenge for acquiring such a pose pair is the question howto calculate the pose of the depth camera T3j if the hand-eye transformation is not known yet. To solve this task, weuse the following two properties:

• The Faro measurement arm can be used to measure the3D coordinates of 3D points on object surfaces. Themeasured 3D coordinates are in the base coordinatesystem of the measurement arm (which is the worldcoordinate system).

• The pose of a camera can be calculated from a set of2D-3D correspondences. Each such 2D-3D correspon-dence stores the position of a 3D point in the worldcoordinate system and its 2D projection onto the imagecoordinate system of the camera.

We use an image marker to obtain such 2D-3D correspon-dences. Instead of an image marker, a checkerboard patterncould be used for the acquisition of 2D-3D correspondencesas well. However, time-of-flight depth cameras have a verylow resolution (a SwissRanger 4000 depth camera has aresolution of 176 × 144 pixel and a CamCube 3.0 has aresolution of 200 × 200 pixel). Furthermore, the intensityimages of time-of-flight depth cameras are more blurredthan the images of 2D cameras. Thus, instead of using acheckerboard pattern, we use the corners of a single imagemarker which can be robustly detected in low-resolutiondepth images and capture close-up images of this markerfor a robust hand-eye calibration. Figure 3 shows such amarker captured with a SwissRanger 4000 depth camera.

This marker is attached to a planar surface in the workingrange of the measurement arm and the 3D positions of itsfour corners (p1, ..., p4) are measured with the point tip ofthe measurement arm. These four 3D points as well as theintrinsic parameters of the depth camera and an image ofthe marker are the input for the camera pose estimation.Both the marker detection and the pose estimation areprovided by the computer vision framework InstantVision[12]. The depth camera’s pose T3j is estimated with directlinear transformation (DLT) and a subsequent nonlinear leastsquares optimization. Then, the accuracy of the calculatedcamera pose is estimated by a pixelwise comparison ofthe marker in the captured camera image with a simulatedprojection of the virtual marker image onto the 2D image,given the calculated camera pose. A pose pair (T1j , T3j) isonly used for the hand-eye calibration if the projected markermatches the captured marker well in the 2D image. For sucha valid pose pair, the hand-eye calibration T2j is specifiedin equation 2 (it can easily be inferred from equation 1).

R2 = R3 ·R1−1

t2 = t3 −R2 · t1(2)

Theoretically, the hand-eye calibration could be approxi-mated by a single pose pair. However, as described in sectionI, the camera pose can be calculated inaccurately even if theprojected marker matches the real marker well in the 2Dimage. To account for such inaccuracies, several hundredto thousand pose pairs are captured (the required sequencecan be recorded quite quickly, as depth cameras capture10 to 30 depth images per second) and T2j is calculatedfor each pose pair. Then, each rotational and translationalparameter of the final hand-eye calibration is the medianof this parameter in all collected T2j transformations. Ifso many transformations are calculated, the median differsonly slightly from the mean. Nevertheless, the median isused to calculate the final hand-eye transformation becauseit is more robust against outliers than the mean values.

Calibration of the Kinect depth camera: Whereas time-of-flight cameras record both a depth value and an inten-sity (grey) value for each pixel, the depth camera of theKinect captures only depth values. Nevertheless, the depthcamera of the Kinect can also be calibrated with a specialworkaround: the depth camera is in fact a near-infraredcamera, which captures the distortion of a projected patternonto this near-infrared image to calculate the distances basedon these distortions. With the OpenNI interface, the rawinfrared image can be acquired. The marker cannot bedetected due to the projected point pattern. This is why theprojector is covered such that it does not project the patternanymore and the marker is illuminated with an externalinfrared lamp. Thus, the marker becomes visible in theinfrared images of the Kinect depth camera and the depthcamera can be calibrated as described in this section.

Figure 4. 3D difference detection with a depth camera attached to ameasurement arm.

IV. APPLICATION: DIFFERENCE DETECTION

Figure 4 shows a 3D difference detection, which is basedon the combination of real-time depth imaging with a mea-surement arm. This demonstrator was created in cooperationwith Volkswagen research and can be used to check if a 3Dmodel of a real object accurately models the real object, or ifthere are differences between the object and the 3D model.The user can move the depth camera around the real object todetect differences. With the known pose of the depth camera(inferred from the pose of the Faro measurement arm),the depth measurements are compared to the correspondingpositions in the 3D model and the differences are visualizedwith a color-encoded augmentation: green represents a goodmatch, yellow shows that the real object is farther away thanmodeled and red shows that the real object is closer than itshould be (according to the 3D model). For example, thesteering wheel is part of the 3D model, but it is missingin the real object. The gear is shifted differently in the realobject than in the 3D model, so it is colored both in yellow(where it should be, according to the 3D model) and in red(where it actually is in the real object).

The major advantage of such an application in comparisonto static laser scanning is that the user can interactivelymove the camera to inspect differences at different partsof the object and from arbitrary viewpoints. Such an in-teractive 3D difference detection application could not berealized without the combination of a depth camera and ameasurement arm: laser scanners usually either only capturedepth values along a single 1D scan line at a time, orthey can only be used from a static position, such that theuser cannot change the position of the scanner during thescanning process. Without the precise pose measured by themechanical measurement arm, the camera pose would haveto be tracked either optically or by a real-time registration

of the 3D point clouds. Whereas the latter is too computa-tionally intensive to be executed in parallel to the differencedetection algorithm, optical camera tracking would not beaccurate enough. For such a textureless object, the camerapose estimated with optical camera tracking usually differsseveral millimeters to centimeters from the real camera pose.Such an inaccurate pose estimation would cause severeerrors in the difference calculation algorithm, making itsuse infeasible for differences which are smaller than theinaccuracies of the camera pose estimation.

The accuracy of the 3D difference detection depends bothon the accuracy of the pose estimation and the measurementaccuracy of the depth camera. The measurement accuracy ofa depth camera is influenced by several factors, such as itsdistance to the measured surface and the surface’s reflectionproperties (for example, the measurement accuracy is betterfor diffuse than for specular surfaces). Kahn [4] providesa simulation based analysis about the general influence ofcamera pose estimation inaccuracies and measurement errorson the overall accuracy of the 3D difference detection.

V. CONCLUSION

In this paper, we have presented the combination of adepth camera with a mechanical measurement arm, whichfuses real-time depth imaging with a highly precise poseestimation. Such an integration is the basis for applicationsthat acquire and register dense 3D measurements from arbi-trary camera positions in real-time, such as the applicationdescribed in section IV.

Furthermore, an accurate camera pose acquired by amechanical measurement arm provides ground truth datarequired for a quantitative evaluation of the accuracy of3D measurements. The setup described in this paper wasused to acquire a variety of depth image sequences withknown ground truth camera poses. These test sequences arecurrently evaluated to quantify the measurement accuracyof depth cameras for different scenarios and to quantify theinfluence of image based camera tracking inaccuracies onthe overall accuracy.

In view of the hand-eye calibration between the depthcamera and the measurement arm, in future work it wouldbe interesting to evaluate whether the hand-eye transforma-tion could be further improved by a global optimization.Currently, the hand-eye calibration is calculated separatelyfor each captured frame and the final hand-eye calibrationis the median of several hundreds of hand-eye calibrationswhich were calculated independently of each other. An alter-native approach to calculate the hand-eye calibration wouldbe to estimate a hand-eye transformation which globallyminimizes the projection errors of all 2D-3D marker cornercorrespondences, instead of first estimating a camera posefrom the 2D-3D correspondences of each frame separately.

REFERENCES

[1] T. Oggier, F. Lustenberger, and N. Blanc, “Miniature 3d tofcamera for real-time imaging,” in Perception and InteractiveTechnologies, 2006, pp. 212–216.

[2] Z. Zalevsky, A. Shpunt, A. Maizels, and J. Garcia, “Methodand system for object reconstruction,” WO Patent ApplicationWO 2007/043 036 A1, 2007.

[3] T. Franke, S. Kahn, M. Olbrich, and Y. Jung, “Enhancingrealism of mixed reality applications through real-time depth-imaging devices in x3d,” in Proceedings of the 16th interna-tional conference on 3D web technology, ser. Web3D ’11.New York, NY, USA: ACM, 2011, pp. 71–79.

[4] S. Kahn, “Reducing the gap between augmented reality and3d modeling with real-time depth imaging,” Virtual Reality,pp. 1–13, 2012, 10.1007/s10055-011-0203-0.

[5] K. Yun and W. Woo, “Spatial interaction using depth camerafor miniature ar,” in 3DUI 2009, 2009, pp. 119–122.

[6] S. Liu, Y. Wang, L. Yuan, P. Tan, and J. Sun, “Videostabilization with a depth camera,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2012, p.8 pp.

[7] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux,D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges,and A. Fitzgibbon, “Kinectfusion: Real-time dense surfacemapping and tracking,” in Proceedings of the 2011 10th IEEEInternational Symposium on Mixed and Augmented Reality,ser. ISMAR ’11, 2011, pp. 127–136.

[8] L. Gruber, S. Gauglitz, J. Ventura, S. Zollmann, M. Huber,M. Schlegel, G. Klinker, D. Schmalstieg, and T. Hollerer,“The city of sights: Design, construction, and measurementof an augmented reality stage set,” in Proc. Nineth IEEEInternational Symposium on Mixed and Augmented Reality(ISMAR’10), Seoul, Korea, Oct. 13-16 2010, pp. 157–163.

[9] S. Lieberknecht, S. Benhimane, P. Meier, and N. Navab,“Benchmarking template-based tracking algorithms,” VirtualReality, vol. 15, pp. 99–108, 2011.

[10] R. Y. Tsai and R. K. Lenz, “A new technique for fullyautonomous and efficient 3d robotics hand-eye calibration,” inProceedings of the 4th international symposium on RoboticsResearch, 1988, pp. 287–297.

[11] K. H. Strobl and G. Hirzinger, “Optimal hand-eye calibra-tion,” in Proceedings of the IEEE/RSJ International Confer-ence on Intelligent Robots and Systems (IROS), 2006, pp.4647–4653.

[12] M. Becker, G. Bleser, A. Pagani, D. Stricker, and H. Wuest,“An architecture for prototyping and application developmentof visual tracking systems,” in Capture, Transmission andDisplay of 3D Video (Proceedings of 3DTV-CON 07 [CD-ROM]), 2007.

fusing real-time depth imaging with high precision pose

Documents