the international journal of robotics researchxren/publication/...2 the international journal of...

http://ijr.sagepub.com/Robotics Research

The International Journal of

http://ijr.sagepub.com/content/early/2011/07/02/0278364911403178The online version of this article can be found at:

DOI: 10.1177/0278364911403178

published online 7 July 2011The International Journal of Robotics ResearchMichael Krainin, Peter Henry, Xiaofeng Ren and Dieter Fox

Manipulator and object tracking for in-hand 3D object modeling

Published by:

http://www.sagepublications.com

On behalf of:

Multimedia Archives

can be found at:The International Journal of Robotics ResearchAdditional services and information for

http://ijr.sagepub.com/cgi/alertsEmail Alerts:

http://ijr.sagepub.com/subscriptionsSubscriptions:

http://www.sagepub.com/journalsReprints.navReprints:

http://www.sagepub.com/journalsPermissions.navPermissions:

at UNIV WASHINGTON LIBRARIES on July 27, 2011ijr.sagepub.comDownloaded from

http://ijr.sagepub.com/

http://ijr.sagepub.com/content/early/2011/07/02/0278364911403178

http://www.sagepublications.com

http://www.ijrr.org/

http://ijr.sagepub.com/cgi/alerts

http://ijr.sagepub.com/subscriptions

http://www.sagepub.com/journalsReprints.nav

http://www.sagepub.com/journalsPermissions.nav


Manipulator and object tracking forin-hand 3D object modeling

The International Journal ofRobotics Research00(000) 1–17© The Author(s) 2011Reprints and permission:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/0278364911403178ijr.sagepub.com

Michael Krainin1, Peter Henry1, Xiaofeng Ren2, and Dieter Fox1,2

AbstractRecognizing and manipulating objects is an important task for mobile robots performing useful services in everydayenvironments. While existing techniques for object recognition related to manipulation provide very good results even fornoisy and incomplete data, they are typically trained using data generated in an offline process. As a result, they do notenable a robot to acquire new object models as it operates in an environment. In this paper we develop an approach tobuilding 3D models of unknown objects based on a depth camera observing the robot’s hand while moving an object. Theapproach integrates both shape and appearance information into an articulated Iterative Closest Point approach to trackthe robot’s manipulator and the object. Objects are modeled by sets of surfels, which are small patches providing occlusionand appearance information. Experiments show that our approach provides very good 3D models even when the object ishighly symmetric and lacks visual features and the manipulator motion is noisy. Autonomous object modeling representsa step toward improved semantic understanding, which will eventually enable robots to reason about their environmentsin terms of objects and their relations rather than through raw sensor data.

KeywordsCognitive robotics, learning and adaptive systems, personal robots, sensing and perception computer vision, sensor fusion,visual tracking

1. Introduction

The ability to recognize and manipulate objects is importantfor mobile robots performing useful services in everydayenvironments. In recent years, various research groups havemade substantial progress in recognition and manipulationof everyday objects (Ciocarlie et al., 2007; Berenson andSrinivasa, 2008; Saxena et al., 2008; Collet Romea et al.,2009; Glover et al., 2009; Lai and Fox, 2009; Rasolzadehet al., 2009). While the developed techniques are oftenable to deal with noisy data and incomplete models, theystill have limitations with respect to their usability in long-term robot deployments in realistic environments. Onecrucial limitation is due to the fact that there is no pro-vision for enabling a robot to autonomously acquire newobject models as it operates in an environment. This isan important limitation, since no matter how extensive thetraining data, a robot might always be confronted with anovel object instance or type when operating in an unknownenvironment.

The goal of our work is to develop techniques thatenable robots to autonomously acquire models of unknownobjects, thereby increasing their semantic understanding.Ultimately, such a capability will allow robots to activelyinvestigate their environments and learn about objects in

an incremental way, adding more and more knowledgeover time. In addition to shape and appearance informa-tion, object models could contain information such as theweight, type, typical location, or grasp properties of theobject. Equipped with these techniques, robots can becomeexperts in their respective environments and share informa-tion with other robots, thereby allowing for rapid progressin robotic capabilities.

In this paper we present a first step toward this goal.Specifically, we develop an approach to building a 3D sur-face model of an unknown object based on data collectedby a depth camera observing the robot’s hand moving theobject. In contrast to much existing work in 3D object mod-eling, our approach does not require a highly accurate depthsensor or a static or unobstructed view of the object, nordoes our approach require an extremely precise manipu-lator. This point is essential because our manipulator can

1 Department of Computer Science & Engineering, University of Wash-ington, USA2Intel Labs Seattle, USA

Corresponding author:Michael Krainin, Department of Computer Science & Engineering, Uni-versity of Washington, Seattle, WA 98195, USAEmail: [email protected]



2 The International Journal of Robotics Research 00(000)

Fig. 1. Experimental setup. We use a WAM arm with BarrettHandon a Segway base. Mounted next to the arm on a pan-tilt unit is aPrimeSense depth camera.

experience errors of multiple centimeters caused by cablestretch (see Figure 3). It is also an important feature if suchtechniques are to be used in robots priced for consumeruse.

Recently, sensors combining RGB images with depthmeasurements (RGB-D sensors) have come to prominencedue to their gaming applications, and in particular due to therelease of the Xbox Kinect. Such sensors are now both veryaffordable and readily available, making them ideal for per-sonal robotics applications. In this paper we equip our robotwith an RGB-D sensor developed by PrimeSense (a refer-ence design equivalent to the Kinect) to allow it to sense itsenvironment and model objects that it encounters.

We develop a Kalman filter that uses depth and visualinformation to track the configuration of the robot’s manip-ulator along with the object in the robot’s hand. By doingso, our approach can compensate for errors in manipulatorand object state estimates arising from factors such as noisein the manipulator’s joint sensors and poor kinematic mod-eling. Over time, an increasingly complete 3D model of theobject is generated by extracting points from each RGB-Dframe and aligning them according to the tracked hand andobject position. The approach integrates the frames into aconsistent surface model using surfels, small discs whichrepresent local surface patches. Experiments show that ourapproach can generate good models even for objects that arehighly symmetric, such as coffee cups, and objects lackingvisual texture.

Figure 2 presents a simplified outline of our approach.On each frame, new sensor data are fed into a Kalmanfilter, which produces pose estimates of the manipulator,object, and sensor. The Kalman filter tracking is based onRANSAC feature matching and an iterative closest pointvariant. These estimates enable segmentation of the objectfrom the RGB-D frame and its alignment with the existingobject model. The model can then be updated based on thenew sensor data.

Our work provides the following contributions:

• We propose a framework for simultaneously trackinga robotic manipulator and a grasped object while con-structing a 3D model of the object. We also show

Fig. 2. High level overview of our approach. As sensor data comein, we incrementally update the configuration and object modelestimates.

how this system can be integrated with other com-ponents to perform autonomous object modeling: Anunknown object can be picked up, modeled, placed backdown, and regrasped so as to fill in holes caused bymanipulator occlusion.

• We present a novel Iterative Closest Point (ICP) vari-ant useful for performing tracking in RGB-D datasequences. We build upon a standard point-to-planeerror metric, demonstrating extensions for sparse fea-ture matching, dense color matching, and priors pro-vided by a Kalman filter. Additionally, we show howto use the surfel representation to provide occlusioninformation to ICP.

• We propose an algorithm for merging multiple surfacesconsisting of local surface patches after loop closure.Our algorithm attempts to find a consensus between thesurfaces to avoid redundancy among patches.

This paper is organized as follows. In the next section,we provide an overview of related work. We then describeour Kalman filter approach to tracking a robot’s manipu-lator and an object grasped by the hand in Section 3. Wealso introduce a novel version of articulated ICP suitableto our tracking task. In Section 4, we go into detail on themodeling process. Section 5 presents experimental resultsdemonstrating the quality of both the tracking and model-ing components of our algorithm. Finally, we conclude inSection 6.

2. Related work

The existing work in tracking and modeling addresses sub-sets of the problem we are trying to solve; however, no onepaper addresses them all. We make use of depth, visual,and encoder information to provide a tracking and modelingsolution for enabling active object exploration for personalrobotics. Below, we discuss a number of areas of researchrelated to our own work.



Krainin et al. 3

2.1. Robotics for object modeling

Broadly speaking, object modeling techniques in roboticscan be divided into two categories: ones where a sensor ismoved around a stationary object and ones where the objectis picked up and moved in front of a sensor. The first cat-egory avoids the difficult problem of robotic grasping andcan be applied even to objects too large or delicate to bepicked up. The second category, into which our techniquefalls, has the advantages of being able to move the objectto see previously occluded sides and also lends itself toextracting further properties such as weight and stiffness.

The first category of papers is closely related to the prob-lem of 3D mapping as it involves motion of a depth sensorin a stationary scene. Triebel et al. (Triebel et al., 2004)mount a SICK laser range finder on a four DOF manipula-tor for 3D volumetric modeling and exploration. They usethe manipulator encoder values for sensor pose estimation.Other approaches for environment and object modelingwith a depth sensor include Henry et al.’s RGB-D mapping(Henry et al., 2010) and Strobl et al.’s Self-Referenced DLR3D-Modeler (Strobl et al., 2009). Both use visual featuretracking as the primary means of camera pose estimation.The former uses ICP to improve pose estimates, while thelatter uses an IMU to provide better image flow predictions.

In the second category, Kraft et al. (Kraft et al., 2008)model contours of objects using a robotic manipulator anda stereo camera. The representations they learn, however,are not complete surface models but rather sparse sets oforiented 3D points along contours. Another important dif-ference to our approach is that the authors assume pre-cise camera to robot calibration and precisely known robotstate at all times. We believe these assumptions to be toorestrictive for the technique to be generally applicable.

Ude et al. (Ude et al., 2008) use robotic manipulationto generate training examples for object recognition. Theirapproach involves generating motion sequences to achievevaried views of an object, segmenting the object fromimages, and extracting training examples for a vision-basedclassifier. Unlike Kraft’s work, their paper assumes neitherknown camera calibration nor precisely known joint angles.However, the paper does not deal with constructing 3Dmodels and therefore does not require precise object pose.

Similarly, Li and Kleeman (Li and Kleeman, 2009) usea robotic manipulator to achieve varied views of an objectfor visual recognition. They store Scale-Invariant FeatureTransform (SIFT) features for frames at discrete rotationangles and perform detection by matching the features of aninput image against each viewpoint of each modeled object.The authors mention that such models could be useful forobject pose estimation. We assert that this requires esti-mating the motion of the object between viewpoints usingtechniques such as those we propose in this paper.

2.2. In-hand object modeling

An area of research having a lot of recent interest is objectmodeling where the object is held and manipulated by a

human hand. This problem is more difficult than the onewe address because there are no longer encoders to pro-vide an (approximate) hand pose. Additionally, the appear-ance of human hands varies from person to person andthe human hand is capable of much more intricate motionsthan a robotic hand. For these reasons, the hand is typicallyignored. Typically, these algorithms only rely on the use ofvisual data or depth data but not both, and to our knowledgenone explicitly try to track the hand as a means of improvingalignment.

In the case of ProFORMA (Pan et al., 2009), the goalis to acquire and track models via a webcam. While visualfeatures alone work fine for some objects, many everydayobjects lack sufficient texture for this type of tracking.

Weise et al. (Weise et al., 2009) use 3D range scansand model objects using surfels but rely solely on ICPwith projection-based correspondences to provide align-ment. Their alignment technique is very much similar tothat of Rusinkiewicz et al. (Rusinkiewicz et al., 2002),with the exception that the alignment is performed betweena partial object model and the current frame rather thanbetween the last two frames. Because they rely on geomet-ric matching only, these techniques tend to fail for objectsexhibiting rotational symmetries, as many householdobjects do.

2.3. Articulated tracking

A number of techniques exist for human hand-tracking;however, many of these make use of only 2D informa-tion such as silhouettes and edge detections (Athitsos andSclaroff, 2003; Sudderth et al., 2004). Some require pre-computed databases and may only detect configurationswithin that database (Athitsos and Sclaroff, 2003) and oth-ers are far from real-time algorithms. Given that we areusing 3D sensors and that we wish to track the manipula-tor in real time through a continuous space of joint angles,such approaches are unsuitable.

In the area of human body tracking, the work ofGanapathi et al. (Ganapathi et al., 2010) is quite promis-ing. The authors perform tracking using a time-of-flight sensor at close to real-time by taking advantageof GPUs. Their approach involves evaluating the like-lihood of hypotheses by performing a ray-tracing of amodel and comparing with the observed depth mea-surements. Additionally, parts detections can inform thesearch to allow recovery from failures caused by occlu-sion or fast motion. These challenges are somewhat dif-ferent from those we face in our problem. Because ourwork focuses on robotic tracking, the motions are largelyknown, relatively slow, and with relatively low occlusion.Our challenge lies in precise and consistent pose esti-mation of the end effector and object model to facili-tate object modeling. We therefore focus on incorporatingmany sources of information into our matching procedurerather than on fast evaluations of a purely depth-basedmetric.




Articulated ICP is an appealing option because of itsspeed, its use of 3D information, and the fact that it isreadily modified to suit specific tasks, as demonstrated bythe wealth of existing ICP variants (see Section 2.4). It hasbeen used in articulated pose estimation in the past (Mün-dermann et al., 2007; Pellegrini et al., 2008); however, tothe best of our knowledge, it has not been integrated withKalman filters, which provide the advantages of smoothingand uncertainty estimation. These uncertainties are crucialas they can be fed back into ICP to reflect the accumulatedknowledge of the state (Section 3.2.3).

The work of Kashani et al. on tracking the state of heavymachinery (Kashani et al., 2010) combines particle filterswith ICP. Their use of particle filters, however, is not toguide ICP’s search. Rather, it provides a coarse initial reg-istration through importance sampling over a broader setof hypotheses than ICP itself would be likely to search. Itis also worth noting that the feasibility of this approach islargely due to their search space being limited to just threedegrees of freedom.

2.4. ICP variants

In this paper we introduce a number of extensions to ICP,resulting in the error function we present in Section 3.Other works have also introduced modifications to the ICPalgorithm to incorporate additional matching criteria.

A common approach is to augment each point in thetwo point clouds with additional attributes. The correspon-dence selection step then finds closest point pairs in thishigher dimensional space. This approach has been appliedto color (Johnson and Kang, 1997), geometric descriptors(Sharp et al., 2002), and image descriptors (Lemuz-Lópezand Arias-Estrada, 2006). The downsides of this approachare that it requires that descriptors be computed for everypoint, and the dimensionality of the nearest neighbor searchincreases. In comparison, our algorithm only requires SIFTdescriptors at detected keypoints, and the nearest neighborsearch occurs in only three dimensions.

Other approaches for incorporating additional attributesinclude matching only a subset of the points which havea specific attribute value (Druon et al., 2006), and con-straining correspondences to be within a hard thresholdwith respect to attribute similarity (Godin et al., 2001). Themost similar approach to our own uses projection-basedcorrespondence selection but moves the projection alongthe image gradient to better match intensities (Weik, 1997).We have found that projection-based correspondence selec-tion struggles with long, thin objects such as our robot’sfingers. We therefore opt to use nearest neighbor-based cor-respondence selection and to augment our error function toencourage color agreement.

Feature matching has also been used to find initializationsfor ICP (Johnson, 1997; Lemuz-López and Arias-Estrada,2006). Due to our tight Kalman filter integration, we alreadyhave a natural choice of initialization. We instead use our

feature correspondences in the ICP error function, forcingthe algorithm to consider these constraints during the finalalignment.

2.5. Reconstruction

For the graphics community, obtaining accurate 3D shapesof objects is a primary research objective and has beenextensively studied. Many researchers have applied rangesensing of various kinds (e.g. Curless and Levoy, 1996;Pai et al., 2001) and can recover amazing details bycombining elaborate hardware with meticulous experi-mental setup, such as that in the Digital MichelangeloProject (Levoy et al., 2000). In comparison, althoughwe are recovering shape and appearance information, wedo so using cheap, noisy sensors and standard robotichardware. Our goal is to robustly and efficiently modelobjects and to apply such knowledge in recognition andmanipulation.

In this paper we primarily perform reconstruction usinga surface-element (or surfel) based approach. Surfels, orig-inally used as a rendering primitive (Pfister et al., 2000),are oriented discs representing local surface patches on anobject. They have been used with much success in recon-struction (e.g. Habbecke and Kobbelt, 2007; Weise et al.,2009) largely due to their conceptual simplicity and ease ofimplementation. In this paper we base much of the recon-struction itself on the approach of Weise et al. We will gointo more detail on surfels and surfel-based reconstructionin Section 4.

3. Manipulator and object tracking

Our goal is to acquire 3D models of objects grasped by arobot’s manipulator. To do so, we must determine align-ments between a (partial) object model and each sensorframe. Existing techniques for in-hand modeling eitherignore the manipulator entirely and rely on object geom-etry or texture to provide alignment (Pan et al., 2009; Weiseet al., 2009), or they rely on the known manipulator motionas the sole means of registration (Kraft et al., 2008; Sato etal., 1997). We argue that the first approach will fail for sym-metric and/or textureless objects, while the second relies tooheavily on, for example, the accuracy of the joint encoders,the kinematic modeling, and the extrinsic sensor calibra-tion. As a demonstration of this second fact, we show inFigure 3 that encoder values along with forward kinemat-ics are not always sufficiently precise for object model-ing. Many of the factors contributing to the inaccuraciesseen in the figure are less prominent in higher precisionrobots for industrial applications, but because we wish ourtechniques to be applicable to affordable, in-home robots,we choose not to sidestep the issue with high-precisionhardware.

We propose, as an alternative to these object-tracking-only and encoder-only techniques, to instead make use



Krainin et al. 5

Fig. 3. Pictured here is our arm in four configurations. For claritywe have attached a ping pong ball to the end of the manipula-tor (highlighted with circles). In each configuration the encoders,along with forward kinematics, predict the same hand pose; how-ever, the center-to-center distance between the two farthest ballpositions is approximately 8 cm.

of multiple sources of information (namely encoders andRGB-D frames of both manipulator and object), and to relyon each to the extent that it is reliable. We assume thatthe robot is equipped with a 3D depth sensor that observesthe robot’s manipulation space, producing 3D, colored pointclouds of the robot’s manipulator and the object graspedby the hand. Figure 4 shows an example image along withdepth information of a BarrettHand holding a box. Thissensor is used both for tracking and for object modeling.To use such a point cloud for tracking, we assume that therobot has a 3D model of its manipulator. Such a model caneither be generated from design drawings or measured inan offline process. The 3D model allows us to generatean expected point cloud measurement for any configura-tion of the manipulator. In our current system, we performa one-time ray-casting on an existing model of the WAMArm and BarrettHand included with OpenRAVE (Diankov,2010). In the future, we plan to investigate techniques simi-lar to Sturm et al.(Sturm et al., 2009) to instead learn thesemodels.

Here, we present a Kalman filter based approach hav-ing the following benefits: 1) It does not rely solely onthe accuracy of the arm and is therefore applicable to amuch broader class of robotic hardware. We will demon-strate that our technique can function even in the absenceof encoder data; 2) Since the proposed technique tracks thearm in addition to the object, it can find proper alignmentsfor geometrically- and visually-featureless object regions;3) The algorithm reasons about the object in hand (bothits motion and the occlusion it causes), which we showimproves manipulator tracking; and 4) Explicitly trackingthe hand leads to straight-forward hand removal from sensordata for object model update. Finally, although we focus inthis paper on the benefits of the manipulator tracking tech-nique for object modeling, the technique can be used in anysituation where a more precise estimate of a robot’s handpose is desirable.

Table 1. Kalman filter for joint manipulator and object tracking.

1: Hand_object_tracker(μk−1, �k−1,Sm,Sobj,Pz,θ̃k , θ̃k−1):

2: uk = θ̃k − θ̃k−1 //motion reported by encoders

3: μ̄k = μk−1 + Buk //motion update of mean

4: �̄k = �k−1 + Rk //motion update of covariance

5: μ̂k = Articulated_ICP(Sm,Sobj,Pz, μ̄k , �̄k

)

6: S ′obj = Segment_and_merge_object

(Sm,Sobj,Pz, μ̂k

)

7: Kk = �̄k+( �̄k + Qk)−1 //Kalman gain

8: μk = μ̄k + Kk( μ̂k − μ̄k) //measurement updateof mean

9: �k =( I − Kk) �̄k //measurement update ofcovariance

10: return μk , �k ,S ′obj

3.1. Kalman filter tracking

We use a Kalman filter for tracking because it helps main-tain temporal consistency as well as providing estimates ofuncertainty. The algorithm, shown in Table 1, takes as inputthe previous time step mean and covariance, surfel cloudsrepresenting the manipulator and object, and joint anglesreported by the encoders of the manipulator. At a high level,the algorithm uses the measured joint angles in a motion-based update (Steps 2–4), uses articulated ICP to provide ameasurement of the system state (Step 5), updates the objectmodel (Step 6), and performs a measurement-based stateupdate (Steps 7–9).

In particular, our Kalman filter state consists of threecomponents:

• The manipulator joint angles θ̂ .• The transformation T̂calib, representing an adjustment

to the initial robot-to-camera calibration, which trans-forms the base of the manipulator into the 3D sensorframe.

• The transformation T̂obj, representing an adjustmentto the location of the object relative to the palm ofthe hand. It transforms the object point cloud into thereference frame of the palm.

The adjustment transformations T̂calib and T̂obj are initial-ized to identity transformations and are encoded as quater-nions and translations. The initial state of the Kalman filterhas associated with it a covariance matrix representing theuncertainties in the initial angles, the camera calibration,and the placement of the palm relative to the first objectpoint cloud.

As given in Step 2, the motion update uk is based uponthe difference in the reported joint angles since the previ-ous time step. The prediction step of the Kalman filter thengenerates the predicted state μ̄k in Step 3. The matrix B




Fig. 4. (left) A BarrettHand holding a box. (right) Rendering of the depth map provided by our depth camera.

simply projects the joint angle update into the higher dimen-sional space of the Kalman filter state. Associated with thisupdate step is noise in the motion distributed according tothe covariance matrix Rk (Step 4). If the camera is assumedfixed (as in our case), then Rk does not contribute any newuncertainty to those components of the state. The compo-nents of Rk representing uncertainty in the object’s motionallow our algorithm to handle slight slippage of the object.In Section 3.3, we discuss how we use these components toenable regrasping the object.

If the calibration and the object’s pose relative to thepalm are assumed fixed (that is, if the object is graspedfirmly), then Rk will not contribute any new uncer-tainty to those components of the state. Alternatively,one may include those terms in Rk in order to compen-sate for movement of the camera or the object inside thehand.

In Step 5, the function Articulated_ICP matches the sur-fel models of the manipulator and object into the observedpoint cloud and returns an estimate, μ̂k , of the state vectorthat minimizes the mismatch between these clouds. Detailsof this algorithm are given in Section 3.2.

Segment_and_merge_object uses the output of Articu-lated_ICP to extract points from the current measurement,Pz, that belong to the object. To do so, it uses the ICPresult μ̂k to appropriately transform the manipulator surfelmodel Sm into the correct joint angles and into the sen-sor’s reference frame. Sm is then used to identify pointsin Pz generated by the manipulator via simple distancechecking. This hand removal procedure is demonstrated inFigure 5.

The points belonging to the object in hand can then beidentified due to their physical relation to the end effec-tor. This technique has the added benefit that it does notrequire a static background as many vision-based algo-rithms do. The resulting points are then integrated into Sobj

with update rules we will describe in Section 4.

Fig. 5. Because we explicitly track the robot’s arm, we can removeit from the current measurement Pz by simply checking 3Ddistances to the arm model.

Steps 7 through 9 are standard Kalman filter correctionsteps, where we take advantage of the fact that Articu-lated_ICP already generates an estimate of the state, μ̂k ,thereby allowing the simplified correction in Step 8. Qk

represents the uncertainty in the ICP result μ̂k . While tech-niques do exist for estimating this matrix (e.g. by using theJacobian matrix of the error function), their estimates arebased on the local neighborhood of the solution. If ICPfinds only a local optimum, such estimates could drasti-cally underestimate the degree to which the ICP solutionis incorrect. We therefore set Qk manually and leave onlineestimation as future work.

3.2. Articulated ICP

We now describe the function Articulated_ICP used inStep 5 of the tracking algorithm. Articulated ICP is an iter-ative approach to estimating joint angles by attempting tominimize an error function over two point clouds. As shownin Table 1, we use the result of the articulated ICP algorithmas a noisy state observation.

We begin with a review of the ICP algorithm for rigidobjects. The inputs to ICP are two 3D point clouds, asource cloud Ps = {

p1s , . . . , pM

s

}and a target cloud Pt ={

p1t , . . . , pN

t

}. The goal is to find a transformation T∗ (3D

rotation and translation) which aligns the point clouds as



Krainin et al. 7

follows:

T∗ = argminT

M∑

i=1

minp

jt∈Pt

wi

∣∣∣T( pis) −pj

t

∣∣∣2

. (1)

To achieve this minimization, the ICP algorithm iteratesbetween the inner minimization of finding correspondencesas pairs of closest points given a transformation, and theouter minimization of finding the transformation minimiz-ing the sum of squared residuals given the correspondences.Since ICP only converges to a local minimum, a goodinitialization for T is important.

In our context, Ps is a combined model of the objectSobj and the manipulator Sm, and Pt contains the currentobservation. As in Mündermann et al. (2007) and Pellegriniet al. (2008), the point clouds in our articulated ICP arerelated to objects that consist of multiple links connectedvia revolutionary joints. Specifically, each point pi

s ∈ Ps

has associated to it a link li in the robot’s manipulator, and isspecified in the local coordinate system of that link. Giventhe state x = 〈θ , Tcalib, Tobj〉, a link li in the robot modelhas a unique transformation TS

li(x) that maps its points into

the reference frame of the sensor. The object is treated asits own link, which has an offset Tobj from the palm frame.The goal of articulated ICP is to solve for the following:

〈θ , Tcalib, Tobj〉∗ = x∗ = argminx

M∑

i=1

minp

jt∈Pt

wi

∣∣∣[TSli

(x) ]( pis) −pj

t

∣∣∣2

. (2)

We have found that the use of point-to-plane type errorfunctions (Chen and Medioni, 1992) can help improve theperformance of ICP by preventing undesired sliding ofsurfaces. When using point-to-plane, the ICP algorithmcontinues to select correspondences in the usual way (forefficiency, we use a KD-tree over the target cloud). Wedenote the index of the ith source point’s correspondenceas corr( i). The difference when using a point-to-planemetric is that the error function optimized by the outerminimization changes from

Ept−pt(x) =M∑

i=1

wi

∣∣∣[TSli

(x) ]( pis) −pcorr(i)

t

∣∣∣2

(3)

to

Ept−plane(x) =M∑

i=1

wi( ( [TSli

(x) ]( pis) −pcorr(i)

t ) ·ncorr(i)t )2 .

(4)To efficiently determine the normals

{n1

t , . . . , nNt

}of the

target cloud, we compute eigenvectors of the covariancematrices for local neighborhoods provided by the grid-structure of the data. The weight wi for the ith source pointis based on the agreement between the normals of the corre-sponding points (the inner product of the two). Additionally,

a weight is set to zero if the correspondence distance isover a threshold, if the source point’s normal (provided bythe surfel representation) is oriented away from the camera,or if the line segment from the camera to the source pointintersects any other surfel (occlusion checking).

The point-to-plane error metric (4) is the starting pointfrom which we extend articulated ICP. It has no closed formsolution (even for the simple, non-articulated case), so weuse Levenberg–Marquardt to optimize the function. We nowdescribe extensions to the ICP error function to improvetracking by taking advantage of other available information.

3.2.1. Sparse feature matching Provided we can find a setof sparse feature correspondences F between 3D pointsin the source and target clouds, we can add these to theoptimization by including an extra error term:

Efeature(x) =∑

(fs,ft)∈F|[T(x) ]( fs) −ft|2. (5)

Here, T(x) refers to the transformation for the object linkbecause, as we will explain below, we only track featureslying on the object. An important point to note is that (5)uses a point-to-point style error function, which provides astronger constraint than a point-to-plane function.

These fixed, feature-based correspondences enable theuse of visual information to improve matching. As an exam-ple, for objects that are geometrically featureless (e.g. pla-nar or uniformly curved like a cup), the error function (4)is unable to disambiguate between the correct state and onein which the object model is shifted along the target sur-face. If the surface is textured, visual features can providethe correct within-plane shift; this is due to the use of apoint-to-point error metric in (5).

Our feature matching approach consists of maintaininga model containing locations (in the coordinate system ofSobj) and descriptors of features lying on the object. In eachframe, we extract a new set of visual features and deter-mine the 3D location of the features from the depth data.We then find matches between the model features and thecurrent frame features using RANSAC to ensure geometricconsistency. Provided there are a minimum number of geo-metrically consistent feature matches, we use the additionalerror term.

To maintain the object feature model, we take anapproach similar to Kawewong et al. (Kawewong et al.,2009). Features detected in each frame are matched againstfeatures from the previous frame. Only features withmatches going back a specified number of frames and lyingwithin the segmented-out object are used in updating theobject feature model. This avoids cluttering the model withnon-stable features arising from factors such as specularhighlights. Stable features with geometrically consistentmatches to the feature model will result in updates to thecorresponding feature in the model (running averages ofposition and descriptor). Stable features without matchesare added into the model as new feature points. In our




implementation, we use SiftGPU (Wu, 2007) for featuredetection.

3.2.2. Dense color matching Sparse feature matching is avery effective technique when there are sufficiently manyand sufficiently distinctive features for matching. Unfortu-nately, these conditions do not always hold. Not all objectshave large quantities of distinctive features. Additionally,we are limited in image resolution and in the minimum dis-tance of the object from the camera by our sensor. So forsmaller objects, we typically cannot use the error term (5).

Even when distinctive visual features cannot be found,there is still useful information in the RGB image. We addthe following additional error term to encourage colors inthe object model to align with similarly colored pixels inthe image:

Ecolor(x) =|Sobj|∑

i=1

wi

∣∣∣∣(C ◦ proj ◦ T(x) ) ( pis) −ci

∣∣∣∣2

c. (6)

We use the notation (x, y) = proj( p) to denote the projec-tion of a 3D point p into the sensor’s image plane, andC(x, y) to denote the (interpolated, or if needed, extrapo-lated) color value for a point in the image plane. As wewill explain in further detail in Section 4, each surfel in theobject model Sobj has a color associated with it. We denotethe color of the ith model point pi

s as ci. Thus, (6) penal-izes differences between a surfel’s color and the color at itsprojection.

In principal, || · ||c could be any distance function overcolors. In our current implementation, it is simply the mag-nitude of the intensity difference between the two colors.The weight wi in (6) is 1 if the surfel has a correspondencecorr( i) in the point-to-plane matching (i.e. non-zero weightin (4)) and 0 otherwise. This prevents non-visible regionsfrom being projected into the image.

Before (6) is evaluated, we first smooth the image, whichhas two benefits. First, it helps reduce image noise. Second,it introduces color gradients where there would otherwisebe hard color boundaries. This provides the optimizer witha gradient to follow.

An important aspect to consider when performing color-based matching is that of lighting variation. One problemwe came across is that specular highlights will move acrossan object as it is rotated. To help prevent them from dis-turbing the color matching, we ignore any terms for whichci or its correspondence in the target frame has an intensityover a pre-defined threshold. This heuristic is quite simplebut helps prevent highlights from being drawn toward eachother.

Other considerations for changes in illumination couldfurther improve the robustness of color matching in ICP.For instance, projecting patches of points rather than oneat a time would allow for the use of a metric based onnormalized cross-correlation.

Fig. 6. Two grasps of the same object. With just a single grasp,the resulting object model has holes. Any given part of the objectis visible in at least one of the grasps.

3.2.3. Prior state information Finally, we include one lasterror term encoding our prior knowledge about the state:

Eprior(x) =(x − μ̄k) �̄−1k (x − μ̄k) . (7)

The use of this term helps influence ICP to make morelikely choices when the registration is ambiguous. Forinstance, if the robot rotates a solid-colored cup, a solu-tion having the cup remain still while the hand moves willappear just as valid to the ICP error function (4) as wouldone in which the cup rotates along with the hand. TheKalman filter covariance �̄k encodes our uncertainties inthe system’s degrees of freedom and can be used to helpresolve such ambiguities.

Another scenario where the use of this prior term is desir-able is when the robot’s hand is visible but its arm is largelyout of frame. Multiple sets of joint angles give the properhand pose and therefore the same ICP error, but they are notall equally consistent with the rest of the tracking sequence.

Although there are clearly important reasons for includ-ing this error term in ICP, it should also be noted thatthis prior has the potential to affect the performance of theKalman filter. μ̂k is supposed to be an independent estimateof the true state, but our prior biases it toward the existingμ̄k .

With the inclusion of these additional terms, our overallerror function becomes

E(x) = Ept−plane(x) +α1Efeature(x) +α2Ecolor(x)

+α3Eprior(x) . (8)

The α’s are relative weights for the components of the errorfunction, which we set heuristically.

3.3. Handling multiple grasps

As described so far, the tracking procedure does not allowfor regrasping of objects — an important step since themanipulator occludes parts of the object (see Figure 6).Most notably, we assume the object moves (roughly) withthe hand, which would not be true during a regrasp proce-dure.

To continue manipulator and object tracking during tran-sitions between grasps, we use a switching Kalman filterwith three states:



Krainin et al. 9

1. The robot is firmly grasping the object.2. The robot is grasping or releasing the object.3. The object is sitting on a table between grasps.

An advantage of performing the object modeling using arobot is that the we have knowledge of when it will graspor release and we therefore always know which state theKalman filter should be in.

The first state is the most common and the one we havefocused on so far. The primary difference in the secondstate is that the object may wobble or slide in unexpectedways. We therefore modify the motion covariance Rk bysoftening the constraint on the object pose relative to themanipulator.

In the third state, the object is expected to remain fixedrelative to the robot’s base rather than the manipulator. Inthis state, the object pose components are reinterpreted asbeing relative to the base.

The first state is the only one in which Segment_and_merge_object is performed. Even during the graspedstate, the modeling procedure is not done while the objectis being raised from or lowered to the table.

The use of a switching Kalman filter allows the robot toexamine an object using one grasp, put the object down,regrasp it, and examine it again, thereby filling in holes fromthe first grasp. In Section 5 we demonstrate example modelsbuilt from multiple grasps.

4. Object modeling

We now describe the representation underlying our objectmodels and the key steps involved in updating objectmodels based on new data.

4.1. Surfels

Our choice of surfels (Habbecke and Kobbelt, 2007; Weiseet al., 2009) as a representation was strongly influencedby the constraints of our problem. Our depth sensor, whileversatile, does suffer from certain types of noise. In partic-ular, we must be able to compensate for quantization errors,filling in of holes, and expansion of objects by extra pix-els (Figure 7). Therefore, it is crucial that we be able torevise the models not just by adding points but also by keep-ing running estimates of their locations and by removingspurious points; how surfels accomplish these properties isexplained in Section 4.2.

Additionally, the problem at hand involves tracking therobot’s manipulator, some of which may be occluded by theobject or itself. We wish to be able to reason explicitly aboutthe visibility of any particular point in Sm or Sobj beforeassigning it a correspondence. Doing so prevents irrelevantmodel points from negatively impacting the alignment.

Surfels fit all of these requirements and are very easyto work with. As we explain below, the addition, update,and removal rules for surfels are quite simple and robust.While other representations such as triangle meshes could

Fig. 7. One of the error modes of our depth sensor. Depicted hereis a point cloud of the lip of a mug against a light blue back-ground. Along both edges shown, extra depth points appear andare colored by the scene’s background. Additionally, the sensorhas quantization errors and tends to fill in small holes.

provide the occlusion information we need, the update rulescan be substantially more inefficient and difficult to imple-ment because of the need to maintain explicit connectionsbetween vertices. Surfels, on the other hand, can be updatedindependently of each other and, if desired, can later beconverted to a mesh in a post-processing step.

A surfel is essentially a circular surface patch. The prop-erties of a surfel si include its position pi, its normal ni, andits radius ri. The radius, as described by Weise et al., is setsuch that, as viewed from the camera position, it would fillup the area of one pixel. As the camera gets closer to the sur-face, surfels are automatically resized, providing an elegantmeans for selecting the appropriate resolution and, further,for using varying levels of detail across a single surface.

One can associate additional attributes to surfels such as“visibility confidence” vi. The possible viewing angles ofa surfel are divided into 64 bins over a hemisphere. Theconfidence is the number of such bins from which the surfelhas been seen at least once. This provides a better measureof confidence than simply the number of frames in whicha surfel has been seen because a patch seen from the sameangle may consistently produce the same faulty reading.

For visualization purposes and for dense color matching(Section 3.2.2), we also keep track of the color ci of the sur-fel. We use the heuristic of using the color from the framewith the most perpendicular viewing angle to the surfel sofar.

4.2. Model update

After performing the segmentation described in Section 3,we use surfel update rules similar to Weise et al. (Weiseet al., 2009) to modify the object model Sobj. Each sur-fel location pi is projected into the image plane. We thenuse bilinear interpolation to determine the point p∗

i and nor-mal n∗

i at that same location in the current frame. pi has adepth di and p∗

i has a depth d∗i ; the difference di −d∗

i is usedto determine the update rule that is used. In the followingrules, we will say that a sensor reading p∗

i is a valid objectreading if it is part of the object (as determined by the objectsegmentation in Section 3.1) and n∗

i does not deviate fromthe camera direction by more than θmax.




1. |di − d∗i | ≤ dmax: If p∗

i is a valid object reading andni does not deviate from the camera direction by morethan θmax, then the surfel is updated. This is done bycomputing running averages of the surfel location andnormal, and updating the grid of viewing directions.Additionally, if the new measurement was taken froma closer location, then the radius of the surfel is updatedaccordingly. If the conditions do not hold, then we donothing.

2. di − d∗i < −dmax: In this case, the observed point is

behind the surfel. If the visibility confidence vi is belowvhigh, then the existing surfel is considered an outlierand removed. If vi is at least vhigh, then the reading isconsidered an outlier and is ignored.

3. di − d∗i > dmax: Then the observed point is in front of

the model surfel si, so we do not update the surfel.

After surfel update comes the surfel addition step. Foreach pixel in the object segments, a new surfel is added ifthere are no existing surfels with normals toward the cam-era either in front of or close behind the reading. This isa simple heuristic; however, it allows us to acquire mod-els of objects which have two surfaces close together suchas the inside and outside of a coffee mug. Finally, there isone more removal step. Any surfel with vi < vstarve thathas not been seen within the last tstarve frames is removed.This is very effective at removing erroneous surfels withoutthe need to return to a viewing angle capable of observ-ing the surfel patch. More details on the parameters in thisapproach and reasonable values for them can be found inWeise et al. (2009).

4.3. Loop closure

We also base our loop closure on the techniques developedby Weise et al. (Weise et al., 2009). The approach involvesmaintaining a graph whose nodes are a subset of the surfelsin the object model. An edge in the graph indicates that thenodes were both visible and used for registration in the sameframe. We refer the interested reader to Weise et al. (2009)for a more detailed description of the procedure.

The most important operation on the graph is the com-putation of connected components when ignoring currentlynon-visible nodes. These components represent surfacesfrom separate passes over the visible region. To preventall of the loop closure error from occurring in a singleframe when a second connected component comes intoview, only one connected component can be matched intoeach frame. The error is distributed over the whole structurein a separate relaxation step.

As illustrated in Figure 8(a), the relaxation step is trig-gered when two connected components both individuallyexplain some minimum percentage of the object pixels ina frame (surfels maintain connections to nearby nodes andcan therefore be associated with connected components).

(a) (b) (c)

(d) (e)

Fig. 8. Illustration of the loop closure and merging procedures.(a) Two connected components overlap enough to trigger a loopclosure. (b) TORO brings the components into alignment. (c,d)Each surfel is updated to a consensus position and orientationbased on nearby surfels in the other component. (e) A new sur-face is constructed, removing redundancy; preference is given tohigh confidence surfels.

At this point, there is some number L of connected compo-nents (usually just two), each with associated surfels. Thesecomponent surfaces are surfel clouds denoted SC1, . . . ,SCL.

Each of the L component clouds is registered into the cur-rent frame using ICP, yielding transformations TS

C1, . . . , TSCL

bringing the surfels in each component into the sen-sor frame. The tracking algorithm described in Section 3implicitly defines a transformation TS

obj of the whole objectinto the sensor frame through its mean state vector. Bycombining these two transformations, we come to a local,adjustment transformation to each component which makesit align to the sensor data when TS

obj is applied: T adjCi =

( TSobj)

−1 ∗ TSCi.

We cannot simply transform each visible surfel by itsappropriate T adj transform because this would break thesurface at the boundaries between visible and non-visiblesurfels. Instead, Weise et al. propose an optimization overnode poses which trades off the T adj constraints and relativenode pose constraints.

If T initi and T init

j are the pre-loop closure poses of the ithand jth nodes respectively, then the pose of the jth nodewithin the coordinate system of the ith node is T init

i→j =( T init

i )−1 ∗ T initj . Utilizing constraints of this form in the

relaxation step forces the solution to respect the relativenode poses at the boundaries and to spread the error overthe entire structure.

We use TORO (Grisetti et al., 2007), an open-source posegraph optimization tool, to optimize over node poses withthe following constraints:

• For each node i, if it is in some connected componentCj, its pose in the object coordinate frame is constrainedto be T adj

Cj ∗ T initi with identity information matrix.

• For each node i and each of its k (we use k = 4)closest connected nodes j, the relative transformation



Krainin et al. 11

between nodes is constrained to be T initi→j with identity

information matrix.

The result of the TORO optimization is a new posefor every node in the graph. Poses of non-node surfelsare subsequently determined through the interpolation pro-cess described in Weise et al. (2009). The result will looksomething like Figure 8(b).

4.4. Surface merging

After the loop closure procedure completes, the objectmodel will have multiple overlapping surfaces. In particu-lar, there will be overlaps among the surfaces SC1, . . . ,SCL,which we illustrate in Figure 8(b). One could simply allowthis overlap to persist, but it is more desirable to instead tryto find a consensus among the layers to further average outmeasurement noise as in Figure 8(e). A single-layered sur-face also avoids giving overlapping regions extra weight in(4) caused by higher point density.

Our approach is to first use the algorithm shown inTable 2 to transform each surfel into the consensus positionand normal of the surfaces within a distance r. The positionof a surfel is adjusted along its normal direction to matchthe average distance along this normal to each surface. Thenormal is set to the average of the normals found for eachof the nearby surfaces. These averages can alternatively beweighted averages, where the weights are determined by theaverage visibility confidence of surfels being considered.These are omitted from Table 2 for readability. This consen-sus procedure is similar to the one used in mesh zippering(Turk and Levoy, 1994).

After the consensus procedure we are faced with theproblem of redundant surfels. Because it is not as straight-forward to merge attributes such as viewing direction grids(which may not be aligned in their discretizations), wewould like to be careful about which surfels (and thereforewhich attributes) we include in the updated model.

We choose to use visibility confidence as a measure ofwhich surfels to retain when there is redundancy. We con-struct a new surfel cloud by first sorting the surfels fromall surfaces by visibility confidence. Each surfel, in order ofdecreasing confidence, is added if a line segment extendingfrom +r to −r in its normal direction does not intersect anysurfels already in the new cloud. This results in a final sur-fel cloud having only a single layer of surfels. A summaryof the loop closure and merging procedure can be seen inFigure 8.

5. Experiments and results

The robot used in our experiments is shown in Figure 1.The basic setup consists of a WAM Arm and BarrettHandmounted on a Segway. The depth camera is located to theside and above the robot manipulator so as to provide agood view of the manipulator workspace. The specific depthcamera we use, developed by PrimeSense (PrimeSense,

Table 2. Consensus surfel algorithm. Transforms multiple surfelclouds to obtain corrected surfel positions and normals. This algo-rithm can be followed by a surfel removal step to give a singleconsensus surfel cloud.

1: Consensus_surfel_clouds(S1, . . . ,SL):

2: S ′1, . . . ,S ′

L = S1, . . . ,SL

3: for i ∈ {1, . . . , L}4: for j ∈ {1, . . . , |Si|}5: depthSum = 0

6: normalSum = nj

7: numSurfaces = 1

8: for k ∈ {1, . . . , L} , k = i

9: S = surfels_in_range(Sk ,pj,r)

10: if |S| > 0

11: depthSum += avg_dist_along_normal(S,nj)

12: normalSum += avg_normal(S)

13: numSurfaces += 1

14: p′j = pj+( depthSum/numSurfaces) ∗ nj

15: n′j = normalize(normalSum)

16: return S ′1, . . . ,S ′

L

2010), was mainly developed for gaming and entertainmentapplications. It provides pixel colors and depth values at640×480 resolution, at 30 frames per second.

We collected depth camera data and joint anglesequences of the moving system. In all but the last exper-iments, which use Next Best View planning, the manipula-tor grasps and trajectories were specified manually and theobjects were grasped only once. One example of a hand-specified trajectory can be seen in Extension 1. Using mul-tiple grasps to generate complete object models is discussedbriefly in Section 5.3.

Our current implementation of the algorithm describedin Section 3.1 runs at 2 to 3 frames per second. We are con-fident that the update rate of the system can be increasedto 10 frames per second using a more optimized implemen-tation and taking advantage of GPU hardware. To simulatesuch a higher update rate, we played back the datasets atapproximately one-fifth real time. We have found that, byskipping frames, we are able to operate in real time, but theresulting models are not as detailed.

The tracking requires a few parameters, which we setheuristically. First are the α weights in (8). We set α1 to5 to give feature correspondences higher weight than reg-ular correspondences. This is desirable because we havemore reason to believe the feature correspondences andbecause features are somewhat sparse, so there aren’t asmany terms in the summation. α2 is set to 0.0001, whichwe select by giving equal weight to a 1 mm spatial error asa 10% color error, which could naturally arise from lighting




variation. Finally, we set α3 to 0.005, which correspondsroughly (given the typical number of correspondences wehave) to equating one standard deviation from the prior to afew millimeters of misalignment for each correspondence.

For the Kalman filter, we use diagonal covariance matri-ces and set the variances empirically. We initialize theuncertainties in joint angles to 5 ◦, camera translation androtation to 0.03 m and 0.01 respectively (rotation uncer-tainty using L2 distance of normalized quaternions), andobject translation and rotation to 0.01 m and 0.01. Themotion uncertainties are 2 ◦ for the joints angles and0.005 m and 0.01 for the object translation and rotation (thecamera is assumed fixed). Finally, the measurement uncer-tainties are 4 ◦ for the joint angles, 0.02 m and 0.01 for thecamera translation and rotation, and 0.01 m and 0.02 for theobject translation and rotation. The measurement uncertain-ties are purposely large to reflect that we never truly know ifICP has found the correct solution and so the Kalman filterdoes not grow overconfident.

5.1. Joint manipulator and object tracking

In this experiment we evaluate the ability of our tech-nique to track the position of the robot hand. Specifically,we investigate whether our system would enable accuratetracking of a low cost manipulator equipped with positionfeedback far less accurate than that of the WAM arm. Todo so, we use the WAM controller’s reported angles asground truth. Though these angles are far from perfect,they provide a common comparison point for the differentnoise settings. To simulate an arm with greater inaccuracies,we included normally distributed additive noise of varyingmagnitude.

To provide an intuitive feel for the units involved, weshow in Figure 9 an example of the deviation betweenreported and observed manipulator after 20 seconds of

motion at a 2.0 ◦s− 12 noise level. In Figure 10, we demon-

strate that our tracking algorithm can handle large amountsof noise in the reported angles without losing accuracyin tracking the end effector. Red dots in Figure 10 showerrors for the uncorrected, noisy pose estimates. Green dots,along with 95% confidence intervals, show tracking resultswhen ignoring the object in the robot’s hand. Blue dotsare results for our joint tracking approach, when model-ing the object and tracking it along with the manipula-tor. Each dot represents the end effector positioning errorat the end of the tracking sequence, averaged over mul-tiple runs and multiple arm motions. Although the noiseassociated with the encoder readings increases along the x-axis, we do not change the motion model covariance Rk ;however, adjusting Rk to better reflect the true uncertain-ties involved would only improve the results we presenthere.

As can be seen, modeling the object further increases therobustness to noise. With arm only tracking, large variationsin end effector accuracy (indicative of frequent failures)

Fig. 9. (left) Drift resulting from 2.0 ◦s−12 noise. Actual sensor

data in true color, tracking result in red, and noisy joint angles inblue. (right) Surfel model produced at this level of noise.

Fig. 10. Distance of the end effector from the ground truth as afunction of the per joint angle drift rate. Notice that the track-ing errors begin earlier when only the arm and not the objectis tracked. Horizontal offset of ‘Joint Tracking’ is for displaypurposes only.

begins around 3.0 ◦s− 12 , while comparable failure levels

do not occur until 3.4 ◦s− 12 for the joint tracking. This is

because we can both explicitly reason about the modeledobject occluding the fingers and use the object as an addi-tional surface to match. An example model generated forthe coffee mug under high noise conditions is shown inFigure 9. When comparing this model to one built with-out additional noise (see Figure 12), it becomes appar-ent that our approach successfully compensates for motionnoise.

Additionally, we ran experiments on the same data setsignoring the encoders entirely and assuming a constantvelocity motion model for Step 2 of Table 1 (i.e. uk =μk−1 − μk−2). This is not the same as ignoring the arm; thearm is still tracked and still provides benefits such as dis-ambiguating the rotation of symmetric objects as describedin Section 3. In these experiments, we found end effectorerrors of 2.15 cm, which is almost identical to the errorswhen using the true encoder values. This both attests to theaccuracy of our ICP-based alignment procedure and sug-gests that our approach may be applicable to domains whereencoder information is not available.



Krainin et al. 13

Fig. 11. Tracking accuracy when using an arm model with ashortened upper arm. Our algorithm is able to track the endeffector even in the presence of both robot modeling errors anddiscrepancies between encoder values and true joint angles.

Fig. 12. Shown here are a can and a mug aligned with ICP aloneon the left of each image and our algorithm on the right. Due tothe high level of symmetry in these objects, ICP is unable to findthe correct alignments between sensor frames, resulting in largelyuseless object models.

Besides factors such as cable stretch, which simply causeencoder values to disagree with the true joint angles, theremay be other sources of noise. One such example is inaccu-racies in the robot model. To test our algorithm’s resilienceto model inaccuracies, we altered our robot model by short-ening the upper arm by 1.0 cm. We then re-ran the exper-iments from Figure 10 to determine how combinations ofboth modeling and encoder errors affect our tracking algo-rithm. These results are shown in Figure 11. The end effec-tor error increases to about 2.5 cm with the shortening ofthe arm, but our algorithm is able to reliably track the hand

up to at least 2.5 ◦s− 12 . Notice again that the joint tracking

remains resilient to higher levels of encoder noise than thearm-only tracking.

5.2. Object modeling

In this set of experiments we investigate how our algorithmperforms in terms of the quality of the resulting objects.

Fig. 13. Pre-loop closure can model (left) without use of the densecolor error term; (middle) with color error term; (right) with colorerror term after loop closure and surface merging.

First, we examine the ability of our algorithm to modelrotationally symmetric objects and objects lacking dis-tinctive geometric regions. Many existing object modelingalgorithms such as Weise et al. (2009) rely on being able togeometrically match an object model into the new frame. Inthis experiment, object segmentations from our joint track-ing algorithm were fed to ICP to be aligned without anyinformation about the hand motion. The resulting cloudsare compared with the output of our system in Figure 12.As can be seen in the figure, ICP is unable to recover theproper transformations because of the ambiguity in surfacematching. It should be noted that for the mug case in par-ticular, systems like ProFORMA (Pan et al., 2009), whichrely on tracking visual features would also be incapable oftracking or modeling the object.

We note that the presence of the color error term (6)can help improve the object models. The left image inFigure 13 shows a pre-loop closure model of a can gen-erated without the dense color error term, while the middleimage shows the model built when using color. The modelgenerated using color has noticeably less accumulated ver-tical misalignment.

We have also found that surfels are a very compactand elegant solution to maintaining object models. Besidesthe benefits of occlusion-checking and incremental update,multiple measurements can be merged into a single surfel,and the arrangement is cleaner and more visually appealing.Figure 14 illustrates the difference between raw data pointclouds and surfel models. Shown on the right are the surfelpatches belonging to two separate objects. The two panelson the left show the raw, colored point clouds from whichthe surfels were generated. The raw clouds contained on theorder of one million points and were randomly downsam-pled for visualization purposes. The surfel clouds containon the order of ten thousand surfels.

The surfel models we have obtained in our online pro-cess contain accurate information of both surface positionsand normals, and can be readily converted to meshes in apost-processing step. We use the open-source Meshlab soft-ware, first applying the Poisson Reconstruction algorithm




Fig. 14. Comparison between aggregated point clouds and sur-fel models generated from the same data and the same framealignments.

Fig. 15. Surfel models and their respective meshed versions.Meshing can fill in small gaps between surfels such as the rimof the coffee can that was partially occluded by the palm. On theother hand, it struggles with thin surfaces like the top of the orangejuice box and can smooth out sharp corners.

(Kazhdan et al., 2006) to obtain a surface mesh from theoriented point cloud. We then apply the Catmull–Clarksubdivision to refine the mesh. Colors for vertices areassigned using the color of the nearest surfel in the originalmodel.

Meshing is potentially useful as there are many existingalgorithms which operate on meshes (e.g. simplification,refinement, hole filling). Many algorithms assume meshesas inputs, including most grasping algorithms. Meshingadditionally adds visual appeal, and can fill in small gapsbetween surfels (and in fact larger ones, but we remove tri-angles with long edges to leave larger holes intact in ourresults). Figure 15 shows examples of the output of themeshing procedure.

A few of the reconstructed objects are shown inFigure 16. Rendered videos of the models are availablein Extension 2. Additionally, a video demonstrating thearm tracking and surfel model construction is available inExtension 1.

Finally, we compare dimensions of some of the objectsin Figure 16 and Figure 17 with measurements taken of thephysical objects to better gauge the accuracy of the recon-structions. The results are shown in Table 3 and, as can beseen, the reconstructed dimensions are typically within afew millimeters of the true dimensions.

5.3. Toward autonomous object modeling

To perform autonomous grasping and modeling, we firstimplemented an approach based on the findings of Balasub-ramanian et al. (Balasubramanian et al., 2010) that enablesthe robot to pick up an unknown object. The object grasp

Fig. 16. Triangulated surface models constructed from surfelclouds. Left column: a router box, large Lego pieces, and a mug.Right column: a can, a clock, and a stuffed doll. Holes in the mod-els are due to occlusion by the hand or unseen regions due to thetrajectory.

Fig. 17. Side-by-side comparisons showing object models beforeand after regrasping. From left to right: an orange juice bottle, acoffee can, and a pill bottle. The regrasping procedure allows forholes caused by the manipulator to be filled in. In the case of theorange juice bottle, a small hole remains as there was some overlapbetween the grasp locations.

point and approach direction are determined by first sub-tracting the table plane from the depth camera data andthen computing the principal component of the point cloudrepresenting the object. The approach is then performedorthogonal to this principal component. While this tech-nique is not intended as a general grasping approach, itworked well enough to perform our initial experiments.Alternatively, one can use local visual or geometric featuresas in Saxena et al. (2008) to obtain this first grasp.

The model can be improved by allowing the robot toplace the object back down and regrasp it. To generate thesecond grasps, and to choose which regions of the objectto examine, we use a Next Best View algorithm whichwe present in a related paper (Krainin et al., 2011). Theresults of this system are shown in Figure 17. Here it canclearly be seen that regrasping allows for filling in previ-ously occluded regions and that the modeling is capable ofresuming, even after an exchange of grasps.

Here we have presented an end-to-end robotic system,capable of picking up an unknown object, tracking andmodeling the object in the presence of noisy joint sen-sors, regrasping the object, and resuming modeling to fillin holes. Additionally, the regrasp planning (Krainin et al.,2011) uses the partial object model, demonstrating graspplanning as one application that can greatly benefit fromthe ability to generate such models.



Krainin et al. 15

Table 3. Comparison of reconstruction dimensions with the measured dimensions of the original objects.

Object Dimension Measured [cm] Reconstructed [cm] Error [cm]

Boddingtons Can Diameter 6.6 6.5 0.1Lego Block Width 3.2 3.3 0.1Lego Block Length 6.4 6.7 0.3OJ Bottle Width 9.7 10.1 0.4OJ Bottle Height 19.6 19.7 0.1Pill Bottle Diameter 9.5 9.3 0.2Pill Bottle Height 18.0 17.7 0.3White Mug Diameter 9.2 9.4 0.2

Fig. 18. Failure Cases: (top-left) Poor alignment after a regraspdue to factors such as high uncertainty in the object pose, non-distinctive geometry, and poor lighting. (top-right) RGB imagefrom the same frame. (bottom) Poorly estimated loop closuretransformation on the lip of a mug (inconsistency circled) causedby fairly uniform color and geometry.

5.4. Failure cases

Although our technique makes use of many sources ofinformation to guide its tracking, registration failures canstill occasionally occur if many things go wrong at once.One such scenario is shown in Figure 18 (top). In theseimages, the front face of a box is being tracked after thebox has been regrasped. The combination of an uncertaingrasp location, a completely planar surface, poor light-ing conditions, and an oblique viewing angle result in themisalignment.

Another place failures can occur is in the loop closureprocedure, which relies on ICP to register each individ-ual connected component into the current frame. For someobjects such as the white mug in Figure 18 (bottom), ICPmay not always succeed due to the fairly uniform color andgeometry.

Finally, failures may occur during the regrasping pro-cedure. While our tracking algorithm can handle slightwobble, sliding, or slipping during the exchange, it cannotcurrently handle large unexpected motions such as an objectfalling over on the table. This is due to ICP’s requirementof being initialized close to the true state. Recovery from

an object falling over would likely require a more globalsearch.

6. Conclusions and future work

We developed an algorithm for tracking robotic manipula-tors and modeling grasped objects using RGB-D sensors.Our approach performs tracking, robot to sensor calibra-tion, and object modeling in a single Kalman filter-basedframework. Experiments show that the technique canrobustly track a manipulator even when significant noise isimposed on the position feedback provided by the manip-ulator or when the robot model contains inaccuracies. Theexperiments also show that jointly tracking the hand and theobject grasped by the hand further increases the robustnessof the approach. The insight behind this technique is thateven though an object might occlude the robot hand, theobject itself can serve as guidance for the pose estimate.

We also introduced a tight integration of the trackingalgorithm and an object modeling approach. Our techniqueuses the Kalman filter estimate to initially locate the objectand to incorporate new observations into the object model.We use surfels as the key representation underlying theobject and manipulator models. This way, our approach cando occlusion-based outlier rejection and adapt the resolu-tion of the representation to the quality of the availabledata.

An alternative approach to ours could be to generate anobject model by moving a camera around the object. How-ever, this approach cannot provide information about objectparts that are not visible based on the object’s positionin the environment. Furthermore, our approach of inves-tigating an object in the robot’s hand also lends itself toextracting information about the object’s weight and surfaceproperties.

Our key motivation for this work is in enabling robots toactively investigate objects in order to acquire rich modelsfor future use. Toward this goal, several open research ques-tions need to be addressed. We have implemented a verysimple approach for initial grasp generation but, particu-larly in the presence of clutter, grasp planning for unknownobjects is far from a solved problem. Also, we have shownour object models useful for grasp planning, but we would




also like to explore techniques for detecting objects throughattached features such as in the work of Collet Romeaet al. (Collet Romea et al., 2009) so that a robot can quicklydetect and grasp objects it has examined before.

In the longer run, we would like to be able to useautonomous modeling as a step toward achieving a greatersemantic understanding. For instance, if a robot canautonomously construct models of new objects and detectthem reliably, it can begin extracting further informationincluding possibly typical location, use cases, and rela-tionships with other objects. The robot can also learn tobetter interact with the object by, for example, storinggrasp locations that worked well on the particular object.Finally, allowing robots to share collected models andassociated information would enable robots to learn muchmore rapidly and is a problem worth exploring.

Acknowledgements

We would like to thank Brian Mayton for all of his assistance,including generating and executing suitable grasps and arm tra-jectories for our experiments. We would also like to thank PeterBrook for allowing us to incorporate his heuristic grasp plannerinto our system.

Funding

This work was funded in part by an Intel grant, by ONR MURIgrants number N00014-07-1- 0749 and N00014-09- 1-1052, andby the NSF under contract number IIS-0812671. Part of thiswork was also conducted through collaborative participation inthe Robotics Consortium sponsored by the U.S Army ResearchLaboratory under the Collaborative Technology Alliance Program,Cooperative Agreement W911NF-10-2- 0016.

Conflict of interest

None declared.

References

Athitsos V and Sclaroff S (2003) Estimating 3d hand pose froma cluttered image. IEEE Computer Society Conference onComputer Vision and Pattern Recognition 2: p. 432.

Balasubramanian R, Xu L, Brook P, Smith J and MatsuokaY (2010) Human-guided grasp measures improve grasprobustness on physical robot. In: Proceedings of the IEEEInternational Conference on Robotics & Automation (ICRA)pp. 2294–2301.

Berenson D and Srinivasa S (2008) Grasp synthesis in clut-tered environments for dexterous hands. In: Proceedings ofthe IEEE-RAS International Conference on Humanoid Robots(Humanoids).

Chen Y and Medioni G (1992) Object modelling by registration ofmultiple range images. Image Vision Computing 10: 145–155.

Ciocarlie M, Goldfeder C and Allen P (2007) Dimensionalityreduction for hand-independent dexterous robotic grasping.In: Proceedings of the IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS).

Collet Romea A, Berenson D, Srinivasa S and Ferguson D (2009)Object recognition and full pose registration from a singleimage for robotic manipulation. In: Proceedings of the IEEEInternational Conference on Robotics & Automation (ICRA).

Curless B and Levoy M (1996) A volumetric method for buildingcomplex models from range images. In: ACM Transactions onGraphics (Proceedings of SIGGRAPH).

Diankov R (2010) Automated Construction of RoboticManipulation Programs. PhD thesis, Robotics Institute,Carnegie Mellon University. http://www.programmingvision.com/rosen_diankov_thesis.pdf.

Druon S, Aldon M and Crosnier A (2006) Color constrainedICP for registration of large unstructured 3d color datasets. International Conference on Information Acquisitionpp. 249–255.

Ganapathi V, Plagemann C, Koller D and Thrun S (2010) Realtime motion capture using a single time-of-flight camera. In:Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR).

Glover J, Rus D and Roy N (2009) Probabilistic models of objectgeometry with application to grasping. International Journalof Robotics Research (IJRR) 28: 999–1019.

Godin G, Laurendeau D and Bergevin R (2001) A methodfor the registration of attributed range images. InternationalConference on 3D Digital Imaging and Modeling p. 179.

Grisetti G, Grzonka S, Stachniss C, Pfaff P and Burgard W (2007)Efficient estimation of accurate maximum likelihood maps in3d. In: Proceedings of the IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS), San Diego, CA,USA.

Habbecke M and Kobbelt L (2007) A surface-growing approachto multi-view stereo reconstruction. In: Proceedings of theIEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR).

Henry P, Krainin M, Herbst E, Ren X and Fox D (2010) RGB-Dmapping: Using depth cameras for dense 3d modeling of indoorenvironments. In: Proceedings of the International Symposiumon Experimental Robotics (ISER), New Delhi, India.

Johnson A (1997) Spin-Images: A Representation for 3-D Sur-face Matching. PhD thesis, Robotics Institute, Carnegie MellonUniversity.

Johnson A and Kang SB (1997) Registration and integrationof textured 3-d data. In: International Conference on RecentAdvances in 3-D Digital Imaging and Modeling (3DIM ’97)pp. 234 – 241.

Kashani A, Owen WS, Himmelman N, Lawrence PD and HallRA (2010) Laser Scanner-based End-effector Tracking andJoint Variable Extraction for Heavy Machinery. The Interna-tional Journal of Robotics Research. http://ijr.sagepub.com/cgi/content/abstract/0278364909359316v1.

Kawewong A, Tangruamsub S and Hasegawa O (2009) Wide-baseline visible features for highly dynamic scene recogni-tion. In: Proceedings of the 13th International Conferenceon Computer Analysis of Images and Patterns (CAIP) pp.723–731.

Kazhdan M, Bolitho M and Hoppe H (2006) Poisson surfacereconstruction. In: Proceedings of the fourth EurographicsSymposium on Geometry Processing.

Kraft D, Pugeault N, Baseski E, Popovic M, Kragic D, Kalkan Set al. (2008) Birth of the object: Detection of objectness andextraction of object shape through object-action complexes.International Journal of Humanoid Robotics 5: 247–265.



Krainin et al. 17

Krainin M, Curless B and Fox D (2011) Autonomous Genera-tion of Complete 3D Object Models Using Next Best ViewManipulation Planning. In: Proceedings of the IEEE Interna-tional Conference on Robotics & Automation (ICRA) Shanghai,China, pp. 5031–5037.

Lai K and Fox D (2009) 3D laser scan classification using webdata and domain adaptation. In: Proceedings of Robotics:Science and Systems (RSS).

Lemuz-López R and Arias-Estrada M (2006) Iterative closestSIFT formulation for robust feature matching. In: Proceed-ings of the Second International Symposium on Advancesin Visual Computing, Springer Lecture Notes in ComputerScience, 4292: 502–513.

Levoy M, Pulli K, Curless B, Rusinkiewicz S, Koller D, Pereira Let al. (2000) The Digital Michelangelo Project: 3d scanning oflarge statues. In: ACM Transactions on Graphics (Proceedingsof SIGGRAPH).

Li WH and Kleeman L (2009) Interactive learning ofvisually symmetric objects. In: Proceedings of theIEEE/RSJ International Conference on IntelligentRobots and Systems (IROS) St Louis, MO, USA.http://www.waiholi.com/papers/li_kleeman_iros09_learn.pdf.

Mündermann L, Corazza S and Andriacchi TP (2007) Accu-rately measuring human movement using articulated ICP withsoft-joint constraints and a repository of articulated models.In: Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR).

Pai D, van den Doel K, James D, Lang J, Lloyd J, Richmond J et al.(2001) Scanning physical interaction behavior of 3d objects. In:ACM Transactions on Graphics (Proceedings of SIGGRAPH)pp. 87–96.

Pan Q, Reitmayr G and Drummond T (2009) ProFORMA: Prob-abilistic Feature-based On-line Rapid Model Acquisition. In:Proceedings of the 20th British Machine Vision Conference(BMVC) London.

Pellegrini S, Schindler K, and Nardi D (2008) A generalisationof the ICP algorithm for articulated bodies. In: EveringhamM and Needham C (eds.) British Machine Vision Conference(BMVC’08).

Pfister H, Zwicker M, van Baar J and Gross M (2000) Surfels:surface elements as rendering primitives. In: Proceedings ofthe 27th annual conference on Computer Graphics and Interac-tive Techniques New York, USA: ACM Press/Addison-WesleyPublishing Co. pp. 335–342.

PrimeSense Home Page. (2010) http://www.primesense.com/.Rasolzadeh B, Bjorkman M, Huebner K and Kragic D (2009)

An active vision system for detecting, fixating and manipulat-ing objects in real world. In: International Journal of RoboticsResearch (IJRR).

Rusinkiewicz S, Hall-Holt O and Levoy M (2002) Real-time 3Dmodel acquisition. ACM Transactions on Graphics (Proceed-ings of SIGGRAPH) 21: 438–446.

Sato Y, Wheeler M and Ikeuchi K (1997) Object shape andreflectance modeling from observation. In: ACM Transactionson Graphics (Proceedings of SIGGRAPH) pp. 379–387.

Saxena A, Driemeyer J and Ng A (2008) Robotic grasping ofnovel objects using vision. International Journal of RoboticsResearch (IJRR) 27: 157–173.

Sharp GC, Lee SW and Wehe DK (2002) ICP registration usinginvariant features. IEEE Transactions on Pattern Analysis andMachine Intelligence (PAMI) 24: 90–102.

Strobl K, Mair E, Bodenmuller T, Kielhofer S, Sepp W, Suppa Met al. (2009) The self-referenced DLR 3d-modeler. In: Proceed-ings of the IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS) pp. 21–28.

Sturm J, Plagemann C and Burgard W (2009) Body schemalearning for robotic manipulators from visual self-perception.Journal of Physiology-Paris 103: 220–231.

Sudderth EB, Mandel MI, Freeman WT and Willsky AS (2004)Visual hand tracking using nonparametric belief propagation.Computer Vision and Pattern Recognition Workshop 12: 189.

Triebel R, Frank B, Meyer J and Burgard W (2004) First stepstowards a robotic system for flexible volumetric mapping ofindoor environments. In: Proceedings of the 5th IFAC Sympo-sium on Intelligent Autonomous Vehicles.

Turk G and Levoy M (1994) Zippered polygon meshes from rangeimages. In: ACM Transactions on Graphics (Proceedings ofSIGGRAPH) New York, NY, USA: ACM pp. 311–318.

Ude A, Omrcen D and Cheng G (2008) Making object learn-ing and recognition an active process. International Journal ofHumanoid Robotics 5: 267–286.

Weik S (1997) Registration of 3-d partial surface models usingluminance and depth information. International Conference on3D Digital Imaging and Modeling p. 93.

Weise T, Wismer T, Leibe B, and Gool LV (2009) In-hand scan-ning with online loop closure. In: IEEE International Workshopon 3-D Digital Imaging and Modeling.

Wu C (2007) SiftGPU: A GPU implementation of scale invariantfeature transform (SIFT). http://cs.unc.edu/ ccwu/siftgpu.

Index to Multimedia Extensions

The multimedia extension page is found at http://www.ijrr.org.

Extension Type Description

1 Video A demonstration of the trackingand modeling process

2 Video Meshed models generated from singlegrasps of objects



the international journal of robotics researchxren/publication/...2 the international journal of...

Documents