sherdil niyaz - home - learning motion planner output from … · 2018-09-27 · sherdil niyaz...

Learning Motion Planner Output from Depth Images via Fully

Convolutional Networks

Sherdil Niyaz

NVIDIA Research

1 Introduction

Given a depth image of a scene and a series of SE(3) grasp proposals in the frame of the image, our goal isclassifying which grasp proposals are reachable given kinemetic and collision constraints.

These proposals could be sampled via some analytic metric, or proposed by another model that takes thedepth image as input. Our deep-learning based approach to computing reachability is motivated by a desireto avoid the overhead of a traditional motion planner, as well as eliminating the need for full knowledge ofthe environment. Our approach requires only a depth image of the scene, as opposed to knowledge of allobstacles and their locations.

In this write-up we demonstrate a proof-of-concept system built for this task, an overview of which is givenin Fig. 1. We first detail a pipeline used to generate synthetic training data. We then explain the processused to train our model, a fully convolutional network (FCN), and justify our choice of data representation.Finally, we report results from preliminary simulation experiments, and use these to identify key next stepsrequired to improve performance.

2 Training Data Generation

2.1 Scene Generation

We randomly generate scenes using the OpenRave simulation environment. In our experiments we considergrasping objects in clutter on a tabletop. We place n objects, each at a random location on the table and

Figure 1: A schematic of our system. Core steps are shown in blue, while external components used in eachstep are shown in orange.

1

Figure 2: Successfully reaching grasp proposals in two different randomly-generated scenes.

rotation about its own axis. Objects are placed such that no two are in collision with eachother, and thatsome part of each object is within the robot’s reachable workspace. Two example scenes are shown in Fig. 2.

2.2 Computing Reachability

We are given m grasp proposals for the various objects contained in each scene. These proposals weresampled using an analytic metric on each object mesh. Each grasp is parameterized using a roll θ andoffset σ with respect to the object normal ~n at specified (u, v) coordinates in the depth image. We onlypermit one grasp proposal per set of (u, v) coordinates.

For each grasp pose among our proposals, we attempt to reach the pose using a snap planner. This isdone by sampling an inverse kinematics solution qgoal for the pose using a deterministic IK solver, and thentaking a straight line in configuration space between qstart and qgoal. A plan can fail either due to kinemeticconstraints (causing the IK solver to return no solution) or collision constraints (by contacting an obstacleor the robot at some point along the returned trajectory). We use a snap planner because it only requiresan SE(3) pose to generate a motion plan, as opposed to complete knowledge of the obstacle geometry. Thisenables proposals classified as reachable by the final system to immediately be executed. We note that qstartis kept constant across all scenes, and that changing it would correspondingly change the reachability labels.

We also employ a pruning strategy where, for each proposed grasp pose, we first collision check onlythe gripper with the environment. We note that any grasp that would cause the gripper itself to be incollision is clearly unreachable. This pruning is used to avoid collision checking the entire arm along theentire trajectory: only if the gripper itself would be in free space do we then collision check the entire plan.

2.3 Saved Data

We render a synthetic depth image of each scene using the Blender 3D modeling software. We also save thevalues (u, v, θ, σ, ~n) that define each grasp proposal, as well as whether that proposal was reachable via thesnap planner. In the event that a grasp proposal was not reachable, we also record the reason (either thatit was kinematically infeasible or led to a collision). A visualization of some of the data generated by ourpipeline is shown in Fig. 3.

2

Figure 3: At left, an example of our synthetic depth images. At right, each pixel is colored according tothe grasp proposal with the corresponding (u, v) coordinates. A blue pixel corresponds to an unreachableproposal, while a yellow pixel corresponds to a reachable one. Purple pixels have no corresponding proposals.

3 Training the Network

3.1 Data Representation

We encode each scene and its grasp proposals as an eight-channel image. The first channel (C1) containsthe synthetic depth image of the scene. C2 and C3 have each pixel contain its own u and v coordinatesrespectively. This is done because our network contains only convolutional layers, which share weights acrossall regions of the image. Without these u/v channels, the network would lack the spatial information it needsto learn kinematic reachability. C4 has each (u, v) pixel contain the roll θ of the grasp proposal at thosecoordinates. Coordinates that do not correspond to a grasp proposal contain a -1. C5 is identical, butcontains the offset σ instead. C6, C7, and C8 contain the x, y, and z components of the object normal ~nat coordinates corresponding to proposals. Coordinates not corresponding to a proposal take a value of 0 inthese channels. We found that using 0 as the default value instead of -1 better distinguished different values.

This data representation is chosen because it allows the scene and all its grasp proposals to be encoded asa single eight-channel image. This requires only a single iteration of backpropagation to update the weightsbased on the reachability of the entire scene. We initially attempted to represent each grasp proposal asa single training point, i.e. as a depth image associated with a tuple (u, v, θ, σ). However, this caused anexplosion in the number of backpropagation iterations required for a single epoch.

The labels for each scene are represented as a binary image, where (u, v) coordinates corresponding toreachable and unreachable proposals take values of 1 and 0, respectively. Coordinates not corresponding toa grasp proposal also take a value of 0, but we will later see that the loss function ignores the output of thenetwork at these coordinates during training.

3.2 Network Architecture

We use the fully convolutional network (FCN) in [5], originally used for semantic segmentation. We modifythe network to take an eight channel image as input instead of three, and apply batch normalization aftereach activation. In our experiments, we found this use of batch normalization to be critical in training thenetwork. We train the network using stochastic gradient descent with momentum, and use a batch size offive scenes. The FCN outputs a value pR for each (u, v) coordinate in the input image, which we take as theprobability of that grasp being reachable.

3

Figure 4: A comparison of predicted grasp reachability versus the ground truth for a standard tabletopscene. Predictions are similar with (bottom) and without (top) the sensor noise model applied. Note thatthe FCN output at (u, v) coordinates not corresponding to a proposal is not visualized.

As opposed to traditional cross entropy loss, we use focal loss [3] to train the FCN. We also “target”this loss towards only (u, v) coordinates that correspond to grasp proposals (i.e. coordinates where C4 andC5 contain values that are not -1). This is done by only using the FCN output at these coordinates tocompute the focal loss, and zeroing out the contribution from other pixels. This focuses backpropogationon better distinguishing reachable and unreachable grasp proposals, and prevents the FCN from learninguseless information.

4 Experiments, Results, and Insights

4.1 Object Split

We perform initial experiments in simulation. We randomly generate tabletop scenes with graspable objectsselected from a variety of mesh corpora. Training and test data are generated using the same synthetic datapipeline described previously. However, we split our objects in a deliberate manner when generating thesetraining and test scenes.

Each corpus of object meshes is broken into sets of “train” and “test” objects. Scenes used to train theFCN only contain objects from the “train” set, and likewise for scenes used to test the FCN after training.Each test set contains the same object classes as the training set, but contains different instances of theseclasses. For example, the training and test set may both contain mugs, but the test set will contain mugs thatare different than those seen in the training set. We also ensure that both sets contain a uniform distributionover the different object classes. Performing this split allows us to evaluate generalization by showing thenetwork new objects at test time that are similar to what it saw during training, but not identical. Whilein these experiments we split scenes over object class, we note that other splits could be employed, such asover the number of objects.

4

Figure 5: A comparison of predicted grasp reachability versus the ground truth for a more constrained boxscene. The sensor noise model is applied on the bottom depth image.

4.2 First Object Corpus: Mugs and Bowls

These experiments use mugs and bowls selected from ShapeNet [1], resized for the ABB YuMi’s gripper. Alltraining and test scenes contain n = 3 objects.

4.2.1 Tabletop Experiment

Our first experiments use scenes generated on the same table as Fig. 2. We evaluate the trained networkon 20 test scenes containing objects from the “test” set, and train the network on about 1500 scenes. Wealso train/test the network with and without a sensor noise model applied to the depth images, in orderto evaluate its robustness to this noise. Accuracy is evaluated by rounding the output pR value at eachcoordinate corresponding to a grasp proposal to 0 or 1, and reporting the percentage of proposals that arecorrectly classified as reachable or unreachable. The network performs similarly without and with the sensornoise model applied, achieving 80.5% and 80.9% test accuracy respectively. An example of predicted graspreachability versus the ground truth is provided in Fig. 4.

In addition to pure accuracy, we also evaluate a metric of safety. Here, of the grasp proposals classifiedas reachable by the network, we report the percentage that were classified correctly as such. Our intuition isthat false positives in our network could be dangerous, leading to the robot colliding with the environment.In these tabletop scenes, the network achieved safety scores of 75.3% and 76.4% without and with thesensor noise model applied, respectively.

4.2.2 Constrained Box Scene

We also explore how the network performs in a more constrained environment. Here, the robot must graspobjects in a box with openings on the top and front (Fig. 6.) We train the FCN on about 1000 scenes.Without and with the noise model applied, the network achieves a test accuracy of 79.3% and 78.0% anda safety score of 72.3% and 72.0%, respectively. Example predictions are shown in Fig. 5.

5

Figure 6: A reachable grasp proposal in the more constrained box environment.

4.2.3 Mixed Environment Scenes

We are also interested in how the network adapts to reachability in different environments. We create atraining set containing 1000 training scenes from each of the table and box experiments, and a combinedtest set of 40 scenes using both test sets. Ideally, the network should generalize and learn that proposalsreachable in the table-top environment may not be reachable in the box environment. This appears to bethe case: without and with the sensor noise model applied, the network achieves a test accuracy of 79.4%and 79.7% respectively and a safety score of 73.9% in both cases.

4.2.4 Effectiveness of Targeting Loss

To explore the effectiveness of only computing the focal loss using coordinates corresponding to grasp pro-posals, we train the FCN for the tabletop experiment in Section 4.2.1 both with and without this adjustment.We recall that without the adjustment, the focal loss is computed using the output pR for all coordinatesin the input. The FCN achieves a test accuracy of only 54.9% without the adjustment to the loss function,compared to 80.5% with the adjustment. We note that these experiments were performed without the sensornoise model applied.

4.3 Second Object Corpus: Mugs, Bowls, and Bottles

Our next experiments use mugs and bowls taken from ShapeNet, but also include bottles. These bottleshave a very different distribution of grasps than the mugs and bowls, with most of them being concentratedon the neck of the bottle.


We randomly generate training and test scenes in the same fashion as earlier, again using n = 3 objects perscene and the table from Fig. 2. We use approximately 1500 training scenes. Performance on these tabletopscenes is notably lower than on those that contained mugs and bowls but excluded bottles. Without andwith the sensor noise modeled applied, the FCN achieves a test accuracy of 74.2% and 74.1%, and averagesafety scores of 68.5% and 67.6%, respectively.

6

Figure 7: A comparison of labels from the snap planner (top) and from sampling free IK solutions (bottom).We see that the latter are far less noisy.

4.3.2 Insight: The Choice of Motion Planner

These lower scores begin to make sense when examining the labels for the grasp proposals used to train theFCN. On bottles, the ground truth labels are extremely noisy, as shown in Fig. 7. Our current hypothesisis that this arises from our use of a snap planner as opposed to a more powerful motion planner. Amore complete planner, such as a PRM [2], explores configuration space more exhaustively to find a pathbetween qstart and qgoal. By comparison, a snap planner is extremely brittle: if the interpolated pathbetween qstart and qgoal is in collision, the planner will simply fail and return no solution. In addition, theplanner is extremely sensitive, since different IK solutions for similar grasp poses will lead to wildly differenttrajectories. In short, the use of such a simple and sensitive planner can lead to successes becoming almostrandom.

To support this hypothesis, we generated a different set of labels using these bottles. Instead of recordingwhether a grasp proposal was reachable via our snap planner, we sampled multiple IK solutions for each oneand recorded whether at least one was in free space. While our snap planner represents one extreme, theselabels are intended to show another: a “perfect” planner that can move from qstart to one of the free spaceIK solutions. We note that in the unconstrained tabletop environment used, this is not an unreasonableassumption. We see in Fig. 7 that these labels are far less noisy than those generated by the snap planner.This suggests a function that would be easier for the FCN to learn, as opposed to our extremely noisy labels.

However, we recall our original reasoning for selecting a snap planner. This allows any grasp proposalclassified as reachable to immediately be used to generate a trajectory. All that is required is an IK solutionand a linear interpolation in configuration space. Notably, no information about the environment (locationsof obstacles, etc) is required. However, a more powerful planner requires more information than we wouldhave a priori at execution time, given that our input is only a depth image. A PRM, for example, requiresthe exact obstacle geometry in order to perform collision checks. There appears to be a trade off between theinformation required run the planner, and the ability of the network to learn the corresponding reachabilitylabels. Ideally, our core planner would balance this tradeoff, requiring less information than a traditionalgraph-based planner while being less brittle than a snap planner. A deterministic planner that could be runfrom a depth image while providing some degree of collision avoidance would be ideal.

4.4 Third Object Corpus: 3DNet

Our final corpus of objects is taken from 3DNet [4]. While previous experiments used only a few objectclasses, these experiments use objects from 33 different classes, all scaled to fit within the gripper of theYuMi. We use these objects to create highly cluttered scenes on the table from Fig. 2, each containing n = 10objects. An example of one such scene is shown in Fig. 9.

7

Figure 8: A comparison of predicted grasp reachability versus the ground truth for cluttered tabletop scenesusing objects from 3DNet.


The network is trained on approximately 1000 scenes. With these highly cluttered tabletop scenes, the FCNachieves a test accuracy of 80.5% and 79.1% without and with the sensor noise model applied. In addition,the FCN achieves respective safety scores of 63.6% and 59.3% in these highly challenging scenes. We seethat the network has less trouble with false negatives than it does false positives. We hypothesize that thiscould be improved by using a more robust planner (as discussed in the previous section) as well as moredata. We discuss efficiently generating training data in the next section. Examples of predictions from ournetwork are shown in Fig. 8.

5 Next Steps

Our preliminary system has enabled experiments that in turn reveal key next steps. As demonstrated earlier,our use of a snap planner can lead to extremely noisy labels in some scenarios due to the planner’s simplenature. These results suggest that use of more robust planner would lead to less noisy reachability labelsand thus a function that could more easily be learned by the network. Integrating such a planner into theReachability Computation stage of our system (Fig. 1) should be treated as very high-priority. Howeverthis planner should require as little information about the world as possible, given that our input is only adepth image as opposed to the exact collision geometry.

In addition, an important engineering challenge emerges at this stage of our system. We currently use theOpenRave simulator to compute reachability. This includes loading each scene, running the motion planner,and then computing collision checks to determine which trajectories are valid. However, OpenRave does notuse the GPU for these tasks. While we currently employ threading to parallelize reachability computation,the restriction of running on the CPU means this is still a very slow process. For example, one of our scenescontains two bowls and a mug, with a total of 3963 grasps proposals to evaluate across all objects. Forthis scene, evaluating reachability for all proposals takes a total of 189.4 seconds. Furthermore, collisionchecking and IK solving together constitute the vast majority of this time, taking 87.9 and 94.3 seconds forthe entire scene respectively.

8

Figure 9: Examples of highly cluttered scenes using objects from 3DNet.

Switching to a simulator that allows both collision checking and IK sampling to be parallelized on theGPU would clearly accelerate reachability computation for different scenes. This would in turn increase theamount of synthetic traning data that could be used with the FCN, and would likely increase performance.We note that taking full advantage of this would also require any new planner to be parallelizable on theGPU as well.

6 Code

Code for this project can be found in the following repositories. Note that some or all of these repositoriesmay be private. Please contact Clemens Eppner (ceppner AT nvidia DOT com) for access.

Scene Generation and Reachability Computation (branch: “sniyaz”)

Training Data Formatting (branch: “sniyaz”)

FCN Training and Testing (branch: “sniyaz”)

References

[1] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song,H. Su, J. Xiao, L. Yi, F. Yu. ShapeNet: An Information-Rich 3D Model Repository. CoRR, abs/1512.03012,2015.

[2] L. E. Kavraki, P. Svestka, J. C. Latombe, and M. H. Overmars. Probabilistic Roadmaps for Path Planningin High-Dimensional Configuration Spaces. IEEE Transactions on Robotics and Automation, vol. 12, no.4, pp. 566–580, 1996.

[3] T. Lin, P. Goyal, R. Girshick, K. He, P. Dollar. Focal Loss for Dense Object Detection. CoRR,abs/1708.02002, 2017.

[4] W. Wohlkinger, A. Aldoma, R. B. Rusu, and M. Vincze. ”3DNet: Large-Scale Object Class Recognitionfrom CAD Models.” In IEEE International Conference on Robotics and Automation (ICRA), pp. 5384-5391,2012.

[5] Y. Xiang. https://github.com/yuxng/DA-RNN/blob/master/lib/networks/vgg16 convs.py.

9

https://bitbucket.org/clemi/tamp/src/master/

https://bitbucket.org/clemi/grabschen/src/master/

https://bitbucket.org/clemi/grasp-learning/src/master/

https://github.com/yuxng/DA-RNN/blob/master/lib/networks/vgg16_convs.py

sherdil niyaz - home - learning motion planner output from … · 2018-09-27 · sherdil niyaz...

Documents