map-based deep imitation learning for obstacle avoidance · map-based deep imitation learning for...

Map-based Deep Imitation Learning for Obstacle Avoidance

Yuejiang Liu1, An Xu2, Zichong Chen1

Abstract— Making an optimal decision to avoid obstacleswhile heading to the goal is one of the fundamental chal-lenges for mobile robots equipped with limited computationalresources. In this paper, we present a deep imitation learningalgorithm that develops a computationally efficient obstacleavoidance policy based on egocentric local occupancy maps.The trained model embedded with a variant of the valueiteration networks is able to provide near-optimal continuousaction commands through fast feed-forward inferences andgeneralize well to unseen planning-based scenarios. To improvethe policy robustness, we augment the training data set withartificially generated maps, which effectively alleviates theshortage of catastrophic samples in normal demonstrations.Extensive experiments on a Segway robot show the effectivenessof the proposed approach in terms of solution optimality,robustness as well as computation time.

I. INTRODUCTION

Obstacle avoidance is one of the fundamental skills thatmobile robots need to acquire. A variety of algorithmshave been developed in the past decade to enable robotsto drive towards a given destination or along a referencepath without running into obstacles [1], [2], [3], [4]. Despitesignificant progress made, the gap of motion intelligencebetween autonomous mobile robots and human beings is stillfar from close yet. For instance, humans make movement de-cisions with negligible efforts, react to emergencies robustly,and behave in a smooth and natural manner. By contrast,consumer mobile robots have been struggling to do so.

A group of high-performance obstacle avoidance algo-rithms addresses the local decision-making problem from anoptimization perspective. The main characteristic of thesealgorithms is to minimize a pre-defined cost function (ormaximize a utility function) associated with the potentialtrajectories in the near future, which are predicted in thecontrol or state space [5]. The optimal trajectory, togetherwith the corresponding control action, is then used to drivethe mobile robot. Though mathematically sound, identifyinga proper cost function to accurately describe the optimalbehavior can be difficult. Moreover, solving the optimizationproblem also poses a practical challenge: the problem isgenerally nonlinear due to the robot dynamics and geometryconstraints, and hence difficult to solve on consumer robotsin real-time. To reduce computational expenses, researchershave proposed several methods such as sampling [2] andlinearization [3]. These approximations, however, inevitablylead to sub-optimal degradation.

1Segway Robotics Inc., Beijing 100192, China, {yuejiang.liu,zichong.chen}@ninebot.com

2Department of Electronic Engineering, Tsinghua University, Beijing100084, China, [email protected]

Aside from the optimization-based approach, imitationlearning (IL), particularly end-to-end learning, has becomean emerging technique to develop intelligent planning orcontrol policies [6]. By means of the convolutional neuralnetworks (CNN), deep imitation learning algorithms havebeen proposed for a wide range of applications and achievedexciting results [4], [7], [8], [9]. Nevertheless, the perfor-mances of most existing CNN-based planners or controllersare noticeably inferior to their behavioral demonstrators [4].One of the open challenges lies in the capacity of modelsfor planning-based representations [10], [11]. Another oneis the distribution mismatch between the states visited bythe demonstrator and the learned policy, which may result insub-optimal and even divergent trajectories. Some on-policytraining strategies try to address this issue by iterativelygathering training samples closer to the learned policy [12],[13]. However, the scarcity of catastrophic events in thetraining dataset remains a hidden danger for safety-criticaloperations.

The goal of this work is to train an obstacle avoidancepolicy that is near-optimal for sequential decision making,robust in emergency events and applicable to consumer mo-bile robots equipped with limited computational resources.Different from the end-to-end imitation learning, we developa local policy based on an egocentric local occupancymap [14]. This local map incorporating multi-sensor multi-frame information provides a compact representation of thesurrounding environment. We model the local policy bya variant of the value iteration networks [10], which isknown for planning-based feature extraction. To enhance therobustness, we collect the training dataset from a mixtureof real-world demonstrations and artificially generated maps,which not only speeds up the data collection procedure butalso effectively alleviates the shortage of dangerous samplesthat are rarely encountered in normal expert demonstrations.

The main contribution of this work lies in two-fold:

• Formulate the obstacle avoidance for mobile robots asan imitation learning problem based on a preprocessedegocentric local map, which serves as a compact andgeneric data representation for policy learning.

• Overcome the shortage of demonstration samples, par-ticularly from the rare yet dangerous events, by gener-ating a set of artificial local maps.

The rest of this paper is organized as follows. In SectionII, we provide a brief review of the related work. In Sec-tion III, we present our deep imitation learning algorithmfor obstacle avoidance based on preprocessed local maps.Experiments with analysis are presented in Section IV,

2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)Madrid, Spain, October 1-5, 2018

978-1-5386-8093-3/18/$31.00 ©2018 IEEE 8644

followed by conclusions and discussion in Section V.

II. RELATED WORK

An extensive body of the recent research papers in imita-tion learning has been focused on the end-to-end approach.For example, convolutional neural networks are trained tomap a visual image input to the left / right steering com-mand for grounded mobile robots [7], quadrotors [8] andautonomous cars [9]. Another recent work [4] proposes atarget-driven motion planner producing linear and angularvelocities from the raw input of a laser range-finder. Despitethese advances, the effectiveness of neural network modelsin the planning context remains a challenge.

To improve the planning-based reasoning, several novelneural network architectures inspired by classical planningalgorithms are recently proposed. An end-to-end architecturenamed Predictron is developed in [11], the core of which isan abstract model representing the Markov reward processthat can be rolled out for value estimation. Another deepneural network architecture with similar underlying moti-vation is the Value Iteration Networks [10], which embeda special module for recursive value iteration. In [15], arecurrent network, motivated by the path integral optimalcontrol, is developed to learn both the cost and the dynamicmodels.

Another practical challenge for imitation learning is thedata distribution mismatch between the demonstration expertand the learned policy. To cope with it, an iterative dataaggression approach is proposed in [12] and applied to learna reactive controller for micro aerial vehicles [16]. Thisapproach is further extended to the SafeDAgger [13], whichtrains a safety policy to prevent the learning process fromfalling into dangerous states meanwhile reducing the humanintervention frequency. Another recent paper addresses thisissue by using an adaptive model predictive controller as thedemonstrator that adjusts its policy to gradually eliminate thedistributional shift [17]. These approaches, though try to fixthe sample distribution problem to some extent, can hardlyovercome the scarcity of the dangerous events in real-worlddemonstrations, and thus leave a pitfall for safety-criticalapplications.

III. APPROACH

In this section, we present the map-based deep imitationlearning approach for obstacle avoidance, which aims tolearn a near-optimal policy for driving among obstacles asefficient, safe and generic as possible.

A. System Structure

The designed obstacle avoidance system consists of twoblocks. As shown in Figure 1, the first block preprocessesthe raw sensory input and sends a local egocentric occupancymap that describes the surrounding physical environmentsto the second block. In addition, a local goal in the robotcoordinate system is generated according to task-specificrequirements. For instance, in a person following task, thelocal goal is defined as a standby point sitting behind the

target person. In a navigation task, the local goal is awaypoint on the remaining global path free of obstacles. Therobot velocity is also supplied to the second policy block forsequential decision making.

Fig. 1: Block diagram of the map-based obstacle avoidance systemfor mobile robots. The first block produces a local occupancy map, alocal goal as well as the vehicle velocity, based on which the secondblock infers an action command from a local policy network.

B. Problem Formulation

Given a local egocentric occupancy map mo, a local goalposition pg in the robot coordinate system and the robotvelocity u′, the proposed local obstacle avoidance policyprovides an action command as follows:

mr = hλ (pg,mo) (1)

u = fθ (mr,u′) (2)

where mr is a local cost map describing the obstacleavoidance task, u = (v,ω) is a pair of the linear velocityand angular velocity for execution, θ and λ are modelparameters.

Specifically, the cost map mr is constructed as an aggregateof the local goal reward and the obstacle penalty:

mr = λmg(pg)−mo (3)

where mg is an egocentric binary map of the local goal,on which the value of the pixel corresponding to the goalequals to 1 and the others to 0. In case the local goal pointpg = (xp,yp) lies out of the local map range x ∈ [xm, xm]and y ∈ [ym, ym], the goal point is linearly projected onto themap border. For instance, if xg > xm and ym < yg < ym, theprojected position p′g = (xm,

ygxg

xm) is used as the goal point.Note that the local cost map mr can be learned within

the neural network model as well [10]. However, fromour empirical experiments, we do not observe a significantdifference on the prediction accuracy between the hand-crafted cost map and a learned one. One reason might bethe linearly combined cost map (3) can essentially representthe rewards that the demonstrator seeks. Additionally, thevalue function is ultimately learned and adapted to the costmap. In this work, we remove the layers for cost learning,aiming to reduce model redundancy.

8645

Fig. 2: The map-based deep neural network model for learning anobstacle avoidance policy. The parameters in the value iterationlayers represent the filter size, depth and stride. The parameters inthe fully connected layers represent the dimension of output units.In our experiment, the number of recurrence in the value iterationblock K is chosen to be 36, which is related to the size of the localmap image.

Supervised by a set of expert demonstrations u, a localpolicy is trained by minimizing the following loss function:

J(θ) = ∑i‖ui−ui‖2 (4)

where i is the sample index.

C. Neural Network Model

To imitate the intended behavior for obstacle avoidance,we model the local policy by means of the value iterationnetworks [10], which is known for planning-based reasoning.As illustrated in Figure 2, the value iteration module firstextracts high-level planning features from the local map inputthrough an recursion of the policy improvement and thetruncated policy evaluation [18].

Vk+1(xi) = maxui

∑xi+1,ri+1

p(xi+1,ri+1|xi,ui)[ri+1 + γVk(xi+1)]

(5)where xi is the grid cell where the robot locates at time stepi, ri+1 is the reward received when moving from xi to xi+1,V (x) is the value function of the grid cell x, k is the iterationnumber, γ is the discount factor and p(xi+1,ri+1|xi,ui) repre-sents the transition probability. The features extracted fromthe value iteration module are subsequently fused with thevehicle velocity, and supplied to the fully connected layers.A velocity command is finally produced by the network todrive the robot.

D. Policy Demonstration

The desired behavior for obstacle avoidance can bedemonstrated by either human experts or a well-tunedoptimization-based algorithmic planner. In this work, weemploy the latter one as an efficient demonstrator, which

computes a predictive driving command by minimizing thefollowing cost at each sampling time:

minu

N

∑i=1

mo(xi)+w1dg +w2αg +w3

N−1

∑i=0‖ui+1−ui‖ (6a)

s.t. xi+1 = h(xi,ui), i = 0, . . . ,N−1 (6b)mo(xi)≤ m, i = 0, . . . ,N (6c)

where xi is the robot pose at time step i, N is the length of theprediction horizon, dg is the distance between the robot poseand the local goal point at time step N, αg is the headingangle between the robot orientation and the direction fromrobot to the local goal at time step N, h(x,u) is the robotdynamic model, mo(x) is the occupancy probability of thegrid cell corresponding to the pose x, m is the maximumoccupancy probability safe for robots to visit and w1,w2,w3are cost weights.

As the mo(xi) and h(xi,ui) are generally non-convex,we tackle the problem (6) on robot by by a sampling-based solver that generates and evaluates a set of candidatetrajectories [2]. For the purpose of expert demonstration,the candidate set is designed to be large and diverse toapproximate a near-optimal solution.

E. Data Acquisition

As mentioned in Section II, an open challenge for imita-tion learning is the state distribution mismatch between thedemonstration expert and the policy learner. To efficientlyaddress this issue in practice, we gather the training sam-ples from two sources. The first one is the experimentaltrajectories collected from normal demonstrations, whichare expected to be encountered by a well-trained policy inmost of the time. The second source comes from artificiallygenerated maps, which are used to supply the dangerouscases that a well-behaved demonstrator rarely runs into.

The artificial map samples are generated in the followingway. First, some binary clusters of obstacles are randomlygenerated on an empty map. Second, the obstacle clustersare blurred by a Gaussian filter, which converts the binarymap to a probabilistic occupancy grid. A local goal point isthen drawn randomly on or around the map plate. To reducethe learning complexity, the generated maps are finallytransformed to the egocentric robot coordination system,which makes the robot pose identical in all the samples.Examples of the artificial and real map samples can be foundin Figure 3 and Figure 8 respectively.

IV. EXPERIMENT

In this section, we present experimental results for theproposed map-based imitation learning approach. We firstintroduce the experiment setup, followed by an analysis ofthe training metrics. Navigation experiments in simulationand real-world environments are finally evaluated.

A. Setup

The robotic platform used in our experiments is theSegway Loomo, a self-balanced robot equipped with an

8646

Fig. 3: Inferred actions from the trained policy in three particular cases in the test dataset. The robot is located at the origin facing tothe north in the occupancy map. The isolated yellow cell is the target goal, while the darkness of the other cells stands for the obstacleprobability. The length of the line represents the linear velocity, while the direction represents the angular velocity. The prediction erroris normal on the left (0.0024), large in the middle (0.1477), and super large on the right (0.5733).

onboard Intel Atom Z8750 CPU and multiple sensors in-cluding an Intel RealSense ZR300, ultrasonic sensors, wheelencoders, etc. The robot pose and velocity are provided byan encoder-visual-inertial state estimator. An occupancy mapis constructed from depth camera measurements, from whichan egocentric local occupancy map is cropped to fixed size2.8m×2.8m and resolution 0.1m at each cycle.

Fig. 4: Segway Loomo equipped with an Intel Realsense ZR300(30Hz RGB-Depth, FishEye, and IMU), an Intel Atom Z8750 (4cores, 2.4GHz) and 4GB memory

We collect over 600k samples for policy learning. Half ofthe samples come from real-world demonstrations and theother half from generated samples. The dataset is split into atraining set (80%) and a test set (20%). The demonstrator isan optimization-based planner that evaluates 734 candidatetrajectories. The neural network model is implemented withthe Tensorflow framework [19], and trained from scratchwith the Adam optimizer [20] on a Nvidia Titan X forapproximately 8 hours.

B. Model Metrics

We first evaluate the learned policy on a frame-by-framebasis by examining the prediction error and comparingagainst the optimization-based approach.

1) Prediction accuracy: The prediction errors of the linearvelocity and angular velocity are summarized in Table I.While the variance on the test dataset is comparably larger,the mean errors on both the training and test dataset aresmall, which shows the potentials of the developed model inimitating the obstacle avoidance behavior and generalizingto unseen scenarios.

train-v [m/s] train-w [rad/s] test-v [m/s] test-w [rad/s]mean 0.0031 0.0115 0.0037 0.0151std 0.0050 0.0132 0.0079 0.0308

TABLE I: Statistics of the prediction error.

train-v train-w test-v test-w

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

erro

r [m

/s] /

[rad

/s]

Fig. 5: Box plot of the prediction error. The lower and upperbounds of the blue box are the first quartile and the third quartilerespectively. The red lines within the blue boxes are the medianvalues, and the red points marked by ’+’ outside the horizontalblack line are outliers.

Figure 5 shows the distribution of the absolute predictionerrors in a box plot. The first and the third quartile of theprediction errors are quite close, significantly better than theresults of end-to-end training reported in [4]. Yet we canstill observe some large prediction errors in the test set, thereason of which is discussed below.

The detailed action commands predicted by the learnedpolicy in three different cases in the test set are shown inFigure 3. In the left case, the trained policy produces anaction command that tends to keep some distance away fromthe obstacle clusters on both sides, which almost overlapswith the decision provided by the demonstrator. In the middleone, the trained policy acts slightly different from the label,which is likely due to the ambiguity caused by the localgoal point near the obstacle clusters. When the local goal isbehind the robot, as shown in the right case, it turns out to behard for the learner to replicate the demonstrated behavior inexactly the same way. However, the decision of the trainedpolicy is still considered reasonable, as it turns to the correctdirection with a smoother change of the angular velocity.

8647

2) Comparison against optimization-based algorithms:One motivation for the proposed imitation learning methodis to overcome the dilemma between the computation timeand solution quality, which the classical optimization-basedalgorithms are often trapped in. Therefore, we compare theperformance of the trained policy against the optimization-based algorithms with different numbers of candidate trajec-tories. Taking the demonstrated behavior as a baseline, the

optimality gap of each policy is defined as ε = 1n

n∑

i=1‖ui−

ui‖2, where n is the total number of the test samples and iis the sample index.

Figure 6a shows the computation time of the consideredalgorithms running on the Segway robot platform equippedwith a 4-core CPU. The feed-forward inference of theproposed neural network policy spends approximately 44mson a Segway robot, which is sufficient to support normalreal-time operations. In terms of the optimality gap, thetrained policy also achieves a highly competitive result.As shown in Figure 6b, the optimality gap of the trainedpolicy is approximately equivalent to that of an optimization-based algorithm with over 250 candidate sample trajectories,which requires about 80ms to compute once on the robot.Moreover, the quality of the trained policy is highly relatedto the quality of the demonstrated behaviors and the trainingdataset, while independent of the online computation time. Ifthe demonstrator can provide an even better training dataset,for example, fully solves the problem (6) to optimality oradd more training samples, the learned policy is expected tobe further improved.

C. Navigation in simulation environments

On the basis of an accurate frame-by-frame action pre-diction, we subsequently examine the performance of thelearned policy in navigation tasks simulated in the Matlab. Inorder to assess the capability of obstacle avoidance, a globalreference path for the robot navigation is defined close to oracross obstacles. At every sampling time, the trained policysends a velocity command to the low-level controller, whichdrives the robot to a new state according to the robot dynamicmodel.

As shown in Figure 7, the navigation trajectory under thetrained policy successfully avoids the obstacles and smoothlyfollows the pre-defined reference path in open spaces. More-over, the distance between the path of the trained policy andthat of the demonstrator is minute, showing a good learningoutcome of the intended behavior.

D. Navigation in real-world

Finally, we deploy the trained model to real-world exper-iments on a Segway robot. Given a global reference path,the task for the Segway robot is to walk along the path andavoid the obstacles in a crowded office environment that isunseen in the training set.

1) Reactions to obstacles: As shown in Figure 8, whenthe robot confronts obstacles, the local egocentric occupancymap provides a clear planar representation of the local

50 100 150 200 250 300

number of samples

20

40

60

80

100

com

puta

tion

time

(ms) optimization-based

learning-based

(a)

50 100 150 200 250 300

number of samples

0

0.02

0.04

0.06

0.08

0.1

0.12

optim

ality

gap

optimization-basedlearning-based

(b)

Fig. 6: Comparison of the computation time and optimality gapbetween the learned policy and the optimization-based approach.(a) The computation time of the learned policy is approximatelyequivalent to an optimization-based algorithm with 125 sampletrajectories. (b) The optimality error of the learned policy isapproximately equivalent to an optimization-based algorithm with270 sample trajectories.

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

global referencetrained plannerdemonstratorobstacle

goalstart

0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

10

global referencetrained plannerdemonstratorobstacle

start

goal

Fig. 7: Navigation in a simulation environment. The black line isthe global reference path, and the yellow and red lines correspondto the trajectories of the learned policy and the optimization-baseddemonstrator respectively.

planning task. Based on the map image, the trained policysucceeds in providing a reactive action command that drivesthe robot away from obstacles.

2) Long-term operation: In the long-term experiments,the robot navigates in the corridor along the global referencerecursively. Figure 9 shows the navigation trajectories oftwo policies trained with different datasets: the first oneis trained only with the observations gathered from normaldemonstration experiments, the second one is trained with amixture of samples from both demonstrations and artificiallygenerated maps. As shown in Figure 9, the robot controlledby the former one hits obstacles in the crowded regionsa couple of times, where human interventions have to be

8648

0

0.5

1

1.5

2

10.50-0.5-1

action commandlastest velocityglobal path

0

0.5

1

1.5

2

10.50-0.5-1

action commandlastest velocityglobal path

Fig. 8: Reactions of the trained policy to unexpected obstaclesin real-world experiments. The figures on top display the localoccupancy maps fed to the policy network. The figures on thebottom are the corresponding front view captured from the cameraof the robot. The robot is located at the origin of the egocentric mapfacing to the north. The length of lines stands for linear velocity,while the direction stands for angular velocity.

carried out for safety. By contrast, the policy trained with amixed dataset successfully drives the robot among obstaclesrobustly and smoothly. A video for real-world navigationexperiments can be found in the supplementary material.

2 4 6 8 10 12 14 16 18

x (m)

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

y (m

)

global referencetrained mixedtrained normal

EmergencyIntervention

Fig. 9: Navigation experiments in a real-world office environment.The task of the robot is to navigate in the corridor recursivelywith the trained policy. The darkness of the occupancy map is inproportion to the obstacle probability.

V. CONCLUSIONS AND DISCUSSION

In this work, we propose a deep imitation learning al-gorithm for obstacle avoidance based on preprocessed localoccupancy maps. Embedded with a value iteration networkand trained with a mixture of real and artificially generatedsamples, the developed policy achieves highly competitivesolution optimality and robustness, and remains efficientand generic for real-time operations on the robot. A briefcomparison between the proposed approach and the otherexisting methods is summarized in Table II.

Map IL E2E IL Optimizationmodel / method VIN CNN sampling

training data exp. + gen. exp. ×sensor type generic specific generic

policy optimality high med low - highcomputation time short med short - long

TABLE II: A brief comparison of the proposed map-based imitationlearning algorithm, the existing end-to-end imitation learning andthe optimization-based methods for obstacle avoidance.

A remaining limitation of the proposed imitation learn-ing approach for obstacle avoidance is the lack of safetyguarantees, which the classical optimization-based approachcan better provide. In the future, we plan to further improvethe robustness by investigating the adversarial samples. Wealso consider extending the proposed model to a probabilisticversion that can provide a degree of uncertainty for theinferred action command, which will be a valuable plus forthe safety-critical robot operations.

ACKNOWLEDGMENTWe would like to thank the other members of the algorithm

team at Segway Robotics for their help and support in thiswork.

REFERENCES

[1] S. M. LaValle, Planning Algorithms. Cambridge University Press,2006.

[2] D. Fox, W. Burgard, and S. Thrun, “The dynamic window approachto collision avoidance,” IEEE Robotics Automation Magazine, vol. 4,no. 1, pp. 23–33, Mar. 1997.

[3] A. Liniger, A. Domahidi, and M. Morari, “Optimization-based au-tonomous racing of 1:43 scale RC cars,” Optimal Control Applicationsand Methods, vol. 36, no. 5, pp. 628–647, Sep. 2015.

[4] M. Pfeiffer et al., “From Perception to Decision: A Data-drivenApproach to End-to-end Motion Planning for Autonomous GroundRobots,” arXiv:1609.07910 [cs], Sep. 2016, arXiv: 1609.07910.

[5] T. M. Howard, C. J. Green, A. Kelly, and D. Ferguson, “Statespace sampling of feasible motions for high-performance mobilerobot navigation in complex environments,” Journal of Field Robotics,vol. 25, no. 6-7, pp. 325–345, Jun. 2008.

[6] P. Abbeel and A. Y. Ng, “Apprenticeship Learning via Inverse Rein-forcement Learning,” in Proceedings of the Twenty-first InternationalConference on Machine Learning, ser. ICML ’04. ACM, 2004.

[7] U. Muller et al., “Off-Road Obstacle Avoidance through End-to-EndLearning,” in Advances in Neural Information Processing Systems 18,2006, pp. 739–746.

[8] A. Giusti et al., “A Machine Learning Approach to Visual Perceptionof Forest Trails for Mobile Robots,” IEEE Robotics and AutomationLetters, vol. 1, no. 2, pp. 661–667, Jul. 2016.

[9] M. Bojarski et al., “End to End Learning for Self-Driving Cars,”arXiv:1604.07316 [cs], Apr. 2016, arXiv: 1604.07316.

[10] A. Tamar et al., “Value Iteration Networks,” in Advances in NeuralInformation Processing Systems 29, 2016, pp. 2154–2162.

[11] D. Silver et al., “The Predictron: End-To-End Learning and Planning,”arXiv:1612.08810 [cs], Dec. 2016, arXiv: 1612.08810.

[12] S. Ross, G. Gordon, and D. Bagnell, “A Reduction of ImitationLearning and Structured Prediction to No-Regret Online Learning,”in PMLR, Jun. 2011, pp. 627–635.

[13] J. Zhang and K. Cho, “Query-Efficient Imitation Learning for End-to-End Autonomous Driving,” arXiv:1605.06450 [cs], May 2016, arXiv:1605.06450.

[14] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. MIT Press,2005.

[15] M. Okada, L. Rigazio, and T. Aoshima, “Path Integral Networks: End-to-End Differentiable Optimal Control,” arXiv:1706.09597 [cs], Jun.2017, arXiv: 1706.09597.

[16] S. Ross et al., “Learning monocular reactive UAV control in clutterednatural environments,” in 2013 IEEE International Conference onRobotics and Automation, May 2013, pp. 1765–1772.

[17] G. Kahn, T. Zhang, S. Levine, and P. Abbeel, “PLATO: Policy learningusing adaptive trajectory optimization,” in 2017 IEEE InternationalConference on Robotics and Automation (ICRA), May 2017, pp. 3342–3349.

[18] S. Richard and B. Andrew, Reinforcement Learning: An Introduction.MIT Press, 1998.

[19] M. Abadi et al., “TensorFlow: Large-Scale Machine Learning onHeterogeneous Distributed Systems,” arXiv:1603.04467 [cs], Mar.2016, arXiv: 1603.04467.

[20] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimiza-tion,” arXiv:1412.6980 [cs], Dec. 2014, arXiv: 1412.6980.

8649

map-based deep imitation learning for obstacle avoidance · map-based deep imitation learning for...

Documents