arxiv:1806.06161v2 [cs.ro] 17 sep 2018 · high sample complexity associated with training policies....

BaRC: Backward Reachability Curriculum for Robotic Reinforcement Learning

Boris Ivanovic1, James Harrison2, Apoorva Sharma1, Mo Chen3, Marco Pavone1

Abstract— Model-free Reinforcement Learning (RL) offersan attractive approach to learn control policies for high-dimensional systems, but its relatively poor sample complexityoften necessitates training in simulated environments. Even insimulation, goal-directed tasks whose natural reward functionis sparse remain intractable for state-of-the-art model-freealgorithms for continuous control. The bottleneck in thesetasks is the prohibitive amount of exploration required toobtain a learning signal from the initial state of the system.In this work, we leverage physical priors in the form of anapproximate system dynamics model to design a curriculumfor a model-free policy optimization algorithm. Our BackwardReachability Curriculum (BaRC) begins policy training fromstates that require a small number of actions to accomplish thetask, and expands the initial state distribution backwards in adynamically-consistent manner once the policy optimization al-gorithm demonstrates sufficient performance. BaRC is general,in that it can accelerate training of any model-free RL algorithmon a broad class of goal-directed continuous control MDPs.Its curriculum strategy is physically intuitive, easy-to-tune,and allows incorporating physical priors to accelerate trainingwithout hindering the performance, flexibility, and applicabilityof the model-free RL algorithm. We evaluate our approach ontwo representative dynamic robotic learning problems and findsubstantial performance improvement relative to previous cur-riculum generation techniques and naıve exploration strategies.

I. INTRODUCTION

Reinforcement learning (RL) is a powerful tool for train-ing agents to maximize reward accumulation in sequentialdecision-making problems [1]–[4]. In particular, model-freeapproaches to robotic RL may allow for the completion oftasks that are difficult for traditional control theoretic toolsto solve, such as learning policies mapping observationsdirectly to actions [3]. In spite of previous successes, RLhas not seen widespread adoption in real-world settings,especially in the context of robotics. One of the fundamentalbarriers to applying RL in robotic systems is the extremelyhigh sample complexity associated with training policies.

One approach to overcoming sample complexity is to trainin simulation and port to the true environment, possiblywith a “sim-to-real” algorithm to mitigate the impacts ofsimulator-reality mismatch [4]–[6]. While this is a promisingapproach that reduces the number of trials that are requiredto run on a physical robot, it still requires possibly millionsof trials on simulated systems [7]. A root cause of thisextremely high sample complexity is the use of naıve explo-ration strategies such as ε-greedy [8], Ornstein-Uhlenbeck

1Department of Aeronautics and Astronautics,Stanford University, Stanford, CA, USA 94305.{borisi,apoorva,pavone}@stanford.edu

2Department of Mechanical Engineering, Stanford University, Stanford,CA, USA 94305. [email protected]

3School of Computing Science, Simon Fraser University, Burnaby, BC,Canada V5A 1S6. [email protected]

noise [9], or in the case of stochastic policy gradient al-gorithms such as Proximal Policy Optimization (PPO) [7],reliance on the stochasticity of the policy. These ditheringexploration strategies are well-known to take exponentialtime for problems with sparse reward [10]. Unfortunately,in robotic learning, smooth reward functions can lead toundesirable behavior and sparse rewards corresponding di-rectly to accomplishing a task are often necessary [11], [12].Moreover, if another component of the reward function isa cost associated with control, naıve exploration strategiesmay actually guide the system away from the goal region inan effort to minimize control costs.

In the context of robotics, exploration efficiency has thepotential to be substantially improved by leveraging physicalpriors [13]. Model-based approaches to robotic RL havedemonstrated exceptional sample efficiency [14], but areoften limited to generating local models and policies, andare not capable of solving as wide a variety of problems asmodel-free methods. Techniques to leverage models to guideexploration often require either substantial modification ofthe underlying model-free learning algorithm [13], [15], orleverage control theoretic tools which potentially preventsapplication to many interesting problems [16], [17].

Contributions: We propose Backward Reachable Curricu-lum (BaRC), a method of improving exploration efficiencyin model-free RL by seamlessly leveraging approximatephysical models without altering the RL algorithm. Ourapproach works by altering the initial state distributionused when training an arbitrary model-free RL agent. Westart training from states near the sparse goal region, anditeratively expand the initial state distribution by computingapproximate backward reachable sets (BRSs): the set of allpoints in the state space capable of reaching a certain regionin a fixed, short amount of time.

By explicitly moving the initial state distribution back-wards in time from the goal according to approximate dy-namics of the system, we initially train on simple problems,ensuring the learning agent receives a strong reward signal.As the agent demonstrates mastery, we increase problemdifficulty until arriving at the original learning problem. Ourapproach requires only approximate BRSs, so a low-fidelityapproximate model can be used to compute them. Duringforward learning, the original high-fidelity model is used, soour method is simply a wrapper that augments any model-free RL algorithm.

One key feature of our approach is that BRSs expand in alldynamically feasible directions of the state space, providinga “frontier” from which states are sampled. By using priorknowledge of the system through approximate dynamics andperforming backward reachability, we drastically improvesample efficiency while still learning a model-free control

arX

iv:1

806.

0616

1v2

[cs

.RO

] 1

7 Se

p 20

18

policy, thereby maintaining the potential to learn high qualitypolicies with arbitrary inputs (such as images or other sensorsinputs). Additionally, our approach is capable of handlinghighly dynamic systems in which the forward and backwardtime dynamics vastly differ, and is modular in the sense thatany model-free RL method and any reachability method canbe combined.

We demonstrate our approach on a car environment and adynamic planar quadrotor environment, and observe substan-tial improvements in learning efficiency over both standardmodel-free RL as well as existing curriculum approaches. Wefurther investigate a variety of scenarios in which there existsmismatch between the model used for curriculum generationand for forward learning, and find good performance evenwith model mismatch.

II. RELATED WORK

One growing body of work that aims to improve explo-ration efficiency in general RL problems has been on thetopic of deep exploration [10]. Dithering strategies such asε-greedy aim to explore the state space by taking suboptimalactions with some low probability. However, this does not ac-count for the relative uncertainty of value estimates, and thusan agent in an MDP with sparse reward will continuouslyexplore a region around the initial state, as opposed to delib-erately searching the state space [18]. General purpose deepexploration strategies address this problem by, for example,maintaining approximate distributions over value functionsand sampling at the start of each episode [10], or augmentingthe MDP reward towards reducing the agent’s uncertainty inenvironment dynamics [19]. In the specific context of roboticRL in simulation, the availability of physical models and theability to adjust the parameters of the RL task while trainingshould enable guiding exploration in a more direct manner;this is the high-level motivation for our work.

Our approach is based on ideas from curriculum learning,which aims to improve the rate of learning by first trainingon easier examples, and increasing the difficulty as trainingprogresses until the original task becomes appropriate to trainon directly [20]. In supervised learning, a variety of hand-designed curricula methods as well as automatic curriculumgeneration methods have been effective in certain scenarios[21], [22]. Several authors have also applied curriculumschemes to robotic systems [23], [24]; however, these ap-proaches are based on learning pre-specified control primi-tives or identifying inverse dynamics models, as opposed todirectly solving a given MDP. A common curriculum methodin RL is reward “smoothing”, in which a sparse reward signalis replaced with a smooth function, thus providing rewardsignal everywhere in the state space. A common exampleof this is providing a cost that is quadratic in the distanceto the goal state. This approach has been shown to yieldhighly sub-optimal policies in numerous robotic control tasks[11], [12]. For example, in a peg insertion task, [11] foundthat the robot would instead place the peg beside the hole,as opposed to successfully inserting it. We compare againstreward smoothing in our numerical experiments.

The work of [12] offers an attractive, purely model-freeapproach to curriculum generation for robotic RL, whichadjusts problem complexity by adjusting the initial statedistribution of the RL task during training. The authors arguethat initial states which yield a medium success rate lead togood learning performance, and thus select new start statesby sampling random actions from these medium successrate states. This approach was applied successfully in avariety of systems, but can break down for highly dynamicor unstable systems. In such systems, the state reached bytaking random actions from a “good start state” is likely tobe far from the original state and thus unlikely to itself alsobe a “good start state.” Further, for systems where dynamicsmoving forward in time are very different from those movingbackward in time, e.g. underactuated systems with drift,this random action sampling curriculum may never trainfrom starts that belong to the true initial state distribution.Compared to the work in [12], our method can be viewedas a generalization that explicitly utilizes prior knowledge ofa system. By expanding BRSs outward from the goal andgenerating a more uniform coverage of the state space, ourapproach effectively prioritizes exploration of the frontier ofthe BRS.

Backward reachability has been extensively used for veri-fying performance and safety of systems [25], [26]. There area plethora of tools for computing BRSs for many differentclasses of system models [27]–[29]. Since the traditionalfocus of backward reachability has been on providing per-formance and safety guarantees, computations of BRSs areexpensive. Recent system decomposition techniques are ef-fective in alleviating the computational burden in a variety ofproblem setups [30]–[32]. As we will demonstrate, approxi-mate BRSs computed using these decomposition techniquesare sufficient for the purpose of guiding policy learning anddo not significantly impact training time.

III. PROBLEM FORMULATION

The goal of this work is to seamlessly leverage physi-cal priors to provide a method for generating curricula toaccelerate model-free RL for sparse-reward robotic tasks insimulation. In the following subsections we formalize thenotion of a sparse-reward robotic RL task and specify thephysical priors we leverage.

A. Sparse Reward MDP

We assume the robotic RL task is defined as a discrete timeMDP of the form (S,A, F, r, ρ0, γ,Sg,Sf ), in which S ⊂Rn and A ⊂ Rm are the state and action sets respectively,ρ0 is the distribution of initial states, and F : S ×A → S isthe transition function which may be stochastic.

The MDP is an infinite horizon process with discountfactor γ. There exist two sets of absorbing states which werefer to as goal states, Sg ⊂ S, and failure states, Sf ⊂ S.This distinction is for technical reasons, but is an intuitiveone that is common in robotic motion planning [33]. Forexample, a goal state may be a robotic arm successfullyplacing an object in the correct location, whereas a failurestate may be the robot dropping the object out of reach.

The function r : S × A → R is a reward function,composed of a strictly negative running cost1, and the costor reward associated with the goal and failure regions.The running cost implies that a robot can only accumulatenegative reward during normal operation, and thus makes itdesirable to reach the goal region as opposed to continuouslyaccumulating reward. Finally, we make the technical assump-tion that for the optimal policy, for all s ∈ S, trajectoriesterminating in the goal region have higher expected rewardthan those terminating in the failure region. This is simplysaying that it is always desirable to try to reach the goalregion and to avoid failure. The reward structure presentedherein is a generalization of that presented in [12], and allowsus to incorporate cost components that are common anddesirable in robotics such as control effort penalties.

We assume we are training in simulation, and thus we canset the initial state of each episode arbitrarily. Furthermore,we assume we have knowledge of the goal region.

B. Approximate Dynamics Model as a Physical Prior

We assume access to physical prior knowledge in theform of a simplified, approximate dynamics model of thesystem, henceforth referred to as the curriculum model M ,with which approximate BRSs can be efficiently computed toguide the policy learning process. Depending on the methodused to compute BRSs, an appropriate choice of curriculummodel could take the form of an ODE, a difference equation,or an MDP. In this paper, we choose to use the Hamilton-Jacobi (HJ) reachability formulation, and hence assume thatthis curriculum model takes the form of a continuous-timeODE. In accordance with HJ reachability, let the curriculumstate be denoted s ∈ S ⊂ Rn, with dynamics given by

˙s(t) = f(s(t), a(t)), a(t) ∈ A, (1)

where the simulator state dimension n is not necessarilyequal to n. Specifically, we define a injective map φ(·) :Rn → Rn that maps the simulator model state to the cur-riculum model state. In general, φ(·) is nonlinear and often aprojection. The curriculum dynamics should be chosen to bea minimally sufficient representation of the dynamics, and isonly need for a coarse approximation of the BRSs.

The curriculum dynamics f : S × A → S are assumed tobe uniformly continuous, bounded, and Lipschitz continuousin s for fixed a, so that given a measureable control functiona(·), there exists a unique trajectory solving (1) [34].

IV. APPROACH

Our full approach is summarized in Algorithm 12. Theintuition behind our method stems from pedagogy, wherelearners are taught foundational topics that are later layeredwith advanced study. BRSs offer a convenient choice as foun-dational topics in the context of continuous control tasks—inorder to reach a state s in time T , a trajectory must passthrough the BRS with time horizon T of state s, which we

1We use the term running cost to denote a cost associated with a non-terminal state-action pair.

2The code used in this paper is available at https://github.com/StanfordASL/BaRC

denote as R(T ; s). Thus, our algorithm begins by requiringthe learner (an RL policy) to learn how to transition fromstates in the goal’s BRS to the goal. Once the learner cansuccessfully reach the goal from these states, the curriculumcomputes their BRS and uses this expanded set as new initialstates from which to train, thus incrementally increasingthe difficulty of the task. This process of expanding theinitial-state distribution in a dynamically-informed mannercontinues iteratively until the BRS spans the state space or agiven start state is reached. The rest of this section describesour algorithm in detail.

Every stage of the curriculum is defined by determiningthe set of initial states to train from by calling EXPAND-BACKWARDS to compute starts set, the union of R(T ; s)for every s in starts, which initially just contains a state inthe goal region. The inner loop of the algorithm representstraining a policy on this stage of the curriculum. Ratherthan only training on states from this expanded set, wesample Nnew states from the start set and mix in Nold statessampled from old starts, a list of states from which thepolicy has previously demonstrated mastery of the task. Inthis way, we avoid catastrophic forgetting during the policytraining. We train the policy using a model-free policy opti-mization algorithm for NTP iterations via the TRAINPOLICYsubroutine. The subroutine also returns the success rate ofthe policy from the different initial states sampled duringtraining. The starts list is updated to contain initial statesfrom which the policy can reliably reach the goal. This isdone via the SELECT subroutine, which returns initial statesin success map with success rate greater than Cselect. Thesestates are also added to the old starts buffer.

This inner loop repeats until the call to EVALUATE deter-mines that the fraction of states in this stage of the curriculumfrom which π can reach the goal exceeds Cpass. At this point,π is considered to have adequately mastered this curriculumstage, and we move to the next curriculum stage.

Hyperparameters: The algorithm has six hyperparam-eters. Cselect and Cpass define a notion of mastery from aparticular state and a curriculum stage respectively. Thehorizon T used in the BRS computation controls how muchthe task difficulty increases between stages of the curriculum.The ratio of Nold to Nnew balances training on new scenarioswith preventing forgetting. The number of training iterationsbetween evaluations NTP should ideally be set to a value suchthat the policy has a high chance of mastering a stage of thecurriculum after NTP iterations of the policy optimizationalgorithm, in order to avoid unnecessary evaluation checks.Empirically, the algorithm is robust to the settings of thesehyperparameters.

Computing BRSs: To actually obtain BRSs, we usemethods from reachability analysis. Specifically, we use theHJ formulation of reachability since our approach focuseson capturing key nonlinear behaviors of systems (throughthe curriculum dynamics model). The BRS of a target setF ⊂ Rn represents the set of states of the curriculum models ∈ Rn from which the curriculum system can be driveninto F at the end of a short time horizon of duration T . Inour work, the BRS of a target set F ⊂ Rn, which we denote

https://github.com/StanfordASL/BaRC

https://github.com/StanfordASL/BaRC

Algorithm 1 The full BaRC algorithm.Require: Hyperparameters Nnew, Nold, T, Cpass, NTP, Cselect

function BARC(sg , ρ0, M )π ← π0

starts ← {sg}oldstarts ← {sg}for i = 1, 2 . . . do

starts set ← EXPANDBACKWARDS(starts, M , T )frac successful ← 0.0while frac successful < Cpass do

ρi ← Unif(starts set, Nnew) ∪Unif(oldstarts, Nold)π, success map ← TRAINPOLICY(ρi, π, NTP)starts ← SELECT(success map, Cselect)oldstarts ← oldstarts ∪ startsfrac successful ← EVALUATE(ρi, π)

end whileiter result ← PERFORMANCEMETRIC(ρ0, π)

end forreturn π

end function

R(T ;F), is formally defined to be

{s0 : ∃a(·), s(·) satisfies (1) , s(−T ) = s0, s(0) ∈ F} ,

which is obtained as the zero sublevel set of a value functionV (−T, s) that is the solution to an HJ partial differentialequation (PDE) [25], [26]: R(T ;F) = {s : V (−T, s) ≤ 0}.More details are given in the appendix. When the target setconsists of a single state s, we denote the BRS R(T ; s).

Solving the PDE is computationally expensive in general.However, since here we are only using BRSs to guide policysearch rather than their traditional application of verifyingsystem performance and safety, we can make approxima-tions. Specifically, we utilize system decomposition methods[30]–[32] with the simplified curriculum model in (1) toobtain approximate BRSs without significantly impacting theoverall policy training time. The techniques in [30]–[32] pro-vide outer approximations, although an outer approximationis not necessary for guiding policy search; any approxima-tion that captures key nonlinear system behavior suffices.Each system we experiment on is first decomposed intooverlapping subsets of one or two components, which caneach be solved efficiently and run online in our algorithm.Specifically, we use the open source helperOC3 and LevelSet Methods4 toolboxes in MATLAB.

Sampling from a BRS: We perform rejection samplingover the components of the decomposed BRS approximation,meaning each component has low dimensionality (typically1-D or 2-D) and thus it is inexpensive to evaluate mem-bership. We first determine a tight bounding box aroundeach component’s BRS (this can be computed efficientlyfrom widely available low-level contour plotting methods,e.g. contourc in MATLAB). Then, we uniformly samplepoints in the bounding box and reject any that fail mem-bership checks. The initial bounding box calculation enablesus to efficiently perform rejection sampling in cases wherethe size of the BRS is much smaller than the state space.

3Found at https://github.com/HJReachability/helperOC4Found at http://www.cs.ubc.ca/˜mitchell/ToolboxLS/

Fig. 1. Performance of the car model under BaRC, the approach of[12] (referred to as Random Curriculum), and standard PPO [7]. Upper:Percentage of start states in which the learned policy can reach a specifiedgoal state. This curve is equivalent to expected reward over a wide initialstate distribution. Mean and 95% confidence intervals were computed fromthe results of 5 different runs. Lower: Average reward during policy training.Note that it is not desirable to have expected reward be approximately 1,as this implies that the curriculum is not providing sufficiently challengingproblems. Instead, it is desirable to ensure the reward is between 0 and 1to provide a learning signal for the agent during training.

Visualizations of our rejection sampling method and of BRSscan be found in the appendix.

While a more formal theoretical analysis is out of thescope of this paper, the theoretical benefits of samplingbased on BRSs can be gleaned from the definition of theBRS, which involves an existential quantifier for the controlfunction a(·). This implies that a state that can possibly reachthe goal using any policy is included in the BRS. This isin contrast to methods that sample the state space throughapplying random controls.

V. EXPERIMENTAL RESULTS

We perform numerical experiments on two dynamical sys-tems, and evaluate a variety of model mismatch scenarios toinvestigate the robustness of BaRC to inaccurate curriculummodels. All experiments were performed using the PPOalgorithm as the model-free policy optimization method.The hyperparameters for both experiments are listed in theappendix.

A. Car ModelWe use a five dimensional car that is a standard test

environment in motion planning [35] and RL [5]. The car

https://github.com/HJReachability/helperOC

http://www.cs.ubc.ca/~mitchell/ToolboxLS/

Fig. 2. Performance of BaRC, random curriculum, and standard PPO on the planar quadrotor task. Left: Mean reward over ten experiments for thethree approaches (95% confidence intervals). Note that between 4-8 iterations, the different instantiations of BaRC learn to reach the goal. The PPO policysimply learns how to however, whereas the random curriculum never learns any non-colliding behavior. Center: The average reward during training ofthe policy. The iterative increasing and decreasing of average reward for the BaRC agent shows the policy learning for a given BRS, followed by BaRCexpanding the BRS. This alternating learning and expansion is repeated until the BaRC agent reaches the initial state and the algorithm achieves maximalreward in every iteration. Right: The mean reward over ten experiments, plotted versus wall clock time.

has state s = [x, y, θ, v, κ]T , where x and y denote positionin the plane, θ denotes heading angle, v denotes speed, andκ denotes trajectory curvature. The system takes as inputa = [av, aκ]T , where av is longitudinal acceleration andaκ is the derivative of the curvature. The reward functionis an indicator variable for a small region around the goalstate. The goal state is a point at the origin with velocitysampled uniformly from 0.1 to 1.0 m/s and heading sampleduniformly from −π to π radians. We consider states thatare within 0.1 m in distance, 0.1 m/s in velocity, and 0.1radians of the goal state to have reached the goal. The agentreceives a reward of 1.0 upon reaching the goal, and 0.0otherwise. The simulator model was used as a curriculummodel for the experiments presented in the body of the paper.The appendixcontains a collection of results for cases wherethere exists a mismatch between the simulator model and thecurriculum model, and these systems performed similarly tothe cases presented here.

To test coverage of the state space via the BaRC algorithm,we initialized a set of 100 points “behind” the goal state.Here, “behind” means that any sampled point’s projectiononto the goal state’s velocity vector is negative. Further,the maximum angle allowed between any sampled state andthe goal state’s negative velocity vector is π/4. Finally, weorient the sampled states randomly within π/4 radians of thegoal vector’s orientation. Each of the sampled states has zeroinitial velocity and curvature. We then ran BaRC, evaluatingat each iteration the number of original sampled states fromwhich the learned policy could traverse to the goal. We chosethis set as it is representative of states from which a policycould realistically traverse to a goal with nonzero velocity.

Figure 1 shows the performance of BaRC, as well as thecurriculum method presented in [12] (which we refer to asRandom Curriculum) and standard PPO [7]. The top plotshows the number of start states from which the algorithmsuccessfully reaches the goal state. When the success rateexceeded 95%, the task was considered solved and trainingwas terminated. BaRC achieves rapid coverage of the statespace and a success rate of greater than 95% for all fiveexperiments in under thirty iterations of the outer loop. Incontrast, PPO obtains almost no reward. The random curricu-lum approach successfully connects for between twenty and

forty percent of start states and shows little improvementbeyond the first 40 iterations. On the bottom, the averagereward during training is plotted. These reward returns arefor training from curriculum-selected start states as opposedto the true initial state distribution of the problem. As a result,counter-intuitively, it is desirable to have neither an averagereward close to zero (implying an ineffective explorationscheme), nor an average reward close to one (implying thecurriculum is expanding too slowly). The average reward forBaRC varied between approximately 0.4 and 0.9, demon-strating that the algorithm creates a well-paced curriculumthat ensures good, continuous learning progress.

B. Planar Quadrotor Model

To evaluate the performance of BaRC on a highly dy-namic, unstable system, we tested the algorithm on a planarsimulation of a quadrotor flying in clutter. This planarquadrotor is a standard test problem in the control literature[36], [37]. This system has state s = [x, vx, y, vy, φ, ω],where x, y, φ denote the planar coordinates and roll, andvx, vy, ω denote their time derivatives. The actions availableto the agent are the two rotor thrusts, T1 and T2. Thelearning agent receives an observation that is augmented with8 laser rangefinder sensors (see Figure 3 for a visualization).These sensors are placed at every 45 degrees and providethe distance from the vehicle to the nearest obstacle. Thisobservation allows the learning agent to map directly fromsensor inputs to actions.

In this problem, the agent receives a reward of 1000 forreaching the goal region (defined as Sg = {s : x ≥ 4, y ≥4}) and accumulates a control cost of 0.01(T 2

1 + T 22 ) per

timestep. The episode runs for at most 200 timesteps, andthe maximum thrust is set to be twice the weight of thequadrotor. If the agent collides with an obstacle or the wallsit receives a cost equal to the maximum control cost over theentire episode. As such, it is always desirable for the agentto avoid collisions. Because of the extremely sparse rewardassociated with reaching the goal, the unstable dynamics, andabsorbing collision states, this problem is extremely difficultto solve for standard exploration schemes.

Figure 2 shows the performance of BaRC as well as therandom curriculum and standard PPO. PPO, as expected,

Fig. 3. Left: Visualization of a trajectory from a BaRC-trained policy. The policy was trained for 10 BaRC iterations. The red lines denote laser rangefindermeasurements which were included in the observation. They are excluded from the other two plots for clarity. Magenta indicates the goal region and greyindicates obstacles. Center Left: A trajectory from a policy trained using the random curriculum approach from [12]. It was trained for 10 curriculumiterations. Center Right: A trajectory from a policy trained using standard PPO [7]. It was trained for 200 PPO iterations (equivalent to 10 outer loopsof 20 PPO iterations). Right: A trajectory from a policy trained using standard PPO, with a quadratized reward function.

never reaches the goal and thus cannot learn to deliberatelymove toward the goal region. Instead, it learns a locallyoptimal policy which hovers in place to avoid collision.We also compare to standard PPO with a smoothed rewardfunction. In this version of the problem, the reward functionis augmented with a quadratic cost term, penalizing distancefrom the goal. As a result, the learned policy is a highly sub-optimal local minimum, in which the agent hovers slightlycloser to the obstacle above it, and to the right. It does not,however, learn a policy that allows it to reach the goal.

The random curriculum, because it takes random ac-tions forward and backward, rapidly becomes unstable andachieves only infrequent rewards. It does not advance farenough in the state space to produce any deliberate goal-seeking behavior from the true start state of the problem.In contrast, the BaRC agent learns to move to the goalevery time within 4-8 iterations of the curriculum. As canbe seen in the average reward plot (Figure 2, center), thereare discrete waves of policy improvement (average rewardincrease) followed by iterative expansion of the BRS (visibleby the associated sharp reward decrease). This continues untilthe curriculum reaches the initial state. It then rapidly trainsfrom the initial state, quickly improving to be able to consis-tently solve the problem. Sample trajectories are visualizedin Figure 3. Because we do not enforce a hover condition inthe goal region, the BaRC policy learns extremely aggressivetrajectories toward the goal.

Since PPO exploration steps are interleaved with expan-sion of the BRS, the average time per iteration is slower thanstandard PPO or the random curriculum. The performanceversus wall clock time for both the car system and the planarquadrotor is presented in Figure 2 (right). Each iterationof BaRC takes roughly 2-3 times as long as standard PPOiterations. Even considering performance versus clock time,BaRC demonstrates dramatic performance improvement ver-sus previous approaches. Moreover, once the BRS expandsto include the initial state, expansion of the BRS may behalted and training can continue without any curriculumcomputation overhead.

VI. DISCUSSION AND CONCLUSIONS

In this paper, we proposed backward reachability cur-riculum (BaRC), which addresses the challenge of samplecomplexity in model-free reinforcement learning (RL) byusing backward reachable sets (BRSs) to provide curriculafor the policy learning process. In our approach, we initiallytrain the policy starting close to the goal region, before iter-atively computing BRSs using a simplified dynamic modelto train the policy from more difficult initial states. BaRC isan effective way to leverage physical priors to dramaticallyincrease the rate of training of model-free algorithms withoutaffecting their flexibility in terms of applicability to a largevariety of RL problems, and enables the solution of sparsereward robotic problems that were previously intractable withstandard RL exploration schemes. The hyperparameters ofBaRC, such as the time horizon of BRS expansions andthe mastery threshold, are intuitive to a system designer andmay be easily adjusted. In our numerical examples, sincewe used continuous time BRS methods, the time horizon ofBRS expansion may be chosen to be any positive number,as opposed to being limited to the same time discretizationof the simulator. Discrete time BRS methods can also beused to a similar effect, since the time discretization ofthe curriculum model need not be the same as that of thesimulator model.

While we have seen strong exploration efficiency gainsfrom directly computing BRSs, it is unclear if our approachcould be relaxed to a sampling based scheme. Our approachmimics dynamic programming, first solving the control sub-problem close to the goal region, and iteratively increasingthe problem complexity while relying on the solutions to thepreviously solved subproblems. Sampling based analoguesto dynamic programming exist in, for example, motion plan-ning [38], [39]. These sampling based algorithms, using thecurriculum dynamics, may be a promising way to increasethe efficiency of approximating BRSs.

There are several additional avenues of future work. First,incorporating both forward and backward reachability tospeed up policy training when a particular initial statedistribution is of interest has to potential to improve trainingefficiency. Investigating potential performance benefits of

tuning the BRS time horizon and quantifying the samplecomplexity benefits of using BRSs versus simulating trajec-tories potentially with respect to the quality of the curriculummodel are important characterizations of BaRC. Finally,performing efficient system identification [40]–[42] duringportions of the training process that occur on the true toobtain a curriculum model in situations when an approximatemodel is not known a priori may be investigated to increasethe set of problems BaRC may be applied to.

ACKNOWLEDGMENT

The authors were partially supported by the Office ofNaval Research, ONR YIP Program, under Contract N00014-17-1-2433 and the Toyota Research Institute (“TRI”). JamesHarrison was supported in part by the Stanford Graduate Fel-lowship and the National Sciences and Engineering ResearchCouncil (NSERC). This article solely reflects the opinionsand conclusions of its authors and not ONR, TRI or anyother Toyota entity.

REFERENCES

[1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al., “Mastering the game of go with deep neuralnetworks and tree search,” Nature, 2016.

[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcementlearning,” Neural Information Processing Systems (NIPS), 2013.

[3] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end trainingof deep visuomotor policies,” Journal of Machine Learning Research,2016.

[4] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bo-hez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion forquadruped robots,” Robotics: Science and Systems (RSS), 2018.

[5] J. Harrison, A. Garg, B. Ivanovic, Y. Zhu, S. Savarese, L. Fei-Fei, andM. Pavone, “AdaPT: Zero-shot adaptive policy transfer for stochasticdynamical systems,” International Symposium on Robotics Research(ISRR), 2017.

[6] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,“Domain randomization for transferring deep neural networks fromsimulation to the real world,” IEEE International Conference onIntelligent Robots and Systems (IROS), 2017.

[7] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,” arXiv:1707.06347, 2017.

[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovskiet al., “Human-level control through deep reinforcement learning,”Nature, 2015.

[9] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforce-ment learning,” International Conference on Learning Representations(ICLR), 2016.

[10] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, “Deep explo-ration via bootstrapped DQN,” Neural Information Processing Systems(NIPS), 2016.

[11] G. Thomas, M. Chien, A. Tamar, J. A. Ojea, and P. Abbeel, “Learningrobotic assembly from CAD,” IEEE International Conference onRobotics and Automation (ICRA), 2018.

[12] C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel, “Re-verse curriculum generation for reinforcement learning,” Conferenceon Robot Learning (CoRL), 2017.

[13] S. Bansal, R. Calandra, S. Levine, and C. Tomlin, “MBMF: Model-based priors for model-free reinforcement learning,” Conference onRobot Learning (CoRL), 2017.

[14] M. Deisenroth and C. E. Rasmussen, “PILCO: A model-based anddata-efficient approach to policy search,” International Conference onMachine Learning (ICML), 2011.

[15] Y. Chebotar, K. Hausman, M. Zhang, G. Sukhatme, S. Schaal, andS. Levine, “Combining model-based and model-free updates fortrajectory-centric reinforcement learning,” International Conference onMachine Learning (ICML), 2017.

[16] S. Levine and P. Abbeel, “Learning neural network policies withguided policy search under unknown dynamics,” Neural InformationProcessing Systems (NIPS), 2014.

[17] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep Q-learning with model-based acceleration,” International Conference onMachine Learning (ICML), 2016.

[18] I. Osband, D. Russo, Z. Wen, and B. Van Roy, “Deep exploration viarandomized value functions,” arXiv:1703.07608, 2017.

[19] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, andP. Abbeel, “Vime: Variational information maximizing exploration,”Neural Information Processing Systems (NIPS), 2016.

[20] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculumlearning,” International Conference on Machine Learning (ICML),2009.

[21] A. Graves, M. G. Bellemare, J. Menick, R. Munos, andK. Kavukcuoglu, “Automated curriculum learning for neural net-works,” International Conference on Machine Learning (ICML), 2017.

[22] W. Zaremba and I. Sutskever, “Learning to execute,” arXiv:1410.4615,2014.

[23] A. Karpathy and M. Van De Panne, “Curriculum learning for motorskills,” Canadian Conference on Artificial Intelligence, 2012.

[24] A. Baranes and P.-Y. Oudeyer, “Active learning of inverse modelswith intrinsically motivated goal exploration in robots,” Robotics andAutonomous Systems, 2013.

[25] S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin, “Hamilton-Jacobireachability: A brief overview and recent advances,” IEEE Conferenceon Decision and Control (CDC), 2017.

[26] M. Chen and C. J. Tomlin, “Hamilton-Jacobi Reachability: SomeRecent Theoretical Advances and Applications in Unmanned AirspaceManagement,” Annual Review of Control, Robotics, and AutonomousSystems, 2018.

[27] I. M. Mitchell, “The Flexible, Extensible and Efficient Toolbox ofLevel Set Methods,” J. Scientific Computing, 2008.

[28] M. Althoff, “An Introduction to CORA 2015,” Proc. Workshop onApplied Verification for Continuous and Hybrid Systems, 2015.

[29] C. Fan, B. Qi, S. Mitra, M. Viswanathan, and P. S. Duggirala,“Automatic Reachability Analysis for Nonlinear Hybrid Models withC2E2,” Computer Aided Verification, 2016.

[30] I. M. Mitchell and C. J. Tomlin, “Overapproximating Reachable Setsby Hamilton-Jacobi Projections,” J. Scientific Computing, 2003.

[31] M. Chen, S. Herbert, and C. J. Tomlin, “Fast reachable set approxi-mations via state decoupling disturbances,” IEEE Conf. Decision andControl, 2016.

[32] M. Chen, S. L. Herbert, M. Vashishtha, S. Bansal, and C. J. Tomlin,“Decomposition of Reachable Sets and Tubes for a Class of NonlinearSystems,” IEEE Transactions on Automatic Control, 2018.

[33] S. M. LaValle, Planning algorithms. Cambridge university press,2006.

[34] E. A. Coddington and N. Levinson, Theory of Ordinary DifferentialEquations. McGraw-Hill Inc., 1955.

[35] D. J. Webb and J. van den Berg, “Kinodynamic RRT*: Asymptoticallyoptimal motion planning for robots with linear dynamics,” IEEEInternational Conference on Robotics and Automation (ICRA), 2013.

[36] J. H. Gillula, H. Huang, M. P. Vitus, and C. J. Tomlin, “Design ofguaranteed safe maneuvers using reachable sets: Autonomous quadro-tor aerobatics in theory and practice,” IEEE International Conferenceon Robotics and Automation (ICRA), 2010.

[37] S. Singh, A. Majumdar, J.-J. Slotine, and M. Pavone, “Robust onlinemotion planning via contraction theory and convex optimization,”IEEE International Conference on Robotics and Automation (ICRA),2017.

[38] L. Janson, E. Schmerling, A. Clark, and M. Pavone, “Fast marchingtree: A fast marching sampling-based method for optimal motionplanning in many dimensions,” International Journal of RoboticsResearch, 2015.

[39] B. Ichter, E. Schmerling, and M. Pavone, “Group marching tree:Sampling-based approximately optimal motion planning on gpus,”IEEE International Conference on Robotic Computing (IRC), 2017.

[40] L. Ljung, System identification. Springer, 1998.[41] S. Levine and V. Koltun, “Guided policy search,” International Con-

ference on Machine Learning (ICML), 2013.[42] J. Harrison, A. Sharma, and M. Pavone, “Meta-learning priors for

efficient online bayesian regression,” Workshop on the AlgorithmicFoundations of Robotics (WAFR), 2018.

[43] M. G. Crandall and P.-l. Lions, “Viscosity solutions of Hamilton-Jacobi equations,” Trans. American Mathematical Society, 1983.

APPENDIX

A. Computing backward reachable sets (BRSs)

Intuitively, a BRS represents the set of states of thecurriculum model s ∈ Rn from which the curriculum systemcan be driven into a target set F ⊂ Rn at the end of a timehorizon of duration T .

In this paper, we focus on the “Maximal BRS5 ”; in thiscase the system seeks to enter F using some control function,and the Maximal BRS (referred to as simply “BRS” in thispaper) represents the set of states from which the system isguaranteed to reach F .

Definition 1: Backward Reachable Set (BRS).

R(T ;F) = {s0 : ∃a(·), s(·) satisfies (1), s(−T ) = s0, s(0) ∈ F}HJ formulations outlined in the review papers [25], [26]

cast the reachability problem as an optimal control problemand directly compute BRSs in the full state space of thesystem. The numerical methods such as those implementedin [27] for obtaining the optimal solution all involve solvingan HJ PDE on a grid that represents a discretization of thestate space.

Let the target set F ⊂ Rn be represented by the implicitsurface function V0(s) as F = {s : V0(s) ≤ 0}. Considerthe optimization problem

VR(t, s) = infa(·)

V0(s(0)) subject to (1) (2)

The value function VR(−T, s) is given by the implicitsurface function representing the maximal BRS: R(T ;F) ={s : VR(−T, s) ≤ 0}, and is the viscosity solution [43] of thefollowing HJ PDE, derived through dynamic programming:

∂V (t, s)

∂t+H(s,∇V (t, s)) = 0, t ∈ [−T, 0], (3)

V (0, s) = V0(s), (4)

where the Hamiltonian H(s, λ) is given by

H(s, λ) = mina∈A

λf(s, a). (5)

B. System decomposition

Solving the HJ PDE (3) is computationally expensive;BRSs for systems with more than 5 state dimensions arecomputationally intractable using the HJ formulation. Inorder to take advantage of the HJ formulation’s ability tocompute BRSs for nonlinear systems without significantlyimpacting the overall policy training time, we will utilizesystem decomposition methods to obtain approximate BRSs.In particular, in [32], the authors proposed decomposingsystems into “self-contained subsystems”, and in [30] and[31], the authors proposed ways to compute projectionsof BRSs. We will combine these methods to obtain over-approximations of BRSs for the curriculum model. Specificdetails of each curriculum model will be described in SectionB.

5The term “maximal” refers to the role of the optimal control, which istrying to maximize the size of the BRS.

It is important to note that in this paper BRSs are used toguide policy learning, and as we will demonstrate, approxi-mate BRSs computed with the simplified curriculum modelin (1) are sufficient for the purposes of this paper.

C. Car Model

For the car environment, we used the following five-statesimulator model:

s =

xy

θvκ

=

(v + dv) cos θ(v + dv) sin θ

ωav + dv−ctrl

daκ(aκ + dκ−ctrl)

, |av| ≤ av, |aκ| ≤ aκ,

(6)where the position (x, y), heading θ, speed v, and angularspeed κ are the car’s states. The control variables are thelinear acceleration av and angular acceleration aκ. Thevariables dv, dav , daκ represent noise in the speed, linearacceleration, and angular acceleration, and are stochastic. Weassume that the learning agent has full access to the car state.

For the curriculum model, we used the same state variableand dynamics, but without the disturbance variables:

˙s =

˙x˙y˙θ˙v˙κ

=

v cos θ

v sin θκavaκ

, |av| ≤ âv, |aκ| ≤ âκ. (7)

In order to ensure that the approximate BRS computationcan be done quickly, we decomposed (7) as follows:

[˙x˙θ

]=

[v cos θκ

], κ ≤ κ ≤ κ, v ≤ v ≤ v (8a)[

˙y˙θ

]=

[v sin θκ

], κ ≤ κ ≤ κ, v ≤ v ≤ v (8b)

˙v = av, |av| ≤ âv, (8c)˙κ = aκ, |aκ| ≤ âκ (8d)

The decomposition occurs in two steps. First, the 5D cur-riculum dynamics are decomposed into two 4D subsystemswith states (x, θ, v, κ) and (y, θ, v, κ). This results in an over-approximation of the BRS [32]. Next, v and κ are separatedfrom (x, θ) and (y, θ), resulting in the subsystems in (8). In(8a) and (8b), v and κ become fictitious control variablesthat are dependent on the subsystem BRSs in (8c) and (8d),which results in over-approximation of the BRS accordingto [30], [31].

Hyperparameter Settings: In all experiments involvingthe car environment, we chose

Nnew = 200, Nold = 100, T = 0.1s,

Cpass = Cselect = 0.5, NTP = 20

These values were chosen with almost no tuning effort, asBaRC is robust to hyperparamter choice.

D. Planar Quadrotor Model

For the planar quadrotor environment, we used the samedynamics for the simulator and curriculum models, given by

xvxyvyφω

=

vx− 1mC

vDvx −

T1

m sinφ− T2

m sinφvy

− 1m (mg + CvDvy) + T1

m cosφ+ T2

m cosφω

− 1IyyCφDω −

lIyyT1 + l

IyyT2

(9a)

with T ≤ T1, T2 ≤ T , and where the state is given by theposition in the vertical plane (x, y), velocity (vx, vy), pitch φ,and pitch rate ω. The control variables are the thrusts T1, T2.In addition, the quadrotor has mass m, moment of inertiaIyy, half-length l. Furthermore, g denotes the accelerationdue to gravity, CvD the translational drag coefficient, and CφDthe rotational drag coefficient.

The quadrotor received an observation containing its fullstate, as well as eight sensor measurements from laserrangefinders. These sensors were located every 45 degreeson the body of the quad, and returned distance to the nearestobstacle. These sensors were fixed to the body frame of thequadrotor. The task of incorporating these sensors outputsto avoid obstacles would require a substantial effort usingtraditional control, estimation, and mapping tools.

For efficient computation of approximate BRSs, we de-composed the quadrotor model as follows:

[vxφ

]=

[− 1mC

vDvx −

T1

m sinφ− T2

m sinφω

], (10a)[

vyφ

]=

[− 1m (mg + CvDvy) + T1

m cosφ+ T2

m cosφω

],

(10b)x = vx, (10c)y = vy, (10d)

ω = − 1

IyyCφDω −

l

IyyT1 +

l

IyyT2 (10e)

with T ≤ T1, T2 ≤ T , ω ≤ ω ≤ ω, vx ≤ vx ≤ vx, vy ≤vy ≤ vy , and where ω is a fictitious control in (10a) and(10b), and vx and vy are fictitious controls in (10c) and (10d)respectively.

Hyperparameter Settings: In all experiments involvingthe quadrotor environment, the hyperparameters were set toexactly match the parameters used previously, for the 5-Dcar environment.

E. Car Model

Wall Clock Runtime: In addition to the analyses per-formed in the main body of the paper, we also compared thewall clock runtime of BaRC to that of a random curriculumand standard PPO. This experiment was run on serverswith octa-core CPUs, but each individual run was limitedto running in one thread. Figure 4 illustrates our results.We can see that our algorithm is competitive in runtime,

0 25 50 75 100 125 150 175 200Time (Minutes)

0

20

40

60

80

100

% S

ucce

ssfu

l Sta

rts

BaRC Random Curriculum PPO Only

Fig. 4. Percentage of successful starts for the car model, plotted versuswall clock time.

even with the additional cost of computing BRSs at eachiteration. In this environment, the longest iterations of BaRCtakes 2-3 times as long as standard PPO iterations. However,this multiple rapidly shrinks as the policy learns to reachthe goal quickly and learns to generalize across the statespace. Once the BRS has reached the initial state, the BRSexpansion can be terminated and each iteration of BaRC runsfor approximately the same amount of time as standard PPO.

Robustness to Model-System Mismatch: To determinehow accurate the approximate dynamics model must be(and therefore how robust our approach is to model error),we evaluated BaRC’s performance in the presence of threecommon model mismatch scenarios: additive velocity noise,nonzero-mean additive control noise, and oversteer.

To add velocity noise, we sample a disturbance valuedv from a standard Gaussian distribution, add this to thecurrent state’s velocity, numerically integrate the dynamicsequations to produce the next state, and then remove theadded velocity from the resulting state. Thus, the addedvelocity noise affects the propagation of the system, but isnot observed by the policy. This velocity noise is roughlyequivalent to wheel slip or increased traction during atimestep of the discrete dynamics. For nonzero-mean additivecontrol noise, we sample the disturbance values dv−ctrl anddκ−ctrl independently from a Gaussian centered at 0.3 withstandard deviation 0.2, which are then added to the controlinputs. Finally, to produce oversteer we set daκ = 1.5which multiplies the commanded rate of curvature. Figure 5shows the performance of BaRC, a random curriculum, andstandard PPO on each of these model mismatch scenarios. Ascan be seen, our approach outperforms the others in all cases.Interestingly, these model mismatches generally improvedthe performance of the random curriculum method.

F. Planar Quadrotor ModelBRS Visualization: While not strictly an experiment,

for the benefit of the reader we show example BRSs for theplanar quadrotor system in Figure 6 to illustrate how it issampled from and what its evolution per iteration looks like.

0 10 20 30 40 50 60 70 80Algorithm Iteration

0

20

40

60

80

% S

ucce

ssfu

l Sta

rtsBaRC Random Curriculum PPO Only


0

20

40

60

80

% S

ucce

ssfu

l Sta

rts



0

20

40

60

80

% S

ucce

ssfu

l Sta

rts


Fig. 5. Corrupted car model results. Left: The performance of BaRC, random curricula, and PPO alone on a system with additive nonzero-mean Gaussiannoise to the control input. Middle: The performance of the three algorithms on a system with oversteer (steering control inputs are multiplied by a constantfactor greater than 1). Right: The algorithms’ performance on a system with additive Gaussian longitudinal velocity noise.

-5 -4 -3 -2 -1 0 1 2 3 4 5

Vx

-5

-4

-3

-2

-1

0

1

2

3

4

Phi

BRS VxPhi Level = 0

BR SetTarget SetContour Bounding BoxSampled Points

-5 -4 -3 -2 -1 0 1 2 3 4 5

Vx

-5

-4

-3

-2

-1

0

1

2

3

4

Phi

BRS VxPhi Level = 0


-5 -4 -3 -2 -1 0 1 2 3 4 5

Vx

-5

-4

-3

-2

-1

0

1

2

3

4

Phi

BRS VxPhi Level = 0


-5 -4 -3 -2 -1 0 1 2 3 4 5

Vx

-5

-4

-3

-2

-1

0

1

2

3

4

Phi

BRS VxPhi Level = 0


Fig. 6. An example BRS for our planar quad system and how it evolves over time as more points are added to the starts variable. Plots advance forwardfrom top-left to bottom-right. Green dots show training start states obtained via rejection sampling. The three BRSs shown in red correspond to timehorizons of T = {0.05, 0.10, 0.15} s, respectively, from the innermost (closest to the target set) to the outermost contour.

arxiv:1806.06161v2 [cs.ro] 17 sep 2018 · high sample complexity associated with training policies....

Documents