quadruped robot obstacle negotiation via reinforcement ... · quadruped robot obstacle negotiation...

8
Quadruped Robot Obstacle Negotiation via Reinforcement Learning Honglak Lee *† , Yirong Shen , Chih-Han Yu †† , Gurjeet Singh § and Andrew Y. Ng * * Computer Science Department, Stanford University, Stanford, CA 94305 Department of Applied Physics, Stanford University, Stanford, CA 94305 Department of Electrical Engineering, Stanford University, Stanford, CA 94305 †† Division of Engineering and Applied Sciences, Harvard University, Cambridge MA 02138 § Scientific Computing and Computational Mathematics Program, Stanford University, Stanford, CA 94305 Abstract— Legged robots can, in principle, traverse a large variety of obstacles and terrains. In this paper, we describe a successful application of reinforcement learning to the problem of negotiating obstacles with a quadruped robot. Our algorithm is based on a two-level hierarchical decomposition of the task, in which the high-level controller selects the sequence of foot- placement positions, and the low-level controller generates the continuous motions to move each foot to the specified positions. The high-level controller uses an estimate of the value function to guide its search; this estimate is learned partially from supervised data. The low-level controller is obtained via policy search. We demonstrate that our robot can successfully climb over a variety of obstacles which were not seen at training time. I. I NTRODUCTION While wheeled vehicles (such as cars and trucks) are very fuel efficient, legged robots can, in principle, climb over much larger obstacles relative to the size of the robot, and thereby access significantly more difficult terrain. In this paper, we apply reinforcement learning to develop a novel controller for a quadruped robot so as to enable it to negotiate a wide variety of obstacles, including ones that had not been previously seen during training. In related work, some other researchers have focused on clever mechanical design of legs that automatically adapt to terrains [1], [2]. Bai et al. [3] proposed a rule-based algorithm for generating gaits to step onto (and over) small obstacles, and demonstrated them in simulation. There has also been work on adaptive gaits for quadruped robots based on “central pattern generators” [4]–[8]. However, these gaits have mainly been applied to traversing rough surfaces, rather than climbing over large/discrete obstacles. Moore et al.’s hexapod (six- legged) RHex robot is capable of climbing up steps, but does so using ingenious mechanical design in which large, circular, “wheel-like” legs flail at the obstacles, rather than through careful balance and coordination [9], [10]. There are also several robots that rely on hopping to climb up steps, e.g., Raibert’s biped robot [11] and Poulakakis et al.’s galloping robot [12]. Bretl et al.’s vertical climbing robot successfully applied motion planning to robot climbing [13]. Their algorithm relied on computing statically stable poses for vertical climbing, and does not immediately apply to more general obstacle negotiation tasks. Learning algorithms have been successfully applied to a few legged locomotion problems, for example [14]–[16]. But these methods have been demonstrated to work only on flat terrain in which the same (periodic) gait could be repeated without taking into account possible obstacles in the path. Kim et al. [17] proposed an algorithm for walking and climbing in 3d environments, but specifically for a robot whose feet had strong suction pads. Chestnutt et al. [18] used hierarchical planning to deal with obstacle avoidance, but considered only 2D path planning and 3D obstacle avoidance (“stepping” over small obstacles), and not the more difficult task of climbing over obstacles. Our method is based on hierarchical reinforcement learn- ing [19]–[22] using a two-level hierarchical decomposition of the task. Given an obstacle (such as a step) that the robot needs to climb over, the high-level planner selects a sequence of “target foot placement positions” for the robot, one foot at a time. It is then the low-level controller’s task to move the feet to these targets in order, while keeping the robot balanced and preventing the legs from hitting any obstacles. The low-level controller operates on a very short temporal scale, in a problem which requires fine coordination and bal- ance, but does not require careful multi-step reasoning. Policy search algorithms [23] with simple parameterized policies apply well to such problems; we use it to learn the low-level controller. In contrast, the high-level controller has to plan a sequence of foot placement positions, and must reason on significantly longer timescales. We build a high-level controller using value function approximation and beam search. In detail, a value function indicates the relative “desirability” of different states in the problem. We learn a value function using a novel reinforcement learning algorithm, one that also makes use of some supervised information. We then apply beam search with the learned value function to find the sequence of foot placements that maximizes the reward. This sequence is then executed by the low-level controller. Hierarchical reinforcement learning algorithms have been applied to a number of other problems, for example various grid-world variants (e.g., [22]). To our knowledge, it has not previously been successfully applied to any continuous state- space problems of comparable size and complexity to ours.

Upload: hoangdang

Post on 10-Mar-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quadruped Robot Obstacle Negotiation via Reinforcement ... · Quadruped Robot Obstacle Negotiation via Reinforcement Learning Honglak Lee∗†, Yirong Shen‡, Chih-Han Yu††,

Quadruped Robot Obstacle Negotiation viaReinforcement Learning

Honglak Lee∗†, Yirong Shen‡, Chih-Han Yu††, Gurjeet Singh§ and Andrew Y. Ng∗∗ Computer Science Department, Stanford University, Stanford, CA 94305† Department of Applied Physics, Stanford University, Stanford, CA 94305

‡ Department of Electrical Engineering, Stanford University, Stanford, CA 94305†† Division of Engineering and Applied Sciences, Harvard University, Cambridge MA 02138

§ Scientific Computing and Computational Mathematics Program, Stanford University, Stanford, CA 94305

Abstract— Legged robots can, in principle, traverse a largevariety of obstacles and terrains. In this paper, we describe asuccessful application of reinforcement learning to the problemof negotiating obstacles with a quadruped robot. Our algorithmis based on a two-level hierarchical decomposition of the task,in which the high-level controller selects the sequence of foot-placement positions, and the low-level controller generates thecontinuous motions to move each foot to the specified positions.The high-level controller uses an estimate of the value function toguide its search; this estimate is learned partially from superviseddata. The low-level controller is obtained via policy search. Wedemonstrate that our robot can successfully climb over a varietyof obstacles which were not seen at training time.

I. INTRODUCTION

While wheeled vehicles (such as cars and trucks) are veryfuel efficient, legged robots can, in principle, climb over muchlarger obstacles relative to the size of the robot, and therebyaccess significantly more difficult terrain. In this paper, weapply reinforcement learning to develop a novel controller fora quadruped robot so as to enable it to negotiate a wide varietyof obstacles, including ones that had not been previously seenduring training.

In related work, some other researchers have focused onclever mechanical design of legs that automatically adapt toterrains [1], [2]. Bai et al. [3] proposed a rule-based algorithmfor generating gaits to step onto (and over) small obstacles,and demonstrated them in simulation. There has also beenwork on adaptive gaits for quadruped robots based on “centralpattern generators” [4]–[8]. However, these gaits have mainlybeen applied to traversing rough surfaces, rather than climbingover large/discrete obstacles. Moore et al.’s hexapod (six-legged) RHex robot is capable of climbing up steps, butdoes so using ingenious mechanical design in which large,circular, “wheel-like” legs flail at the obstacles, rather thanthrough careful balance and coordination [9], [10]. Thereare also several robots that rely on hopping to climb upsteps, e.g., Raibert’s biped robot [11] and Poulakakis et al.’sgalloping robot [12]. Bretl et al.’s vertical climbing robotsuccessfully applied motion planning to robot climbing [13].Their algorithm relied on computing statically stable posesfor vertical climbing, and does not immediately apply tomore general obstacle negotiation tasks. Learning algorithmshave been successfully applied to a few legged locomotion

problems, for example [14]–[16]. But these methods have beendemonstrated to work only on flat terrain in which the same(periodic) gait could be repeated without taking into accountpossible obstacles in the path. Kim et al. [17] proposed analgorithm for walking and climbing in 3d environments, butspecifically for a robot whose feet had strong suction pads.Chestnutt et al. [18] used hierarchical planning to deal withobstacle avoidance, but considered only 2D path planning and3D obstacle avoidance (“stepping” over small obstacles), andnot the more difficult task of climbing over obstacles.

Our method is based on hierarchical reinforcement learn-ing [19]–[22] using a two-level hierarchical decomposition ofthe task. Given an obstacle (such as a step) that the robotneeds to climb over, the high-level planner selects a sequenceof “target foot placement positions” for the robot, one foot ata time. It is then the low-level controller’s task to move thefeet to these targets in order, while keeping the robot balancedand preventing the legs from hitting any obstacles.

The low-level controller operates on a very short temporalscale, in a problem which requires fine coordination and bal-ance, but does not require careful multi-step reasoning. Policysearch algorithms [23] with simple parameterized policiesapply well to such problems; we use it to learn the low-levelcontroller.

In contrast, the high-level controller has to plan a sequenceof foot placement positions, and must reason on significantlylonger timescales. We build a high-level controller using valuefunction approximation and beam search. In detail, a valuefunction indicates the relative “desirability” of different statesin the problem. We learn a value function using a novelreinforcement learning algorithm, one that also makes useof some supervised information. We then apply beam searchwith the learned value function to find the sequence of footplacements that maximizes the reward. This sequence is thenexecuted by the low-level controller.

Hierarchical reinforcement learning algorithms have beenapplied to a number of other problems, for example variousgrid-world variants (e.g., [22]). To our knowledge, it has notpreviously been successfully applied to any continuous state-space problems of comparable size and complexity to ours.

Page 2: Quadruped Robot Obstacle Negotiation via Reinforcement ... · Quadruped Robot Obstacle Negotiation via Reinforcement Learning Honglak Lee∗†, Yirong Shen‡, Chih-Han Yu††,

Fig. 1. Quadruped robot

II. QUADRUPED ROBOT MODEL

The quadruped robot used in this work is shown in Figure 1.The robot is approximately 10cm tall, 28cm wide, and 12cmlong, and weighs 500 grams. Each leg has three servomotors,two at the hip joint for rotating the upper segment of each legforward/backward or up/down; and one at the knee joint forrotating the lower segment of each leg inward/outward. Theservos have a maximum torque 3.0 kg-cm, and may fail tomove to the commanded position if the torque is insufficient.The task is to send the right sequence of commands (targetangles) to these servos that enable the robot to climb up anobstacle.

The state of the robot is completely determined by itsposition (x, y, z); orientation (roll ψ1, pitch ψ2, yaw ψ3);the twelve joint angles ξ1, ... , ξ12, and the correspondingvelocities and angular velocities x, y, z, ψ1, ψ2, ψ3, ξ1, ...,˙ξ12. This gives a 36 dimensional state space. However, our

low-level controller will choose commands only as a functionof the variables (x, y, z, ψ1, ψ2, ψ3, ξ1, ... , ξ12), whichspan an 18 dimensional subspace (cf. state abstraction inreinforcement learning [19]–[22]). To simplify our notation,we will sometimes write these 18d state variables at time t asΩt = [ω1, ..., ω18]T .

Figure 2 shows the kinematic model of our robot and thecoordinate transformations used to compute the location of itsjoints. Standard forward kinematics can be used to calculateall the foot positions given Ωt [24]. We write foot i’s positionas follows:

uit =world

foot T (x, y, z, ψ1, ψ2, ψ3, ξi1 , ξi2 , ξi3). (1)

where uit is the foot i’s position at time t; and i1, i2, i3 are

the indices of the three joints on foot i. The term worldfoot T (·)

is obtained from simple kinematic computations.

III. LOW-LEVEL CONTROLLER

The job of the low-level controller is to output a sequenceof servo commands that will move a single foot to a giventarget position while keeping the other 3 feet stationary. Ourlow-level controller uses a policy (controller) that maps fromthe current state, the index of the moving foot, and the targetposition to commands for the 12 servos. We take a policy

Fig. 2. Quadruped state variables and forward kinematics

search approach to learning the low-level controller. Thus,as is standard practice in policy search, we will begin byspecifying the policy class (i.e., the parameterization of thelow-level controller). The low-level controller has to generatefoot motions that, during the motion to the target position,do not hit any of a variety of possible obstacles. Potentialfields [25] seem well suited to this task. More precisely, wechoose a policy parameterization based on potential fields,and use policy search to automatically learn the potential fieldparameters.

Potential fields [25] are frequently used in robot motionplanning to find a sequence of actions to move towards a goalwithout violating certain constraints; for example, to generatea path to a goal while avoiding obstacles. Informally, we willplace an “attractive potential” at the goal (this potential isrepresented by a function over the state space that decreasesas we approach the goal); and “repulsive potentials” at the ob-stacles (represented by functions that increase as we approachan obstacle). The overall potential function is the sum of allthe attractive and repulsive potentials. On each step, the robotmoves by taking the steepest descent direction over the overallpotential function.

In detail, we formulate a low-level controller using a poten-tial field composed of three functions: (1) Goal potential:an attractive potential field which encourages the robot tomove a single leg toward the specific goal position generatedby the high-level planner. (2) Surface potential: a repulsivepotential that keeps the moving foot away from the ground andobstacle surfaces. (3) Posture Potential: a potential functionthat encourages “good” posture (cf. [26]) such as having thecenter of gravity (CG) within the triangle formed by the threesupporting feet and minimizing yaw and roll of the robot body.

For the task of moving foot i from the current position uit to

a goal foot placement position uig , we construct the following

potential:

Ut(Ωt, uit, u

ig) = Ug

t (uit, u

ig) + Us

t (uit, u

ig) + Up

t (Ωt), (2)

where Ugt (·), Us

t (·), and Upt (·) represent the goal, floor, and

posture potential functions respectively. We defer to the Ap-pendix the exact functional forms of these potential functions.To execute a foot motion, at each instant in time the robot

Page 3: Quadruped Robot Obstacle Negotiation via Reinforcement ... · Quadruped Robot Obstacle Negotiation via Reinforcement Learning Honglak Lee∗†, Yirong Shen‡, Chih-Han Yu††,

computes the negative gradient, evaluated at the current state,of the potential function:

gt,j = −∇ωjUt(Ωt, u

it, u

ig), j = 1, ..., 18 (3)

The robot then tries to change its joint angles in the directionspecified by gt.1

This idea needs two additional refinements. Naively fol-lowing the gradient of the potential function will only workif there are no local minima in the potential, other than aglobal minimum at the goal position. However, near the frontof obstacles, it is possible for the attractive goal potential andthe repulsive surface potential to cancel, causing local minimato form in the potential function. To deal with this problem,we add a vortex field term [27]; this is a vector field that goesaround the obstacles. Our use of the vortex field is motivatedby the observation that the moving foot must not only avoidcolliding with obstacles, but must also go around obstacles inorder to reach the goal position. So, near the front of obstacleswhere the attractive goal potential and the repulsive surfacepotentials can nearly cancel each other, the vortex field willcause the moving foot to move upwards over the obstacle.Details on the vortex field can be found in the Appendix.

One additional refinement to the algorithm is necessary tokeep the three supporting feet from moving. At every step, weinformally think of the robot as trying to change its state insome direction gt (given by the sum of the gradient of Ut andthe vortex field vector). In general, following the direction ofgt exactly would also change the position of the supporting(non-moving) feet us1

t , us2t , and us3

t , where s1, s2, and s3 arethe indices of the robot’s three supporting feet at time t. Inorder to avoid this (since we want to move only one leg ata time), following [28] we will project gt into the subspaceof motions which keeps the positions of the supporting feetconstant. More precisely, we define Φt ∈ R18×9 to be thematrix whose columns are the gradients of the componentsof us1

t , us2t , and us3

t with respect to the 18 state variables.The change in the positions of the supporting feet due to asmall change in the state variables δΩt is approximately ΦT

t ·δΩt. Hence, in order to keep the supporting feet stationary,we should only move in directions that are in the nullspace ofΦT

t . Now, let gt∗ be the projection of gt onto the nullspace of

ΦTt . Finding gt

∗ can be posed as the following minimizationproblem:

min ||gt∗ − gt||2

s.t. ΦTt · gt

∗ = 0 (4)

The closed form solution for gt∗ is:

gt∗ = (I − Φt · (ΦT

t · Φt)−1 · ΦTt ) · gt (5)

Thus, we change the joint angles in the direction of gt∗.

In choosing the form of the potential function, we were thusable to encode a significant amount of prior knowledge aboutthe task of moving a single foot. Specifically, minimizingthe combination of the three potential functions will tend

1Only 12 of the components of gt correspond to robot joint angles, andare executed on the servomotors. The other 6 components are ignored.

to cause the robot to move the foot toward the goal whilemaintaining balance and avoiding collisions with obstacles orwith the ground. Further, the projection onto the nullspaceof Φt prevents the supporting feet from moving. All thisprior knowledge, encoded into the low-level controller’s policyclass, helps it to solve the high-dimensional search problemassociated with moving a single foot to a new position.2 Thatwe can encode so much prior knowledge into the policyparameterization is one of the strengths of policy search.However, the policy class still has many free parameters whichare difficult to choose by hand. For example, even though itis easy to specify the form of a potential that repulses a footfrom an obstacle, it is hard to decide by hand exactly whatthe magnitude of this repulsion should be. In our approach, welearn the parameters of the potential functions automatically,as discussed in the next section.

A. Policy search

The low-level controller has 20 parameters that govern var-ious trade-offs between its attractive and repulsive potentials.To learn these parameters, we began by building a dynamicalsimulator3 of the robot following the specifications of the realrobot. Using the simulator, we then applied PEGASUS policysearch [23], implemented with locally greedy hill-climbingsearch, to optimize the parameters on a set of predefinedtraining tasks. Each of the training tasks involves moving asingle foot to a new location. During learning, we use a rewardfunction that penalizes undesirable behaviors such as taking along time to complete the foot movement, passing too closeto the vertical surface of an obstacle, or failing to move thefoot to the desired goal location. The policy search algorithmtries to choose parameters so as to maximize the rewards.The learned parameters are then fixed, giving us the low-levelcontroller that will be used by the high-level planner.

IV. HIGH-LEVEL PLANNER

Given the low-level controller for moving a single leg, wenow need a high-level planner for generating the sequenceof target foot positions. Finding this sequence of target footpositions represents a difficult search problem because a badchoice of foot position early in the sequence could leadthe robot to a bad state (for example, one with very badposture), from which it may be extremely difficult to makefurther progress. Moreover, performing exhaustive search overall possible sequences of foot positions is computationallyintractable, because of the high branching factor and largesearch depth involved.

In order to avoid performing exhaustive search, we need away of measuring how “good” a state is. For example, in A∗

search, a heuristic cost-to-go function gives estimates of theoptimal future cost from any given state to the goal. However,

2The action space used by the low-level controller is 12d and not 3d, sinceeven if the current step requires only moving foot 1, we may still wish tomove the servomotors in the other legs to maintain balance.

3We used Open Dynamics Engine (ODE) to build our dynamical simulator.See http://ode.org for details on ODE.

Page 4: Quadruped Robot Obstacle Negotiation via Reinforcement ... · Quadruped Robot Obstacle Negotiation via Reinforcement Learning Honglak Lee∗†, Yirong Shen‡, Chih-Han Yu††,

it is often difficult to hand-specify a good heuristic cost-to-go function in complex problems, since it requires estimatingthe entire sequence of unknown future costs. In reinforcementlearning (RL) [29], the “goodness” of a state s is measuredby the optimal value function V ∗(s). Informally, this is the“expected optimal future rewards” starting from the state s.Using reinforcement learning, we will automatically learn anapproximate value function.

A. Reinforcement learning preliminaries

This section gives a brief overview of the reinforcementlearning formalism. For more details, see, e.g., [29].

In reinforcement learning, we model a system to be con-trolled as having a set of possible states S, and a set ofpossible actions A. Further, we are given a reward functionR : S × A → R, so that R(s, a) is the reward for takingaction a in state s. The system’s dynamics are given by astate transition function F : S × A → S, defined so thats′ = F (s, a) is the state reached upon taking action a in states.4 A policy is any function π : S → A mapping from thestates to the actions. We say we are executing a policy π ifwhenever we are in state s, we take action a = π(s). Thevalue function V π(s0) for a policy π is defined as the sum ofdiscounted rewards upon starting in state s0 and taking actionsaccording to π:

V π(s0) = R(s0, a0) + γR(s1, a1) + γ2R(s2, a2) + . . .

where ai = π(si), si+1 = F (si, ai), and γ ∈ [0, 1) is thediscount factor. The discount factor causes rewards in thedistant future to be weighted less than rewards in the nearfuture. The optimal value function is defined as

V ∗(s) = maxπ

V π(s).

This is the best possible sum of discounted rewards that can beattained using any policy. The optimal value function satisfiesthe Bellman equation:

V ∗(s) = maxa

R(s, a) + γV ∗(s′), (6)

where s′ is the state reached upon taking action a in states. In other words, the maximum possible value starting fromthe current state s is given by maximizing the sum of theimmediate one-step reward R(s, a) and the future rewardsfrom the next state s′. Once V ∗(·) is known, finding the policythat gives the maximum sum of discounted rewards is easy.The optimal policy π∗ : S → A is given by

π∗(s) = arg maxa

R(s, a) + γV ∗(s′). (7)

In our high-level planner, taking an action a will correspondto choosing a new position for one of the feet. The rewardfunction we used gave positive rewards for (1) movementof the robot’s CG towards the goal; small negative rewardsfor (2) the time it took to execute the movement; and very

4Although the general RL framework is most commonly applied to prob-lems with stochastic dynamics, here we will use a simplified version withonly deterministic dynamics.

large negative rewards for (3) failure of low-level controller toexecute the action.

B. Value function approximation

There is a large number of existing RL algorithms forexactly learning the optimal value function. However, most ofthe exact algorithms apply only to problems with small, finite,state-spaces. For problems with large, continuous state spaces,it is generally not possible to obtain V ∗(·) exactly, since evenexplicitly storing V ∗(s) for every s ∈ S is impossible. In suchcases, an approximate representation for V ∗(·) must be used.Following fairly standard practice in RL, we will approximateV ∗(s) as a linear combination of a set of features of the states. Specifically, we approximate V ∗(s) as:

V (s) = θTφ(s),

where φ(s) is a vector of features of the state s, and θ is aparameter vector that will be learned.

In our approach, φ(s) consisted of the following featuresof the state5: (1) Distance from robot’s CG to the goal; (2)Average distance from the feet to the goal; (3) Orientation ofthe body, expressed as (1−cos(ψ1), 1−cos(ψ2), 1−cos(ψ3));(4) Maximum knee angle; (5) A feature capturing surfaceroughness underneath each foot on the ground6; (6) Surfacecurvature7; (7) Difference between maximum and minimumheights of the feet; (8) Area of supporting triangle (formed bythe three stationary feet); (9) Radius of the largest circle thatinscribes the supporting triangle; (10) The distances betweeneach pair of feet.

C. Reinforcement learning algorithm

To learn an approximation to the value function, we devel-oped a new partially supervised learning algorithm, inspired bySupport Vector Machines [30], for learning an approximationto the value function. Our algorithm is based on the obser-vation that in order to have π(s)—the action chosen usingour learned value function V—to be the same as the optimalaction π∗(s), we need only for the following to hold:

R(s, π∗(s))+γV (F (s, π∗(s))) > R(s, a)+γV (F (s, a)). (8)

for all actions a 6= π∗(s). Thus, if we have a “trainingset” consisting of 3-tuples (si, a

∗i , ai), i = 1, . . . ,m where

si ∈ S, a∗i = π∗(si) and ai 6= π∗(si), then one might try tofind an approximation V to the value function that satisfiesthe equation above for all tuples in the training set. SinceV (s) = θTφ(s) is linear in the parameters θ, this simplygives a set of linear constraints, and in fact any numberof standard linear classification algorithms (such as softmax

5We fixed the foot movement order as in the conventional trot gait (left-front, right-rear, right-front, left-rear), and so our state actually also includesinformation about which foot is to be moved next.

6This is computed using a (gaussian weighted) sum of squares of the localgradient of the terrain height, over a small patch centered at each foot on theground.

7Computed as a (gaussian weighted) sum of the second derivatives of theterrain height function, over a small patch centered at each foot on the ground.

Page 5: Quadruped Robot Obstacle Negotiation via Reinforcement ... · Quadruped Robot Obstacle Negotiation via Reinforcement Learning Honglak Lee∗†, Yirong Shen‡, Chih-Han Yu††,

regression [31], and support vector machines [30]) can be usedto learn the parameters.

However, in our experiments, one additional refinement tothis idea caused it to work significantly better. Specifically,under mild conditions, it can be shown that the optimal valuefunction V ∗ is the unique solution to the Bellman equations(Eq. 6). Thus, one possible approach to approximating V ∗ is tofind the parameters θ that come as close as possible to solvingthe Bellman equations. More precisely (following [32]), wecan solve for θ so as to minimize the squared Bellman error:

minθ

∑|V (si)−R(si, a

∗i )− γV (F (si, a

∗i ))|2, (9)

where si and a∗i are from our training set,8 as described above,and V (si) = θTφ(si).

Putting together the two ideas described above—supervisedlearning to try to satisfy the constraints given in Eq. 8, aswell as minimizing the squared Bellman error as in Eq. 9—we obtain the following optimization problem:

maxθ,δ δ − βP

|V (si)−R(si, a∗i )− γV (F (si, a

∗i ))|2 − α||θ||22

s.t. R(si, a∗i ) + γV (F (si, a

∗i ))

−R(si, ai)− γV (F (si, ai)) ≥ δ i = 1, . . . , m.(10)

Here, α and β are constants that determine the relativeweighting of the different terms in the optimization objective.This optimization problem is a quadratic program, whichcan be readily solved [33]. Note that the constraints in theoptimization are exactly demanding that Eq. 8 holds for alltuples in the training set (assuming δ > 0), and moreover thatthe left-hand-side be larger than the right-hand-side by a gapor “margin” of at least δ.9

If β = 0, then the objective function reduces to onlytrying to optimize the constraints given in Eq. 8. (In fact,the resulting algorithm would be extremely similar to usinga Support Vector Machine [30] with parameters θ to try tolearn a classifier to identify optimal actions a∗i vs. suboptimalactions ai.) Conversely, for small or zero α, the optimizationobjective instead emphasizes minimizing the squared Bellmanerror, as in Eq. 9. However, by combining both of theseobjective functions together, we obtain a better algorithm thaneither approach alone.

In our experiments, we used a training set of 400 tuples ofstates, optimal actions (target foot positions) and suboptimalactions.10 Out of these 400 training examples, there were only16 distinct states si and (assumed) optimal actions a∗i , whichcorresponded to a sequence of 16 footsteps that successfully

8We sum over only distinct si’s.9Following the L1 norm soft margin SVM [30], we also considered adding

slack parameters to address the case of linearly inseparable data. However,this turned out to be unnecessary for our application, and even without thesoft margin, the algorithm attained good performance.

10We note that generating these training examples was fairly easy, since itis not difficult for this robot to climb the (relatively short) 2.6cm step. Thus,we could quite easily find a sequence of actions which succeeded in climbingthis step; the actions in this sequence then formed our (assumed optimal) a∗iin the training set. A set of (assumed suboptimal) actions ai were similarlyquickly generated.

Fig. 3. Simulated robot executing a sequence of controls to climb up a single4cm step.

Fig. 4. Simulated robot climbing up an irregular step; the right portion ofthe step is 6cm and the right portion of the step is 3 cm.

climbed a single 2.6cm step. The suboptimal actions for these16 steps were generated by manually labeling a number offoot positions that were clearly suboptimal or inappropriatefor the current state si.

We also tried learning a value function using temporaldifference (TD) learning [29], which is one of the bestknown, standard, reinforcement learning algorithms. Unlikeour algorithm, TD does not make use of a labeled trainingset during learning. But despite devoting significant efforts totuning TD, including trying out different features, differentminor modifications to the learning algorithm and so on, inevery instance it still failed to produce a good value functioneven after running for more than 10 hours.

Having learned the parameters θ, we are now ready togenerate actions using the high-level planner. Given V , Eq. 7gives a straightforward way to choose actions. However, asin [34] we can do even better by performing a multiple-step lookahead. More precisely, we used beam search tolook multiple steps ahead, and to try to find a sequence ofactions to move us towards the goal, or more formally, tomaximize the total sum of discounted rewards. During thebeam search procedure, we used our (deterministic) simulatorfor the robot to expand different search nodes, and used thelearned V (s) as the heuristic function with which to guide thesearch algorithm. Our implementation discretized the space ofpossible actions a (possible foot placements) into 15 differentvalues, and used a beam width of 10.

V. EXPERIMENTAL RESULTS

All learning was done using our simulator of the robot.After learning the parameters for both the low- and the high-level controllers, we tested the resulting hierarchical policy, insimulation, on a variety of obstacles. In the first experiment,we asked the robot to climb up a 4cm tall step. Figure 3 showsthe resulting sequence of robot motions. We also tested therobot on an “irregular” step. (See Figure 4.) In this example,the right half of the step is 6cm high, and the left half is 3cmhigh. We believe that it is extremely difficult to hand-codecontrollers for climbing asymmetric steps like these, sincethey require somewhat difficult, asymmetric (and, we find,

Page 6: Quadruped Robot Obstacle Negotiation via Reinforcement ... · Quadruped Robot Obstacle Negotiation via Reinforcement Learning Honglak Lee∗†, Yirong Shen‡, Chih-Han Yu††,

(a) A single step (b) Double-angled steps

(c) Double-cornered steps (d) Four interleaved steps

Fig. 5. Real robot executing learned sequence of controls to climb upobstacles.

somewhat non-intuitive) gaits. The learned controller success-fully climbed this obstacle. In addition to these examples, thelearned controller also appeared quite robust, and was ableto climb a large variety of different obstacles, including stepsat various heights (0-6cm); ones oriented at different anglesrelative to the robot; and irregular steps (with high and lowportions) of different heights.

We also successfully demonstrated these gaits on the realrobot, for a large variety of obstacles. For example, Figure 5shows the real robot executing the gait shown in Figure 3, aswell as gaits for several other obstacles.11 Videos of the sim-ulated and actual robot executing the learned gaits are avail-able online, at http://ai.stanford.edu/∼yirong/quadruped/videos

Note that our experiments used a training set (specifyingoptimal and sub-optimal target foot positions) based on climb-ing only a 2.6cm step. However, our learned gaits readilygeneralized to a large range of obstacles that were not seen inthe training set, and were indeed significantly more complexand difficult than the single 2.6cm step seen during training.

For comparison, we also implemented the Rapidly exploringRandom Trees (RRT) algorithm [35]. This is a commonly-usednonholonomic motion planning algorithm, that in principle canbe applied to tasks such as ours. We spent significant effortadjusting RRT’s parameters, but in all versions, even afterrunning for more than twenty hours, it was never able to find

11So far, our description has assumed a closed-loop controller that choosesthe current action as a function of the state s. In our initial experiments, weused a 3-camera array and an IMU to estimate the robot state s, so as toexecute a closed-loop controller. However, somewhat to our surprise, just bygenerating an open-loop sequence of actions (by executing the controller inclosed-loop in simulation, and taking the resulting sequence of servomotorangles executed), we were also able to execute these gaits (comprising smallnumbers of steps) open-loop on the robot. We note also that the obstacleshown in Figure 3d, which comprises surfaces at four different heights, seemedparticularly difficult. Our algorithm successfully found a gait for it usinga beam width as small as two, but neither the Bellman-error only nor thesupervised learning only (β = 0) algorithm was able to do so.

control sequences for negotiating a single 4cm step.

VI. CONCLUSIONS

We described the application of a hierarchical reinforcementlearning algorithm, one that uses a two-level hierarchical taskdecomposition, to a quadruped robot. The algorithm was ableto learn a controller for negotiating a large variety of differentobstacles, including ones not seen during training. We believethat hierarchical reinforcement learning holds rich promise fornegotiating obstacles and rough terrain with biped, quadrupedand hexapod robots.

ACKNOWLEDGMENTS

We thank Sebastian Thrun and Greg Lawrence for helpfulconversations. The robot hardware used in this work was basedon a design by Lynxmotion, and portions of it were built byDoron Tal. Support from the DARPA Learning Locomotionprogram (under contract number FA8650-05-C-7261) is grate-fully acknowledged.

REFERENCES

[1] S. Hirose, A. Nagabuko, and R. Toyama, “Machine that can walk andclimb on floors, walls, and ceilings,” in International Conference onAdvanced Robotics, Pisa,Italy, 1991.

[2] S. Hirose, K. Yoneda, and H. Tsukagoshi, “Titan vii: Quadruped walkingand manipulating robot on a steep slope,” in International Conferenceon Robotics and Automation, Albuquerque, New Mexico, USA, 1997.

[3] S. Bai, K. H. Low, G. Seet, and T. Zielinska, “A new free gait generationfor quadrupeds based on primary/secondary gait,” in InternationalConference on Robotics and Automation, 1999, pp. 1371–1376.

[4] H. Kimura and Y. Fukuoka, “Adaptive dynamic walking of thequadruped on irregular terrain - autonomous adaptation using neuralsystem model,” in International Conference on Robotics and Automa-tion, 2000, pp. 436–443.

[5] H. Kimura, Y. Fukuoka, Y. Hada, and K. Takase, “Three-dimensionaladaptive dynamic walking of a quadruped - rolling motion feedbackto cpgs controlling pitching motion,” in International Conference onRobotics and Automation, 2002, pp. 2228–2233.

[6] Y. Fukuoka, H. Kimura, Y. Hada, and K. Takase, “Adaptive dynamicwalking of a quadruped robot ’tekken’ on irregular terrain using aneural system model,” in International Conference on Robotics andAutomation, 2003, pp. 2037–2042.

[7] S. Peng, C. Lani, and G. Cole, “A biologically inspired four leggedwalking robot,” in International Conference on Robotics and Automa-tion, 2003, pp. 2024–2030.

[8] Z. Zhang, Y. Fukuoka, and H. Kimura, “Adaptive running of a quadrupedrobot on irregular terrain based on biological concepts,” in InternationalConference on Robotics and Automation, 2003, pp. 2043–2048.

[9] E. Z. Moore and M. Buehler, “Stable stair climbing in a simplehexapod,” in International Conference on Climbing and Walking Robots,Karlsruhe, Germany, 2001, pp. 603–610.

[10] E. Z. Moore, D. Campbell, F. Grimminger, and M. Buehler, “Reliablestair climbing in the simple hexapod rhex,” in International Conferenceon Robotics and Automation, vol. 3, Washington, D.C., U.S.A, 2002,pp. 2222–2227.

[11] M. H. Raibert, Legged Robots that Balance. Cambridge, MA: MITPress, 1986.

[12] I. Poulakakis, J. A. Smith, and M. Buehler, “Experimentally validatedbounding models for scout ii quadrupedal robot,” in InternationalConference on Robotics and Automation, New Orleans, LA, USA, 2004,pp. 825–830.

[13] T. Bretl, S. Rock, J. Latombe, B. Kennedy, and H. Aghazarian, “Free-climbing with a multi-use robot,” in 9th Int. Symp. on ExperimentalRobotics, Singapore, 2004.

[14] J. Morimoto and C. Atkeson, “Minimax differential dynamic program-ming:an application to robust biped walking,” in Neural InformationProcessing Systems Conference, 2002.

Page 7: Quadruped Robot Obstacle Negotiation via Reinforcement ... · Quadruped Robot Obstacle Negotiation via Reinforcement Learning Honglak Lee∗†, Yirong Shen‡, Chih-Han Yu††,

[15] J. Bagnell, S. Kakade, A. Ng, and J. Schneider, “Policy search bydynamic programming,” in Neural Information Processing SystemsConference, Vancouver, Canada, 2003.

[16] N. Kohl and P. Stone, “Machine learning for fast quadrupedal loco-motion,” in National Conference on Artificial Intelligence, San Jose,California, USA, 2004.

[17] H. Kim, T. Kang, V. G. Loc, and H. R. Choi, “Gait planning ofquadruped walking and climbing robot for locomotion in 3d environ-ment,” in International Conference on Robotics and Automation, 2005.

[18] J. Chestnutt, J. Kuffner, K. Nishiwaki, and S. Kagami, “Planningbiped navigation strategies in complex environments,” in InternationalConference on Humanoid Robots, 2003.

[19] R. Sutton, D. Precup, and S. Singh, “Intra-option learning abouttemporally abstract actions,” in International Conference on MachineLearning, 1998, pp. 556–564.

[20] M. Hauskrecht, N. Meuleau, C. Boutilier, L. P. Kaelbling, and T. Dean,“Hierarchical solution of markov decision processes using macroac-tions,” in Conference on Uncertainty in Artificial Intelligence, 1998.

[21] R. Parr and S. Russell, “Reinforcement learning with hierarchies ofmachines,” in Neural Information Processing Systems Conference, 1998.

[22] T. G. Dietterich, “Hierarchical reinforcement learning with the maxqvalue function decomposition,” Journal of Artificial Intelligence Re-search, vol. 13, pp. 227–303, 1999.

[23] A. Y. Ng and M. I. Jordan, “Pegasus: A policy search method forlarge mdps and pomdps,” in Conference on Uncertainty in ArtificialIntelligence, 2000, pp. 406–415.

[24] J. Craig, Introduction to Robotics: Mechanics and Control, 2nd ed.Addison- Wesley Longman Publishing, 1989.

[25] O. Khatib, “Real-time obstacle avoidance for manipulators and mobilerobots,” International Journal of Robotics Research, vol. 5, no. 1, pp.90–98, 1986.

[26] O. Brock, O. Khatib, and S. Viji, “Task-consistent obstacle avoidanceand motion behavior for mobile manipulation,” in International Confer-ence on Robotics and Automation, Washington, USA, 2002.

[27] C. D. Medio and G. Oriolo, “Robot obstacle avoidance using vortexfields,” in Advances in Robot Kinematics, S. Stifter and J. Lenarcic,Eds. Springer-Verlag, 1991, pp. 227–235.

[28] O. Khatib, “Inertial properties in robotics manipulation: An object-levelframework,” International Journal of Robotics Research, vol. 14, no. 1,pp. 19–36, February 1995.

[29] R. S. Sutton and A. G. Barto, Reinforcement Learning. MIT Press,1998.

[30] V. N. Vapnik, Statistical Learning Theory. John Wiley & Sons, 1998.[31] P. McCullagh and J. A. Nelder, Generalized Linear Models (second

edition). Chapman and Hall, 1989.[32] L. C. Baird, “Residual algorithms: Reinforcement learning with function

approximation,” in International Conference on Machine Learning,1995.

[33] S. Boyd and L. Vandenberghe, Convex Optimization. CambridgeUniversity Press, 2004.

[34] S. Davies, A. Y. Ng, and A. Moore, “Applying online-search to rein-forcement learning,” in National Conference on Artificial Intelligence,1998.

[35] S. M. LaValle and J. J. Kuffner, “Randomized kinodynamic planning,” inInternational Conference on Robotics and Automation, Detroit, Michi-gan, USA, 1999.

APPENDIXLOW-LEVEL CONTROLLER DETAILS

Here, we describe the potential functions and the vortex fieldin detail. Our coordinate system is defined with the robot’sdirection of travel being the positive x direction, with up beingthe positive z direction, and y being given by the right-handrule. We will use the following notation:t is the time since the start of the task (of moving a

single leg)Ωt is the state vector at time ti is the index of the moving footuj

t ujt = (uj

t,x, ujt,y, u

jt,z) is the position vector of foot

j at time t

ug ug = (ug,x, ug,y, ug,z) is the goal position for themoving foot

Nobs is the number of obstacles in the environment. Ourcurrent implementation permits only obstacles thatare rectangular boxes. Note that the ground is alsoconsidered as an obstacle in the definitions below.

ξ(Ωt) is a measure of how far the center of gravity(CG)of the robot is inside the triangle formed by the 3supporting legs. It is calculated by projecting thepositions of the 3 supporting feet and the CG intothe xy plane and measuring the minimum distancebetween projection of the CG to the sides of thetriangle formed by the projected feet positions. ξ(Ωt)is positive if the CG is within the triangle, andnegative if it is outside the triangle.

The low-level controller has 20 parameters: ρ1, ρ2, . . . , ρ16,λ1, λ2, τ1, τ2. The ρi’s control the shape and size of the variouspotential and vortex fields. The λi’s and τi’s determine whichpotentials are turned on based on the position of the CG andthe amount of time elapsed.

At the beginning of each task, the robot starts in a centeringphase, during which the goal and surface potentials are offwhile the CG potential is on. This allows the robot to shiftits CG into the new supporting triangle before it tries to takea new step. The initial centering phase will last a minimumof τ2 time steps and will terminate when either the CG issufficiently inside the supporting triangle (i.e., ξ(Ωt) ≤ λ1)or when the maximum number of time steps for the initialcentering phase τ2 is reached. The goal and surface potentialsare also turned off whenever the center of mass is near oroutside the boundary of the supporting triangle (ξ(Ωt) ≤ τ1);this in effect stops the motion of the active foot and allows therobot to move its CG into a stable position before continuingwith moving the active foot.

A. Goal Potential

The goal potential attracts the moving foot to the goallocation ug .

If we are in the initial centering phase or if the CG is nearthe boundary of the supporting triangle (i.e., ξ(Ωt) ≤ λ1),then Ug

t = 0. Otherwise,

Ugt = ρ1||ui

t − ug||2 − ρ2 exp(−ρ3||uit − ug||2)

+ρ4(uit,y − ug,y)2

The third term puts a heavier emphasis on the y coordinate toreduce small oscillations of the moving foot in the y direction.

B. Surface Potential

Away from the goal position, the surface potential repels themoving foot from the surfaces of obstacles and the ground.The strength of this potential decays to 0 as one approachesthe goal position.

If we are in the initial centering phase or if the CG is nearthe boundary of the supporting triangle (i.e. ξ(Ωt) ≤ λ1), then

Page 8: Quadruped Robot Obstacle Negotiation via Reinforcement ... · Quadruped Robot Obstacle Negotiation via Reinforcement Learning Honglak Lee∗†, Yirong Shen‡, Chih-Han Yu††,

Ust = 0. Otherwise,

Ust = (1− exp(−ρ7||ui

t − ug||22))Nobs∑k=1

ρ5 exp(−ρ6dk)

where dk is the distance from the moving foot to obstacle k.

C. Posture Potential

The posture potential is defined as the sum of three otherpotentials

Upt = U c

t + Uot + Uf

t

where U ct , Uo

t and Uft are the CG, orientation, and support

feet constraint potentials respectively.1) CG Potential: The CG potential encourages the the CG

of the robot to stay within the triangle formed by the threesupporting feet. It is always on during the initial centeringphase of a new task. It is also on when the CG is near oroutside the boundary of the supporting triangle.

If we are not in the initial centering phase and ξ(Ωt) ≥ λ2

then U ct = 0. Otherwise,

U ct = exp(−ρ9 min(ξ, ρ8))

2) Orientation Potential: The orientation potential discour-ages large roll and yaw of the robot body and is defined as

Uot = ρ10(1− cos(ψ1)) + ρ11(1− cos(ψ3)),

where ψ1 and ψ3 are respectively the roll and yaw of the robotbody.

3) Constraint Potential: The constraint potential tries tokeep the supporting feet at the same position as they wereat the beginning of the task and is defined as

Uft = ρ12

∑j=1,j 6=i

||ujt − uj

0||22

Although it may seem that the constraint potential is redundantsince it accomplishes the same purpose as the nullspaceprojection described by 4, it is nonetheless necessary becauseof errors in the linearization used by the nullspace projection.

D. Vortex Field

The vortex field encourages the moving foot to move upalong the vertical surfaces of obstacles. The vortex field isused to help ensure that the moving foot does not get stuck inregions where the potential function has nearly zero gradient(e.g., places where the goal and surface potential gradientspoint in nearly opposite directions). The strength of the vortexfield of an obstacle decays with distance to the obstacle.

The total vortex field is the sum of the vortex fields inducedby all the obstacles. Let vk denote the vortex field induced byobstacle k. Let pk = (pk,x, pk,y, pk,z) be the nearest point onobstacle k to the position of the moving foot ui

t. If uit,x = pk,x

then the moving foot is already above a flat region of obstaclek and we set vk = 0. If the moving foot is higher than andrelatively close to the goal position (i.e., ui

t,z > ug,z) and|ui

t,x − ug,x| < ρ16, then we expect the goal potential will be

able to pull the foot in and so we set all the vortex fields to0 in this case. In all other cases, vk is calculated as

vk =ρ13

1 + exp(ρ14(||uit − pk||2 − ρ15))

·sign(ui

t,x − ug,x)~vk

||~vk||2where ~vk is the vector cross product

~vk = (uit − pk)× y

and y is the unit vector in the y direction.We note that v is given in terms of the 3d coordinates of

the moving foot and we need to convert it into the 18d statespace of the robot. This is done by multiplying by the Jacobianmade up of the partial derivatives of the components of themoving foot’s position with respect to the state variables.