grounded action transformation for robot learning in ...jphanna/gsl.pdfgrounded action...

7
Grounded Action Transformation for Robot Learning in Simulation Abstract Robot learning in simulation is a promising alternative to the prohibitive sample cost of learning in the physical world. Unfortunately, policies learned in simulation often perform worse than hand-coded policies when applied on the physi- cal robot. Grounded simulation learning (GSL) promises to address this issue by altering the simulator to better match the real world. This paper proposes a new algorithm for GSL – Grounded Action Transformation – and applies it to learn- ing of humanoid bipedal locomotion. Our approach results in a 43.27% improvement in forward walk velocity compared to a state-of-the art hand-coded walk. We further evaluate our methodology in controlled experiments using a second, higher-fidelity simulator in place of the real world. Our results contribute to a deeper understanding of grounded simulation learning and demonstrate its effectiveness for learning robot control policies. Introduction Manually designing control policies for every possible sit- uation a robot could encounter is impractical. Reinforce- ment learning (RL) provides a promising alternative to hand- coding skills. Recent applications of RL to high dimensional control tasks have seen impressive successes within simu- lation(Schulman et al. 2015; Lillicrap et al. 2015). Unfortu- nately, a large gap exists between what is possible in simula- tion and the reality of robot learning. State-of-the-art learning methods require thousands of episodes of experience which is impractical for a physical robot. Aside from the time it would take, collecting the required training data may lead to substantial wear on the robot. Furthermore, as the robot ex- plores different policies it may execute unsafe actions which could damage the robot. An alternative to learning directly on the robot is learn- ing in simulation (Cutler and How 2015; Koos, Mouret, and Doncieux 2010). Simulation is a valuable tool for robotics research as execution of a robotic skill in simulation is com- paratively easier than real world execution. However, the value of simulation learning is limited by the inherent inaccu- racy of simulators in modeling the dynamics of the physical world (Kober, Bagnell, and Peters 2013). As a result, learn- ing that takes place in a simulator is unlikely to increase real world performance. Grounded Simulation Learning (GSL) is a framework for learning with a simulator in which the simulator is modified with data from the physical robot, learning takes place in simulation, the new policy is evaluated on the robot, and data from the new policy is used to further modify the simulator (Farchy et al. 2013). The paper introducing GSL demonstrates the effectiveness of the method in a single, limited experi- ment, by increasing the forward walking velocity of a slow, stable bipedal walk by 26.7%. This paper introduces a new algorithm – Grounded Action Transformation (GAT) – for simulator grounding within the GSL framework. The algorithm is fully implemented and evaluated using a high-fidelity simulator as a surrogate for the real world. The results of this study contribute to a deeper understanding of transfer from simulation methods and the effectiveness of GAT. In contrast to prior work, our real-world evaluation of GAT starts from a state-of-the-art walk engine as the base policy, and nonetheless is able to improve the walk velocity by over 43%, leading to what may be the fastest known walk on the SoftBank NAO robot. Background Preliminaries We formalize robot skill learning as a reinforcement learning (RL) problem. At discrete time-step t the robot takes action A t π(·|S t ) according to π which is a distribution over actions, A t ∈A, conditioned on the current state, S t S . The environment, E, responds with S t+1 P (·|S t ,A t ) according to the dynamics, P : S×A×S→ R which is a probability density function over next states conditioned on the current state and action. A trajectory of length L is a sequence of states and actions, τ := S 0 ,A 0 , ..., S L ,A L . We also define a cost function, c, which assigns a scalar cost to each (s, a) pair. We denote the value of policy, π, as J (π) := E τ Pr(·|π) h L t=0 c(S t ,A t ) i where Pr(τ |π) is the probability of τ when selecting actions according to π. We assume π is parameterized by a vector θ and denote the parameterized policy as π θ . Since θ determines the pol- icy distribution we overload notation and refer to π θ as θ. Given initial parameters θ 0 , the goal of policy improvement is to find θ 0 such that J (θ 0 ) <J (θ 0 ). In this work, θ is a deterministic policy which maps state observations to an ac- tion vector a t . Each component of a t , a i t , is the desired joint angle for degree-of-freedom i. At each time-step, t, θ selects a t according to the robot’s configuration in joint space, x t ,

Upload: duongdang

Post on 11-Apr-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

Grounded Action Transformation for Robot Learning in Simulation

AbstractRobot learning in simulation is a promising alternative tothe prohibitive sample cost of learning in the physical world.Unfortunately, policies learned in simulation often performworse than hand-coded policies when applied on the physi-cal robot. Grounded simulation learning (GSL) promises toaddress this issue by altering the simulator to better matchthe real world. This paper proposes a new algorithm for GSL– Grounded Action Transformation – and applies it to learn-ing of humanoid bipedal locomotion. Our approach results ina 43.27% improvement in forward walk velocity comparedto a state-of-the art hand-coded walk. We further evaluateour methodology in controlled experiments using a second,higher-fidelity simulator in place of the real world. Our resultscontribute to a deeper understanding of grounded simulationlearning and demonstrate its effectiveness for learning robotcontrol policies.

IntroductionManually designing control policies for every possible sit-uation a robot could encounter is impractical. Reinforce-ment learning (RL) provides a promising alternative to hand-coding skills. Recent applications of RL to high dimensionalcontrol tasks have seen impressive successes within simu-lation(Schulman et al. 2015; Lillicrap et al. 2015). Unfortu-nately, a large gap exists between what is possible in simula-tion and the reality of robot learning. State-of-the-art learningmethods require thousands of episodes of experience whichis impractical for a physical robot. Aside from the time itwould take, collecting the required training data may lead tosubstantial wear on the robot. Furthermore, as the robot ex-plores different policies it may execute unsafe actions whichcould damage the robot.

An alternative to learning directly on the robot is learn-ing in simulation (Cutler and How 2015; Koos, Mouret, andDoncieux 2010). Simulation is a valuable tool for roboticsresearch as execution of a robotic skill in simulation is com-paratively easier than real world execution. However, thevalue of simulation learning is limited by the inherent inaccu-racy of simulators in modeling the dynamics of the physicalworld (Kober, Bagnell, and Peters 2013). As a result, learn-ing that takes place in a simulator is unlikely to increase realworld performance.

Grounded Simulation Learning (GSL) is a framework forlearning with a simulator in which the simulator is modified

with data from the physical robot, learning takes place insimulation, the new policy is evaluated on the robot, and datafrom the new policy is used to further modify the simulator(Farchy et al. 2013). The paper introducing GSL demonstratesthe effectiveness of the method in a single, limited experi-ment, by increasing the forward walking velocity of a slow,stable bipedal walk by 26.7%.

This paper introduces a new algorithm – Grounded ActionTransformation (GAT) – for simulator grounding within theGSL framework. The algorithm is fully implemented andevaluated using a high-fidelity simulator as a surrogate forthe real world. The results of this study contribute to a deeperunderstanding of transfer from simulation methods and theeffectiveness of GAT. In contrast to prior work, our real-worldevaluation of GAT starts from a state-of-the-art walk engine asthe base policy, and nonetheless is able to improve the walkvelocity by over 43%, leading to what may be the fastestknown walk on the SoftBank NAO robot.

BackgroundPreliminariesWe formalize robot skill learning as a reinforcement learning(RL) problem. At discrete time-step t the robot takes actionAt ∼ π(·|St) according to π which is a distribution overactions, At ∈ A, conditioned on the current state, St ∈S. The environment, E, responds with St+1 ∼ P (·|St, At)according to the dynamics, P : S × A × S → R which isa probability density function over next states conditionedon the current state and action. A trajectory of length L isa sequence of states and actions, τ := S0, A0, ..., SL, AL.We also define a cost function, c, which assigns a scalarcost to each (s, a) pair. We denote the value of policy, π,as J(π) := Eτ∼Pr(·|π)

[∑Lt=0 c(St, At)

]where Pr(τ |π) is

the probability of τ when selecting actions according to π.We assume π is parameterized by a vector θ and denote

the parameterized policy as πθ. Since θ determines the pol-icy distribution we overload notation and refer to πθ as θ.Given initial parameters θ0, the goal of policy improvementis to find θ′ such that J(θ′) < J(θ0). In this work, θ is adeterministic policy which maps state observations to an ac-tion vector at. Each component of at, ait, is the desired jointangle for degree-of-freedom i. At each time-step, t, θ selectsat according to the robot’s configuration in joint space, xt,

high level intention commands (e.g. walk forward at 75%of maximum velocity), ωt, and a sensor vector, ψt, of IMUand foot sensor readings. The state of the robot can be fullydescribed as st = 〈xt, xt, xt,ωt,ψt〉 where xt and xt arethe time derivatives of xt. Since the robot only observes xt,ωt, and ψt, E is partially observable. Note that while thehigh level intention commands are determined by the robot,from the point-of-view of θ, these directional commands arestate features.

In this paper, learning takes place in a simulator which isan environment, Esim, that models E. Specifically Esim hasthe same state-action space as E but inevitably a different dy-namics distribution, Psim. In many robotics settings, c is userdefined and thus is identical for E and Esim. However, thedifferent dynamics distribution mean that for any policy, θ,J(θ) 6= Jsim(θ) since θ induces a different trajectory distri-bution in E than in Esim. Thus θ′ with Jsim(θ′) < Jsim(θ0)does not imply J(θ′) < J(θ0) – in fact J(θ′) could even beworse than J(θ0). In practice and in the literature, learningin a simulator frequently leads to catastrophic degradation ofJ . This paper explores methods for learning in Esim whichresult in lower policy cost.

Related WorkThis section surveys work in two areas related to this re-search: simulation-transfer methods and model-based rein-forcement learning. Simulation-transfer methods attempt tooptimize J(θ) with experience from a provided model Esim.Model-based RL attempts to construct the model as well. Ourproposed method builds a partial model, uses it to modifythe simulator, and then successfully optimizes a policy. Thismethodology places our work at the intersection of simulationlearning and model-based RL.

Transfer from Simulation: One approach to simulation-transfer is using experience in simulation to reduce learningtime on the physical robot. Cutler et al. (2014) use lowerfidelity simulators to narrow the action search space for fasterlearning in higher fidelity simulators or the real world. Cullyet al. (2015) use a simulator to estimate fitness values forlow-dimensional robot behaviors which gives the robot priorknowledge of how to adapt its behavior if it becomes dam-aged during real world operation. Cutler et al. (2015) useexperience in simulation to estimate a prior for a Gaussianprocess model to be used with the PILCO learning algorithm(Deisenroth and Rasmussen 2011). In contrast to these meth-ods, our method performs all learning in a grounded simulatorand only uses the physical robot for policy evaluation.

Another class of simulation-transfer methods directly mod-ify Jsim. Koos et al. (2010) use multi-objective optimizationto find policies that trade off between optimizing Jsim(π)and transferability. Iocchi et al. (2007) compute a correctionfunction from J(π) such that Jsim(π) = J(π). Instead ofmodifying Jsim directly, we modify the actions passed to thesimulator which indirectly changes Jsim.

The most straightforward approach is to transfer the policylearned in simulation by using it as is. It is often noted inthe literature that such direct transfer is generally ineffective.

One example of successful direct transfer is Hengst et al.(2011) who learn a control policy for a bipedal robot’s feet.This policy is then used with an inverted pendulum modelwalk policy in the real world. We consider direct transferin our experiments but find it ineffective for learning fullbipedal locomotion.

Model-Based Reinforcement Learning: Model-based re-inforcement learning aims to learn a model, Psim of the taskMDP dynamics and then use Psim to learn a policy. Abbeel etal. (2006) use real-world experience to modify an inaccuratemodel of a deterministic MDP. Learning is done with a policygradient algorithm which improves policy parameters alongthe gradient of Jsim(θ) with respect to θ. The real-world ex-perience is used to modify Psim so that the gradient of Jsim isthe same as the gradient of J . This method assumes randomaccess modification — the ability to arbitrarily modify thesimulated dynamics of any state-action pair. This assumptionis restrictive in that it may be false for many simulators (in-cluding both simulators used in this paper). Our proposedalgorithm makes no such assumption.

Ross et al. (2012) present a bound on the performance lossof J(θ) compared to J(θ′) for θ learned with the model. Thebound gets looser as the difference between θ and θ′ getslarger. This bound provides a theoretical justification for aniterative process (such as GSL) that alternates between datacollection and policy improvement. Our experiments validatethe use of an iterative process in practice.

Grounded Simulation LearningIn this section we introduce the Grounded Simulation Learn-ing (GSL) framework as presented by Farchy et al. (2013).GSL improves robot learning in simulation by using statetransition data from the physical system to modify Esim suchthat it is a better model of E. The process of making thesimulator more like the real world is referred to as grounding.

The GSL framework assumes the following:

1. There is an imperfect simulator, Esim = 〈S,A, Psim, c〉,which models the environment of interest, E. Further-more, Esim must be modifiable. A modifiable simulatorhas parameterized transition probabilities Pφ(·|s, a) :=Psim(·|s, a;φ) where the vector φ can be changed to pro-duce, in effect, a different simulator.

2. Additionally, GSL assumes the physical robot can record adata set, D of trajectories when executing any policy θ.

3. Finally, GSL requires a policy improvement routine,optimize, that searches for θ which decrease Jsim(θ).The optimize routine returns a set of candidate policies,Πc to evaluate on the physical robot.

GSL grounds Esim by finding φ? such that:

φ? = argminφ

∑τ∈D

d (Pr(τ |θ), P rsim(τ |θ,φ)) (1)

where Pr(τ |θ) is the probability of observing trajectory τon the physical robot, Prsim(τ |θ,φ) is the probability of τ

in the simulator with dynamics parameterized by φ, and d isa measure of similarity between these probabilities.

Assuming GSL can solve (1), the framework is as follows:

1. Execute policy θ0 on the physical robot to collect a dataset of trajectories, D.

2. Find φ? which satisfies Equation 1 estimated with trajec-tories from D.

3. Use optimize with Jsim and the modified Psim to learna set of candidate policies Πc in simulation which areexpected to perform well on the physical robot.

4. Evaluate each proposed θc ∈ Πc on the physical robot andreturn the policy, θ′, with minimal J .

GSL can be applied iteratively with θ′ being used to col-lect more trajectories to ground the simulator again for asecond iteration of policy improvement. The re-groundingstep is necessary since changes to θ result in changes to thedistribution of states that the robot observes. When this hap-pens, a simulator that has been modified with data from thestate distribution of θ0 may be a poor model under the statedistribution of θ′.

Farchy et al. proposed a GSL algorithm, which we referto as GUIDED GSL. GUIDED GSL blends simulator modifica-tion with human guided policy optimization. This algorithmdemonstrated the efficacy of GSL in optimizing forward walkvelocity of a bipedal robot. The robot in that research beganlearning with a slow but stable walk policy. Four iterationsof GUIDED GSL led to an improvement of over 26% on thisbaseline walk. Two limitations of GUIDED GSL are that:

1. It makes the assumption that actions in simulation achievetheir desired effect instantaneously.

2. It required a human expert to manually select which com-ponents of θ were allowed to change during the optimizeroutine.

In this work, we introduce an enhanced GSL methodologywhich eliminates both assumptions and optimizes one of thefastest available NAO walk engines.

Robot Platform and TaskBefore presenting GAT, our new GSL algorithm, we describethe robot platform, initial walk policy, and simulators whichwe use for evaluation. While our method is applicable to otherrobots, tasks, and simulators we describe these componentsfirst in order to ground the algorithm’s presentation.

Robot Platform: Our target task is bipedal walking us-ing the SoftBank NAO.1 The NAO is a humanoid robot with25 degrees of freedom (See Figure 2a). For walking, ourNAO uses an open source walk engine developed at the Uni-versity of New South Wales (UNSW) (Ashar et al. 2015;Hall et al. 2016). This walk engine has been used by at leastone team in each of the past three RoboCup Standard Plat-form League (SPL) championship games in which teamsof five NAOs compete in soccer matches. To the best of our

1https://www.ald.softbankrobotics.com/en

knowledge, it is one of the fastest open source walks avail-able.2 The walk is a zero moment point walk based on aninverted pendulum model. The walk is closed loop, usingthe inertial measurement (IMU) sensors and a learned anklecontrol policy to improve stability. The UNSW walk enginehas 15 parameters that control features of the walk (e.g. stepheight, pendulum model height). The values of the param-eters from the open source release constitute θ0. For fullinformation on the UNSW walk see (Hengst 2014).

Simulators: We use two different simulators in this work.The first is the open source SimSpark simulator.3 The sim-ulator simulates realistic physics with the Open DynamicsEngine.4 This simulator was also used in the work intro-ducing GSL (Farchy et al. 2013). We also use the Gazebosimulator5 provided by the Open Source Robotics Founda-tion.6 Gazebo is an open source simulator that is commonlyused in robotics research. SimSpark enables fast simulationbut is a lower fidelity model of the real world. Gazebo en-ables high fidelity simulation with an added computationalcost. In one of our experiments we consider the more real-istic Gazebo environment as E and SimSpark as Esim. Thedifference between these simulators and the physical robotthat our algorithm corrects is how actions change the robot’sconfiguration. In SimSpark, actions achieve the desired com-mand angle almost instantaneously. On the physical robotthere is a delay. Gazebo also models joints as more reactivethan the real world although less reactive than in SimSpark.

Grounded Action TransformationWe now introduce our principle algorithmic contribution, GSLwith Grounded Action Transformation (GAT) which improvesthe grounding step (step 2) of the GSL framework. GAT im-proves grounding by correcting differences in the effects ofactions between E and Esim. GSL with GAT is presented inAlgorithm 1.

The GSL framework grounds the simulator by finding φ?

that satisfies (1). Since it is often easier to minimize error inthe one step dynamics distribution than error in the trajectorydistributions, GAT uses:

φ? = argminφ

∑(st,at,st+1)∈D

||P (st+1|st,at)− Pφ(st+1|st,at)||2

(2)

as a surrogate loss function which can be minimized withtransitions observed in D. GSL with GAT begins by collectingthe dataset D (Algorithm 1, line 4).

Physics-based simulators (such as SimSpark and Gazebo)have a large number of parameters determining the physics of

2While other regular Robocup participants also have competitivewalks, we are not aware any published result that confirms a speedas much as 42% faster than UNSW’s, as we acheive in this paper.

3http://simspark.sourceforge.net4http://www.ode.org/5http://gazebosim.org/6http://www.osrfoundation.org/

the simulated environment (e.g. friction coefficients). How-ever, using these parameters as φ is not amenable to numer-ical optimization of (2). To find φ? efficiently, GAT uses aparameterized action transformation function which takesthe agent’s state and action as input and outputs a new actionwhich – when taken in simulation – will result in the robottransitioning to the same next state it would have in E. Wedenote this function, gφ : S ×A → A; the parameters of gserve as φ under the GSL framework. GAT constructs g withparameterized models of the robot’s dynamics and the simu-lators inverse dynamics. Assuming the simulated robot canrecord a dataset Dsim of experience like the physical robot,GAT reduces (2) to a supervised learning problem.

GAT defines g := gφ by a deterministic forward model ofthe robot’s dynamics, f and a deterministic model of the sim-ulator’s inverse dynamics, f−1

sim. The function, f maps (st,at)to the maximum likelihood estimate of xt+1 under P . Thefunction f−1

sim maps (st, st+1) to the action that is most likelyto produce this transition in simulation. When executing θ insimulation, the robot selects at ∼ πθ(·|st) and then uses fto predict what the resulting configuration, xt+1, would bein E. Then at is replaced with at := f−1

sim(st, f(st,at)). Theresult is the robot achieves the exact xt+1 it would have onthe physical robot7.

In practice f and f−1sim are represented with supervised

regression models and learned from D and Dsim respectively(Algorithm 1 lines 5-6). The approximation of g introducesnoise into the robot’s motion. To ensure stable motions, GATuses a smoothing parameter α. The action transformationfunction (Algorithm 1 line 8) is then defined as:

g(st,at) := αf−1sim(st, f(st,at)) + (1− α)at

.In our experiments, we set α as high as possible sub-

ject to the walk remaining stable. Figure 1 illustrates theGAT-modified Esim. GAT then proceeds to improve θ withoptimize within the grounded simulator (lines 8-9).

GAT ImplementationIn this work, GAT uses two neural networks to approximatef and f−1

sim. Each function is a three layer network with 200hidden units in the first layer and 180 hidden units in thesecond. While at is a vector of desired joint angles, at isencoded as a desired change in xt which was found to im-prove prediction. Prior knowledge of the relationship between(st,at) and st+1 is incorporated by having f predict jointaccelerations instead of next state. The accelerations canthen be integrated to produce the resulting next state. Ad-ditionally, f and f−1

sim regress to the sine and cosine of thetarget angular accelerations and then pass the outputs throughthe arctan function to produce the final angular accelera-tion. Passing the network outputs through the arctan func-tion ensures f and f−1

sim produce valid joint angles. The true7GAT subsumes GUIDED GSL which makes the additional as-

sumption f−1sim (st,xt+1) = xt+1, in other words that requesting

joint angles means the robot achieves those joint angles at the nexttime-step. GAT removes all human guidance from policy optimiza-tion.

Algorithm 1 Grounded Action Transformation (GAT) Pseudocode. Input: An initial policy, θ, the environment, E, a simulator,Esim, smoothing parameter α, and a policy improvement method,optimize. The function rolloutN(θ, N) executes N trajectorieswith θ and returns the observed state transition data. The functionstrainForwardModel and trainInverseModel estimate modelsof the forward and inverse dynamics respectively.

1: function GAT2: θ0 ← θ3: Drobot ← RolloutN(E,θ0, N)4: Dsim ← RolloutN(Esim,θ0, N)5: f ← trainForwardModel(Drobot)

6: f−1sim ← trainInverseModel(Dsim)

7: g(s, a)← αf−1sim(s, f(s, a)) + (1− α) · at

8: Π← optimize(Esim,θ, g)9: return argminθ∈Π J(θ)

10: end function

Figure 1: An illustration of the modifiable simulator GAT induces.At each time-step the robot takes an action, at, and passes at to amodification function, g. The modification function uses a determin-istic model of the real robot’s dynamics, f , to predict the effect ofexecuting at on the physical robot. Then, a model of the simulatedrobot’s inverse dynamics uses the prediction, xt, to predict the ac-tion at which will acheive xt in simulation. Finally, at is passedto the environment, Esim and the resulting state transition will besimilar to the transition that would have occurred in E.

state, st, is estimated by concatenating joint configurationsxt,xt−1, ..xt−4 and past actions at−1, ...,at−4. The state es-timate 〈xt, ...,xt−4,at−1, ...,at−4〉 improves the predictionsof f and f−1

sim because it implicitly captures the unobserved xtand xt state variables. The length of the configuration historywas chosen to balance predictive accuracy with keeping thenumber of inputs to the networks small. Both networks aretrained with backpropagation.

Empirical ResultsExperimental Set-upWe evaluate GAT on the task of bipedal robot walking. Thewalk takes a target forward velocity parameter in the range[0, 1]. We set this parameter to 0.75 which we found led tothe fastest walk that was reliably stable. The robot walksforward towards a target at this velocity. If the robot’s angleto the target becomes greater than five degrees it turns backtowards the target while continuing to walk forward. In allenvironments, average forward velocity determines J . Onthe physical robot a trajectory terminates once the robot haswalked four meters or falls. A trajectory generated with θ0

(a) An Aldebaran NAO Robot (b) A simulated NAO in Gazebo (c) A simulated NAO in SimSpark

Figure 2: The three robotic environments used in this work. The Aldebaran NAO is our target physical robot. The NAO is simulated in theGazebo and SimSpark simulators.

lasts ≈ 20.5 seconds on the robot. In simulation a trajectoryterminates after a fixed time interval (7.5 seconds in SimSparkand 10 seconds in Gazebo) or when the robot falls. Previouswork has shown that when using a model estimated from datait is better to use shorter action trajectories to avoid overfittingto model error (Jiang et al. 2015). At each step t, st in Esim

will be different than st in E even if st−1 and at−1 were thesame in each environment. This error compounds over thecourse of a trajectory since state error at st is propagatedforward into st+1. This observation motivates using shortertrajectory lengths as simulator fidelity decreases.

Optimizing θ We use the Covariance Matrix Adaption-Evolutionary Strategy (CMA-ES) algorithm (Hansen 2006) asthe optimize routine. CMA-ES is a policy gradient methodwhich samples a population of candidate θ from a Gaussiandistribution over parameters. The top k parameters are usedto update the sampling distribution so that the mean is closerto an optimal policy. We modify Jsim(θ) for the optimizationby adding a penalty of −15 for any trajectory in which therobot falls. The added penalty encourages CMA-ES to stronglyfavor stable policies over faster, less stable ones. To clarifyterminology, a generation refers to a single update of CMA-ES; an iteration refers to a complete cycle of GSL.

SimSpark to Gazebo: Since a large number of trials aredifficult to obtain on a physical robot, we present a studyof GSL using Gazebo as a surrogate for the real world. Inthis setting we evaluate the effectiveness of GAT compared tolearning with no grounding and grounding Esim by injectingnoise into the robot’s actions. Adding an “envelope” of noisehas been used before to minimize simulation bias by prevent-ing the policy improvement algorithm from overfitting to thesimulator’s dynamics (Jakobi, Husbands, and Harvey 1995).We refer to this baseline as Noise-Envelope. Since GAT withfunction approximation introduces noise into the robot’s ac-tions we wish to verify that GAT offers benefits over suchmethods. Noise-Envelope adds standard Gaussian noise tothe robot’s actions within the ungrounded simulator. We alsoattempted to evaluate GUIDED GSL but preliminary experi-ments showed that the assumption that actions achieve theirdesired effect instantaneously did not hold for this setting.

We run 10 trials of each method. For GAT we collect 50trajectories of robot experience to train f and 50 trajecto-ries of simulated experience to train f−1

sim. For each method,

we optimize θ for 10 generations of the CMA-ES algorithm.In each generation, 150 policies are sampled and evaluatedin simulation. CMA-ES estimates Jsim with 20 trajectoriesfrom each policy. Overall, the CMA-ES optimization requires30,000 simulated trajectories for each trial.

Table 1 gives the average improvement in stable walkpolicies for each method and the number of trials in which amethod failed to produce a stable improvement. This tableonly considers policies found after the first generation ofCMA-ES. The reason for this is that the policies in the firstgeneration were randomly sampled from an initial searchdistribution. We consider learning to begin once CMA-EShas used the Jsim values of the first generation to update thesearch distribution. Using Jsim to identify θ which improveJ in the first generation is a policy evaluation problem whileimprovement afterwards is a policy improvement problem.

Simulation to NAO Set-up: We evaluate GAT for trans-ferring simulation learning to the physical NAO using bothSimSpark and Gazebo as Esim. The data set D consists of 15trajectories collected with θ0 on the physical NAO. For eachiteration, we optimize θ for 10 generations of the CMA-ESalgorithm. We evaluate the best policy from each generationwith 5 trajectories on the physical robot. If the robot falls inany of the 5 trajectories the policy is considered unstable.

One challenge with our setting is the simulators receiverobot actions at 50 Hz while the NAO sends joint commandsevery 100 Hz. The discrepancy in action frequency poses aproblem for using real world data to modify how joints movein simulation. By skipping every other measurement to getan effectively 50Hz data trace, we are able to model how thephysical robot’s joints move at the simulator’s frequency.

Experimental ResultsSimSpark to Gazebo Results: Table 1 shows that GATmaximizes improvement in J while minimizing iterationswithout improvement. NOISE-ENVELOPE improves upon nogrounding in both improvement and number of iterationswithout improvement. Adding noise to the simulator encour-ages CMA-ES to propose robust policies which are more likelyto be stable. However, GAT further improves over NOISE-ENVELOPE. This result demonstrates that action transforma-tions are grounding the simulator in a more effective waythan simply injecting noise.

Table 1 also shows that on average GAT finds an improvedpolicy within the first few policy updates after grounding.

Method No Ground Noise-Envelope GAT% Improvement 11.094 18.93 23.37

Failures 7 5 1Best Gen. 1.33 6.6 2.67

Table 1: This table compares Grounded Action Transformation(GAT) with baseline approaches for transferring learning betweenSimSpark and Gazebo. The first row displays the average maximumimprovement over all stable walks found by each method after thefirst policy update made by CMA-ES. The second row is the numberof times a method failed to find a stable walk. The third row givesthe average generation of CMA-ES when the best policy was found.No Ground refers to learning done in the unmodified simulator.

When learning with no grounding finds an improvement itis also usually in an early generation of CMA-ES. As policyimprovement progresses, the best policies in each generationbegin to overfit to the dynamics of Esim. Without groundingoverfitting happens almost immediately. NOISE-ENVELOPEis shown to be more robust to overfitting since any policy itproposes achieved good cost in a noisy Esim. The groundingdone by GAT is inherently local to the trajectory distributionof θ0. Thus as θ changes in policy improvement, the actiontransformation function fails to produce a more realistic sim-ulator. Noise modification methods can mitigate overfittingby emphasizing robust policies although it is also limited infinding as strong of an improvement as GAT.

Simulator to Physical NAO Results: Our final empiricalevaluation applies GAT to learning bipedal walks on a physi-cal NAO. Table 2 gives the walk velocity of θ′ returned bothmethods. The physical robot walks at a velocity of 19.52 cm/swith θ0. Two iterations of GAT with SimSpark increased thewalk velocity of the NAO to 27.97 cm/s — an improvement of43.27% compared to θ0.8 GAT with SimSpark and GAT withGazebo both improved walk velocity by over 30%. This re-sult demonstrates generality of our approach across differentsimulators.

As in the previous experiment, policy improvement withCMA-ES required 30,000 trajectories per iteration to find the10 policies that were evaluated on the robot. In contrast thetotal number of trajectories executed on the physical robot is65 (15 trajectories in D and 5 evaluations per θ′ ∈ Π). Thisresult demonstrates GAT can use sample-intensive simulationlearning to optimize real world skills with a low number oftrajectories on the physical robot.

Farchy et al. demonstrated the benefits of re-grounding andfurther optimizing θ. We reground using the 15 trajectoriescollected with the best policy found by GAT with SimSparkand optimize for a further 10 generations in simulation. Thesecond iteration results in a walk which averages 27.97 cm/sfor a total improvement of 43.27% over θ0.

Finally, we evaluate θ0 with 100% of maximum velocityrequested. The average velocity of this walk across five stabletrajectories is 24.3 cm / s. This result shows that GAT canlearn walk policies which outperform one of the best handcoded walks available.

8A video of the learned walk policies is included in the supple-mental material.

θ0 GAT with SimSpark GAT with Gazebo19.52 (cm/s) 26.27 26.89

0.0 (%) 34.58 37.76

Table 2: This table gives the maximum learned velocity and percentimprovement for each method starting from θ0 (left column). Notshown: Iteration 2 of GAT with SimSpark resulted in a policy withvelocity 27.97 cm/s for a total improvement of 43.27% over θ0.

Discussion and Future WorkOur proposed algorithm, GAT, has some limitations that wediscuss here. The decision to learn an action modificationfunction, g, makes the assumption that ∀(st, at, st+1) thatcould be observed on the physical robot there exists an actionat which will produce the same transition when used in placeof at. We posit that this assumption is often true since at canusually be executed with more or less force in simulation toachieve the desired response. However, this assumption islikely to break down under contact dynamics where externalforces resist the robot’s actions. Other tasks may introduceother forms of simulator bias caused by discrepancies whichGAT is currently limited in handling. An important directionfor future work is to characterize the settings where thisapproach is limited and to identify alternatives.

We specify simulator grounding as a supervised learningproblem however the distribution of inputs to the learnedmodels, f and f−1

sim, during policy execution differ from thetraining distributions. The distribution change is likely toweaken the quality of action modification under g. To seewhy the change happens consider that f is trained under(st, at, st+1) sampled from executing θ on the real robotwhile f−1

sim is trained on (st, at, st+1) sampled in simulation.During simulator modification, both models are evaluatedon data sampled from the distribution of θ in the groundedsimulator. We have started to explore methods similar to theDataset Aggregation (DAgger) algorithm (Ross and Bagnell2012) which collects new data to retrain f−1

sim as the statedistribution of θ in Esim changes with grounding.

Finally, this paper considers action modification but acomplementary approach could consider sensor modification.Sensor modification would add another layer to the modifiedsimulator such that the environment returns a state and thesensor modification function predicts what the robot wouldobserve in that state. The two methods have an interestinginteraction since sensor modification changes the action θ se-lects while action modification changes the next state whichchanges the observed sensor values as well.

ConclusionThis paper proposed and evaluated the Grounded ActionTransformation (GAT) algorithm for grounded simulationlearning. Our method led to a 43.27 % improvement in thewalk velocity of a state-of-the-art bipedal robot. We con-ducted an empirical study which analyzed the benefits ofGAT compared to a pair of baseline simulation-transfer meth-ods. This experiment demonstrates GAT allows learning toprogress further in simulation than other methods.

ReferencesAbbeel, P.; Quigley, M.; and Ng, A. Y. 2006. Using inaccuratemodels in reinforcement learning. In 23rd InternationalConference on Machine Learning, ICML.Ashar, J.; Ashmore, J.; Hall, B.; Harris, S.; Hengst, B.; Liu,R.; Mei (Jacky), Z.; Pagnucco, M.; Roy, R.; Sammut, C.;Sushkov, O.; Teh, B.; and Tsekouras, L. 2015. Robocup spl2014 champion team paper. In Bianchi, R. A. C.; Akin, H. L.;Ramamoorthy, S.; and Sugiura, K., eds., RoboCup 2014:Robot World Cup XVIII, volume 8992 of Lecture Notes inComputer Science. Springer International Publishing. 70–81.Cully, A.; Clune, J.; Tarapore, D.; and Mouret, J.-B. 2015.Robots that can adapt like animals. Nature.Cutler, M., and How, J. P. 2015. Efficient reinforcementlearnng for robots using informative simulated priors. InIEEE International Conference on Robotics and Automation,ICRA.Cutler, M.; Walsh, T. J.; and How, J. P. 2014. Reinforcementlearning with multi-fidelity simulators. In IEEE Conferenceon Robotics and Automation, ICRA.Deisenroth, M. P., and Rasmussen, C. E. 2011. Pilco: Amodel-based and data-efficient approach to policy search. InInternational Conference on Machine Learning, ICML.Farchy, A.; Barrett, S.; MacAlpine, P.; and Stone, P. 2013.Humanoid robots learning to walk faster: From the real worldto simulation and back. In Twelth International Conferenceon Autonomous Agents and Multiagent Systems, AAMAS.Hall, B.; Harris, S.; Hengst, B.; Liu, R.; Ng, K.; Pagnucco,M.; Pearson, L.; Sammut, C.; and Schmidt, P. 2016. Robocupspl 2015 chamption team paper.Hansen, N. 2006. The CMA evolution strategy: a comparingreview. In Lozano, J.; Larranaga, P.; Inza, I.; and Bengoetxea,E., eds., Towards a new evolutionary computation. Advanceson estimation of distribution algorithms. Springer. 75–102.Hengst, B.; Lange, M.; and White, B. 2011. Learning tocontrol a biped with feet. In Humanoids.Hengst, B. 2014. runswift walk2014 report robocup standardplatform league. Technical report, The University of NewSouth Wales.Iocchi, L.; Libera, F. D.; and Menegatti, E. 2007. Learninghumanoid soccer actions interleaving simulated and real data.In Second Workshop on Humanoid Soccer Robots.Jakobi, N.; Husbands, P.; and Harvey, I. 1995. Noise and thereality gap: The use of simulation in evolutionary robotics. InEuropean Conference on Artificial Life, 704–720. Springer.Jiang, N.; Kulesza, A.; Singh, S.; and Lewis, R. 2015. Thedependence of effective planning horizon on model accu-racy. In International Conference on Autonomous Agents andMultiagent Systems, AAMAS.Kober, J.; Bagnell, J. A.; and Peters, J. 2013. Reinforcementlearning in robotics: A survey. The International Journal ofRobotics Research.Koos, S.; Mouret, J.-B.; and Doncieux, S. 2010. Crossing thereality gap in evolutionary robotics by promoting transferable

controllers. In Proceedings of the 12th annual conference onGenetic and evolutionary computation, 119–126. ACM.Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.;Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous con-trol with deep reinforcement learning. CoRR abs/1509.02971.Ross, S., and Bagnell, J. A. 2012. Agnostic system iden-tification for model-based reinforcement learning. In 29thInternational Conference on Machine Learning, ICML.Schulman, J.; Moritz, P.; Levine, S.; Jordan, M. I.; andAbbeel, P. 2015. High-dimensional continuous control usinggeneralized advantage estimation. In International Confer-ence on Learning Representations.