marks without text: cf. another document sent by e-mail ...cga/sony1/description_cmt-s.pdf · up...

Learning To Simulate Accurately And

Learning To Learn From SimulationPrimary PI: Christopher G. Atkeson ([email protected])

Email: [email protected], Phone: +1-412-512-0150

Co-PI: Akihiko Yamaguchi

A proposal to the SONY 2016 Focused Research Award program in the area of Evolving

Reinforcement Learning from Simulation to Real Environment.

Abstract: We believe there are two important challenges for fast, robust, scalable and precise transition

of learned knowledge from simulation to real environments. The first is making better simulators using learned

knowledge from the real world and also from simulation. The second is learning robust and more efficient

control and learning algorithms from simulation. We propose emphasizing (1) task-level modeling for simu-

lation. We have found that modeling complex physics in the context of a task and a particular robot strategy is

much more effective than trying to learn general models. For example, modeling shaking salt from a dispenser

from first principles is very difficult. Learning models of granular materials in general or for all possible salt

dispensers is very difficult. However, learning models of the outcome of a particular shaking strategy applied

to a particular salt dispenser is much easier and more accurate. We propose (2) using learned task models at

several levels in hierarchical simulation: qualitative/symbolic models that represent different physics and task

phases such as in-contact/unconstrained or sliding/stuck; task-level models that represent continuous outcomes

of tasks such as how much salt will come out when a particular salt shaker is shaken in a particular way; and

process models such as how granular materials will flow or how much friction there will be. We propose (3)

learning probability distributions of the physics of the real world. Estimated probability models can be used

in simulation to (4) design robust policies and more efficient and robust learning algorithms. Modeling

how much a robot knows and more importantly what it doesn’t know is very useful in deciding how best to

learn what it expects to not know. For example, knowing the possible range of an unknown delay is very useful

in designing both a robust policy to handle any possible delay, and an efficient learning algorithm to learn the

delay so learned policies can take advantage of knowing the delay and perform better. These learned models,

better policies, and learning algorithms can be shared across robots to better evolve reinforcement learning from

simulations to real environments.

Motivation: Our experience with the DARPA Robotics Challenge (Figure 1), our SARCOS humanoids, and

experiments with manipulating deformable objects, liquids, and granular materials has convinced us of the need

for innovative simulation approaches that can generate robust behaviors in real environments with less trial-

and-error. In the DARPA Robotics Challenge, many teams including our WPI-CMU team found learning from

Figure 1: DARPA Robotics Challenge and robot juggling

1

complex physics with

relating to context,

tasks, and particular

robot strategies

Marks without text: cf. another document sent by e-mail

Figure 2: In our past work we used inverse models of the task to map task errors to command corrections [2, 1].

We have found that optimization is more effective than trying to track a learned reference movement, especially

with non-minimum phase plants. This figure shows an example of a robot learning to do a nonlinear, unstable,

and non-minimum phase task (pendulum swing up) from watching a human do it (once) [6, 7, 4]. Left: A

cartoon of the robot swinging up an inverted pendulum by moving the hand side to side. Middle (pendulum

motion) and Right (hand motion): The figures show the teacher’s demonstration of how to manipulate a

pendulum by moving the hand side to side and the robot’s practice trials. On the first attempt the robot imitates

its perception of the teacher’s movement, which fails to swing the pendulum upright due to differences in grasp

and slight deviations in the trajectory of handle position and three dimensional orientation. The robot then uses

optimization and an updated model of the task dynamics to adapt its hand motion. The 2nd trial is better, the

model is updated again, and the 3rd trial succeeds. We have found that model-based optimization greatly speeds

up learning. However, model-based learning sometimes gets stuck because updating the model with new data

causes only a very slow change in the policy because the planned movement is in a different area of state space

from the new data. We use a different form of learning, policy optimization, to speed up learning in this case.

simulation to not be very useful, and we had to build a replica of the test course and practice tasks with the actual

robot for over two months. Much of the time the robot was not available due to breakdowns, and this process

required a large number of people working around the clock to take advantage of every second of available

practice time. Similarly, my graduate students are reluctant to do simulations of our SARCOS humanoid. At

the high level of performance we have the robot operating at, the students feel simulation is a waste of time

since improving performance depends on handling the unmodeled features of the robot, and jump directly to

experiments with the actual robot. This means it takes a year or so to program the robot to do a new high-

performance task. This is not a cost-effective way to program high-quality robots. To simplify programming

robots, we have developed very fast reinforcement learning algorithms that use simulation to perform mental

practice using learned models during actual robot learning: robot jugging, air hockey, and humanoid walking

(Figure 2) [?, ?, ?].

However, it is clear to us that we can make better use of simulation than just mental practice using learned

models. Morimoto and Doya provided an early example of how reinforcement learning in simulation can ac-

celerate actual robot learning [?]. Unforunately, the techniques they used almost 20 years ago are still state of

the art today. We have faster computers and can do more simulation, but the paradigm is the same: practice the

entire task in simulation, and then practice the entire task with the robot. Idealized rigid body dynamic models

are primarily used to perform the simulation (for example SDFast then and MuJoCo now). There is no robust

planning, or learning to simulate or learn more effectively using the simulation, and the simulation is not used

once the robot starts to practice in reality.

In our case study on robot pouring [49], we built a liquid simulator (Figure 3). We found that existing

fast approximate liquid simulations seemed to be far from reality and accurate liquid simulation was very slow,

especially for simulating different types of materials in the context of different tasks; for example shaking to

pour tomato sauce, and squeezing a shampoo bottle. We found that using learned task input/output models to

simulate liquids and granular materials in the context of a particular task was much more effective.

In all of this work, what we have noticed is that low level commands, controller gains, and nonlinear policies

do not transfer well from simulation to robot, or even from robot to robot. In the DARPA Learning Locomotion

program and the DARPA Robotics Challenge we worked with a set of identical robots (in the first case Little

2

Figure 3: Left and Middle: Actual robot pouring by our PR2 and Baxter robots. Right: (a) Pouring simulation.

(b) Using different particle physics to simulate variations in material properties.

Dog and the second case Atlas I). We found that successful performance on one robot did not transfer well

to a test site with a supposedly identical robot. What did transfer well are task structure, task strategies, and

task-specific learning algorithms, and to a lesser extent, filter, controller, and learning algorithm modes and time

constants/bandwidths. We can use simulation to find good task strategies, task-specific learning algorithms,

filter, controller, and learning algorithm properties.

Goals: One of our research goals is to test the idea that large libraries (thousands) of matched models

and policies (a skill library) can improve robot behavior generation and learning to human-level performance

and efficiency. We will verify and explore this idea on practical robotics domains, such as deformable object

manipulation and cooking, which also involves manipulating liquids and granular materials (Figure 3). Our key

idea for making a better simulation is to emphasize task models, which are co-developed with corresponding

task policies [?, ?]. The simulation knows how to pour salt from a particular salt dispenser, and accurately

models that behavior. Task-level modeling minimizes simulation error due to integrating behavior across time

and actually facilitates transfer of learning across tasks, Combining models and policies allows the skill library

to support both model-based and model-free learning.

We will develop our work on 1) performing simulation using task-level learned models as well as idealized

physical models to maximize the fidelity of our simulations, and 2) learning a hierarchy of models including

symbolic and qualitative models that select features and bound effects, numerical task input/output models that

make specific predictions given specific robot commands, and finite element and other forms of detailed me-

chanical, material, surface interaction, thermal, and chemical models that verify plans developed at higher levels

as well as enable simulations to discover new strategies using different physical phenomena, These elements

will be combined into a Skill Simulator (Figure 4). Such a simulator would enable robots to learn robust poli-

cies from it using mental practice, analyze the reason of failures in actual practice, learn to prepare for future

learning (learning to learn), and estimate human intentions during learning from demonstrations.

A second key idea is to 3) learn and utilize probabilistic models of what is not known exactly about

the world. Probabilistic models such as models of prior distributions and measurement and process noise are

used to set Kalman filter parameters, for example. 4) Probability distributions of possible model parameters

and model structures will be used to guide robust policy learning. 5) In Learning To Learn, simulation over

learned probabilistic models are used to discover good sequences of tasks, features, and directions in task space

to explore to enable an actual robot efficiently learn in the real world. Learning To Learn addresses an expanded

vision of curriculum learning, in which tasks and features are sequenced in a generalized form of “shaping”. We

believe human motor learning is effective because humans identify the most important direction in command

space and learn that first. Only later are more dimensions added to learning in a principal components-like

process to deal with the true dimensionality of the task. In addition, we will develop our work on 6) improving

partial robot-in-the-loop control and simulation techniques where the robot is always successful or almost

successful as it performs more and more components of a task, and spends as little time as possible performing

badly and learning little. This is a form of assisted learning, like bicycle training wheels. We will also contribute

to robotics with research on learning from demonstration and deformable object manipulation.

3

Figure 4: Conceptual illustration of the proposed Skill Simulator. It consists of a skill library (symbolic repre-

sentations of tasks) and a model server. Each model may be an engineered model or a learned model (memory-

based or using parametric structures such as neural networks).

Methods

We have chosen difficult domains to test our ideas and framework: manipulating deformable objects, liquids, and

granular materials (Figure 3). Cooking is an everyday task that involves these domains, including (1) Grasping

objects such as rigid and soft containers, food, and tools. (2) Pouring liquids, powders, and particles. (3) Cutting

food such as vegetables, fruits, and meats. (4) Mixing materials such as pancake mix and milk, and seasoning

and soup. Concretely these tasks have the following features: (A) Each of them consists of component skills.

which are common across tasks, such as grasping and moving an object or stirring. (B) They include deformable

object manipulation, liquids, and/or granular materials. (C) Another interesting point of cooking tasks is that

they have complex dynamics between things that are easy to measure during cooking such as temperature and

viscosity, and rating of taste by humans, which is expensive to measure.

This video of our PR2 robot pouring ( https://youtu.be/GjwfbOur3CQ ) and our Baxter robot

pouring ( https://youtu.be/NIn-mCZ-h_g ) are good introductions to our work [49]: From these case

studies of pouring, we obtained key ideas for practical robot learning: a skill library is useful to deal with

the variations of tasks, and planning behaviors with learned models is a successful approach in robotics. We

developed a framework of learning decomposed (subtask-level) dynamic models and planning actions in [47].

We also developed a stochastic extension of neural networks in [48] that is useful for modeling dynamical

systems. Related research on robot cooking includes making pancakes [19] and baking cookies [14]. There is

extensive work on robot cooking and serving food in Japan [?]. Although they have successful results, their

robotic behaviors do not generalize widely. We have shown generalization and adaptation abilities, and efficient

learning based on both simulated and actual practice. Our and other’s work provides an excellent foundation for

the proposed work. Due to space limitations, the following offers more details on only portions of the proposed

work.

Proposed Framework: Building The Skill Simulator: The proposed Skill Simulator design provides a

foundation and framework to develop practical reinforcement learning (RL) methods for robots. From a techni-

cal point of view, making such a simulator fast and accurate is difficult. We propose to model task components

that are tightly connected with skills (sub-tasks, or primitive actions). These models form graph structures of

dynamical systems (Figure 4). The Skill Simulator is hierarchical: including symbolic and qualitative models,

numerical task input/output models, and finite element and other forms of detailed mechanical, material, surface

interaction, thermal, and chemical models. The Skill Simulator supports two types of component models: engi-

4

https://youtu.be/GjwfbOur3CQ

https://youtu.be/NIn-mCZ-h_g

Figure 5: Our Baxter robot, Baxter cutting a tomato, and our PR2 robot.

neered and learned models. When we do not have good models made by humans, which is usual in deformable

object manipulation, we learn models from practice. The Skill Simulator is multimodal, including sound and

other vibrations, and thermal and chemical models. The Skill Simulator also includes models that relate human

concepts and sensed data, for example taste and food property (e.g. salinity, sugar content (brix), and acidity).

It includes failure models as probabilistic bifurcations.

In learning models, we will build on our previous work on learning non-parametric memory-based robot

and task models [5, 24, 26, 25, 33, 31, 34, 32]. We will store data from all past robot behavior, and build models

as necessary to answer queries. We will refer to research on transfer learning [22, 45] to improve generalization.

Importance sampling (e.g. [46]) is a useful method in this context.

We will also explore the use of deep (many-layered) neural networks. As mentioned above we use stochastic

models of dynamical systems for dealing with simulation biases. We will extend deep neural networks to be

capable of: (1) modeling prediction error and output noise, (2) computing an output probability distribution for

a given input distribution, and (3) computing gradients of output expectation with respect to an input. Since

neural networks have nonlinear activation functions (e.g. rectified linear units; ReLU), these extensions are not

trivial. We will use approximations to make training deep neural networks more efficient [REFERENCE OR

EXAMPLE]. We will verify this approach using grasping, pouring, and cutting tasks, for example.

Hierarchies of task-level models make simulations more realistic and support more efficient domain adapta-

tion and transfer learning, but often require new simulator technologies, rather than a single temporal integrator

solving a large set of simultaneous equations that basically express only rigid body dynamics. With task-

command to task-output models we do not need to compute time integrals to estimate the output state, which

reduces the uncertainty caused by time integrals. We also use trajectory optimization techniques (stochas-

tic differential dynamic programming (SDDP)) to simultaneously generate a local policy and perform the

simulation[23, 30]. We have extended SDDP for graph-structured dynamic systems. A challenge is that there

would be many local maxima since we use learned dynamical models. Such local maxima could trap SDDP

since it is a gradient method. We will work on developing a method to avoid this. We will also extend DDP to

directed-graph-structured dynamical systems. Since DDP consists of forward and backward propagating calcu-

lations, it is clearly possible to extend it for tree structures. We use graph theory to transform a graph structure

into a tree structure, and apply the our proposed extended SDDP algorithm. In order to deal with discrete selec-

tions, we refer to methods for N-armed bandit problems as well as hypothesis testing methods [41]. The Skill

Simulator represents both task models and policies, making it a combination of model-based and model-free

reinforcement learning which supports policy search/optimization.

Proposed Work: Hybrids of Model-free and Model-based RL: We believe model-based reinforcement

learning is the way to achieve efficient learning. However, model-based learning has disadvantages such as

occasionally getting stuck. We will reduce the disadvantages of model-based RL by introducing a supervisory

model-free approach. We aim to increase the final performance and reduce the computational cost of execution.

Specifically we refer to direct policy search (e.g. [43, 17]). As well as dynamical models, we maintain policies

5

Cite: YamaguchiAtkeson2016hmA

in new bib file

Reference?

that map input states to actions. There are at least two choices on how to use the samples in learning: using

samples to train dynamic models only (the policies are optimized with the dynamical models), and using samples

to train both dynamic models and policies. A well-known architecture is Dyna [38]. Originally it was developed

for discrete state-action domains, and later a new version with a linear function approximator was developed

for continuous domains [40]. Recently Levine et al. [21] proposed a more practical approach where a trajectory

optimization method for unknown dynamics is combined with a local linear model learned from samples. While

these methods were developed for continuous or discrete state-action domains, methods for hierarchical dynamic

systems have not been developed. Developing a combined method of model-free and model-based approaches

for the hierarchical hybrid-reality simulator will contribute to efficient learning using simulation.

Proposed Work: Hierarchical Modeling: The task-level dynamic models we have described so far are

forming a hierarchy: symbolic finite state machines, learned task models of skills, and learned primitive action

models and sub-skills.. For more complicated tasks such as cooking, we consider a hierarchy of more layers.

Such research is found as hierarchical RL (e.g. [39, 8, 15]). Most of the existing algorithms are for symbolic

worlds, i.e. they are not capable of performing practical robotic tasks. In our work, we include continuous

variables even at higher levels. For example, a pouring skill will have continuous parameters like a target

amount, and they are reasoned about in a cooking task.

Proposed Work: Learning Hierarchical Representations From Human Demonstrations: A practical

benefit of using learning from human demonstrations (LfD) is automation of building the skill library and the

(graph) structures of task-level dynamic systems. Much existing LfD research (e.g. [9, 18, 13]) would be useful

for this purpose. Segmentation methods for behavior (e.g. [29]) will be very useful for task-level modeling, for

example. We will contribute to this field by introducing our Skill Simulator. Its high level models will help

robots understand the intention of human demonstrations even if a skill is new to the robots. For example, when

the human is demonstrating a squeezing skill to a robot, the robot can guess that the skill is for pouring since

the phenomenon matches with the high-level dynamic model of pouring. The robot can relate the perceived

demonstration to stored experiences about pouring. We will build on our previous work on learning from

demonstration. We have implemented direct policy learning to allow a robot to learn air hockey and a marble

maze task from watching a human [12, 10, 11]. Other work we have done on policy learning and optimization

and learning includes [3, 35, 36, 37].

Proposed Work: Learning Probabilistic Models: We will used memory-based approaches to constructing

density models to estimate probability distributions [5]. The data from actual robot execution will be stored.

When a probabilistic model is required for a particular task, data from similar tasks and conditions will be

combined and a density function estimated non-parametrically using kernel smooting.

Proposed Work: Robust Policy Design: One way to design robust policies is to learn a policy that works

for a set or probability distribution of models. The proposed approach achieves robustness by simultaneously

designing one control law for multiple models with potentially different model structures, which represent model

uncertainty and unmodeled dynamics. Multiple model policy search can be done in the typical “model-free”

way by finding the cost of the current policy applied to all of the models. It can be made much more efficient

if a model-based policy optimization approach is taken which uses first and/or second order gradients. These

gradients can be computed recursively, in a similar way to how a value function itself is computed. The gradients

for each model are summed to get the total gradient. Here is an example of a policy gradient for one time step of

a discrete time simulation [?]: V kx= Lx +Luππx + V k+1

x(Fx +Fuππx) and V k

p= (Lu + V k+1

xFu)ππp + V k+1

p

where x is the state on this step and u is the action. p is the vector of adjustable policy parameters. Subscripts

indicate partial derivatives. V is the value function, L is the reward or cost function, ππ is the current policy

being evaluated, and F is the dynamic model for the process or simulation.

Proposed Work: Learning To Learn addresses an expanded vision of curriculum learning, in which tasks

and features are chosen and sequenced in a generalized form of “shaping”. In our work on building models of

robot and actuator dynamics [?], we discovered that some directions in unknown parameter space were hard to

identify and required a great deal of learning effort to reduce parametric uncertainty. Often, these parameter

directions were hard to identify because they had little effect, and thus did not have to be learned. Similarly,

6

other directions had large effects, needed to be learned, and could be learned quite rapidly and accurately.

Directions that have the largest effects on reward need to be learned first. We believe human motor learning is

effective because humans identify the most important direction in input space and learn that first. Only later are

more dimensions added to learning in a principal components-like process to deal with the true dimensionality

of the task. In Learning to Learn simulation over learned probabilistic models of possible model parameters and

model structures are used to discover good sequences of tasks, features, and directions in task space to explore

to enable an actual robot to efficiently learn in the real world.

We present a simple example of learning to learn that learns directions in simulation so the number of

adjustable parameters on the real robot is greatly reduced. Consider the cart-pole problem, in which the length

of the pole and the mass of the cart vary. In addition, there is an unknown delay as well as actuator dynamics.

Given there is only a single actuator (the force on the cart), there is a direction in state space and a corresponding

single feedback gain for any optimal full state LQR controller, for example. In terms of learning to control this

system, a single parameter, the feedback gain, is all that needs to be adjusted. The feedback gain has to be large

enough to stabilize the system, but small enough not to cause oscillations or instabilities due to the unknown

delay and actuator dynamics. The direction in state space that the gain is applied to can be fixed over a wide

range of possible pole lengths, cart masses, delays, and actuator dynamics. The first phase of learning on

the real robot should focus on learning an appropriate feedback gain (one parameter) and nothing else. Later

phases of learning can tune the direction for slightly improved performance (d − 1 parameters where d is the

dimensionality of the state). Compare this to a traditional system identification approach, which would need

to learn d2 + d parameters of a locally linear model. In more complex systems, similar decomposition of the

system into local modes (directions) and scalar gains can be used to create a regulator, or a local controller at

each step of a trajectory. A similar decomposition into directions and scalar gains applies to filter design and

learning algorithms.

Strategies that limit the need for learning could be one reason for the efficiency of human learning. For

example an infant starts from lower degrees of freedom, and gradually increases the degrees of freedom [42].

In our pouring case study [49], we created several different skills (e.g. tipping and shaking) for controlling flow

rather than using a general skill (e.g. joint angle trajectory). With such skills, we could reduced the number

of parameters to be learned. One aspect of the proposed research is to develop algorithms that formalize and

generalize this idea.

In addition, simulation with probabilistic models can be used to select filtering, control, and and learning

algorithm parameters. The Kalman filter is a good example of how learning probability distributions leads to

improved performance, in this case state estimation. The better the estimate of the prior state distribution and

the process and measurement noise covariance, the better the resulting filter design. In the nonlinear case,

simulation (Monte Carlo evaluation) is needed to find optimal filter designs. This also applies to control design

and learning algorithm design.

More generally, Learning to Learn can be viewed as a form of active learning [?] in which we use an

estimate of the increase in reward or utility for a fixed amount of learning effort (often experiment time)

(∆Reward/∆LearningEffort) to optimize future learning by choosing optimal actions for both control and

learning. Learning how to trade off exploitation vs. exploration can be determined using simulation with prob-

ability distributions of possible worlds. One can view this as a Monte Carlo approach to designing optimal

algorithms for Dual Control. [?].

Proposed Work: Finding Better Strategies By Exploring In Simulation And In Reality: Simulation

can be used to explore, and find better task strategies. Figure 6 shows robot learning on the marble maze task,

where a rolling ball is guided through a maze (with holes) by tilting the board. Learning using “skills” in the

marble maze task was much faster than learning using a direct mapping from states to actions. This is clearly

a two-edged sword: learning is greatly sped up, but ultimate performance might be limited because a needed

skill is missing. However, new skills can be discovered in simulation and in reality. In the three observed

games the human maneuvers the marble below hole 14 (Figure 6). During robot practice the ball falls into

hole 14 and the robot learns that it can more easily maneuver the ball around the top of hole 14. We did not

7

better to put together

as they are talking a

similar thing

Proposed Work: Finding Skills From Simulation And Reality

Figure 6: Ball paths during hardware maze learning: Left: The 3 training games. Middle: Performance on 10

games based on learning from observation using the 3 training games. The maze was successfully completed

5 times, and the red circles mark where the ball fell into the holes. Right: Performance on 10 games based on

learning from practice after 30 practice games. There were no failures. Note the strategy change at hole 14

(zoom in using your PDF viewer).

even know this action was possible until we observed the robot do it. For this to work in simulation, we need

models that can generalize beyond the skills that have already been learned. We have found that our skill models

can extrapolate somewhat. To go beyond that, we need finite element and other forms of detailed mechanical,

material, surface interaction, thermal, and chemical models that enable simulations to discover new strategies,

often using different event sequences (such as contacts) or even using different physical phenomena,

Proposed Work: Using Simulation (Models) While Running The Real Robot: We note that model-based

reinforcement learning continues while operating the real robot. We should use the power of simulation in other

ways during actual robot learning. Simulation can help estimate derivatives and gradient directions for learning

using first or second order gradient descent. Simulation can explore possible actions and eliminate bad or unsafe

actions that a learning algorithm suggests, before the real robot tries them and fails. Simulation can perform

components of the task in robot-in-the-loop learning, so the robot is always close to success and estimated

gradients of utility with respect to task inputs are accurate, and steps in the learning algorithm actually lead to

improved performance. Finally, Learning To Learn based on simulation can continue to improve learning based

on actual robot data.

Proposed Work: Transfer: We will explore sharing learned knowledge among multiple tasks and robots.

For example grasping skills are used in different tasks. Although a grasping policy is task or situation dependent

(e.g. grasping for pouring and grasping for loading dishes into dish washer are different), a grasping dynamical

model could be more easily shared among tasks. Policies are the result of planning over a task given goals,

while dynamical models are local relationships independent of goals and much task context. Other models

can be transferred as well. For example, in pouring, after material comes out of the source container and flow

happens, the dynamical model between the flow features (flow position, flow variance) and the amount poured

into the receiver or spilled onto a table is largely independent of the robot.

Fast, robust, scalable and precise transition of learned knowledge from simulation to real environ-

ments: Our Skill Simulator is fast because it uses task models and focuses on relevant behavior. It is precise

and transitions to real robot behavior are fast because the Skill Simulator uses learned models, which are more

accurate and often do not require temporal integration, as they map directly from task commands to outputs. The

simulator and transitioned behavior are scalable because we use trajectory graphs to represent policies locally,

which has a computational cost of the dimensionality to the third power (due to matrix inversions) rather than

being exponential in dimensionality. The computational cost scales linearly with respect to the duration of the

behavior/simulation. The transitioned behavior is robust because we use multiple models and model probability

distributions in controller and learning algorithm design.

8

Differentiation From The Current State Of The Art

AKIHIKO, THIS IS WHERE RELATED WORK GOES. WE CAN PUT A LONGER VERSION ON THE

WEB AND POINT TO IT. CAN YOU REVISE THIS AND ADD MORE RELATED WORK UNTIL WE

OVERFLOW 10 PAGES?

As a model-based RL, our research is superior to the state-of-the-art (e.g. [20]) because of the (sub)task-level

dynamic models.

Compared to direct policy search which is a most successful robot learning approach (cf. [16]), our approach

has better generalization ability, reusability and shareability, and robustness to reward changes. We contribute

to reduce the simulation bias issue [16].

As an approach of planning with physics simulators, our method provides non-rigid object models that are

hard to handle even on the state-of-the-art simulators such as MuJoCo [44].

As a deep reinforcement learning research, we present a practical use of deep neural networks as learning

models in a part of the simulator.

Unlike the state-of-the-art robot learning methods [16] where robots directly search policies, we create

simulation models of the world through interaction with real world.

Such research is found as hierarchical RL (e.g. [39, 8, 15]). For more complicated tasks such as cooking,

we consider a hierarchy of more layers. Most of the existing algorithms are for symbolic worlds, i.e. they are

not capable of performing practical robotic tasks. In our work, we include continuous variables even at higher

levels. For example, a pouring skill will have continuous parameters like a target amount, and they are reasoned

about in a cooking task.

Such a model-based reinforcement learning (RL) has not been popular in this decade in robot learning [16].

The state-the-art reinforcement learning methods are direct policy search (e.g. [43, 17]). Morimoto et al. ex-

plored to learn models through practice and planning actions on it [27]. An early example of how reinforcement

learning in simulation can accelerate actual robot learning is shown in [28]. One reason that this approach has

not become dominant is the increase of uncertainty during simulation (time integrals), which is discussed as

simulation biases in [16]. We can improve this in several ways.

CMU Resources: The Search-based Planning Lab provides a PR2 robot for pouring experiments (Figure 5).

The PR2 robot has two 7 DOF arms, two parallel grippers on each arm, a lift-type torso, and an omni-directional

mobile platform. Its arm payload is 1.8 kg, the grip force is 80 N, and the grip range is 0 to 90 mm, which have

been enough for our pouring experiments so far. We also have a Baxter research robot for the proposed work

(Figure 5). It has two 7 DOF arms, and two different types of parallel grippers. Its arm payload is 2.2 kg. One

gripper’s grip force is 44 N with a grip range 37 to 75 mm, and the other gripper’s grip force is 100 N with a

grip range range 0 to 84 mm. The Baxter robot has torque sensors at each joint. We extensively instrument both

robots and the surrounding environment with many cameras: RGB, depth, thermal, etc.

Budget: Proposed budget in US$:

Atkeson - 0.5 month 10,000

Yamaguchi - 75% 40,000

Benefits 11,398

Computing 2,000

Materials 20,000

Japan Travel 8,000

US Travel 4,000

Direct Costs 100,000

Overhead 50,000

Total 150,000

References[1] E. Aboaf, S. Drucker, and C. Atkeson. Task-level robot learning: Juggling a tennis ball more accurately. In IEEE International Conference on Robotics and

Automation, pages 1290–1295, Scottsdale, AZ, 1989.

9

[2] Chae H. An, C. G. Atkeson, and John M. Hollerbach. Model-Based Control of a Robot Manipulator. MIT Press, Cambridge, MA, 1988.

[3] S. O. Anderson, J. K. Hodgins, and C. G. Atkeson. Approximate policy transfer applied to simulated bongo board balance. In IEEE-RAS InternationalConference on Humanoid Robots (Humanoids), 2007.

[4] C. G. Atkeson. Nonparametric model-based reinforcement learning. In Advances in Neural Information Processing Systems, volume 10, pages 1008–1014.MIT Press, Cambridge, MA, 1998.

[5] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial Intelligence Review, 11:11–73, 1997.

[6] C. G. Atkeson and S. Schaal. Learning tasks from a single demonstration. In Proceedings of the 1997 IEEE International Conference on Robotics andAutomation (ICRA97), pages 1706–1712, 1997.

[7] C. G. Atkeson and Stefan Schaal. Robot learning from demonstration. In Proc. 14th International Conference on Machine Learning, pages 12–20. MorganKaufmann, 1997.

[8] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.

[9] Darrin C. Bentivegna. Learning from Observation Using Primitives. PhD thesis, Georgia Institute of Technology, 2004.

[10] D.C. Bentivegna, C.G. Atkeson, A. Ude, and G. Cheng. Learning tasks from observation and practice. Robotics and Autonomous Systems, 47:163–169,2004.

[11] D.C. Bentivegna, C.G. Atkeson, A. Ude, and G. Cheng. Learning to act from observation and practice. International Journal of Humanoid Robotics,1(4):585–611, 2004.

[12] D.C. Bentivegna, G. Cheng, and C.G. Atkeson. Learning from observation and from practice using behavioral primitives. In 11th International Symposiumon Robotics Research, Siena, Italy, 2003.

[13] Aude Billard and Daniel Grollman. Robot learning by demonstration. Scholarpedia, 8(12):3824, 2013.

[14] Mario Bollini, Stefanie Tellex, Tyler Thompson, Nicholas Roy, and Daniela Rus. Interpreting and executing recipes with a cooking robot. In the 13thInternational Symposium on Experimental Robotics, pages 481–495, 2013.

[15] Shahar Cohen, Oded Maimon, and Evgeni Khmlenitsky. Reinforcement learning with hierarchical decision-making. In ISDA ’06: Proceedings of the SixthInternational Conference on Intelligent Systems Design and Applications, pages 177–182, USA, 2006. IEEE Computer Society.

[16] J. Kober, J. Andrew Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. International Journal of Robotics Research, 32(11):1238–1274,2013.

[17] Jens Kober and Jan Peters. Policy search for motor primitives in robotics. Machine Learning, 84(1-2):171–203, 2011.

[18] Petar Kormushev, Sylvain Calinon, and Darwin G. Caldwell. Robot motor skill coordination with EM-based reinforcement learning. In the IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS’10), pages 3232–3237, 2010.

[19] Lars Kunze and Michael Beetz. Envisioning the qualitative effects of robot manipulation actions using simulation-based projections. Artificial Intelligence,2015.

[20] Ian Lenz, Ross Knepper, and Ashutosh Saxena. DeepMPC: Learning deep latent features for model predictive control. In Robotics: Science and Systems(RSS’15), 2015.

[21] Sergey Levine, Nolan Wagener, and Pieter Abbeel. Learning contact-rich manipulation skills with guided policy search. In the IEEE International Conferenceon Robotics and Automation (ICRA’15), 2015.

[22] Michael G. Madden and Tom Howley. Transfer of experience between reinforcement learning environments with progressive difficulty. Artificial IntelligenceReview, 21:375–398, June 2004.

[23] David Mayne. A second-order gradient method for determining optimal trajectories of non-linear discrete-time systems. International Journal of Control,3(1):85–95, 1966.

[24] J. Morimoto and C. G. Atkeson. Improving humanoid locomotive performance with learnt approximated dynamics via Gaussian processes for regression. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2007.

[25] J. Morimoto and C. G. Atkeson. Nonparametric representation of an approximated Poincare map for learning biped locomotion. Autonomous Robots,27(2):131–144, 2009.

[26] J. Morimoto, S. H. Hyon, C. G. Atkeson, and G. Cheng. Low-dimensional feature extraction for humanoid locomotion using kernel dimension reduction. InIEEE-RAS Conference on Robotics and Automation, pages 2711–2716, 2008.

[27] J. Morimoto, G. Zeglin, and C.G. Atkeson. Minimax differential dynamic programming: Application to a biped walking robot. In the IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS’03), volume 2, pages 1927–1932, 2003.

[28] Jun Morimoto and Kenji Doya. Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. In ICML ’00: Proceedings of theSeventeenth International Conference on Machine Learning, pages 623–630, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.

[29] Scott Niekum, Sachin Chitta, Bhaskara Marthi, Sarah Osentoski, and Andrew G Barto. Incremental semantically grounded learning from demonstration. InRobotics: Science and Systems 2013, 2013.

[30] Yunpeng Pan and Evangelos Theodorou. Probabilistic differential dynamic programming. In Advances in Neural Information Processing Systems 27, pages1907–1915. Curran Associates, Inc., 2014.

[31] S. Schaal and C. G. Atkeson. Constructive incremental learning from only local information. Neural Computation, 10(8):2047–2084, 1998.

[32] S. Schaal and C. G. Atkeson. Learning control for robotics. IEEE Robotics & Automation Magazine, 17(2):20–29, 2010.

[33] S. Schaal, C. G. Atkeson, and S. Vijayakumar. Real-time robot learning with locally weighted learning. In Proceedings, IEEE International Conference onRobotics and Automation, 2000.

[34] S. Schaal, C. G. Atkeson, and S. Vijayakumar. Scalable locally weighted statistical techniques for real time robot learning. Applied Intelligence, 16(1), 2002.

[35] M. Stolle, H. Tappeiner, J. Chestnutt, and C. G. Atkeson. Transfer of policies based on trajectory libraries. In IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), 2007.

[36] Martin Stolle and Christopher G. Atkeson. Knowledge transfer using local features. In Proceedings of the IEEE Symposium on Approximate DynamicProgramming and Reinforcement Learning (ADPRL), 2007.

[37] Martin Stolle and Christopher G. Atkeson. Finding and transferring policies using stored behaviors. Autonomous Robots, 29(2):169—200, 2010.

[38] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In the Seventh InternationalConference on Machine Learning, pages 216–224. Morgan Kaufmann, 1990.

[39] Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112:181–211, 1999.

[40] Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, and Michael Bowling. Dyna-style planning with linear function approximation and prioritizedsweeping. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, pages 528–536, 2008.

[41] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.

[42] G. Taga, R. Takaya, and Y. Konishi. Analysis of general movements of infants towards understanding of developmental principle for motor control. In theIEEE International Conference on Systems, Man, and Cybernetics, 1999 (SMC ’99), volume 5, pages 678–683, 1999.

[43] E. Theodorou, J. Buchli, and S. Schaal. Reinforcement learning of motor skills in high dimensions: A path integral approach. In the IEEE InternationalConference on Robotics and Automation (ICRA’10), pages 2397–2403, may 2010.

10

[44] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robotsand Systems, pages 5026–5033, 2012.

[45] L. Torrey and J. Shavlik. Transfer learning. In E. Soria, J. Martin, R. Magdalena, M. Martinez, and A. Serrano, editors, Handbook of Research on MachineLearning Applications, chapter 11. IGI Global, 2009.

[46] Eiji Uchibe and Kenji Doya. Competitive-cooperative-concurrent reinforcement learning with importance sampling. In the International Conference onSimulation of Adaptive Behavior: From Animals and Animats, pages 287–296, 2004.

[47] Akihiko Yamaguchi and Christopher G. Atkeson. Differential dynamic programming with temporally decomposed dynamics. In the 15th IEEE-RASInternational Conference on Humanoid Robots (Humanoids’15), 2015.

[48] Akihiko Yamaguchi and Christopher G. Atkeson. Neural networks and differential dynamic programming for reinforcement learning problems. In the IEEEInternational Conference on Robotics and Automation (ICRA’16), 2016.

[49] Akihiko Yamaguchi, Christopher G. Atkeson, and Tsukasa Ogasawara. Pouring skills with planning and learning modeled from human demonstrations.International Journal of Humanoid Robotics, 12(3):1550030, 2015.

11

marks without text: cf. another document sent by e-mail ...cga/sony1/description_cmt-s.pdf · up...

Documents