marks without text: cf. another document sent by e-mail ...cga/sony1/description_cmt-s.pdf · up...
TRANSCRIPT
Learning To Simulate Accurately And
Learning To Learn From SimulationPrimary PI: Christopher G. Atkeson ([email protected])
Email: [email protected], Phone: +1-412-512-0150
Co-PI: Akihiko Yamaguchi
A proposal to the SONY 2016 Focused Research Award program in the area of Evolving
Reinforcement Learning from Simulation to Real Environment.
Abstract: We believe there are two important challenges for fast, robust, scalable and precise transition
of learned knowledge from simulation to real environments. The first is making better simulators using learned
knowledge from the real world and also from simulation. The second is learning robust and more efficient
control and learning algorithms from simulation. We propose emphasizing (1) task-level modeling for simu-
lation. We have found that modeling complex physics in the context of a task and a particular robot strategy is
much more effective than trying to learn general models. For example, modeling shaking salt from a dispenser
from first principles is very difficult. Learning models of granular materials in general or for all possible salt
dispensers is very difficult. However, learning models of the outcome of a particular shaking strategy applied
to a particular salt dispenser is much easier and more accurate. We propose (2) using learned task models at
several levels in hierarchical simulation: qualitative/symbolic models that represent different physics and task
phases such as in-contact/unconstrained or sliding/stuck; task-level models that represent continuous outcomes
of tasks such as how much salt will come out when a particular salt shaker is shaken in a particular way; and
process models such as how granular materials will flow or how much friction there will be. We propose (3)
learning probability distributions of the physics of the real world. Estimated probability models can be used
in simulation to (4) design robust policies and more efficient and robust learning algorithms. Modeling
how much a robot knows and more importantly what it doesn’t know is very useful in deciding how best to
learn what it expects to not know. For example, knowing the possible range of an unknown delay is very useful
in designing both a robust policy to handle any possible delay, and an efficient learning algorithm to learn the
delay so learned policies can take advantage of knowing the delay and perform better. These learned models,
better policies, and learning algorithms can be shared across robots to better evolve reinforcement learning from
simulations to real environments.
Motivation: Our experience with the DARPA Robotics Challenge (Figure 1), our SARCOS humanoids, and
experiments with manipulating deformable objects, liquids, and granular materials has convinced us of the need
for innovative simulation approaches that can generate robust behaviors in real environments with less trial-
and-error. In the DARPA Robotics Challenge, many teams including our WPI-CMU team found learning from
Figure 1: DARPA Robotics Challenge and robot juggling
1
complex physics with
relating to context,
tasks, and particular
robot strategies
Marks without text: cf. another document sent by e-mail
Figure 2: In our past work we used inverse models of the task to map task errors to command corrections [2, 1].
We have found that optimization is more effective than trying to track a learned reference movement, especially
with non-minimum phase plants. This figure shows an example of a robot learning to do a nonlinear, unstable,
and non-minimum phase task (pendulum swing up) from watching a human do it (once) [6, 7, 4]. Left: A
cartoon of the robot swinging up an inverted pendulum by moving the hand side to side. Middle (pendulum
motion) and Right (hand motion): The figures show the teacher’s demonstration of how to manipulate a
pendulum by moving the hand side to side and the robot’s practice trials. On the first attempt the robot imitates
its perception of the teacher’s movement, which fails to swing the pendulum upright due to differences in grasp
and slight deviations in the trajectory of handle position and three dimensional orientation. The robot then uses
optimization and an updated model of the task dynamics to adapt its hand motion. The 2nd trial is better, the
model is updated again, and the 3rd trial succeeds. We have found that model-based optimization greatly speeds
up learning. However, model-based learning sometimes gets stuck because updating the model with new data
causes only a very slow change in the policy because the planned movement is in a different area of state space
from the new data. We use a different form of learning, policy optimization, to speed up learning in this case.
simulation to not be very useful, and we had to build a replica of the test course and practice tasks with the actual
robot for over two months. Much of the time the robot was not available due to breakdowns, and this process
required a large number of people working around the clock to take advantage of every second of available
practice time. Similarly, my graduate students are reluctant to do simulations of our SARCOS humanoid. At
the high level of performance we have the robot operating at, the students feel simulation is a waste of time
since improving performance depends on handling the unmodeled features of the robot, and jump directly to
experiments with the actual robot. This means it takes a year or so to program the robot to do a new high-
performance task. This is not a cost-effective way to program high-quality robots. To simplify programming
robots, we have developed very fast reinforcement learning algorithms that use simulation to perform mental
practice using learned models during actual robot learning: robot jugging, air hockey, and humanoid walking
(Figure 2) [?, ?, ?].
However, it is clear to us that we can make better use of simulation than just mental practice using learned
models. Morimoto and Doya provided an early example of how reinforcement learning in simulation can ac-
celerate actual robot learning [?]. Unforunately, the techniques they used almost 20 years ago are still state of
the art today. We have faster computers and can do more simulation, but the paradigm is the same: practice the
entire task in simulation, and then practice the entire task with the robot. Idealized rigid body dynamic models
are primarily used to perform the simulation (for example SDFast then and MuJoCo now). There is no robust
planning, or learning to simulate or learn more effectively using the simulation, and the simulation is not used
once the robot starts to practice in reality.
In our case study on robot pouring [49], we built a liquid simulator (Figure 3). We found that existing
fast approximate liquid simulations seemed to be far from reality and accurate liquid simulation was very slow,
especially for simulating different types of materials in the context of different tasks; for example shaking to
pour tomato sauce, and squeezing a shampoo bottle. We found that using learned task input/output models to
simulate liquids and granular materials in the context of a particular task was much more effective.
In all of this work, what we have noticed is that low level commands, controller gains, and nonlinear policies
do not transfer well from simulation to robot, or even from robot to robot. In the DARPA Learning Locomotion
program and the DARPA Robotics Challenge we worked with a set of identical robots (in the first case Little
2
Figure 3: Left and Middle: Actual robot pouring by our PR2 and Baxter robots. Right: (a) Pouring simulation.
(b) Using different particle physics to simulate variations in material properties.
Dog and the second case Atlas I). We found that successful performance on one robot did not transfer well
to a test site with a supposedly identical robot. What did transfer well are task structure, task strategies, and
task-specific learning algorithms, and to a lesser extent, filter, controller, and learning algorithm modes and time
constants/bandwidths. We can use simulation to find good task strategies, task-specific learning algorithms,
filter, controller, and learning algorithm properties.
Goals: One of our research goals is to test the idea that large libraries (thousands) of matched models
and policies (a skill library) can improve robot behavior generation and learning to human-level performance
and efficiency. We will verify and explore this idea on practical robotics domains, such as deformable object
manipulation and cooking, which also involves manipulating liquids and granular materials (Figure 3). Our key
idea for making a better simulation is to emphasize task models, which are co-developed with corresponding
task policies [?, ?]. The simulation knows how to pour salt from a particular salt dispenser, and accurately
models that behavior. Task-level modeling minimizes simulation error due to integrating behavior across time
and actually facilitates transfer of learning across tasks, Combining models and policies allows the skill library
to support both model-based and model-free learning.
We will develop our work on 1) performing simulation using task-level learned models as well as idealized
physical models to maximize the fidelity of our simulations, and 2) learning a hierarchy of models including
symbolic and qualitative models that select features and bound effects, numerical task input/output models that
make specific predictions given specific robot commands, and finite element and other forms of detailed me-
chanical, material, surface interaction, thermal, and chemical models that verify plans developed at higher levels
as well as enable simulations to discover new strategies using different physical phenomena, These elements
will be combined into a Skill Simulator (Figure 4). Such a simulator would enable robots to learn robust poli-
cies from it using mental practice, analyze the reason of failures in actual practice, learn to prepare for future
learning (learning to learn), and estimate human intentions during learning from demonstrations.
A second key idea is to 3) learn and utilize probabilistic models of what is not known exactly about
the world. Probabilistic models such as models of prior distributions and measurement and process noise are
used to set Kalman filter parameters, for example. 4) Probability distributions of possible model parameters
and model structures will be used to guide robust policy learning. 5) In Learning To Learn, simulation over
learned probabilistic models are used to discover good sequences of tasks, features, and directions in task space
to explore to enable an actual robot efficiently learn in the real world. Learning To Learn addresses an expanded
vision of curriculum learning, in which tasks and features are sequenced in a generalized form of “shaping”. We
believe human motor learning is effective because humans identify the most important direction in command
space and learn that first. Only later are more dimensions added to learning in a principal components-like
process to deal with the true dimensionality of the task. In addition, we will develop our work on 6) improving
partial robot-in-the-loop control and simulation techniques where the robot is always successful or almost
successful as it performs more and more components of a task, and spends as little time as possible performing
badly and learning little. This is a form of assisted learning, like bicycle training wheels. We will also contribute
to robotics with research on learning from demonstration and deformable object manipulation.
3
Figure 4: Conceptual illustration of the proposed Skill Simulator. It consists of a skill library (symbolic repre-
sentations of tasks) and a model server. Each model may be an engineered model or a learned model (memory-
based or using parametric structures such as neural networks).
Methods
We have chosen difficult domains to test our ideas and framework: manipulating deformable objects, liquids, and
granular materials (Figure 3). Cooking is an everyday task that involves these domains, including (1) Grasping
objects such as rigid and soft containers, food, and tools. (2) Pouring liquids, powders, and particles. (3) Cutting
food such as vegetables, fruits, and meats. (4) Mixing materials such as pancake mix and milk, and seasoning
and soup. Concretely these tasks have the following features: (A) Each of them consists of component skills.
which are common across tasks, such as grasping and moving an object or stirring. (B) They include deformable
object manipulation, liquids, and/or granular materials. (C) Another interesting point of cooking tasks is that
they have complex dynamics between things that are easy to measure during cooking such as temperature and
viscosity, and rating of taste by humans, which is expensive to measure.
This video of our PR2 robot pouring ( https://youtu.be/GjwfbOur3CQ ) and our Baxter robot
pouring ( https://youtu.be/NIn-mCZ-h_g ) are good introductions to our work [49]: From these case
studies of pouring, we obtained key ideas for practical robot learning: a skill library is useful to deal with
the variations of tasks, and planning behaviors with learned models is a successful approach in robotics. We
developed a framework of learning decomposed (subtask-level) dynamic models and planning actions in [47].
We also developed a stochastic extension of neural networks in [48] that is useful for modeling dynamical
systems. Related research on robot cooking includes making pancakes [19] and baking cookies [14]. There is
extensive work on robot cooking and serving food in Japan [?]. Although they have successful results, their
robotic behaviors do not generalize widely. We have shown generalization and adaptation abilities, and efficient
learning based on both simulated and actual practice. Our and other’s work provides an excellent foundation for
the proposed work. Due to space limitations, the following offers more details on only portions of the proposed
work.
Proposed Framework: Building The Skill Simulator: The proposed Skill Simulator design provides a
foundation and framework to develop practical reinforcement learning (RL) methods for robots. From a techni-
cal point of view, making such a simulator fast and accurate is difficult. We propose to model task components
that are tightly connected with skills (sub-tasks, or primitive actions). These models form graph structures of
dynamical systems (Figure 4). The Skill Simulator is hierarchical: including symbolic and qualitative models,
numerical task input/output models, and finite element and other forms of detailed mechanical, material, surface
interaction, thermal, and chemical models. The Skill Simulator supports two types of component models: engi-
4
Figure 5: Our Baxter robot, Baxter cutting a tomato, and our PR2 robot.
neered and learned models. When we do not have good models made by humans, which is usual in deformable
object manipulation, we learn models from practice. The Skill Simulator is multimodal, including sound and
other vibrations, and thermal and chemical models. The Skill Simulator also includes models that relate human
concepts and sensed data, for example taste and food property (e.g. salinity, sugar content (brix), and acidity).
It includes failure models as probabilistic bifurcations.
In learning models, we will build on our previous work on learning non-parametric memory-based robot
and task models [5, 24, 26, 25, 33, 31, 34, 32]. We will store data from all past robot behavior, and build models
as necessary to answer queries. We will refer to research on transfer learning [22, 45] to improve generalization.
Importance sampling (e.g. [46]) is a useful method in this context.
We will also explore the use of deep (many-layered) neural networks. As mentioned above we use stochastic
models of dynamical systems for dealing with simulation biases. We will extend deep neural networks to be
capable of: (1) modeling prediction error and output noise, (2) computing an output probability distribution for
a given input distribution, and (3) computing gradients of output expectation with respect to an input. Since
neural networks have nonlinear activation functions (e.g. rectified linear units; ReLU), these extensions are not
trivial. We will use approximations to make training deep neural networks more efficient [REFERENCE OR
EXAMPLE]. We will verify this approach using grasping, pouring, and cutting tasks, for example.
Hierarchies of task-level models make simulations more realistic and support more efficient domain adapta-
tion and transfer learning, but often require new simulator technologies, rather than a single temporal integrator
solving a large set of simultaneous equations that basically express only rigid body dynamics. With task-
command to task-output models we do not need to compute time integrals to estimate the output state, which
reduces the uncertainty caused by time integrals. We also use trajectory optimization techniques (stochas-
tic differential dynamic programming (SDDP)) to simultaneously generate a local policy and perform the
simulation[23, 30]. We have extended SDDP for graph-structured dynamic systems. A challenge is that there
would be many local maxima since we use learned dynamical models. Such local maxima could trap SDDP
since it is a gradient method. We will work on developing a method to avoid this. We will also extend DDP to
directed-graph-structured dynamical systems. Since DDP consists of forward and backward propagating calcu-
lations, it is clearly possible to extend it for tree structures. We use graph theory to transform a graph structure
into a tree structure, and apply the our proposed extended SDDP algorithm. In order to deal with discrete selec-
tions, we refer to methods for N-armed bandit problems as well as hypothesis testing methods [41]. The Skill
Simulator represents both task models and policies, making it a combination of model-based and model-free
reinforcement learning which supports policy search/optimization.
Proposed Work: Hybrids of Model-free and Model-based RL: We believe model-based reinforcement
learning is the way to achieve efficient learning. However, model-based learning has disadvantages such as
occasionally getting stuck. We will reduce the disadvantages of model-based RL by introducing a supervisory
model-free approach. We aim to increase the final performance and reduce the computational cost of execution.
Specifically we refer to direct policy search (e.g. [43, 17]). As well as dynamical models, we maintain policies
5
Cite: YamaguchiAtkeson2016hmA
in new bib file
Reference?
that map input states to actions. There are at least two choices on how to use the samples in learning: using
samples to train dynamic models only (the policies are optimized with the dynamical models), and using samples
to train both dynamic models and policies. A well-known architecture is Dyna [38]. Originally it was developed
for discrete state-action domains, and later a new version with a linear function approximator was developed
for continuous domains [40]. Recently Levine et al. [21] proposed a more practical approach where a trajectory
optimization method for unknown dynamics is combined with a local linear model learned from samples. While
these methods were developed for continuous or discrete state-action domains, methods for hierarchical dynamic
systems have not been developed. Developing a combined method of model-free and model-based approaches
for the hierarchical hybrid-reality simulator will contribute to efficient learning using simulation.
Proposed Work: Hierarchical Modeling: The task-level dynamic models we have described so far are
forming a hierarchy: symbolic finite state machines, learned task models of skills, and learned primitive action
models and sub-skills.. For more complicated tasks such as cooking, we consider a hierarchy of more layers.
Such research is found as hierarchical RL (e.g. [39, 8, 15]). Most of the existing algorithms are for symbolic
worlds, i.e. they are not capable of performing practical robotic tasks. In our work, we include continuous
variables even at higher levels. For example, a pouring skill will have continuous parameters like a target
amount, and they are reasoned about in a cooking task.
Proposed Work: Learning Hierarchical Representations From Human Demonstrations: A practical
benefit of using learning from human demonstrations (LfD) is automation of building the skill library and the
(graph) structures of task-level dynamic systems. Much existing LfD research (e.g. [9, 18, 13]) would be useful
for this purpose. Segmentation methods for behavior (e.g. [29]) will be very useful for task-level modeling, for
example. We will contribute to this field by introducing our Skill Simulator. Its high level models will help
robots understand the intention of human demonstrations even if a skill is new to the robots. For example, when
the human is demonstrating a squeezing skill to a robot, the robot can guess that the skill is for pouring since
the phenomenon matches with the high-level dynamic model of pouring. The robot can relate the perceived
demonstration to stored experiences about pouring. We will build on our previous work on learning from
demonstration. We have implemented direct policy learning to allow a robot to learn air hockey and a marble
maze task from watching a human [12, 10, 11]. Other work we have done on policy learning and optimization
and learning includes [3, 35, 36, 37].
Proposed Work: Learning Probabilistic Models: We will used memory-based approaches to constructing
density models to estimate probability distributions [5]. The data from actual robot execution will be stored.
When a probabilistic model is required for a particular task, data from similar tasks and conditions will be
combined and a density function estimated non-parametrically using kernel smooting.
Proposed Work: Robust Policy Design: One way to design robust policies is to learn a policy that works
for a set or probability distribution of models. The proposed approach achieves robustness by simultaneously
designing one control law for multiple models with potentially different model structures, which represent model
uncertainty and unmodeled dynamics. Multiple model policy search can be done in the typical “model-free”
way by finding the cost of the current policy applied to all of the models. It can be made much more efficient
if a model-based policy optimization approach is taken which uses first and/or second order gradients. These
gradients can be computed recursively, in a similar way to how a value function itself is computed. The gradients
for each model are summed to get the total gradient. Here is an example of a policy gradient for one time step of
a discrete time simulation [?]: V kx= Lx +Luππx + V k+1
x(Fx +Fuππx) and V k
p= (Lu + V k+1
xFu)ππp + V k+1
p
where x is the state on this step and u is the action. p is the vector of adjustable policy parameters. Subscripts
indicate partial derivatives. V is the value function, L is the reward or cost function, ππ is the current policy
being evaluated, and F is the dynamic model for the process or simulation.
Proposed Work: Learning To Learn addresses an expanded vision of curriculum learning, in which tasks
and features are chosen and sequenced in a generalized form of “shaping”. In our work on building models of
robot and actuator dynamics [?], we discovered that some directions in unknown parameter space were hard to
identify and required a great deal of learning effort to reduce parametric uncertainty. Often, these parameter
directions were hard to identify because they had little effect, and thus did not have to be learned. Similarly,
6
other directions had large effects, needed to be learned, and could be learned quite rapidly and accurately.
Directions that have the largest effects on reward need to be learned first. We believe human motor learning is
effective because humans identify the most important direction in input space and learn that first. Only later are
more dimensions added to learning in a principal components-like process to deal with the true dimensionality
of the task. In Learning to Learn simulation over learned probabilistic models of possible model parameters and
model structures are used to discover good sequences of tasks, features, and directions in task space to explore
to enable an actual robot to efficiently learn in the real world.
We present a simple example of learning to learn that learns directions in simulation so the number of
adjustable parameters on the real robot is greatly reduced. Consider the cart-pole problem, in which the length
of the pole and the mass of the cart vary. In addition, there is an unknown delay as well as actuator dynamics.
Given there is only a single actuator (the force on the cart), there is a direction in state space and a corresponding
single feedback gain for any optimal full state LQR controller, for example. In terms of learning to control this
system, a single parameter, the feedback gain, is all that needs to be adjusted. The feedback gain has to be large
enough to stabilize the system, but small enough not to cause oscillations or instabilities due to the unknown
delay and actuator dynamics. The direction in state space that the gain is applied to can be fixed over a wide
range of possible pole lengths, cart masses, delays, and actuator dynamics. The first phase of learning on
the real robot should focus on learning an appropriate feedback gain (one parameter) and nothing else. Later
phases of learning can tune the direction for slightly improved performance (d − 1 parameters where d is the
dimensionality of the state). Compare this to a traditional system identification approach, which would need
to learn d2 + d parameters of a locally linear model. In more complex systems, similar decomposition of the
system into local modes (directions) and scalar gains can be used to create a regulator, or a local controller at
each step of a trajectory. A similar decomposition into directions and scalar gains applies to filter design and
learning algorithms.
Strategies that limit the need for learning could be one reason for the efficiency of human learning. For
example an infant starts from lower degrees of freedom, and gradually increases the degrees of freedom [42].
In our pouring case study [49], we created several different skills (e.g. tipping and shaking) for controlling flow
rather than using a general skill (e.g. joint angle trajectory). With such skills, we could reduced the number
of parameters to be learned. One aspect of the proposed research is to develop algorithms that formalize and
generalize this idea.
In addition, simulation with probabilistic models can be used to select filtering, control, and and learning
algorithm parameters. The Kalman filter is a good example of how learning probability distributions leads to
improved performance, in this case state estimation. The better the estimate of the prior state distribution and
the process and measurement noise covariance, the better the resulting filter design. In the nonlinear case,
simulation (Monte Carlo evaluation) is needed to find optimal filter designs. This also applies to control design
and learning algorithm design.
More generally, Learning to Learn can be viewed as a form of active learning [?] in which we use an
estimate of the increase in reward or utility for a fixed amount of learning effort (often experiment time)
(∆Reward/∆LearningEffort) to optimize future learning by choosing optimal actions for both control and
learning. Learning how to trade off exploitation vs. exploration can be determined using simulation with prob-
ability distributions of possible worlds. One can view this as a Monte Carlo approach to designing optimal
algorithms for Dual Control. [?].
Proposed Work: Finding Better Strategies By Exploring In Simulation And In Reality: Simulation
can be used to explore, and find better task strategies. Figure 6 shows robot learning on the marble maze task,
where a rolling ball is guided through a maze (with holes) by tilting the board. Learning using “skills” in the
marble maze task was much faster than learning using a direct mapping from states to actions. This is clearly
a two-edged sword: learning is greatly sped up, but ultimate performance might be limited because a needed
skill is missing. However, new skills can be discovered in simulation and in reality. In the three observed
games the human maneuvers the marble below hole 14 (Figure 6). During robot practice the ball falls into
hole 14 and the robot learns that it can more easily maneuver the ball around the top of hole 14. We did not
7
better to put together
as they are talking a
similar thing
Proposed Work: Finding Skills From Simulation And Reality
Figure 6: Ball paths during hardware maze learning: Left: The 3 training games. Middle: Performance on 10
games based on learning from observation using the 3 training games. The maze was successfully completed
5 times, and the red circles mark where the ball fell into the holes. Right: Performance on 10 games based on
learning from practice after 30 practice games. There were no failures. Note the strategy change at hole 14
(zoom in using your PDF viewer).
even know this action was possible until we observed the robot do it. For this to work in simulation, we need
models that can generalize beyond the skills that have already been learned. We have found that our skill models
can extrapolate somewhat. To go beyond that, we need finite element and other forms of detailed mechanical,
material, surface interaction, thermal, and chemical models that enable simulations to discover new strategies,
often using different event sequences (such as contacts) or even using different physical phenomena,
Proposed Work: Using Simulation (Models) While Running The Real Robot: We note that model-based
reinforcement learning continues while operating the real robot. We should use the power of simulation in other
ways during actual robot learning. Simulation can help estimate derivatives and gradient directions for learning
using first or second order gradient descent. Simulation can explore possible actions and eliminate bad or unsafe
actions that a learning algorithm suggests, before the real robot tries them and fails. Simulation can perform
components of the task in robot-in-the-loop learning, so the robot is always close to success and estimated
gradients of utility with respect to task inputs are accurate, and steps in the learning algorithm actually lead to
improved performance. Finally, Learning To Learn based on simulation can continue to improve learning based
on actual robot data.
Proposed Work: Transfer: We will explore sharing learned knowledge among multiple tasks and robots.
For example grasping skills are used in different tasks. Although a grasping policy is task or situation dependent
(e.g. grasping for pouring and grasping for loading dishes into dish washer are different), a grasping dynamical
model could be more easily shared among tasks. Policies are the result of planning over a task given goals,
while dynamical models are local relationships independent of goals and much task context. Other models
can be transferred as well. For example, in pouring, after material comes out of the source container and flow
happens, the dynamical model between the flow features (flow position, flow variance) and the amount poured
into the receiver or spilled onto a table is largely independent of the robot.
Fast, robust, scalable and precise transition of learned knowledge from simulation to real environ-
ments: Our Skill Simulator is fast because it uses task models and focuses on relevant behavior. It is precise
and transitions to real robot behavior are fast because the Skill Simulator uses learned models, which are more
accurate and often do not require temporal integration, as they map directly from task commands to outputs. The
simulator and transitioned behavior are scalable because we use trajectory graphs to represent policies locally,
which has a computational cost of the dimensionality to the third power (due to matrix inversions) rather than
being exponential in dimensionality. The computational cost scales linearly with respect to the duration of the
behavior/simulation. The transitioned behavior is robust because we use multiple models and model probability
distributions in controller and learning algorithm design.
8
Differentiation From The Current State Of The Art
AKIHIKO, THIS IS WHERE RELATED WORK GOES. WE CAN PUT A LONGER VERSION ON THE
WEB AND POINT TO IT. CAN YOU REVISE THIS AND ADD MORE RELATED WORK UNTIL WE
OVERFLOW 10 PAGES?
As a model-based RL, our research is superior to the state-of-the-art (e.g. [20]) because of the (sub)task-level
dynamic models.
Compared to direct policy search which is a most successful robot learning approach (cf. [16]), our approach
has better generalization ability, reusability and shareability, and robustness to reward changes. We contribute
to reduce the simulation bias issue [16].
As an approach of planning with physics simulators, our method provides non-rigid object models that are
hard to handle even on the state-of-the-art simulators such as MuJoCo [44].
As a deep reinforcement learning research, we present a practical use of deep neural networks as learning
models in a part of the simulator.
Unlike the state-of-the-art robot learning methods [16] where robots directly search policies, we create
simulation models of the world through interaction with real world.
Such research is found as hierarchical RL (e.g. [39, 8, 15]). For more complicated tasks such as cooking,
we consider a hierarchy of more layers. Most of the existing algorithms are for symbolic worlds, i.e. they are
not capable of performing practical robotic tasks. In our work, we include continuous variables even at higher
levels. For example, a pouring skill will have continuous parameters like a target amount, and they are reasoned
about in a cooking task.
Such a model-based reinforcement learning (RL) has not been popular in this decade in robot learning [16].
The state-the-art reinforcement learning methods are direct policy search (e.g. [43, 17]). Morimoto et al. ex-
plored to learn models through practice and planning actions on it [27]. An early example of how reinforcement
learning in simulation can accelerate actual robot learning is shown in [28]. One reason that this approach has
not become dominant is the increase of uncertainty during simulation (time integrals), which is discussed as
simulation biases in [16]. We can improve this in several ways.
CMU Resources: The Search-based Planning Lab provides a PR2 robot for pouring experiments (Figure 5).
The PR2 robot has two 7 DOF arms, two parallel grippers on each arm, a lift-type torso, and an omni-directional
mobile platform. Its arm payload is 1.8 kg, the grip force is 80 N, and the grip range is 0 to 90 mm, which have
been enough for our pouring experiments so far. We also have a Baxter research robot for the proposed work
(Figure 5). It has two 7 DOF arms, and two different types of parallel grippers. Its arm payload is 2.2 kg. One
gripper’s grip force is 44 N with a grip range 37 to 75 mm, and the other gripper’s grip force is 100 N with a
grip range range 0 to 84 mm. The Baxter robot has torque sensors at each joint. We extensively instrument both
robots and the surrounding environment with many cameras: RGB, depth, thermal, etc.
Budget: Proposed budget in US$:
Atkeson - 0.5 month 10,000
Yamaguchi - 75% 40,000
Benefits 11,398
Computing 2,000
Materials 20,000
Japan Travel 8,000
US Travel 4,000
Direct Costs 100,000
Overhead 50,000
Total 150,000
References[1] E. Aboaf, S. Drucker, and C. Atkeson. Task-level robot learning: Juggling a tennis ball more accurately. In IEEE International Conference on Robotics and
Automation, pages 1290–1295, Scottsdale, AZ, 1989.
9
[2] Chae H. An, C. G. Atkeson, and John M. Hollerbach. Model-Based Control of a Robot Manipulator. MIT Press, Cambridge, MA, 1988.
[3] S. O. Anderson, J. K. Hodgins, and C. G. Atkeson. Approximate policy transfer applied to simulated bongo board balance. In IEEE-RAS InternationalConference on Humanoid Robots (Humanoids), 2007.
[4] C. G. Atkeson. Nonparametric model-based reinforcement learning. In Advances in Neural Information Processing Systems, volume 10, pages 1008–1014.MIT Press, Cambridge, MA, 1998.
[5] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial Intelligence Review, 11:11–73, 1997.
[6] C. G. Atkeson and S. Schaal. Learning tasks from a single demonstration. In Proceedings of the 1997 IEEE International Conference on Robotics andAutomation (ICRA97), pages 1706–1712, 1997.
[7] C. G. Atkeson and Stefan Schaal. Robot learning from demonstration. In Proc. 14th International Conference on Machine Learning, pages 12–20. MorganKaufmann, 1997.
[8] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.
[9] Darrin C. Bentivegna. Learning from Observation Using Primitives. PhD thesis, Georgia Institute of Technology, 2004.
[10] D.C. Bentivegna, C.G. Atkeson, A. Ude, and G. Cheng. Learning tasks from observation and practice. Robotics and Autonomous Systems, 47:163–169,2004.
[11] D.C. Bentivegna, C.G. Atkeson, A. Ude, and G. Cheng. Learning to act from observation and practice. International Journal of Humanoid Robotics,1(4):585–611, 2004.
[12] D.C. Bentivegna, G. Cheng, and C.G. Atkeson. Learning from observation and from practice using behavioral primitives. In 11th International Symposiumon Robotics Research, Siena, Italy, 2003.
[13] Aude Billard and Daniel Grollman. Robot learning by demonstration. Scholarpedia, 8(12):3824, 2013.
[14] Mario Bollini, Stefanie Tellex, Tyler Thompson, Nicholas Roy, and Daniela Rus. Interpreting and executing recipes with a cooking robot. In the 13thInternational Symposium on Experimental Robotics, pages 481–495, 2013.
[15] Shahar Cohen, Oded Maimon, and Evgeni Khmlenitsky. Reinforcement learning with hierarchical decision-making. In ISDA ’06: Proceedings of the SixthInternational Conference on Intelligent Systems Design and Applications, pages 177–182, USA, 2006. IEEE Computer Society.
[16] J. Kober, J. Andrew Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. International Journal of Robotics Research, 32(11):1238–1274,2013.
[17] Jens Kober and Jan Peters. Policy search for motor primitives in robotics. Machine Learning, 84(1-2):171–203, 2011.
[18] Petar Kormushev, Sylvain Calinon, and Darwin G. Caldwell. Robot motor skill coordination with EM-based reinforcement learning. In the IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS’10), pages 3232–3237, 2010.
[19] Lars Kunze and Michael Beetz. Envisioning the qualitative effects of robot manipulation actions using simulation-based projections. Artificial Intelligence,2015.
[20] Ian Lenz, Ross Knepper, and Ashutosh Saxena. DeepMPC: Learning deep latent features for model predictive control. In Robotics: Science and Systems(RSS’15), 2015.
[21] Sergey Levine, Nolan Wagener, and Pieter Abbeel. Learning contact-rich manipulation skills with guided policy search. In the IEEE International Conferenceon Robotics and Automation (ICRA’15), 2015.
[22] Michael G. Madden and Tom Howley. Transfer of experience between reinforcement learning environments with progressive difficulty. Artificial IntelligenceReview, 21:375–398, June 2004.
[23] David Mayne. A second-order gradient method for determining optimal trajectories of non-linear discrete-time systems. International Journal of Control,3(1):85–95, 1966.
[24] J. Morimoto and C. G. Atkeson. Improving humanoid locomotive performance with learnt approximated dynamics via Gaussian processes for regression. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2007.
[25] J. Morimoto and C. G. Atkeson. Nonparametric representation of an approximated Poincare map for learning biped locomotion. Autonomous Robots,27(2):131–144, 2009.
[26] J. Morimoto, S. H. Hyon, C. G. Atkeson, and G. Cheng. Low-dimensional feature extraction for humanoid locomotion using kernel dimension reduction. InIEEE-RAS Conference on Robotics and Automation, pages 2711–2716, 2008.
[27] J. Morimoto, G. Zeglin, and C.G. Atkeson. Minimax differential dynamic programming: Application to a biped walking robot. In the IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS’03), volume 2, pages 1927–1932, 2003.
[28] Jun Morimoto and Kenji Doya. Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. In ICML ’00: Proceedings of theSeventeenth International Conference on Machine Learning, pages 623–630, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
[29] Scott Niekum, Sachin Chitta, Bhaskara Marthi, Sarah Osentoski, and Andrew G Barto. Incremental semantically grounded learning from demonstration. InRobotics: Science and Systems 2013, 2013.
[30] Yunpeng Pan and Evangelos Theodorou. Probabilistic differential dynamic programming. In Advances in Neural Information Processing Systems 27, pages1907–1915. Curran Associates, Inc., 2014.
[31] S. Schaal and C. G. Atkeson. Constructive incremental learning from only local information. Neural Computation, 10(8):2047–2084, 1998.
[32] S. Schaal and C. G. Atkeson. Learning control for robotics. IEEE Robotics & Automation Magazine, 17(2):20–29, 2010.
[33] S. Schaal, C. G. Atkeson, and S. Vijayakumar. Real-time robot learning with locally weighted learning. In Proceedings, IEEE International Conference onRobotics and Automation, 2000.
[34] S. Schaal, C. G. Atkeson, and S. Vijayakumar. Scalable locally weighted statistical techniques for real time robot learning. Applied Intelligence, 16(1), 2002.
[35] M. Stolle, H. Tappeiner, J. Chestnutt, and C. G. Atkeson. Transfer of policies based on trajectory libraries. In IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), 2007.
[36] Martin Stolle and Christopher G. Atkeson. Knowledge transfer using local features. In Proceedings of the IEEE Symposium on Approximate DynamicProgramming and Reinforcement Learning (ADPRL), 2007.
[37] Martin Stolle and Christopher G. Atkeson. Finding and transferring policies using stored behaviors. Autonomous Robots, 29(2):169—200, 2010.
[38] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In the Seventh InternationalConference on Machine Learning, pages 216–224. Morgan Kaufmann, 1990.
[39] Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112:181–211, 1999.
[40] Richard S. Sutton, Csaba Szepesvari, Alborz Geramifard, and Michael Bowling. Dyna-style planning with linear function approximation and prioritizedsweeping. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, pages 528–536, 2008.
[41] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
[42] G. Taga, R. Takaya, and Y. Konishi. Analysis of general movements of infants towards understanding of developmental principle for motor control. In theIEEE International Conference on Systems, Man, and Cybernetics, 1999 (SMC ’99), volume 5, pages 678–683, 1999.
[43] E. Theodorou, J. Buchli, and S. Schaal. Reinforcement learning of motor skills in high dimensions: A path integral approach. In the IEEE InternationalConference on Robotics and Automation (ICRA’10), pages 2397–2403, may 2010.
10
[44] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robotsand Systems, pages 5026–5033, 2012.
[45] L. Torrey and J. Shavlik. Transfer learning. In E. Soria, J. Martin, R. Magdalena, M. Martinez, and A. Serrano, editors, Handbook of Research on MachineLearning Applications, chapter 11. IGI Global, 2009.
[46] Eiji Uchibe and Kenji Doya. Competitive-cooperative-concurrent reinforcement learning with importance sampling. In the International Conference onSimulation of Adaptive Behavior: From Animals and Animats, pages 287–296, 2004.
[47] Akihiko Yamaguchi and Christopher G. Atkeson. Differential dynamic programming with temporally decomposed dynamics. In the 15th IEEE-RASInternational Conference on Humanoid Robots (Humanoids’15), 2015.
[48] Akihiko Yamaguchi and Christopher G. Atkeson. Neural networks and differential dynamic programming for reinforcement learning problems. In the IEEEInternational Conference on Robotics and Automation (ICRA’16), 2016.
[49] Akihiko Yamaguchi, Christopher G. Atkeson, and Tsukasa Ogasawara. Pouring skills with planning and learning modeled from human demonstrations.International Journal of Humanoid Robotics, 12(3):1550030, 2015.
11