nri-small: adaptive/approximate dynamic programming (adp ...cga/tmp-public/nsf.pdf · nri-small:...

23
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Time (s) Position (m) Simple Example Results nominal w/ original p actual w/ original p nominal w/ optimized p actual w/ optimized p Figure 1: Left: One of our soft robot prototypes. Middle: Our humanoid robot working with a human, Right: Simulations of the simple example. NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots This proposal focuses on how to control robots in the presence of unmodeled dynamics, a relatively ne- glected area in optimal control, reinforcement learning, and adaptive/approximate dynamic programming. A major focus of the National Robotics Initiative is to work with humans. This necessarily involves touching soft and poorly modeled objects (such as a human). In addition, the National Robotics Initiative stresses the need for soft robots for safety. We have begun research into inflatable robots to achieve safe physical human-robot interaction (Figure 1 Left) [93, 94]. We also believe that inflatable robots can be made much more cheaply than current robots, using manufacturing techniques from the clothing, printing, and plastic (pool) toy indus- tries [42, 75, 57]. We have found that for robots with soft structure, the contact state greatly affects the dynamics of the robot, in the same way that pressing the strings of a guitar against the frets changes the natural frequency of the strings while playing music. Control algorithms for soft robots and manipulating soft objects need to be able to handle poorly known objects and loads, substantial model changes, and unmodeled dynamics as con- tact state changes. We have found that the same dependence of unmodeled dynamics on contact state is true for complex “rigid” manipulators such as our humanoid robot (Figure 1 Middle). All robots are soft to some degree. We have found that current model-based controller design techniques such as dynamic programming can be destabilized by unmodeled dynamics. One approach to preserving model-based design is to learn models and design controllers online (typically referred to as an “indirect” approach). Even in the linear case, there is too little time or data to fully identify the dynamics at all frequencies at the rate typical locomotion or manipulation tasks are performed. Current model-free (“direct”) adaptive control and reinforcement learning techniques learn too slowly to be practical. Our approach is to design and learn policies (feedback controllers) for particular tasks, and make them robust to unmodeled dynamics using a multiple model design approach [116, 36, 113, 77, 12, 73, 83, 15, 60, 96, 11, 47]. The proposed approach achieves robustness by simultaneously designing one control law for multiple models with potentially different model structures, which represent model uncertainty and unmodeled dynamics. Our approach supports the design of deterministic nonlinear and time varying controllers for both deterministic and stochastic nonlinear and time varying systems, including policies with internal state such as observers or other state estimators. We highlight the benefit of control laws made up of collections of simple policies where only one simple policy is active at a time. Multiple model controller optimization and learning is particularly fast and effective in this situation because derivatives are decoupled. This proposal is focused on designing controllers for soft robots. Our approach will also apply to systems with uncertain time delays, bandwidth or power limits on actuation, vibrational modes, and non-collocated sensing found in lightweight robot arms, series elastic actuation, satellites with booms or large solar panels, and large space structures. What is transformative? 1) The emphasis on nonparametric unmodeled dynamics is new to reinforcement 1

Upload: others

Post on 15-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Time (s)

Positio

n (

m)

Simple Example Results

nominal w/ original p

actual w/ original p

nominal w/ optimized p

actual w/ optimized p

Figure 1: Left: One of our soft robot prototypes. Middle: Our humanoid robot working with a human, Right:

Simulations of the simple example.

NRI-Small: Adaptive/Approximate Dynamic Programming (ADP)

Control Of Soft Inflatable Robots

This proposal focuses on how to control robots in the presence of unmodeled dynamics, a relatively ne-

glected area in optimal control, reinforcement learning, and adaptive/approximate dynamic programming. A

major focus of the National Robotics Initiative is to work with humans. This necessarily involves touching soft

and poorly modeled objects (such as a human). In addition, the National Robotics Initiative stresses the need

for soft robots for safety. We have begun research into inflatable robots to achieve safe physical human-robot

interaction (Figure 1 Left) [93, 94]. We also believe that inflatable robots can be made much more cheaply

than current robots, using manufacturing techniques from the clothing, printing, and plastic (pool) toy indus-

tries [42, 75, 57]. We have found that for robots with soft structure, the contact state greatly affects the dynamics

of the robot, in the same way that pressing the strings of a guitar against the frets changes the natural frequency

of the strings while playing music. Control algorithms for soft robots and manipulating soft objects need to be

able to handle poorly known objects and loads, substantial model changes, and unmodeled dynamics as con-

tact state changes. We have found that the same dependence of unmodeled dynamics on contact state is true

for complex “rigid” manipulators such as our humanoid robot (Figure 1 Middle). All robots are soft to some

degree.

We have found that current model-based controller design techniques such as dynamic programming can be

destabilized by unmodeled dynamics. One approach to preserving model-based design is to learn models and

design controllers online (typically referred to as an “indirect” approach). Even in the linear case, there is too

little time or data to fully identify the dynamics at all frequencies at the rate typical locomotion or manipulation

tasks are performed. Current model-free (“direct”) adaptive control and reinforcement learning techniques learn

too slowly to be practical.

Our approach is to design and learn policies (feedback controllers) for particular tasks, and make them robust

to unmodeled dynamics using a multiple model design approach [116, 36, 113, 77, 12, 73, 83, 15, 60, 96, 11, 47].

The proposed approach achieves robustness by simultaneously designing one control law for multiple models

with potentially different model structures, which represent model uncertainty and unmodeled dynamics. Our

approach supports the design of deterministic nonlinear and time varying controllers for both deterministic and

stochastic nonlinear and time varying systems, including policies with internal state such as observers or other

state estimators. We highlight the benefit of control laws made up of collections of simple policies where only

one simple policy is active at a time. Multiple model controller optimization and learning is particularly fast and

effective in this situation because derivatives are decoupled. This proposal is focused on designing controllers

for soft robots. Our approach will also apply to systems with uncertain time delays, bandwidth or power limits

on actuation, vibrational modes, and non-collocated sensing found in lightweight robot arms, series elastic

actuation, satellites with booms or large solar panels, and large space structures.

What is transformative? 1) The emphasis on nonparametric unmodeled dynamics is new to reinforcement

1

Page 2: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

learning and relatively neglected in optimal control. The emphasis on unmodeled dynamics that depend on

contact state and force levels, which have been neglected in robotics. 2) The emphasis on complex and hierar-

chical policies, such as policies that are collections of simple policies, is unusual for control theory and optimal

control. 3) Efficient multiple model methods for designing robust nonlinear and time varying controllers for

nonlinear systems, as well as feedforward input design, will increase the acceptance of these techniques. 4)

Multiple model methods for designing robust controllers that handle different model structures, and multiple

model controller design methods that design policies with internal state, allowing the co-design of controllers

and state estimators, are both novel. 5) Applying the multiple models design perspective in a unified way to

a wide range of controller design issues. 6) High performance control for soft robots will transform safety

concerns, and will reduce the cost of robots by orders of magnitude.

Robust Multiple Model Controller Design

In our research we will focus on both discrete time and continuous time formulations. Due to space restric-

tions, we will limit our discussion here to how to design deterministic nonlinear and potentially time varying

discrete time control laws. For cases where the multiple models all have the same state vector, the common

policy is u = ππ(x,p), where u is a vector of controls of dimensionality Nu. x is the state vector of the controlled

system (dimensionality Nx), and p is a policy parameter vector of dimensionality Np that describes the policy

ππ().A Simple Example: We present our method applied to a simple example. We then compare our method to

a perturbation-based approach applied to second order unmodeled dynamics and an unknown delay. Consider

a nominal linear plant which is a double integrator sampled at 1kHz. The state vector x consists of the position

p and velocity v. In this example the feedback control law has the structure u = Kx = kpp+ kvv. An optimal

Linear Quadratic Regulator (LQR) is designed for the nominal double integrator plant with a one step cost

function of L(x,u) = 0.5(xTQx+uTRu). In this exampleQ= [1000 0;0 1] and R= [0.001] resulting in optimal

feedback gains of K = [973 54].The true plant is the nominal plant with the following unmodeled dynamics: a second order low pass filter is

added on the input with a cutoff of 10 Hz. The transfer function for the unmodeled dynamics is ω2/(s2+2γωs+ω2), with a damping ratio γ = 1 and a natural frequency ω = 20π. There is no resonant peak and the unmodeled

dynamics acts as a well behaved low pass filter. However, the unmodeled dynamics drive the true plant unstable

when the feedback gains designed using the nominal plant model are used. Figure 1 shows simulations of these

conditions: the blue dot-dashed line is the nominal plant with the original gains [973 54], and the black dotted

line shows the true plant with the original gains, which is unstable.

One way to design a robust control law is to optimize the parameters of a control law (in this case the position

and velocity feedback gains) by evaluating them on several different models. The control law is simulated for

a fixed duration D on each of M models for S initial conditions, and the cost of each trajectory, Vm(xs,p), issummed for the overall optimization criterion, using the above L(x,u): C= ∑M

m=1 ∑Ss=1w(m,s)Vm(xs,p).w(m,s)

is a weight on each trajectory. It is useful to normalize the weights so that ∑Mm=1 ∑S

s=1w(m,s) = 1/N where N

is the total number of time steps of all trajectories combined. We will suppress the m subscript on V to simplify

our results. We assume that each trajectory is created using the appropriate model and use the appropriate model

dynamics to calculate derivatives of V in what follows. First and second order gradients are summed using the

same weights w(m,s) as were used on the trajectories.Optimizing u = Kx for the nominal model and the nominal model with an added input filter with ω = 10π

and γ = 0.5, with initial conditions (1,0), results in feedback gains of [148 16]. These gains are also stable for

the true plant (ω = 20π,γ = 1). Figure 1 shows simulations of these conditions: The magenta dashed line shows

the nominal plant with the gains optimized for multiple models, and the red solid line shows the true plant with

the same gains. The multiple model gains are less aggressive than the original gains, and the true plant is stable

and reasonably well damped.

A model with the same model structure as the true plant does not have to be included in the set of models

used in policy optimization. Optimizing using the nominal double integrator model and the nominal model

with an input delay of 50 milliseconds results in optimized gains of [141 18], which provide about the same

2

Page 3: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

performance on the true plant as the previous optimized gains. In addition, the new gains are stable for double

integrator plants with delays up to 61 milliseconds, while the original gains of [973 54] are stable only for delaysup to 22 milliseconds. We note that the nominal double integrator model, the nominal model with an input filter,

and the nominal model with a delay all have different model structures (number of state variables for example),

which a multiple model policy optimization approach should handle.

We compare our approach to the heuristic used in reinforcement learning of adding simulated perturba-

tions to make the policy more robust. We use the method of common random numbers [41] (which has been

reinvented many times and is also known as correlated sampling, matched pairs, matched sampling, and Pe-

gasus [72]) to optimize the policy. An array of random numbers is created, and that same array is used to

perturb each simulation of the nominal system, typically by adding process noise to the plant input u, while

optimizing policy parameters. On the simple example, we found that the added process noise needed to be quite

large (±1000 uniformly distributed on each time step) for the generated controller to work reliably on the true

plant with the input filter with a cutoff of 10 Hz. However, there was only a narrow window of noise levels

that worked reliably, and higher and lower levels of noise produced unstable controllers quite often. We have

found in general that added noise is not a reliable proxy for unmodeled dynamics. The challenging aspect of

unmodeled dynamics is that small errors are correlated across time, leading to large effects.

Related Work

[88] surveys the early history of dealing with unmodeled dynamics in control theory. Loop transfer recovery

and H∞ design are prominent techniques for designing linear robust control systems. One approach to handling

the effect of contact on unmodeled dynamics is to use passivity and/or impedance control. These arguments

typically assume the control signal produces forces or torques or the actuator dynamics actuator dynamics are

perfectly known, and the sensors and actuators are collocated or the dynamics between the control signals

and sensors are perfectly known [1]. In our case, with both our soft robots and our humanoid robot, control

signals typically control pneumatic or hydraulic valves which are “third order” in that they affect the derivative

of a force or torque, and the multi-stage valves typically have internal dynamics with power sources (so they

definitely aren’t passive) as well. In the case of tendon driven systems the dynamics of the tendons are affected

by the loading. Our sensors and actuators are not collocated, and in fact with inflatable robots the actuation

can be distributed over a wide area, so the “point” of actuation is not well defined. In the case of our hydraulic

system we have to use force control to make it compliant, and our implementation of impedance control was

severely limited by the structural resonances of the various metal parts, including the force sensors themselves,

and play in the bearings. Even “passive” controller designs and impedance control are affected by unmodeled

dynamics. Restrictions on the unmodeled dynamics that they take a particular form or are passive or minimum

phase are unrealistic.

[2] surveys the progress in dealing with unmodeled dynamics in adaptive control. There are several reasons

why optimal control is a more useful foundation for what we are trying to do than adaptive control. Unlike most

of adaptive control, we are not trying to follow a reference model but trying to optimize a criteria. This has been

referred to as adaptive optimal control [23]. We are dealing with severely nonlinear systems and quick transient

tasks, rather than trying to regulate a linear or linearizable system to a steady state. It is unlikely that there

will be enough data or excitation to identify a system during each phase of a task, either directly or indirectly.

We must integrate information across repeated attempts to execute related tasks. We will borrow ideas from

adaptive control such as separation of time scales, limiting the number of adaptable parameters, adaptation dead

zones, persistence of excitation, turning off any model identification when the input is not rich enough, adding

in small system identification signals, limiting the controller bandwidth, adding hysteresis on model switching,

and taking into account that changing the controller can change an identified model as well. We must be mindful

that “1) there are always unmodeled dynamics at sufficiently high frequencies (and it is futile to try to model

these dynamics) and 2) the plant cannot be isolated from unknown disturbances (e.g., 60 Hz hum) even though

these may be small” [2]. In our case disturbances such as impacts during locomotion or manipulation can be

quite large.

There is a strong relationship between the proposed work and Differential Dynamic Programming (DDP),

3

Page 4: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

which propagates value function information backward in time along a trajectory, and chooses optimal actions

and feedback gains at each time step [43, 35]. The optimization of global parameters α and the general form of

the value function update equations in [35] were an inspiration for the proposed work. Our proposed work sug-

gests alternative forms of DDP, such as optimizing a trajectory-based policy and a version which uses multiple

models simultaneously.

Soon after the development of the linear quadratic regulator, the research area of output feedback controller

optimization was created to handle the case when full state feedback was not available, and an observer or state

estimator was not used [116, 44, 74, 55, 115]. Output feedback optimization computes the optimal control

law for linear models when the structure of the control law is fixed. We note that it is difficult to apply linear

matrix inequality (LMI) or polytopic model-based optimal output feedback techniques to multiple models with

different model structures since it is not clear how to interpolate between these models [116]. One can embed

the multiple models in a much more complex single model so that structural differences become parametric

differences, but that greatly complicates the design process. For linear systems one can interpolate models in

the frequency domain, but it is not clear how to generalize frequency domain interpolation to nonlinear models

with different structures. Varga showed how to apply multiple models to output feedback controller optimization

where all models have the same state vector [116].

Policy optimization (aka policy search/refinement/improvement/gradient) is of great interest in reinforce-

ment learning (RL) [114, 72, 12, 49, 47]. Typically a stochastic policy is used to provide “exploration” or from

our point of view perform numeric differentiation to find the dependence of the trajectory cost on the policy

parameters. Gradient learning algorithms such as backpropagation applied to a lattice network model of the

trajectory-based computations or backpropagation through time applied to a recurrent network model result in

similar gradient equations to this work [126, 119, 120, 82]. One area of reinforcement learning that is also

closely related to this work is that of adaptive critics [62, 122, 21, 97, 48, 117, 49, 79]. We discuss adaptive

critics in more detail in the section on “Global Optimization Of General Policies”. Lewis and Vrabie develop

first order analytic gradient equations for the special case when the policy is linearly parametrized [49]. Kolter

developed a first order analytic gradient that propagates derivatives forward in time for deterministic policy

optimization, which, because it does not take advantage of value functions, is in general less efficient than our

proposed approach [47].

The term multiple models means different things in different fields. We use it to mean alternative plants

that could exist. In machine learning it often means multiple model structures that are selected or blended to

fit data. In control theory it has been used both for alternative global models and local models that divide up

the state space [121]. In Multiple Model Adaptive Control and Multiple Model Adaptive Estimation (MMAC

and MMAE) instead of computing one policy based on multiple models as is done in the proposed approach, a

policy is computed for each possible model. An adaptive algorithm learns to select or combine the individual

policies.

Research Approach

This section outlines some of our approaches to anticipated research challenges (the ones that fit in the page

limit). A wide variety of optimization algorithms can be used to optimize the policy parameters p. One goal

of the proposed work is to provide efficient algorithms to calculate first and second order gradients of the total

trajectory cost of a control law with respect to its parameters, to speed up policy optimization. We describe

how to propagate analytic gradients backward along simulated trajectories. Our approach supports the design

of deterministic nonlinear and time varying controllers for both deterministic and stochastic nonlinear and time

varying systems, including policies with internal state such as observers or other state estimators. We highlight

the benefit of control laws made up of collections of simple policies where only one simple policy is active at a

time. Controller optimization and learning is particularly fast and effective in this situation because derivatives

are decoupled.

The gradient algorithms presented are intended to be used in a controller design process, so we assume the

nominal models, one step cost functions, and policy structure are all known. Initially we will consider policy

optimization problems using multiple discrete time models where there is no discounting, full state feedback is

4

Page 5: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

available, all the models use the same state vector, the policies are static, and there is no opponent. In later years

we will explore extensions to the basic approach. Due to space restrictions, in this proposal we will only show

how to calculate the first and second order cost gradients for a single trajectory. The total derivatives for a set of

models and trajectories are the sum of the derivatives for each trajectory.

First Order Gradient: A first order gradient descent algorithm updates the policy parameters in the fol-

lowing way: ∆p = −ε∑Mm=1 ∑S

s=1w(m,s)VTp (xs,p) where ∆p is the update, ε is a step size, and Vp = ∂V/∂p.

Vp and other derivatives of scalars are row vectors. Initially, we will use a finite horizon to a fixed point

in time to evaluate the policy. In this case the Bellman Equation (principle of optimality [20]) becomes:

V k(x,p) = L(x,ππ(x,p))+V k+1(F(x,ππ(x,p)),p) where xi+1 = F(xi,ui) are the system dynamics equations ap-

propriate for each model, andV k(x,p) = φ(xD)+∑D−1i=k L(xi,ui) is the cost of the remaining trajectory generated

by starting at xk and using the policy ui = ππ(xi,p). φ(x) is a terminal cost function evaluated at the end of the

trajectory. We note that the one step cost function L() and terminal cost function φ() may depend on the model

m, and also the initial state (s index). We will explore this possibility in the later years of this project. The

derivative Vp is V 0p , and we will use the notation Vp and V 0

p interchangeably. i and k are temporal indices and

can appear as either subscripts and superscripts as needed for readability.

To calculate the first order gradient we will approximate the dynamics, one step cost, policy, and value

function V () with first order Taylor series approximations. For example, F(x,u) = F+Fx∆x+Fu∆u where

we follow the conventions of [35] in that x, u, and p subscripts indicate partial derivatives evaluated with the

appropriate arguments at that time point along the trajectory. Derivatives of scalars (Lx, Lu, Vx, and Vp) are row

vectors. Derivatives of vectors are matrices whose rows are the derivatives of the components of the original

vector. Fx is an Nx×Nx matrix, Fu is Nx×Nu, ππx is Nu×Nx, and ππp is Nu×Np. In this case, the derivatives of

the Bellman Equation are:

V kx = Lx +Luππx +V k+1

x (Fx +Fuππx) (1)

and

V kp = (Lu +V k+1

x Fu)ππp +V k+1p (2)

We are suppressing the k superscripts on the right hand sides of these equations since every symbol not indexed

by k+1 is indexed by k. V 0p is calculated by using these equations to propagate V and its derivatives backward

in time along the trajectory. We are making extensive use of the chain rule. Depending on how the policy

optimization is formulated, VD and its derivatives can be those of a terminal cost function, or they can be zero

if there is no terminal cost function. For a terminal cost function φ(x), VDx = φx. Since φ() is independent of the

policy parameters, VDp = 0.

Equations 1 and 2 can be used in many ways in optimization. Backward passes to calculate ∆p can alter-

nate with forward passes that generate new trajectories by using the new policy and integrating the appropriate

dynamics forward in time for each model. Trajectory segments can be generated, as in multiple shooting [110].

Trajectories can be represented parametrically and an optimization procedure can be used to make the trajecto-

ries consistent with the new policy and appropriate dynamics, as in collocation [22].

Second Order Gradient: A second order gradient descent algorithm updates the policy parameters in the

following way: ∆p = −(∑Mm=1 ∑S

s=1w(m,s)Vpp(xs,p))−1

∑Mm=1 ∑S

s=1w(m,s)VTp (xs,p) where Vpp = ∂2V/∂p∂p.

To calculate the second order gradient we will approximate the dynamics, one step cost, policy, and value

function V () with second order Taylor series approximations. For example: F(x,u) = F+Fx∆x+Fu∆u+0.5∆xTFxx∆x+∆xTFxu∆u+0.5∆uTFuu∆u. We follow the conventions of [35] in that the second derivatives of

vectors (Fxx, Fxu, Fuu, ππxx, . . .) are third-order tensors. A quadratic form including the second derivative of a

vector such as ∆xTFxu∆u is a vector whose jth component is the quadratic form using the second derivative of

the jth component of the original vector: ∆xTFjxu∆u. Another useful formula is the product of a row vector v, a

matrix A, the third order tensor (ππxp for example), and another matrix B which is: v(AππxpB) = ∑ j vj(Aππ

jxpB)

We note that cross derivatives are independent of the order in which the derivatives are taken, so Lux = LTxu,

Vpx =VTxp, F

jux = (F

jxu)

T, and ππjpx = (ππ

jxp)

T.

5

Page 6: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

This results in the following recursion in time for the second order derivatives of V :

V kxx = Lxx +Lxuππx +(Lxuππx)

T +ππTxLuuππx +Luππxx +(Fx +Fuππx)

TV k+1xx (Fx +Fuππx)

+V k+1x (Fxx +Fxuππx +(Fxuππx)

T +ππTxFuuππx +Fuππxx) (3)

V kxp = Lxuππp +ππT

xLuuππp +Luππxp +(Fx +Fuππx)TV k+1

xx Fuππp +(Fx +Fuππx)TV k+1

xp

+V k+1x (Fxuππp +ππT

xFuuππp +Fuππxp) (4)

V kpp = ππT

pLuuππp +Luππpp +(Fuππp)TV k+1

xx Fuππp +(Fuππp)TV k+1

xp +((Fuππp)TV k+1

xp )T +V k+1pp

+V k+1x (πT

pFuuππp +Fuππpp) (5)

V 0pp is calculated by using these equations to propagate V and its derivatives backward in time, again making

extensive use of the chain rule. VD() can also be that of a terminal cost function, or zero. For a terminal cost

function φ(x), VDx = φx and V

Dxx = φxx. Since φ() is independent of the policy parameters, VD

p , VDxp, and V

Dpp are

zero. There are actually a wide variety of ways to use first and second order gradients in optimization [81], and

our methods to calculate gradients can be used in many of them.

Discounting: It is often useful to apply a discount factor γ to the Bellman Equation: V k(x,p)=L(x,ππ(x,p))+γV k+1(F(x,ππ(x,p)),p). This is easily handled by modifying the above algorithms, either by multiplying each

occurrence ofV k+1 and its derivatives in the above derivative propagation equations by γ, or equivalently, includ-

ing the discounting as a separate step interleaved with the above derivative propagation equations: V k = γV k,

V kx = γV k

x , Vkp = γV k

p , Vkxx = γV k

xx, Vkxp = γV k

xp, and Vkpp = γV k

pp.

Constraints: Wewill explore handling constraints using both Lagrange multipliers or penalty functions [35,

43]. Although penalty functions may be more convenient from a programming point of view (only the one step

and terminal cost functions are modified), Lagrange multiplier approaches allow constraints to be met exactly.

Since actions are generated by the policy as a function of state, constraints on actions can be transformed into

constraints on states. We will show how to handle a violated terminal state constraint ϕ(x(t f )) = 0 on a single

trajectory using Lagrange multipliers and first order gradient methods. Handling state constraints at other times

is done in a similar way. We will explore how to generalize this approach to multiple models and trajectories.

Because the policy parameters can have complex effects on constraint violations, it is useful to introduce a

Lagrange multiplier ν and constraint value function W k(x,p) for each active constraint. The constraint value

function is propagated in a way similar to the V value function equations except there is no one step cost:

W k(x,p) = W k+1(F(x,ππ(x,p)),p). WD is ϕ(x(t f )). W kx = W k+1

x (Fx +Fuππx) and W kp = W k+1

x Fuππp +W k+1p .

WDx is ϕx(x(t f )) andW

Dp = 0. The Hamiltonian is H(x0,p,ν) = V 0(x0,p)+ νW 0(x0,p). The derivative of the

Hamiltonian with respect to p (which is zero at the optimal point) gives the modified first order gradient update:

∆p = −ε(V 0p (x0,p)+νW 0

p (x0,p))T (6)

ν is chosen to extremize the Hamiltonian, in that setting the derivative of the Hamiltonian with respect to ν to

zero givesW 0(x0,p) = 0 which enforces the desired constraint.

Simplifying Policies: We will explore biasing, abstracting, and deliberately simplifying policies. We may

want to bias individual policy parameters to be zero unless there is substantial evidence they should be non-zero.

This is particularly useful when using large numbers of simple local policies. We will explore several ways to

do this. One way we will consider is to add a cost function on the policy parameters: L(p). One example of

a suitable cost function is the L2 norm L(p) = pTp. Another is the L1 norm which sums the components of p:

∑Np

j |p j|.We will explore a second approach to simplifying policies which limits the dimension of the policy pa-

rameter update ∆p. One can collect a number of updates after several steps, and use those updates to define a

basis. Future updates can be projected into that subspace, limiting the possible policies. A similar approach

is to limit the dimensionality of the inverted Hessian matrix in a second order update. We can decompose the

6

Page 7: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

Hessian into UDUT using an eigenvalue/eigenvector decomposition, where D is diagonal with the eigenvalues

d j as diagonal elements. If the desired dimensionality of the update is n, the top n eigenvalues can be inverted,

while the inverse of all other eigenvalues can be set to zero in D−1. Another approach we will explore is to add

basis vectors to the allowed policy subspace while removing others as the policy optimization proceeds, based

on the space spanned by the largest n elements of δp or the eigenvectors corresponding to the top n eigenvalues

of the Hessian.

Combinations of Simple Policies: We will explore hierarchical policies where a complex policy is built out

of many simpler policies. In preliminary work, we have found that collections of simple policies where only one

simple policy is active at a time have particularly fast and effective controller optimization and learning because

derivatives are decoupled.

The second order gradient descent update typically has a very large reduction in computational cost. For

simple policy j first order gradient descent updates its parameters in the following way (including biasing the

parameters):

∆p j = −ε j

(Lp +

M

∑m=1

S

∑s=1

w(m,s)VTp j(xs,p)

)(7)

Note that the step size ε can now depend on the simple policy being updated (this is especially useful if adaptive

step size algorithms are used). Since only simple policies that are actually used are updated this leads to a

reduction in computational cost. The second derivative with respect to policy parameters Vpp or Hessian matrix

is block diagonal. Policy parameters of different simple policies do not interact, since only one policy operates

on each time step and Vx and Vxx are used to decouple the current policy optimization from optimization of

simple policies used in the future. Second order policy updates can be handled independently for each simple

policy. Second order gradient descent updates the jth simple policy in the following way (including biasing

the parameters and including a regularizing positive definite diagonal matrix λI, with λ chosen by a Levenberg

Marquardt or Trust Region algorithm to control step size [81]):

∆p j = −

(Lpp +λ jI+

M

∑m=1

S

∑s=1

w(m,s)Vp jp j(xs,p)

)−1(Lp +

M

∑m=1

S

∑s=1

w(m,s)VTp j(xs,p)

)(8)

Inverting several small Hessian matrices is typically much less expensive than inverting a single large Hessian

matrix. Note that the regularization parameter λ can now depend on the simple policy being updated, which is

useful if the Hessians have negative eigenvalues of various magnitudes.

There are several ways to generate such policy collections. We can divide the state space up into a grid or

some other tessellation and place a simple policy in each cell (Figure 2 E). We can also place simple policies

along a trajectory or at random locations in state space [10] and use nearest neighbor operations to find the

closest simple policy based on an appropriate distance metric. We can implement a time varying policy where

at each time step k the kth simple policy is used.

For example, let’s consider a collection of simple policies where the simple policies are affine: u(x,p) =u+KC(x− x). At time k a single policy is used (the jth affine policy), and its adjustable parameters are (u j,K j).We refer to this case as Locally Linear Policy Optimization (LLPO). The time varying version of this approach

where on each time step k the kth affine policy is used is the policy optimization analog of Differential Dynamic

Programming (DDP) [43, 35], which can be referred to as DDP-PO.

The parameter vector for the jth affine policy p j concatenates uj and the rows of K j. If the jth affine policy

is used on time step k, ππkx = K jC. The ππk

p j matrix has Nu rows and Nu(1+Ny) columns.

ππkp j =

1 0 . . . 0 yTk 0T . . . 0T

0 1 . . . 0 0T yTk . . . 0T

· · . . . · · · . . . ·0 0 . . . 1 0T 0T . . . yTk

(9)

7

Page 8: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

0 0.5 1 1.5 2 2.5 3 3.5 4−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5Optimal trajectory

Time (s)

An

gle

(ra

dia

ns)

0 0.5 1 1.5 2 2.5 3 3.5 4−10

−8

−6

−4

−2

0

2

4

6

Time (s)

u:

Nm

; k p

: N

m/r

ad

; k v

: N

m−

s/r

ad

Optimal u and feedback gains along optimal trajectory

uk

p

kv

−3

−2

−1

0

1

2

3

4

−10

−5

0

5

10

0

5

10

velocity (r/s)

Value function for one link example

position (r)

va

lue

−3

−2

−1

0

1

2

3

4

−10

−5

0

5

10

−5

0

5

velocity (r/s)

Policy for one link example

position (r)

torq

ue

(N

m)

Figure 2: A: The optimal trajectory for the pendulum swing up (θ vs. time). B: Optimal parameters of locally

linear policies along the optimal trajectory: (u, position gain kp, and velocity gain kv). The corresponding

optimal value function (C) and optimal policy (D). E: The optimal policy with a manually generated tessellation

for locally linear policies. The optimal trajectory is shown as a red line. The value function is cut off above 10

for visibility.

ππkxx = 0 and ππk

p jp j = 0. The product of a vector v of length Nu and ππkxp j is given by the Nx by Nu(1+Ny) matrix:

vππxp =(0Nu×Nu

v1CT v2C

T . . . vNuCT).

The derivative propagation equations along a trajectory are:

V kx = Lx +LuK

jC+V k+1x (Fx +FuK

jC) (10)

V kxx = Lxx +LxuK

jC+(LxuKjC)T +(K jC)TLuuK

jC+(Fx +FuKjC)TV k+1

xx (Fx +FuKjC)

+V k+1x (Fxx +FxuK

jC+(FxuKjC)T +(K jC)TFuuK

jC) (11)

For the affine policy currently in use (simple policy j):

V kp j = (Lu +V k+1

x Fu)ππp j +V k+1p j (12)

V kxp j = Lxuππp j +(K jC)TLuuππp j +Luππxp j +(Fx +FuK

jC)TV k+1xx Fuππp j +(Fx +FuK

jC)TV k+1xp j

+V k+1x (Fxuππp j +(K jC)TFuuππp j +Fuππxp j) (13)

V kp jp j = ππT

p jLuuππp j +(Fuππp j)TV k+1xx Fuππp j +(Fuππp j)TV k+1

xp j +((Fuππp j)TV k+1xp j )T +V k+1

p jp j

+V k+1x (πT

p jFuuππp j) (14)

For affine policies not currently in use but that have been used (simple policy l): V kpl

= V k+1pl

, V kxpl

= (Fx +

FuKlC)TV k+1

xpl, and V k

plpl=V k+1

plpl. One of the update equations (7) or (8) is used.

LLPO Preliminary Results: To verify the LLPO algorithm and explore timing, we implemented both

numeric and our analytic first and second order policy optimization on a pendulum swing up problem with

the following dynamics: θ = −Tmgl cos(θ)/I where x = (θ, θ)T, θ is the pendulum angle with straight down

being 0, T = 0.01s, the moment of inertia about the joint is I = 0.3342, the product of mass, gravity, and the

pendulum length is mgl = 4.905, C is an identity matrix, and L(x,u) = 0.5T (0.1θ2 + torque2). We have found

the optimal trajectory (cost = 3.5385) using dynamic programming (DP) and differential dynamic programming

(DDP) (Figure 2 A, C, and D). We can use these solutions to see how the optimal parameters of locally linear

policies (u, position gain kp, and velocity gain kv) vary along the optimal trajectory (Figure 2 B).

We will use this problem to test algorithm timing in the context of numeric and analytical first and second

order gradients on a 500 step trajectory. In the numeric approach we used finite differencing of total trajectory

costs to numerically estimate Vp and Vpp. We can vary the number of affine policies and see how the cost of

computing these gradients increases for both approaches (Table 1). Table entries report time in milliseconds for

one calculation of Vp or Vp and Vpp for 10, 100, and 500 local policies. We see that analytic derivatives become

relatively much cheaper to compute as the number of affine policies increases, since the numeric approaches

have to vary all the parameters of all the simple policies to estimate derivatives, while the analytic approaches

only require a number of updates related to the length of the trajectory and largely independent of the number

of simple policies. The cost of the numeric first order gradient computation is proportional to the number of

simple policies, while the numeric second order gradient computation grows with the square of the number of

8

Page 9: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

Method 10 policies 100 policies 500 policies

First order numeric 0.108 11.3 53

First order analytic 0.098 0.104 0.124

Second order numeric 450 45000 1061000

Second order analytic 0.77 0.89 1.20

Table 1: LLPO implementation timing comparison.

simple policies. In the analytic approaches the computational cost of finding the nearest neighbor simple policy

and initializing all simple policies for each new trajectory depend on the number of simple policies. The cost

for updating the policy gradients for simple policies not used on the current time step, inverting the Hessian

matrices, and updating simple policy parameters depend only on the number of simple policies used, so in the

worst case this cost is proportional to the length of each trajectory. Simple policies that have not been used on

the current trajectory do not need to be updated until they are used. In practice the total cost of the analytic

approaches is almost independent of the number of simple policies available or used.

Using a single simple policy at a time and using analytic derivatives makes optimization and learning possi-

ble for large complex policy optimization problems. In our experience so far using collections of simple policies,

sometimes the second order analytic approach is faster than the first order analytic approach because it takes

many fewer iterations to converge, and sometimes the first order approach has a slight edge. Either the solutions

found are equivalent, or the second order approach finds a better solution. We will explore these issues further

in the proposed research.

Weighted Averaging of Simple Policies: We will explore weighted averaging of simple policies. One

possible consequence of using multiple simple policies on each time step by forming a weighted average of the

outputs is that the Hessian matrix may no longer be block diagonal or have a form that reduces the computational

cost of inverting it. When both policy j and policy l are active at the same time, cross terms between ππp j , ππpl ,

Vxp j , and Vxpl arise which might destroy the block diagonal nature of the Hessian (Vp jpl 6= 0).

Handling Models With Different States: So far we have assumed all of the multiple models have the

same state vector. We will outline what happens to the first order update formulas (Equations 1 and 2) when

we are optimizing over multiple models with different state vectors. Since we are planning, we assume we

know the dynamics of each model: zmi+1 = Fm(zmi ,ui). In order to use the same policy with all models, each

model must provide a vector of observations: y = gm(zm) and the common policy is a function of those mea-

surements: ππ(y,p). In order to use the same one step cost function L(x,u), each model must provide a way

to generate a “nominal” state: x = hm(zm). This function is unnecessary if the one step cost function is

a function of the observation vector L(y,u). Finally, there must be a way to start each model’s trajectory

from an equivalent state zm0 , given a nominal starting state x0. The Bellman Equation for each model is:

Vm,k(zm,p) = L(hm(zm),ππ(gm(zm),p))+Vm,k+1(Fm(zm,ππ(gm(zm),p)),p). The first order derivative propaga-

tion equations are (suppressing the m subscripts):

V kz = Lxhz +Luππygz +V k+1

z (Fz +Fuππygz) (15)

V kp = (Lu +V k+1

z Fu)ππp +V k+1p (16)

During policy optimization the appropriate model specific dynamics, observation equation, h(), value function,and derivative propagation equations are used on each application of a model m to a starting point indexed by

s. We will explore how to generalize the second order derivative propagation equations for Vzz, Vzp, and Vpp for

multiple models with different model structures.

Global Optimization Of General Policies: We will explore combining our gradient-based policy opti-

mization, which finds locally optimal policies, with approximate dynamic programming. This hybrid approach

can globally optimize grid-based policies, with associated grid-based value functions (Figure 2 C and D). For

more general parametrized policies a similar approximate dynamic programming approach [62, 122, 21, 97, 48,

9

Page 10: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

117, 49, 79] does not necessarily find globally optimal policies, but it does help avoid many bad local minima.

Function approximation is used to represent both a policy ππ(x,p) as we do and a parametrized global value

function V (x,ωω). Gradient descent and other optimization techniques are used to learn p and ωω. Our approach

tries to avoid making a commitment to a global structure and parametrization for V (x) or Vx(x) by using lo-

cal quadratic models for V (x) (or equivalently local linear models for Vx(x)). Performing local gradient-based

policy optimization along an explicit trajectory may greatly reduce the need for an accurate global model of

the value function, and allow quite approximate function optimization methods to be used to in approximate

dynamic programming (ADP). The hybrid method may work well with an inaccurate global model of the value

function V () but accurate local models of its derivatives. We will explore this hybrid approach in the proposed

work.

In Heuristic Dynamic Programming (HDP), a form of policy iteration, the parameters ωω of a value func-

tion approximation V (x,ωω) are trained using supervised learning to match the right hand side of the Bellman

Equations L(xi,ππ(xi,p)) + V (F(xi,ππ(xi,p)),ωω) evaluated with the current ωω. The parameters p of a policy

ππ(p) are trained using supervised learning with targets from minimizing u = argminu(L(x,u)+V (F(x,u),ωω).These targets are created assuming arbitrary u are possible and without respect to the parametrization of the

policy. In local versions of HDP gradient approaches are used to train ω and p, and only a local minimum is

found. [119] discusses computing a derivative of the right hand side of the Bellman Equation with respect to

the policy parameters p to facilitate the minimization: Qkp = Luππp +VxFuππp +Qk+1

p which matches our first

order gradient (2). Dual Heuristic Programming (DHP) learns Vx directly by training using the right hand side

of (1) as the target [62, 122, 118, 21, 97, 117, 49, 79]. We are proposing representing Vx(x) and Vp(x) in a

first order approach, and in addition Vxx(x), Vxp(x), and Vpp(x) in a second order approach. Globalized DHP

(GDHP) combines HDP by training a global representation of V (x,ωω) and DHP by including Vx as part of

the training algorithm. Action dependent versions of HDP (ADHDP) and DHP (ADDHP) learn Q functions

Q(x,u,w,p) = L(x,u)+V k+1(F(x,ππ(x,p)),w), Qx(), and Qu() instead of value functions V () and Vx().Policy iteration in dynamic programming must be adapted for multiple models [125]. To match our gradient-

based minimization of a weighted sum across models, we envisage optimizing a weighted average of the cost

across models, sampled at a set of states x j: u j = argminu ∑mw(m)(Lm(x j,u)+Vm(Fm(x j,u))

)There is one

policy approximation, but a separateV () approximation for each model, updated separately: Vm, j = Lm(x j,u j)+Vm(Fm(x j,u j)) We are effectively assuming that a trajectory starts at each x j, so the s index in w(m,s) can be

dropped in the above equation.

[92, 40, 71] train adaptive critics along trajectories (as we propose to do), rather than at a set of training

points. [40, 71] use a set of initial conditions, as we propose to do. [92, 40] use gradient equations from Pontrya-

gin’s minimum principle-based trajectory optimization, which focus on taking the derivative of the Hamiltonian

with respect to state and action, and leave out terms involving ππx, ππp, and Vp. [71] takes advantage of special

forms of the dynamics (linear in the control) and one step cost function (quadratic in the control). For the LQR

case the value function is a global quadratic function which is learned on each iteration of policy iteration. [71]

also explores using radial basis functions to representV (x). [71, 13, 80] analyze convergence issues and provideconvergence proofs.

We propose two methods to approximate the value function in “global” optimization of the policy using

u = argminu(L(x,u)+V (F(x,u),ωω). The first is to use the collection of local quadratic models of the value

function at each point along each trajectory. These models do not have to be trained in the usual machine

learning sense, as V , Vx, and Vxx can be computed for each trajectory point by a single sweep backwards along

each trajectory. We could use a nearest neighbor approach with a distance metric that potentially depended on

the query state to find the most appropriate local model. We could also use weighted combinations of predictions

of various local value function models, with weights depending on distances of the trajectory point to the query.

The second value function approximation approach is to use a global value function approximation, for example

sigmoidal neural nets or radial basis functions. The “global” policy optimization step can be interspersed with

gradient-based local optimization of the policy using the approaches previously described in this paper.

Adaptive Grids/Parametrizations: In addition to fixed parametrization of the policy, we will explore

adaptive grids and parametrizations. [39] reviews recent work in adaptive grids and parametrizations for value

10

Page 11: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

function representation. Figure 2 E shows a manually generated tessellation of the state space for pendulum

swing up. Each cell has an approximately linear policy. We will look for automatic ways to create such an

adaptive grid, using various tessellation approaches including kd-trees. An advantage of our gradient-based

policy optimization is that it provides several indicators as to where and how to subdivide a cell. One approach is

to divide cells whose gradientsVp are different at different locations. The magnitude ofVpx provides an indicator

of this, and the matrix provides an indication of the best way to split a cell (to minimize Vp discrepancy in the

new cells).

We can only represent the policy in detail at states that are actually visited by the trajectories. One way

to create new features is to start with constant or linear features of one variable, and then combine them into

polynomials or create new features in the collection by applying nonlinearities such as exp(), log(), sqrt(), sin(),

cos(), ... This is reminiscent of the Group Method of Data Handling (GMDH) originally developed in the

Soviet Union and also referred to as polynomial neural networks. The number of hidden units and layers in

sigmoidal neural networks or radial basis function networks can also be varied. Another way to to discover

useful cells, features and parametrizations is to look for ways in which the inputs to the policy can be factored

into independent subsets. [123, 124] has shown that walking can be simplified in this way, for example. All of

these operations can be applied simultaneously (each cell can have a different set of features or parametrization,

and cells can be split and merged as well as have their representation changed). It is possible that representing

the policy at several levels of resolution is useful, either with several grids or using wavelets. We can also

explore using multiple temporal resolutions, where trajectories of varying lengths are simulated to estimate

policy gradients. We will also explore whether simultaneously applying these techniques to value functions will

be useful.

We have proposed optimization using random actions as a particularly efficient way to perform dynamic

programming [7]. We will explore performing random actions using cell-based representations by replacing

a cell’s local policy by a random policy during dynamic programming and using that random policy until a

trajectory leaves the cell. It may be the case that this accelerates convergence to a better local optimum or helps

avoid bad local optimum.

Handling Stochastic Systems: One proposed approach to handling stochastic systems with uncertain dy-

namics and noisy measurements is to sample from realizations of trajectories (a Monte Carlo approach) [43].

Since we already propose sampling trajectories based on models and initial states, we can re-use all of the ma-

chinery we will develop and use several trajectory realizations for each model and initial state. During policy

optimization we would keep the random noise fixed for each trajectory as in the common random numbers

method [41] (also known as correlated sampling, matched pairs, matched sampling, and Pegasus [72]).

In a situation where the process and measurement noise can be described by first and second moments

(as with Gaussian noise) we propose analytically propagating and optimizing means and variances. We will

initially focus on discrete time systems with zero mean process noise w with covariance W and independent

zero mean measurement noise v with covariance V , and later extend to continuous time systems. We will inject

the noise directly into the process dynamics and measurement functions, so both additive, multiplicative, and

more general nonlinear noise can be handled: xi+1 = F(xi,ππ(yi,p),wi) and yi = g(xi,vi). In later years we will

also explore various enhancements to this model including having an effect of the past command ui−1 on the

measurement noise (which would allow sensing actions such as aiming a sensor).

We want to minimize the total cost, including the additional cost due to the process and measurement noise.

In preliminary work we have explored using an Extended Kalman Filter to process the noisy measurements.

The variance of the error in x on time step k before we incorporate the measurement is [44, 55, 45]:

X−(k+1) = (Fx +Fuππygx)X+(k)(Fx +Fuππygx)

T +(Fuππygv)V (Fuππygv)T +FwW FT

w

+1

2trace(FxxX

+(k)FxxX+(k)) (17)

Fxx = Fxx +FxuFuππygx +(Fuππygx)TFux +(ππygx)

TFuuππygx +Fu(gTxππyygx +ππygxx) (18)

where the superscripts− and + indicate the state error covariance X before and after the measurement has been

11

Page 12: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

incorporated. The measurement update of the Kalman Filter is [45]:

X+(k) = (I−K gx)X−(k) (19)

K = X−(k)gx

(gxX

−(k)gTx +V +1

2trace(gxxX

−(k)gxxX−(k))

)−1

(20)

This leads to the following additional cost:

∆V = trace(φxx(xD)X+(D))+D−1

∑i=1

(0.5∗ trace(LxxX+(i))+ trace((Lxuππygx +(Lxuππygx)

T)X+(i))

+0.5∗ trace(Luu((ππygx)X+(i)(ππygx)

T +(ππygv)V (ππygv)T))) (21)

We can optimize the policy including these stochastic terms by calculating the gradient of ∆V with respect to

p and adding that to the deterministic gradient Vp. A similar approach can be used to perform “dual control”

where caution (avoiding risk: the combination of large uncertainty and a large cost Hessian) and probing (adding

exploratory actions where they will do the most good) are combined in a learned policy [45].

Handling Dynamic Policies: We will explore how to optimize and learn policies which have internal state,

such as state estimators (as in LQG design) and central pattern generators (CPGs) in biology. We will explore

extending our approach by augmenting the model state with the additional policy state x: z = (xT xT)T. A

dynamical equation for z must be defined which will probably be a function of the policy parameter vector p:

zk+1 = F (zk,uk,p). A suitable observation equation y = g(z) must be defined. The new observation vector

will probably include the old observation vector and all the policy state variables. The mapping to the one step

cost function x = h(z) is easy to define since z can be truncated to provide x. Starting at a z0 corresponding

to x0 is straightforward if the policy state x can be initialized to zero, a known value, or random values. After

adding appropriate process and measurement noise, we can use the results of the previous sections to optimize

p. We will also explore keeping the actual state and the policy state separate, which may lead to more efficient

optimization algorithms.

Receding Horizon Control (RHC)/Model Predictive Control (MPC): The multiple model optimization

criterion and our gradient approaches to efficiently optimizing it can be used in implementing Receding Horizon

Control (RHC, also known as Model Predictive Control or MPC). We can implement RHC/MPC time varying

locally constant policies that apply a new control vector on each time step. We can implement a policy with

very few parameters, such as a gain matrix K. We can also use variable temporal resolution policies, in which

the first NRHC steps each have their own control vector, but after that a policy with many fewer parameters is

used until the end of the horizon. At the end of the horizon, an appropriate terminal cost function should be

used. We note that on each control iteration the optimized control vectors can be shifted forward in time and the

candidate policy can be initialized to what it was on the last control step to provide a warm start and speed up

optimization. We also expect that the terminal cost function can be learned or updated based on value functions

from previous optimizations.

Adaptive Multiple Model Policy Optimization: We expect that during RHC/MPC, we can adapt the

weights of the multiple models used in the optimization criterion. Models that accurately predict the next state

can have their weights increased, while models that poorly predict future states can have their weights reduced.

It is likely that some regularization will be required so that the system does not focus on only one model and

reject all others.

Model Following: So far we have focused on optimizing a policy. We note that we can also use gradients to

achieve model following or model reference control. If the desired model is xi+1 = md(xi,ui), then we modify

the original one step cost function with L(x,u) = L(x,u) + (F(x,u)−md(x,u))TQ(F(x,u)−md(x,u)) This

approach can be used to perform pole placement, for example.

Continuous Time Policy Optimization: Continuous time approaches are extensively used in the output

feedback controller optimization literature, and [49] explores them in adaptive dynamic programming. We ex-

pect the continuous time version of our approach to be very similar to our discrete time version. It is most useful

when the duration of the trajectory or terminal time depends on reaching a goal set or is part of the optimization.

12

Page 13: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

Instead of discrete time dynamics xk+1 = F(xk,uk) we have continuous time dynamics x = f(x,u). The valuefunction corresponding to the policy ππ(x,p) is generated by a cost increment function l(x,u) integrated along

the trajectory generated by the policy: V (x(t),p, t) = φ(x(t f ), t f ) +R t ft l(x(τ),ππ(x(τ),p))dτ where x(t) is the

state at time t on the trajectory and t f is the final time of the trajectory. φ(x, t) is the terminal cost function. Due

to space restrictions we must skip the details of the derivation of the first order gradient propagation equations

in continuous time, which are (they propagate backwards in time):

−Vx = Vx(fx + fuππx)+ lx + luππx (22)

−Vp = Vxfuππp + luππp (23)

Creating Models And Minimax Policies: We can use the policy optimization machinery we have devel-

oped to optimize worst case models (make them as difficult as possible for the current policy). This allows us

to automatically create difficult models in the set of models used for training on the next iteration. We can think

of this as automatically creating policies for opponents.

Research Plan

This research involves two full time PhD students, one part time postdoc, and the PI. The first graduate

student will focus on implementing this controller design approach on various soft robots, including our inflat-

able robot and rigid robots with soft skin that we will build using other NSF ERC and DARPA funding. The

second graduate student will focus on implementing this controller design approach on our Sarcos humanoid

(see Facilities section). Both students will also develop new theory and algorithms. The postdoc will focus

on theoretical and algorithm development (it is likely the postdoc position will rotate through postdocs with

maximum two year stays). The PI will supervise, guide, and provide assistance as needed. All members of the

team will work together to evaluate each other’s results. Each year we will choose a set of issues to focus on, to

maximize synergy among researchers. The planned schedule is as follows.

Year 1: We will implement and evaluate both discrete time and continuous time baseline versions of the

proposed approach in simulation. This includes developing first and second order gradients for a range of sit-

uations. We will be checking if derivatives are accurate, and evaluating a number of different policy structures

(collections of constant policies (for example tables with constant entries), collections of affine policies, linear

parametrization with complex hand-crafted basis functions, radial basis functions, and sigmoidal neural nets).

We will be empirically checking for convergence of the optimization algorithms to a local or potentially global

optimum, and whether the resulting policy is empirically robust to modeling error relative to the set of mod-

els used in optimization. We will explore how to best find an initial policy, including using optimal control

approaches for single models such as dynamic programming, DDP, and LQR. We will also simulate the appli-

cation of the proposed approach to the soft robots and our humanoid. Tasks for the soft robots include feeding,

dressing, cleaning, and grooming a human. Tasks for the humanoid robot include locomotion over rough terrain

and whole body behaviors involving both manipulation and locomotion.

Year 2: We will do the first round of hardware implementations, both on the soft robots and on our hu-

manoid. We will develop adaptive grid and parametrization approaches to global policy optimization and also

global value function optimization. We will apply our approach to combined optimization of controllers and

state estimators. As part of this, we will develop our algorithms for optimizing controllers for stochastic sys-

tems, including optimizing the probability of success as a measure of performance. We will explore the use

of non-parametric modeling error (additive process noise) as a synergistic approach to achieving robust policy

design. We will explore the connection between locally constant policies and differential dynamic programming

(DDP), as well as the connection between locally linear policies and DDP [43, 35].

Year 3: We will continue and evaluate our hardware implementations, and identify and resolve new issues

that result. We will develop policy optimization algorithms that deliberately bias or simplify the policy, and

analyze the effect of such bias theoretically. We will develop parallel implementations of our algorithms on

multi-core computers, on GPUs, and on a supercomputing cluster. We will explore the use of opponent policies

as a way to refine the model set, and as a synergistic approach to achieving policy robustness. We will explore

whether it is useful to make the one step cost function L() and terminal cost function φ() depend on the model m

13

Page 14: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

or the initial state (s index). We will explore enhancements to the stochastic model including having an effect of

the past command ui−1 on the measurement noise (which would allow sensing actions such as aiming a sensor).

We will explore how to handle trajectories with free terminal time whose termination is determined by a cost

function or by arrival at a goal or goal set. We will explore how to handle discontinuities in the dynamics,

cost function, policy, and value function. We will explore the effect of a minimax rather than average cost

performance criteria. We will explore “unscented” approaches to calculating derivatives that use information

from multiple trajectories, to handle bad local minima and model discontinuities such as look up tables, dead

zones, saturations, and hysteresis. We will develop a multi-model approach to system identification, where

different temporal segments of data are identified with different models, and controllers are designed for the

resulting set of models or model distribution.

Year 4: We will focus on a second round of hardware implementation, both on the soft robots and on our

humanoid. We will explore whether providing explicit examples of possible unmodeled dynamics during the

control design process is easier for the control designer than specifying parameters ranges in complex models

which turn different model structures into parameter variations. We will develop a theoretical convergence proof

both that optimization algorithms converge to a local or global optimum, and that the resulting policies are ro-

bust to modeling error relative to the set of models used in optimization. Based on our parallel implementation,

we will develop receding horizon control (RHC)/model predictive control (MPC) approaches to online policy

optimization. We will explore “dual control” of stochastic systems where caution (avoiding risk: the combina-

tion of large uncertainty and a large cost Hessian) and probing (adding exploratory actions where they will do

the most good) are combined in a learned policy.

Year 5: A major focus of Year 5 is evaluation of our algorithms, both practically and theoretically. We

will develop inverse optimal control, where the optimization criteria is parametrized and optimized to match

teacher-provided training data. We will develop model following versions of our approach, in addition to pol-

icy optimization. We will also explore the optimization of dynamic policies such as central pattern generators

(CPGs). Students will be finishing their theses, and pursuing unresolved and/or unexpected issues at the direc-

tion of their thesis committees.

Prior NSF Supported Work:

NSF award number: ECS-0325383, Total award: $1,466,667, Duration: 10/1/03 - 9/30/08, Title: ITR: Collab-

orative Research: Using Humanoids to Understand Humans, PI: Atkeson, 3 other co-PIs. This grant focused

on programming human-like behaviors in a humanoid robot. Our research focused on the coordination and

control of dynamical physical systems with particular interest in providing robots with the ability to control

their actions in complex and unpredictable environments. Atkeson’s work has focused on model-based learning

using optimization and also reinforcement learning. This work has shown that we can generate useful nonlinear

feedback control laws using optimization, but also made us aware of the challenge of unmodeled dynamics.

NSF award number: ECCS-0824077, Total award: $348,199, Duration: 9/1/08 - 8/31/12, Title: Approximate

Dynamic Programming Using Random Sampling. PI: Atkeson. This grant has supported much of the founda-

tional and preliminary work behind this proposal.

NSF award number: IIS-0964581, Total award: $699,879, Duration: 7/1/10 - 6/30/13, Title: RI: Medium:

Collaborative Research: Trajectory Libraries for Locomotion on Rough Terrain. PI: Hodgins, Atkeson co-

PI. This grant has supported work on controlling humanoid robots based on trajectory libraries, which can be

viewed as a collection of complex policies.

Relevant publications that are a result of this research include [4, 5, 6, 3, 7, 9, 8, 10, 14, 17, 19, 18, 16, 24, 25,

28, 27, 26, 29, 30, 31, 32, 33, 37, 38, 46, 51, 50, 52, 54, 53, 56, 59, 58, 61, 66, 69, 70, 67, 64, 63, 68, 65, 76, 78,

85, 84, 87, 86, 91, 89, 90, 95, 99, 98, 100, 101, 102, 104, 103, 105, 108, 106, 109, 107, 111, 112, 123, 124, 127,

128, 129].

Broader Impact

Education: A key part of this effort is teaching undergraduates and graduate students how to apply what

they learn in relatively theoretical classes on robotics or control theory to the real world. We will work to

develop an appreciation for the limits of theory, and an understanding of what might go wrong (or right) in

14

Page 15: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

practical applications. We will try to infuse this kind of “common sense” in CMU classes generally, starting

with our own teaching. As a focus for this effort, Atkeson has created and is teaching an undergraduate course

16-264: Humanoids. Wowwee Inc. has recently donated 40 Robosapien RS Media humanoid robots for this

course. Atkeson will also create a graduate course on “Robust Intelligence” based on our and other’s work in

this area. We will also coordinate our education activities with the CMU/Pitt NSF Quality of Life Technologies

Engineering Research Center. Results from this research will feed directly into ERC courses.

Outreach: One form of outreach we have pursued is an aggressive program of visiting students and post-

docs. This has been most successful internationally, with visitors from the Delft University of Technology (4

students) [4, 56, 127], the HUBO lab at KAIST (1 student and 1 postdoc) [16, 33, 46], and the Chinese Schol-

arship Council supported 5 students [52, 128]. We welcome new visitors, who are typically paid by their home

institutions during the visit. We are currently experimenting with the use of Youtube and lab notebooks on the

web to make public preliminary results as well as final papers and videos. We have found this is a useful way

to support internal communication as well as potentially create outside interest in our work. We will continue

to give lectures to visiting K-12 classes. We will participate in NSF QoLT ERC outreach activities including

the QoLT Ambassador program, TechLink, and BodyScout activities. The Carnegie Mellon Robotics Institute

already has an aggressive outreach program at the K-12 level [34], and we will participate in that program, as

well as other CMU programs such as Andrew’s Leap (a CMU a summer enrichment program for high school

students), the SAMS program (Summer Academy for Mathematics + Science: a summer program for diversity

aimed at high schoolers), Creative Tech Night for Girls, and other minority outreach programs.

Dissemination of materials: To distribute the results of our research, we will make our simulations publicly

available, and we will develop course materials on legged locomotion, optimal control, and humanoid robotics

for courses we and others will teach. We are currently developing a course on balance and posture, and also

legged locomotion. We will make these course materials freely available on the web. An additional activity that

would be made possible by this funding is to make our simulations publicly available and maintained.

Diversity: For a proposal of this size it is difficult to create independent diversity efforts such as recruiting

efforts at minority colleges. We will utilize the REU associated with the NSF QoLT ERC as a way to reach

undergraduates who are not students at the University of Pittsburgh or Carnegie Mellon. This REU focuses on

students with disabilities and minority students. We will also make use of existing efforts that are part of ongoing

efforts in the Robotics Institute, and CMU-wide efforts. These efforts include supporting minority visits to

CMU, recruiting at various conferences and educational institutions, and providing minority fellowships. CMU

is fortunate in being successful in attracting an usually high percentage of female undergraduates in Computer

Science. Our NSF QoLT ERC and our connection with the University of Pittsburgh’s Rehabilitation Science

and Technology Department have proved to be a magnet for students with disabilities. In terms of attracting

non-typical graduate students, we have had success working with relevant graduate admissions programs.

Relationship to current funding: Atkeson is participating in CMU’s NSF Quality of Life Technology

(QoLT) Engineering Research Center (ERC) (EEC-0540865). The ERC focuses on assistive technology rather

than robust intelligence or robotics, and is not able to support the proposed work. The ERC is able to support the

development of soft robot prototypes for assistive applications. Atkeson is PI on NSF award ECCS-0824077,

Approximate Dynamic Programming Using Random Sampling. This award supports one student and will be

finished by the time any new support is awarded. Atkeson is co-PI on NSF award IIS-0964581, RI: Medium:

Collaborative Research: Trajectory Libraries for Locomotion on Rough Terrain. which focuses on using tra-

jectory libraries for the control of a humanoid robot. Atkeson is PI on the DARPA M3 Program with support

entitled Achieving Maximum Mobility and Manipulation Using Human-like Compliant Behavior and Behavior

Libraries. This project focuses its support on high performance control of a humanoid robot. Atkeson is PI on

a DARPA M3 Program subcontract from iRobot entitled Maximum Mobility and Manipulation which supports

the construction of soft robot prototypes for military applications. The ERC and DARPA support for soft robot

prototypes will provide the soft robots used in the proposed research.

Why five years of funding?: My average student takes about five years to graduate. We start our students

on research the day they arrive. I would like them to be able to fully concentrate on one project during their

graduate school career. The three year funding model does not match reality for our students.

15

Page 16: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

References

[1] A. Albu-Schaffer, C. Ott, and G. Hirzinger. A unified passivity-based control framework for position,

torque and impedance control of flexible joint robots. International Journal of Robotics Research,

26(1):23–39, 2007.

[2] B. D. O. Anderson. Failures of adaptive control theory and their resolution. Commun. Inf. Syst., 5(1):1–

20, 2005.

[3] S. O. Anderson, J. K. Hodgins, and C. G. Atkeson. Approximate policy transfer applied to simulated

bongo board balance. In IEEE-RAS International Conference on Humanoid Robots (Humanoids), 2007.

[4] S. O. Anderson, M. Wisse, C. G. Atkeson, J. K. Hodgins, G. J. Zeglin, and B. Moyer. Powered bipeds

based on passive dynamic principles. In Proceedings of the 5th IEEE-RAS International Conference on

Humanoid Robots (Humanoids 2005), pages 110–116, 2005.

[5] S.O. Anderson, C.G. Atkeson, and J.K. Hodgins. Coordinating feet in bipedal balance. In Proceedings

of the 6th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2006), 2006.

[6] S.O. Anderson and S. Srinivasa. Identifying trajectory classes in dynamic tasks. In International Sympo-

sium on Approximate Dynamic Programming and Reinforcement Learning, 2007.

[7] C. G. Atkeson. Randomly sampling actions in dynamic programming. In IEEE International Symposium

on Approximate Dynamic Programming and Reinforcement Learning (ADPRL), pages 185–192, 2007.

[8] C. G. Atkeson and B. Stephens. Multiple balance strategies from one optimization criterion. In IEEE-RAS

International Conference on Humanoid Robots (Humanoids), 2007.

[9] C. G. Atkeson and B. Stephens. Random sampling of states in dynamic programming. In Neural Infor-

mation Processing Systems (NIPS) Conference, 2007.

[10] C. G. Atkeson and B. Stephens. Random sampling of states in dynamic programming. IEEE Transactions

on Systems, Man, and Cybernetics, Part B, 38(4):924–929, 2008.

[11] V. Azhmyakov, V. Boltyanski, and A. Poznyak. The dynamic programming approach to multi-model

robust optimization. Nonlinear Analysis, Theory, Methods & Applications, 72(2):1110–1119, 2010.

[12] D. Bagnell and J. Schneider. Autonomous helicopter control using reinforcement learning policy search

methods. In IEEE International Conference on Robotics and Automation, 2001.

[13] S. N. Balakrishnan, J. Ding, and F. L. Lewis. Issues on stability of ADP feedback controllers for dynami-

cal systems. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 38(4):913–917,

2008.

[14] Jernej Barbic, Alla Safonova, Jia-Yu Pan, Christos Faloutsos, Jessica K. Hodgins, and Nancy S. Pollard.

Segmenting Motion Capture Data into Distinct Behaviors. In Proceedings of Graphics Interface 2004,

pages 185–194, 2004.

[15] F. J. Bejarano, A. Poznyak, and L. Fridman. Min-max output integral sliding mode control for multiplant

linear uncertain systems. In American Control Conference, pages 5875–5880, 2007.

[16] D. Bentivegna, C. G. Atkeson, and J.-Y. Kim. Compliant control of a hydraulic humanoid joint. In

IEEE-RAS International Conference on Humanoid Robots (Humanoids), 2007.

[17] D.C. Bentivegna, C.G. Atkeson, and G. Cheng. Learning from observation and practice using primitives.

In AAAI Symposium on Real Life Reinforcement Learning, 2004.

16

Page 17: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

[18] D.C. Bentivegna, C.G. Atkeson, A. Ude, and G. Cheng. Learning tasks from observation and practice.

Robotics and Autonomous Systems, 47:163–169, 2004.

[19] D.C. Bentivegna, C.G. Atkeson, A. Ude, and G. Cheng. Learning to act from observation and practice.

International Journal of Humanoid Robotics, 1(4):585–611, 2004.

[20] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 1995.

[21] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA,

1996.

[22] J. T. Betts. Practical Methods for Optimal Control Using Nonlinear Programming. Society for Industrial

and Applied Mathematics (SIAM), 2001.

[23] R. R. Bitmean, M. Gevers, and V. Wertz. Adaptive Optimal Control: The Thinking Man’s GPC. Prentice

Hall, 1990.

[24] Jinxiang Chai and Jessica K. Hodgins. Performance animation from low-dimensional control signals.

ACM Transactions on Graphics (SIGGRAPH 2005), 24(3):686–696, 2005.

[25] Thierry Chaminade, Jessica Hodgins, and Mitsuo Kawato. Anthropomorphism influences perception of

computer-animated characters’ actions. Social Cognitive and Affective Neuroscience, 2007.

[26] L. Y. Chang, N. Pollard, T. Mitchell, and E. P. Xing. Feature selection for grasp recognition from optical

markers. In IEEE/RSJ Intl. Conference on Intelligent Robots and Systems (IROS), 2007.

[27] L. Y. Chang and N. S. Pollard. Constrained least-squares optimization for robust estimation of center of

rotation. Journal of Biomechanics, 40(6):1392–1400, 2007.

[28] L. Y. Chang and N. S. Pollard. Robust estimation of dominant axis of rotation. Journal of Biomechanics,

40(12):2707–2715, 2007.

[29] G. Cheng, S.-H. Hyon, A. Ude, J. Morimoto, J. G. Hale, J. Hart, J. Nakanishi, D. Bentivegna, J. Hodgins,

C. Atkeson, M. Mistry, S. Schaal, and M. Kawato. CB: Exploring neuroscience with a humanoid research

platform. In IEEE International Conference on Robotics and Automation, pages 1772–1773, 2008.

[30] J. Chestnutt and J.J. Kuffner. A tiered planning strategy for biped navigation. In IEEE International

Conference on Humanoid Robotics, 2004.

[31] J. Chestnutt and J.J. Kuffner. Dynamic footstep planning. In IEEE International Conference on Robotics

and Automation, 2005.

[32] German Cheung, Simon Baker, Jessica K. Hodgins, and Takeo Kanade. Markerless human motion trans-

fer. In Proceedings of the SIGGRAPH 2004 Conference on Sketches & Applications. ACM Press, August

2004.

[33] D. Choi, C. G. Atkeson, S. J. Cho, and J. Y. Kim. Phase plane control of a humanoid. In IEEE-RAS

International Conference on Humanoid Robots (Humanoids), pages 145–150, 2008.

[34] CMU Office of Robotics Education. www.cs.cmu.edu/∼roboed/.

[35] P. Dyer and S. R. McReynolds. The Computation and Theory of Optimal Control. Academic Press, New

York, NY, 1970.

[36] L. A. Feldkamp and G. V. Puskorius. Fixed-weight controller for multiple systems. In International

Conference on Neural Networks, pages 773–778, 1997.

17

Page 18: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

[37] J. L. Fu and N. S. Pollard. On the importance of asymmetries in grasp quality metrics for tendon driven

hands. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1068–

1075, 2006.

[38] J. L. Fu, S. S. Srinivasa, N. S. Pollard, and B. C. Nabbe. Planar batting with shape and pose uncertainty.

In IEEE International Conference on Robotics and Automation, 2007.

[39] A. Geramifard. Scaling Learning to Multi-UAV Domains Using Adaptive Representations and Integration

with Cooperative Planners. PhD thesis, MIT, 2011.

[40] D. Han and S. N. Balakrishnan. Adaptive critic based neural networks for control-constrained agile missle

control. In IEEE American Control Conference, pages 2600–2604, 1999.

[41] R. G. Heikes, D. C. Montgomery, and R. L. Rardin. Using common random numbers in simulation

experiments - an approach to statistical analysis. Simulation, 27(3):81–85, 1976.

[42] iRobot Inflatable Robots. www.irobot.com.

[43] D. H. Jacobson and D. Q. Mayne. Differential Dynamic Programming. Elsevier, New York, NY, 1970.

[44] S. M. Joshi. Optimal output feedback control of linear systems in presence of forcing and measurement

noise. Technical Memorandum TM X-72003, NASA, Langley Research Center, Hampton VA, 1974.

DTIC ADA280162; handle.dtic.mil/100.2/ADA280162.

[45] D. A. Kendrick. Stochastic Control For Economic Models. eco.utexas.edu/faculty/Kendrick, 2nd edition,

2002.

[46] J.-Y. Kim, C. G. Atkeson, J. K. Hodgins, D. Bentivegna, and S. J. Cho. Online gain switching algorithm

for joint position control of a hydraulic humanoid robot. In IEEE-RAS International Conference on

Humanoid Robots (Humanoids), 2007.

[47] J. Z. Kolter. Learning and control with inaccurate models. PhD thesis, Stanford, August 2010.

[48] F. L. Lewis, G. Lendaris, and D. Liu. Special issue on approximate dynamic programming and reinforce-

ment learning for feedback control. IEEE Trans. Syst. Man, Cybern – B: Cybernetics, 41(1):896–897,

2008.

[49] F. L. Lewis and D. Vrabie. Reinforcement learning and adaptive dynamic programming for feedback

control. IEEE Circuits and Systems Magazine, 9(3):32–50, 2009.

[50] Y. Li and N. S. Pollard. A shape matching algorithm for synthesizing humanlike enveloping grasps. In

IEEE-RAS International Conference on Humanoid Robots (Humanoids), pages 442–449, 2005.

[51] Ying Li, Jiaxin L. Fu, and Nancy S. Pollard. Data driven grasp synthesis using shape matching and

task-based pruning. IEEE Transactions on Visualization and Computer Graphics, 13(4):732–747, 2007.

[52] C. Liu and C. G. Atkeson. Standing balance control using a trajectory library. In IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS), pages 3031 — 3036, 2009.

[53] A. Mahboobin, P. J. Loughlin, C. G. Atkeson, and M. S. Redfern. A mechanism for sensory re-weighting

in postural control. Medical and Biological Engineering and Computing, 47(9):921–929, 2009.

[54] A. Mahboobin, P. J. Loughlin, M. S. Redfern, S. O. Anderson, C. G. Atkeson, and J. K. Hodgins. Sen-

sory adaptation in human balance control: Lessons for biomimetic robotic bipeds. Neural Networks,

21(4):621–627, 2008.

18

Page 19: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

[55] P. M. Makila and H. T. Toivonen. Computational methods for parametric LQ problems— a survey. IEEE

Transactions on Automatic Control, AC-32(8):658–671, 1987.

[56] T. Mandersloot, M. Wisse, and C.G. Atkeson. Controlling velocity in bipedal walking: A dynamic

programming approach. In Proceedings of the 6th IEEE-RAS International Conference on Humanoid

Robots (Humanoids), 2006.

[57] Mangar USA Inflatable Mobility Aids. www.mangarusa.com.

[58] James McCann and Nancy S. Pollard. Responsive characters from motion fragments. ACM Transactions

on Graphics (SIGGRAPH), 26(3), 2007.

[59] James McCann, Nancy S. Pollard, and Siddhartha S. Srinivasa. Physics-based motion retiming. In 2006

ACM SIGGRAPH / Eurographics Symposium on Computer Animation, September 2006.

[60] M.McNaughton. CASTRO: robust nonlinear trajectory optimization using multiple models. In IEEE/RSJ

International Conference on Intelligent Robots and Systems (IROS), pages 177 – 182, 2007.

[61] Ronald A. Metoyer and Jessica K. Hodgins. Reactive pedestrian path following from examples. The

Visual Computer, 20(10):635–649, 2004.

[62] W. T. Miller, R. S. Sutton, and P. J. Werbos. Neural Networks for Control. MIT Press, 1990.

[63] J. Morimoto and C. G. Atkeson. Improving humanoid locomotive performance with learnt approximated

dynamics via gaussian processes for regression. In IEEE/RSJ International Conference on Intelligent

Robots and Systems, 2007.

[64] J. Morimoto and C. G. Atkeson. Learning biped locomotion: Application of Poincare-map-based rein-

forcement learning. IEEE Robotics and Automation Magazine, 14(2):41–51, 2007.

[65] J. Morimoto and C. G. Atkeson. Nonparametric representation of an approximated Poincare map for

learning biped locomotion. Autonomous Robots, 27(2):131–144, 2009.

[66] J. Morimoto, G. Cheng, C. G. Atkeson, and G. Zeglin. A simple reinforcement learning algorithm for

biped walking. In IEEE International Conference on Robotics and Automation, pages 3030–3035, 2004.

[67] J. Morimoto, G. Endo, J. Nakanishi, S. Hyon, G. Cheng, D. Bentivegna, and C.G. Atkeson. Modulation

of simple sinusoidal patterns by a coupled oscillator model for biped walking. In IEEE International

Conference on Robotics and Automation, 2006.

[68] J. Morimoto, S. H. Hyon, C. G. Atkeson, and G. Cheng. Low-dimensional feature extraction for hu-

manoid locomotion using kernel dimension reduction. In IEEE-RAS Conference on Robotics and Au-

tomation, pages 2711–2716, 2008.

[69] J. Morimoto, J. Nakanishi, G. Endo, G. Cheng, C. G. Atkeson, and G. Zeglin. Acquisition of biped

walking policy using an approximated Poincare map. In IEEE International Conference on Humanoid

Robotics, 2004.

[70] J. Morimoto, J. Nakanishi, G. Endo, G. Cheng, C. G. Atkeson, and G. Zeglin. Poincare-map-based re-

inforcement learning for biped walking. In IEEE International Conference on Robotics and Automation,

pages 2381–2386, 2005.

[71] J. J. Murray, C. J. Cox, G. G. Lendaris, and R. Saeks. Adaptive dynamic programming. IEEE Transac-

tions on Systems, Man, and Cybernetics, Part C, 32(2):140–153, 2002.

19

Page 20: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

[72] A. Y. Ng and M. Jordan. Pegasus: A policy search method for large MDPs and POMDPs. In Uncertainty

in Artificial Intelligence, Proceedings of the Sixteenth Conference, 2000.

[73] R. H. Nystrom, J. M. Boling, J. M. Ramstedt, H. T. Toivonen, and K. E. Haggblom. Application of robust

and multimodel control methods to an ill-conditioned distillation column. Journal of Process Control,

12:39–53, 2002.

[74] J. O’Reilly. Optimal low-order feedback controllers for linear discrete-time systems. In C. T. Leondes,

editor, Control and Dynamic Systems, volume 16, pages 335–367. Academic Press, 1980.

[75] Otherlab Inflatable Robots. www.otherlab.com.

[76] Sang Il Park and Jessica K. Hodgins. Capturing and animating skin deformation in human motion. ACM

Transactions on Graphics (SIGGRAPH 2006), 25(3), August 2006.

[77] Y. Piguet, U. Holmberg, and R. Longchamp. A minimax approach for multi-objective controller design

using multiple models. International Journal of Control, 72(7-8):716–726, 1999.

[78] N. S. Pollard and V. B. Zordan. Physically based grasping control from example. In 2005 ACM SIG-

GRAPH / Eurographics Symposium on Computer Animation, August 2005.

[79] W. B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley-

Interscience, 2nd edition, 2011.

[80] W. B. Powell and J. Ma. A review of stochastic algorithms with continuous value function approximation

and some new approximate policy iteration algorithms for multi-dimensional continuous applications.

Journal of Control Theory and Applications, 9(3):336–352, 2011.

[81] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. Cambridge

University Press, 2nd edition, 1992.

[82] D. V. Prokhorov. Backpropagation through time and derivative adaptive critics — a common framework

for comparison. InHandbook of Learning and Approximate Dynamic Programming [97], pages 381–404.

[83] D. V. Prokhorov. Training recurrent neurocontrollers for robustness with derivative-free Kalman filter.

IEEE Transactions on Neural Networks, 17(6):1606–1616, 2006.

[84] P. S. A. Reitsma and N. S. Pollard. Evaluating motion graphs for character animation. ACM Transactions

on Graphics, in press, 2007.

[85] Paul S. A. Reitsma and Nancy S. Pollard. Evaluating motion graphs for character navigation. In 2004

ACM SIGGRAPH / Eurographics Symposium on Computer Animation, August 2004.

[86] Liu Ren, Alton Patrick, Alexei A. Efros, Jessica K. Hodgins, and James M. Rehg. A data-driven approach

to quantifying natural human motion. ACM Transactions on Graphics (SIGGRAPH 2005), 24(3), August

2005.

[87] Liu Ren, Gregory Shakhnarovich, Jessica K. Hodgins, Hanspeter Pfister, and Paul Viola. Learning sil-

houette features for control of human motion. ACM Transactions on Graphics, 24(4), October 2005.

[88] M. G. Safonov and M. K. H. Fan. Editorial, special issue on multivariable stability margin. International

Journal Of Robust And Nonlinear Control, 7:97–103, 1997.

[89] Alla Safonova and Jessica K. Hodgins. Analyzing the physical correctness of interpolated human motion.

In ACM SIGGRAPH / Eurographics Symposium on Computer Animation, pages 171–180, 2005.

20

Page 21: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

[90] Alla Safonova and Jessica K. Hodgins. Construction and optimal search of interpolated motion graphs.

ACM Transactions on Graphics (SIGGRAPH 2007), 26(3), August 2007.

[91] Alla Safonova, Jessica K. Hodgins, and Nancy S. Pollard. Synthesizing physically realistic human motion

in low-dimensional, behavior-specific spaces. ACM Transactions on Graphics (SIGGRAPH 2004), 23(3),

August 2004.

[92] G. Saini and S. N. Balakrishnan. Adaptive critic based neurocontroller for autolanding of aircrafts. In

IEEE American Control Conference, pages 1081–1085, 1997.

[93] S. Sanan, J. Moidel, and C. G. Atkeson. Robots with inflatable links. In IEEE/RSJ Int. Conf. Intell.

Robots Syst. (IROS), pages 4331–4336, 2009.

[94] S. Sanan, M. H. Ornstein, and C. G. Atkeson. Physical human interaction for an inflatable manipulator.

In IEEE Engineering in Medicine and Biology Society (EMBC), 2011.

[95] S. Schaal and C. G. Atkeson. Learning control for robotics. IEEE Robotics & Automation Magazine,

17(2):20–29, 2010.

[96] J. Shinar, V. Glizer, and V. Turetsky. Solution of a linear pursuit-evasion game with variable structure and

uncertain dynamics. In S. Jorgensen, M. Quincampoix, and T. Vincent, editors, Advances in Dynamic

Game Theory - Numerical Methods, Algorithms, and Applications to Ecology and Economics, volume 9,

pages 195–222. Birkhauser, 2007.

[97] J. Si, A. G. Barto, W. B. Powell, and D. Wunsch II. Handbook of Learning and Approximate Dynamic

Programming. Wiley-IEEE Press, 2004.

[98] B. Stephens. Humanoid push recovery. In IEEE-RAS International Conference on Humanoid Robotics

(Humanoids), 2007.

[99] B. Stephens. Integral control of humanoid balance. In IEEE/RSJ International Conference on Intelligent

Robots and Systems (IROS), 2007.

[100] B. Stephens. Modeling and control of periodic humanoid balance using the linear biped model. In

International Conference on Humanoid Robots, 2009.

[101] B. Stephens. Dynamic balance force control for compliant humanoid robots. In International Conference

on Intelligent Robots and Systems (IROS), pages 1248–1255, 2010.

[102] B. Stephens and C. G. Atkeson. Push recovery by stepping for humanoid robots with force controlled

joints. In International Conference on Humanoid Robots, 2010.

[103] M. Stilman, C. G. Atkeson, J. J. Kuffner, and G. Zeglin. Dynamic programming in reduced dimensional

spaces: Dynamic planning for robust biped locomotion. In IEEE International Conference on Robotics

and Automation (ICRA), pages 2399–2404, 2005.

[104] M. Stilman and J. J. Kuffner. Navigation among movable obstacles: Real-time reasoning in complex

environments. In IEEE International Conference on Humanoid Robotics (Humanoids), 2004.

[105] M. Stilman and J.J. Kuffner. Navigation among movable obstacles : Real-time reasoning in complex

environments. Int. J. Humanoid Robotics, 2(4):1–24, 2005.

[106] M. Stilman and J.J. Kuffner. Planning among movable obstacles with artificial constraints. In Proc. 6th

Int’l Workshop on the Algorithmic Foundations of Robotics (WAFR’06), pages 1–20, 2006.

21

Page 22: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

[107] M. Stilman and J.J. Kuffner. Planning among movable obstacles with artificial constraints. Submitted to

Int. J. Robotics Research (special issue of selected papers from WAFR’06), 2007.

[108] M. Stilman, K. Nishiwaki, S. Kagami, and J.J. Kuffner. Planning and executing navigation among mov-

able obstacles. In IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS’06), 2006.

[109] M. Stilman, K. Nishiwaki, S. Kagami, and J.J. Kuffner. Planning and executing navigation among mov-

able obstacles. Advanced Robotics, 2007. To appear.

[110] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer Verlag, 3rd edition, 2002.

[111] Martin Stolle and Christopher G. Atkeson. Policies based on trajectory libraries. In Proceedings of the

International Conference on Robotics and Automation (ICRA), 2006.

[112] Martin Stolle and Christopher G. Atkeson. Knowledge transfer using local features. In Proceedings of the

IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL), 2007.

[113] H. T. Su and T. Samad. Neuro-control design: Optimization aspects. In O. Omidvar and D. L. Elliott,

editors, Neural Systems For Control, pages 259–288. Academic Press, 1997.

[114] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA,

1998.

[115] V. L. Syrmos, C. T. Abdallah, P. Dorato, and K. Grigoriadis. Static output feedback — a survey. Auto-

matica, 33(2):125–137, 1997.

[116] A. Varga. Optimal output feedback control: a multi-model approach. In IEEE International Symposium

on Computer-Aided Control System Design, pages 327–332, 1996.

[117] F.-Y. Wang, H. Zhang, and D. Liu. Adaptive dynamic programming: An introduction. IEEE Computa-

tional Intelligence Magazine, 4(2):39–47, 2009.

[118] P. J. Werbos. Approximate dynamic programming for real-time control and neural modeling. In Hand-

book of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches [122], pages 493–525.

[119] P. J. Werbos. Neurocontrol and supervised learning: An overview and evaluation. In Handbook of

Intelligent Control: Neural, Fuzzy, and Adaptive Approaches [122], pages 65–89.

[120] P. J. Werbos. The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political

Forecasting. Wiley, 1994.

[121] P. J. Werbos. Multiple models for approximate dynamic programming and true intelligent control: Why

and how. In K. Narendra, editor, Proc. 10th Yale Conference on Learning and Adaptive Systems. EE

Dept. Yale University, 1998.

[122] D. A. White and D. A. Sofge. Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches.

Van Nostrand Reinhold, 1992.

[123] E. Whitman and C. G. Atkeson. Control of a walking biped using a combination of simple policies. In

International Conference on Humanoid Robots, pages 520–527, 2009.

[124] E. Whitman and C. G. Atkeson. Control of instantaneously coupled systems applied to humanoid walk-

ing. In International Conference on Humanoid Robots, pages 210–217, 2010.

[125] E. C. Whitman and C. G. Atkeson. Multiple model robust dynamic programming. In submitted to

American Control Conference, 2012.

22

Page 23: NRI-Small: Adaptive/Approximate Dynamic Programming (ADP ...cga/tmp-public/nsf.pdf · NRI-Small: Adaptive/Approximate Dynamic Programming (ADP) Control Of Soft Inflatable Robots

[126] R. J. Williams. Adaptive state representation and estimation using recurrent connectionist networks. In

Neural Networks for Control [62], pages 97–114.

[127] M. Wisse, C. G. Atkeson, and D. K. Kloimwieder. Swing leg retraction helps biped walking stability. In

Proceedings of the 5th IEEE-RAS International Conference on Humanoid Robots (Humanoids), 2005.

[128] D. Xing, C. G. Atkeson, J. Su, and B. Stephens. Gain scheduled control of perturbed standing balance. In

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4063–4068, 2010.

[129] Katsu Yamane, James J. Kuffner, and Jessica K. Hodgins. Synthesizing animations of human manipula-

tion tasks. ACM Transactions on Graphics, 23(3):532–539, August 2004.

23