50 th anniversary of the curse of dimensionality continuous states: storage cost: resolution dx...

Post on 18-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

50th Anniversary of The Curse of Dimensionality

• Continuous States:

Storage cost:

resolutiondx

Computational cost:

resolutiondx

• Continuous Actions:

Computational cost:

resolutiondu

Beating The Curse Of Dimensionality

• Reduce dimensionality (biped examples)• Use primitives (Poincare section)• Parameterize V, policy (future lecture)• Reduce volume of state space explored• Use greater depth search• Adaptive/Problem-specific grid/sampling

– Split where needed– Random sampling – add where needed

• Random action search• Random state search• Hybrid Approaches: combine local and global opt.

Use Brute Force

• Deal with computational cost by using cluster supercomputer.

• Main issue is minimizing communication between nodes.

Cluster Supercomputing

• (8) Cores w/ small local memory (cache)

• (100) Nodes w/ shared memory (16GB)

• (4-16Gb/s) Network

• (100T) Disks

Q(x,u) = L(x,u) + V(f(x,u))

• c = L(x,u): as in desktop case

• x_next = f(x,u): as in desktop case

• V(x_next)– Uniform grid– Multilinear interpolation if all values available,

distance weighted averaging if bad values

Allocate grid to cores/nodes

Handle Overlap

Push Updated V’s To Users

So what does this all mean for programming?

• On a node, split grid cells among threads, which execute on cores.

• Share updates of V(x) and u(x) within node almost for free using shared memory.

• Pushing updated V(x) and u(x) to other nodes uses the network which is relatively slow…..

Dealing with the slow network

• Organize grid cells into packet-sized blocks. Send them as a unit.

• Threshold updates: too small, don’t send it.

• Only do 1/N updates for each block (maximum skip time).

• Tolerate packet loss (UDP) vs. verification (TCP/MPI)

Use Adaptive Grid

• Reduce computational and storage costs by using adaptive grid.

• Generate adaptive grid using random sampling.

Trajectory-BasedDynamicProgramming

Full Trajectories Helps Reduce Resolution Needed

SIDP Trajectory Based

Reducing the Volume Explored

An Adaptive Grid Approach

Global PlanningPropagate Value Function Across

Trajectoriesin Adaptive Grid

Growing theExploredRegion:Adaptive Grids

Bidirectional Search

BidirectionalSearchCloseup

Spine Representation

Growing theExploredRegion:SpineRepresentation

Comparison

One Link Swing UpNeeded Only

63 Points

Trajectories ForEach Point

Random Sampling of States• Initialize with a point at the goal with local models

based on LQR.• Choose a random new state x.• Use the nearest stored point’s local model of the value

function to predict the value of the new point (VP).• Optimize a trajectory from x to the goal. At each step

use the nearest stored point’s local model of the policy to create an action. Use DDP to refine this trajectory. VT is cost of trajectory starting from x.

• Store point at start of trajectory if |VT - VP |> λ

(surprise), VT < Vlimit and VP < Vlimit, otherwise discard.• Interleave re-optimization of all stored points. Only

update if Vnew < V (V is upper bound on value).• Gradually increase Vlimit.

Two Link Pendulum

• Criterion:

AnkleAngle Hip

Angle

AnkleTorque

HipTorque

Four Links

Four Links: 8 dimensional system

Convergence?• Because we create trajectories to the goal,

each value function estimate at a point is an upper bound for the value at that point.

• Eventually all value function entries will be consistent with their nearest neighbor’s local model, and no new points can be added.

• We are using more aggressive acceptance tests for new points: VB < λVP, λ < 1, and VP < Vlimit vs. |VB – VP| < ε and VB < Vlimit

• Not clear if needed new points can be blocked.

Use Local Models

• Try to achieve a sparse representation using local models.

Linear Quadratic Regulators

Learning From Observation

Regulator tasks• Examples: balance a pole, move at a constant

velocity

• A reasonable starting point is a Linear Quadratic Regulator (LQR controller)

• Might have nonlinear dynamics xk+1 = f(xk,uk), but since stay around xd, can locally linearize xk+1 = Axk + Buk

• Might have complex scoring function c(x,u), but can locally approximate with a quadratic model c xTQx + uTRu

• dlqr() in matlab

Linearization Example

• Iθdd = -mgl sin(θ) – μθd + τ

• Linearize

• Discretize time

• Vectorize

• (θ θd) k+1

T = (1 T; -mglT/I 1-μT/I) (θ θd) kT

+ (0 T/I)T τk

LQR Derivation• Assume V() quadratic: Vk+1(x) = xTVxx:k+1x

• C(x,u) = xTQx + uTRu + (Ax+Bu)TVxx:k+1 (Ax+Bu)

• Want C/u = 0

• BTVxx:k+1Ax = -(BTVxx:k+1B + R)u

• u = Kx (linear controller)

• K = - (BTVxx:k+1B + R)-1BTVxx:k+1A

• Vxx:k= ATVxx:k+1A + Q + ATVxx:k+1BK

Trajectory Optimization (closed loop)

• Differential Dynamic Programming (local approach to DP).

Learning Trajectories

Q function

• x: state, u: control or action

• Dynamics: xk+1 = f(xk, uk)

• Cost function: L(x,u)

• Value function V(x) = ∑L(x,u)

• Q function Q(x,u) = L(x,u) + V(f(x,u))

• Bellman’s Equation V(x) = minu Q(x,u)

• Policy/control law: u(x) = argminu Q(x,u)

Local Models About

Propagating Local Models Along a Trajectory:

Differential Dynamic ProgrammingGradient version

• Vx:k-1 = Qx = Lx + Vxfx

• Δu = Qu = Lu + Vxfu

Differential Dynamic Programming (DDP)

[McReynolds 70, Jacobson 70]

t

Initial

Terminal

Value function(update)Execution

Q(T-2)u’(T-2)V(T-2)

Q(T-1):Action value functionu’(T-1):New control outputV(T-1):State value function

Improved trajectory

Nominal trajectory

V(T)

Require:•Dynamics model•Penalty function tr

Propagating Local Models Along a Trajectory:

Differential Dynamic Programming

Levenberg Marquardt• y = f(x)

• minx (s = yTy/2)

• gradient ∂s/∂x = (∂f/∂x)Ty = JTy

• Hessian ∂2s/∂2x = H = (∂2f/∂2x) y + JTJ

• 2nd order gradient descent Δx = H-1JTy

• Problem: H not positive definite

• Solution: Δx = (H + λI)-1JTy

• λ small: 2nd order approach

• λ large: 1st order approach, Δx = JTy/λ

• Trick 2: H ≈ JTJ

Levenberg Marquardt-like DDP

• Δu = (Quu + λI)-1Qu

• K = (Quu + λI)-1Qux

• Drop fxx, fxu, fux, and fuu terms

Other tricks

• If Δu fails, try ε Δu

• Just optimize last part of trajectory.

• Regularize Qxx

Neighboring Optimal Control

What Changes When Task Periodic?

• Discount factor means V() might increase along trajectory. V() cannot always decrease in periodic tasks.

Robot Hopper Example

Dimensionality Reduction

• Use of simple models (for example LIPM)

• Poincaré section

Inverted Pendulum Model• Massless legs

• State: pitch angular velocity at TOP

• Controls: ankle torque, step length

ø

Optimization Criterion

• T is step duration; Ta is ankle torque; ø is leg swing angle; Vd is desired velocity.

• Ankle torque: ∑(Ta2)

• Swing leg acceleration: (ø/T2)2

• Match desired velocity: (2sin(ø/2)/T – Vd)2

• Criterion is weighted sum of above terms.

Top

Transition

Poincaré section

Poincaré Section

Optimal Controller For Sagittal Plane Only (Vd=1)

FootPlacementPolicy

AnkleTorquePolicy

ValueFunction

ReturnMap

VELOCITY VELOCITY

Return Map (Vd=1)

Optimal Controller For Sagittal Plane Only (Vd=1)

FootPlacementPolicy

AnkleTorquePolicy

ValueFunction

ReturnMap

VELOCITY VELOCITY

Foot Placement Policies

Ankle Torque Policies

Return Maps

ø

ø

h

Add Torso

Optimization Criteria

• Ankle torque: ∑(Ta2)

• Swing leg acceleration: (ø/T2)2

• Match desired velocity: (2sin(ø/2)/T – Vd)2

• Desired torso angle: ∑(ψd)2

Simulation

Simulation

Commands

Torso

What difference does a torso make?

3D version: Add roll

• State: pitch velocity, roll, roll velocity at TOP

• Action 1: Sagittal foot placement

• Action 2: Sagittal ankle torque

• Action 3: Lateral foot placement

Roll Optimization Criteria

• T = step duration.

• Ankle torque: torque2

• Swing leg acceleration: (ø/T2)2

• Match desired velocity: (2sin(ø/2)/T – Vd)2

• Roll leg acceleration: (øroll/T2)2

Lateral foot placement at fixed roll

top related