space-indexed dynamic programming: learning to follow trajectories j. zico kolter, adam coates,...

56
Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science Department Stanford University July 2008, ICML

Upload: nigel-logan

Post on 30-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamic Programming: Learning to

Follow Trajectories

J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway

Computer Science DepartmentStanford University

July 2008, ICML

Page 2: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Outline

• Reinforcement Learning and Following Trajectories

• Space-indexed Dynamical Systems and Space-indexed Dynamic Programming

• Experimental Results

Page 3: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Reinforcement Learning and Following Trajectories

Page 4: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Trajectory Following

• Consider task of following trajectory in a vehicle such as a car or helicopter

• State space too large to discretize, can’t apply tabular RL/dynamic programming

Page 5: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Trajectory Following

• Dynamic programming algorithms w/ non-stationary policies seem well-suited to task– Policy Search by Dynamic Programming

(Bagnell, et. al), Differential Dynamic Programming (Jacobson and Mayne)

Page 6: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Dynamic Programming

t=1

Divide control task into discrete time steps

Page 7: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Dynamic Programming

t=1

Divide control task into discrete time steps

t=2

Page 8: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Dynamic Programming

t=1

Divide control task into discrete time steps

t=2t=3

t=4 t=5 : : :

Page 9: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

Page 10: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

¼5

Page 11: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

¼5¼4

Page 12: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

¼5¼4¼3

¼2¼1

Page 13: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Key Advantage: Policies are local (only need to perform well over small

portion of state space)

¼5¼4¼3

¼2¼1

Page 14: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Problems with Dynamic Programming

Problem #1: Policies from traditional dynamic

programming algorithms are time-indexed

Page 15: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Problems with Dynamic Programming

¼5

Supposed we learned policy assuming this

distribution over states¼5

Page 16: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Problems with Dynamic Programming

¼5

But, due to natural stochasticity of environment, car is actually here at t = 5

Page 17: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Problems with Dynamic Programming

¼5

Resulting policy will perform very poorly

Page 18: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Problems with Dynamic Programming

¼5¼4

¼3¼2

¼1

Partial Solution: Re-indexingExecute policy closest to current

location, regardless of time

Page 19: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Problems with Dynamic Programming

Problem #2: Uncertainty over future states makes it hard to

learn any good policy

Page 20: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Problems with Dynamic Programming

Due to stochasticity, large uncertainty over states in

distant future

Dist. over states at time t = 5

Page 21: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Problems with Dynamic Programming

DP algorithms require learning policy that performs well over entire distribution

Dist. over states at time t = 5

Page 22: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamic Programming

• Basic idea of Space-Indexed Dynamic Programming (SIDP):

Perform DP with respect to space indices (planes tangent to trajectory)

Page 23: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamical Systems and Dynamic

Programming

Page 24: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Difficulty with SIDP

• No guarantee that taking single action will move to next plane along trajectory

• Introduce notion of space-indexed dynamical system

Page 25: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

Page 26: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

current state

Page 27: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

control actioncurrent state

Page 28: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

control actioncurrent statetime derivative of state

Page 29: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)

Euler integration

st+¢ t = st +f (st;ut)¢ t

Page 30: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamical Systems

• Creating space-indexed dynamical systems:

• Simulate forward until whenever vehicle hits next tangent plane

space index d

space index d+1

Page 31: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamical Systems

• Creating space-indexed dynamical systems:

space index dspace index d+1

_s = f (s;u)

sd+1 = sd+f (sd;ud)¢ t(sd;ud)

Page 32: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamical Systems

• Creating space-indexed dynamical systems:

space index dspace index d+1

_s = f (s;u)

(Positive solution exists as long as controller makes

some forward progress)

sd+1 = sd+f (sd;ud)¢ t(sd;ud)

¢ t(s;u) =( _s?d+1)

T (s¡ s?d+1)( _s?d+1)

T _s

¢ t(s;u)

Page 33: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamical Systems

• Result is a dynamical system indexed by spatial-index variable d rather than time

• Space-indexed dynamic programming runs DP directly on this system

sd+1 = sd+f (sd;ud)¢ t(sd;ud)

Page 34: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamic Programming

Divide trajectory into discrete space planes

d=1

Page 35: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamic Programming

Divide trajectory into discrete space planes

d=1 d=2

Page 36: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamic Programming

Divide trajectory into discrete space planes

d=1 d=2d=3

d=4d=5

Page 37: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamic Programming

d=1 d=2d=3

d=4d=5

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1

Page 38: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamic Programming

d=1 d=2d=3

d=4d=5

¼5

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1

Page 39: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamic Programming

d=1 d=2d=3

d=4d=5

¼5¼4

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1

Page 40: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamic Programming

d=1 d=2d=3

d=4d=5

¼5¼4¼3

¼2¼1

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1

Page 41: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Problems with Dynamic Programming

Problem #1: Policies from traditional dynamic

programming algorithms are time-indexed

Page 42: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamic Programming

Time indexed DP: can execute

policy learned for different location

Space indexed DP: always executes policy based on current spatial

index

¼5

¼4

Page 43: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Problems with Dynamic Programming

Problem #2: Uncertainty over future states makes it hard to

learn any good policy

Page 44: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamic Programming

Time indexed DP: wide distribution

over future states

Space indexed DP: much tighter

distribution over future states

Dist. over states at time t = 5 Dist. over states at index d = 5

Page 45: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed Dynamic Programming

Time indexed DP: wide distribution

over future states

Space indexed DP: much tighter

distribution over future states

Dist. over states at time t = 5 Dist. over states at index d = 5

t(5):

Page 46: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Experiments

Page 47: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Experimental Domain

• Task: following race track trajectory in RC car with randomly placed obstacles

Page 48: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Experimental Setup

• Implemented space-indexed version of PSDP algorithm– Policy chooses steering angle using SVM

classifier (constant velocity)– Used simple textbook model simulator of car

dynamics to learn policy

• Evaluated PSDP time-indexed, time-indexed with re-indexing and space-indexed

Page 49: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Time-Indexed PSDP

Page 50: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Time-Indexed PSDP w/ Re-indexing

Page 51: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Space-Indexed PSDP

Page 52: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Empirical Evaluation

Time-indexed PSDP Time-indexed PSDP with Re-indexing

Space-indexed PSDP

Cost: 49.32Cost: Infinite (no trajectory succeeds) Cost: 59.74

Page 53: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Additional Experiments

• In the paper: additional experiments on the Stanford Grand Challenge Car using space-indexed DDP, and on a simulated helicopter domain using space-indexed PSDP

Page 54: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Related Work

• Reinforcement learning / dynamic programming: Bagnell et al., 2004; Jacobson and Mayne, 1970; Lagoudakis and Parr, 2003; Langford and Zadrozny, 2005

• Differential Dynamic Programming: Atkeson, 1994; Tassa et al., 2008

• Gain Scheduling, Model Predictive Control: Leith and Leithead, 2000; Garica et al., 1989

Page 55: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Summary

• Trajectory following uses non-stationary policies, but traditional DP / RL algorithms suffer because they are time-indexed

• In this paper, we introduce the notions of a space-indexed dynamical system, and space-indexed dynamic programming

• Demonstrated usefulness of these methods on real-world control tasks.

Page 56: Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science

Thank you!

Videos available online athttp://cs.stanford.edu/~kolter/icml08videos