learning first order markov models for control pieter abbeel and andrew y. ng, poster 48 tuesday...
Post on 19-Dec-2015
215 views
TRANSCRIPT
![Page 1: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/1.jpg)
Learning First Order Markov Models for ControlPieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday
Consider modeling an autonomous RC-car’s dynamics from a sequence of states and actions collected at 100Hz.
We have training data: (s1, a1, s2, a2, …).
We’d like to build a model of the MDP’s transition probabilities P(st+1|st, at). Slide #1
![Page 2: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/2.jpg)
Learning First Order Markov Models for ControlPieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday
• If we use maximum likelihood (ML) to fit the parameters of the MDP, then we are constrained to fit only the 1-step transitions:
max t p(st+1 | st, at)
• But in RL, our goal is to maximize the long-term rewards, so we aren’t really interested in the 1/100th-second dynamics.
• The dynamics on longer time-scales are often only poorly approximated (assuming the system isn’t really first-order).
• Algorithms for building models that better capture dynamics on longer time-scales.
• Experiments on autonomous RC car driving.
Slide #2
![Page 3: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/3.jpg)
Learning First Order Markov Models for Control
Pieter Abbeel and Andrew Y. Ng
Stanford University
![Page 4: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/4.jpg)
Autonomous RC Car
![Page 5: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/5.jpg)
Motivation
• Consider modeling an RC-car’s dynamics from a sequence of states and actions collected at 100Hz.
• Maximum likelihood fitting of a first order Markov model constrains the model to fit only the 1-step transitions. However for control applications, we do not care only about the dynamics on the time-scale of 1/100 of a second, but also about longer time-scales.
![Page 6: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/6.jpg)
Motivation
• If we use maximum likelihood (ML) to fit the parameters of a first-order Markov model, then we are constrained to fit only the 1-step transitions.
• The dynamics on longer time-scales are often only poorly approximated [unless the system dynamics are really first-order].
• However for control: interested in maximizing the long-term expected rewards.
![Page 7: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/7.jpg)
Regardless of true model, ML will return
same model with .
Random Walk Example
• Random walk.
• Consider two cases
Increments i perfectly correlated: Var(ST) = T2.
Increments i independent: Var(ST) = T.
![Page 8: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/8.jpg)
Examples of physical systems
• Influence of wind disturbances on helicopterVery small over one time step.Strong correlations lead to substantial effect over time.
• First order ML model may overestimate ability to control helicopter and car [thinking variance is O(T) rather than O(T2)]. This leads to danger of, e.g., flying too close to a building, or driving on too narrow a road.
• Systematic model errors can show up as correlated noise. E.g., oversteering or understeering of car.
![Page 9: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/9.jpg)
Problem statement
• The learning problem: Given: state/action sequence data from a system. Goal: model the system for purposes of control (such as
to use with a RL algorithm).
• Even when dynamics are not governed by an MDP, we often would still like to model it as such (rather than as a POMDP), since MDPs are much easier to solve.
• How do we learn an accurate first order Markov model from data for control?
[Our ideas are also applicable to higher order, and/or more structured models such as dynamic Bayesian networks and mixed memory Markov models.]
![Page 10: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/10.jpg)
Preliminaries and Notation
• Finite-state decision process (DP) S: set of states, A: set of actions, P: set of state transition probabilities
[not Markov!] : discount factor, D: initial state distribution, R: reward function, 8 s R(s) · Rmax .
• We will fit a model , with estimates of the transition probabilities .
• Value of state s0 in under policy
![Page 11: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/11.jpg)
Where is the variational distance.
Parameter estimation when no actions
• Consider
• dvar is hard to optimize from samples, but can be upper-bounded by a function of KL-divergence.
• Minimizing KL-divergence is, in turn, identical to minimizing log-loss.
![Page 12: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/12.jpg)
dvarKLlog-likelihood
[The last step reflects we are equally interested in every state
as possible starting state s0.]
![Page 13: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/13.jpg)
The resulting lagged objective
• Given a training sequence s0:T, we propose to use
• Compare this to the maximum likelihood objective
![Page 14: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/14.jpg)
S2S1
S2
S1
Lagged objective vs. ML
S0 S3S2S1
S0 S1
S2S1
S3S2
S0 S2
S3S1
S0 S3
• Consider a length four training sequence, which could have various dependencies.
• ML takes into account only the following transitions:
• Our lagged objective also takes into account:
[Yellow nodes are observed, white nodes are unobserved.]
![Page 15: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/15.jpg)
• M-step: update such that
EM-algorithm to optimize lagged objective
• E-step: compute expected counts
and store in stats. I.e., 8 t, k, l, i, j
![Page 16: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/16.jpg)
Computational Savings for E-step
• Inference for E-step can be done using standard forward and backward message passing. For every pair (t, t+k), the forward messages at position t+i depend on t only, not on k. So, computation of different terms in the inner-summation can share messages. Similarly for backward messages. This reduces the number of message computations by a factor T.
• Often only interested in some maximum horizon H. I.e., in the inner-summation of the objective only consider k=1,…,H.
Reduction from O(T3) to O(T H2).
• More substantial savings: (St=i, St+k=j) and (St’=i, St’+k=j) contribute same to stats( . , . )
Computing stats( . , . ) contribution for all such pairs only once.
Further reduction to O(|S|2 H2).
![Page 17: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/17.jpg)
Incorporating actions
• If actions are incorporated, our objective becomes:
• The EM-algorithm is trivially extended by conditioning
on the actions during the E-step.
• Forward messages need to be computed only once for
every t, backward messages once for every t+k. [as before]
• Number of possibilities for at:t+k-1 is O(|A|k).
Use only a few deterministic exploration policies.
Can still obtain same computational savings as before.
![Page 18: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/18.jpg)
Experiment 1: shortest vs. safest path
• Actions are 4 compass directions.
• Move in intended direction with probability 0.7, and a random direction with probability 0.3.
• The directions of the “random transitions” are dependent, and correlated over time. A parameter q controls the correlation between the directions of the random transitions on different time steps (uncorrelated if q=0, perfectly correlated if q=1).
• We will fit a first order Markov model to these dynamics (with each grid position being a state).
[Details: Noise process governed by a Markov process (not directly observable by the agent) with each of the 4 directions as states, with Prob(staying in same state) = q.]
![Page 19: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/19.jpg)
Experiment 1: shortest vs. safest path
[Details: Learning was done using a 200,000 length state-action sequence. Reported results are averages over 5 independent trials. The exploration policy used independent random actions at each time step.]
If the noise is strongly correlated across time (large q), our model estimates the dynamics to have a higher “effective noise level.” As a consequence the more cautious policy (path B) is used.
(q)
![Page 20: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/20.jpg)
Experiment 2: Queue
Actions: 3 service rates, with faster service rates being more expensive.
q0 = 0 reward = 0
q1 = p reward = -1
q2 = .75 reward = -10
Queue buffer length = 20; buffer overflow results in reward -1000.
Customers arrive over time to be served.
At every time, the arrival probability equals p.
Service rate = probability that the customer first in queue
gets serviced successfully in the current time step.
![Page 21: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/21.jpg)
P( arrival | slow mode ) = 0.01
P( arrival | fast mode ) = 0.99
Steady state: P(slow mode)=0.8, P(fast mode)=0.2
Experiment 2: Queue
Underlying (unobserved!) arrival process has 2 different modes (fast arrivals and slow arrivals)
Additional parameter determines how rapidly system changes between fast and slow modes.
Fast switching
Slow switchingbetween modes
between modes
![Page 22: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/22.jpg)
Experiment 2: Queue
Estimate/Learn first order Markov model with State = size of the queue, Actions = 3 service rates Exploration policy = repeatedly use same service rate for 25 time-steps. We used 8000 such trials.
15% better performance
at high correlation levels.
Same performance
at low correlation levels.
![Page 23: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/23.jpg)
Experiment 3: RC-car
Consider the situation where the RC-car can choose between 2 paths
• A curvy path with high reward if successful in reaching the goal.
• An easier path with lower reward if successful in reaching the goal
We build a dynamics model of the car, and find a policy/controller in simulation for following each of the paths. The decision about which path to follow is then made based upon this simulation.
![Page 24: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/24.jpg)
RC-car model
• : angular direction the RC-car is headed
• : angular velocity
• V : velocity of the RC-car (kept constant)
• ut : steering input to the car ( 2 [-1,1])
• C1, C2, C3 : parameters of the model, estimated using linear regression
• wt : noise term, zero-mean Gaussian with variance 2
.
Using the lagged objective, we re-estimate the variance 2, and compare its performance to the first-order estimate of 2.
![Page 25: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/25.jpg)
Controller
• We use the following controller
desired steering angle = p1*(y-ydes) + p2*(-des);
u = f(desired steering angle);
We optimize over the parameters p1, p2 to follow the straight line y=0, for which we set ydes=0, des=0.
For the specific two trajectories, ydes(x), des(x) are optimized as a function of the current x position.
• For localization, we use an overhead camera.
![Page 26: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/26.jpg)
Simulated performance on curvy trajectory
Plot shows 100 sample runs in simulation under the ML-model.
The ML-model predicts the RC-car can follow the curvy road >95% of the time.
Plot shows 10 sample runs in simulation under the lag-learned model.
The lag-learned model predicts the RC-car can follow the curvy road < 10% of the time.
Green lines: simulated trajectories, Black lines: road boundaries.
![Page 27: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/27.jpg)
Simulated performance on easier trajectory
Plot shows 100 sample runs in simulation under the ML-model.
The ML-model predicts the RC-car can follow the easier road >99% of the time.
Plot shows 100 sample runs in simulation under the lag-learned model.
The lag-learned model predicts the RC-car can follow the curvy road > 70% of the time.
Green lines: simulated trajectories, Black lines: road boundaries.
ML would choose the curvy road if high reward along curvy road.
![Page 28: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/28.jpg)
Actual performance on easier trajectory
[Movies available.]
The real RC-car succeeded on the easier road 20/20 times.
The real RC-car failed on the curvy road 19/20 times.
![Page 29: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/29.jpg)
RC-car movie
![Page 30: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/30.jpg)
Conclusions
• Maximum likelihood with a first order Markov model only tries to model the 1-step transition dynamics.
• For many control applications, we desire an accurate model of the dynamics on longer time-scales.
• We showed that, by using an objective that takes into account the longer time scales, in many cases a better dynamical model (and a better controller) is obtained.
Special thanks to Mark Woodward, Dave Dostal, Vikash Gilja and Sebastian Thrun.
![Page 31: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/31.jpg)
Cut out slides follow
![Page 32: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/32.jpg)
Lagged objective vs. ML
• Consider a length four training sequence, which could have various dependencies.
• ML takes into account only the following transitions.
• Our lagged objective also takes into account
[Shaded nodes are observed, white nodes are unobserved.]
![Page 33: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/33.jpg)
Experiment 2: Queue [use this one or previous one?]
Queue size at time t Queue size at time t+1
s(t)
s(t+1) = s(t)+1
s(t+1) = s(t)
s(t+1) = s(t)-1
arrival
no arrival
unsuccessful servicing
unsuccessful servicing
successful servicing
successful servicing
Choice of actions between 3 service rates
q0 = 0 reward = 0
q1 = p reward = -1
q2 = .75 reward = -10
Buffer size = 20. Buffer overflow results in reward of -1000.
Arrival probability = p
![Page 34: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/34.jpg)
Actual performance on curvy trajectory
[Movies available.]
Green lines: simulated trajectories, Black lines: road boundaries.
Real trajectories obtained as obtained on floor.
The actual RC-car fell off the curvy trajectory 19/20 times.
![Page 35: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/35.jpg)
Alternative title slides follow
![Page 36: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/36.jpg)
Learning First Order Markov Models for Control
Pieter Abbeel and Andrew Y. Ng
Stanford University
![Page 37: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/37.jpg)
Learning First
![Page 38: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/38.jpg)
Order Markov
![Page 39: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/39.jpg)
Models for
![Page 40: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/40.jpg)
Control
![Page 41: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/41.jpg)
Pieter Abbeel and Andrew Y. Ng
Stanford University
![Page 42: Learning First Order Markov Models for Control Pieter Abbeel and Andrew Y. Ng, Poster 48 Tuesday Consider modeling an autonomous RC-car’s dynamics from](https://reader030.vdocuments.us/reader030/viewer/2022032800/56649d365503460f94a0efff/html5/thumbnails/42.jpg)
Pieter Abbeel and Andrew Y. Ng
Stanford University