![Page 1: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/1.jpg)
PEGASUS for Helicopter Control
Presented by: Ken Alton
![Page 2: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/2.jpg)
References
• PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan
• Shaping and Policy Search in Reinforcement Learning, PhD Thesis, 2003 – Andrew Ng
• Autonomous helicopter flight via reinforcement learning, 2004 – Andrew Ng, H. Jin Kim, Michael Jordan, and Shankar Sastry
• Inverted autonomous helicopter flight via reinforcement learning, 2004 – Andrew Ng and others
• An application of reinforcement learning to aerobatic helicopter flight, 2007 - Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y. Ng
![Page 3: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/3.jpg)
PEGASUS
• Policy Evaluation-of-Goodness And Search Using Scenarios
• Markov Decision Process (MDP)
• Policy
![Page 4: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/4.jpg)
Policy Evaluation and Optimization
• V (s) is expected discounted sum of rewards for executing policy starting from state s
• Value of a policy • Optimal policy within class for MDP M
• Find a policy
![Page 5: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/5.jpg)
Deterministic Simulative Model
• For POMDPs– Memory-free policies that depend only on
observables– Limited-memory policies that introduce
artificial memory variables into the state
• Generative model – takes input (s,a) and outputs s’ according to
• Deterministic simulative model
![Page 6: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/6.jpg)
Deterministic Simulative Model (2)
• Most computer implementations provide this model but need to expose interface to random number generator
• Example– Normal distributed – Cumulative distribution function– Let dP = 1 and choose g to be
![Page 7: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/7.jpg)
Transformation of (PO)MDP
• Given• Construct
– State
• Deterministic transition
– Where
• D’ such that s ~ D and pi’s are i.i.d• Reward • Policies
![Page 8: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/8.jpg)
Policy Search Method
• Transformed M to deterministic M’• Value of policy• Sample of m initial states (scenarios)
• Approximate value
• Like generating m Monte Carlo trajectories and taking their average reward but randomization is fixed in advance
![Page 9: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/9.jpg)
Policy Search Method (2)
• Since objective function is deterministic can use standard optimization methods
• Gradient ascent methods– Smoothly parameterize family of policies
– If relevant quantities are differentiable, find gradient
• Local maxima can be a problem
![Page 10: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/10.jpg)
Example – Grid World
![Page 11: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/11.jpg)
Example: Riding a Bicycle• Randlov and Alstrom’s bicycle simulator• Objective to ride to goal 1km away• Action torque applied to handlebars and
displacement of rider’s center of gravity• Hand-picked 15 features of state but not fine-tuned• Policy
• Gradient ascent• Shaping rewards to reward progress towards the
goal
![Page 12: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/12.jpg)
Helicopter Flight
![Page 13: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/13.jpg)
Helicopter
• Inertial Navigation System –accelerometers and gyroscopes
• Differential GPS and digital compass• Kalman filter integrates sensor information• State• Actions
– a1, a2 : longitudinal (front-back) and latitudinal (left-right) pitch control
– a3 : main rotor collective pitch control– a4 : tail rotor collective pitch control
![Page 14: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/14.jpg)
Model Identification
• Body coordinates
• Locally-weighted linear regression with (st,at) as inputs and one-step differences st+1 – st as outputs
• Some parameters in the regression hard-coded• Extra unobserved variables to model latency in
response to controls• Used human pilot flight data to fit and test the
model
![Page 15: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/15.jpg)
Policy Search
• PEGASUS• Neural network for policy class
![Page 16: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/16.jpg)
Policy Search (2)
• Quadratic state cost
• Weights scale terms to same order of magnitude
• Quadratic action cost
• Overall reward is• Both gradient ascent and random walk
worked
![Page 17: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/17.jpg)
Competition Maneuvers
![Page 18: PEGASUS for Helicopter Control · 2008. 3. 13. · • PEGASUS: A policy search method for large MDPs and POMDPs, 2000 - Andrew Ng and Michael Jordan • Shaping and Policy Search](https://reader035.vdocuments.us/reader035/viewer/2022071516/613886ed0ad5d20676494ec9/html5/thumbnails/18.jpg)
Competition Maneuvers (2)
• Vary over desired trajectory• Augment policy class to consider coupling
between helicopter’s subdynamics• Use deviation from a projection onto
desired trajectory• Use a potential function which increases
along the trajectory