dynamic programming i · 2009. 10. 6. · reinforcement learning dynamic programming i subramanian...
TRANSCRIPT
![Page 1: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/1.jpg)
Reinforcement Learning
Dynamic Programming I
Subramanian RamamoorthySchool of Informatics
5 October, 2009
![Page 2: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/2.jpg)
In this Lecture
• Review of RL problem formulation
• Optimality and Bellman’s equation
• Policy evaluation and improvement
05/10/2009 Reinforcement Learning 2
![Page 3: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/3.jpg)
The RL Problem
Main Elements:
• States, s
• Actions, a
• State transition dynamics -often, stochastic & unknown
• Reward (r) process - possibly stochastic
Objective: Policy t(s,a)
– probability distribution over actions given current state
05/10/2009 Reinforcement Learning 3
Assumption:Environment is a finite-state MDP
![Page 4: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/4.jpg)
Some Key Quantities in an MDP
• System dynamics are stochastic – represented by a probability distribution.
• Problem is defined as maximization of expected rewards
– Recall that E(X) = xi p(xi)
for finite-state systems
05/10/2009 Reinforcement Learning 4
![Page 5: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/5.jpg)
5
Three Aspects of the RL Problem
• PlanningThe MDP is known (states, actions, transitions, rewards).
Find an optimal policy, *!
• LearningThe MDP is unknown. You are allowed to interact with it.
Find an optimal policy *!
• Optimal learningWhile interacting with the MDP, minimize the loss due to not using an optimal policy from the beginning.
05/10/2009 Reinforcement Learning
![Page 6: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/6.jpg)
6
Solving MDPs – Many Dimensions
• Which problem? (Planning, learning, optimal learning)• Exact or approximate?• Uses samples?• Incremental?• Uses value functions?
– Yes: Value-function based methods• Planning: DP, Random Discretization Method, FVI, …• Learning: Q-learning, Actor-critic, …
– No: Policy search methods• Planning: Monte-Carlo tree search, Likelihood ratio methods (policy
gradient), Sample-path optimization (Pegasus), …• Representation
– Structured state:• Factored states, logical representation, …
– Structured policy space:• Hierarchical methods
Today, we will focuson DP for the
‘planning’ problem.
05/10/2009 Reinforcement Learning
![Page 7: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/7.jpg)
Value Functions
• Value functions are used to determine how good it is for the agent to be in a given state
– Or, how good is it to perform an action from a given state?
• This is defined w.r.t. a specific policy (s,a)
• State value function:
• Action value function:
05/10/2009 Reinforcement Learning 7
![Page 8: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/8.jpg)
Value Functions
Note that there are multiple sources of (probabilistic) uncertainty:
• In state s, one can select different actions a
• The system can transition to different states s’ from s
• Depending on the above, return (defined in terms of reward) is a random variable - which is what we seek to maximize
05/10/2009 Reinforcement Learning 8
![Page 9: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/9.jpg)
Recursive Form of V – Bellman Equation
05/10/2009 Reinforcement Learning 9
Expand 1-step forward& rewrite expectation
![Page 10: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/10.jpg)
Backup Diagrams
• If you go all the way ‘down’, you can just read off the reward value
• The backup process (i.e., recursive equation above) allows you to compute the corresponding value at current state
– taking transition probabilities into account
05/10/2009 Reinforcement Learning 10
![Page 11: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/11.jpg)
Bellman Equation for Q
05/10/2009 Reinforcement Learning 11
![Page 12: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/12.jpg)
Optimal Value Function
• For finite MDPs,
• Let us denote the optimal policy
• The optimal value functions are:
• From this,
• Will there always be a well defined ?
Theorem [Blackwell, 1962] – For every MDP with finite state/action space, there exists an optimal deterministic stationary plan
05/10/2009 Reinforcement Learning 12
![Page 13: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/13.jpg)
Recursive Form of V*
05/10/2009 Reinforcement Learning 13
![Page 14: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/14.jpg)
Backup for Q*
05/10/2009 Reinforcement Learning 14
![Page 15: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/15.jpg)
What is Dynamic Programming?
Given a known model of the environment as an MDP (transition dynamics and reward probabilities), DP is a collection of algorithms for computing optimal policies.
• Much of what is in the S+B book is founded on DP concepts
• The goal is often to achieve the same effect as DP but with less computation and imperfect information regarding the environment
RL Simulation-based methods for solving optimal control problems
• The notion of value functions came out of the classical DP literature
(as did Bellman’s equation – Hamilton-Jacobi before that)
05/10/2009 Reinforcement Learning 15
![Page 16: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/16.jpg)
Policy Evaluation
How to compute V(s) for an arbitrary policy ? (Prediction problem)
For a given MDP, this yields a system of simultaneous equations
– as many unknowns as states
– Matrix form:
Solve iteratively, with a sequence of value functions,
05/10/2009 Reinforcement Learning 16
V = (I - P )-1 r
![Page 17: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/17.jpg)
Computationally…
We could achieve this in a number of different ways:
• Maintain two arrays, computing iterations over one and copying results to the other
• In-place: Overwrite as new backed-up values become available
• It can be shown that this algorithm will also converge to optimality (somewhat faster, even)
– Backups sweep through the space
– Sweep order has significant influence on convergence rates
05/10/2009 Reinforcement Learning 17
![Page 18: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/18.jpg)
Iterative Policy Evaluation
05/10/2009 Reinforcement Learning 18
![Page 19: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/19.jpg)
Grid-World Example
05/10/2009 Reinforcement Learning 19
![Page 20: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/20.jpg)
Iterative Policy Evaluation in Grid World
05/10/2009 Reinforcement Learning 20
Note: The value function can be searchedgreedily to find long-term optimal actions
![Page 21: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/21.jpg)
Important Property: Convergence
05/10/2009 Reinforcement Learning 21
Contraction mapping.
![Page 22: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/22.jpg)
Contraction Mapping Property
05/10/2009 Reinforcement Learning 22
![Page 23: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/23.jpg)
Policy Improvement
Does it make sense to deviate from (s) at any state (following the policy everywhere else)?
05/10/2009 Reinforcement Learning 23
- Policy Improvement Theorem [Howard/Blackwell]
![Page 24: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/24.jpg)
Key Idea Behind Policy Improvement
05/10/2009 Reinforcement Learning 24
![Page 25: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/25.jpg)
Computing Better Policies
Starting with an arbitrary policy, we’d like to approach truly optimal policies. So, we compute new policies using the following,
Are we restricted to deterministic policies? No.
With stochastic policies,
05/10/2009 Reinforcement Learning 25
![Page 26: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!](https://reader035.vdocuments.us/reader035/viewer/2022062510/6145452434130627ed50deb8/html5/thumbnails/26.jpg)
Summary
• Formulation of MDP and three kinds of problems (planning, learning, optimal learning)
• Value functions
• Optimality and Bellman’s equation
• Policy evaluation via dynamic programming
• Policy improvement
Next time:
• Algorithms for value and policy iteration
• Making the computational techniques efficient
05/10/2009 Reinforcement Learning 26