dynamic programming i · 2009. 10. 6. · reinforcement learning dynamic programming i subramanian...
TRANSCRIPT
Reinforcement Learning
Dynamic Programming I
Subramanian RamamoorthySchool of Informatics
5 October, 2009
In this Lecture
• Review of RL problem formulation
• Optimality and Bellman’s equation
• Policy evaluation and improvement
05/10/2009 Reinforcement Learning 2
The RL Problem
Main Elements:
• States, s
• Actions, a
• State transition dynamics -often, stochastic & unknown
• Reward (r) process - possibly stochastic
Objective: Policy t(s,a)
– probability distribution over actions given current state
05/10/2009 Reinforcement Learning 3
Assumption:Environment is a finite-state MDP
Some Key Quantities in an MDP
• System dynamics are stochastic – represented by a probability distribution.
• Problem is defined as maximization of expected rewards
– Recall that E(X) = xi p(xi)
for finite-state systems
05/10/2009 Reinforcement Learning 4
5
Three Aspects of the RL Problem
• PlanningThe MDP is known (states, actions, transitions, rewards).
Find an optimal policy, *!
• LearningThe MDP is unknown. You are allowed to interact with it.
Find an optimal policy *!
• Optimal learningWhile interacting with the MDP, minimize the loss due to not using an optimal policy from the beginning.
05/10/2009 Reinforcement Learning
6
Solving MDPs – Many Dimensions
• Which problem? (Planning, learning, optimal learning)• Exact or approximate?• Uses samples?• Incremental?• Uses value functions?
– Yes: Value-function based methods• Planning: DP, Random Discretization Method, FVI, …• Learning: Q-learning, Actor-critic, …
– No: Policy search methods• Planning: Monte-Carlo tree search, Likelihood ratio methods (policy
gradient), Sample-path optimization (Pegasus), …• Representation
– Structured state:• Factored states, logical representation, …
– Structured policy space:• Hierarchical methods
Today, we will focuson DP for the
‘planning’ problem.
05/10/2009 Reinforcement Learning
Value Functions
• Value functions are used to determine how good it is for the agent to be in a given state
– Or, how good is it to perform an action from a given state?
• This is defined w.r.t. a specific policy (s,a)
• State value function:
• Action value function:
05/10/2009 Reinforcement Learning 7
Value Functions
Note that there are multiple sources of (probabilistic) uncertainty:
• In state s, one can select different actions a
• The system can transition to different states s’ from s
• Depending on the above, return (defined in terms of reward) is a random variable - which is what we seek to maximize
05/10/2009 Reinforcement Learning 8
Recursive Form of V – Bellman Equation
05/10/2009 Reinforcement Learning 9
Expand 1-step forward& rewrite expectation
Backup Diagrams
• If you go all the way ‘down’, you can just read off the reward value
• The backup process (i.e., recursive equation above) allows you to compute the corresponding value at current state
– taking transition probabilities into account
05/10/2009 Reinforcement Learning 10
Bellman Equation for Q
05/10/2009 Reinforcement Learning 11
Optimal Value Function
• For finite MDPs,
• Let us denote the optimal policy
• The optimal value functions are:
• From this,
• Will there always be a well defined ?
Theorem [Blackwell, 1962] – For every MDP with finite state/action space, there exists an optimal deterministic stationary plan
05/10/2009 Reinforcement Learning 12
Recursive Form of V*
05/10/2009 Reinforcement Learning 13
Backup for Q*
05/10/2009 Reinforcement Learning 14
What is Dynamic Programming?
Given a known model of the environment as an MDP (transition dynamics and reward probabilities), DP is a collection of algorithms for computing optimal policies.
• Much of what is in the S+B book is founded on DP concepts
• The goal is often to achieve the same effect as DP but with less computation and imperfect information regarding the environment
RL Simulation-based methods for solving optimal control problems
• The notion of value functions came out of the classical DP literature
(as did Bellman’s equation – Hamilton-Jacobi before that)
05/10/2009 Reinforcement Learning 15
Policy Evaluation
How to compute V(s) for an arbitrary policy ? (Prediction problem)
For a given MDP, this yields a system of simultaneous equations
– as many unknowns as states
– Matrix form:
Solve iteratively, with a sequence of value functions,
05/10/2009 Reinforcement Learning 16
V = (I - P )-1 r
Computationally…
We could achieve this in a number of different ways:
• Maintain two arrays, computing iterations over one and copying results to the other
• In-place: Overwrite as new backed-up values become available
• It can be shown that this algorithm will also converge to optimality (somewhat faster, even)
– Backups sweep through the space
– Sweep order has significant influence on convergence rates
05/10/2009 Reinforcement Learning 17
Iterative Policy Evaluation
05/10/2009 Reinforcement Learning 18
Grid-World Example
05/10/2009 Reinforcement Learning 19
Iterative Policy Evaluation in Grid World
05/10/2009 Reinforcement Learning 20
Note: The value function can be searchedgreedily to find long-term optimal actions
Important Property: Convergence
05/10/2009 Reinforcement Learning 21
Contraction mapping.
Contraction Mapping Property
05/10/2009 Reinforcement Learning 22
Policy Improvement
Does it make sense to deviate from (s) at any state (following the policy everywhere else)?
05/10/2009 Reinforcement Learning 23
- Policy Improvement Theorem [Howard/Blackwell]
Key Idea Behind Policy Improvement
05/10/2009 Reinforcement Learning 24
Computing Better Policies
Starting with an arbitrary policy, we’d like to approach truly optimal policies. So, we compute new policies using the following,
Are we restricted to deterministic policies? No.
With stochastic policies,
05/10/2009 Reinforcement Learning 25
Summary
• Formulation of MDP and three kinds of problems (planning, learning, optimal learning)
• Value functions
• Optimality and Bellman’s equation
• Policy evaluation via dynamic programming
• Policy improvement
Next time:
• Algorithms for value and policy iteration
• Making the computational techniques efficient
05/10/2009 Reinforcement Learning 26