reinforcement learning dynamic programming i subramanian ramamoorthy school of informatics 31...
DESCRIPTION
Returns Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. where T is a final time step at which a terminal state is reached, ending an episode. 31/01/20123TRANSCRIPT
![Page 1: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/1.jpg)
Reinforcement Learning
Dynamic Programming I
Subramanian RamamoorthySchool of Informatics
31 January, 2012
![Page 2: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/2.jpg)
Continuing from last time…
MDPs and Formulation of RL Problem
31/01/2012 2
![Page 3: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/3.jpg)
Returns
Suppose the sequence of rewards after step t is : rt1, rt2 , rt3,What do we want to maximize?
. stepeach for , , themaximize want towe general,In
tRE t return expected
Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze.
Rt rt1 rt2 rT ,where T is a final time step at which a terminal state is reached, ending an episode.
31/01/2012 3
![Page 4: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/4.jpg)
Returns for Continuing Tasks
Continuing tasks: interaction does not have natural episodes.
Discounted return:
. theis ,10, where
, 0
132
21
ratediscount
k
ktk
tttt rrrrR
shortsighted 0 1 farsighted
31/01/2012 4
![Page 5: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/5.jpg)
Example – Setting Up Rewards
Avoid failure: the pole falling beyonda critical angle or the cart hitting end oftrack.
reward 1 for each step before failure return number of steps before failure
As an episodic task where episode ends upon failure:
As a continuing task with discounted return:
reward 1 upon failure; 0 otherwise
return k , for k steps before failure
In either case, return is maximized by avoiding failure for as long as possible.
31/01/2012 5
![Page 6: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/6.jpg)
Another Example
Get to the top of the hillas quickly as possible.
reward 1 for each step where not at top of hill return number of steps before reaching top of hill
Return is maximized by minimizing number of steps to reach the top of the hill.
31/01/2012 6
![Page 7: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/7.jpg)
A Unified Notation• In episodic tasks, we number the time steps of each
episode starting from zero.• We usually do not have to distinguish between episodes,
so we write instead of for the state at step t of episode j.
• Think of each episode as ending in an absorbing state that always produces reward of zero:
• We can cover all cases by writing
st st, j
reached. always is state absorbing reward zero a ifonly 1 becan where
, 0
1
kkt
kt rR
31/01/2012 7
![Page 8: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/8.jpg)
Markov Decision Processes
• If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP).
• If state and action sets are finite, it is a finite MDP. • To define a finite MDP, you need to give:– state and action sets– one-step “dynamics” defined by transition probabilities:
– reward probabilities:
Ps s a Pr st1 s st s,at a for all s, s S, a A(s).
Rs s a E rt1 st s,at a,st1 s for all s, s S, a A(s).
31/01/2012 8
![Page 9: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/9.jpg)
The RL Problem
Main Elements:• States, s• Actions, a• State transition dynamics -
often, stochastic & unknown• Reward (r) process - possibly
stochastic
Objective: Policy t(s,a)– probability distribution over
actions given current state
AssumptionAssumption::Environment defines Environment defines a finite-state MDPa finite-state MDP
31/01/2012 9
![Page 10: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/10.jpg)
Recycling Robot
An Example Finite MDP
• At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge.
• Searching is better but runs down the battery; if it runs out of power while searching, it has to be rescued (which is bad).
• Decisions made on basis of current energy level: high, low.
• Reward = number of cans collected
31/01/2012 10
![Page 11: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/11.jpg)
Recycling Robot MDP
S high,low A(high) search , wait A(low) search,wait, recharge
Rsearch expected no. of cans while searching
Rwait expected no. of cans while waiting
Rsearch Rwait
31/01/2012 11
![Page 12: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/12.jpg)
Enumerated in Tabular Form
31/01/2012 12
![Page 13: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/13.jpg)
Given such an enumeration of transitions and corresponding costs/rewards, what is the best
sequence of actions?
We know we want to maximize:
So, what must one do?
31/01/2012 13
01
kkt
kt rR
![Page 14: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/14.jpg)
The Shortest Path Problem
31/01/2012 14
![Page 15: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/15.jpg)
Finite-State Systems and Shortest Paths
– state space sk is a finite set for each k
– ak can get you from sk to fk(sk, ak) at a cost gk(xk, uk)
31/01/2012 15
Length ≈ Cost ≈ Sum of length of arcs
Solve this first
Jk(i) = minj [akij + Jk+1(j)]
![Page 16: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/16.jpg)
Value Functions
State - value function for policy :
V (s)E Rt st s E krtk 1 st sk 0
Action - value function for policy :
Q (s, a) E Rt st s, at a E krt k1 st s,at ak0
• The value of a state is the expected return starting from that state; depends on the agent’s policy:
• The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following :
31/01/2012 16
![Page 17: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/17.jpg)
Recursive Equation for Value
Rt rt1 rt2 2rt3
3rt4
rt1 rt2 rt3 2rt4
rt1 Rt1
The basic idea:
So: sssVrE
ssREsV
ttt
tt
11
)(
31/01/2012 17
![Page 18: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/18.jpg)
Optimality in MDPs – Bellman Equation
31/01/2012 18
![Page 19: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/19.jpg)
More on the Bellman Equation
a s
ass
ass sVassV )( ),()( RP
This is a set of equations (in fact, linear), one for each state.The value function for is its unique solution.
Backup diagrams:
for V for Q
31/01/2012 19
![Page 20: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/20.jpg)
Bellman Equation for Q
31/01/2012 20
![Page 21: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/21.jpg)
Gridworld• Actions: north, south, east, west; deterministic.• If it would take agent off the grid: no move but reward = –1• Other actions produce reward = 0, except actions that move
agent out of special states A and B as shown.
State-value function for equiprobable random policy;= 0.9
31/01/2012 21
![Page 22: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/22.jpg)
Golf
• State is ball location• Reward of –1 for each stroke
until the ball is in the hole• Value of a state?• Actions:
– putt (use putter)– driver (use driver)
• putt succeeds anywhere on the green
22
![Page 23: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/23.jpg)
if and only if V (s) V (s) for all s S
Optimal Value Functions• For finite MDPs, policies can be partially ordered:
• There are always one or more policies that are better than or equal to all the others. These are the optimal policies. We denote them all *.
• Optimal policies share the same optimal state-value function:
• Optimal policies also share the same optimal action-value function:
V (s) max
V (s) for all s S
Q(s,a)max
Q (s, a) for all s S and a A(s)
This is the expected return for taking action a in state s and thereafter following an optimal policy.
31/01/2012 23
![Page 24: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/24.jpg)
Optimal Value Function for Golf
• We can hit the ball farther with driver than with putter, but with less accuracy
• Q*(s,driver) gives the value or using driver first, then using whichever actions are best
31/01/2012 24
![Page 25: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/25.jpg)
Bellman Optimality Equation for V*
)( max
,)( max
),( max)(
)(
11)(
)(
sV
aasssVrE
asQsV
ass
s
asssAa
ttttsAa
sAa
RP
The value of a state under an optimal policy must equalthe expected return for the best action from that state:
The relevant backup diagram:
V* is the unique solution of this system of nonlinear equations.
31/01/2012 25
![Page 26: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/26.jpg)
Recursive Form of V*
31/01/2012 26
![Page 27: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/27.jpg)
Bellman Optimality Equation for Q*
s a
ass
ass
tttat
asQ
aassasQrEasQ
),(max
,),(max ),( 11
RP
The relevant backup diagram:
Q* is the unique solution of this system of nonlinear equations.
31/01/2012 27
![Page 28: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/28.jpg)
Why Optimal State-Value Functions are Useful
Any policy that is greedy with respect to V* is an optimal policy.
Therefore, given V* , one-step-ahead search produces the long-term optimal actions.
E.g., back to the gridworld (from your S+B book):
*31/01/2012 28
![Page 29: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/29.jpg)
What About Optimal Action-Value Functions?
Given Q*, the agent does not evenhave to do a one-step-ahead search:
31/01/2012 29
![Page 30: Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012](https://reader036.vdocuments.us/reader036/viewer/2022062412/5a4d1b147f8b9ab059990bad/html5/thumbnails/30.jpg)
Solving the Bellman Optimality Equation• Finding an optimal policy by solving the Bellman
Optimality Equation requires the following:– accurate knowledge of environment dynamics;– we have enough space and time to do the computation;– the Markov Property.
• How much space and time do we need?– polynomial in number of states (via dynamic programming),– BUT, number of states is often huge (e.g., backgammon has
about 1020 states)
• We usually have to settle for approximations.• Many RL methods can be understood as approximately
solving the Bellman Optimality Equation.31/01/2012 30