dynamic programming i · 2009. 10. 6. · reinforcement learning dynamic programming i subramanian...

26
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009

Upload: others

Post on 22-Aug-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Reinforcement Learning

Dynamic Programming I

Subramanian RamamoorthySchool of Informatics

5 October, 2009

Page 2: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

In this Lecture

• Review of RL problem formulation

• Optimality and Bellman’s equation

• Policy evaluation and improvement

05/10/2009 Reinforcement Learning 2

Page 3: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

The RL Problem

Main Elements:

• States, s

• Actions, a

• State transition dynamics -often, stochastic & unknown

• Reward (r) process - possibly stochastic

Objective: Policy t(s,a)

– probability distribution over actions given current state

05/10/2009 Reinforcement Learning 3

Assumption:Environment is a finite-state MDP

Page 4: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Some Key Quantities in an MDP

• System dynamics are stochastic – represented by a probability distribution.

• Problem is defined as maximization of expected rewards

– Recall that E(X) = xi p(xi)

for finite-state systems

05/10/2009 Reinforcement Learning 4

Page 5: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

5

Three Aspects of the RL Problem

• PlanningThe MDP is known (states, actions, transitions, rewards).

Find an optimal policy, *!

• LearningThe MDP is unknown. You are allowed to interact with it.

Find an optimal policy *!

• Optimal learningWhile interacting with the MDP, minimize the loss due to not using an optimal policy from the beginning.

05/10/2009 Reinforcement Learning

Page 6: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

6

Solving MDPs – Many Dimensions

• Which problem? (Planning, learning, optimal learning)• Exact or approximate?• Uses samples?• Incremental?• Uses value functions?

– Yes: Value-function based methods• Planning: DP, Random Discretization Method, FVI, …• Learning: Q-learning, Actor-critic, …

– No: Policy search methods• Planning: Monte-Carlo tree search, Likelihood ratio methods (policy

gradient), Sample-path optimization (Pegasus), …• Representation

– Structured state:• Factored states, logical representation, …

– Structured policy space:• Hierarchical methods

Today, we will focuson DP for the

‘planning’ problem.

05/10/2009 Reinforcement Learning

Page 7: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Value Functions

• Value functions are used to determine how good it is for the agent to be in a given state

– Or, how good is it to perform an action from a given state?

• This is defined w.r.t. a specific policy (s,a)

• State value function:

• Action value function:

05/10/2009 Reinforcement Learning 7

Page 8: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Value Functions

Note that there are multiple sources of (probabilistic) uncertainty:

• In state s, one can select different actions a

• The system can transition to different states s’ from s

• Depending on the above, return (defined in terms of reward) is a random variable - which is what we seek to maximize

05/10/2009 Reinforcement Learning 8

Page 9: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Recursive Form of V – Bellman Equation

05/10/2009 Reinforcement Learning 9

Expand 1-step forward& rewrite expectation

Page 10: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Backup Diagrams

• If you go all the way ‘down’, you can just read off the reward value

• The backup process (i.e., recursive equation above) allows you to compute the corresponding value at current state

– taking transition probabilities into account

05/10/2009 Reinforcement Learning 10

Page 11: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Bellman Equation for Q

05/10/2009 Reinforcement Learning 11

Page 12: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Optimal Value Function

• For finite MDPs,

• Let us denote the optimal policy

• The optimal value functions are:

• From this,

• Will there always be a well defined ?

Theorem [Blackwell, 1962] – For every MDP with finite state/action space, there exists an optimal deterministic stationary plan

05/10/2009 Reinforcement Learning 12

Page 13: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Recursive Form of V*

05/10/2009 Reinforcement Learning 13

Page 14: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Backup for Q*

05/10/2009 Reinforcement Learning 14

Page 15: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

What is Dynamic Programming?

Given a known model of the environment as an MDP (transition dynamics and reward probabilities), DP is a collection of algorithms for computing optimal policies.

• Much of what is in the S+B book is founded on DP concepts

• The goal is often to achieve the same effect as DP but with less computation and imperfect information regarding the environment

RL Simulation-based methods for solving optimal control problems

• The notion of value functions came out of the classical DP literature

(as did Bellman’s equation – Hamilton-Jacobi before that)

05/10/2009 Reinforcement Learning 15

Page 16: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Policy Evaluation

How to compute V(s) for an arbitrary policy ? (Prediction problem)

For a given MDP, this yields a system of simultaneous equations

– as many unknowns as states

– Matrix form:

Solve iteratively, with a sequence of value functions,

05/10/2009 Reinforcement Learning 16

V = (I - P )-1 r

Page 17: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Computationally…

We could achieve this in a number of different ways:

• Maintain two arrays, computing iterations over one and copying results to the other

• In-place: Overwrite as new backed-up values become available

• It can be shown that this algorithm will also converge to optimality (somewhat faster, even)

– Backups sweep through the space

– Sweep order has significant influence on convergence rates

05/10/2009 Reinforcement Learning 17

Page 18: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Iterative Policy Evaluation

05/10/2009 Reinforcement Learning 18

Page 19: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Grid-World Example

05/10/2009 Reinforcement Learning 19

Page 20: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Iterative Policy Evaluation in Grid World

05/10/2009 Reinforcement Learning 20

Note: The value function can be searchedgreedily to find long-term optimal actions

Page 21: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Important Property: Convergence

05/10/2009 Reinforcement Learning 21

Contraction mapping.

Page 22: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Contraction Mapping Property

05/10/2009 Reinforcement Learning 22

Page 23: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Policy Improvement

Does it make sense to deviate from (s) at any state (following the policy everywhere else)?

05/10/2009 Reinforcement Learning 23

- Policy Improvement Theorem [Howard/Blackwell]

Page 24: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Key Idea Behind Policy Improvement

05/10/2009 Reinforcement Learning 24

Page 25: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Computing Better Policies

Starting with an arbitrary policy, we’d like to approach truly optimal policies. So, we compute new policies using the following,

Are we restricted to deterministic policies? No.

With stochastic policies,

05/10/2009 Reinforcement Learning 25

Page 26: Dynamic Programming I · 2009. 10. 6. · Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 5 October, 2009. ... Find an optimal policy, *!

Summary

• Formulation of MDP and three kinds of problems (planning, learning, optimal learning)

• Value functions

• Optimality and Bellman’s equation

• Policy evaluation via dynamic programming

• Policy improvement

Next time:

• Algorithms for value and policy iteration

• Making the computational techniques efficient

05/10/2009 Reinforcement Learning 26