introduction to markov decision processes and dynamic ... · introduction to markov decision...

Introduction to Markov Decision Processes and DynamicProgramming

Judith Butepage and Marcus Klasson

KTH, Royal Institute of Technology, Stockholm

[email protected], [email protected]

February 14, 2017

Judith Butepage and Marcus Klasson (RPL) Introduction to RL February 14, 2017 1 / 46

Overview

1 Introduction to Markov Decision ProcessesFormal Modelling of RL TasksValue FunctionsBellman and his equationsOptimal Value Function

2 Dynamic ProgrammingPolicy EvaluationPolicy ImprovementPolicy IterationValue Iteration


The Agent-Environment Interaction

St œ S, S is the set of possible statesAt œ A(St), A(St) is the set of actions in state St

Rt+1 œ R µ R, is a numerical rewardfit(a|s), a policy denoting the probability of

choosing action At = a in state St = s

The agent’s goal is to maximize the total amount of reward it receives over the long run.


Help us to maximize our rewards!

The states are the slides of this lecture.The actions are your reactions.We get more reward when you understand and when you ask questions.

So raise your hand and do not get lost in this mathematical jungle!


A Short Discourse Into Multi-Armed Bandits

The agent can choose between k actions and receives a reward for each action.The expected reward for taking action a at time t is

qú(a) = E[Rt |At = a].

If the agent has chosen actions up to time t, the average received reward is

Qt(a) =qt≠1

i=1 Ri · 1(Ai = a)qt≠1

i=1 1(Ai = a).


Multi-Armed Bandits Example - Dragon FinderWe can choose the actions

A = {d1, d2, d3}

We have chosen actions and received rewards

A1:t≠1 = [d1, d2, d1, d3, d2, d3, d3]R1:t≠1 = [2.6, 1.1, 3.4, 6.1, 0.8, 4.6, 5.2]


Multi-Armed Bandits Example - Dragon Finder

We have chosen actions and received rewards

A1:t≠1 = [d1, d2, d1, d3, d2, d3, d3]R1:t≠1 = [2.6, 1.1, 3.4, 6.1, 0.8, 4.6, 5.2]

Then we have

Qt(d1) = (2.6 + 3.4)2

= 3

Qt(d2) = (1.1 + 0.8)2

= 0.95

Qt(d3) = (6.1 + 4.6 + 5.2)3

= 5.3

We can be greedy and exploit this function by choosing the action that gives us the highestexpected reward.Or we can explore our action space and choose a random action with probability ‘.


Multi-Armed Bandits Example - Graph ‘-greedy

Steps

0 100 200 300 400 500 600 700 800 900 1000

Ave

rag

e R

ew

ard

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

ϵ = 0 (greedy)

ϵ = 0.01

ϵ = 0.1

Comparing greedy method with two ‘-greedy (‘ = 0.01 and ‘ = 0.1). Rewards are Normallydistributed as

Rd ≥ N (µd , ‡d ), µ = [3, 1, 5], ‡ = [0.5, 0.25, 1].

Takes t = 1000 steps and is averaged over 1000 runs.


Markov Decision Processes

A Markov Decision Process (MDP) is defined by a 5-tuple (S, A, p(), R, “)

S is a finite set of possible statesA(St) is a finite set of actions in state St

p(sÕ|s, a) is the state-transition probability to state sÕ from state s taking action aR is a numerical reward“ is a discount factor, 0 Æ “ Æ 1

A finite MDP has a finite number of states and actions.


The Valentine’s Dilemma

The final goal of the princess is to rescue her prince. However, there are obstacles on the way.Valentine’s day is only ONCE a year, so she needs to be fast!For every step she gets a reward of -1, unless she meets a dragon and needs to fight it. Thenthe reward is -5.


Goals and Rewards

Goal: The maximization of the expected value of the cumulative sum of a received scalar signal(called reward).Reward signal: What we want to achieve, not how to achieve it.


Discounted Rewards

Episodic task: T œ N, each episode ends in a Continuing task: T = Œterminal state

Expected reward: Gt is some specific function of the reward sequence

Episodic task: Gt = Rt+1 + Rt+2 + Rt+3 + .... + RT

Continuing task: Gt = Rt+1 + “Rt+2 + “2Rt+3 + “3Rt+4 + ....

=Œÿ

k=0

“kRt+k+1

0 Æ “ Æ 1 is called the discount rate.


Unified Notation

Episodic task: T œ N, each episode ends in a Continuing task: T = Œterminal state

Gt =T≠t≠1ÿ

k=0

“kRt+k+1

T can be Œ, 0 Æ “ Æ 1, but not T = Œ and “ = 1

Myopic agent: “ = 0 Far-sighted agent: “ æ 1


State Representations

Representation 1 Representation 2 Representation 3

A state can include sensory signals, abstract environmental information or even mental states.However, it should only contain information relevant for decision making.


The Valentine’s Dilemma - The Markov Property

Generally, the current response could depend on the entire past:p(St+1 = sÕ, Rt+1 = r |S0, A0, R1, . . . , St≠1, At≠1, Rt , St , At)

The Markov property assumes independence of the past given the present:p(sÕ, r |s, a) .= p(St+1 = sÕ, Rt+1 = r |St = s, At = a)


Markov Decision ProcessesA Markov Decision Process is defined by a 5-tuple (S, A, p(), R, “)

S is a finite set of possible statesA(St) is a finite set of actions in state St

p(sÕ|s, a) is the state-transition probability to state sÕ from state s taking action aR is a numerical reward“ is a discount factor, 0 Æ “ Æ 1

Expected rewards for state–action pair

r(s, a) .= E[Rt+1|St = s, At = a] =ÿ

rœR

rÿ

sÕœS

p(sÕ, r |s, a)

State-transition probabilities

p(sÕ|s, a) .= p(St+1 = sÕ, |St = s, At = a) =ÿ

rœR

p(sÕ, r |s, a)

Expected rewards for state–action–next-state triple

r(s, a, sÕ) .= E[Rt+1|St = s, At = a, St+1 = sÕ] =

qrœR r p(sÕ, r |s, a)

p(sÕ|s, a)


MDP Transition Graph - Encountering a Dragon

Figure: Transition graph and table.States: Sm: = Smashed against the wall, Fi: = Fighting, Wo: = Won.Actions: A: = Attacking, H: = Hitting, S: = Sneaking past the dragon.Functions: [p(s’|s,a), r(s,a,s’)]


Value Functions

Efi[Gt ] denotes the expectation of Gt when following policy fi(a|s).

State–value function for policy fi

vfi(s) .= Efi[Gt |St = s] = Efi

CŒÿ

k=0

“kRt+k+1|St = s

D

Action–value function for policy fi

qfi(s, a) .= Efi[Gt |St = s, At = a] = Efi

CŒÿ

k=0

“kRt+k+1|St = s, At = a

D


Bellman Equation for State–Value Functions

Figure: Richard Ernest Bellman (August 26, 1920 - March 19, 1984)

vfi(s) .= Efi[Gt |St = s]

= Efi

CŒÿ

k=0

“kRt+k+1|St = s

D

= Efi

CRt+1 + “

Œÿ

k=0

“kRt+k+2|St = s

D

=ÿ

aœA

fi(a|s)ÿ

rœR

ÿ

sÕœS

p(sÕ, r |s, a)

Cr + “Efi

CŒÿ

k=0

“kRt+k+2|St+1 = sÕ

DD

=ÿ

aœA

fi(a|s)ÿ

rœR

ÿ

sÕœS

p(sÕ, r |s, a)#r + “vfi(sÕ)

$, ’s œ S


Bellman Equation for Action–Value functions

qfi(s, a) = Efi

CŒÿ

k=0


D

= ...

=ÿ

rœR

ÿ

sÕœS

p(sÕ, r |s, a)

Cr + “

ÿ

aÕœA

fi(aÕ|sÕ)qfi(sÕ, aÕ)

D, ’s œ S, ’a œ A


Backup Diagrams

(a) vfi(s) =ÿ

aœA

fi(a|s)ÿ

rœR

ÿ

sÕœS


$, ’s œ S

(b) qfi(s, a) =ÿ

rœR

ÿ

sÕœS

p(sÕ, r |s, a)

Cr + “

ÿ

aÕœA

fi(aÕ|sÕ)qfi(sÕ, aÕ)

D, ’s œ S, ’a œ A


Optimal Value Function

We say that policy fi is better than fiÕ iffi Ø fiÕ i� vfi(s) Ø vfiÕ (s) ’s œ S

It is always the case that÷fi : fi Ø fiÕ ’fiÕ, where fi is the optimal policy fiú and

vú(s) .= maxfi

vfi(s), ’s œ S is the optimal state-value function

qú(s, a) .= maxfi

qfi(s, a), ’s œ S, ’a œ A(s) is the optimal action-value function.


Bellman Optimality Equation

vú(s) = maxaœA(s)

qú(s, a)

= maxa

Efiú [Gt |St = s, At = a]

= maxa

Efiú

CŒÿ

k=0


D

= maxa

Efiú

CRt+1 + “

Œÿ

k=0


D

= maxa

Efiú [Rt+1 + “vú(St+1)|St = s, At = a]

= maxaœA(s)

ÿ

sÕ,r

p(sÕ, r |s, a)[r + “vú(sÕ)]


Bellman Optimality Equation - Backup Diagrams

(a) vú(s) = maxaœA(s)

ÿ

sÕ,r

p(sÕ, r |s, a)[r + “vú(sÕ)]

(b) qú(s, a) =ÿ

sÕ,r

p(sÕ, r |s, a)[r + “ maxaÕœA(s)

qú(sÕ, aÕ)]


Introduction to Dynamic Programming

In general, Dynamic Programming techniques optimize subproblems of the main problem toreach a globally optimal solution.In the context of RL, Dynamic Programming is a collection of algorithms that can compute theoptimal value function of a finite MDP given a perfect model of the environment.


Evaluating a Policy fi

We have a policy fi(a|s) and want to compute the value function vfi(s), ’s œ S.The Bellman equation can be solved directly:

vfi(s) .=ÿ

a

fi(a|s)ÿ

sÕ,r

p(sÕ, r |s, a)[r + “vfi(sÕ)]

vfi(s1) =c1,0 + c1,1vfi(s1) + c1,2vfi(s2) + c1,3vfi(s3) + ...



vfi(s4) =...

If the MDP is not finite we are in trouble!A large number of states and action makes this approach also infeasible.Computational complexity is O(n3), where n is the number of states.


Policy Evaluation

Assume that the environment is a finite MDP. We can use an iterative approach:

vk+1(s) .= Efi[Rt+1 + “vk(St+1)|St = s]

=ÿ

a

fi(a|s)ÿ

sÕ,r

p(sÕ, r |s, a)[r + “vk(sÕ)]

This update uses an operation called full backup and is called Iterative Policy Evaluation.This will converge to the fixed point vk = vfi .


Iterative Policy Evaluation


Running Example

Shaded squares are terminal states.Actions that will take the agent o� the grid stays in the same state.


Running Example - ”Random Policy”

vk+1(s) =ÿ

a

fi(a|s)ÿ

sÕ,r

p(sÕ, r |s, a)[r + “vk(sÕ)]


Policy Improvement

We have a policy fi(a|s) but it is not optimal. How can we improve it?

Policy improvement theorem:If qfi(s, fiÕ(s)) Ø vfi(s) then the policy fiÕ must be as good, or better than fi.

It must obtain greater or equal returns in all states vfiÕ (s) Ø vfi(s).


Proof of Policy Improvement Theorem

vfi(s) Æ qfi(s, fiÕ(s))= EfiÕ [Rt+1 + “vfi(St+1)|St = s]Æ EfiÕ [Rt+1 + “qfi(St+1, fiÕ(St+1))|St = s]= EfiÕ [Rt+1 + “EfiÕ [Rt+2 + “vfi(St+2)]|St = s]= EfiÕ [Rt+1 + “Rt+2 + “2vfi(St+2)|St = s]...Æ EfiÕ [Rt+1 + “Rt+2 + “2Rt+3 + ...|St = s]= vfiÕ (s)


Greedy Policy

We have state-value function vfi(s), ’s œ S and greedily choose actions that maximize it.

fiÕ(s) .= arg maxa

qfi(s, a)

= arg maxa

E[Rt+1 + “vfi(St+1)|St = s, At = a]

= arg maxa

ÿ

sÕ,r


$

If the greedy policy fiÕ is as good as, but not better than fi, then vfiÕ = vfi , ’s œ S.

vfiÕ (s) = maxa

E[Rt+1 + “vfiÕ (St+1)|St = s, At = a]

= maxa

ÿ

sÕ,r

p(sÕ, r |s, a)#r + “vfiÕ (sÕ)

$


Running Example


Policy Iteration


Generalized Policy Iteration

Any method that interleaves the two processes of policy evaluation and policy improvement fallsunder the umbrella of generalized policy iterations.The two processes of policy evaluation and policy improvement can be seen as opposing forcesthat will agree on a single joint solution in the long run.


Value Iteration

Value iteration combines policy improvement and truncated policy evaluation steps.

vk+1(s) .= maxa

E[Rt+1 + “vk(St+1)|St = s, At = a]

= maxa

ÿ

sÕ,r

p(sÕ, r |s, a)#r + “vk(sÕ)

$, ’s œ S.


Value Iteration


Convergence and Termination

All methods presented up to here are only guaranteed to converge for k æ Œ.

However, often we get reasonable results by setting a convergence criterion such as|Vk+1(s) ≠ Vk(s)| < ◊.

Dynamic programming methods scale polynomially in the number of states and actions.Therefore they are exponentially faster than any direct search in the policy space.On today’s computers MDPs with millions of states can be solved with DP methods.


Limitations of MDPs

Stop using your pink glasses: The real world is not a video game!


Limitations of MDPs

Circumvent the problem of high-dimensional state and action spaces by dividing your probleminto subproblems.


Summary

In reinforcement learning we have an agent that interacts with its environment and receivesrewards based on its decisions. The goal is to learn to choose actions that maximizes theexpected future reward.

states – states should contain all relevant information for making decisions

actions – an action brings you from state s into state sÕ according to p(sÕ|s, a)

rewards – an agent receives rewards for being in a state

policy – a policy is a stochastic rule for choosing actions as a function of states

Markov Decision Process – (S, A, p(), R, “) + markov property

value functions – vfi(s) & qfi(s, a) summarize the expected reward for following a policy fi

policy evaluation – given policy fi(s) we iteratively compute vfi(s) ’s œ S

policy improvement – given vfi(s) improve your policy fi(s), e.g. by being greedy

policy iteration – alternate between policy evaluation and policy improvement

value iteration – combine policy evaluation and policy improvement


Questions?


References

Sutton, Richard S and Barto, Andrew G (2016++)Reinforcement learning: An introductionPublisher: MIT press Cambridge


Thanks!


Policy Evaluation - Proof Sketch

Assume we converged at K.

vK (s) .= Efi[Rt+1 + “vK≠1(St+1)|St = s]

vK (s) =ÿ

a

fi(a|s)ÿ

sÕ,r

p(sÕ, r |s, a)[r + “Efi[vK≠1(sÕ)]¸ ˚˙ ˝qa

fi(a|sÕ)q

sÕÕ,rp(sÕÕ,r|sÕ,a)[r+“Efi [vK≠2(sÕÕ)]]

]

Since we follow fi in every step, we e�ectively will approximate vfi .


introduction to markov decision processes and dynamic ... · introduction to markov decision...

Documents