a guided tour of chapter 2: markov decision process and

31
A Guided Tour of Chapter 2 : Markov Decision Process and Bellman Equations Ashwin Rao ICME, Stanford University Ashwin Rao (Stanford) MDP Chapter 1 / 31

Upload: others

Post on 24-Mar-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

A Guided Tour of Chapter 2:Markov Decision Process and Bellman Equations

Ashwin Rao

ICME, Stanford University

Ashwin Rao (Stanford) MDP Chapter 1 / 31

Developing Intuition on Optimal Sequential Decisioning

Chapter 1 covered “Sequential Uncertainty” and notion of “Rewards”

Here we extend the framework to include “Sequential Decisioning”

Developing intuition by revisiting the Inventory example

Over-ordering risks “holding costs” of overnight inventory

Under-ordering risks “stockout costs” (empty shelves more damaging)

Orders influence future inventory levels, and consequent future orders

Also need to deal with delayed costs and demand uncertainty

Intuition on how challenging it is to determine Optimal Actions

Cyclic interplay between the Agent and Environment

Unlike supervised learning, there’s no “teacher” here (only Rewards)

Ashwin Rao (Stanford) MDP Chapter 2 / 31

Cyclic Interplay between Agent and Environment

Ashwin Rao (Stanford) MDP Chapter 3 / 31

MDP Definition for Discrete Time, Countable States

Definition

A Markov Decision Process (MDP) comprises of:

A countable set of states S (State Space), a set T ⊆ S (known asthe set of Terminal States), and a countable set of actions AA time-indexed sequence of environment-generated pairs of randomstates St ∈ S and random rewards Rt ∈ D (a countable subset of R),alternating with agent-controllable actions At ∈ A for time stepst = 0, 1, 2, . . .

Markov Property: P[(Rt+1,St+1)|(St ,At , St−1,At−1, . . . ,S0,A0)] =P[(Rt+1,St+1)|(St ,At)] for all t ≥ 0

Termination: If an outcome for ST (for some time step T ) is a statein the set T , then this sequence outcome terminates at time step T .

S0,A0,R1,S1,A1,R2,S2,A2, . . . ,ST−1,AT−1,RT ,ST

Ashwin Rao (Stanford) MDP Chapter 4 / 31

Time-Homogeneity, Transition Function, Reward Functions

Time-Homogeneity: P[(Rt+1, St+1)|(St ,At)] independent of t

⇒ Transition Probability Function PR : N ×A×D × S → [0, 1]

PR(s, a, r , s ′) = P[(Rt+1 = r , St+1 = s ′)|St = s,At = a]

State Transition Probability Function P : N ×A× S → [0, 1]:

P(s, a, s ′) =∑r∈DPR(s, a, r , s ′)

Reward Transition Function RT : N ×A× S → R defined as:

RT (s, a, s ′) = E[Rt+1|(St+1 = s ′, St = s,At = a)]

=∑r∈D

PR(s, a, r , s ′)

P(s, a, s ′)· r =

∑r∈D

PR(s, a, r , s ′)∑r∈D PR(s, a, r , s ′)

· r

Reward Function R : N ×A → R defined as:

R(s, a) = E[Rt+1|(St = s,At = a)] =∑s′∈S

∑r∈DPR(s, a, r , s ′) · r

Ashwin Rao (Stanford) MDP Chapter 5 / 31

Policy: Function defining the Behavior of the Agent

A Policy is an Agent-controlled function π : N ×A → [0, 1]

π(s, a) = P[At = a|St = s] for all time steps t = 0, 1, 2, . . .

Above definition assumes Policy is Markovian and Stationary

If not stationary, we can include time in State to make it stationary

We denote a deterministic policy as a function πD : N → A

π(s, πD(s)) = 1 and π(s, a) = 0 for all a ∈ A with a 6= πD(s)

c l a s s P o l i c y (ABC, G e n e r i c [ S , A ] ) :@abst ractmethoddef a c t ( s e l f , s t a t e : NonTerminal [ S ] ) −> \

D i s t r i b u t i o n [ A ] :pass

Ashwin Rao (Stanford) MDP Chapter 6 / 31

[MDP, Policy] := MRP

PπR(s, r , s ′) =∑a∈A

π(s, a) · PR(s, a, r , s ′)

Pπ(s, s ′) =∑a∈A

π(s, a) · P(s, a, s ′)

RπT (s, s ′) =∑a∈A

π(s, a) · RT (s, a, s ′)

Rπ(s) =∑a∈A

π(s, a) · R(s, a)

Ashwin Rao (Stanford) MDP Chapter 7 / 31

@abstractclass MarkovDecisionProcess

c l a s s M a r k o v D e c i s i o n P r o c e s s (ABC, G e n e r i c [ S , A ] ) :@abst ractmethoddef a c t i o n s ( s e l f , s t a t e : NonTerminal [ S ] ) \

−> I t e r a b l e [ A ] :pass

@abst ractmethoddef s t e p ( s e l f , s t a t e : NonTerminal [ S ] , a c t i o n : A) \

−> D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] , f l o a t ] ] :pass

Ashwin Rao (Stanford) MDP Chapter 8 / 31

@abstractclass MarkovDecisionProcess

def a p p l y p o l i c y ( s e l f , p o l i c y : P o l i c y [ S , A ] ) \−> MarkovRewardProcess [ S ] :

mdp = s e l f

c l a s s RewardProcess ( MarkovRewardProcess [ S ] ) :def t r a n s i t i o n r e w a r d (

s e l f ,s t : NonTerminal [ S ]

) −> D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] , f l o a t ] ] :a c t i o n s : D i s t r i b u t i o n [ A ] = p o l i c y . a c t ( s t )return a c t i o n s . apply (

lambda a : mdp . s t e p ( st , a ))

return RewardProcess ( )

Ashwin Rao (Stanford) MDP Chapter 9 / 31

Finite Markov Decision Process

Finite State Space S = {s1, s2, . . . , sn}, |N | = m ≤ n

Action Space A(s) is finite for each s ∈ NFinite set of (next state, reward) transitions

We’d like a sparse representation for PRConceptualize PR : N ×A×D × S → [0, 1] as:

N → (A → (S × D → [0, 1]))

StateReward = F i n i t e D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] ,f l o a t ] ]

Act ionMapping = Mapping [ A, StateReward [ S ] ]StateAct ionMapp ing = Mapping [ NonTerminal [ S ] ,

Act ionMapping [ A, S ] ]

Ashwin Rao (Stanford) MDP Chapter 10 / 31

class FiniteMarkovDecisionProcess

c l a s s F i n i t e M a r k o v D e c i s i o n P r o c e s s (M a r k o v D e c i s i o n P r o c e s s [ S , A ]

) :m: StateAct ionMapp ing [ S , A ]n t s t a t e s : Sequence [ NonTerminal [ S ] ]

def i n i t ( s e l f , m: Mapping [ S , Mapping [ A,F i n i t e D i s t r i b u t i o n [ Tuple [ S , f l o a t ] ] ] ] ) :

nt : Set [ S ] = set ( mapping . k e y s ( ) )s e l f .m = {NonTerminal ( s ) : {a : C a t e g o r i c a l (

{( NonTerminal ( s1 ) i f s1 i n nt e l s eTermina l ( s1 ) , r ) : p f o r ( s1 , r ) , p i nv . t a b l e ( ) . i t e m s ( )} ) f o r a , v i nd . i t e m s ( )} f o r s , d i n mapping . i t e m s ( )}

s e l f . n t s t a t e s = l i s t ( s e l f .m. k e y s ( ) )

Ashwin Rao (Stanford) MDP Chapter 11 / 31

class FinitePolicy

def s t e p ( s e l f , s t a t e : NonTerminal [ S ] , a c t i o n : A) \−> StateReward :

return s e l f . mapping [ s t a t e ] [ a c t i o n ]

@ d a t a c l a s s ( f r o z e n=True )c l a s s F i n i t e P o l i c y ( P o l i c y [ S , A ] ) :

p o l i c y m a p : Mapping [ S , F i n i t e D i s t r i b u t i o n [ A ] ]

def a c t ( s e l f , s t a t e : NonTerminal [ S ] ) \−> F i n i t e D i s t r i b u t i o n [ A ] :

return s e l f . p o l i c y m a p [ s t a t e . s t a t e ]

With this, we can write a method for FiniteMarkovDecisionProcess thattakes a FinitePolicy and produces a FiniteMarkovRewardProcess

Ashwin Rao (Stanford) MDP Chapter 12 / 31

Inventory MDP

α := On-Hand Inventory, β := On-Order Inventory

h := Holding Cost (per unit of overnight inventory)

p := Stockout Cost (per unit of missed demand)

C := Shelf Capacity (number of inventory units shelf can hold)

S = {(α, β) : 0 ≤ α + β ≤ C}A((α, β)) = {θ : 0 ≤ θ ≤ C − (α + β)}f (·) := PMF of demand, F (·) := CMF of demand

RT ((α, β), θ, (α + β − i , θ)) = −hα for 0 ≤ i ≤ α + β − 1

RT ((α, β), θ, (0, θ)) = −hα− p(∞∑

j=α+β+1

f (j) · (j − (α + β)))

= −hα− p(λ(1− F (α + β − 1))− (α + β)(1− F (α + β)))

Ashwin Rao (Stanford) MDP Chapter 13 / 31

State-Value Function of an MDP for a Fixed Policy

Define the Return Gt from state St as:

Gt =∞∑

i=t+1

γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ2 · Rt+3 + . . .

γ ∈ [0, 1] is the discount factor

State-Value Function (for policy π) V π : N → R defined as:

V π(s) = Eπ,PR[Gt |St = s] for all s ∈ N , for all t = 0, 1, 2, . . .

V π is Value Function of π-implied MRP, satisfying MRP Bellman Eqn

V π(s) = Rπ(s) + γ ·∑s′∈N

Pπ(s, s ′) · V π(s ′)

This yields the MDP (State-Value Function) Bellman Policy Equation

V π(s) =∑a∈A

π(s, a) · (R(s, a) + γ ·∑s′∈N

P(s, a, s ′) · V π(s ′)) (1)

Ashwin Rao (Stanford) MDP Chapter 14 / 31

Action-Value Function of an MDP for a Fixed Policy

Action-Value Function (for policy π) Qπ : N ×A → R defined as:

Qπ(s, a) = Eπ,PR[Gt |(St = s,At = a)] for all s ∈ N , a ∈ A

V π(s) =∑a∈A

π(s, a) · Qπ(s, a) (2)

Combining Equation (1) and Equation (2) yields:

Qπ(s, a) = R(s, a) + γ ·∑s′∈N

P(s, a, s ′) · V π(s ′) (3)

Combining Equation (3) and Equation (2) yields:

Qπ(s, a) = R(s, a) + γ ·∑s′∈N

P(s, a, s ′)∑a′∈A

π(s ′, a′) · Qπ(s ′, a′) (4)

MDP Prediction Problem: Evaluating V π(·) and Qπ(·) for fixed policy πAshwin Rao (Stanford) MDP Chapter 15 / 31

MDP State-Value Function Bellman Policy Equation

Ashwin Rao (Stanford) MDP Chapter 16 / 31

MDP Action-Value Function Bellman Policy Equation

Ashwin Rao (Stanford) MDP Chapter 17 / 31

Optimal Value Functions

Optimal State-Value Function V ∗ : N → R defined as:

V ∗(s) = maxπ∈Π

V π(s) for all s ∈ N

where Π is the space of all stationary (stochastic) policies

For each s, maximize V π(s) across choices of π ∈ Π

Does this mean we could have different maximizing π for different s?

We’ll answer this question later

Optimal Action-Value Function Q∗ : N ×A → R defined as:

Q∗(s, a) = maxπ∈Π

Qπ(s, a) for all s ∈ N , a ∈ A

Ashwin Rao (Stanford) MDP Chapter 18 / 31

Bellman Optimality Equations

V ∗(s) = maxa∈A

Q∗(s, a) (5)

Q∗(s, a) = R(s, a) + γ ·∑s′∈N

P(s, a, s ′) · V ∗(s ′) (6)

These yield the MDP State-Value Function Bellman Optimality Equation

V ∗(s) = maxa∈A{R(s, a) + γ ·

∑s′∈N

P(s, a, s ′) · V ∗(s ′)} (7)

and the MDP Action-Value Function Bellman Optimality Equation

Q∗(s, a) = R(s, a) + γ ·∑s′∈N

P(s, a, s ′) ·maxa′∈A

Q∗(s ′, a′) (8)

MDP Control Problem: Computing V ∗(·) and Q∗(·)

Ashwin Rao (Stanford) MDP Chapter 19 / 31

MDP State-Value Function Bellman Optimality Equation

Ashwin Rao (Stanford) MDP Chapter 20 / 31

MDP Action-Value Function Bellman Optimality Equation

Ashwin Rao (Stanford) MDP Chapter 21 / 31

Optimal Policy

Bellman Optimality Equations don’t directly solve Control

Because (unlike Bellman Policy Equations), these are non-linear

But these equations form the foundations of DP/RL algos for Control

But will solving Control give us the Optimal Policy?

What does Optimal Policy mean anyway?

What if different π maximize V π(s) for different s?

So define an Optimal Policy π∗ as one that ”dominates” all other π:

π∗ ∈ Π is an Optimal Policy if V π∗(s) ≥ V π(s) for all π and for all s

Is there an Optimal Policy π∗ such that V ∗(s) = V π∗(s) for all s?

Ashwin Rao (Stanford) MDP Chapter 22 / 31

Optimal Policy achieves Optimal Value Function

Theorem

For any (discrete-time, countable-spaces, time-homogeneous) MDP:

There exists an Optimal Policy π∗ ∈ Π, i.e., there exists a Policyπ∗ ∈ Π such thatV π∗(s) ≥ V π(s) for all policies π ∈ Π and for all states s ∈ NAll Optimal Policies achieve the Optimal Value Function, i.e.V π∗(s) = V ∗(s) for all s ∈ N , for all Optimal Policies π∗

All Optimal Policies achieve the Optimal Action-Value Function, i.e.Qπ∗(s, a) = Q∗(s, a) for all s ∈ N , for all a ∈ A, for all OptimalPolicies π∗

Ashwin Rao (Stanford) MDP Chapter 23 / 31

Proof Outline

For any Optimal Policies π∗1 and π∗2, V π∗1 (s) = V π∗2 (s) for all s ∈ NConstruct a candidate Optimal (Deterministic) Policy π∗D : N → A:

π∗D(s) = argmaxa∈A

Q∗(s, a) for all s ∈ N

π∗D achieves the Optimal Value Functions V ∗ and Q∗:

V ∗(s) = Q∗(s, π∗D(s)) for all s ∈ N

V π∗D (s) = V ∗(s) for all s ∈ N

Qπ∗D (s, a) = Q∗(s, a) for all s ∈ N , for all a ∈ A

π∗D is an Optimal Policy:

V π∗D (s) ≥ V π(s) for all policies π ∈ Π and for all states s ∈ N

Ashwin Rao (Stanford) MDP Chapter 24 / 31

State Space Size and Transitions Complexity

Tabular Algorithms for State Spaces that are not too large

In real-world, state spaces are very large/infinite/continuous

Curse of Dimensionality: Size Explosion as a function of dimensions

Curse of Modeling: Transition Probabilities hard to model/estimate

Dimension-Reduction techniques, Unsupervised ML methods

Function Approximation of the Value Function (in ADP and RL)

Sampling, Sampling, Sampling ... (in ADP and RL)

Ashwin Rao (Stanford) MDP Chapter 25 / 31

Action Space Sizes

Large Action Spaces: Hard to represent, estimate and evaluate:

Policy πAction-Value Function for a policy Qπ

Optimal Action-Value Function Q∗

Large Actions Space makes it hard to calculate argmaxa Q(s, a)

Optimization over Action Space for each non-terminal state

Policy Gradient a technique to deal with large action spaces

Ashwin Rao (Stanford) MDP Chapter 26 / 31

Time-Steps Variants and Continuity

Time-Steps: terminating (episodic) or non-terminating (continuing)

Discounted or Undiscounted MDPs, Average-Reward MDPs

Continuous-time MDPs: Stochastic Processes and Stochastic Calculus

When States/Actions/Time all continuous, Hamilton-Jacobi-Bellman

Ashwin Rao (Stanford) MDP Chapter 27 / 31

Partially-Observable Markov Decision Process (POMDP)

Two different notions of State:

Internal representation of the environment at each time step t (S(e)t )

The agent’s state at each time step t (let’s call it S(a)t )

We assumed S(e)t = S

(a)t (= St) and that St is fully observable

A more general framework assumes agent sees Observations Ot

Agent cannot see (or infer) S(e)t from history of observations

This more general framework is called POMDP

POMDP is specified with Observation Space O and observationprobability function Z : S ×A×O → [0, 1] defined as:

Z(s ′, a, o) = P[Ot+1 = o|(St+1 = s ′,At = a)]

Along with the usual transition probabilities specification PRMDP is a special case of POMDP with Ot = S

(e)t = S

(a)t = St

Ashwin Rao (Stanford) MDP Chapter 28 / 31

POMDP Framework

Ashwin Rao (Stanford) MDP Chapter 29 / 31

Belief States, Tractability and Modeling

Agent doesn’t have knowledge of St , only of Ot

So Agent has to “guess” St by maintaining Belief States

b(h)t = (P[St = s1|Ht = h],P[St = s2|Ht = h], . . .)

where history Ht is all data known to agent by time t:

Ht := (O0,R0,A0,O1,R1,A1, . . . ,Ot ,Rt)

Ht satisfies Markov Property ⇒ b(h)t satisfies Markov Property

POMDP yields (huge) MDP whose states are POMDP’s belief states

Real-world: Model as accurate POMDP or approx as tractable MDP?

Ashwin Rao (Stanford) MDP Chapter 30 / 31

Key Takeaways from this Chapter

MDP Bellman Policy Equations

MDP Bellman Optimality Equations

Existence of an Optimal Policy, and of each Optimal Policy achievingthe Optimal Value Function

Ashwin Rao (Stanford) MDP Chapter 31 / 31