a guided tour of chapter 2: markov decision process and
TRANSCRIPT
A Guided Tour of Chapter 2:Markov Decision Process and Bellman Equations
Ashwin Rao
ICME, Stanford University
Ashwin Rao (Stanford) MDP Chapter 1 / 31
Developing Intuition on Optimal Sequential Decisioning
Chapter 1 covered “Sequential Uncertainty” and notion of “Rewards”
Here we extend the framework to include “Sequential Decisioning”
Developing intuition by revisiting the Inventory example
Over-ordering risks “holding costs” of overnight inventory
Under-ordering risks “stockout costs” (empty shelves more damaging)
Orders influence future inventory levels, and consequent future orders
Also need to deal with delayed costs and demand uncertainty
Intuition on how challenging it is to determine Optimal Actions
Cyclic interplay between the Agent and Environment
Unlike supervised learning, there’s no “teacher” here (only Rewards)
Ashwin Rao (Stanford) MDP Chapter 2 / 31
MDP Definition for Discrete Time, Countable States
Definition
A Markov Decision Process (MDP) comprises of:
A countable set of states S (State Space), a set T ⊆ S (known asthe set of Terminal States), and a countable set of actions AA time-indexed sequence of environment-generated pairs of randomstates St ∈ S and random rewards Rt ∈ D (a countable subset of R),alternating with agent-controllable actions At ∈ A for time stepst = 0, 1, 2, . . .
Markov Property: P[(Rt+1,St+1)|(St ,At , St−1,At−1, . . . ,S0,A0)] =P[(Rt+1,St+1)|(St ,At)] for all t ≥ 0
Termination: If an outcome for ST (for some time step T ) is a statein the set T , then this sequence outcome terminates at time step T .
S0,A0,R1,S1,A1,R2,S2,A2, . . . ,ST−1,AT−1,RT ,ST
Ashwin Rao (Stanford) MDP Chapter 4 / 31
Time-Homogeneity, Transition Function, Reward Functions
Time-Homogeneity: P[(Rt+1, St+1)|(St ,At)] independent of t
⇒ Transition Probability Function PR : N ×A×D × S → [0, 1]
PR(s, a, r , s ′) = P[(Rt+1 = r , St+1 = s ′)|St = s,At = a]
State Transition Probability Function P : N ×A× S → [0, 1]:
P(s, a, s ′) =∑r∈DPR(s, a, r , s ′)
Reward Transition Function RT : N ×A× S → R defined as:
RT (s, a, s ′) = E[Rt+1|(St+1 = s ′, St = s,At = a)]
=∑r∈D
PR(s, a, r , s ′)
P(s, a, s ′)· r =
∑r∈D
PR(s, a, r , s ′)∑r∈D PR(s, a, r , s ′)
· r
Reward Function R : N ×A → R defined as:
R(s, a) = E[Rt+1|(St = s,At = a)] =∑s′∈S
∑r∈DPR(s, a, r , s ′) · r
Ashwin Rao (Stanford) MDP Chapter 5 / 31
Policy: Function defining the Behavior of the Agent
A Policy is an Agent-controlled function π : N ×A → [0, 1]
π(s, a) = P[At = a|St = s] for all time steps t = 0, 1, 2, . . .
Above definition assumes Policy is Markovian and Stationary
If not stationary, we can include time in State to make it stationary
We denote a deterministic policy as a function πD : N → A
π(s, πD(s)) = 1 and π(s, a) = 0 for all a ∈ A with a 6= πD(s)
c l a s s P o l i c y (ABC, G e n e r i c [ S , A ] ) :@abst ractmethoddef a c t ( s e l f , s t a t e : NonTerminal [ S ] ) −> \
D i s t r i b u t i o n [ A ] :pass
Ashwin Rao (Stanford) MDP Chapter 6 / 31
[MDP, Policy] := MRP
PπR(s, r , s ′) =∑a∈A
π(s, a) · PR(s, a, r , s ′)
Pπ(s, s ′) =∑a∈A
π(s, a) · P(s, a, s ′)
RπT (s, s ′) =∑a∈A
π(s, a) · RT (s, a, s ′)
Rπ(s) =∑a∈A
π(s, a) · R(s, a)
Ashwin Rao (Stanford) MDP Chapter 7 / 31
@abstractclass MarkovDecisionProcess
c l a s s M a r k o v D e c i s i o n P r o c e s s (ABC, G e n e r i c [ S , A ] ) :@abst ractmethoddef a c t i o n s ( s e l f , s t a t e : NonTerminal [ S ] ) \
−> I t e r a b l e [ A ] :pass
@abst ractmethoddef s t e p ( s e l f , s t a t e : NonTerminal [ S ] , a c t i o n : A) \
−> D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] , f l o a t ] ] :pass
Ashwin Rao (Stanford) MDP Chapter 8 / 31
@abstractclass MarkovDecisionProcess
def a p p l y p o l i c y ( s e l f , p o l i c y : P o l i c y [ S , A ] ) \−> MarkovRewardProcess [ S ] :
mdp = s e l f
c l a s s RewardProcess ( MarkovRewardProcess [ S ] ) :def t r a n s i t i o n r e w a r d (
s e l f ,s t : NonTerminal [ S ]
) −> D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] , f l o a t ] ] :a c t i o n s : D i s t r i b u t i o n [ A ] = p o l i c y . a c t ( s t )return a c t i o n s . apply (
lambda a : mdp . s t e p ( st , a ))
return RewardProcess ( )
Ashwin Rao (Stanford) MDP Chapter 9 / 31
Finite Markov Decision Process
Finite State Space S = {s1, s2, . . . , sn}, |N | = m ≤ n
Action Space A(s) is finite for each s ∈ NFinite set of (next state, reward) transitions
We’d like a sparse representation for PRConceptualize PR : N ×A×D × S → [0, 1] as:
N → (A → (S × D → [0, 1]))
StateReward = F i n i t e D i s t r i b u t i o n [ Tuple [ S t a t e [ S ] ,f l o a t ] ]
Act ionMapping = Mapping [ A, StateReward [ S ] ]StateAct ionMapp ing = Mapping [ NonTerminal [ S ] ,
Act ionMapping [ A, S ] ]
Ashwin Rao (Stanford) MDP Chapter 10 / 31
class FiniteMarkovDecisionProcess
c l a s s F i n i t e M a r k o v D e c i s i o n P r o c e s s (M a r k o v D e c i s i o n P r o c e s s [ S , A ]
) :m: StateAct ionMapp ing [ S , A ]n t s t a t e s : Sequence [ NonTerminal [ S ] ]
def i n i t ( s e l f , m: Mapping [ S , Mapping [ A,F i n i t e D i s t r i b u t i o n [ Tuple [ S , f l o a t ] ] ] ] ) :
nt : Set [ S ] = set ( mapping . k e y s ( ) )s e l f .m = {NonTerminal ( s ) : {a : C a t e g o r i c a l (
{( NonTerminal ( s1 ) i f s1 i n nt e l s eTermina l ( s1 ) , r ) : p f o r ( s1 , r ) , p i nv . t a b l e ( ) . i t e m s ( )} ) f o r a , v i nd . i t e m s ( )} f o r s , d i n mapping . i t e m s ( )}
s e l f . n t s t a t e s = l i s t ( s e l f .m. k e y s ( ) )
Ashwin Rao (Stanford) MDP Chapter 11 / 31
class FinitePolicy
def s t e p ( s e l f , s t a t e : NonTerminal [ S ] , a c t i o n : A) \−> StateReward :
return s e l f . mapping [ s t a t e ] [ a c t i o n ]
@ d a t a c l a s s ( f r o z e n=True )c l a s s F i n i t e P o l i c y ( P o l i c y [ S , A ] ) :
p o l i c y m a p : Mapping [ S , F i n i t e D i s t r i b u t i o n [ A ] ]
def a c t ( s e l f , s t a t e : NonTerminal [ S ] ) \−> F i n i t e D i s t r i b u t i o n [ A ] :
return s e l f . p o l i c y m a p [ s t a t e . s t a t e ]
With this, we can write a method for FiniteMarkovDecisionProcess thattakes a FinitePolicy and produces a FiniteMarkovRewardProcess
Ashwin Rao (Stanford) MDP Chapter 12 / 31
Inventory MDP
α := On-Hand Inventory, β := On-Order Inventory
h := Holding Cost (per unit of overnight inventory)
p := Stockout Cost (per unit of missed demand)
C := Shelf Capacity (number of inventory units shelf can hold)
S = {(α, β) : 0 ≤ α + β ≤ C}A((α, β)) = {θ : 0 ≤ θ ≤ C − (α + β)}f (·) := PMF of demand, F (·) := CMF of demand
RT ((α, β), θ, (α + β − i , θ)) = −hα for 0 ≤ i ≤ α + β − 1
RT ((α, β), θ, (0, θ)) = −hα− p(∞∑
j=α+β+1
f (j) · (j − (α + β)))
= −hα− p(λ(1− F (α + β − 1))− (α + β)(1− F (α + β)))
Ashwin Rao (Stanford) MDP Chapter 13 / 31
State-Value Function of an MDP for a Fixed Policy
Define the Return Gt from state St as:
Gt =∞∑
i=t+1
γ i−t−1 · Ri = Rt+1 + γ · Rt+2 + γ2 · Rt+3 + . . .
γ ∈ [0, 1] is the discount factor
State-Value Function (for policy π) V π : N → R defined as:
V π(s) = Eπ,PR[Gt |St = s] for all s ∈ N , for all t = 0, 1, 2, . . .
V π is Value Function of π-implied MRP, satisfying MRP Bellman Eqn
V π(s) = Rπ(s) + γ ·∑s′∈N
Pπ(s, s ′) · V π(s ′)
This yields the MDP (State-Value Function) Bellman Policy Equation
V π(s) =∑a∈A
π(s, a) · (R(s, a) + γ ·∑s′∈N
P(s, a, s ′) · V π(s ′)) (1)
Ashwin Rao (Stanford) MDP Chapter 14 / 31
Action-Value Function of an MDP for a Fixed Policy
Action-Value Function (for policy π) Qπ : N ×A → R defined as:
Qπ(s, a) = Eπ,PR[Gt |(St = s,At = a)] for all s ∈ N , a ∈ A
V π(s) =∑a∈A
π(s, a) · Qπ(s, a) (2)
Combining Equation (1) and Equation (2) yields:
Qπ(s, a) = R(s, a) + γ ·∑s′∈N
P(s, a, s ′) · V π(s ′) (3)
Combining Equation (3) and Equation (2) yields:
Qπ(s, a) = R(s, a) + γ ·∑s′∈N
P(s, a, s ′)∑a′∈A
π(s ′, a′) · Qπ(s ′, a′) (4)
MDP Prediction Problem: Evaluating V π(·) and Qπ(·) for fixed policy πAshwin Rao (Stanford) MDP Chapter 15 / 31
Optimal Value Functions
Optimal State-Value Function V ∗ : N → R defined as:
V ∗(s) = maxπ∈Π
V π(s) for all s ∈ N
where Π is the space of all stationary (stochastic) policies
For each s, maximize V π(s) across choices of π ∈ Π
Does this mean we could have different maximizing π for different s?
We’ll answer this question later
Optimal Action-Value Function Q∗ : N ×A → R defined as:
Q∗(s, a) = maxπ∈Π
Qπ(s, a) for all s ∈ N , a ∈ A
Ashwin Rao (Stanford) MDP Chapter 18 / 31
Bellman Optimality Equations
V ∗(s) = maxa∈A
Q∗(s, a) (5)
Q∗(s, a) = R(s, a) + γ ·∑s′∈N
P(s, a, s ′) · V ∗(s ′) (6)
These yield the MDP State-Value Function Bellman Optimality Equation
V ∗(s) = maxa∈A{R(s, a) + γ ·
∑s′∈N
P(s, a, s ′) · V ∗(s ′)} (7)
and the MDP Action-Value Function Bellman Optimality Equation
Q∗(s, a) = R(s, a) + γ ·∑s′∈N
P(s, a, s ′) ·maxa′∈A
Q∗(s ′, a′) (8)
MDP Control Problem: Computing V ∗(·) and Q∗(·)
Ashwin Rao (Stanford) MDP Chapter 19 / 31
Optimal Policy
Bellman Optimality Equations don’t directly solve Control
Because (unlike Bellman Policy Equations), these are non-linear
But these equations form the foundations of DP/RL algos for Control
But will solving Control give us the Optimal Policy?
What does Optimal Policy mean anyway?
What if different π maximize V π(s) for different s?
So define an Optimal Policy π∗ as one that ”dominates” all other π:
π∗ ∈ Π is an Optimal Policy if V π∗(s) ≥ V π(s) for all π and for all s
Is there an Optimal Policy π∗ such that V ∗(s) = V π∗(s) for all s?
Ashwin Rao (Stanford) MDP Chapter 22 / 31
Optimal Policy achieves Optimal Value Function
Theorem
For any (discrete-time, countable-spaces, time-homogeneous) MDP:
There exists an Optimal Policy π∗ ∈ Π, i.e., there exists a Policyπ∗ ∈ Π such thatV π∗(s) ≥ V π(s) for all policies π ∈ Π and for all states s ∈ NAll Optimal Policies achieve the Optimal Value Function, i.e.V π∗(s) = V ∗(s) for all s ∈ N , for all Optimal Policies π∗
All Optimal Policies achieve the Optimal Action-Value Function, i.e.Qπ∗(s, a) = Q∗(s, a) for all s ∈ N , for all a ∈ A, for all OptimalPolicies π∗
Ashwin Rao (Stanford) MDP Chapter 23 / 31
Proof Outline
For any Optimal Policies π∗1 and π∗2, V π∗1 (s) = V π∗2 (s) for all s ∈ NConstruct a candidate Optimal (Deterministic) Policy π∗D : N → A:
π∗D(s) = argmaxa∈A
Q∗(s, a) for all s ∈ N
π∗D achieves the Optimal Value Functions V ∗ and Q∗:
V ∗(s) = Q∗(s, π∗D(s)) for all s ∈ N
V π∗D (s) = V ∗(s) for all s ∈ N
Qπ∗D (s, a) = Q∗(s, a) for all s ∈ N , for all a ∈ A
π∗D is an Optimal Policy:
V π∗D (s) ≥ V π(s) for all policies π ∈ Π and for all states s ∈ N
Ashwin Rao (Stanford) MDP Chapter 24 / 31
State Space Size and Transitions Complexity
Tabular Algorithms for State Spaces that are not too large
In real-world, state spaces are very large/infinite/continuous
Curse of Dimensionality: Size Explosion as a function of dimensions
Curse of Modeling: Transition Probabilities hard to model/estimate
Dimension-Reduction techniques, Unsupervised ML methods
Function Approximation of the Value Function (in ADP and RL)
Sampling, Sampling, Sampling ... (in ADP and RL)
Ashwin Rao (Stanford) MDP Chapter 25 / 31
Action Space Sizes
Large Action Spaces: Hard to represent, estimate and evaluate:
Policy πAction-Value Function for a policy Qπ
Optimal Action-Value Function Q∗
Large Actions Space makes it hard to calculate argmaxa Q(s, a)
Optimization over Action Space for each non-terminal state
Policy Gradient a technique to deal with large action spaces
Ashwin Rao (Stanford) MDP Chapter 26 / 31
Time-Steps Variants and Continuity
Time-Steps: terminating (episodic) or non-terminating (continuing)
Discounted or Undiscounted MDPs, Average-Reward MDPs
Continuous-time MDPs: Stochastic Processes and Stochastic Calculus
When States/Actions/Time all continuous, Hamilton-Jacobi-Bellman
Ashwin Rao (Stanford) MDP Chapter 27 / 31
Partially-Observable Markov Decision Process (POMDP)
Two different notions of State:
Internal representation of the environment at each time step t (S(e)t )
The agent’s state at each time step t (let’s call it S(a)t )
We assumed S(e)t = S
(a)t (= St) and that St is fully observable
A more general framework assumes agent sees Observations Ot
Agent cannot see (or infer) S(e)t from history of observations
This more general framework is called POMDP
POMDP is specified with Observation Space O and observationprobability function Z : S ×A×O → [0, 1] defined as:
Z(s ′, a, o) = P[Ot+1 = o|(St+1 = s ′,At = a)]
Along with the usual transition probabilities specification PRMDP is a special case of POMDP with Ot = S
(e)t = S
(a)t = St
Ashwin Rao (Stanford) MDP Chapter 28 / 31
Belief States, Tractability and Modeling
Agent doesn’t have knowledge of St , only of Ot
So Agent has to “guess” St by maintaining Belief States
b(h)t = (P[St = s1|Ht = h],P[St = s2|Ht = h], . . .)
where history Ht is all data known to agent by time t:
Ht := (O0,R0,A0,O1,R1,A1, . . . ,Ot ,Rt)
Ht satisfies Markov Property ⇒ b(h)t satisfies Markov Property
POMDP yields (huge) MDP whose states are POMDP’s belief states
Real-world: Model as accurate POMDP or approx as tractable MDP?
Ashwin Rao (Stanford) MDP Chapter 30 / 31