improved bounds for policy iteration in markov decision

1/9

Improved Bounds for Policy Iteration in Markov Decision Problems

Shivaram Kalyanakrishnan

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

[email protected]

November 2017

Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1 / 9

2/9

Sequential Decision Making


2/9


ActSense

AGENT

ENVIRONMENT

Think

actionrewardstate


2/9


ActSense

AGENT

ENVIRONMENT

Think

actionrewardstate

https://img.tradeindia.com/fp/1/524/panoramic-elevators-564.jpg


2/9


ActSense

AGENT

ENVIRONMENT

Think

actionrewardstate

https://img.tradeindia.com/fp/1/524/panoramic-elevators-564.jpg

http://www.nature.com/polopoly_fs/7.33483.1453824868!/image/WEB_Go—1.jpg_gen/derivatives/landscape_630/WEB_Go—1.jpg


3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 30.75, −2


3/9


s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 30.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)


3/9


s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 30.75, −2

Elements of an MDP

States (S)

Actions (A)


Rewards (R)

Behaviour is encoded as a Policy π, which maps states to actions.


3/9


s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)


Rewards (R)



3/9


s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)


Rewards (R)


What is a “good” policy?


3/9


s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)


Rewards (R)


What is a “good” policy? One that maximises expected long-term reward.


3/9


s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)


Rewards (R)



Vπ is the Value Function of π. For s ∈ S,

Vπ(s) = Eπ

[

r0 + γr1 + γ2r2 + . . . |start state = s

]

.


3/9


s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

γ = 0.9

Elements of an MDP

States (S)

Actions (A)


Rewards (R)

Discount factor (γ)



Vπ is the Value Function of π. For s ∈ S,

Vπ(s) = Eπ

[

r0 + γr1 + γ2r2 + . . . |start state = s

]

.


4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.


4/9

Optimal Policies



Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45BBB 10.0 11.0 10.0


4/9

Optimal Policies



Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45 ← Optimal policyBBB 10.0 11.0 10.0


4/9

Optimal Policies



Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)


Every MDP is guaranteed to have an optimal policy π⋆, such that

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).


4/9

Optimal Policies



Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)



∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

What is the complexity of computing an optimal policy?


4/9

Optimal Policies



Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)



∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).


Note: an MDP with |S| = n states and |A| = k actions has a total of kn policies.


4/9

Optimal Policies



Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)



∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).


Note: an MDP with |S| = n states and |A| = k actions has a total of kn policies.

One extra definition needed: Action Value Function Qπ

a for a ∈ A.

Qπ

a = Ra + γTaVπ.

Given π, a polynomial computation yields Vπ and Q

π

a for a ∈ A.


5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π


5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

3Q (s ) ππQ (s ) 3

7Q (s ) ππQ (s ) 7


5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

Improvable states


5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

Improvable states

Improving actions


5/9

Policy Improvement

Given π,

Pick one or more improvable states, and in them,

Switch to an arbitrary improving action.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

Improvable states

Improving actions


5/9

Policy Improvement

Given π,




s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement


5/9

Policy Improvement

Given π,




s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Policy Improvement Theorem (H60, B12):

(1) If π has no improvable states, then it is optimal, else

(2) if π′ is obtained as above, then

∀s ∈ S : Vπ′

(s) ≥ Vπ(s) and ∃s ∈ S : V

π′

(s) > Vπ(s).


6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).


6/9





Different switching strategies lead to different routes to the top.


6/9





Different switching strategies lead to different routes to the top.

How long are the routes?!


7/9

Switching Strategies and Bounds

Upper bounds on number of iterations

PI Variant Type k = 2 General k

Howard’s PIDeterministic O

(

2n

n

)

O

(

kn

n

)

[H60, MS99]

Mansour and Singh’sRandomised 1.7172n ≈ O

(

k

2

)n

Randomised PI [MS99]


7/9





(

2n

n

)

O

(

kn

n

)

[H60, MS99]


(

k

2

)n


Lower bounds on number of iterations

Ω(n) Howard’s PI on n-state, 2-action MDPs [HZ10].


7/9





(

2n

n

)

O

(

kn

n

)

[H60, MS99]


(

k

2

)n




Ω(1.4142n) Simple PI on n-state, 2-action MDPs [MC94].


7/9





(

2n

n

)

O

(

kn

n

)

[H60, MS99]


(

k

2

)n


Batch-switching PIDeterministic 1.6479n

k0.7207n

[KMG16a, GK17]

Recursive Simple PIRandomised – (2 + ln(k − 1))n

[KMG16b]



Ω(1.4142n) Simple PI on n-state, 2-action MDPs [MC94].


8/9

Recursive Simple Policy Iteration

s s s s s s ss1 2 3 4 5 6 7 8

π


8/9


Given π,

Pick the improvable state with the highest index, and,

Switch to an improving action picked uniformly at random.


s s s s s s ss1 2 3 4 5 6 7 8

π


8/9


Given π,




s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement


8/9


Given π,




s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Expected number of iterations: (1+Hk−1)n ≤ (2+ ln(k−1))n.


8/9


Given π,




s s s s s s ss1 2 3 4 5 6 7 8

π

Expected number of iterations: (1+Hk−1)n ≤ (2+ ln(k−1))n.


9/9

Conclusion

Policy Iteration: widely used algorithm, more than half a century old.

Substantial gap exists between upper and lower bounds.

We furnish several exponential improvements to upper bounds.


9/9

Conclusion




Bears similarity to Simplex algorithm for Linear Programming.

Howard’s PI works much better in practice than the variants for which we have

shown improved upper bounds!

Open problem: Is the number of iterations taken by Howard’s PI on n-state,

2-action MDPs upper-bounded by the (n + 2)-nd Fibonacci number?


9/9

Conclusion









For references see tutorial.

Theoretical Analysis of Policy IterationTutorial at IJCAI 2017https://www.cse.iitb.ac.in/ shivaram/resources/ijcai-2017-tutorial-policyiteration/index.html.


9/9

Conclusion









For references see tutorial.

Theoretical Analysis of Policy IterationTutorial at IJCAI 2017https://www.cse.iitb.ac.in/ shivaram/resources/ijcai-2017-tutorial-policyiteration/index.html.

Thank you!


improved bounds for policy iteration in markov decision

Documents