improved bounds for policy iteration in markov decision

56
1/9 Improved Bounds for Policy Iteration in Markov Decision Problems Shivaram Kalyanakrishnan Department of Computer Science and Engineering Indian Institute of Technology Bombay [email protected] November 2017 Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1/9

Upload: others

Post on 18-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

1/9

Improved Bounds for Policy Iteration in Markov Decision Problems

Shivaram Kalyanakrishnan

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

[email protected]

November 2017

Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1 / 9

1/9

Improved Bounds for Policy Iteration in Markov Decision Problems

Shivaram Kalyanakrishnan

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

[email protected]

November 2017

Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1 / 9

1/9

Improved Bounds for Policy Iteration in Markov Decision Problems

Shivaram Kalyanakrishnan

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

[email protected]

November 2017

Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1 / 9

1/9

Improved Bounds for Policy Iteration in Markov Decision Problems

Shivaram Kalyanakrishnan

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

[email protected]

November 2017

Collaborators: Neeldhara Misra, Aditya Gopalan, Utkarsh Mall, Ritish Goyal, Anchit Gupta

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 1 / 9

2/9

Sequential Decision Making

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 2 / 9

2/9

Sequential Decision Making

ActSense

AGENT

ENVIRONMENT

Think

actionrewardstate

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 2 / 9

2/9

Sequential Decision Making

ActSense

AGENT

ENVIRONMENT

Think

actionrewardstate

https://img.tradeindia.com/fp/1/524/panoramic-elevators-564.jpg

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 2 / 9

2/9

Sequential Decision Making

ActSense

AGENT

ENVIRONMENT

Think

actionrewardstate

https://img.tradeindia.com/fp/1/524/panoramic-elevators-564.jpg

http://www.nature.com/polopoly_fs/7.33483.1453824868!/image/WEB_Go—1.jpg_gen/derivatives/landscape_630/WEB_Go—1.jpg

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 2 / 9

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 30.75, −2

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 30.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 30.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Behaviour is encoded as a Policy π, which maps states to actions.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Behaviour is encoded as a Policy π, which maps states to actions.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Behaviour is encoded as a Policy π, which maps states to actions.

What is a “good” policy?

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Behaviour is encoded as a Policy π, which maps states to actions.

What is a “good” policy? One that maximises expected long-term reward.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Behaviour is encoded as a Policy π, which maps states to actions.

What is a “good” policy? One that maximises expected long-term reward.

Vπ is the Value Function of π. For s ∈ S,

Vπ(s) = Eπ

[

r0 + γr1 + γ2r2 + . . . |start state = s

]

.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

3/9

Markov Decision Problems (MDPs)

s s1 2

s3

0.5, 0

1, 1

0.5, 3

0.25, −1

1, 10.5, −1

1, 2

0.5, 3

RED

BLUE

RED 0.75, −2

γ = 0.9

Elements of an MDP

States (S)

Actions (A)

Transition probabilities (T )

Rewards (R)

Discount factor (γ)

Behaviour is encoded as a Policy π, which maps states to actions.

What is a “good” policy? One that maximises expected long-term reward.

Vπ is the Value Function of π. For s ∈ S,

Vπ(s) = Eπ

[

r0 + γr1 + γ2r2 + . . . |start state = s

]

.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 3 / 9

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45BBB 10.0 11.0 10.0

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45 ← Optimal policyBBB 10.0 11.0 10.0

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45 ← Optimal policyBBB 10.0 11.0 10.0

Every MDP is guaranteed to have an optimal policy π⋆, such that

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45 ← Optimal policyBBB 10.0 11.0 10.0

Every MDP is guaranteed to have an optimal policy π⋆, such that

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

What is the complexity of computing an optimal policy?

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45 ← Optimal policyBBB 10.0 11.0 10.0

Every MDP is guaranteed to have an optimal policy π⋆, such that

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

What is the complexity of computing an optimal policy?

Note: an MDP with |S| = n states and |A| = k actions has a total of kn policies.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

4/9

Optimal Policies

Vπ satisfies a recursive equation: V

π = Rπ + γTπVπ, which gives

Vπ = (I − γTπ)

−1Rπ.

π Vπ(s1) V

π(s2) Vπ(s3)

RRR 4.45 6.55 10.82RRB -5.61 -5.75 -4.05RBR 2.76 4.48 9.12RBB 2.76 4.48 3.48BRR 10.0 9.34 13.10BRB 10.0 7.25 10.0BBR 10.0 11.0 14.45 ← Optimal policyBBB 10.0 11.0 10.0

Every MDP is guaranteed to have an optimal policy π⋆, such that

∀π ∈ Π,∀s ∈ S : Vπ⋆

(s) ≥ Vπ(s).

What is the complexity of computing an optimal policy?

Note: an MDP with |S| = n states and |A| = k actions has a total of kn policies.

One extra definition needed: Action Value Function Qπ

a for a ∈ A.

a = Ra + γTaVπ.

Given π, a polynomial computation yields Vπ and Q

π

a for a ∈ A.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 4 / 9

5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

3Q (s ) ππQ (s ) 3

7Q (s ) ππQ (s ) 7

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

Improvable states

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

5/9

Policy Improvement

s s s s s s ss1 2 3 4 5 6 7 8

π

Improvable states

Improving actions

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

5/9

Policy Improvement

Given π,

Pick one or more improvable states, and in them,

Switch to an arbitrary improving action.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

Improvable states

Improving actions

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

5/9

Policy Improvement

Given π,

Pick one or more improvable states, and in them,

Switch to an arbitrary improving action.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

5/9

Policy Improvement

Given π,

Pick one or more improvable states, and in them,

Switch to an arbitrary improving action.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Policy Improvement Theorem (H60, B12):

(1) If π has no improvable states, then it is optimal, else

(2) if π′ is obtained as above, then

∀s ∈ S : Vπ′

(s) ≥ Vπ(s) and ∃s ∈ S : V

π′

(s) > Vπ(s).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

5/9

Policy Improvement

Given π,

Pick one or more improvable states, and in them,

Switch to an arbitrary improving action.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Policy Improvement Theorem (H60, B12):

(1) If π has no improvable states, then it is optimal, else

(2) if π′ is obtained as above, then

∀s ∈ S : Vπ′

(s) ≥ Vπ(s) and ∃s ∈ S : V

π′

(s) > Vπ(s).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

5/9

Policy Improvement

Given π,

Pick one or more improvable states, and in them,

Switch to an arbitrary improving action.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Policy Improvement Theorem (H60, B12):

(1) If π has no improvable states, then it is optimal, else

(2) if π′ is obtained as above, then

∀s ∈ S : Vπ′

(s) ≥ Vπ(s) and ∃s ∈ S : V

π′

(s) > Vπ(s).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 5 / 9

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Different switching strategies lead to different routes to the top.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

6/9

Policy Iteration (PI)

π ← Arbitrary policy.

While π has improvable states:

π ← PolicyImprovement(π).

Different switching strategies lead to different routes to the top.

How long are the routes?!

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 6 / 9

7/9

Switching Strategies and Bounds

Upper bounds on number of iterations

PI Variant Type k = 2 General k

Howard’s PIDeterministic O

(

2n

n

)

O

(

kn

n

)

[H60, MS99]

Mansour and Singh’sRandomised 1.7172n ≈ O

(

k

2

)n

Randomised PI [MS99]

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 7 / 9

7/9

Switching Strategies and Bounds

Upper bounds on number of iterations

PI Variant Type k = 2 General k

Howard’s PIDeterministic O

(

2n

n

)

O

(

kn

n

)

[H60, MS99]

Mansour and Singh’sRandomised 1.7172n ≈ O

(

k

2

)n

Randomised PI [MS99]

Lower bounds on number of iterations

Ω(n) Howard’s PI on n-state, 2-action MDPs [HZ10].

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 7 / 9

7/9

Switching Strategies and Bounds

Upper bounds on number of iterations

PI Variant Type k = 2 General k

Howard’s PIDeterministic O

(

2n

n

)

O

(

kn

n

)

[H60, MS99]

Mansour and Singh’sRandomised 1.7172n ≈ O

(

k

2

)n

Randomised PI [MS99]

Lower bounds on number of iterations

Ω(n) Howard’s PI on n-state, 2-action MDPs [HZ10].

Ω(1.4142n) Simple PI on n-state, 2-action MDPs [MC94].

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 7 / 9

7/9

Switching Strategies and Bounds

Upper bounds on number of iterations

PI Variant Type k = 2 General k

Howard’s PIDeterministic O

(

2n

n

)

O

(

kn

n

)

[H60, MS99]

Mansour and Singh’sRandomised 1.7172n ≈ O

(

k

2

)n

Randomised PI [MS99]

Batch-switching PIDeterministic 1.6479n

k0.7207n

[KMG16a, GK17]

Recursive Simple PIRandomised – (2 + ln(k − 1))n

[KMG16b]

Lower bounds on number of iterations

Ω(n) Howard’s PI on n-state, 2-action MDPs [HZ10].

Ω(1.4142n) Simple PI on n-state, 2-action MDPs [MC94].

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 7 / 9

8/9

Recursive Simple Policy Iteration

s s s s s s ss1 2 3 4 5 6 7 8

π

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9

8/9

Recursive Simple Policy Iteration

Given π,

Pick the improvable state with the highest index, and,

Switch to an improving action picked uniformly at random.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9

8/9

Recursive Simple Policy Iteration

Given π,

Pick the improvable state with the highest index, and,

Switch to an improving action picked uniformly at random.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9

8/9

Recursive Simple Policy Iteration

Given π,

Pick the improvable state with the highest index, and,

Switch to an improving action picked uniformly at random.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9

8/9

Recursive Simple Policy Iteration

Given π,

Pick the improvable state with the highest index, and,

Switch to an improving action picked uniformly at random.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

s s s s s s ss1 2 3 4 5 6 7 8

π

Policy Improvement

Expected number of iterations: (1+Hk−1)n ≤ (2+ ln(k−1))n.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9

8/9

Recursive Simple Policy Iteration

Given π,

Pick the improvable state with the highest index, and,

Switch to an improving action picked uniformly at random.

Let the resulting policy be π′.

s s s s s s ss1 2 3 4 5 6 7 8

π

Expected number of iterations: (1+Hk−1)n ≤ (2+ ln(k−1))n.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 8 / 9

9/9

Conclusion

Policy Iteration: widely used algorithm, more than half a century old.

Substantial gap exists between upper and lower bounds.

We furnish several exponential improvements to upper bounds.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 9 / 9

9/9

Conclusion

Policy Iteration: widely used algorithm, more than half a century old.

Substantial gap exists between upper and lower bounds.

We furnish several exponential improvements to upper bounds.

Bears similarity to Simplex algorithm for Linear Programming.

Howard’s PI works much better in practice than the variants for which we have

shown improved upper bounds!

Open problem: Is the number of iterations taken by Howard’s PI on n-state,

2-action MDPs upper-bounded by the (n + 2)-nd Fibonacci number?

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 9 / 9

9/9

Conclusion

Policy Iteration: widely used algorithm, more than half a century old.

Substantial gap exists between upper and lower bounds.

We furnish several exponential improvements to upper bounds.

Bears similarity to Simplex algorithm for Linear Programming.

Howard’s PI works much better in practice than the variants for which we have

shown improved upper bounds!

Open problem: Is the number of iterations taken by Howard’s PI on n-state,

2-action MDPs upper-bounded by the (n + 2)-nd Fibonacci number?

For references see tutorial.

Theoretical Analysis of Policy IterationTutorial at IJCAI 2017https://www.cse.iitb.ac.in/ shivaram/resources/ijcai-2017-tutorial-policyiteration/index.html.

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 9 / 9

9/9

Conclusion

Policy Iteration: widely used algorithm, more than half a century old.

Substantial gap exists between upper and lower bounds.

We furnish several exponential improvements to upper bounds.

Bears similarity to Simplex algorithm for Linear Programming.

Howard’s PI works much better in practice than the variants for which we have

shown improved upper bounds!

Open problem: Is the number of iterations taken by Howard’s PI on n-state,

2-action MDPs upper-bounded by the (n + 2)-nd Fibonacci number?

For references see tutorial.

Theoretical Analysis of Policy IterationTutorial at IJCAI 2017https://www.cse.iitb.ac.in/ shivaram/resources/ijcai-2017-tutorial-policyiteration/index.html.

Thank you!

Shivaram Kalyanakrishnan (2017) Analysis of Policy Iteration in MDPs 9 / 9