between winning slow and losing fast

52
Winning slow, losing fast, and in between. Reinaldo A Uribe Muriel Colorado State University. Prof. C. Anderson Oita University. Prof. K. Shibata Universidad de Los Andes. Prof. F. Lozano February 8, 2010

Upload: r-uribe

Post on 03-Jun-2015

167 views

Category:

Documents


0 download

DESCRIPTION

BMAC presentation. Department of Computer Science. Colorado State University. Feb. 08, 2010.

TRANSCRIPT

Page 1: Between winning slow and losing fast

Winning slow, losing fast, and in between.

Reinaldo A Uribe Muriel

Colorado State University. Prof. C. AndersonOita University. Prof. K. Shibata

Universidad de Los Andes. Prof. F. Lozano

February 8, 2010

Page 2: Between winning slow and losing fast

It’s all fun and games until someone proves a theorem.Outline

1 Fun and games

2 A theorem

3 An algorithm

Page 3: Between winning slow and losing fast

A game: Snakes & LaddersBoard: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)

Player advances thenumber of stepsindicated by a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves theplayer forward to thetop.

Goal: reaching state100.

Page 4: Between winning slow and losing fast

A game: Snakes & LaddersBoard: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)

Boring!(No skill required, only luck.)

Player advances thenumber of stepsindicated by a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves theplayer forward to thetop.

Goal: reaching state100.

Page 5: Between winning slow and losing fast

Variation: Decision Snakes and Ladders

Sets of “win” and“loss” terminal states.

Actions: either“advance” or “retreat,”to be decided beforethrowing the die.

Page 6: Between winning slow and losing fast

Reinforcement Learning: Finding the optimal policy.

“Natural” Rewards: ±1on “win”/“lose”, 0othw.

Optimal policymaximizes totalexpected reward.

Dynamic programmingquickly finds theoptimal policy.

Probability of winning:pw = 0.97222 . . .

But...

Page 7: Between winning slow and losing fast

Reinforcement Learning: Finding the optimal policy.

“Natural” Rewards: ±1on “win”/“lose”, 0othw.

Optimal policymaximizes totalexpected reward.

Dynamic programmingquickly finds theoptimal policy.

Probability of winning:pw = 0.97222 . . .

But...

Page 8: Between winning slow and losing fast

Claim:

It is not always desirable to find the optimal policyfor that problem.

Page 9: Between winning slow and losing fast

Claim:

It is not always desirable to find the optimal policyfor that problem.

Hint: mean episode length of the optimal policy, d = 84.58333steps.

Page 10: Between winning slow and losing fast

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Page 11: Between winning slow and losing fast

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Page 12: Between winning slow and losing fast

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Page 13: Between winning slow and losing fast

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Page 14: Between winning slow and losing fast

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Page 15: Between winning slow and losing fast

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Page 16: Between winning slow and losing fast

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

Page 17: Between winning slow and losing fast

A simple, yet powerful idea.

Introduce a step punishment term −rstep so theagent has an incentive to terminate faster.

Page 18: Between winning slow and losing fast

A simple, yet powerful idea.

Introduce a step punishment term −rstep so theagent has an incentive to terminate faster.

At time t,

r(t) =

+1− rstep “win”−1− rstep “loss”−rstep othw.

Page 19: Between winning slow and losing fast

A simple, yet powerful idea.

Introduce a step punishment term −rstep so theagent has an incentive to terminate faster.

At time t,

r(t) =

+1− rstep “win”−1− rstep “loss”−rstep othw.

Origin: Maze rewards, −1 except on termination.Problem: rstep =?(i.e, cost of staying in the game usually incommensurable withterminal rewards)

Page 20: Between winning slow and losing fast

Better than optimal?

Optimal policy forrstep = 0

Optimal policy forrstep = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pw

d

Page 21: Between winning slow and losing fast

Better than optimal?

Optimal policy forrstep = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pw

d

Page 22: Between winning slow and losing fast

Better than optimal?

Optimal policy forrstep = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pw

d

Page 23: Between winning slow and losing fast

Better than optimal?

Optimal policy forrstep = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pw

d

Page 24: Between winning slow and losing fast

Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010

Page 25: Between winning slow and losing fast

Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010

∗ in 108 ply.

Page 26: Between winning slow and losing fast

Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010

1Shannon, 1950.

∗ in 108 ply.

Visits only about 5√

of the total number of valid states1, but, if aply takes one second, an average game will last three years and twomonths.

Page 27: Between winning slow and losing fast

Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010

1Shannon, 1950.

∗ in 108 ply.

Visits only about 5√

of the total number of valid states1, but, if aply takes one second, an average game will last three years and twomonths.

Certainly unlikely to be the case, but in fact finding policies ofmaximum winning probability remains the usual goal in RL.

Page 28: Between winning slow and losing fast

Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010

1Shannon, 1950.

∗ in 108 ply.

Visits only about 5√

of the total number of valid states1, but, if aply takes one second, an average game will last three years and twomonths.

Certainly unlikely to be the case, but in fact finding policies ofmaximum winning probability remains the usual goal in RL.

Discount factor γ, used to ensure values are finite, has effect inepisode length, but is unpredictable and suboptimal (for the pw

dproblem)

Page 29: Between winning slow and losing fast

Main result.

For a general ±1-rewarded problem, there exists anr ∗step for which the value-optimal solution maximizespw

d and the value of the initial state is -1

∃r∗step|

π∗ = argmaxπ∈Π

v = argmaxπ∈Π

pw

d

v∗(s0) = v = −1

Page 30: Between winning slow and losing fast

Stating the obvious.

Every policy has a mean episode length d ≥ 1 and probabilityof winning 0 ≤ pw ≤ 1.

v = 2pw − 1− rstepd

Page 31: Between winning slow and losing fast

Stating the obvious.

Every policy has a mean episode length d ≥ 1 and probabilityof winning 0 ≤ pw ≤ 1.

v = 2pw − 1− rstepd

Page 32: Between winning slow and losing fast

Stating the obvious.

Every policy has a mean episode length d ≥ 1 and probabilityof winning 0 ≤ pw ≤ 1.

v = 2pw − 1− rstepd

(Lemma: Extensible to vectors using indicator variables)

Page 33: Between winning slow and losing fast

Stating the obvious.

Every policy has a mean episode length d ≥ 1 and probabilityof winning 0 ≤ pw ≤ 1.

v = 2pw − 1− rstepd

(Lemma: Extensible to vectors using indicator variables)

The proof rests on a solid foundation of duh!

Page 34: Between winning slow and losing fast

Key substitution.The w − l space

w = pwd l = 1−pw

d

Each policy is represented by a unique point in the w − lplane.

The policy cloud is limited by the triangle with vertices (1,0),(0,1), and (0,0).

Page 35: Between winning slow and losing fast

Key substitution.The w − l space

w = pwd l = 1−pw

d

Each policy is represented by a unique point in the w − lplane.

The policy cloud is limited by the triangle with vertices (1,0),(0,1), and (0,0).

Page 36: Between winning slow and losing fast

Key substitution.The w − l space

w = pwd l = 1−pw

d

Each policy is represented by a unique point in the w − lplane.

The policy cloud is limited by the triangle with vertices (1,0),(0,1), and (0,0).

Page 37: Between winning slow and losing fast

Execution and speed in the w − l space.

0.2

0.2

0.4

0.4

0.6

0.6

0.8

0.8

0.8

1 1 1 1

l

w

0 0.5 10

0.2

0.4

0.6

0.8

1

Winning probability:

pw =w

w + l

2

2

4

816

w

l

0 0.5 10

0.2

0.4

0.6

0.8

1

Mean episode length:

d =1

w + l

Page 38: Between winning slow and losing fast

Proof Outline - Value in the w − l space.

v =w − l − rstep

w + l

Page 39: Between winning slow and losing fast

So...

All level sets intersect at the same point,(rstep,−rstep)

There is a one-to-one relationship betweenvalues and slopes.

Value (for all rstep), mean episode length andwinning probability level sets are lines

Optimal policies in the convex hull of the policycloud.

Page 40: Between winning slow and losing fast

And done!

π∗ = maxπ

pw

d= max

πw

(Vertical level sets) When vt ≈ −1, we’re there.

Page 41: Between winning slow and losing fast

Algorithm

Set εInitialize π0

rstep ← 0Repeat:

Find π+, vπ+(solve from π0 by any RL method)

rstep ← r ′stepπ0 ← π+

Until |vπ+(s0) + 1| < ε

On termination, π+ ≈ π∗.

rstep update using a learning rate µ > 0,

r ′step = rstep + µ[vπ+

(s0) + 1]

Page 42: Between winning slow and losing fast

Algorithm

Set εInitialize π0

rstep ← 0Repeat:

Find π+, vπ+(solve from π0 by any RL method)

rstep ← r ′stepπ0 ← π+

Until |vπ+(s0) + 1| < ε

On termination, π+ ≈ π∗.

rstep update using a learning rate µ > 0,

r ′step = rstep + µ[vπ+

(s0) + 1]

Page 43: Between winning slow and losing fast

Optimal rstep update.

Minimizing the interval of rstep uncertainty in the nextiteration.

Requires solving a minmax problem. Either root of an 8thdegree polynomial in r ′step or zero of the difference of tworational functions of order 4. (Easy using secant method).

O(log 1ε ) complexity.

Page 44: Between winning slow and losing fast

Extensions.

Problems solvable through a similar methodConvex (linear) tradeoff.

π∗ = argmaxπ∈Π {αpw − (1− α)d}Greedy tradeoff.

π∗ = argmaxπ∈Π

{2pw−1

d

}Arbitrary tradeoffs.

π∗ = argmaxπ∈Π

{αpw−β

d

}Asymmetric rewards.

rwin = a, rloss = −b; a, b ≥ 0

Games with tie outcomes.

Games with multiple win / loss rewards.

Page 45: Between winning slow and losing fast

Harder family of problems

Maximize the probability of having won before n steps / mepisodes.

Why? Non-linear level sets / non-convex functions in the w − lspace.

Page 46: Between winning slow and losing fast

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.

Feature-robustness

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Page 47: Between winning slow and losing fast

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.

Feature-robustness

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Page 48: Between winning slow and losing fast

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.1 Continuous/discrete statewise action neighbourhoods.2 Discrete policy neighbourhoods for structured tasks.3 General policy neighbourhoods.

Feature-robustness

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Page 49: Between winning slow and losing fast

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.

Feature-robustness

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Page 50: Between winning slow and losing fast

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.

Feature-robustness1 Value/Speed/Execution neighbourhoods in the w − l space.2 Robustness as a trading off of features

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Page 51: Between winning slow and losing fast

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.

Feature-robustness

Can traditional Reinforcement Learning methods still be usedto handle the learning?

Page 52: Between winning slow and losing fast

Thank [email protected] - [email protected]

Untitled by Li Wei, School of Design, Oita University, 2009.