between winning slow and losing fast

Winning slow, losing fast, and in between.

Reinaldo A Uribe Muriel

Colorado State University. Prof. C. AndersonOita University. Prof. K. Shibata

Universidad de Los Andes. Prof. F. Lozano

February 8, 2010

It’s all fun and games until someone proves a theorem.Outline

1 Fun and games

2 A theorem

3 An algorithm

A game: Snakes & LaddersBoard: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)

Player advances thenumber of stepsindicated by a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves theplayer forward to thetop.

Goal: reaching state100.

http://www.naa.gov.au/

A game: Snakes & LaddersBoard: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)

Boring!(No skill required, only luck.)

Player advances thenumber of stepsindicated by a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves theplayer forward to thetop.

Goal: reaching state100.

http://www.naa.gov.au/

Variation: Decision Snakes and Ladders

Sets of “win” and“loss” terminal states.

Actions: either“advance” or “retreat,”to be decided beforethrowing the die.

Reinforcement Learning: Finding the optimal policy.

“Natural” Rewards: ±1on “win”/“lose”, 0othw.

Optimal policymaximizes totalexpected reward.

Dynamic programmingquickly finds theoptimal policy.

Probability of winning:pw = 0.97222 . . .

But...

Claim:

It is not always desirable to find the optimal policyfor that problem.

Claim:

It is not always desirable to find the optimal policyfor that problem.

Hint: mean episode length of the optimal policy, d = 84.58333steps.

Optimal policy revisited.

Seek winning.

Avoid losing.

Stay safe.

A simple, yet powerful idea.

Introduce a step punishment term −rstep so theagent has an incentive to terminate faster.



At time t,

r(t) =

+1− rstep “win”−1− rstep “loss”−rstep othw.



At time t,

r(t) =

+1− rstep “win”−1− rstep “loss”−rstep othw.

Origin: Maze rewards, −1 except on termination.Problem: rstep =?(i.e, cost of staying in the game usually incommensurable withterminal rewards)

Better than optimal?

Optimal policy forrstep = 0

Optimal policy forrstep = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pw

d

Better than optimal?

Optimal policy forrstep = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pw

d

Chess: White wins∗Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010


∗ in 108 ply.


1Shannon, 1950.

∗ in 108 ply.

Visits only about 5√

of the total number of valid states1, but, if aply takes one second, an average game will last three years and twomonths.


1Shannon, 1950.

∗ in 108 ply.



Certainly unlikely to be the case, but in fact finding policies ofmaximum winning probability remains the usual goal in RL.


1Shannon, 1950.

∗ in 108 ply.



Certainly unlikely to be the case, but in fact finding policies ofmaximum winning probability remains the usual goal in RL.

Discount factor γ, used to ensure values are finite, has effect inepisode length, but is unpredictable and suboptimal (for the pw

dproblem)

Main result.

For a general ±1-rewarded problem, there exists anr ∗step for which the value-optimal solution maximizespw

d and the value of the initial state is -1

∃r∗step|

π∗ = argmaxπ∈Π

v = argmaxπ∈Π

pw

d

v∗(s0) = v = −1

Stating the obvious.

Every policy has a mean episode length d ≥ 1 and probabilityof winning 0 ≤ pw ≤ 1.

v = 2pw − 1− rstepd




(Lemma: Extensible to vectors using indicator variables)




(Lemma: Extensible to vectors using indicator variables)

The proof rests on a solid foundation of duh!

Key substitution.The w − l space

w = pwd l = 1−pw

d

Each policy is represented by a unique point in the w − lplane.

The policy cloud is limited by the triangle with vertices (1,0),(0,1), and (0,0).

Execution and speed in the w − l space.

0.2

0.2

0.4

0.4

0.6

0.6

0.8

0.8

0.8

1 1 1 1

l

w

0 0.5 10

0.2

0.4

0.6

0.8

1

Winning probability:

pw =w

w + l

2

2

4

816

w

l

0 0.5 10

0.2

0.4

0.6

0.8

1

Mean episode length:

d =1

w + l

Proof Outline - Value in the w − l space.

v =w − l − rstep

w + l

So...

All level sets intersect at the same point,(rstep,−rstep)

There is a one-to-one relationship betweenvalues and slopes.

Value (for all rstep), mean episode length andwinning probability level sets are lines

Optimal policies in the convex hull of the policycloud.

And done!

π∗ = maxπ

pw

d= max

πw

(Vertical level sets) When vt ≈ −1, we’re there.

Algorithm

Set εInitialize π0

rstep ← 0Repeat:

Find π+, vπ+(solve from π0 by any RL method)

rstep ← r ′stepπ0 ← π+

Until |vπ+(s0) + 1| < ε

On termination, π+ ≈ π∗.

rstep update using a learning rate µ > 0,

r ′step = rstep + µ[vπ+

(s0) + 1]

Optimal rstep update.

Minimizing the interval of rstep uncertainty in the nextiteration.

Requires solving a minmax problem. Either root of an 8thdegree polynomial in r ′step or zero of the difference of tworational functions of order 4. (Easy using secant method).

O(log 1ε ) complexity.

Extensions.

Problems solvable through a similar methodConvex (linear) tradeoff.

π∗ = argmaxπ∈Π {αpw − (1− α)d}Greedy tradeoff.


{2pw−1

d

}Arbitrary tradeoffs.


{αpw−β

d

}Asymmetric rewards.

rwin = a, rloss = −b; a, b ≥ 0

Games with tie outcomes.

Games with multiple win / loss rewards.

Harder family of problems

Maximize the probability of having won before n steps / mepisodes.

Why? Non-linear level sets / non-convex functions in the w − lspace.

Outline of future research.Towards robustness.

Policy variation in tasks with fixed episode length. Inclusion oftime as a component of the state space.

Defining policy neighbourhoods.

Feature-robustness

Can traditional Reinforcement Learning methods still be usedto handle the learning?



Defining policy neighbourhoods.1 Continuous/discrete statewise action neighbourhoods.2 Discrete policy neighbourhoods for structured tasks.3 General policy neighbourhoods.

Feature-robustness





Feature-robustness





Feature-robustness1 Value/Speed/Execution neighbourhoods in the w − l space.2 Robustness as a trading off of features





Feature-robustness


Thank [email protected] - [email protected]

Untitled by Li Wei, School of Design, Oita University, 2009.

between winning slow and losing fast

Documents

rstep pw

policy maximizesd

rstep loss rstep othw

rstep dlemma

rstep winr t

rstep lossrstep othw

step punishment term

mean episode length