monte-carlo planning: policy improvement

1

Monte-Carlo Planning:Policy Improvement

Alan Fern

2

Monte-Carlo Planning

h Often a simulator of a planning domain is availableor can be learned from data

2

Fire & Emergency ResponseConservation Planning

3

Large Worlds: Monte-Carlo Approachh Often a simulator of a planning domain is available

or can be learned from data

h Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator

3

World Simulato

r RealWorld

action

State + reward

4

MDP: Simulation-Based Representationh A simulation-based representation gives: S, A, R, T, I:

5 finite state set S (|S|=n and is generally very large)5 finite action set A (|A|=m and will assume is of reasonable size)

5 Stochastic, real-valued, bounded reward function R(s,a) = rg Stochastically returns a reward r given input s and a

5 Stochastic transition function T(s,a) = s’ (i.e. a simulator)g Stochastically returns a state s’ given input s and ag Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP

5 Stochastic initial state function I. g Stochastically returns a state according to an initial state distribution

These stochastic functions can be implemented in any language!

5

Outlineh You already learned how to evaluate a policy

given a simulator5 Just run the policy multiple times for a finite

horizon and average the rewards

h In next two lectures we’ll learn how to select good actions

6

Monte-Carlo Planning Outlineh Single State Case (multi-armed bandits)

5 A basic tool for other algorithms

h Monte-Carlo Policy Improvement5 Policy rollout5 Policy Switching

h Monte-Carlo Tree Search5 Sparse Sampling5 UCT and variants

Today

7

Single State Monte-Carlo Planningh Suppose MDP has a single state and k actions

5 Can sample rewards of actions using calls to simulator5 Sampling action a is like pulling slot machine arm with

random payoff function R(s,a)

s

a1 a2 ak

R(s,a1) R(s,a2) R(s,ak)

Multi-Armed Bandit Problem

…

…

Multi-Armed Banditsh We will use bandit algorithms as components for

multi-state Monte-Carlo planning5 But they are useful in their own right

h Pure bandit problems arise in many applications

h Applicable whenever: 5 We have a set of independent options with unknown

utilities5 There is a cost for sampling options or a limit on total

samples5 Want to find the best option or maximize utility of our

samples

Multi-Armed Bandits: Examples

h Clinical Trials5 Arms = possible treatments5 Arm Pulls = application of treatment to inidividual5 Rewards = outcome of treatment5 Objective = Find best treatment quickly (debatable)

h Online Advertising5 Arms = different ads/ad-types for a web page 5 Arm Pulls = displaying an ad upon a page access5 Rewards = click through5 Objective = find best add quickly (the maximize clicks)

10

Simple Regret Objectiveh Different applications suggest different types of

bandit objectives.h Today minimizing simple regret will be the objective

5 Simple Regret: quickly identify arm with high reward (in expectation) s

a1 a2 ak

R(s,a1) R(s,a2) R(s,ak)

Multi-Armed Bandit Problem

…

…

11

Simple Regret Objective: Formal Definition

h Protocol: At time step n the algorithm picks an “exploration” arm to pull and observes reward and also picks an arm index it thinks is best ( are random variables).5 If interrupted at time n the algorithm returns .

h Let denote the expected reward of best armh Expected Simple Regret (: difference between

and expected reward of arm selected by our strategy at time n

12

UniformBandit Algorith (or Round Robin)

UniformBandit Algorithm: h At round n pull arm with index (k mod n) + 1 h At round n return arm (if asked) with largest average reward

5 I.e. is the index of arm with best average so far

h This bound is exponentially decreasing in n! 5 So even this simple algorithm has a provably small simple regret.

Theorem: The expected simple regret of Uniform after n arm pulls is upper bounded by O for a constant c.

Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19), 1832-1852

13

Can we do better?

Algorithm -GreedyBandit : (parameter )h At round n, with probability pull arm with best average reward

so far, otherwise pull one of the other arms at random. h At round n return arm (if asked) with largest average reward

Theorem: The expected simple regret of -Greedy for after n arm pulls is upper bounded by O for a constant c that is larger than the constant for Uniform(this holds for “large enough” n).

Tolpin, D. & Shimony, S, E. (2012). MCTS Based on Simple Regret. AAAI Conference on Artificial Intelligence.

Often is seen to be more effective in practice as well.

14

Monte-Carlo Planning Outlineh Single State Case (multi-armed bandits)

5 A basic tool for other algorithms

h Monte-Carlo Policy Improvement5 Policy rollout5 Policy Switching

h Monte-Carlo Tree Search5 Sparse Sampling5 UCT and variants

Today

Policy Improvement via Monte-Carlo

h Now consider a very large multi-state MDP.h Suppose we have a simulator and a non-optimal policy

5 E.g. policy could be a standard heuristic or based on intuition

h Can we somehow compute an improved policy?

15

World Simulator

+ Base Policy Real

World

action

State + reward

16

Policy Improvement Theoremh Definition: The Q-value function gives the expected future

reward of starting in state s, taking action a, and then following policy until the horizon h.

h Define:

h Theorem [Howard, 1960]: For any non-optimal policy the policy is strictly better than .

h So if we can compute at any state we encounter, then we can execute an improved policy

h Can we use bandit algorithms to compute

17

Policy Improvement via Banditss

a1 a2 ak

SimQ(s,a1,π,h) SimQ(s,a2,π,h) SimQ(s,ak,π,h)

…

h Idea: define a stochastic function SimQ(s,a,π,h) that we can implement and whose expected value is Qπ(s,a,h)

h Then use Bandit algorithm to select (approximately) the action with best Q-value

How to implement SimQ?

18

Policy Improvement via Bandits

SimQ(s,a,π,h) r = R(s,a) simulate a in s

s = T(s,a) for i = 1 to h-1

r = r + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return r

h Simply simulate taking a in s and following policy for h-1 steps, returning discounted sum of rewards

h Expected value of SimQ(s,a,π,h) is Qπ(s,a,h) which can be made arbitrarily close to Qπ(s,a) by increasing h

19

Policy Improvement via Bandits

SimQ(s,a,π,h) r = R(s,a) simulate a in s

s = T(s,a) for i = 1 to h-1

r = r + R(s, π(s)) simulate h-1 steps s = T(s, π(s)) of policy Return r

s …

…

…

…

a1

a2

Trajectory under p

Sum of rewards = SimQ(s,a1,π,h)

ak

Sum of rewards = SimQ(s,a2,π,h)

Sum of rewards = SimQ(s,ak,π,h)

20

Policy Improvement via Banditss

a1 a2 ak

SimQ(s,a1,π,h) SimQ(s,a2,π,h) SimQ(s,ak,π,h)

…

h Now apply your favorite bandit algorithm for simple regreth UniformRollout : use UniformBandit

Parameters: number of trials n and horizon/height hh -GreedyRollout : use -GreedyBandit

Parameters: number of trials n, and horizon/height h( often is a good choice)

21

UniformRollout

s

a1 a2ak

…

q11 q12 … q1w q21 q22 … q2w qk1 qk2 … qkw

… … … … … … … … …

SimQ(s,ai,π,h) trajectories

Each simulates taking action ai then following π for h-1 steps.

Samples of SimQ(s,ai,π,h)

h Each action is tried roughly the same number of times (approximately times)

22

𝝐−𝐆𝐫𝐞𝐞𝐝𝐲𝐑𝐨𝐥𝐥𝐨𝐮𝐭

s

a1 a2ak

…

q11 q12 … q1u q21 q22 … q2v qk1

… … … … …

• For we might expect it to be better than UniformRollout for a same value of n.

h Allocates a non-uniform number of trials across actions (focuses on more promising actions)

23

Executing Rollout in Real World

… …s

a1 a2 ak

…

… … … … … … … …

a1 a2 ak

…

… … … … … … … …

a2 ak

run policy rollout run policy rollout

Real worldstate/action sequence

Simulated experience

24

Policy Rollout: # of Simulator Calls

• Total of n SimQ calls each using h calls to simulator• Total of hn calls to the simulator

a1 a2ak

…

… … … … … … … … …

SimQ(s,ai,π,h) trajectories

Each simulates taking action ai then following π for h-1 steps.

s

25

Multi-Stage Rollout

h A single call to Rollout[π,h,w](s) yields one iteration of policy improvement starting at policy π

h We can use more computation time to yield multiple iterations of policy improvement via nesting calls to Rollout5 Rollout[Rollout[π,h,w],h,w](s) returns the action for state s resulting

from two iterations of policy improvement5 Can nest this arbitrarily

h Gives a way to use more time in order to improve performance

26

Practical Issues

h There are three ways to speedup either rollout procedure1. Use a faster policy2. Decrease the number of trajectories n3. Decrease the horizon h

h Decreasing Trajectories:5 If n is small compared to # of actions k, then performance could be

poor since actions don’t get tried very often5 One way to get away with a smaller n is to use an action filter

h Action Filter: a function f(s) that returns a subset of the actions in state s that rollout should consider5 You can use your domain knowledge to filter out obviously bad actions5 Rollout decides among the rest

27

Practical Issues

h There are three ways to speedup either rollout procedure1. Use a faster policy2. Decrease the number of trajectories n3. Decrease the horizon h

h Decrease Horizon h:5 If h is too small compared to the “real horizon” of the problem, then the

Q-estimates may not be accurate5 One way to get away with a smaller h is to use a value estimation

heuristic

h Heuristic function: a heuristic function v(s) returns an estimate of the value of state s5 SimQ is adjusted to run policy for h steps ending in state s’ and

returning the sum of rewards adding in estimate v(s’)

28

Multi-Stage Rollout

a1 a2ak

…

… … … … … … … … …

Trajectories of SimQ(s,ai,Rollout[π,h,w],h)

Each step requires nh simulator callsfor Rollout policy

• Two stage: compute rollout policy of “rollout policy of π”

• Requires (nh)2 calls to the simulator for 2 stages• In general exponential in the number of stages

s

29

Example: Rollout for Solitaire [Yan et al. NIPS’04]

h Multiple levels of rollout can payoff but is expensive

Player Success Rate Time/Game

Human Expert 36.6% 20 min

(naïve) Base Policy

13.05% 0.021 sec

1 rollout 31.20% 0.67 sec

2 rollout 47.6% 7.13 sec

3 rollout 56.83% 1.5 min

4 rollout 60.51% 18 min

5 rollout 70.20% 1 hour 45 min

30

Rollout in 2-Player Games

s

a1 a2ak

…

q11 q12 … q1w q21 q22 … q2w qk1 qk2 … qkw

… … … … … … … … …h SimQ simply uses the base policy to select

moves for both players until the horizonh Is biased toward playing

well against h Is this ok?

p1p2

31

Another Useful Technique: Policy Switching

h Suppose you have a set of base policies {π1, π2,…, πM}

h Also suppose that the best policy to use can depend on the specific state of the system and we don’t know how to select.

h Policy switching is a simple way to select which policy to use at a given step via a simulator

32

Another Useful Technique: Policy Switching

s

Sim(s,π1,h) Sim(s,π2,h) Sim(s,πM,h)

…

h The stochastic function Sim(s,π,h) simply samples the h-horizon value of π starting in state s

h Implement by simply simulating π starting in s for h steps and returning discounted total reward

h Use Bandit algorithm to select best policy and then select action chosen by that policy

π 1 π 2

πM

33

PolicySwitchingPolicySwitch[{π1, π2,…, πM},h,n](s) 1. Define bandit with M arms giving rewards Sim(s,πi,h) 2. Let i* be index of the arm/policy selected by your

favorite bandit algorithm using n trials3. Return action πi*(s) s

π 1 π 2πM

…

v11 v12 … v1w v21 v22 … v2w vM1 vM2 … vMw

… … … … … … … … …

Sim(s,πi,h) trajectories

Each simulates following πi for h steps.

Discounted cumulative rewards

34

Executing Policy Switching in Real World

… …s

1 𝜋2 𝜋k

…

… … … … … … … …

𝜋1 𝜋2 𝜋k

…

… … … … … … … …

𝜋2(s) 𝜋k(s’)

run policy rollout run policy rollout

Real worldstate/action sequence

Simulated experience

35

Policy Switching: Quality

h Let denote the ideal switching policy 5 Always pick the best policy index at any state

h The value of the switching policy is at least as good as the best single policy in the set5 It will often perform better than any single policy in set.5 For non-ideal case, were bandit algorithm only picks

approximately the best arm we can add an error term to the bound.

Theorem: For any state s, .

Policy Switching in 2-Player Games

Now we give two policies sets one for each player:

Max Player Policies (us) :

Min Player Policies (them) : }

These policy sets will often be the same, when players have the same actions sets.

The encode our knowledge of what the possible effective strategies might be in the game

Minimax Policy Switching

….

….

….

…. …. …. …. ….

….

Build GameMatrix

Game Simulator

Current State s

Each entry gives estimated value (for max player)of playing a policy pair against one another

Each value estimated by averaging across w simulated games.

Minimax Switching

….

….

….

…. …. …. …. ….

….

Build GameMatrix

Game Simulator

Current State s

MaxiMin Policy

MaxiMin Policy :

Select action

monte-carlo planning: policy improvement

Documents