monte-carlo graph search - github pages

53
Monte-Carlo Graph Search: the Value of Merging Similar States Edouard Leurent 1,2 , Odalric-Ambrym Maillard 1 1 Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 – CRIStAL, 2 Renault Group

Upload: others

Post on 02-Jan-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Monte-Carlo Graph Search - GitHub Pages

Monte-Carlo Graph Search:the Value of Merging Similar States

Edouard Leurent1,2,Odalric-Ambrym Maillard1

1Univ. Lille, Inria, CNRS,Centrale Lille, UMR 9189 – CRIStAL,

2Renault Group

Page 2: Monte-Carlo Graph Search - GitHub Pages

Motivation

Monte-Carlo Tree Search algorithms• rely on a tree structure to represent their value estimates.

• performance independent of the size S of the state spaceTabular RL (UCBVI)

√HSAn

MCTS (OPD) n− log 1γ/ logA

2 -Monte-Carlo Graph Search- Edouard Leurent

Page 3: Monte-Carlo Graph Search - GitHub Pages

Motivation

Monte-Carlo Tree Search algorithms• rely on a tree structure to represent their value estimates.

• performance independent of the size S of the state spaceTabular RL (UCBVI)

√HSAn

MCTS (OPD) n− log 1γ/ logA

2 -Monte-Carlo Graph Search- Edouard Leurent

Page 4: Monte-Carlo Graph Search - GitHub Pages

Limitation

• There can be several paths to the same state s

• s is represented several times in the tree• No information is shared between these paths

3 -Monte-Carlo Graph Search- Edouard Leurent

Page 5: Monte-Carlo Graph Search - GitHub Pages

Limitation

• There can be several paths to the same state s

• s is represented several times in the tree• No information is shared between these paths

3 -Monte-Carlo Graph Search- Edouard Leurent

Page 6: Monte-Carlo Graph Search - GitHub Pages

Limitation

• There can be several paths to the same state s

• s is represented several times in the tree• No information is shared between these paths

3 -Monte-Carlo Graph Search- Edouard Leurent

Page 7: Monte-Carlo Graph Search - GitHub Pages

Limitation

• There can be several paths to the same state s

• s is represented several times in the tree• No information is shared between these paths

3 -Monte-Carlo Graph Search- Edouard Leurent

Page 8: Monte-Carlo Graph Search - GitHub Pages

Limitation

• There can be several paths to the same state s

• s is represented several times in the tree

• No information is shared between these paths

3 -Monte-Carlo Graph Search- Edouard Leurent

Page 9: Monte-Carlo Graph Search - GitHub Pages

Limitation

• There can be several paths to the same state s

• s is represented several times in the tree• No information is shared between these paths

3 -Monte-Carlo Graph Search- Edouard Leurent

Page 10: Monte-Carlo Graph Search - GitHub Pages

Motivating example

Not accounting for state similarity hinders exploration

Sparse gridworld: reward of 0 everywhere

4 -Monte-Carlo Graph Search- Edouard Leurent

Page 11: Monte-Carlo Graph Search - GitHub Pages

Motivating example

Not accounting for state similarity hinders explorationSparse gridworld: reward of 0 everywhere

4 -Monte-Carlo Graph Search- Edouard Leurent

Page 12: Monte-Carlo Graph Search - GitHub Pages

Planners behaviours

Uniform planning in the space of sequences of actionsOPD

OPD, budget of n = 5460 moves

5 -Monte-Carlo Graph Search- Edouard Leurent

Page 13: Monte-Carlo Graph Search - GitHub Pages

Motivating example 2/2

Concentration• Does not lead to uniform exploration of the state space

• 2D random walk ∼ Rayleigh distribution P(d) = 2dH e−

−d2H

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20OPD

0

100

101

102

budget of 5460 samples, maximum distance d = 6

6 -Monte-Carlo Graph Search- Edouard Leurent

Page 14: Monte-Carlo Graph Search - GitHub Pages

Motivating example 2/2

Concentration• Does not lead to uniform exploration of the state space

• 2D random walk ∼ Rayleigh distribution P(d) = 2dH e−

−d2H

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20OPD

0

100

101

102

budget of 5460 samples, maximum distance d = 6

6 -Monte-Carlo Graph Search- Edouard Leurent

Page 15: Monte-Carlo Graph Search - GitHub Pages

Motivating example 2/2

Concentration• Does not lead to uniform exploration of the state space

• 2D random walk ∼ Rayleigh distribution P(d) = 2dH e−

−d2H

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20OPD

0

100

101

102

budget of 5460 samples, maximum distance d = 6

6 -Monte-Carlo Graph Search- Edouard Leurent

Page 16: Monte-Carlo Graph Search - GitHub Pages

Motivating example 2/2

Concentration• Does not lead to uniform exploration of the state space

• 2D random walk ∼ Rayleigh distribution P(d) = 2dH e−

−d2H

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20OPD

0

100

101

102

budget of 5460 samples, maximum distance d = 6

6 -Monte-Carlo Graph Search- Edouard Leurent

Page 17: Monte-Carlo Graph Search - GitHub Pages

Goal

Better exploit this wasted information

• By merging similar states

7 -Monte-Carlo Graph Search- Edouard Leurent

Page 18: Monte-Carlo Graph Search - GitHub Pages

Goal

Better exploit this wasted information• By merging similar states

7 -Monte-Carlo Graph Search- Edouard Leurent

Page 19: Monte-Carlo Graph Search - GitHub Pages

Goal

Better exploit this wasted information• By merging similar states into a graph

7 -Monte-Carlo Graph Search- Edouard Leurent

Page 20: Monte-Carlo Graph Search - GitHub Pages

Goal

Better exploit this wasted information• By merging similar states

7 -Monte-Carlo Graph Search- Edouard Leurent

Page 21: Monte-Carlo Graph Search - GitHub Pages

Questions

• How to adapt MCTS algorithms to work on graphs?• Can we quantify the benefit of using graphs over trees?

8 -Monte-Carlo Graph Search- Edouard Leurent

Page 22: Monte-Carlo Graph Search - GitHub Pages

MCTS for Deterministic MDPs

𝜕𝑇𝑛

𝑇𝑛∘

Optimism in the Face of Uncertainty: OPD

1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)2. Follow optimistic actions, from the root down to a leaf b3. Expand the leaf b ∈ ∂T n

9 -Monte-Carlo Graph Search- Edouard Leurent

Page 23: Monte-Carlo Graph Search - GitHub Pages

MCTS for Deterministic MDPs

𝜕𝑇𝑛

𝑇𝑛∘

Optimism in the Face of Uncertainty: OPD

1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)

2. Follow optimistic actions, from the root down to a leaf b3. Expand the leaf b ∈ ∂T n

9 -Monte-Carlo Graph Search- Edouard Leurent

Page 24: Monte-Carlo Graph Search - GitHub Pages

MCTS for Deterministic MDPs

𝜕𝑇𝑛

𝑇𝑛∘

Optimism in the Face of Uncertainty: OPD

1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)2. Follow optimistic actions, from the root down to a leaf b

3. Expand the leaf b ∈ ∂T n

9 -Monte-Carlo Graph Search- Edouard Leurent

Page 25: Monte-Carlo Graph Search - GitHub Pages

MCTS for Deterministic MDPs

𝜕𝑇𝑛

𝑇𝑛∘

Optimism in the Face of Uncertainty: OPD

1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)2. Follow optimistic actions, from the root down to a leaf b3. Expand the leaf b ∈ ∂T n

9 -Monte-Carlo Graph Search- Edouard Leurent

Page 26: Monte-Carlo Graph Search - GitHub Pages

MCTS for Deterministic MDPs

𝜕𝑇𝑛+1

𝑇𝑛+1∘

Optimism in the Face of Uncertainty: OPD

1. Build confidence bounds L(a) ≤ V (a) ≤ U(a)2. Follow optimistic actions, from the root down to a leaf b3. Expand the leaf b ∈ ∂T n

9 -Monte-Carlo Graph Search- Edouard Leurent

Page 27: Monte-Carlo Graph Search - GitHub Pages

On a graph...

𝒢n

𝜕𝒢𝑛

Same principle: GBOP-D

1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)2. Follow optimistic actions until an external node s is reached3. Expand the external node s ∈ ∂Gn

> We are guaranteed to expand any state only once.

10 -Monte-Carlo Graph Search- Edouard Leurent

Page 28: Monte-Carlo Graph Search - GitHub Pages

On a graph...

𝒢n

𝜕𝒢𝑛

Same principle: GBOP-D

1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)

2. Follow optimistic actions until an external node s is reached3. Expand the external node s ∈ ∂Gn

> We are guaranteed to expand any state only once.

10 -Monte-Carlo Graph Search- Edouard Leurent

Page 29: Monte-Carlo Graph Search - GitHub Pages

On a graph...

𝒢n

𝜕𝒢𝑛

Same principle: GBOP-D

1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)2. Follow optimistic actions until an external node s is reached

3. Expand the external node s ∈ ∂Gn

> We are guaranteed to expand any state only once.

10 -Monte-Carlo Graph Search- Edouard Leurent

Page 30: Monte-Carlo Graph Search - GitHub Pages

On a graph...

𝒢n+1

𝜕𝒢𝑛+1

Same principle: GBOP-D

1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)2. Follow optimistic actions until an external node s is reached3. Expand the external node s ∈ ∂Gn

> We are guaranteed to expand any state only once.

10 -Monte-Carlo Graph Search- Edouard Leurent

Page 31: Monte-Carlo Graph Search - GitHub Pages

On a graph...

𝒢n+1

𝜕𝒢𝑛+1

Same principle: GBOP-D

1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)2. Follow optimistic actions until an external node s is reached3. Expand the external node s ∈ ∂Gn

> We are guaranteed to expand any state only once.

10 -Monte-Carlo Graph Search- Edouard Leurent

Page 32: Monte-Carlo Graph Search - GitHub Pages

On a graph...

𝒢n+1

𝜕𝒢𝑛+1

Same principle: GBOP-D

1. Build confidence bounds L(s) ≤ V (s) ≤ U(s)2. Follow optimistic actions until an external node s is reached3. Expand the external node s ∈ ∂Gn

> We are guaranteed to expand any state only once.

10 -Monte-Carlo Graph Search- Edouard Leurent

Page 33: Monte-Carlo Graph Search - GitHub Pages

How to build the bounds L,U?

How to bound V (s) = sup∑∞

t=0 γtrt?

• Initialize with trivial bounds: L = 0, U = 11−γ

• Apply the Bellman operator B : V → maxaR(s, a) + γV (s ′)

L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)

𝐵𝑛 ℬ𝑛

• Trees: Converges in dn steps• Graphs: May converge in ∞ steps when there is a loop

11 -Monte-Carlo Graph Search- Edouard Leurent

Page 34: Monte-Carlo Graph Search - GitHub Pages

How to build the bounds L,U?

How to bound V (s) = sup∑∞

t=0 γtrt?

• Initialize with trivial bounds: L = 0, U = 11−γ

• Apply the Bellman operator B : V → maxaR(s, a) + γV (s ′)

L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)

𝐵𝑛 ℬ𝑛

• Trees: Converges in dn steps• Graphs: May converge in ∞ steps when there is a loop

11 -Monte-Carlo Graph Search- Edouard Leurent

Page 35: Monte-Carlo Graph Search - GitHub Pages

How to build the bounds L,U?

How to bound V (s) = sup∑∞

t=0 γtrt?

• Initialize with trivial bounds: L = 0, U = 11−γ

• Apply the Bellman operator B : V → maxaR(s, a) + γV (s ′)

L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)

𝐵𝑛 ℬ𝑛

• Trees: Converges in dn steps• Graphs: May converge in ∞ steps when there is a loop

11 -Monte-Carlo Graph Search- Edouard Leurent

Page 36: Monte-Carlo Graph Search - GitHub Pages

How to build the bounds L,U?

How to bound V (s) = sup∑∞

t=0 γtrt?

• Initialize with trivial bounds: L = 0, U = 11−γ

• Apply the Bellman operator B : V → maxaR(s, a) + γV (s ′)

L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)

𝐵𝑛 ℬ𝑛

• Trees: Converges in dn steps

• Graphs: May converge in ∞ steps when there is a loop

11 -Monte-Carlo Graph Search- Edouard Leurent

Page 37: Monte-Carlo Graph Search - GitHub Pages

How to build the bounds L,U?

How to bound V (s) = sup∑∞

t=0 γtrt?

• Initialize with trivial bounds: L = 0, U = 11−γ

• Apply the Bellman operator B : V → maxaR(s, a) + γV (s ′)

L(a) ≤ B(L)(a) ≤ V (a) ≤ B(U)(a) ≤ U(a)

𝐵𝑛 ℬ𝑛

• Trees: Converges in dn steps• Graphs: May converge in ∞ steps when there is a loop

11 -Monte-Carlo Graph Search- Edouard Leurent

Page 38: Monte-Carlo Graph Search - GitHub Pages

Analysis

Is GBOP-D more efficient than OPD?

Performance: rn = V ? − V (an)

Theorem (Sample complexity of OPD, Hren and Munos, 2008)

rn = O(

n− log 1γ/log κ

),

where κ is a problem-dependent difficulty measure.

Theorem (Sample complexity of GBOP-D)

rn = O(

n− log 1γ/log κ∞

),

where κ∞ is a tighter problem-dependent difficulty measure:

κ∞ ≤ κ

12 -Monte-Carlo Graph Search- Edouard Leurent

Page 39: Monte-Carlo Graph Search - GitHub Pages

Analysis

Is GBOP-D more efficient than OPD?Performance: rn = V ? − V (an)

Theorem (Sample complexity of OPD, Hren and Munos, 2008)

rn = O(

n− log 1γ/log κ

),

where κ is a problem-dependent difficulty measure.

Theorem (Sample complexity of GBOP-D)

rn = O(

n− log 1γ/log κ∞

),

where κ∞ is a tighter problem-dependent difficulty measure:

κ∞ ≤ κ

12 -Monte-Carlo Graph Search- Edouard Leurent

Page 40: Monte-Carlo Graph Search - GitHub Pages

Analysis

Is GBOP-D more efficient than OPD?Performance: rn = V ? − V (an)

Theorem (Sample complexity of OPD, Hren and Munos, 2008)

rn = O(

n− log 1γ/log κ

),

where κ is a problem-dependent difficulty measure.

Theorem (Sample complexity of GBOP-D)

rn = O(

n− log 1γ/log κ∞

),

where κ∞ is a tighter problem-dependent difficulty measure:

κ∞ ≤ κ

12 -Monte-Carlo Graph Search- Edouard Leurent

Page 41: Monte-Carlo Graph Search - GitHub Pages

Analysis

Is GBOP-D more efficient than OPD?Performance: rn = V ? − V (an)

Theorem (Sample complexity of OPD, Hren and Munos, 2008)

rn = O(

n− log 1γ/log κ

),

where κ is a problem-dependent difficulty measure.

Theorem (Sample complexity of GBOP-D)

rn = O(

n− log 1γ/log κ∞

),

where κ∞ is a tighter problem-dependent difficulty measure:

κ∞ ≤ κ

12 -Monte-Carlo Graph Search- Edouard Leurent

Page 42: Monte-Carlo Graph Search - GitHub Pages

Comparing difficulty measures

• κ∞ = κ if the MDP has a tree structure

• κ∞ < κ when trajectories intersect a lot

> actions cancel each-other out (e.g. moving left or right)> actions are commutative (e.g. placing pawns on a board)

Illustrative example: 3 states, K > 2 actions

𝑟∗ 𝑟−

𝑟+

0

𝑟∗

κ∞ = 1 < κ = K − 1

13 -Monte-Carlo Graph Search- Edouard Leurent

Page 43: Monte-Carlo Graph Search - GitHub Pages

Comparing difficulty measures

• κ∞ = κ if the MDP has a tree structure• κ∞ < κ when trajectories intersect a lot

> actions cancel each-other out (e.g. moving left or right)> actions are commutative (e.g. placing pawns on a board)

Illustrative example: 3 states, K > 2 actions

𝑟∗ 𝑟−

𝑟+

0

𝑟∗

κ∞ = 1 < κ = K − 1

13 -Monte-Carlo Graph Search- Edouard Leurent

Page 44: Monte-Carlo Graph Search - GitHub Pages

Comparing difficulty measures

• κ∞ = κ if the MDP has a tree structure• κ∞ < κ when trajectories intersect a lot

> actions cancel each-other out (e.g. moving left or right)

> actions are commutative (e.g. placing pawns on a board)

Illustrative example: 3 states, K > 2 actions

𝑟∗ 𝑟−

𝑟+

0

𝑟∗

κ∞ = 1 < κ = K − 1

13 -Monte-Carlo Graph Search- Edouard Leurent

Page 45: Monte-Carlo Graph Search - GitHub Pages

Comparing difficulty measures

• κ∞ = κ if the MDP has a tree structure• κ∞ < κ when trajectories intersect a lot

> actions cancel each-other out (e.g. moving left or right)> actions are commutative (e.g. placing pawns on a board)

Illustrative example: 3 states, K > 2 actions

𝑟∗ 𝑟−

𝑟+

0

𝑟∗

κ∞ = 1 < κ = K − 1

13 -Monte-Carlo Graph Search- Edouard Leurent

Page 46: Monte-Carlo Graph Search - GitHub Pages

Comparing difficulty measures

• κ∞ = κ if the MDP has a tree structure• κ∞ < κ when trajectories intersect a lot

> actions cancel each-other out (e.g. moving left or right)> actions are commutative (e.g. placing pawns on a board)

Illustrative example: 3 states, K > 2 actions

𝑟∗ 𝑟−

𝑟+

0

𝑟∗

κ∞ = 1 < κ = K − 1

13 -Monte-Carlo Graph Search- Edouard Leurent

Page 47: Monte-Carlo Graph Search - GitHub Pages

Experiment: sparse gridworld

Rewards in a ball around (10, 10) of radius 5, with quadratic decay

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20OPD

0

100

101

102

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20GBOP-D

0

100

101

102

103

n = 5640 samples

14 -Monte-Carlo Graph Search- Edouard Leurent

Page 48: Monte-Carlo Graph Search - GitHub Pages

Extension to stochastic MDPs

Extension to stochastic MDPs• Use state similarity to tighten the bounds L ≤ V ≤ U.• We adapt MDP-GapE (Jonsson et al., 2020) to obtain GBOP

15 -Monte-Carlo Graph Search- Edouard Leurent

Page 49: Monte-Carlo Graph Search - GitHub Pages

Experiment: stochastic gridworld

Noisy transitions with probability p = 10%

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20UCT

0

100

101

102

103

−20 −10 0 10 20−20

−15

−10

−5

0

5

10

15

20GBOP

0

100

101

102

103

n = 5640 samples

16 -Monte-Carlo Graph Search- Edouard Leurent

Page 50: Monte-Carlo Graph Search - GitHub Pages

Exploration-Exploitation score

S =n∑

t=1d(st , s0)︸ ︷︷ ︸Exploration

− d(st , sg)︸ ︷︷ ︸Exploitation

OPD GBOP-D MDP-GapE BRUE UCT KL-OLOP GBOP

0

5

10

15

20

scor

e

dynamics

deterministic

stochastic

n = 5640 samples

17 -Monte-Carlo Graph Search- Edouard Leurent

Page 51: Monte-Carlo Graph Search - GitHub Pages

Sailing Domain (Vanderbei, 1996)

102 103 104

budget n

10−1sim

ple

regr

etr n

agent

BRUE

GBOP

KL-OLOP

MDP-GapE

UCT

Effective branching factor κe :• κe ≈ 3.6, for BRUE, KL-OLOP, MDP-GapE, UCT• κe ≈ 1.2 for GBOP, which suggests our results may still hold

18 -Monte-Carlo Graph Search- Edouard Leurent

Page 52: Monte-Carlo Graph Search - GitHub Pages

Sailing Domain (Vanderbei, 1996)

102 103 104

budget n

10−1sim

ple

regr

etr n

agent

BRUE

GBOP

KL-OLOP

MDP-GapE

UCT

Effective branching factor κe :• κe ≈ 3.6, for BRUE, KL-OLOP, MDP-GapE, UCT• κe ≈ 1.2 for GBOP, which suggests our results may still hold

18 -Monte-Carlo Graph Search- Edouard Leurent

Page 53: Monte-Carlo Graph Search - GitHub Pages

Thank You!

19 -Monte-Carlo Graph Search- Edouard Leurent