david silver - ucl · the real world real-world problems are messy huge state spaces huge branching...

Simulation-Based Search

David Silver

Part I: Background

The Real World

Real-world problems are MESSY

Huge state spaces

Huge branching factors

Long-term consequences of actions

No good heuristics

Traditional search algorithms fail (e.g. A*, alpha-beta)

Example: The Game of Go

The ancient oriental game of Go is ~4000 years old

Usually played on 19x19 board

Simple rules, complex strategy

Black and white place down stones alternately

Capturing

If stones are completely surrounded they are captured

Winner

The game is finished when both players pass

The intersections surrounded by each player are known as territory

The player with more territory wins the game

Shape Knowledge in Go

Go players utilise a large vocabulary of shapes:

One-point jump

Ponnuki

Hane

A Grand Challenge for AI

Huge state space: 10170 states

Big branching factor: 361 actions

Long-term consequences: hundreds of moves

No good heuristics: amateur level after 40 years

Traditional search has failed in Computer Go

Progress In 19x19 Computer Go

1 dan

1 kyu

2 dan

3 dan

4 dan

2 kyu

5 dan

6 dan

7 dan

3 kyu

4 kyu

5 kyu

6 kyu

7 kyu

8 kyu

9 kyu

10 kyu

Be

gin

ne

rM

aste

r

Traditional Search

2002 2004 2006 2008 20102000 2001 2003 2005 2007 20091999199819971996

11 kyu

12 kyu

13 kyu

14 kyu

15 kyu

Many Faces of Go

Go++

Handtalk

Progress In 19x19 Computer Go

1 dan

1 kyu

2 dan

3 dan

4 dan

2 kyu

5 dan

6 dan

7 dan

3 kyu

4 kyu

5 kyu

6 kyu

7 kyu

8 kyu

9 kyu

10 kyu

Be

gin

ne

rM

aste

r

Monte-Carlo Search

Traditional Search

Zen

MoGo

MoGo

CrazyStone

Indigo

2002 2004 2006 2008 20102000 2001 2003 2005 2007 20091999199819971996

11 kyu

12 kyu

13 kyu

14 kyu

15 kyu

Indigo

Success of Monte-Carlo Search

Human master level in Computer Go

Super human world-champion level in Backgammon, Scrabble

Computer world champion in general game playing, Hex, Amazons, Lines of Action, Hearts, Skat, ...

World computer record for many puzzles: Morpion Solitaire, 16x16 Sudoku, SameGame, ...

Part II:Monte-Carlo Tree Search

Position Evaluation

Game outcome z

Black wins z=1

White wins z=0

Value of position s

Vπ(s) = Eπ[z|s]

V*(s) = maxπ Vπ(s)

<= Monte-Carlo simulation

<= Tree search

Monte-Carlo Simulation

Simulate n random games from current position s

Evaluate position by mean outcome of simulations

V (s) =1n

n�

i=1

zi

Monte-Carlo Simulation

Current position s

Simulation

1 1 0 0 Outcomes

V(s) = 2/4 = 0.5

Simple Monte-Carlo Search

Run Monte-Carlo simulation from the current position s, for each action a

Select action maximising Monte-Carlo value

Monte-Carlo Tree Search

Monte-Carlo simulation from the current position s

Builds a search tree containing Monte-Carlo values

Simulation policy has two phases:

Tree policy (e.g. greedy)

Default policy (e.g. uniform random)

Monte-Carlo Tree Search

Optimism in the Face of Uncertainty

Want to exploit the knowledge we have accumulated

Search the best nodes of the tree most deeply

Want to explore to accumulate more knowledge

Search the most uncertain nodes of the tree

Solution: use upper confidence bound on value

Q⊕(s, a) = Q(s, a) + c

�log n(s)n(s, a)

UCT (Upper Confidence Trees)

Exploitation

Q⊕(s, a) = Q(s, a) + c

�log n(s)n(s, a)

Exploration

Q⊕(s, a) = Q(s, a) + c

�log n(s)n(s, a)

Weaknesses of MCTS

No generalisation between similar positions

No knowledge beyond the search tree

Monte-Carlo evaluation is high variance

Highly dependent on random rollout policy

Part IIIHeuristic MC-RAVE

MoGo

Generalise value of same move in similar positions

Use a heuristic function to initialise leaf nodes

(Gelly and Silver, 2007)

Rapid Action Value Estimate (RAVE)

Assume that the value of move is the same

Regardless of when move is played


MC value of C3 = 0/1RAVE value of C3 = 3/5


MC value of C3 = 1/1RAVE value of C3 = 2/3

3/5

2/3

MC-RAVE

Monte-Carlo value is unbiased but has high variance

RAVE value is biased but has low variance

Combine both values so as to minimise MSE

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

1 10 100 1000 10000

k

UCTMC-RAVE

Win

nin

gra

teagain

stG

nuG

o3.7

.10

MC-RAVE in MoGo

Heuristic MCTS

V(s)n(s)

s

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0 50 100 150 200

Win

nin

g r

ate

ag

ain

st G

nu

Go

3.7

.10

de

fau

lt leve

l

Local shape featuresHandcrafted heuristicGrandfather heuristicEven game heuristicUCT–RAVE

n̂h

Heuristic MCTS in MoGo

MoGo (2007)

• MoGo = heuristic MCTS + MC-RAVE + handcrafted default policy

• 99% winning rate against best traditional programs

• Highest rating on 9x9 and 19x19 Computer Go Server

• Gold medal at Computer Go Olympiad

• First victory against 9-dan professional player (9x9)

MoGo (2009)

• MoGo += massive parallelisation, expert Go knowledge, better heuristics, ...

• First victory against 9-dan professional player (19x19) (7 stones handicap)

• Is the end nigh for humankind?

MoGo: 13x13 Scalability

Part IVTemporal-Difference Search

MC vs. TD Learning

In reinforcement learning:

TD learning reduces variance but increases bias

TD learning is usually more efficient than MC

TD(λ) can be much more efficient than MC

Function Approximation

Also in reinforcement learning:

Can use MC or TD with function approximation

Approximate value over large state spaces

Generalise between similar states

e.g. linear function approximation (tile coding, coarse coding, etc.)

Temporal-Difference Search

Simulation-based search

Using TD instead of MC

Using function approximation

Temporal-Difference Search

Consider subgame starting from current state s

Apply temporal-difference learning to subgame:

Simulate games of self-play from s

Update feature weights using TD

Value function is specialised online to current subgame

(Silver et al. 2008)

Linear TD learning

Features: φ(s)

Value function: V(s) = θ·φ(s)

Play many games from start to end

Update feature weights θ after every move

TD-error: δ = V(s’) - V(s)

Weight update: Δθ = αδφ(s)

Linear TD search

Features: φ(s)

Value function: V(s) = θ·φ(s)

Simulate many games from current position

Update feature weights θ after every simulated move

TD-error: δ = V(s’) - V(s)

Weight update: Δθ = αδφ(s)

Local Shape Features

Binary features matching a local configuration of stones

All possible locations and configurations from 1x1 to 3x3

~1 million features for 9x9 Go

TD Learning

Current State

Learning

Q(s, a) = φ(s, a)T θ

Feature weights

TD Search

Search

Current State

Q̄(s, a) = φ(s, a)T θ + φ̄(s, a)T θ̄

Feature weights

Empty Triangle

Temporal difference learning:

Local shape feature acquires a negative weight

Guzumi

Temporal difference search:

Local shape feature acquires a positive weight

Blood Vomiting Game

Dyna-2: Two Memories

A memory is a vector of feature weights

Long-term memory updated by TD learning

General domain knowledge

Short-term memory updated by TD search

Specific knowledge about current situation

Positions are evaluated by combining both memories

Results for Dyna-2

TD Learning + TD SearchTD SearchTD LearningUCT

Dynamic Evaluation

Traditional search algorithms (e.g. alpha-beta) use a static evaluation function

Dyna-2 provides a dynamic evaluation function

Re-learn the evaluation function after every move

Dyna-2 + Alpha-Beta: Results

RLGO

RLGO = TD Learning + TD Search (+ Alpha-Beta)

Outperforms all handcrafted, traditional search and traditional machine learning programs

Outperforms MCTS in 9x9 Go and scales better with board size

A New Paradigm for Big AI

Given a generative model of the world’s dynamics:

Consider subproblem starting from now

Simulate experience from now with the model

Apply reinforcement learning to simulations

Questions?

david silver - ucl · the real world real-world problems are messy huge state spaces huge branching...

Documents