david silver - ucl · the real world real-world problems are messy huge state spaces huge branching...

59
Simulation-Based Search David Silver

Upload: others

Post on 17-Nov-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Simulation-Based Search

David Silver

Page 2: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Part I: Background

Page 3: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

The Real World

Real-world problems are MESSY

Huge state spaces

Huge branching factors

Long-term consequences of actions

No good heuristics

Traditional search algorithms fail (e.g. A*, alpha-beta)

Page 4: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Example: The Game of Go

The ancient oriental game of Go is ~4000 years old

Usually played on 19x19 board

Simple rules, complex strategy

Black and white place down stones alternately

Page 5: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Capturing

If stones are completely surrounded they are captured

Page 6: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Winner

The game is finished when both players pass

The intersections surrounded by each player are known as territory

The player with more territory wins the game

Page 7: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Shape Knowledge in Go

Go players utilise a large vocabulary of shapes:

One-point jump

Ponnuki

Hane

Page 8: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

A Grand Challenge for AI

Huge state space: 10170 states

Big branching factor: 361 actions

Long-term consequences: hundreds of moves

No good heuristics: amateur level after 40 years

Traditional search has failed in Computer Go

Page 9: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Progress In 19x19 Computer Go

1 dan

1 kyu

2 dan

3 dan

4 dan

2 kyu

5 dan

6 dan

7 dan

3 kyu

4 kyu

5 kyu

6 kyu

7 kyu

8 kyu

9 kyu

10 kyu

Be

gin

ne

rM

aste

r

Traditional Search

2002 2004 2006 2008 20102000 2001 2003 2005 2007 20091999199819971996

11 kyu

12 kyu

13 kyu

14 kyu

15 kyu

Many Faces of Go

Go++

Handtalk

Page 10: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Progress In 19x19 Computer Go

1 dan

1 kyu

2 dan

3 dan

4 dan

2 kyu

5 dan

6 dan

7 dan

3 kyu

4 kyu

5 kyu

6 kyu

7 kyu

8 kyu

9 kyu

10 kyu

Be

gin

ne

rM

aste

r

Monte-Carlo Search

Traditional Search

Zen

MoGo

MoGo

CrazyStone

Indigo

2002 2004 2006 2008 20102000 2001 2003 2005 2007 20091999199819971996

11 kyu

12 kyu

13 kyu

14 kyu

15 kyu

Indigo

Page 11: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Success of Monte-Carlo Search

Human master level in Computer Go

Super human world-champion level in Backgammon, Scrabble

Computer world champion in general game playing, Hex, Amazons, Lines of Action, Hearts, Skat, ...

World computer record for many puzzles: Morpion Solitaire, 16x16 Sudoku, SameGame, ...

Page 12: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Part II:Monte-Carlo Tree Search

Page 13: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Position Evaluation

Game outcome z

Black wins z=1

White wins z=0

Value of position s

Vπ(s) = Eπ[z|s]

V*(s) = maxπ Vπ(s)

<= Monte-Carlo simulation

<= Tree search

Page 14: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Monte-Carlo Simulation

Simulate n random games from current position s

Evaluate position by mean outcome of simulations

V (s) =1n

n�

i=1

zi

Page 15: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Monte-Carlo Simulation

Current position s

Simulation

1 1 0 0 Outcomes

V(s) = 2/4 = 0.5

Page 16: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Simple Monte-Carlo Search

Run Monte-Carlo simulation from the current position s, for each action a

Select action maximising Monte-Carlo value

Page 17: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Monte-Carlo Tree Search

Monte-Carlo simulation from the current position s

Builds a search tree containing Monte-Carlo values

Simulation policy has two phases:

Tree policy (e.g. greedy)

Default policy (e.g. uniform random)

Page 18: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Monte-Carlo Tree Search

Page 19: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Monte-Carlo Tree Search

Page 20: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Monte-Carlo Tree Search

Page 21: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Monte-Carlo Tree Search

Page 22: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Monte-Carlo Tree Search

Page 23: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Optimism in the Face of Uncertainty

Want to exploit the knowledge we have accumulated

Search the best nodes of the tree most deeply

Want to explore to accumulate more knowledge

Search the most uncertain nodes of the tree

Solution: use upper confidence bound on value

Page 24: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Q⊕(s, a) = Q(s, a) + c

�log n(s)n(s, a)

UCT (Upper Confidence Trees)

Page 25: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Exploitation

Q⊕(s, a) = Q(s, a) + c

�log n(s)n(s, a)

Page 26: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Exploration

Q⊕(s, a) = Q(s, a) + c

�log n(s)n(s, a)

Page 27: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Weaknesses of MCTS

No generalisation between similar positions

No knowledge beyond the search tree

Monte-Carlo evaluation is high variance

Highly dependent on random rollout policy

Page 28: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Part IIIHeuristic MC-RAVE

Page 29: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

MoGo

Generalise value of same move in similar positions

Use a heuristic function to initialise leaf nodes

(Gelly and Silver, 2007)

Page 30: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Rapid Action Value Estimate (RAVE)

Assume that the value of move is the same

Regardless of when move is played

Page 31: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Rapid Action Value Estimate (RAVE)

MC value of C3 = 0/1RAVE value of C3 = 3/5

Page 32: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Rapid Action Value Estimate (RAVE)

MC value of C3 = 1/1RAVE value of C3 = 2/3

3/5

2/3

Page 33: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

MC-RAVE

Monte-Carlo value is unbiased but has high variance

RAVE value is biased but has low variance

Combine both values so as to minimise MSE

Page 34: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

1 10 100 1000 10000

k

UCTMC-RAVE

Win

nin

gra

teagain

stG

nuG

o3.7

.10

MC-RAVE in MoGo

Page 35: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Heuristic MCTS

V(s)n(s)

s

Page 36: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0 50 100 150 200

Win

nin

g r

ate

ag

ain

st G

nu

Go

3.7

.10

de

fau

lt leve

l

Local shape featuresHandcrafted heuristicGrandfather heuristicEven game heuristicUCT–RAVE

n̂h

Heuristic MCTS in MoGo

Page 37: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

MoGo (2007)

• MoGo = heuristic MCTS + MC-RAVE + handcrafted default policy

• 99% winning rate against best traditional programs

• Highest rating on 9x9 and 19x19 Computer Go Server

• Gold medal at Computer Go Olympiad

• First victory against 9-dan professional player (9x9)

Page 38: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

MoGo (2009)

• MoGo += massive parallelisation, expert Go knowledge, better heuristics, ...

• First victory against 9-dan professional player (19x19) (7 stones handicap)

• Is the end nigh for humankind?

Page 39: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

MoGo: 13x13 Scalability

Page 40: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Part IVTemporal-Difference Search

Page 41: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

MC vs. TD Learning

In reinforcement learning:

TD learning reduces variance but increases bias

TD learning is usually more efficient than MC

TD(λ) can be much more efficient than MC

Page 42: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Function Approximation

Also in reinforcement learning:

Can use MC or TD with function approximation

Approximate value over large state spaces

Generalise between similar states

e.g. linear function approximation (tile coding, coarse coding, etc.)

Page 43: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Temporal-Difference Search

Simulation-based search

Using TD instead of MC

Using function approximation

Page 44: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Temporal-Difference Search

Consider subgame starting from current state s

Apply temporal-difference learning to subgame:

Simulate games of self-play from s

Update feature weights using TD

Value function is specialised online to current subgame

(Silver et al. 2008)

Page 45: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Linear TD learning

Features: φ(s)

Value function: V(s) = θ·φ(s)

Play many games from start to end

Update feature weights θ after every move

TD-error: δ = V(s’) - V(s)

Weight update: Δθ = αδφ(s)

Page 46: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Linear TD search

Features: φ(s)

Value function: V(s) = θ·φ(s)

Simulate many games from current position

Update feature weights θ after every simulated move

TD-error: δ = V(s’) - V(s)

Weight update: Δθ = αδφ(s)

Page 47: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Local Shape Features

Binary features matching a local configuration of stones

All possible locations and configurations from 1x1 to 3x3

~1 million features for 9x9 Go

Page 48: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

TD Learning

Current State

Learning

Q(s, a) = φ(s, a)T θ

Feature weights

Page 49: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

TD Search

Search

Current State

Q̄(s, a) = φ(s, a)T θ + φ̄(s, a)T θ̄

Feature weights

Page 50: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Empty Triangle

Temporal difference learning:

Local shape feature acquires a negative weight

Page 51: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Guzumi

Temporal difference search:

Local shape feature acquires a positive weight

Page 52: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Blood Vomiting Game

Page 53: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Dyna-2: Two Memories

A memory is a vector of feature weights

Long-term memory updated by TD learning

General domain knowledge

Short-term memory updated by TD search

Specific knowledge about current situation

Positions are evaluated by combining both memories

Page 54: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Results for Dyna-2

TD Learning + TD SearchTD SearchTD LearningUCT

Page 55: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Dynamic Evaluation

Traditional search algorithms (e.g. alpha-beta) use a static evaluation function

Dyna-2 provides a dynamic evaluation function

Re-learn the evaluation function after every move

Page 56: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Dyna-2 + Alpha-Beta: Results

Page 57: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

RLGO

RLGO = TD Learning + TD Search (+ Alpha-Beta)

Outperforms all handcrafted, traditional search and traditional machine learning programs

Outperforms MCTS in 9x9 Go and scales better with board size

Page 58: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

A New Paradigm for Big AI

Given a generative model of the world’s dynamics:

Consider subproblem starting from now

Simulate experience from now with the model

Apply reinforcement learning to simulations

Page 59: David Silver - UCL · The Real World Real-world problems are MESSY Huge state spaces Huge branching factors Long-term consequences of actions No good heuristics Traditional search

Questions?