adversarial search continued: stochastic gamesbarto/courses/cs383-fall11/383... · 2011. 10. 4. ·...
TRANSCRIPT
![Page 1: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/1.jpg)
1
CMPSCI 383 October 4, 2011
Adversarial Search Continued: Stochastic Games
![Page 2: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/2.jpg)
2
Tip for doing well
• Begin work on assignments early!
![Page 3: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/3.jpg)
3
Game tree (2=player, deterministic, turns)
![Page 4: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/4.jpg)
4
Minimax algorithm
• Perfect play for deterministic games
• Idea — select moves with highest minimax value. That is, select the best achievable payoff against best play by your opponent
![Page 5: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/5.jpg)
5
α-β pruning
![Page 6: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/6.jpg)
6
α-β pruning
![Page 7: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/7.jpg)
7
Example: α-β pruning
![Page 8: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/8.jpg)
8
Example: α-β pruning
![Page 9: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/9.jpg)
9
Example: α-β pruning
![Page 10: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/10.jpg)
10
Example: α-β pruning
![Page 11: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/11.jpg)
11
Why is it called α-β?
• α is the value of the best (highest-value) choice found so far at any choice point along the path for MAX • If v is worse than α, MAX will avoid it, so that branch can be
pruned.
• β is the value of the best (lowest-value) choice found so far at any choice point along the path for MIN • If v is worse than β , MIN will avoid it, so that branch can
be pruned.
If m is better than n, we will never get to n
![Page 12: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/12.jpg)
12
What if a game has a “chance element”?
![Page 13: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/13.jpg)
13
Chance nodes
We know how to value the other nodes. How do we value chance nodes?
![Page 14: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/14.jpg)
14
Expected value
• The sum of the probability of each possible outcome multiplied by its value:
• Are there pathological cases where this statistic could do something strange? • Extreme values (“outliers”) • Functions that are a non-linear transformation
of the probability of winning
€
E(X) = pixii∑
![Page 15: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/15.jpg)
15
Expected minimax value
• Now three different cases to evaluate, rather than just two. • MAX • MIN • CHANCE
EXPECTIMINIMAX(n) = UTILITY(n), If terminal node maxs ∈ successors(n) EXPRCTIMINIMAX(s), If n is MAX node mins ∈ successors(n) EXPECTIMINIMAX(s), If n is MIN node ∑s ∈ successors(n) P(s) • EXPECTIMINIMAX(s), If n is CHANCE node
![Page 16: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/16.jpg)
16
Expectiminimaxing
![Page 17: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/17.jpg)
17
In Backgammon
![Page 18: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/18.jpg)
18
Evaluation functions
• So cut off the search and evaluate leaves with an evaluation function (as in H-MINIMAX)
• But do evaluation functions behave the same way in stochastic games? • Just need to order nodes in the right way, so
particular values are not so important.
• Chance nodes make things more difficult.
![Page 19: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/19.jpg)
19
Particular values DO matter
Order-preserving transformation
![Page 20: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/20.jpg)
20
![Page 21: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/21.jpg)
21
Partially Observable Games (Games of Imperfect Information)
argmaxa ∑s P(s) MINIMAX(RESULT(s,a)),
argmaxa 1/N ∑ MINIMAX(RESULT(si,a)), N
i = 1
Monte Carlo approximation:
“averaging over clairvoyance”
![Page 22: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/22.jpg)
22
Example: Kriegspiel
Maintain belief states
![Page 23: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/23.jpg)
23
Rollouts
• Play out a position to completion several thousand times with different random dice sequences.
• The best play is assumed to be the one that produced the best outcome statistics in the rollout.
• What program should select moves?
![Page 24: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/24.jpg)
24
TD Gammon
Tesauro 1992, 1994, 1995, ...
• White has just rolled a 5 and a 2 so can move one of his pieces 5 and one (possibly the same) 2 steps
• Objective is to advance all pieces to points 19-24
• Hitting • Doubling • 30 pieces, 24 locations implies
enormous number of configurations
• Effective branching factor of 400
![Page 25: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/25.jpg)
25
Multi-layer Neural Network
![Page 26: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/26.jpg)
26
Summary of TD-Gammon Results
Bill Robertie: world-class human grandmaster and former World Champion
TD-Gammon 2.1 plays at a strong master level that is extremely close to equaling the world’s best human planers. ….would be the favorite in a long money game session or grueling tournament like the World Cup competition (it never gets tired or careless).
![Page 27: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/27.jpg)
27
A Few Details
• Reward: 0 at all times except those in which the game is won, when it is 1
• Episodic (game = episode), undiscounted • Gradient descent TD(λ) with a multi-layer
neural network • weights initialized to small random numbers • backpropagation of TD error • four input units for each point; unary encoding
of number of white pieces, plus other features • Use of afterstates • Learning during self-play
![Page 28: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/28.jpg)
28
MENACE (Michie, 1961) “Matchbox Educable Noughts and Crosses Engine”
x"
x"
x"
x"
o"
o"
o"
o"
o"
x"
x"
x"x"
o"
o"
o"x"
x"
x"o"
o"
o"x" x"
x" x"
x" x"
x" x"
o"
o"
x"o"
o"x"
x"
x"
x"x"
x" o"
o"
o"
o"
o"
o"
o"
o"
o"
o"
o"
x"
x"
x"
o"
o"
o"
o"o"
o"
o"
x"
x"
x"
o"o"x"
x"
x"
x"
o"
o"
o"
x" o"
o"
o"
o"
o"
o"
o"
o"
o"
x"
x"x"
x"
o"o"
o"
o"
o"
x"
x"
x"
x"
o"
o"
o"
o"
o"
o"
o"
o"
x"
x"
x"
x"
o"
o"
o"
o"
o"
o"
o"
x"
x"
x"
x"
o"
o"
o"
o"x"
x"
x"
x"
o"
o"
o"
o"x"
x"
![Page 29: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/29.jpg)
29
Tic-Tac-Toe
X X X O O X
X O X
O X O
X O X
X O X
O X O
X O X
O X O
X
} x’s move
} x’s move
} o’s move
} x’s move
} o’s move
...
... ... ...
... ... ... ... ...
x x x
x o x
o x o
x
x
x x o o
Assume an imperfect opponent: —he/she sometimes makes mistakes
![Page 30: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/30.jpg)
30
1. Make a table with one entry per state:
2. Now play lots of games. To pick our moves,
look ahead one step:
State V(s) – estimated probability of winning: value .5 ? .5 ? . . .
. . .
. . . . . .
1 win
0 loss
. . . . . .
0 draw
x
x x x o o
o o o x x
o o o o x x x x o
current state
various possible next states *
Just pick the next state with the highest estimated prob. of winning — the largest V(s); a greedy move.
But 10% of the time pick a move at random; an exploratory move.
Learning an evaluation function
![Page 31: Adversarial Search Continued: Stochastic Gamesbarto/courses/CS383-Fall11/383... · 2011. 10. 4. · Bill Robertie: world-class human grandmaster and former World Champion TD-Gammon](https://reader035.vdocuments.us/reader035/viewer/2022071418/61167228f8d2d10e742fdb82/html5/thumbnails/31.jpg)
31
“Exploratory” move
– the state before our greedy move – the state after our greedy move
x y
We increment each V(x) toward V(y) – a backup :
a small positive fraction, the step – size parameter
← V(x) V(x) + α [V(y) – V(x)]
A Temporal-Difference (TD) method
Backups