probability cse 473 – autumn 2003 henry kautz. expectimax
TRANSCRIPT
Probability
CSE 473 – Autumn 2003
Henry Kautz
ExpectiMax
node chance a isn if )(ExpectiMax)(
nodemax isn if )}(children|)(ExpectiMaxmax{
node terminala isn if )(
)(ExpectiMax
)(
nchildrens
ssP
nss
nU
n
Hungry Monkey: 2-Ply Game Tree
0 0 1 0 0 0 1 0 1 1 2 1 0 0 1 0
jump
jump jumpjump
jump
shake
shake shake shakeshake
2/3
2/3 2/3 2/3 2/3 2/3
1/3
1/3 1/3 1/3 1/3 1/3
1/6 5/6
1/6 1/61/6 5/6 5/6 5/6
ExpectiMax 1 – Chance Nodes
0 2/3
0 0 1 0
0 1/6
0 0 1 0
1 7/6
1 1 2 1
0 1/6
0 0 1 0
jump
jump jumpjump
jump
shake
shake shake shakeshake
2/3
2/3 2/32/3 2/3 2/3
1/3
1/3 1/3 1/3 1/3 1/3
1/6 5/6
1/6 1/61/6 5/6 5/6 5/6
ExpectiMax 2 – Max Nodes
2/3
0 2/3
0 0 1 0
1/6
0 1/6
0 0 1 0
7/6
1 7/6
1 1 2 1
1/6
0 1/6
0 0 1 0
jump
jump jumpjump
jump
shake
shake shake shakeshake
2/3
2/3 2/32/3 2/3 2/3
1/3
1/3 1/3 1/3 1/3 1/3
1/6 5/6
1/6 1/61/6 5/6 5/6 5/6
ExpectiMax 3 – Chance Nodes
1/2 1/3
2/3
0 2/3
0 0 1 0
1/6
0 1/6
0 0 1 0
7/6
1 7/6
1 1 2 1
1/6
0 1/6
0 0 1 0
jump
jump jumpjump
jump
shake
shake shake shakeshake
2/3
2/3 2/32/3 2/3 2/3
1/3
1/3 1/3 1/3 1/3 1/3
1/6 5/6
1/6 1/61/6 5/6 5/6 5/6
ExpectiMax 4 – Max Node
1/2
1/2 1/3
2/3
0 2/3
0 0 1 0
1/6
0 1/6
0 0 1 0
7/6
1 7/6
1 1 2 1
1/6
0 1/6
0 0 1 0
jump
jump jumpjump
jump
shake
shake shake shakeshake
2/3
2/3 2/32/3 2/3 2/3
1/3
1/3 1/3 1/3 1/3 1/3
1/6 5/6
1/6 1/61/6 5/6 5/6 5/6
Policies
• The result of the ExpectiMax analysis is a conditional plan (also called a policy):– Optimal plan for 2 steps: jump; shake– Optimal plan for 3 steps:
jump; if (ontable) {shake; shake}
else {jump; shake}
• Probabilistic planning can be generalized in many ways, including:– Action costs– Hidden state
• The general problem is that of solving a Markov Decision Process (MDP)
2 Player Games of Chance
( )
ExpectiMiniMax( )
( ) if n is a terminal node
max{ExpectiMiniMax( ) | children( )} if n is max node
min{ExpectiMiniMax( ) | children( )} if n is min node
( )ExpectiMiniMax( ) if n is a cs children n
n
U n
s s n
s s n
P s s
hance node
Backgammon• Branching factor:
– Chance node: 21– Max node: about 20 on average– Size of tree: O(ckmk)– In practice: can search 3 plies
• Neurogammon & TD-Gammon (Tesauro 1995)– Learned weights on static evaluation function by playing
against itself
– Use results of games to optimize weights:• “Punish” features that were on in losing games
• “Reward” features that were on in winning games
– A kind of reinforcement learning
– Became world’s best backgammon player!