monte-carlo tree search for multi-player, no-limit texas ...guyvdb/slides/siks11.pdf2*mcts uct+...
TRANSCRIPT
-
Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker
Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker
Guy Van den Broeck
-
Deceptive play
Should I bluff?
-
Opponent modeling
Should I bluff?Is he bluffing?
-
Incomplete information
Should I bluff?
Who has the Ace?
Is he bluffing?
-
Game of chance
Should I bluff?
Who has the Ace? What are the odds?
Is he bluffing?
-
Exploitation
Should I bluff?
Who has the Ace? What are the odds?
Is he bluffing?
I'll bet because he always calls
-
Huge state space
Should I bluff?
Who has the Ace? What are the odds?
Is he bluffing?
What can happen next?I'll bet because he always calls
-
Risk management & Continuous action space
Should I bet $5 or $10?Should I bluff?
Who has the Ace? What are the odds?
Is he bluffing?
What can happen next?I'll bet because he always calls
-
Take-Away Message:We can solve all these problems!
Should I bet $5 or $10?Should I bluff?
Who has the Ace? What are the odds?
Is he bluffing?
What can happen next?I'll bet because he always calls
-
Problem Statement
A bot for Texas hold'em poker No-Limit & > 2 players
Not done before! Exploitative, not game theoretic
Game tree search + Opponent modeling
Applies to any problem with either incomplete information non-determinism continuous actions
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
-
Poker Game TreePoker Game Tree
Minimax trees: deterministic Tic-tac-toe, checkers, chess, go,… max min
-
Poker Game TreePoker Game Tree
Minimax trees: deterministic Tic-tac-toe, checkers, chess, go,…
Expecti(mini)max trees: chance Backgammon, …
max min
max min mix
-
Poker Game TreePoker Game Tree
Minimax trees: deterministic Tic-tac-toe, checkers, chess, go,…
Expecti(mini)max trees: chance Backgammon, …
Miximax trees: hidden information
max min
mix
max
max
min
mix
mix
+ opponent model
-
my actionfold
callraise
-
my action
Resolve
fold
callraise
-
my action
Resolve
fold
callraise
0
-
my action
Resolve
fold
callraise
Reveal Cards
…
0
0.5 0.5
-
my action
Resolve
fold
callraise
Reveal Cards
…
0
0.5 0.5
-
my action
Resolve
fold
callraise
Reveal Cards
…
0
-1
0.5 0.5
-
my action
Resolve
fold
callraise
Reveal Cards
…
0
-1 3
0.5 0.5
-
my action
Resolve
fold
callraise
Reveal Cards
…
0
1
-1 3
0.5 0.5
-
my action
Resolve
fold
callraise
Reveal Cards
…
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.1
-
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… …
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.1
-
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… …
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14
-
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… ……
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14
-
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… ……
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14 2
-
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… ……
…
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14 2
-
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… ……
…
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14 2
0
-
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… ……
…
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14 2
0
3
-
my action
Resolve
fold
callraise
Reveal Cards
…opp-2 action
fold
… ……
…
0
1
-1 3
0.5 0.5 fold call raise
opp-1 action0.6
0.3 0.14 2
0
3
3
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
-
Short ExperimentShort Experiment
-
Opponent ModelOpponent Model
Set of probability trees Weka's M5' Separate model for
Actions
Hand cards at showdown
-
Fold ProbabilityFold ProbabilitynbAllPlayerRaises
-
(Can also be relational) Tilde probability tree [Ponsen08]
-
Opponent RanksOpponent Ranks
Learn distribution of hand ranks at showdown
Rank Bucket
Probability
Number of Raises
Probability
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
-
Traversing the treeTraversing the tree
Limit Texas Hold’em 1018 nodes Fully traversable
No-limit >1071 nodes Too large to traverse Sampled, not searched Monte-Carlo Tree Search
-
Monte-Carlo Tree SearchMonte-Carlo Tree Search
[Chaslot08]
-
SelectionSelection
In each node: is an estimate of the rewardis the number of samples
-
SelectionSelection
UCT (Multi-Armed Bandit)
In each node: is an estimate of the rewardis the number of samples
-
SelectionSelection
UCT (Multi-Armed Bandit)
In each node: is an estimate of the rewardis the number of samples
exploitation
-
SelectionSelection
UCT (Multi-Armed Bandit)
In each node: is an estimate of the rewardis the number of samples
exploration
exploitation
-
SelectionSelection
UCT (Multi-Armed Bandit)
CrazyStone
In each node: is an estimate of the rewardis the number of samples
exploration
exploitation
-
ExpansionSimulationExpansionSimulation
-
BackpropagationBackpropagationis an estimate of the rewardis the number of samples
-
BackpropagationBackpropagation
Sample-weighted average
is an estimate of the rewardis the number of samples
-
BackpropagationBackpropagation
Sample-weighted average
Maximum child
is an estimate of the rewardis the number of samples
-
Initial experiments 1*MCTS + 2*rule based Exploitative!
MCTS Bot
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
-
MCTS for games with uncertainty? Expected reward distributions (ERD) Sample selection using ERD Backpropagation of ERD
[VandenBroeck09]
-
100 samples
∞ samples
MiniMax
Expected reward distribution
10 samples
Variance
Estimating
-
100 samples
∞ samples
MiniMax
Expected reward distribution
10 samples
Variance
Estimating
-
100 samples
∞ samples
MiniMax
Expected reward distribution
10 samples
Variance
Estimating
-
100 samples
∞ samples
MiniMax
Expected reward distribution
10 samples
Variance
Estimating
-
100 samples
∞ samples
MiniMax
Expected reward distribution
10 samples
SamplingVariance
Estimating
-
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance
Estimating
-
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance
Estimating
-
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance
Estimating
-
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance
Estimating
-
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
Estimating
-
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
ExpectiMax/MixiMax
/ T(P)Estimating
-
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
ExpectiMax/MixiMax
/ T(P)Estimating
-
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
ExpectiMax/MixiMax
/ T(P)Estimating
-
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
ExpectiMax/MixiMax
/ T(P)Estimating
-
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
ExpectiMax/MixiMax
Sampling
/ T(P)Estimating
-
100 samples
∞ samples
MiniMax ExpectiMax/MixiMax
Expected reward distribution
10 samples
SamplingVariance Uncertainty + Sampling
ExpectiMax/MixiMax
Sampling
/ T(P)Estimating
-
ERD selection strategy
Objective? Find maximum expected reward Sample more in subtrees with
(1) High expected reward(2) Uncertain estimate
UCT does (1) but not really (2) CrazyStone does (1) and (2) for
deterministic games (Go) UCT+ selection:
(1) (2)
-
ERD selection strategy
Objective? Find maximum expected reward Sample more in subtrees with
(1) High expected reward(2) Uncertain estimate
UCT does (1) but not really (2) CrazyStone does (1) and (2) for
deterministic games (Go) UCT+ selection:
“Expected value under perfect play”
-
ERD selection strategy
Objective? Find maximum expected reward Sample more in subtrees with
(1) High expected reward(2) Uncertain estimate
UCT does (1) but not really (2) CrazyStone does (1) and (2) for
deterministic games (Go) UCT+ selection:
“Measure of uncertainty due to sampling”
-
max
A
…
B
…
3 4
ERD max-distribution backpropagation
-
max
A
…
B
…
3 4
ERD max-distribution backpropagation
3.5
sample-weighted
-
max
A
…
B
…
3 4
ERD max-distribution backpropagation
3.5
sample-weighted
4
max
-
max
A
…
B
…
3 4
ERD max-distribution backpropagation
3.5
sample-weighted
4
max
“When the game reaches P, we'll have more time to find the real “
-
max
A
…
B
…
3 4
ERD max-distribution backpropagation
4
3.5
sample-weighted
max
max-distribution
4.5
-
max
A
…
B
…
3 4
ERD max-distribution backpropagation
A4B4 0.8*0.5 0.2*0.5
P(A>4) = 0.2P(A4) = 0.5P(B4) = 0.6 > 0.5
4.5
-
Experiments
2*MCTS Max-distribution Sample-weighted
2*MCTS UCT+ (stddev) UCT
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
-
Dealing with continuous actions
Sample discrete actions
Progressive unpruning [Chaslot08] (ignores smoothness of EV function)
... Tree learning search (work in progress)
relative betsize
-
Tree learning search
Based on regression tree induction from data streams training examples arrive quickly nodes split when significant reduction in stddev training examples are immediately forgotten
Edges in TLS tree are not actions, but sets of actions, e.g., (raise in [2,40]), (fold or call)
MCTS provides a stream of (action,EV) examples Split action sets to reduce stddev of EV
(when significant)
-
Tree learning searchmax
{Fold, Call}
max
Bet in [0,10]
? ?
-
Tree learning searchmax
?
{Fold, Call}
max
Bet in [0,10]
?
-
Tree learning searchmax
?
{Fold, Call}
max
Bet in [0,10]
?
Optimal split at 4
-
Tree learning searchmax
Bet in [0,4]
{Fold, Call}
max
Bet in [0,10]
Bet in [4,10]
max max
? ?? ?
-
one action of P1
one action of P2
Tree learning search
-
Selection Phase
Sample 2.4
P1
Each node has EV estimate, which generalizes over actions
-
Expansion
Selected Node
P1
P2
-
Expansion
Expanded nodeRepresents any action of P3P3
P2
P1
-
Backpropagation
New sample;Split becomes significant
-
Backpropagation
New sample;Split becomes significant
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
-
Online learning of opponent model
Start from (safe) model of general opponent Exploit weaknesses of specific opponent
Start to learn modelof specific opponent
(exploration of opponent behavior)
-
Multi-agent interaction
-
Multi-agent interaction
Yellow learns model for Blue and changes strategy
-
Multi-agent interaction
Yellow learns model for Blue and changes strategy
Yellow doesn't profit!
-
Multi-agent interaction
Yellow learns model for Blue and changes strategy
Yellow doesn't profit!
Green profits without changing strategy!!
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search
Uncertainty in MCTS Continuous action spaces
Opponent model Online learning Concept drift
Conclusion
-
Concept drift
While learning from a stream, the training examples in the stream change In opponent model: changing strategy
“Changing gears is not just about bluffing, it's about changing strategy to achieve a goal.”
Learning with concept drift adapt quickly to changes yet robust to noise (recognize recurrent concepts)
-
Basic approach to concept drift
Maintain a window of training examples large enough to learn small enough to adapt quickly without 'old' concepts
Heuristics to adjust window size based on FLORA2 framework [Widmer92]
-
Accuracy
Window size
4 components of a singleopponent model
Start online learning
Concept drift
-
Bad parameters for heuristic
NOT ROBUST
Accuracy
Window size
-
Outline
Overview approach The Poker game tree Opponent model Monte-Carlo tree search
Research challenges Search Opponent model
Conclusion
-
Conclusions
First exploitive poker bot for No-limit Holdem > 2 players
Apply in other games backgammon computational pool ...
Challenge for MCTS games with uncertainty continuous action space
Challenge for ML online learning concept drift (relational learning)
-
Thanks for listening!
Monte-Carlo Tree Search in Poker using Expected Reward DistributionsDia 2Dia 3Dia 4Dia 5Dia 6Dia 7Dia 8Dia 9Dia 10Dia 11Dia 12Dia 13Dia 14Dia 15Dia 16Dia 17Dia 18Dia 19Dia 20Dia 21Dia 22Dia 23Dia 24Dia 25Dia 26Dia 27Dia 28Dia 29Dia 30Dia 31Dia 32Dia 33Dia 34Dia 35Short ExperimentDia 37Dia 38Dia 39Opponent RanksDia 41Dia 42Dia 43Dia 44Dia 45Dia 46Dia 47Dia 48Dia 49Dia 50Dia 51Dia 52Dia 53Dia 54Dia 55Dia 56Dia 57Dia 58Dia 59Dia 60Dia 61Dia 62Dia 63Dia 64Dia 65Dia 66Dia 67Dia 68Dia 69Dia 70Dia 71Dia 72Dia 73Dia 74Dia 75Dia 76Dia 77Dia 78Dia 79Dia 80Dia 81Dia 82Dia 83Dia 84Dia 85Dia 86Dia 87Dia 88Dia 89Dia 90Dia 91Dia 92Dia 93Dia 94Dia 95Dia 96Dia 97Dia 98Dia 99Dia 100Dia 101Dia 102Dia 103Dia 104Dia 105Dia 106Dia 107Dia 108Dia 109Dia 110Dia 111Dia 112