monte-carlo tree search for multi-player, no-limit texas ...guyvdb/slides/siks11.pdf2*mcts uct+...

Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker

Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker

Guy Van den Broeck

Deceptive play

Should I bluff?

Opponent modeling

Should I bluff?Is he bluffing?

Incomplete information

Should I bluff?

Who has the Ace?

Is he bluffing?

Game of chance

Should I bluff?

Who has the Ace? What are the odds?

Is he bluffing?

Exploitation

Should I bluff?


Is he bluffing?

I'll bet because he always calls

Huge state space

Should I bluff?


Is he bluffing?

What can happen next?I'll bet because he always calls

Risk management & Continuous action space

Should I bet $5 or $10?Should I bluff?


Is he bluffing?


Take-Away Message:We can solve all these problems!

Should I bet $5 or $10?Should I bluff?


Is he bluffing?


Problem Statement

A bot for Texas hold'em poker No-Limit & > 2 players

Not done before! Exploitative, not game theoretic

Game tree search + Opponent modeling

Applies to any problem with either incomplete information non-determinism continuous actions

Outline

Overview approach The Poker game tree Opponent model Monte-Carlo tree search

Research challenges Search Opponent model

Conclusion

Poker Game TreePoker Game Tree

Minimax trees: deterministic Tic-tac-toe, checkers, chess, go,… max min


Minimax trees: deterministic Tic-tac-toe, checkers, chess, go,…

Expecti(mini)max trees: chance Backgammon, …

max min

max min mix


Minimax trees: deterministic Tic-tac-toe, checkers, chess, go,…

Expecti(mini)max trees: chance Backgammon, …

Miximax trees: hidden information

max min

mix

max

max

min

mix

mix

+ opponent model

my actionfold

callraise

my action

Resolve

fold

callraise

my action

Resolve

fold

callraise

0

my action

Resolve

fold

callraise

Reveal Cards

…

0

0.5 0.5

my action

Resolve

fold

callraise

Reveal Cards

…

0

-1

0.5 0.5

my action

Resolve

fold

callraise

Reveal Cards

…

0

-1 3

0.5 0.5

my action

Resolve

fold

callraise

Reveal Cards

…

0

1

-1 3

0.5 0.5

my action

Resolve

fold

callraise

Reveal Cards

…

0

1

-1 3

0.5 0.5 fold call raise

opp-1 action0.6

0.3 0.1

my action

Resolve

fold

callraise

Reveal Cards

…opp-2 action

fold

… …

0

1

-1 3


opp-1 action0.6

0.3 0.1

my action

Resolve

fold

callraise

Reveal Cards

…opp-2 action

fold

… …

0

1

-1 3


opp-1 action0.6

0.3 0.14

my action

Resolve

fold

callraise

Reveal Cards

…opp-2 action

fold

… ……

0

1

-1 3


opp-1 action0.6

0.3 0.14

my action

Resolve

fold

callraise

Reveal Cards

…opp-2 action

fold

… ……

0

1

-1 3


opp-1 action0.6

0.3 0.14 2

my action

Resolve

fold

callraise

Reveal Cards

…opp-2 action

fold

… ……

…

0

1

-1 3


opp-1 action0.6

0.3 0.14 2

my action

Resolve

fold

callraise

Reveal Cards

…opp-2 action

fold

… ……

…

0

1

-1 3


opp-1 action0.6

0.3 0.14 2

0

my action

Resolve

fold

callraise

Reveal Cards

…opp-2 action

fold

… ……

…

0

1

-1 3


opp-1 action0.6

0.3 0.14 2

0

3

my action

Resolve

fold

callraise

Reveal Cards

…opp-2 action

fold

… ……

…

0

1

-1 3


opp-1 action0.6

0.3 0.14 2

0

3

3

Outline



Conclusion

Short ExperimentShort Experiment

Opponent ModelOpponent Model

Set of probability trees Weka's M5' Separate model for

Actions

Hand cards at showdown

Fold ProbabilityFold ProbabilitynbAllPlayerRaises

(Can also be relational) Tilde probability tree [Ponsen08]

Opponent RanksOpponent Ranks

Learn distribution of hand ranks at showdown

Rank Bucket

Probability

Number of Raises

Probability

Outline



Conclusion

Traversing the treeTraversing the tree

Limit Texas Hold’em 1018 nodes Fully traversable

No-limit >1071 nodes Too large to traverse Sampled, not searched Monte-Carlo Tree Search

Monte-Carlo Tree SearchMonte-Carlo Tree Search

[Chaslot08]

SelectionSelection

In each node: is an estimate of the rewardis the number of samples

SelectionSelection

UCT (Multi-Armed Bandit)


SelectionSelection



exploitation

SelectionSelection



exploration

exploitation

SelectionSelection


CrazyStone


exploration

exploitation

ExpansionSimulationExpansionSimulation

BackpropagationBackpropagationis an estimate of the rewardis the number of samples

BackpropagationBackpropagation

Sample-weighted average

is an estimate of the rewardis the number of samples

BackpropagationBackpropagation

Sample-weighted average

Maximum child

is an estimate of the rewardis the number of samples

Initial experiments 1*MCTS + 2*rule based Exploitative!

MCTS Bot

Outline



Conclusion

Outline


Research challenges Search

Uncertainty in MCTS Continuous action spaces

Opponent model Online learning Concept drift

Conclusion

MCTS for games with uncertainty? Expected reward distributions (ERD) Sample selection using ERD Backpropagation of ERD

[VandenBroeck09]

100 samples

∞ samples

MiniMax

Expected reward distribution

10 samples

Variance

Estimating

100 samples

∞ samples

MiniMax


10 samples

SamplingVariance

Estimating

100 samples

∞ samples

MiniMax ExpectiMax/MixiMax


10 samples

SamplingVariance

Estimating

100 samples

∞ samples



10 samples

SamplingVariance Uncertainty + Sampling

Estimating

100 samples

∞ samples



10 samples


ExpectiMax/MixiMax

/ T(P)Estimating

100 samples

∞ samples



10 samples


ExpectiMax/MixiMax

Sampling

/ T(P)Estimating

ERD selection strategy

Objective? Find maximum expected reward Sample more in subtrees with

(1) High expected reward(2) Uncertain estimate

UCT does (1) but not really (2) CrazyStone does (1) and (2) for

deterministic games (Go) UCT+ selection:

(1) (2)






“Expected value under perfect play”






“Measure of uncertainty due to sampling”

max

A

…

B

…

3 4

ERD max-distribution backpropagation

max

A

…

B

…

3 4


3.5

sample-weighted

max

A

…

B

…

3 4


3.5

sample-weighted

4

max

max

A

…

B

…

3 4


3.5

sample-weighted

4

max

“When the game reaches P, we'll have more time to find the real “

max

A

…

B

…

3 4


4

3.5

sample-weighted

max

max-distribution

4.5

max

A

…

B

…

3 4


A4B4 0.8*0.5 0.2*0.5

P(A>4) = 0.2P(A4) = 0.5P(B4) = 0.6 > 0.5

4.5

Experiments

2*MCTS Max-distribution Sample-weighted

2*MCTS UCT+ (stddev) UCT

Outline





Conclusion

Dealing with continuous actions

Sample discrete actions

Progressive unpruning [Chaslot08] (ignores smoothness of EV function)

... Tree learning search (work in progress)

relative betsize

Tree learning search

Based on regression tree induction from data streams training examples arrive quickly nodes split when significant reduction in stddev training examples are immediately forgotten

Edges in TLS tree are not actions, but sets of actions, e.g., (raise in [2,40]), (fold or call)

MCTS provides a stream of (action,EV) examples Split action sets to reduce stddev of EV

(when significant)

Tree learning searchmax

{Fold, Call}

max

Bet in [0,10]

? ?


?

{Fold, Call}

max

Bet in [0,10]

?


?

{Fold, Call}

max

Bet in [0,10]

?

Optimal split at 4


Bet in [0,4]

{Fold, Call}

max

Bet in [0,10]

Bet in [4,10]

max max

? ?? ?

one action of P1

one action of P2

Tree learning search

Selection Phase

Sample 2.4

P1

Each node has EV estimate, which generalizes over actions

Expansion

Selected Node

P1

P2

Expansion

Expanded nodeRepresents any action of P3P3

P2

P1

Backpropagation

New sample;Split becomes significant

Outline





Conclusion

Online learning of opponent model

Start from (safe) model of general opponent Exploit weaknesses of specific opponent

Start to learn modelof specific opponent

(exploration of opponent behavior)

Multi-agent interaction


Yellow learns model for Blue and changes strategy



Yellow doesn't profit!



Yellow doesn't profit!

Green profits without changing strategy!!

Outline





Conclusion

Concept drift

While learning from a stream, the training examples in the stream change In opponent model: changing strategy

“Changing gears is not just about bluffing, it's about changing strategy to achieve a goal.”

Learning with concept drift adapt quickly to changes yet robust to noise (recognize recurrent concepts)

Basic approach to concept drift

Maintain a window of training examples large enough to learn small enough to adapt quickly without 'old' concepts

Heuristics to adjust window size based on FLORA2 framework [Widmer92]

Accuracy

Window size

4 components of a singleopponent model

Start online learning

Concept drift

Bad parameters for heuristic

NOT ROBUST

Accuracy

Window size

Outline



Conclusion

Conclusions

First exploitive poker bot for No-limit Holdem > 2 players

Apply in other games backgammon computational pool ...

Challenge for MCTS games with uncertainty continuous action space

Challenge for ML online learning concept drift (relational learning)

Thanks for listening!

Monte-Carlo Tree Search in Poker using Expected Reward DistributionsDia 2Dia 3Dia 4Dia 5Dia 6Dia 7Dia 8Dia 9Dia 10Dia 11Dia 12Dia 13Dia 14Dia 15Dia 16Dia 17Dia 18Dia 19Dia 20Dia 21Dia 22Dia 23Dia 24Dia 25Dia 26Dia 27Dia 28Dia 29Dia 30Dia 31Dia 32Dia 33Dia 34Dia 35Short ExperimentDia 37Dia 38Dia 39Opponent RanksDia 41Dia 42Dia 43Dia 44Dia 45Dia 46Dia 47Dia 48Dia 49Dia 50Dia 51Dia 52Dia 53Dia 54Dia 55Dia 56Dia 57Dia 58Dia 59Dia 60Dia 61Dia 62Dia 63Dia 64Dia 65Dia 66Dia 67Dia 68Dia 69Dia 70Dia 71Dia 72Dia 73Dia 74Dia 75Dia 76Dia 77Dia 78Dia 79Dia 80Dia 81Dia 82Dia 83Dia 84Dia 85Dia 86Dia 87Dia 88Dia 89Dia 90Dia 91Dia 92Dia 93Dia 94Dia 95Dia 96Dia 97Dia 98Dia 99Dia 100Dia 101Dia 102Dia 103Dia 104Dia 105Dia 106Dia 107Dia 108Dia 109Dia 110Dia 111Dia 112

monte-carlo tree search for multi-player, no-limit texas ...guyvdb/slides/siks11.pdf2*mcts uct+...

Documents