monte-carlo tree search for multi-player, no-limit texas ...guyvdb/slides/siks11.pdf2*mcts uct+...

112
Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker Guy Van den Broeck

Upload: others

Post on 28-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker

    Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker

    Guy Van den Broeck

  • Deceptive play

    Should I bluff?

  • Opponent modeling

    Should I bluff?Is he bluffing?

  • Incomplete information

    Should I bluff?

    Who has the Ace?

    Is he bluffing?

  • Game of chance

    Should I bluff?

    Who has the Ace? What are the odds?

    Is he bluffing?

  • Exploitation

    Should I bluff?

    Who has the Ace? What are the odds?

    Is he bluffing?

    I'll bet because he always calls

  • Huge state space

    Should I bluff?

    Who has the Ace? What are the odds?

    Is he bluffing?

    What can happen next?I'll bet because he always calls

  • Risk management & Continuous action space

    Should I bet $5 or $10?Should I bluff?

    Who has the Ace? What are the odds?

    Is he bluffing?

    What can happen next?I'll bet because he always calls

  • Take-Away Message:We can solve all these problems!

    Should I bet $5 or $10?Should I bluff?

    Who has the Ace? What are the odds?

    Is he bluffing?

    What can happen next?I'll bet because he always calls

  • Problem Statement

    A bot for Texas hold'em poker No-Limit & > 2 players

    Not done before! Exploitative, not game theoretic

    Game tree search + Opponent modeling

    Applies to any problem with either incomplete information non-determinism continuous actions

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search Opponent model

    Conclusion

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search Opponent model

    Conclusion

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search Opponent model

    Conclusion

  • Poker Game TreePoker Game Tree

    Minimax trees: deterministic Tic-tac-toe, checkers, chess, go,… max min

  • Poker Game TreePoker Game Tree

    Minimax trees: deterministic Tic-tac-toe, checkers, chess, go,…

    Expecti(mini)max trees: chance Backgammon, …

    max min

    max min mix

  • Poker Game TreePoker Game Tree

    Minimax trees: deterministic Tic-tac-toe, checkers, chess, go,…

    Expecti(mini)max trees: chance Backgammon, …

    Miximax trees: hidden information

    max min

    mix

    max

    max

    min

    mix

    mix

    + opponent model

  • my actionfold

    callraise

  • my action

    Resolve

    fold

    callraise

  • my action

    Resolve

    fold

    callraise

    0

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    0

    0.5 0.5

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    0

    0.5 0.5

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    0

    -1

    0.5 0.5

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    0

    -1 3

    0.5 0.5

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    0

    1

    -1 3

    0.5 0.5

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    0

    1

    -1 3

    0.5 0.5 fold call raise

    opp-1 action0.6

    0.3 0.1

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    …opp-2 action

    fold

    … …

    0

    1

    -1 3

    0.5 0.5 fold call raise

    opp-1 action0.6

    0.3 0.1

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    …opp-2 action

    fold

    … …

    0

    1

    -1 3

    0.5 0.5 fold call raise

    opp-1 action0.6

    0.3 0.14

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    …opp-2 action

    fold

    … ……

    0

    1

    -1 3

    0.5 0.5 fold call raise

    opp-1 action0.6

    0.3 0.14

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    …opp-2 action

    fold

    … ……

    0

    1

    -1 3

    0.5 0.5 fold call raise

    opp-1 action0.6

    0.3 0.14 2

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    …opp-2 action

    fold

    … ……

    0

    1

    -1 3

    0.5 0.5 fold call raise

    opp-1 action0.6

    0.3 0.14 2

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    …opp-2 action

    fold

    … ……

    0

    1

    -1 3

    0.5 0.5 fold call raise

    opp-1 action0.6

    0.3 0.14 2

    0

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    …opp-2 action

    fold

    … ……

    0

    1

    -1 3

    0.5 0.5 fold call raise

    opp-1 action0.6

    0.3 0.14 2

    0

    3

  • my action

    Resolve

    fold

    callraise

    Reveal Cards

    …opp-2 action

    fold

    … ……

    0

    1

    -1 3

    0.5 0.5 fold call raise

    opp-1 action0.6

    0.3 0.14 2

    0

    3

    3

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search Opponent model

    Conclusion

  • Short ExperimentShort Experiment

  • Opponent ModelOpponent Model

    Set of probability trees Weka's M5' Separate model for

    Actions

    Hand cards at showdown

  • Fold ProbabilityFold ProbabilitynbAllPlayerRaises

  • (Can also be relational) Tilde probability tree [Ponsen08]

  • Opponent RanksOpponent Ranks

    Learn distribution of hand ranks at showdown

    Rank Bucket

    Probability

    Number of Raises

    Probability

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search Opponent model

    Conclusion

  • Traversing the treeTraversing the tree

    Limit Texas Hold’em 1018 nodes Fully traversable

    No-limit >1071 nodes Too large to traverse Sampled, not searched Monte-Carlo Tree Search

  • Monte-Carlo Tree SearchMonte-Carlo Tree Search

    [Chaslot08]

  • SelectionSelection

    In each node: is an estimate of the rewardis the number of samples

  • SelectionSelection

    UCT (Multi-Armed Bandit)

    In each node: is an estimate of the rewardis the number of samples

  • SelectionSelection

    UCT (Multi-Armed Bandit)

    In each node: is an estimate of the rewardis the number of samples

    exploitation

  • SelectionSelection

    UCT (Multi-Armed Bandit)

    In each node: is an estimate of the rewardis the number of samples

    exploration

    exploitation

  • SelectionSelection

    UCT (Multi-Armed Bandit)

    CrazyStone

    In each node: is an estimate of the rewardis the number of samples

    exploration

    exploitation

  • ExpansionSimulationExpansionSimulation

  • BackpropagationBackpropagationis an estimate of the rewardis the number of samples

  • BackpropagationBackpropagation

    Sample-weighted average

    is an estimate of the rewardis the number of samples

  • BackpropagationBackpropagation

    Sample-weighted average

    Maximum child

    is an estimate of the rewardis the number of samples

  • Initial experiments 1*MCTS + 2*rule based Exploitative!

    MCTS Bot

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search Opponent model

    Conclusion

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search

    Uncertainty in MCTS Continuous action spaces

    Opponent model Online learning Concept drift

    Conclusion

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search

    Uncertainty in MCTS Continuous action spaces

    Opponent model Online learning Concept drift

    Conclusion

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search

    Uncertainty in MCTS Continuous action spaces

    Opponent model Online learning Concept drift

    Conclusion

  • MCTS for games with uncertainty? Expected reward distributions (ERD) Sample selection using ERD Backpropagation of ERD

    [VandenBroeck09]

  • 100 samples

    ∞ samples

    MiniMax

    Expected reward distribution

    10 samples

    Variance

    Estimating

  • 100 samples

    ∞ samples

    MiniMax

    Expected reward distribution

    10 samples

    Variance

    Estimating

  • 100 samples

    ∞ samples

    MiniMax

    Expected reward distribution

    10 samples

    Variance

    Estimating

  • 100 samples

    ∞ samples

    MiniMax

    Expected reward distribution

    10 samples

    Variance

    Estimating

  • 100 samples

    ∞ samples

    MiniMax

    Expected reward distribution

    10 samples

    SamplingVariance

    Estimating

  • 100 samples

    ∞ samples

    MiniMax ExpectiMax/MixiMax

    Expected reward distribution

    10 samples

    SamplingVariance

    Estimating

  • 100 samples

    ∞ samples

    MiniMax ExpectiMax/MixiMax

    Expected reward distribution

    10 samples

    SamplingVariance

    Estimating

  • 100 samples

    ∞ samples

    MiniMax ExpectiMax/MixiMax

    Expected reward distribution

    10 samples

    SamplingVariance

    Estimating

  • 100 samples

    ∞ samples

    MiniMax ExpectiMax/MixiMax

    Expected reward distribution

    10 samples

    SamplingVariance

    Estimating

  • 100 samples

    ∞ samples

    MiniMax ExpectiMax/MixiMax

    Expected reward distribution

    10 samples

    SamplingVariance Uncertainty + Sampling

    Estimating

  • 100 samples

    ∞ samples

    MiniMax ExpectiMax/MixiMax

    Expected reward distribution

    10 samples

    SamplingVariance Uncertainty + Sampling

    ExpectiMax/MixiMax

    / T(P)Estimating

  • 100 samples

    ∞ samples

    MiniMax ExpectiMax/MixiMax

    Expected reward distribution

    10 samples

    SamplingVariance Uncertainty + Sampling

    ExpectiMax/MixiMax

    / T(P)Estimating

  • 100 samples

    ∞ samples

    MiniMax ExpectiMax/MixiMax

    Expected reward distribution

    10 samples

    SamplingVariance Uncertainty + Sampling

    ExpectiMax/MixiMax

    / T(P)Estimating

  • 100 samples

    ∞ samples

    MiniMax ExpectiMax/MixiMax

    Expected reward distribution

    10 samples

    SamplingVariance Uncertainty + Sampling

    ExpectiMax/MixiMax

    / T(P)Estimating

  • 100 samples

    ∞ samples

    MiniMax ExpectiMax/MixiMax

    Expected reward distribution

    10 samples

    SamplingVariance Uncertainty + Sampling

    ExpectiMax/MixiMax

    Sampling

    / T(P)Estimating

  • 100 samples

    ∞ samples

    MiniMax ExpectiMax/MixiMax

    Expected reward distribution

    10 samples

    SamplingVariance Uncertainty + Sampling

    ExpectiMax/MixiMax

    Sampling

    / T(P)Estimating

  • ERD selection strategy

    Objective? Find maximum expected reward Sample more in subtrees with

    (1) High expected reward(2) Uncertain estimate

    UCT does (1) but not really (2) CrazyStone does (1) and (2) for

    deterministic games (Go) UCT+ selection:

    (1) (2)

  • ERD selection strategy

    Objective? Find maximum expected reward Sample more in subtrees with

    (1) High expected reward(2) Uncertain estimate

    UCT does (1) but not really (2) CrazyStone does (1) and (2) for

    deterministic games (Go) UCT+ selection:

    “Expected value under perfect play”

  • ERD selection strategy

    Objective? Find maximum expected reward Sample more in subtrees with

    (1) High expected reward(2) Uncertain estimate

    UCT does (1) but not really (2) CrazyStone does (1) and (2) for

    deterministic games (Go) UCT+ selection:

    “Measure of uncertainty due to sampling”

  • max

    A

    B

    3 4

    ERD max-distribution backpropagation

  • max

    A

    B

    3 4

    ERD max-distribution backpropagation

    3.5

    sample-weighted

  • max

    A

    B

    3 4

    ERD max-distribution backpropagation

    3.5

    sample-weighted

    4

    max

  • max

    A

    B

    3 4

    ERD max-distribution backpropagation

    3.5

    sample-weighted

    4

    max

    “When the game reaches P, we'll have more time to find the real “

  • max

    A

    B

    3 4

    ERD max-distribution backpropagation

    4

    3.5

    sample-weighted

    max

    max-distribution

    4.5

  • max

    A

    B

    3 4

    ERD max-distribution backpropagation

    A4B4 0.8*0.5 0.2*0.5

    P(A>4) = 0.2P(A4) = 0.5P(B4) = 0.6 > 0.5

    4.5

  • Experiments

    2*MCTS Max-distribution Sample-weighted

    2*MCTS UCT+ (stddev) UCT

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search

    Uncertainty in MCTS Continuous action spaces

    Opponent model Online learning Concept drift

    Conclusion

  • Dealing with continuous actions

    Sample discrete actions

    Progressive unpruning [Chaslot08] (ignores smoothness of EV function)

    ... Tree learning search (work in progress)

    relative betsize

  • Tree learning search

    Based on regression tree induction from data streams training examples arrive quickly nodes split when significant reduction in stddev training examples are immediately forgotten

    Edges in TLS tree are not actions, but sets of actions, e.g., (raise in [2,40]), (fold or call)

    MCTS provides a stream of (action,EV) examples Split action sets to reduce stddev of EV

    (when significant)

  • Tree learning searchmax

    {Fold, Call}

    max

    Bet in [0,10]

    ? ?

  • Tree learning searchmax

    ?

    {Fold, Call}

    max

    Bet in [0,10]

    ?

  • Tree learning searchmax

    ?

    {Fold, Call}

    max

    Bet in [0,10]

    ?

    Optimal split at 4

  • Tree learning searchmax

    Bet in [0,4]

    {Fold, Call}

    max

    Bet in [0,10]

    Bet in [4,10]

    max max

    ? ?? ?

  • one action of P1

    one action of P2

    Tree learning search

  • Selection Phase

    Sample 2.4

    P1

    Each node has EV estimate, which generalizes over actions

  • Expansion

    Selected Node

    P1

    P2

  • Expansion

    Expanded nodeRepresents any action of P3P3

    P2

    P1

  • Backpropagation

    New sample;Split becomes significant

  • Backpropagation

    New sample;Split becomes significant

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search

    Uncertainty in MCTS Continuous action spaces

    Opponent model Online learning Concept drift

    Conclusion

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search

    Uncertainty in MCTS Continuous action spaces

    Opponent model Online learning Concept drift

    Conclusion

  • Online learning of opponent model

    Start from (safe) model of general opponent Exploit weaknesses of specific opponent

    Start to learn modelof specific opponent

    (exploration of opponent behavior)

  • Multi-agent interaction

  • Multi-agent interaction

    Yellow learns model for Blue and changes strategy

  • Multi-agent interaction

    Yellow learns model for Blue and changes strategy

    Yellow doesn't profit!

  • Multi-agent interaction

    Yellow learns model for Blue and changes strategy

    Yellow doesn't profit!

    Green profits without changing strategy!!

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search

    Uncertainty in MCTS Continuous action spaces

    Opponent model Online learning Concept drift

    Conclusion

  • Concept drift

    While learning from a stream, the training examples in the stream change In opponent model: changing strategy

    “Changing gears is not just about bluffing, it's about changing strategy to achieve a goal.”

    Learning with concept drift adapt quickly to changes yet robust to noise (recognize recurrent concepts)

  • Basic approach to concept drift

    Maintain a window of training examples large enough to learn small enough to adapt quickly without 'old' concepts

    Heuristics to adjust window size based on FLORA2 framework [Widmer92]

  • Accuracy

    Window size

    4 components of a singleopponent model

    Start online learning

    Concept drift

  • Bad parameters for heuristic

    NOT ROBUST

    Accuracy

    Window size

  • Outline

    Overview approach The Poker game tree Opponent model Monte-Carlo tree search

    Research challenges Search Opponent model

    Conclusion

  • Conclusions

    First exploitive poker bot for No-limit Holdem > 2 players

    Apply in other games backgammon computational pool ...

    Challenge for MCTS games with uncertainty continuous action space

    Challenge for ML online learning concept drift (relational learning)

  • Thanks for listening!

    Monte-Carlo Tree Search in Poker using Expected Reward DistributionsDia 2Dia 3Dia 4Dia 5Dia 6Dia 7Dia 8Dia 9Dia 10Dia 11Dia 12Dia 13Dia 14Dia 15Dia 16Dia 17Dia 18Dia 19Dia 20Dia 21Dia 22Dia 23Dia 24Dia 25Dia 26Dia 27Dia 28Dia 29Dia 30Dia 31Dia 32Dia 33Dia 34Dia 35Short ExperimentDia 37Dia 38Dia 39Opponent RanksDia 41Dia 42Dia 43Dia 44Dia 45Dia 46Dia 47Dia 48Dia 49Dia 50Dia 51Dia 52Dia 53Dia 54Dia 55Dia 56Dia 57Dia 58Dia 59Dia 60Dia 61Dia 62Dia 63Dia 64Dia 65Dia 66Dia 67Dia 68Dia 69Dia 70Dia 71Dia 72Dia 73Dia 74Dia 75Dia 76Dia 77Dia 78Dia 79Dia 80Dia 81Dia 82Dia 83Dia 84Dia 85Dia 86Dia 87Dia 88Dia 89Dia 90Dia 91Dia 92Dia 93Dia 94Dia 95Dia 96Dia 97Dia 98Dia 99Dia 100Dia 101Dia 102Dia 103Dia 104Dia 105Dia 106Dia 107Dia 108Dia 109Dia 110Dia 111Dia 112