learning to communicate implicitly by actionsfrom modeling them effectively. in this paper, we...

Learning to Communicate Implicitly by Actions

Zheng Tian†, Shihao Zou†, Ian Davies†, Tim Warr†, Lisheng Wu†, Haitham Bou Ammar† ‡, Jun Wang†† University College London

‡ Huawei R&D UKzheng.tian.11, shihao.zou.17, ian.davies.12, tim.warr.17, [email protected]

[email protected]@cs.ucl.ac.uk

Abstract

In situations where explicit communication is limited, humancollaborators act by learning to: (i) infer meaning behind theirpartner’s actions, and (ii) convey private information aboutthe state to their partner implicitly through actions. The firstcomponent of this learning process has been well-studied inmulti-agent systems, whereas the second — which is equallycrucial for successful collaboration — has not. To mimic bothcomponents mentioned above, thereby completing the learn-ing process, we introduce a novel algorithm: Policy BeliefLearning (PBL). PBL uses a belief module to model the otheragent’s private information and a policy module to form adistribution over actions informed by the belief module. Fur-thermore, to encourage communication by actions, we pro-pose a novel auxiliary reward which incentivizes one agentto help its partner to make correct inferences about its pri-vate information. The auxiliary reward for communication isintegrated into the learning of the policy module. We evalu-ate our approach on a set of environments including a matrixgame, particle environment and the non-competitive biddingproblem from contract bridge. We show empirically that thisauxiliary reward is effective and easy to generalize. These re-sults demonstrate that our PBL algorithm can produce strongpairs of agents in collaborative games where explicit commu-nication is disabled.

IntroductionIn collaborative multi-agent systems, communication is es-sential for agents to learn to behave as a collective ratherthan a collection of individuals. This is particularly impor-tant in the imperfect-information setting, where private in-formation becomes crucial to success. In such cases, effi-cient communication protocols between agents are neededfor private information exchange, coordinated joint-actionexploration, and true world-state inference.

In typical multi-agent reinforcement learning (MARL)settings, designers incorporate explicit communicationchannels hoping to conceptually resemble language or ver-bal communication which are known to be important for hu-man interaction (Baker et al. 1999). Though they can be used

Copyright c© 2020, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

for facilitating collaboration in MARL, explicit communica-tion channels come at additional computational and memorycosts, making them difficult to deploy in decentralized con-trol (Roth, Simmons, and Veloso 2006).

Environments where explicit communication is difficultor prohibited are common. These settings can be syntheticsuch as those in games, e.g., bridge and Hanabi, but alsofrequently appear in real-world tasks such as autonomousdriving and autonomous fleet control. In these situations,humans rely upon implicit communication as a means of in-formation exchange (Rasouli, Kotseruba, and Tsotsos 2017)and are effective in learning to infer the implicit meaning be-hind others’ actions (Heider and Simmel 1944). The abilityto perform such inference requires the attribution of a men-tal state and reasoning mechanism to others. This ability isknown as theory of mind (Premack and Woodruff 1978). Inthis work, we develop agents that benefit from consideringothers’ perspectives and thereby explore the further devel-opment of machine theory of mind (Rabinowitz et al. 2018).

Previous works have considered ways in which an agentcan, by observing an opponent’s behavior, build a model ofopponents’ characteristics, objectives or hidden informationeither implicitly (He et al. 2016; Bard et al. 2013) or explic-itly (Raileanu et al. 2018; Li and Miikkulainen 2018). Whilstthese works are of great value, they overlook the fact that anagent should also consider that it is being modeled and adaptits behavior accordingly, thereby demonstrating a theory ofmind. For instance, in collaborative tasks, a decision-makercould choose to take actions which are informative to itsteammates, whereas, in competitive situations, agents mayact to conceal private information to prevent their opponentsfrom modeling them effectively.

In this paper, we propose a generic framework, titled pol-icy belief learning (PBL), for learning to cooperate in im-perfect information multi-agent games. Our work combinesopponent modeling with a policy that considers that it is be-ing modeled. PBL consists of a belief module, which modelsother agents’ private information by considering their previ-ous actions, and a policy module which combines the agent’scurrent observation with their beliefs to return a distributionover actions. We also propose a novel auxiliary reward forencouraging communication by actions, which is integrated

arX

iv:1

810.

0444

4v4

[cs

.AI]

20

Nov

201

9

into PBL. Our experiments show that agents trained usingPBL can learn collaborative behaviors more effectively thana number of meaningful baselines without requiring any ex-plicit communication. We conduct a complete ablation studyto analyze the effectiveness of different components withinPBL in our bridge experiment.

Related WorkOur work is closely related to (Albrecht and Stone 2017;Lowe et al. 2017; Mealing and Shapiro 2017; Raileanuet al. 2018) where agents build models to estimate otheragents’ hidden information. Contrastingly, our work en-hances a “flat” opponent model with recursive reasoning.“Flat” opponent models estimate only the hidden informa-tion of opponents. Recursive reasoning requires making de-cisions based on the mental states of others as well as thestate of the environment. In contrast to works such as I-POMDP (Gmytrasiewicz and Doshi 2005) and PR2 (Wen etal. 2019) where the nested belief is embedded into the train-ing agent’s opponent model, we incorporate level-1 nestedbelief “I believe that you believe” into our policy by a novelauxiliary reward.

Recently, there has been a surge of interest in using rein-forcement learning (RL) approaches to learn communicationprotocols (Foerster et al. 2016; Lazaridou, Peysakhovich,and Baroni 2016; Mordatch and Abbeel 2017; Sukhbaatar,Szlam, and Fergus 2016). Most of these works enable agentsto communicate via an explicit channel. Among these works,Mordatch and Abbeel (2017) also observe the emergenceof non-verbal communication in collaborative environmentswithout an explicit communication channel, where agentsare exclusively either a sender or a receiver. Similar re-search is also conducted in (de Weerd, Verbrugge, and Ver-heij 2015). In our setting, we do not restrict agents to be ex-clusively a sender or a receiver of communications – agentscan communicate mutually by actions. Knepper et al. (2017)propose a framework for implicit communication in a co-operative setting and show that various problems can bemapped into this framework. Although our work is con-ceptually close to (Knepper et al. 2017), we go further andpresent a practical algorithm for training agents. The recentwork of (Foerster et al. 2018) solves an imperfect informa-tion problem as considered here from a different angle. Weapproach the problem by encouraging agents to exchangecritical information through their actions whereas Foersteret al. train a public player to choose an optimal determinis-tic policy for players in a game based on publicly observ-able information. In Cooperative Inverse RL (CIRL) whererobotic agents try to infer a human’s private reward functionfrom their actions (Hadfield-Menell et al. 2016), optimal so-lutions need to produce behavior that coveys information.

Dragan, Lee, and Srinivasa (2013) consider how to trainagents to exhibit legible behavior (i.e. behavior from whichit is easy to infer the intention). Their approach is dependenton a hand-crafted cost function to attain informative behav-ior. Mutual information has been used as a means to promotecoordination without the need for a human engineered costfunction. Strouse et al. (2018) use a mutual information ob-jective to encourage an agent to reveal or hide its intention.

In a related work, Jaques et al. (2019) utilize a mutual in-formation objective to imbue agents with social influence.While the objective of maximal mutual information in ac-tions can yield highly effective collaborating agents, a mu-tual information objective in itself is insufficient to necessi-tate the development of implicit communication by actions.Eccles et al. (2019) introduce a reciprocity reward as an al-ternative approach to solve social dilemmas..

A distinguishing feature of our work in relation to previ-ous works in multi-agent communication is that we do nothave a predefined explicit communication protocol or learnto communicate through an explicit channel. Information ex-change can only happen via actions. In contrast to previousworks focusing on unilaterally making actions informative,we focus on bilateral communication by actions where infor-mation transmission is directed to a specific party with po-tentially limited reasoning ability. Our agents learn to com-municate through iterated policy and belief updates such thatthe resulting communication mechanism and belief mod-els are interdependent. The development of a communica-tion mechanism therefore requires either direct access to themental state of other agents (via centralized training) or theability to mentalize, commonly known as theory of mind.We investigate our proposed algorithm in both settings.

Problem DefinitionWe consider a set of agents, denoted by N , interacting withan unknown environment by executing actions from a jointset A = A1, ... ,AN, with Ai denoting the action spaceof agent i, and N the total number of agents. To enablemodels that approximate real-world scenarios, we assumeprivate and public information states. Private informationstates, jointly (across agents) denoted byX = X1, ... ,XNare a set of hidden information states where Xi is onlyobservable by agent i, while public states O are observedby all agents. We assume that hidden information states ateach time step are sampled from an unknown distributionPx : X → [0, 1], while public states evolve from an initialdistribution Po : O → [0, 1], according to a stochastic tran-sition model T : O×X×A1×···×AN×O → [0, 1]. Havingtransitioned to a successor state according to T , agents re-ceive rewards from R : S × A1 × ··· × AN → R, wherewe have used S = O × X to denote joint state descriptionsthat incorporate both public and private information. Finally,rewards are discounted over time by a factor γ ∈ (0, 1].With this notation, our problem can be described succinctlyby the tuple: 〈N ,A,O,X , T ,R,Px,Po, γ〉, which we re-fer to as an imperfect-information Markov decision process(I2MDP)1. In this work, we simplify the problem by assum-ing that hidden information states X are temporally staticand are given at the beginning of the game.

We interpret the joint policy from the perspective of agenti such that π = (πi(ai|s), π−i(a−i|s)), where π−i(a−i|s)is a compact representation of the joint policy of all agentsexcluding agent i. In the collaborative setting, each agent is

1We also note that our problem can be formalized as a de-centralized partially observable Markov decision process (Dec-POMDP) (Bernstein, Zilberstein, and Immerman 2013).

presumed to pursue the shared maximal cumulative rewardexpressed as

max ηi(π) = E

[ ∞∑t=1

γtR(st, ait, a−it )

], (1)

where st = [xit, x−it , ot] is the current full information state,

(ait, a−it ) are joint actions taken by agent i and all other

agents respectively at time t and γ is a discount factor.

Policy Belief LearningApplying naive single agent reinforcement learning (SARL)algorithms to our problem will lead to poor performance.One reason for this is the partial observability of the envi-ronment. To succeed in a partially observable environment,an agent is often required to maintain a belief state. Recallthat, in our setting, the environment state is formed from theunion of the private information of all agents and the pub-licly observable information, st = [xit, x

−it , ot]. We there-

fore learn a belief module Φi(x−it ) to model other agents’private information x−it which is the only hidden informa-tion from the perspective of agent i in our setting. We as-sume that an agent can model x−it given the history ofpublic information and actions executed by other agentshit = o1:t−1, a

−i1:t−1. We use a NN to parameterize the

belief module which takes in the history of public informa-tion and produces a belief state bit = Φi(x−it |hit). The beliefstate together with information observable by agent i formsa sufficient statistic, sit = [xit, b

it, ot], which contains all the

information necessary for the agent to act optimally (Astrom1965). We use a separate NN to parameterize agent i’s pol-icy πi(ait|sit) which takes in the estimated environment statesit and outputs a distribution over actions. As we assume hid-den information is temporally static, we will drop the timescript for it in the rest of the paper.

The presence of multiple learning agents interacting withthe environment renders the environment non-stationary.This further limits the success of SARL algorithms whichare generally designed for environments with stationary dy-namics. To solve this, we adopt centralized training and de-centralized execution, where during training all agents arerecognized as one central representative agent differing onlyby their observations. Under this approach, one can imaginebelief models Φi(x−i|hit) and Φ−i(xi|h−it ) sharing param-eters φ. The input data, however, varies across agents dueto the dependency on both hit and h−it . In a similar fash-ion, we let policies share the parameters θ. Consequently,one may think of updating θ and φ using one joint data setaggregated across agents. Without loss of generality, in theremainder of this section, we discuss the learning procedurefrom the point of view of a single agent, agent i.

We first present the learning procedure of our belief mod-ule. At iteration k, we use the current policy π[k](a

i|s) togenerate a data set of size M , Ω[k] = (x−ij , hij)

Mj=1, using

self-play and learn a new belief module by minimizing:φ[k] := argmin

φE(x−i,hi)∼Ω[k−1]

[KL(x−ij ||bij(hij ;φ)

],

(2)

where KL(·||·) is the Kullback–Leibler(KL) divergence andwe use a one-hot vector to encode the ground truth, x−ij ,when we calculate the relevant KL-divergence.

With updated belief module Φ[k], we learn a new pol-icy for the next iteration, π[k+1], via a policy gradient al-gorithm. Sharing information in multi-agent cooperativegames through communication reduces intractability by en-abling coordinated behavior. Rather than implementing ex-pensive protocols (Heider and Simmel 1944), we encourageagents to implicitly communicate through actions by intro-ducing a novel auxiliary reward signal. To do so, notice thatin the centralized setting agent i has the ability to consultits opponent’s belief model Φ−i(xi|h−it ) thereby exploitingthe fact that other agents hold beliefs over its private infor-mation xi. In fact, comparing b−it to the ground-truth xi en-ables agent i to learn which actions bring these two quanti-ties closer together and thereby learn informative behavior.This can be achieved through an auxiliary reward signal de-vised to encourage informative action communication:

ric,t = KL(xi||b−i,∗)− KL(xi||b−it+1), (3)

where b−i,∗ = Φ−i[k](xi|h−it,∗) is agent−i’s best belief (so-far)

about agent i’s private information:

b−i,∗ = argmin KL(xi||b−iu ) ∀ u ≤ t.In other words, ric,t encourages communication as it is pro-portional to the improvement in the opponent’s belief (for afixed belief model Φ−i[k](x

i|h−it+1)), measured by its proxim-ity to the ground-truth, resulting from the opponent observ-ing agent i’s action ait. Hence, during the policy learningstep of PBL, we apply a policy gradient algorithm with ashaped reward of the form:2

r = re + αrc, (4)

where re is the reward from the environment, rc is the com-munication reward and α ≥ 0 balances the communicationand environment rewards.

Algorithm 1 Per-Agent Policy Belief Learning (PBL)

1: Initialize: Randomly initialize policy π0 and belief Φ0

2: Pre-train π0

3: for k = 0 to max iterations do4: Sample episodes for belief training using self-play

forming the data set Ω[k]

5: Update belief network using data from Ω[k] solvingEquation 2

6: Given updated beliefs Φ[k+1](·), update policy π(·)(policy gradients with rewards from Equation 4)

7: end for8: Output: Final policy, and belief model

Initially, in the absence of a belief module, we pre-train apolicy π[0] naively by ignoring the existence of other agentsin the environment. As an agent’s reasoning ability may

2Please note, we omit the agent index i in the reward equation,as we shape rewards similarly for all agents.

10 0 0

4 8 4

10 0 0

10 0 0

4 8 4

10 0 0

0 0 10

4 8 4

0 0 0

0 0 10

4 8 4

0 0 10

Player 2 (acts second)

Play

er 1

(act

s fir

st)

Card 1

Player 2 action

A B C

Card 2

Car

d 1

Play

er 1

act

ion

Car

d 2

A

B

C

(a) Payoff for the matrix game

0 10 20 30 40 50 60 70 80 90Iterations

4

5

6

7

8

9

Ave

rage

rew

ard

BADBADCFPG

PBL

(b) Learning curves of PBL and baselines over 100 runs

Figure 1: Matrix game experiment and results.

be limited, we may then iterate between Belief and Policylearning multiple times until either the allocated computa-tional resources are exhausted or the policy and belief mod-ules converge. We summarize the main steps of PBL in Al-gorithm 1. Note that, although information can be leakedduring training, as training is centralized, distributed test-phase execution ensures hidden-private variables during ex-ecution.

Machine Theory of Mind

In PBL, we adopt a centralized training and decentralizedexecution scheme where agents share the same belief andpolicy models. In reality, however, it is unlikely that twopeople will have exactly the same reasoning process. In con-trast to requiring everyone to have the same reasoning pro-cess, a person’s success in navigating social dynamics re-lies on their ability to attribute mental states to others. Thisattribution of mental states to others is known as theoryof mind (Premack and Woodruff 1978). Theory of mind isfundamental to human social interaction which requires therecognition of other sensory perspectives, the understandingof other mental states, and the recognition of complex non-verbal signals of emotional state (Lemaignan and Dillen-bourg 2015). In collaboration problems without an explicitcommunication channel, humans can effectively establish anunderstanding of each other’s mental state and subsequentlyselect appropriate actions. For example, a teacher will re-iterate a difficult concept to students if she infers from thestudents’ facial expressions that they have not understood.The effort of one agent to model the mental state of anotheris characterized as Mutual Modeling (Dillenbourg 1999).

In our work, we also investigate whether the proposedcommunication reward can be generalized to a distributedsetting which resembles a human application of theory ofmind. Under this setting, we train a separate belief model foreach agent so that Φi(x−i|hit) and Φ−i(xi|h−it ) do not shareparameters (φi 6= φ−i). Without centralization, an agent canonly measure how informative its action is to others with itsown belief model. Assuming agents can perfectly recall theirpast actions and observations, agent i computes its commu-

nication reward as:3

ric,t = KL(xi||bi,∗t )− KL((xi||bit+1),

where bi,∗t = Φi(xi|h−it,∗) and bit+1 = Φi(xi|h−it+1). In thisway, an agent essentially establishes a mental state of oth-ers with its own belief model and acts upon it. We humblybelieve this could be a step towards machine theory of mindwhere algorithmic agents learn to attribute mental states toothers and adjust their behavior accordingly.

The ability to mentalize relieves the restriction of collab-orators having the same reasoning process. However, thesuccess of collaboration still relies on the correctness ofone’s belief about the mental states of others. For instance,correctly inferring other drivers’ mental states and conven-tions can reduce the likelihood of traffic accidents. Thereforeroad safety education is important as it reduces variabilityamong drivers reasoning processes. In our work, this align-ment amounts to the similarity between two agents’ trainedbelief models which is affected by training data, initializa-tion of weights, training algorithms and so on. We leave in-vestigation of the robustness of collaboration to variabilityin collaborators’ belief models to future work.

Experiments & ResultsWe test our algorithms in three experiments. In the first, wevalidate the correctness of the PBL framework which inte-grates our communication reward with iterative belief andpolicy module training in a simple matrix game. In this rel-atively simple experiment, PBL achieves near optimal per-formance. Equipped with this knowledge, we further applyPBL to the non-competitive bridge bidding problem to ver-ify its scalability to more complex problems. Lastly, we in-vestigate the efficacy of the proposed communication rewardin a distributed training setting.

Matrix GameWe test our PBL algorithm on a matrix card game where animplicit communication strategy is required to achieve the

3Note the difference of super/sub-scripts of the belief modeland its parameters when compared to Equation 3.

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500Iterations

0.057

0.059

0.061

0.063

0.065

0.067

0.069

Ave

rage

rew

ard

PBLNCRNPBIIP

(a) Learning curves for non-competitive bridge bidding

0.0690.077

0.025

-0.025

0.079

0.062

-0.054

PBL PQL-5 PQL-5(No-Rules)

PQL-5(Single-Network)

PQL-4 PQL-4(No-Rules)

PQL-4(Single-Network)

(b) Comparison of PBL and PQL

Figure 2: a) Learning curves for non-competitive bridge bidding with a warm start from a model trained to predict the scoredistribution (average reward at warm start: 0.038). Details of warm start provided in Appendix B.3. b) Bar graph comparingPBL to variants of PQL, with the full version of PQL results as reported in (Yeh and Lin 2016).

global optimum. This game is first proposed in (Foerster etal. 2018). There are two players and each player receivesa card drawn from card 1, card 2 independently at the be-ginning of the game. Player 1 acts first and Player 2 respondsafter observing Player 1’s action. Neither player can see theother’s hand. By the design of the payoff table (shown inFigure. 1a), Player 1 has to use actions C and A to signifythat it holds Cards 1 and 2 respectively so that Player 2 canchoose its actions optimally with the given information. Wecompare PBL with algorithms proposed in (Foerster et al.2018) and vanilla policy gradient. As can be seen from Fig-ure 1b, PBL performs similarly to BAD and BAD-CF on thissimple game and outperforms vanilla policy gradient signif-icantly. This demonstrates a proof of principle for PBL in amulti-agent imperfect information coordination game.

Contract Bridge Case-StudyNon-competitive contract bridge bidding is an imperfect-information game that requires information exchange be-tween agents to agree high-quality contracts. Hence, sucha game serves as an ideal test-bed for PBL. In bridge, twoteams of two (North-South vs East-West) are situated inopposing positions and play a trick-taking game using astandard 52-card deck. Following a deal, bidding and play-ing phases can be effectively separated. During the bid-ding phase, players sequentially bid for a contract until afinal contract is reached. A PASS bid retains previously pro-posed contracts and a contract is considered final if it is fol-lowed by three consecutive PASS bids. A non-PASS bid pro-poses a new contract of the form 〈integer,suit〉, whereinteger takes integer values between one and seven, andsuit belongs to ♣,♦,♥,♠,NT. The number of tricksneeded to achieve a contract are 6 + integer, and an NTsuit corresponds to bidding to win tricks without trumps. Acontract-declaring team achieves points if it fulfills the con-tract, and if not, the points for the contract go to the opposingteam. Bidding must be non-decreasing, meaning integeris non-decreasing and must increase if the newly proposedtrump suit precedes or equals the currently bid suit in theordering ♣ < ♦ < ♥ < ♠ < NT .

In this work, we focus on non-competitive bidding inbridge, where we consider North (N) and South (S) biddingin the game, while East (E) and West (W) always bid PASS.Hence, the declaring team never changes. Thus, each dealcan be viewed as an independent episode of the game. Theprivate information of player i ∈ N,S, xi, is its hand.xi is a 52-dimensional binary vector encoding player i’s 13cards. An agent’s observation at time step t consists of itshand and the bidding history: oit = xit, hit. In each episode,Players N and S are dealt hands xN , xS respectively. Theirhands, together, describe the full state of the environments = xN , xS, which is not fully observed by either ofthe two players. Since rolling out via self-play for everycontract is computationally expensive, we resort to doubledummy analysis (DDA) (Haglund 2010) for score estima-tion. Interested readers are referred to (Haglund 2010) andthe appendix for further details. In our work, we use stan-dard Duplicate bridge scoring rules (League 2017) to scoregames and normalize scores by dividing them by the maxi-mum abusolute score.

Benchmarking & Ablation Studies: PBL introduces sev-eral building blocks, each affecting performance in its ownright. We conduct an ablation study to better understand theimportance of these elements and compare against a state-of-the-art method in PQL (Yeh and Lin 2016). We introducethe following baselines:

1. Independent Player (IP): A player bids independentlywithout consideration of the existence of the other player.

2. No communication reward (NCR): One important ques-tion to ask is how beneficial the additional communicationauxiliary reward rc is in terms of learning a good biddingstrategy. To answer this question, we implement a base-line using the same architecture and training schedule asPBL but setting the communication reward weighting tozero, α = 0.

3. No PBL style iteration (NPBI): To demonstrate that mul-tiple iterations between policy and belief training are ben-eficial, we compare our model to a baseline policy trainedwith the same number of weight updates as our model but

… …

Player one makes Four Hearts bid

1.41.41.4 1.4

Pass belief update

3.34.5

6.63.9

Heart belief update

Player one makes One Clubs bid

Player two makes Pass bid

Player two makes One Hearts bid

Time

3.1 3.0

5.9

3.3

Club belief update

1.22.8

0.81.1

Heart belief update

h1, Hi<latexit sha1_base64="IuqiYJ3FDtR01Q1vM9eoE1+9Q0w=">AAAC13icjVHLSsNAFD2N7/qKunQTLIILKUkVdFlw41LBtkorJRnHGpwmYTIRpYg7cesPuNU/Ev9A/8I74whqEZ2Q5My595yZe2+UiThXvv9SckZGx8YnJqfK0zOzc/PuwmIzTwvJeIOlIpWHUZhzESe8oWIl+GEmediPBG9F5zs63rrgMo/T5EBdZfy4H/aS+DRmoSKq6y52RJj0BA/Wvd2ONLDrVvyqb5Y3DAILKrBrL3Wf0cEJUjAU6IMjgSIsECKnp40APjLijjEgThKKTZzjGmXSFpTFKSMk9py+Pdq1LZvQXnvmRs3oFEGvJKWHVdKklCcJ69M8Ey+Ms2Z/8x4YT323K/pH1qtPrMIZsX/pPjP/q9O1KJxi29QQU02ZYXR1zLoUpiv65t6XqhQ5ZMRpfEJxSZgZ5WefPaPJTe26t6GJv5pMzeo9s7kF3vQtacDBz3EOg2atGmxUa/ublfq6HfUklrGCNZrnFurYxR4a5H2JBzziyTlybpxb5+4j1SlZzRK+Lef+Hc66ljc=</latexit>

hPASSi<latexit sha1_base64="Uh2u1GUS57tN7JN0KB8BS8E1WEo=">AAAC2HicjVHLTsJAFD3UF+ILZemmEU1ckYImusS4cYlBHhEIacuADUPbTKcmhJi4M279Abf6RcY/0L/wzlgSlRidpu2Zc+85M/deJ+ReJC3rNWXMzS8sLqWXMyura+sb2c2tehTEwmU1N+CBaDp2xLjns5r0JGfNUDB75HDWcIanKt64ZiLyAv9CjkPWGdkD3+t7ri2J6mZzbW77A87Mykm12hYad7N5q2DpZc6CYgLySFYlyL6gjR4CuIgxAoMPSZjDRkRPC0VYCInrYEKcIOTpOMMNMqSNKYtRhk3skL4D2rUS1qe98oy02qVTOL2ClCb2SBNQniCsTjN1PNbOiv3Ne6I91d3G9HcSrxGxElfE/qWbZv5Xp2qR6ONY1+BRTaFmVHVu4hLrrqibm1+qkuQQEqdwj+KCsKuV0z6bWhPp2lVvbR1/05mKVXs3yY3xrm5JAy7+HOcsqJcKxYNC6fwwX95NRp3GNnawT/M8QhlnqKBG3mM84gnPxqVxa9wZ95+pRirR5PBtGQ8fVTiWyw==</latexit>

h1, Ci<latexit sha1_base64="ItN/WB9eS5yvjwmmf1/7dxmC9W8=">AAAC2HicjVHLSsNAFD3GV62vapdugkVwUUpSBV0WunFZwT6wLSVJpzV0moTJRChFcCdu/QG3+kXiH+hfeGdMQS2iE5KcOfeeM3PvdSPux9KyXheMxaXlldXMWnZ9Y3NrO7ez24jDRHis7oU8FC3XiRn3A1aXvuSsFQnmjF3Omu6oquLNayZiPwwu5CRi3bEzDPyB7zmSqF4u3+FOMOTMtItmtSM07uUKVsnSy5wHdgoKSFctzL2ggz5CeEgwBkMASZjDQUxPGzYsRMR1MSVOEPJ1nOEGWdImlMUowyF2RN8h7dopG9BeecZa7dEpnF5BShMHpAkpTxBWp5k6nmhnxf7mPdWe6m4T+rup15hYiSti/9LNMv+rU7VIDHCqa/CppkgzqjovdUl0V9TNzS9VSXKIiFO4T3FB2NPKWZ9NrYl17aq3jo6/6UzFqr2X5iZ4V7ekAds/xzkPGuWSfVQqnx8XKsV01BnsYR+HNM8TVHCGGurkPcEjnvBsXBq3xp1x/5lqLKSaPL4t4+EDOPaWXA==</latexit>

h4, Hi<latexit sha1_base64="2y35yoBPPcdVPCb0lVZa9UE6YLM=">AAAC2HicjVHLSsNAFD2Nr1pf1S7dBIvgopS0FnRZcNNlBfvAtpQkndbQaRImE6EUwZ249Qfc6heJf6B/4Z0xBbWITkhy5tx7zsy91wm5F0nLek0ZS8srq2vp9czG5tb2TnZ3rxkFsXBZww14INqOHTHu+awhPclZOxTMnjictZzxmYq3rpmIvMC/kNOQ9Sb2yPeGnmtLovrZXJfb/ogzs1Iwa12hcT+bt4qWXuYiKCUgj2TVg+wLuhgggIsYEzD4kIQ5bET0dFCChZC4HmbECUKejjPcIEPamLIYZdjEjuk7ol0nYX3aK89Iq106hdMrSGnikDQB5QnC6jRTx2PtrNjfvGfaU91tSn8n8ZoQK3FF7F+6eeZ/daoWiSFOdQ0e1RRqRlXnJi6x7oq6ufmlKkkOIXEKDyguCLtaOe+zqTWRrl311tbxN52pWLV3k9wY7+qWNODSz3Eugma5WDouls8r+WohGXUa+zjAEc3zBFXUUEeDvKd4xBOejUvj1rgz7j9TjVSiyeHbMh4+AEw3lmQ=</latexit>

Figure 3: An example of a belief update trace showing how PBL agents use actions for effective communication. Upon observ-ing 〈PASS〉 from East, West decreases its HCP belief in all suits. When West bids 〈1,C〉, East improves belief in clubs. Next,East bids 〈1,H〉. West recalculates its belief from last time step and increases its HCP belief in hearts.

no further PBL iterations after training a belief networkΦ0 at PBL iteration k = 0.

4. Penetrative Q-Learning (PQL): PQL as proposed byYeh and Lin (2016) as the first bidding policy fornon-competitive bridge bidding without human domainknowledge.Figure 2a shows the average learning curves of our model

and three baselines for our ablation study. We obtain thesecurves by testing trained algorithms periodically on a pre-generated test data set which contains 30,000 games. Eachpoint on the curve is an average score computed by Dupli-cate bridge scoring rules (League 2017) over 30,000 gamesand 6 training runs. As can been seen, IP and NCR both ini-tially learn faster than our model. This is reasonable as PBLspends more time learning a communication protocol at first.However, IP converges to a local optimum very quickly andis surpassed by PBL after approximately 400 learning iter-ations. NCR learns a better bidding strategy than IP with abelief module. However, NCR learns more slowly than PBLin the later stage of training because it has no guidance onhow to convey information to its partner. PBL outperform-ing NPBI demonstrates the importance of iterative trainingbetween policy and belief modules.Restrictions of PQL: PQL (Yeh and Lin 2016) is the firstalgorithm trained to bid in bridge without human engineeredfeatures. However, its strong bidding performance relies onheavy adaption and heuristics for non-competitive Bridgebidding. First, PQL requires a predefined maximum numberof allowed bids in each deal, while using different biddingnetworks at different times. Our results show that it will failwhen we train a single NN for the whole game, which canbeen seen as a minimum requirement for most DRL algo-rithms. Second, PQL relies on a rule-based function for se-lecting the best contracts at test time. In fact, removing thissecond heuristic significantly reduces PQL’s performance asreported in Figure 2b. In addition, without pre-processingthe training data as in (Yeh and Lin 2016), we could not re-produce the original results. To achieve state-of-the-art per-

formance, we could use these (or other) heuristics for ourbidding algorithm. However, this deviates from the focusof our work which is to demonstrate that PBL is a generalframework for learning to communicate by actions.

Belief Update Visualization: To understand how agents up-date their beliefs after observing a new bid, we visualize thebelief update process (Figure 3). An agent’s belief about itsopponent’s hand is represented as a 52-dimensional vectorwith real values which is not amenable to human interpreta-tion. Therefore, we use high card points (HCPs) to summa-rize each agent’s belief. For each suit, each card is given apoint score according to the mapping: A=4, K=3, Q=2, J=1,else=0. Note that while agents’ beliefs are updated based onthe entire history of its opponent’s bids, the difference be-tween that agent’s belief from one round to the next is pre-dominantly driven by the most recent bid of its opponent, asshown in Figure 3.

Learned Bidding Convention: Whilst our model’s biddingdecisions are based entirely on raw card data, we can usehigh card points as a simple way to observe and summarizethe decisions which are being made. For example, we ob-serve our policy opens the bid with 1♠ if it has HCPs ofspade 4.5 or higher but lower HCPs of any other suits. Werun the model on the unseen test set of 30,000 deals andsummarize the learned bidding convention in Appendix C.1.

Imperfect Recall of History: the length of the action his-tory players can recall affects the accuracy of the belief mod-els. The extent of the impact depends on the nature of thegame. In bridge, the order of bidding encodes important in-formation. We ran an ablation study where players can onlyrecall the most recent bid. In this setting, players do worse(average score 0.065) than players with perfect recall. Weconjecture that this is because players can extract less infor-mation and therefore the accuracy of belief models drops.

0 100 200 300 400 500 600Iterations

1.0

0.8

0.6

0.4

0.2Av

erag

e re

ward

CRNCR

(a) Learning curves for Silent Guide(b) CR (c) NCR

Figure 4: a) Learning curves for Silent Guide. Guide agent trained with communication reward (CR) significantly outperformsthe one trained with no communication reward (NCR). b) A trajectory of Listener (gray circle) and Guide (blue circle) with CR.Landmarks are positioned randomly and the Goal landmark (blue square) is randomly chosen at the start of each episode. c) Atrajectory of Listener and Guide with NCR. Trajectories are presented with agents becoming progressively darker over time.

Silent GuideWe modify a multi-agent particle environment (Lowe et al.2017) to test the effectiveness of our novel auxiliary rewardin a distributed setting. This environment also allows us toexplore the potential for implicit communication to arisethrough machine theory of mind. In the environment thereare two agents and three landmarks. We name the agentsGuide and Listener respectively. Guide can observe Lis-tener’s goal landmark which is distinguished by its color.Listener does not observe its goal. However, Listener is ableto infer the meaning behind Guide’s actions. The two agentsreceive the same reward which is the negative distance be-tween Listener and its goal. Therefore, to maximize the cu-mulative reward, Guide needs to tell Listener the goal land-mark color. However, as the “Silent Guide” name suggests,Guide has no explicit communication channel and can onlycommunicate to Listener through its actions.

In the distributed setting, we train separate belief modulesfor Guide and Listener respectively. The two belief mod-ules are both trained to predict a naive agent’s goal landmarkcolor given its history within the current episode but usingdifferent data sets. We train both Guide and Listener policiesfrom scratch. Listener’s policy takes Listener’s velocity, rel-ative distance to three landmarks and the prediction of thebelief module as input. It is trained to maximize the envi-ronment reward it receives. Guide’s policy takes its veloc-ity, relative distance to landmarks and Listener’s goal as in-put. To encourage communication by actions, we train Guidepolicy with the auxiliary reward proposed in our work. Wecompare our method against a naive Guide policy which istrained without the communication reward. The results areshown in Figure 4. Guide when trained with communica-tion reward (CR) learns to inform Listener of its goal by ap-proaching to the goal it observes. Listener learns to follow.However, in NCR setting, Listener learns to ignore Guide’suninformative actions and moves to the center of three land-marks. While Guide and Listener are equipped with beliefmodels trained from different data sets, Guide manages to

use its own belief model to establish the mental state of Lis-tener and learns to communicate through actions judged bythis constructed mental state of Listener. We also observethat a trained Guide agent can work with a naive RL listener(best reward -0.252) which has no belief model but can ob-serve PBL guide agents action. The success of Guide withCR shows the potential for machine theory of mind. We ob-tain the learning curves by repeating the training process fivetimes and take the shared average environment reward.

Conclusions & Future WorkIn this paper, we focus on implicit communication throughactions. This draws a distinction of our work from previ-ous works which either focus on explicit communication orunilateral communication. We propose an algorithm com-bining agent modeling and communication for collaborativeimperfect-information games. Our PBL algorithm iteratesbetween training a policy and a belief module. We propose anovel auxiliary reward for encouraging implicit communica-tion between agents which effectively measures how muchcloser the opponent’s belief about a player’s private infor-mation becomes after observing the player’s action. We em-pirically demonstrate that our methods can achieve near op-timal performance in a matrix problem and scale to complexproblems such as contract bridge bidding. We conduct aninitial investigation of the further development of machinetheory of mind. Specifically, we enable an agent to use itsown belief model to attribute mental states to others and actaccordingly. We test this framework and achieve some ini-tial success in a multi-agent particle environment under dis-tributed training. There are a lot of interesting avenues forfuture work such as exploration of the robustness of collab-oration to differences in agents’ belief models.

ReferencesAlbrecht, S. V., and Stone, P. 2017. Autonomous AgentsModelling Other Agents: A Comprehensive Survey andOpen Problems. arXiv e-prints.

Astrom, K. J. 1965. Optimal control of Markov Processeswith incomplete state information. Journal of MathematicalAnalysis and Applications 10:174–205.Baker, M.; Hansen, T.; Joiner, R.; and Traum, D. 1999. Therole of grounding in collaborative learning tasks. Collabo-rative learning: Cognitive and computational approaches.Bard, N.; Johanson, M.; Burch, N.; and Bowling, M. 2013.Online implicit agent modelling. In Proceedings of the 2013International Conference on Autonomous Agents and Multi-agent Systems, AAMAS ’13, 255–262.Bernstein, D. S.; Zilberstein, S.; and Immerman, N. 2013.The Complexity of Decentralized Control of Markov Deci-sion Processes. arXiv e-prints arXiv:1301.3836.de Weerd, H.; Verbrugge, R.; and Verheij, B. 2015. Higher-order theory of mind in the tacit communication game. Bio-logically Inspired Cognitive Architectures.Dillenbourg, P. 1999. What do you mean by collaborativelearning? In Dillenbourg, P., ed., Collaborative-learning:Cognitive and Computational Approaches. Oxford: Elsevier.Dragan, A. D.; Lee, K. C.; and Srinivasa, S. S. 2013. Leg-ibility and predictability of robot motion. In Proceedingsof the 8th ACM/IEEE international conference on Human-robot interaction, 301–308. IEEE Press.Eccles, T.; Hughes, E.; Kramar, J.; Wheelwright, S.; andLeibo, J. Z. 2019. Learning reciprocity in complex sequen-tial social dilemmas. arXiv preprint arXiv:1903.08082.Foerster, J. N.; Assael, Y. M.; Freitas, N.; and Whiteson, S.2016. Learning to communicate with deep multi-agent rein-forcement learning. CoRR abs/1605.06676.Foerster, J. N.; Song, F.; Hughes, E.; Burch, N.; Dunning,I.; Whiteson, S.; Botvinick, M.; and Bowling, M. 2018.Bayesian Action Decoder for Deep Multi-Agent Reinforce-ment Learning. arXiv e-prints arXiv:1811.01458.Gmytrasiewicz, P. J., and Doshi, P. 2005. A framework forsequential planning in multi-agent settings. J. Artif. Int. Res.Hadfield-Menell, D.; Russell, S. J.; Abbeel, P.; and Dragan,A. 2016. Cooperative inverse reinforcement learning. InAdvances in neural information processing systems, 3909–3917.Haglund, B. 2010. Search algorithms for a bridge doubledummy solver.He, H.; Boyd-Graber, J.; Kwok, K.; and III, H. D. 2016. Op-ponent modeling in deep reinforcement learning. In ICML’16.Heider, F., and Simmel, M. 1944. An experimental study ofapparent behavior. The American journal of psychology.Jaques, N.; Lazaridou, A.; Hughes, E.; Gulcehre, C.; Ortega,P.; Strouse, D.; Leibo, J. Z.; and De Freitas, N. 2019. So-cial influence as intrinsic motivation for multi-agent deepreinforcement learning. In International Conference on Ma-chine Learning, 3040–3049.Kingma, D., and Ba, J. 2014. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980.Knepper, R. A.; I.Mavrogiannis, C.; Proft, J.; and Liang, C.2017. Implicit communication in a joint action. In Pro-

ceedings of the 2017 ACM/IEEE International Conferenceon Human-Robot Interaction, HRI ’17. ACM.Lazaridou, A.; Peysakhovich, A.; and Baroni, M. 2016.Multi-agent cooperation and the emergence of (natural) lan-guage. CoRR abs/1612.07182.League, A. C. B. 2017. Laws of duplicate bridge.Lemaignan, S., and Dillenbourg, P. 2015. Mutual mod-elling in robotics: Inspirations for the next steps. 2015 10thACM/IEEE International Conference on Human-Robot In-teraction (HRI) 303–310.Li, X., and Miikkulainen, R. 2018. Dynamic adaptation andopponent exploitation in computer poker. In Workshops atthe 32nd AAAI Conference on Artificial Intelligence.Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; andMordatch, I. 2017. Multi-agent actor-critic for mixedcooperative-competitive environments. Neural InformationProcessing Systems (NIPS).Mealing, R., and Shapiro, J. L. 2017. Opponent model-ing by expectation-maximization and sequence prediction insimplified poker. IEEE Transactions on Computational In-telligence and AI in Games 9(1):11–24.Mordatch, I., and Abbeel, P. 2017. Emergence of groundedcompositional language in multi-agent populations. CoRRabs/1703.04908.Premack, D., and Woodruff, G. 1978. Does the chimpanzeehave a theory of mind? Behavioral and Brain Sciences 1.Rabinowitz, N. C.; P., F.; Song, H. F.; Zhang, C.; Eslami, S.M. A.; and Botvinick, M. 2018. Machine theory of mind.CoRR abs/1802.07740.Raileanu, R.; Denton, E.; Szlam, A.; and Fergus, R. 2018.Modeling others using oneself in multi-agent reinforcementlearning. CoRR abs/1802.09640.Rasouli, A.; Kotseruba, I.; and Tsotsos, J. 2017. Agree-ing to cross: How drivers and pedestrians communicate. InIntelligent Vehicles Symposium (IV), 2017 IEEE. IEEE.Roth, M.; Simmons, R.; and Veloso, M. 2006. What to com-municate? execution-time decision in multi-agent pomdps.In DARS ’06. Springer. 177–186.Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; andKlimov, O. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347.Strouse, D. J.; Kleiman-Weiner, M.; Tenenbaum, J.;Botvinick, M.; and Schwab, D. 2018. Learning to share andhide intentions using information regularization. NIPS’18.Sukhbaatar, S.; Szlam, A.; and Fergus, R. 2016. Learningmultiagent communication with backpropagation. CoRR.Wen, Y.; Yang, Y.; Luo, R.; Wang, J.; and Pan, W. 2019.Probabilistic recursive reasoning for multi-agent reinforce-ment learning. In ICLR.Yeh, C., and Lin, H. 2016. Automatic bridge bidding usingdeep reinforcement learning. CoRR abs/1607.03290.

A BridgeA.1 Playing Phase in BridgeAfter the final contract is decided, the player from the declaring side who first bid the trump suit named in the final contractbecomes Declarer. Declarer’s partner is Dummy. The player to the left of the declarer becomes the first leading player. ThenDummy lays his cards face up on the table and then play proceeds clockwise. On each trick, the leading player shows one cardfrom their hand and other players need to play the same suit as the leading player if possible; otherwise, they can play a cardfrom another suit. Trump suit is superior to all other suits and, within a suit, a higher rank card is superior to lower rank one. Atrick is won by the player who plays the card with the highest priority. The winner of the hand becomes the leading player forthe next trick.

A.2 Double Dummy Analysis (DDA)Double Dummy Analysis assumes that, for a particular deal, one player’s hand is fully observed by other players and playersalways play cards to their best advantage. However, given a set s = xN , xS, the distribution of remaining cards for the twonon-bidding players East and West is still unknown. To reduce the variance of the estimate, we repeatedly sample a deal Utimes by allocating the remaining cards randomly to East and West and then estimate re(s) by taking the average of their DDAscores,

re(s) =1

U

U∑u=1

re(xN , xS , xEu , x

Wu ), (5)

where xEu , xWu are hands for East and West from the uth sampling respectively. For a specific contract at, the corresponding

score is given by re(at|s) = re(xN , xS , at). In our work, we set U = 20.

A.3 Bridge Scoring

Algorithm 2 Bridge Duplicate Scoring

Score (tricks made, bid level, trump suit)T ← trump suitδ ← tricks made− (bid level + 6)score← 0if δ ≥ 0 thenscore← score+ bid level ∗ scaleT + biasT Contract tricksif score ≥ 100 thenscore← 300 Game Bonus

elsescore← 50 PARTSCORE

end ifif δ = 6 thenscore← score+ 500 Slam bonus

else if δ = 7 thenscore← score+ 1000 Grand Slam bonus

end ifif δ > 0 thenscore← score+ δ ∗ scaleT Over-tricks

end ifelsescore← score− bid level ∗ 50 Under-tricks

end if

Algorithm 2 shows how we score a game under Duplicate Bridge Scoring rules. We obtain the average of tricks made usingDouble Dummy Analysis (Haglund 2010) given the hands of players North and South xN , xS, the declarer and the trumpsuit. The score function above has a scale and bias for each trump suit. The scale is 20 for ♣ and ♦ and 30 for all others. Biasis zero for all trumps, except NT which has a bias of 10.

Note that Double is only a valid bid in response to a contract proposed by one’s opponents. Also, a Redouble bid must bepreceded by a Double bid. In the non-competitive game, opponents do not propose contracts, so these options are naturally notincluded.

South

North

hN FCb

+

xS

bS

ηSFCpaS E

hSFCb

+

xN

bN

ηN FCp aN

b

pp

b

Figure 5: The architecture of our Policy Network and Belief Network in bridge. E represents environment and FC stands forfully connected layers. For player i, we add their private information xi and belief bs together to form ηi as the input to itspolicy network π. Player i′s action ai becomes part of player j’s history hj , which is the input to Player j’s belief network Φ.

A.4 Double Pass AnalysisAt the beginning of bidding, when two players both bid Pass (Double Pass), all players’ hands are re-dealt and a new episodestarts. If we ignore these episodes in training, a naive strategy emerges where a player always bids Pass unless it is highlyconfident about its hand and therefore bids at level 1 whose risk is at the minimum. In this work, we are interested in solvingproblems where private information needs to be inferred from observed actions for better performance in a game. Therefore,this strategy is less meaningful and the opportunity cost of bidding Pass could be high when a player could have won the gamewith high reward. To reflect this opportunity cost in training, we set the reward for Double Pass as the negative of the maximumattainable reward given the players’ hands: rdp(s) = −max(re(s)). Therefore, a player will be penalized heavily by biddingPass if it could have obtained a high reward otherwise and awarded slightly if they could never win in the current episode. It isworthy of note that an initial Pass bid can convey information; however, if it is followed by a second Pass bid the game endsand hence no further information is imparted from a second Pass.

We note that, by Duplicate Bridge Laws 77 and 22 of (League 2017), if the game opens with four Pass bids all players scorezero. In our setting, however, East and West play pass regardless of their hands and this will give North and South Player extraadvantages. Therefore, we use rdp to avoid results where North and South only bid where they have particularly strong hands;we discourage full risk averse behavior.

B Experiment Details in BridgeB.1 Offline EnvironmentGenerating a new pair of hands for North and South and pre-calculating scores for every possible contract at the start of a newepisode during policy training is time inefficient. Instead, we pre-generate 1.5 million hands and score them in advance. Thenwe sample new episodes from this data set when we train a policy. We also generate a separate test data set containing 30,000hands for testing our models.

B.2 Model ArchitectureWe parameterize policy and belief modules with two neural networks πθp and Φθb respectively. Fig. 5 shows our model for thebridge experiments. Weights of belief and policy modules are shared between players. The input to a player’s policy πθp is thesum of its private information x and belief b. The last layer of the belief module Φ has a Sigmoid non-linear activation function,so that the belief vector b has elements in the interval [0, 1]. By adding belief b to x, we could reuse parameters associated withx for b and avoid training of extra parameters.

B.3 Policy Pre-trainingIn the first PBL iteration (k = 0), to have a good initial policy and avoid one with low-entropy, we train our policy to predict adistribution formed using Softmax on re(s) with temperature τ . The loss for pre-train policy given s is:

Lk0 = KL(π(a|x)||Softmax(re(s), τ)),

where KL is the KL-Divergence. To have a fair comparison with other benchmarks, all our benchmarks are initialized withthis pre-trained policy. Supervising a bidding policy to predict pre-calculated scores for all actions only provides it with a basicunderstanding of its hand.

B.4 Policy TrainingWe utilize Proximal Policy Optimization (PPO) (Schulman et al. 2017) for policy training. Optimization is performed with theAdam optimizer (Kingma and Ba 2014) with β1 = 0.9, β2 = 0.999, ε = 10−8. The initial learning rate is 10−4 which wedecay exponentially with a rate of decay of 0.95 and a decaying step of 50.

The distance measure used in communication rewards rc is cross entropy and we treat each dimension of belief and groundtruth as an independent Bernoulli distribution.

We train PBL with 8 PBL iterations, we do policy gradient 200 times per iteration and we sample 5000 episodes each time.Each mini-batch is then a sub-sample of 2048 episodes. The initial communication weight α is 5, and we decay it gradually.We train all other baselines with the same PPO hyperparameters.

B.5 Learning A Belief NetworkWhen a player tries to model its partner’s hand based on the observed bidding history, we assume it can omit the restriction thatits partner can only hold 13 cards. Therefore, we take the prediction of the partner’s hand given the observed bidding historyas a 52-label classification problem, where each label represents one corresponding card being in the partner’s hand. In otherwords, we treat each card from a 52-card deck being in the partner’s hand as an independent Bernoulli distribution and we traina belief network by maximizing the joint likelihood of these 52 Bernoulli distributions given a bidding history h. This gives theloss for belief network as:

LΦ = −52∑i=1

x−i log(bi) + (1− x−i) log(1− bi),

where x−i and bi are elements of one-hot encoding vectors of a partner’s hand x−i and one agent’s belief bi. The reasoningbehind this assumption is we think it is more important to have a more accurate prediction over an invalid distribution than aless accurate one over a valid distribution as the belief itself is already an approximation.

For each iteration of belief training, we generate 300,000 data episodes with the current policy to train the belief network.Optimization is performed with the Adam optimizer with β1 = 0.9, β2 = 0.999, ε = 10−8 and a decay rate of 0.95. The initiallearning rate is 10−3. The batch size is 1024. We split the data set such that 90% of it is used for training and 10% is for earlystopping check to prevent overfitting.

C Learned Opponent Model and Bidding ConventionC.1 High Card Points Tables

Bid ♣ ♦ ♥ ♠ Total

PASS 1.4 1.3 1.4 1.4 5.41♠ 2.3 2.1 2.3 4.5 11.21♥ 2.3 2.2 4.7 1.8 11.01♦ 2.2 4.4 2.1 2.0 10.71♣ 4.8 1.9 2.1 2.1 10.9

3NT 4.4 4.4 4.9 4.6 18.34♦ 3.0 6.5 3.1 3.4 16.14♣ 6.6 5.3 2.3 2.3 16.5

Table 1: Opening bid - own aHCPs.


PASS 1.4 1.3 1.3 1.3 5.31♠ 2.3 2.1 2.2 4.6 11.11♥ 2.2 2.2 4.7 1.9 11.01♦ 2.3 4.5 2.0 2.0 10.81♣ 4.8 2.0 2.2 2.2 11.1

3NT 4.4 4.6 4.8 4.7 18.54♦ 2.8 6.8 3.3 3.1 16.04♣ 6.2 4.0 3.0 2.8 16.0

Table 2: Belief HCPs after observing opening bid.


PASS 4.4 4.5 4.4 4.5 17.81♠ 4.4 4.3 4.5 7.1 20.31♥ 4.6 4.4 7.0 4.9 20.91♦ 4.3 6.7 4.8 4.9 20.71♣ 7.2 4.3 4.7 4.9 21.12♠ 4.8 4.7 5.2 8.5 23.22♥ 4.9 5.1 8.6 5.1 23.7

3NT 6.6 6.6 7.1 7.2 27.64♦ 5.2 8.5 5.5 5.6 24.94♣ 8.4 5.6 5.4 5.4 24.86♥ 5.7 6.2 9.9 6.9 28.76♦ 6.4 9.2 7.3 7.1 30.06♣ 10.8 4.0 10.0 6.5 31.3

6NT 7.5 7.0 8.7 9.0 32.2

Table 3: Responding bid - own + belief aHCPs.

Table 1 shows the average HCPs (aHCPs) present in a hand for each of the opening bidding decisions made by North. Oncean opening bid is observed, South updates their belief; Table 2 shows the effect which each opening bid has on South’s belief.We show in Table 3 the responding bidding decisions made by South; aHCP values in table 3 are the sum of HCPs in South’shand and South’s belief over HCPs in North’s hand. The values highlighted in bold for each row are the maximum values forthe respective row. This is only done for rows where the bid is has a specified trump suit.

C.3 Interesting Bidding ExamplesWe note that in general, in the initial stages of bidding the suit of a bid is used to show strength in that suit with no trumpbids used to show strength across all suits and PASS bids used to show a weak hand. In later stages of bidding, players eitherreiterate their strength with larger trick numbers in their bids or show strength in other suits. In the final rounds of bidding agentsshow their overall combined view of the pairing’s hands to propose bids that reflect all 26 of the pair’s cards. In the followingexamples, we describe how this bidding process manifests in certain cases. Note that in each case where a bid specifying thetrump suit is made, the bidding pair have the majority of the cards in the trump suit in their collective hands. This is oftenessential for success in attaining suited bids.

Example 1NORTH6♠

10♠

2r

4r

6r

Ar

2q

3♣

5♣

J♣

Q♣

K♣

A♣

SOUTH8♠

K♠

A♠

8r

3q

4q

7q

9q

Jq

Aq

2♣

6♣

7♣

BIDS: 1♣ 1 2♣ 2 3NT PASSREWARD: 0.283

North has a very strong hand in clubs holding all of the picture cards and the ace. North also holds the ace of hearts. South hasstrength in diamonds and spaces. The bidding pattern shows that North initially signals their strength in clubs and South theirstrength in diamonds. Both parties then reiterate their strengths in their next turns and a contract of 3NT is agreed.

Example 2NORTH8♠

Q♠

K♠

A♠

8r

Qr

2q

3q

4q

9q

Jq

Qq

Aq

SOUTH3♠

4♠

3r

7r

9r

10r

Jr

Kr

5q

6q

Kq

4♣

A♣

BIDS: 1 1 3NT 4 6 PASSREWARD: 0.301

North opens bidding with diamonds to show their strength in the suit. South then responds with hearts to show their strongestsuit. North then bids no trumps to signal an inferred collective strength in multiple suits. South has little strength outside ofhearts and continues to bid hearts and ultimately a contract of 6 is reached as bid by North who only has two hearts. This isa bid to win all but one hand with hearts as trumps which reflects the fact that between them the pair have from 7 to King ofhearts. The pairing bid hearts despite their collective strength also in diamonds since hearts commands a higher per trick scoreif the contract is met.

Example 3NORTH2♠

4♠

7♠

J♠

A♠

Kr

Kq

3♣

4♣

5♣

6♣

8♣

A♣

SOUTH6♠

9♠

K♠

7r

3q

5q

6q

10q

Qq

Aq

10♣

J♣

K♣

BIDS: 1♣ 1 1♠ 2♣ 3NT 4♠ PASSREWARD: 0.305

Between them the pairing have a strong hand in diamonds clubs and spades with a lack of hearts (15 + 13 = 28 HCP Total).Bidding opens with 1♣ as North aims to show their strength in clubs. South then conveys their strength in diamonds with abid of 1. Bidding continues as North demonstrates some strength in spades with a bid of 1♠. Later, a bid by North of 3NTdemonstrates that North believes the pair have strength across the board as reflected by the bids from all suits except hearts.Bidding ends however, with a bid of 4♠ from South which is then accepted through a PASS bid from North. South makes thebid of 4♠ despite only having 3 spades in their hand. The earlier bid of 1♠ from North informs South that North has somestrength in spades and hence South believes that the pair may be able to attain 4♠.

Example 4NORTH2♠

7r

9r

10r

Kr

2q

4q

9q

Jq

Aq

10♣

J♣

A♣

SOUTH9♠

K♠

2r

4r

Jr

Ar

3q

6q

8q

Qq

Kq

Q♣

K♣

BIDS: 1 3NT 4 6 6NT PASSREWARD: 0.651North has strength in all suits excluding spades (13 HCP Total) and South has strength evenly spread across all suits (18 HCP).Their shared strongest suit is diamonds where they hold the highest 4 cards between them. The bid of 6 is raised to 6NT asNorth forms a belief that the pair share strength across multiple suits (31 HCP Total for both hands).

Example 5NORTH3♠

9♠

J♠

Q♠

6r

10r

Qr

2q

3q

7q

Aq

2♣

5♣

SOUTH7♠

10♠

4r

Jr

Ar

4q

8q

Qq

Kq

4♣

6♣

K♣

A♣

BIDS: PASS 1♣ 1 2 3NT PASSREWARD: 0.211North opens bidding with PASS signaling they they have a relatively weak hand (8 HCP Total). South then responds with 1♣signaling their strength in clubs. Bidding then progresses as North bids 1 signaling that diamond is the stronger of their suits.South also has strength in diamonds and ultimately the pairing is able to reason about their collective hands and bid 3NT; a bidthat would not be reached without forming beliefs due to the relative weakness of North’s hand.

Example 6NORTH6♠

7♠

10♠

J♠

5r

6r

7r

Kr

8q

10q

7♣

Q♣

K♣

SOUTH9♠

K♠

Ar

3q

4q

9q

Qq

Aq

2♣

3♣

4♣

8♣

A♣

BIDS: PASS 1♣ 2♣ 2 3NT PASSREWARD: 0.258Between them North and South have evenly spread strength across the suits (9 + 17 = 26 HCP Total). Similarly to Example 4,we see that the agents are able to reason, using their beliefs, about the collective strength of both of their hands. Here the pairreach a bid of 3NT while in Example 4, the pairing shared 31 HCP and reached a bid of 6NT.

Example 7NORTH8r

Qr

Kr

4q

5q

6q

7q

8q

9q

2♣

3♣

7♣

K♣

SOUTH2♠

4♠

6♠

10♠

J♠

3r

9r

Ar

9♣

10♣

J♣

Q♣

A♣

BIDS: PASS 1♣ 1 1♠ 2♣ PASSREWARD: 0.104Both players lack any cards in one of the four suits. North opens bidding with PASS to show the relative weakness of their hand(8 HCP Total). Bidding remains low in different suits reflecting the similar level of strength across the suits while low bids showno significant strength overall. Ultimately a bid of 2♣ is reached reflecting strength in ♣ as between them the pair hold from 9to Ace of clubs.

learning to communicate implicitly by actionsfrom modeling them effectively. in this paper, we...

Documents