effective master-slave communication on a multi-agent deep reinforcement learning … ·...

Effective Master-Slave Communication On AMulti-Agent Deep Reinforcement Learning System

Xiangyu Kong1, Bo Xin2∗ , Fangchen Liu1∗, Yizhou Wang1

1Nat’l Eng. Lab. for Video Technology, Cooperative Medianet Innovation Center,Key Laboratory of Machine Perception (MoE), Sch’l of EECS,

Peking University, Beijing, 100871, [email protected], {liufangchen, yizhou.wang}@pku.edu.cn

2Microsoft Research, Beijing, China [email protected]

Abstract

Many challenging practical problems require multiple agents to solve collabora-tively. However, communication becomes a bottleneck when a multi-agent system(MAS) scales. This is particularly true when a MAS is deployed to autonomouslearning (e.g. reinforcement learning), where massive interactive communicationis required. We argue that the effectiveness of communication is a key factor todetermine the intelligence level of a multi-agent learning system. In this regard,we propose to adapt the classical hierarchical master-slave architecture to facilitateefficient multi-agent communication during the interactive reinforcement learning(RL) process implemented on a deep neural network. The master agent aggregatesmessages uploaded from the slaves and generates unique message to each slaveaccording to the aggregated information and the states of the slave. Each slaveincorporates both the instructive messages from the master and its own to takeactions to fulfill the goal. In this way, the joint action-state space of the agentsgrows only linearly instead of geometrically with the number of agents compared tothe peer-to-peer architecture. In experiments, we show that with effective commu-nication, the proposed multi-agent learning system consistently outperforms latestcompeting methods both in synthetics experiments and the challenging StarCraft1micromanagement tasks.

1 IntroductionRecent years have witnessed successful application of RL technologies to many challenging problems,ranging from game playing [13; 19] to robotics [8] and other important artificial intelligence (AI)related fields such as [18] etc. Most of these works have been studying the problem of a single agent.However, many challenging practical tasks require the collaboration of multiple agents, for example,the coordination of autonomous vehicles [1], multi-robot control [10], network packet delivery [27]and multi-player games [22] etc. Motivated from the success of (single agent) deep RL, wherevalue/policy approximators were implemented via deep neural networks, recent research effortson multi-agent RL also embrace deep networks and target at more complicated environments andcomplex tasks, e.g. [20; 17; 2; 9]. However, the essential state-action space of multiple agents growsgeometrically with the number of agents. Thus it remains an open challenge how deep reinforcementlearning can be effectively scaled to more agents in various situations.

We argue that the key solution to this problem is an effective communication architecture thatcan bridge all agents. Inspired by the canonical master-slave architecture that has been proved

∗Equal contribution.1StarCraft and its expansion StarCraft: Brood War are trademarks of Blizzard EntertainmentTM.

Hierarchical Reinforcement Learning Workshop at the 31st Conference on Neural Information ProcessingSystems (HRL@NIPS 2017), Long Beach, CA, USA.

Master

S

S

mean

1

GCM 𝑎𝑡1+

𝑎𝑡2+

𝑎𝑡3+

ℎ𝑡−1𝑚 , 𝑐𝑡−1

𝑚ℎ𝑡𝑚, 𝑐𝑡

𝑚

ℎ𝑡−11

ℎ𝑡−12

ℎ𝑡−13

ℎ𝑡𝑚 𝑐𝑡

𝑚


𝑚


𝑚

ℎ𝑡1

ℎ𝑡2

ℎ𝑡3

ℎ𝑡1

ℎ𝑡2

ℎ𝑡3

𝑐𝑡−1𝑚

ℎ𝑡−1𝑚

+

𝜎 𝜎 𝜎

GCM

GCM

2

3

tanh

tanh

𝑎𝑡𝑚→𝑖

ℎ𝑡𝑚

𝑐𝑡𝑚

𝑚𝑡𝑜𝑡

Master Module

tanhℎ𝑡−1𝑖 ℎ𝑡

𝑖

𝑠𝑡𝑖

Slave Module 𝑖

GCM of Slave Agent 𝑖

𝑐𝑡𝑚

ℎ𝑡𝑚

+

𝜎 𝜎 𝜎

tanh

tanh

ℎ𝑡𝑖

𝑎𝑡𝑚→1

𝑎𝑡𝑚→2

𝑎𝑡𝑚→3

𝑎𝑡1→1

𝑎𝑡2→2

𝑎𝑡3→3

A)

B)

C)

D)

𝑚𝑡

𝑠𝑡1

𝑠𝑡2

𝑠𝑡3

𝑜𝑡

Figure 1: A) master module, B) slave module, C) gated composition module, D) model architecture.

to be effective in a wide range of multi-agent tasks [14; 4; 16; 15; 25; 12; 11], we propose ahierarchical deep neural network to facilitate efficient multi-agent communication in an interactiveRL environment. Although our designs vary from the existing works, we have inherited the spirit ofleveraging agent hierarchy in a master-slave manner. That is, the master agent tends to plan in a globalmanner without focusing on potentially distracting details from each slave agent and meanwhile theslave agents often locally optimize their actions with respect to both their local state and the guidancecoming from the master agent. One can consider the relationship as the coach and the players in afootball/basketball team. Although the idea is clear and intuitive, we notice that our work is amongthe first to explicitly design master-slave architecture for deep MARL.

Note that our proposal is closely related to hierarchical deep RL, e.g. [7; 26; 23]. However, suchhierarchical deep RL methods studies the hierarchy regarding tasks or goals and are usually targetingat sequential sub-tasks where the meta-controller constantly generates goals for controllers to achieve.As a contrast, in our case the meta-controller will dispatch instructions to multiple low-level agents ina parallel manner to help them work collaboratively to achieve a common goal. In this sense, ourdesign can also be considered as an extension of hierarchical deep RL to a multi-agent scenario.

We instantiate our idea with a multi-agent policy network constructed with the master-slave agenthierarchy (shown in Figure 1). For both each slave agent and the master agent, the policy approxi-mators are realized using recurrent neural networks (RNN). The communication among the masteragent and slave agents is implemented as inter-agent message transmission. While each slave agenttakes local states as its input, the master agent takes both the global states and the messages mt fromall slave agents as its input. The final action output of each slave agent is composed of contributionsfrom both the corresponding slave agent and the master agent. In this way, the master agent cangenerate different messages am→it for each slave agent i. This is implemented via a gated compositionmodule as shown in Figure 1. We test our proposal using both synthetic experiments and challengingStarCraft micromanagement tasks. Our method consistently outperforms recent competing MARLmethods by a clear margin. We also provide analysis to showcase the effectiveness of the learnedpolicies, many of which illustrate interesting phenomenon related to our specific designs.

2 Master-Slave Multi-Agent RLHereafter we focus on introducing our policy-based implementation. However, our idea can also berealized using value-based or actor-critic methods. In particular, our target is to learn a mapping fromstates to actions πθ(at|st) at any given time t, where s = {sm, s1, ..., sC} and a = {a1, ..., aC}are collective states and actions of C agents respectively and θ = {θm, θ1, ..., θC} represents theparameters of the policy function approximator of all agents, including the master agent θm. Notethat we have explicitly formulated sm to represent the independent state to the master agent but haveleft out a corresponding am since the master’s actions will be merged with and represented by thefinal actions of all slave agents. This design has two benefits: 1) one can now input potentially more

2

global states to the master agent; and meanwhile 2) the whole network can be trained end-to-end withsignals directly coming from actual actions.

In Figure 1, we illustrate the whole pipeline of our master-slave multi-agent architecture. Specifically,we demonstrate the network structure unfolded at a single time step. For example, at time step t, thestate s consists of each si of the ith slave agent and sm = ot of the master agent. Each slave agentis represented as a blue circle and the master agent is represented as a yellow rectangle. All agentsare policy networks realized with RNN modules such as LSTM [5] cells or a stack of RNN/LSTMcells. Therefore, besides the states, all agents also take the hidden state of RNN ht−1 as their inputs,representing their reasonings along time. Meanwhile the master agent also takes as input someinformation from each slave agent ci. These communications are represented via colored connectionsin the figure.

To merge the actions from the master agent and those from the slave agents, we propose a gatedcomposition module (GCM), whose behavior resembles LSTM. Figure 1 illustrates more details.Specifically, this module takes the "thoughts" or hidden states of the master agent hmt and the slaveagents hit as input and output action proposals am→it , which later will be added to independentaction proposals from the corresponding slave agents ait. Since such a module depends on boththe "thoughts" from the master agent and those from certain slave agent, it facilitates the masterto provide different action proposals to individual slave agents. We denote this solution as "MS-MARL+GCM". (Note that in [6] a similar but simpler gating mechanism was proposed only fortwo-agent communication.) In certain cases, one may also want the master to provide unified actionproposals to all agents. This could easily be implemented as a special case of where the gate relatedto the slave’s "thoughts" shuts down, denoted as regular MS-MARL.

As mentioned above, due to our design, learning can be performed in an end-to-end manner bydirectly applying policy gradient in the centralized perspective. Specifically, one would update allparameters following the policy gradient theorem [21] as

θ ← θ + λ

T−1∑t=1

∇θlog πθ(st,at)vt (1)

where data samples are collected stochastically from each episode {s1,a1, r2, ..., sT−1,aT−1, rT }∼πθ and vt =

∑tj=1 rt. Note that, for discrete action space, we applied softmax policy on the top layer

and for continues action space, we adopted Gassian policy on the top layer.

3 ExperimentsTo justify the effectiveness of the proposed master-slave architecture, we conducted experiments onfive representative multi-agent tasks and environments in which multiple agents interact with eachother to achieve certain goals. These tasks and environments have been widely used for the evaluationof popular MARL methods such as [20; 9; 17; 24; 2; 3]. The first two tasks are the traffic junctiontask and the combat task that was originally proposed in [20]. These two tasks are discrete in stateand action spaces. The others are StarCraft micromanagement tasks originating from the well-knownreal-time strategy (RTS) game: StarCraft and was originally defined in [22; 24]. Instead of playinga complete StarCraft game, the micromanagement tasks involve a local battle between two groupsof units. Similar to the combat task, the possible actions also include move and attack. One groupwins the game when all the units of the other group are eliminated. However, one big differencefrom the combat task is that the StarCraft environment is continuous in state and action space, whichmeans a much larger search space for learning. The selected StarCraft micromanagement tasks forour experiments are {15 Marines vs. 16 Marines, 10 Marines vs. 13 Zerglings, 15 Wraiths vs. 17Wraiths}. We will refer to these three tasks as {15M vs. 16M, 10M vs. 13Z, 15W vs. 17W} forbrevity in the rest of this paper. Note that all the three tasks are categorized as "hard combats" in [17].

Tasks GMEZO CommNet BiCNet MS-MARL MS-MARL + GCM

Traffic Junction - 0.94 - 0.96 -Combat - 0.31 - 0.59 -

15M vs. 16M 0.63 0.68 0.71 0.77 0.8210M vs. 13Z 0.57 0.44 0.64 0.75 0.7615W vs. 17W 0.42 0.47 0.53 0.61 0.60

Table 1: Mean winning rates of different methods on the cotinuous StarCraft micromanagement tasks

3

20 40 60 80 100Epoch

0.5

0.6

0.7

0.8

0.9

1

Win

rat

es

OursCommNet

(a) The traffic junction task

100 200 300 400Epoch

0

0.2

0.4

0.6

0.8

1

Win

rat

es

OursCommNet

(b) The combat task

10 20 30 40 50 60 70Epoch

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Win

rat

es

OursCommNet

(c) 15M vs. 16MFigure 2: Comparing winning rates of different methods on all three tasks

(a) Master and slave actions (b) CommNet (c) Our MS-MARL ModelFigure 3: Demonstration of the policy learned by our MS-MARL method:(a) how master help slavesin the combat task, (b) a failure case of CommNet (c) the successful "Pincer Movement" policylearned by our MS-MARL model

Table 1 demonstrates the performance improvement of our method when compared with the baselines.For CommNet we directly run the released code on the traffic junction task and the combat task usinghyper-parameters provided in [20]. We compute the mean winning rates in Table 1 by testing thetrained models for 100 rounds. However, since the code of GMEZO and BiCNet is not released yet,there is no report of their performance on the first two tasks. It can be seen that our MS-MARLmodel performs better than CommNet on both tasks. On StarCraft micromanagement tasks, the meanwinning rates of GMEZO, CommNet and BiCNet are all available from [17]. As for GMEZO, wedirectly follow the results in [17] since we utilize a similar state and action definition as them. Theresults are displayed in Table 1. Obviously, on all the selected three tasks, our MS-MARL methodachieves consistently better performance compared to the baselines. Figure 2 shows the trainingprocess of CommNet and our MS-MARL method by plotting winning rate curves for the first twotasks as well as "15M vs. 16M". From this plot, our MS-MARL model clearly enjoys better, fasterand more stable convergence, which highlights the importance of our master-slave design.Another interesting observation is how the master agent and each of the slave agents contribute to thefinal action choices (as shown in Figure 3 (a)). In the combat task, we observe that the master agentdoes often learn an effective global policy. The action components extracted from the master agentlead the whole team of agents move towards the enemies regions. Meanwhile, all the slave agentsadjust their positions locally to gather together. In the task of "15M vs. 16M", our MS-MARL modellearns a particular policy of spreading the agents into a half-moon shape ("spread out") to focus fireand attacking the frontier enemies before the others enter the firing range, as illustrated in Figure 3(c). Actually, this group behavior is similar to the famous military maneuver "Pincer Movement"which is widely exploited in successful battles in the history. Although CommNet sometimes followsuch kind of policy, it often fails to spread to a larger "pincer" to cover the enemies and thereforeloses the battle. Figure 3 (b) shows one of such examples. The size of the "pincer" seems especiallyimportant for winning the task of "15M vs. 16M" where we have less units than the enemy.

4 ConclusionIn this paper, we revisit the master-slave architecture for effective multi-agent communication in deepMARL. As designed, the master agent aggregates messages uploaded from the slaves and generatesunique message to each slave; each slave incorporates both the instructive message from the masterand its own to take actions to fulfill the goal. With the proposed instantiation using policy gradientnetworks, we empirically demonstrate the superiority of our proposal against existing methods inseveral challenging mutli-agent tasks.

4

Acknowledgments

We would like to express our thanks for support from the following research grants NSFC-61625201and Qualcomm Greater China University Research Grant.

References[1] Yongcan Cao, Wenwu Yu, Wei Ren, and Guanrong Chen. An overview of recent progress in

the study of distributed multi-agent coordination. IEEE Transactions on Industrial informatics,9(1):427–438, 2013.

[2] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson.Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926, 2017.

[3] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Philip Torr, Pushmeet Kohli, Shimon White-son, et al. Stabilising experience replay for deep multi-agent reinforcement learning. arXivpreprint arXiv:1702.08887, 2017.

[4] Satoru Fujita and Victor R Lesser. Centralized task distribution in the presence of uncertainty andtime deadlines. In Proceedings of the Second International Conference on Multi-Agent Systems,pages 87–94, 1996.

[5] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,9(8):1735–1780, 1997.

[6] Xiangyu Kong, Bo Xin, Yizhou Wang, and Gang Hua. Collaborative deep reinforcement learningfor joint object search. In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), July 2017.

[7] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deepreinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances inNeural Information Processing Systems, pages 3675–3683, 2016.

[8] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deepvisuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016.

[9] Hangyu Mao, Yan Ni, Zhibo Gong, Weichen Ke, Chao Ma, Yang Xiao, Yuan Wang, JiakangWang, Quanbin Wang, Xiangyu Liu, et al. Accnet: Actor-coordinator-critic net for" learning-to-communicate" with deep multi-agent reinforcement learning. arXiv preprint arXiv:1706.03235,2017.

[10] Laëtitia Matignon, Laurent Jeanpierre, Abdel-Illah Mouaddib, et al. Coordinated multi-robotexploration under communication constraints using decentralized markov decision processes. InAAAI, 2012.

[11] DB Megherbi and Minsuk Kim. A hybrid p2p and master-slave cooperative distributed multi-agent reinforcement learning technique with asynchronously triggered exploratory trials andclutter-index-based selected sub-goals. In Computational Intelligence and Virtual Environmentsfor Measurement Systems and Applications (CIVEMSA), 2016 IEEE International Conference on,pages 1–6. IEEE, 2016.

[12] Dalila B Megherbi and Manuel Madera. A hybrid p2p and master-slave architecture forintelligent multi-agent reinforcement learning in a distributed computing environment: A casestudy. In Computational Intelligence for Measurement Systems and Applications (CIMSA), 2010IEEE International Conference on, pages 107–112. IEEE, 2010.

[13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.

[14] Fabrice R Noreils. Toward a robot architecture integrating cooperation between mobile robots:Application to indoor environment. The International Journal of Robotics Research, 12(1):79–98,1993.

5

[15] Johan Parent, Katja Verbeeck, Jan Lemeire, Ann Nowe, Kris Steenhaut, and Erik Dirkx.Adaptive load balancing of parallel applications with multi-agent reinforcement learning onheterogeneous systems. Scientific Programming, 12(2):71–79, 2004.

[16] Kui-Hong Park, Yong-Jae Kim, and Jong-Hwan Kim. Modular q-learning based multi-agentcooperation for robot soccer. Robotics and Autonomous Systems, 35(2):109–122, 2001.

[17] Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang.Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXivpreprint arXiv:1703.10069, 2017.

[18] Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. Deep reinforcement learning-based image captioning with embedding reward. arXiv preprint arXiv:1704.03899, 2017.

[19] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-tering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489,2016.

[20] Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropa-gation. In Advances in Neural Information Processing Systems, pages 2244–2252, 2016.

[21] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135.MIT Press Cambridge, 1998.

[22] Gabriel Synnaeve, Nantas Nardelli, Alex Auvolat, Soumith Chintala, Timothée Lacroix, ZemingLin, Florian Richoux, and Nicolas Usunier. Torchcraft: a library for machine learning research onreal-time strategy games. arXiv preprint arXiv:1611.00625, 2016.

[23] Chen Tessler, Shahar Givony, Tom Zahavy, Daniel Mankowitz, and Shie Mannor. A deephierarchical approach to lifelong learning in minecraft, 2017.

[24] Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, and Soumith Chintala. Episodic explorationfor deep deterministic policies: An application to starcraft micromanagement tasks. arXiv preprintarXiv:1609.02993, 2016.

[25] Katja Verbeeck, Ann Nowé, and Karl Tuyls. Coordinated exploration in multi-agent reinforce-ment learning: an application to load-balancing. In Proceedings of the fourth international jointconference on Autonomous agents and multiagent systems, pages 1105–1106. ACM, 2005.

[26] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg,David Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning.In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference onMachine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3540–3549,International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

[27] Dayong Ye, Minjie Zhang, and Yun Yang. A multi-agent framework for packet routing inwireless sensor networks. sensors, 15(5):10026–10047, 2015.

6

effective master-slave communication on a multi-agent deep reinforcement learning … ·...

Documents