qatten: a general framework for cooperative multiagent ...qatten: a general framework for...

13
Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1 Jianye Hao 1 Ben Liao 2 Kun Shao 3 Guangyong Chen 2 Wulong Liu 3 Hongyao Tang 1 Abstract In many real-world settings, a team of cooper- ative agents must learn to coordinate their be- havior with private observations and communica- tion constraints. Deep multiagent reinforcement learning algorithms (Deep-MARL) have shown superior performance in these realistic and dif- ficult problems but still suffer from challenges. One branch is the multiagent value decomposi- tion, which decomposes the global shared multia- gent Q-value Q tot into individual Q-values Q i to guide individuals’ behaviors. However, previous work achieves the value decomposition heuristi- cally without valid theoretical groundings, where VDN supposes an additive formation and QMIX adopts an implicit inexplicable mixing method. In this paper, for the first time, we theoretically derive a linear decomposing formation from Q tot to each Q i . Based on this theoretical finding, we introduce the multi-head attention mechanism to approximate each term in the decomposing for- mula with theoretical explanations. Experiments show that our method outperforms state-of-the-art MARL methods on the widely adopted StarCraft benchmarks across different scenarios, and atten- tion analysis is also investigated with sights. 1. Introduction Cooperative multiagent reinforcement learning problem has been studied extensively in the last decade, where a system of agents learn towards coordinated policies to optimize the accumulated global rewards (Busoniu et al., 2008; Gupta et al., 2017; Palmer et al., 2018). Complex tasks such as the coordination of autonomous vehicles (Cao et al., 2012), op- timizing the productivity of a factory in distributed logistics (Ying & Sang, 2005) and energy distribution, often modeled as cooperative multi-agent learning problems, have great 1 Tianjin Univeristy, China 2 Tencent, China 3 Huawei Noah’s Ark Lab, China. Correspondence to: Jianye Hao <[email protected]>. Preprint. application prospect and enormous commercial value. One natural way to address cooperative MARL problem is the centralized approach, which views the multiagent system (MAS) as a whole and solves it as a single-agent learning task. In such settings, existing reinforcement learning (RL) techniques can be leveraged to learn joint optimal policies based on agents joint observations and common rewards (Tan, 1993). However, the centralized approach usually scales not well, since the joint action space of agents grows exponentially as the increase of the number of agents. Fur- thermore, partial observation limitation and communication constraints also necessitate the learning of decentralised poli- cies, which condition only on the local action-observation history of each agent (Foerster et al., 2018). Another choice is the decentralized approach that each agent learns its own policy. Letting individual agents learn concur- rently based on the global reward (aka. independent learn- ers) (Tan, 1993) is the simplest option. However, it shown to be difficult in even simple two-agent, single-state stochastic coordination problems. One main reason is that the global reward signal brings the non-stationarity that agents cannot distinguish between the stochasticity of the environment and the exploitative behaviors of other co-learners (Lowe et al., 2017), and thus may mistakenly update their policies. To mitigate this issue, decentralized policies can often be learned in the centralized training with decentralized exe- cution (CTDE) paradigm. Whiling training, a centralized critic could be accessible to the joint action and additional state information to give update signals for decentralized agent polices, which only receive its local observations for deployment. For CTDE, one major challenge is how to represent and use the centralized multiagent action-value function Q tot . Such a complex function is difficult to learn when there are many agents and, even if it can be learned, offers no obvious way to induce decentralized policies that allow each agent to act based on an individual observation. We can learn a fully centralized state-action value function Q tot and then use it to guide the optimization of decen- tralised policies in an actor-critic framework, an approach taken by counterfactual multi-agent (COMA) policy gradi- ents (Foerster et al., 2018), as well as work by (Gupta et al., 2017). However, the fully centralized critic of COMA suf- arXiv:2002.03939v1 [cs.MA] 10 Feb 2020

Upload: others

Post on 13-May-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent ReinforcementLearning

Yaodong Yang 1 Jianye Hao 1 Ben Liao 2 Kun Shao 3 Guangyong Chen 2 Wulong Liu 3 Hongyao Tang 1

AbstractIn many real-world settings, a team of cooper-ative agents must learn to coordinate their be-havior with private observations and communica-tion constraints. Deep multiagent reinforcementlearning algorithms (Deep-MARL) have shownsuperior performance in these realistic and dif-ficult problems but still suffer from challenges.One branch is the multiagent value decomposi-tion, which decomposes the global shared multia-gent Q-value Qtot into individual Q-values Qi toguide individuals’ behaviors. However, previouswork achieves the value decomposition heuristi-cally without valid theoretical groundings, whereVDN supposes an additive formation and QMIXadopts an implicit inexplicable mixing method.In this paper, for the first time, we theoreticallyderive a linear decomposing formation from Qtotto each Qi. Based on this theoretical finding, weintroduce the multi-head attention mechanism toapproximate each term in the decomposing for-mula with theoretical explanations. Experimentsshow that our method outperforms state-of-the-artMARL methods on the widely adopted StarCraftbenchmarks across different scenarios, and atten-tion analysis is also investigated with sights.

1. IntroductionCooperative multiagent reinforcement learning problem hasbeen studied extensively in the last decade, where a systemof agents learn towards coordinated policies to optimize theaccumulated global rewards (Busoniu et al., 2008; Guptaet al., 2017; Palmer et al., 2018). Complex tasks such as thecoordination of autonomous vehicles (Cao et al., 2012), op-timizing the productivity of a factory in distributed logistics(Ying & Sang, 2005) and energy distribution, often modeledas cooperative multi-agent learning problems, have great

1Tianjin Univeristy, China 2Tencent, China 3HuaweiNoah’s Ark Lab, China. Correspondence to: Jianye Hao<[email protected]>.

Preprint.

application prospect and enormous commercial value.

One natural way to address cooperative MARL problem isthe centralized approach, which views the multiagent system(MAS) as a whole and solves it as a single-agent learningtask. In such settings, existing reinforcement learning (RL)techniques can be leveraged to learn joint optimal policiesbased on agents joint observations and common rewards(Tan, 1993). However, the centralized approach usuallyscales not well, since the joint action space of agents growsexponentially as the increase of the number of agents. Fur-thermore, partial observation limitation and communicationconstraints also necessitate the learning of decentralised poli-cies, which condition only on the local action-observationhistory of each agent (Foerster et al., 2018).

Another choice is the decentralized approach that each agentlearns its own policy. Letting individual agents learn concur-rently based on the global reward (aka. independent learn-ers) (Tan, 1993) is the simplest option. However, it shown tobe difficult in even simple two-agent, single-state stochasticcoordination problems. One main reason is that the globalreward signal brings the non-stationarity that agents cannotdistinguish between the stochasticity of the environmentand the exploitative behaviors of other co-learners (Loweet al., 2017), and thus may mistakenly update their policies.To mitigate this issue, decentralized policies can often belearned in the centralized training with decentralized exe-cution (CTDE) paradigm. Whiling training, a centralizedcritic could be accessible to the joint action and additionalstate information to give update signals for decentralizedagent polices, which only receive its local observations fordeployment. For CTDE, one major challenge is how torepresent and use the centralized multiagent action-valuefunction Qtot. Such a complex function is difficult to learnwhen there are many agents and, even if it can be learned,offers no obvious way to induce decentralized policies thatallow each agent to act based on an individual observation.

We can learn a fully centralized state-action value functionQtot and then use it to guide the optimization of decen-tralised policies in an actor-critic framework, an approachtaken by counterfactual multi-agent (COMA) policy gradi-ents (Foerster et al., 2018), as well as work by (Gupta et al.,2017). However, the fully centralized critic of COMA suf-

arX

iv:2

002.

0393

9v1

[cs

.MA

] 1

0 Fe

b 20

20

Page 2: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

fers difficulty in evaluating global Q-values from the jointstate-action space especially when there are more than ahandful of agents and is hard to give an appropriate multia-gent baseline (Schroeder de Witt et al., 2019).

Different from previous methods, Value Decomposition Net-work (VDN) (Sunehag et al., 2018) is proposed to learn acentralized but factored Qtot in between these two extremes.By representing Qtot as a sum of individual value functionsQi that condition only on individual observations and ac-tions, a decentralized policy arises simply from each agentselecting actions greedily with respect to its Qi. However,VDN supposes a strict additive assumption between Qtotand Qi, and ignores any extra state information availableduring training. Later QMIX is proposed to overcome thelimitations of VDN (Rashid et al., 2018). QMIX employsa network that estimates joint action-values as a black-boxnon-linear combination of per-agent values that conditiononly on local observations. Besides, QMIX enforces thatQtot is monotonic in Qi, which allows computationallytractable maximisation of the joint action-value in off-policylearning. But QMIX performs an implicit mixing of Qi andregards the mixing process as a black-box. Both VDN andQMIX introduce the representation limitations with the as-sumptions of the additive or monotonic relationship of QtotandQi. Recently, QTRAN is proposed to guarantee optimaldecentralization by using linear constraints while avoidingrepresentation limitations introduced by VDN and QMIX.However, the constraints of QTRAN on the optimizationproblem involved is computationally intractable and authorshave to relax these constraints by two penalties thus deviateQTRAN from exact solutions (Mahajan et al., 2019).

In this paper, for the first time, we theoretically derive ageneralized formalization of Qtot and Qi for any numberof cooperative agents (Theorem 2 and 3) without introduc-ing additional assumptions or constraints. Following thetheoretical formalization, we also propose a practical multi-head attention based Q-value mixing network (Qatten) toapproximate the global Q-value and the decomposition ofindividual Q-values. Qatten takes advantages of the key-value memory operation to measure the importance of eachagent for the global system and the multi-head structure tocapture the different high-order partial derivatives of Qtotwith respect to Qi when decomposing. Besides, Qatten canbe enhanced by the weighted head Q-value and non-linearitymechanisms for greater approximation ability. Experimentson the challenging MARL benchmark show that our methodobtains the best performance. The attention analysis showsthat Qatten captures the importance of each agent properlyfor approximating Qtot and, to some degree, reveals theinternal workflow of the Qtot’s approximation from Qi.

The remainder of this paper is organized as follows. We firstintroduce the Markov games, Deep-MARL algorithms and

attention mechanism in Section 2. Then in Section 3, weexplain the explicit formula for local behavior of coopera-tive MARL and derive the mathematical relation betweenQtot and Qi. Next, in Section 4, we present the frameworkof Qatten in details. Furthermore, we validate Qatten in thechallenging StarCraft II platform and give the specific anal-ysis including the multi-head attention in Section 5. Finally,conclusions and future work are provided in Section 6.

2. Background2.1. Markov Games

Markov games is a multi-agent extension of Markov Deci-sion Processes (Littman, 1994). They are defined by a setof states, S, action sets for each of N agents, A1, ..., AN , astate transition function, T : S ×A1 × ...×AN → P (S),which defines the probability distribution over possiblenext states, given the current state and actions for eachagent, and a reward function for each agent that also de-pends on the global state and actions of all agents, Ri :S×A1×...×AN → R. We specially consider the partial ob-served Markov games, in which each agent i receives a localobservation oi : Z(S, i)→ Oi. Each agent learns a policyπi : Oi → P (Ai) which maps each agent’s observation toa distribution over its action set. Each agent learns a policythat maximizes its expected discounted returns, J i(πi) =Ea1∼π1,...,aN∼πN ,s∼T [

∑∞t=0 γ

trit(st, a1t , ..., a

Nt )], where

γ ∈ [0, 1] is the discounted factor. If all agents receivethe same rewards (R1 = ... = RN = R), Markov gamesbecomes fully-cooperative: a best-interest action of oneagent is also a best-interest action of others (Matignon et al.,2012). We consider the fully-cooperative partially observedMarkov games in this paper. For the partially observedMarkov games, an empirical centralized training, decentral-ized execution paradigm (Lowe et al., 2017) is employed totrain well-coordinated decentralized pollies by incorporat-ing the global information when training in the simulation.

2.2. Centralized Training with Decentralized Execution

Arguably the most naive training method for MARL tasksis to learn the individual agents action-value functions inde-pendently, i.e., independent Q-learning. This method wouldbe simple and scalable, but it shown to be difficult in evensimple two-agent, single-state stochastic coordination prob-lems (Tan, 1993). Recently, for Markov games especiallywith only partial observability and restricted inter-agentcommunication, an empirical Centralized Training with De-centralized Execution (CTDE) paradigm (Lowe et al., 2017)is employed to train well-coordinated decentralized polliesby incorporating the global information whiling trainingin the simulation. While executing, agents make decisionsbased on the local observations according to the learned poli-cies. CTDE allows agents to learn and construct individual

Page 3: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

action-value functions, such that optimization at the indi-vidual level leads to optimization of the joint action-valuefunction. This in turn, enables agents at execution time toselect an optimal action simply by looking up the individ-ual action-value functions, without having to refer to thejoint one. Recent works including VDN and QMIX employCTDE to train the scalable multiple agents with differentways of approximating the multiagent value function.

An important concept for such methods is decentralisability,also called Individual-Global-Max (IGM) (Son et al., 2019),which asserts that ∃Qi, such that ∀s,~a:

argmax~a

Q∗(s,~a) =

(argmaxai

Q1(τ1, a1)... argmaxaN

QN (τN , aN )),(1)

where τ i is the history record of agent i’s observation oi.While the containment is strict for partially observable set-ting, it can be shown that all tasks are decentralisable givenfull observability and sufficient representational capacity.

2.3. MARL Algorithms

2.3.1. VDN

Instead of letting each agent learn an individual action-valuefunction Qi independently as in IQL (Tan, 1993), VDNlearns a centralized but factored Qtot, where Qtot(s,~a) =∑iQ

i(s, ai). By representing Qtot as a sum of individualvalue functions Qi that condition only on individual obser-vations oi and actions, a decentralised policy arises simplyfrom each agent selecting actions greedily with respect toits Qi. However, VDN assumes that the additivity existswhen Qi is evaluated based on oi, which indeed makes anapproximation and brings inaccuracy. VDN severely lim-its the complexity of centralized action-value functions andignores any extra state information available during training.

2.3.2. QMIX

QMIX learns a monotonic multiagent Q-value approxima-tion Qtot (Rashid et al., 2018). QMIX factors the jointaction-value Qtot into a monotonic non-linear combinationof individual Q-value Qi of each agent which learns via amixing network. The mixing network with non-negativeweights produced by a hynernetwork is responsible forcombing the agent’s utilities for the chosen actions intoQtot(s,~a). This nonnegativity ensures that ∂Qtot

∂Qi ≥ 0,which in turn guarantees the IGM property (Equation 1).The decomposition allows for an efficient, tractable maxi-mization as it can be performed linearly from decentralizedpolicies as well as the easy decontralize employment. Dur-ing learning, the QMIX agents use ε-greedy exploration overtheir individual utilities to ensure sufficient exploration.

𝑆𝑜𝑓𝑡𝑀𝑎𝑥

𝑆𝑐𝑎𝑙𝑒

𝑀𝑎𝑡𝑀𝑢𝑙

𝑉𝑄 𝑉𝐾 𝑉

𝑀𝑎𝑡𝑀𝑢𝑙

Figure 1. Attention Mechanism.

2.3.3. QTRAN

QTRAN further studies IGM. Theorem 1 in the QTRANpaper guarantees optimal decentralization by using linearconstraints between agent utilities and joint action values,and avoid the representation limitations introduced by VDNand QMIX. However, the constraints on the optimizationproblem involved is computationally intractable to solvein discrete state-action spaces and is impossible given con-tinuous state-action spaces. The authors propose two algo-rithms (QTRAN-base and QTRAN-alt) which relax theseconstraints using two L2 penalties but deviate QTRAN fromthe exact solution due to these relaxations. In practice,on the complex MARL domains (Samvelyan et al., 2019),QTRAN performs poorly (Mahajan et al., 2019).

2.4. Attention Mechanism

In recent years, attention mechanism (Vaswani et al., 2017)has been widely used in various research fields. The atten-tion mechanism in deep learning originates from humanvisual attention and is trying to imitate the human’s focus-ing process. The core goal behind both of them is to selectthe most relevant feature area which is more critical to thecurrent task from the numerous information. Given the re-markable performance of attention mechanisms in variousdomains, more and more work relies on the idea of attentionto deal with challenges of MAS. Figure 1 shows the particu-lar attention proposed in (Vaswani et al., 2017). An attentionfunction can be described as mapping a query and a set ofkey-value pairs to an output, where the query VQ, keys V jK ,values and output are all vectors. The output is computedas a weighted sum of the values, where each weight wj as-signed to each value is computed by a compatibility functionof the query with the corresponding key.

wj =exp(f(VQ, V

jK))∑

k exp(f(VQ, VkK))

, (2)

Page 4: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

where f(VQ, VjK)) is the user-defined function to measure

the importance of the corresponding value and the scaleddot-product is a common one. In practice, multi-head struc-ture is usually employed to allow the model to focus oninformation from different representation sub-spaces.

3. TheoryIn this section, we provide the theoretical foundation of ourproposed Qatten model. We show that in the general frame-work of MARL without any additional assumption, when weinvestigate the global Q-value Qtot near maximum pointin action space, the dependence of Qtot on individual Q-value Qi is approximately linear (Theorem 2). The linearcoefficients are proposed to be modeled in terms of atten-tion mechanism to be elaborated in the Section 4, and theprecision and effectiveness of this linear approximation iswell validated in our experiment section (Section 5).

The functional relation between Qtot and Qi appears tobe linear in action space, yet contains all the non-linearinformation. We show in Theorem 3 that the detailed struc-ture of linear coefficients is related to all the non-lineardependence of Qtot and Qi. In other words, as functions ofaction variables, non-linear cross terms such as QiQj turnout to be approximately linear.

We emphasize that our theory is based solely on conditionsrequired in a general MARL setting. Our theory serves as anumbrella covering existing methods such as VDN, QMIX,QTRAN, including the (already) general framework IGM(Son et al., 2019), since all these methods require additionalassumptions in MARL. In more precise terms, the Q-valueQtot(s,~a) is a function on the state vector s and joint actions~a = (a1, ..., aN ).We fix s and investigate the local behaviorof Qtot near a maximum point ~ao. Gradient vanishes in ~a

∂Qtot∂ai

(s,~ao) = 0 (3)

so locally we have

Qtot(s,~a) = Qtot(s,~ao)

+∑ij

∂Qtot∂ai∂aj

(ao)(ai − aio)(aj − ajo) + o(‖~a− ~ao‖2).

(4)

One of the non-trivial points of our finding is that the secondorder term in Equation 4 has no cross terms.

Theorem 1. The coefficients of (ai−aio)(aj−ajo) for i 6= jin Equation 4 vanish, namely

∂Qtot∂ai∂aj

(ao) = 0.

When applied with the Implicit Function Theorem, Qtot can

also be viewed as a function in terms of Qi:

Qtot = Qtot(s,Q1, Q2, ..., Qn), (5)

where Qi = Qi(s, ai). Our main result is that the majorterms in Equation 4 (Qtot(~ao)+ second order term) coincidewith a linear functional ofQi, leaving the higher order termso(‖~a − ~ao‖2) verified to be negligible in our experiments(Section 5).

Theorem 2. There exist constants c(s), λi(s) (dependingon state s), such that when we neglect higher order termso(‖~a−~ao‖2), the local expansion of Qtot admits the follow-ing form

Qtot(s,~a) = c(s) +∑i

λi(s)Qi(s, ai). (6)

And in an cooperative setting, the constants λi(s) ≥ 0.

The reason why in cooperative setting λi(s) in Theorem 2are non-negative is the the following. The common goal ofagents is to maximize global Q-valueQtot. If the coefficientλi(s) of agent i is negative, increasing of Qi will have anegative impact on the global value Qtot. In this case, agenti sits in a competitive position against the whole group Qtot.This is a violation of the cooperative assumption.

The major reason behind Theorem 2 is the following. Everyagent i is connected to the whole group - an independentagent i should not be considered as a group member. Thismeans mathematically that the variation of individual Q-value Qi will have an impact (negative or positive) on theglobal Q-value Qtot, that is

∂Qtot∂Qi

6= 0

Now that gradient ∂Qtot

∂ai vanishes as in Eq. (3), it can beexpanded by the chain rule as

∂Qtot∂ai

=∂Qtot∂Qi

∂Qi

∂ai= 0.

We conclude that∂Qi

∂ai= 0, (7)

namely each Qi is locally quadratic in ai. For a detailedproof of Theorem 2 see Appendix A.

In the next section, we propose to make use of the attentionmechanism as universal function approximator (Yun et al.,2020) to approximate coefficients λi(s) in the Theorem 2(upto a normalization factor). The Universal ApproximationTheorem for Transformer proved in (Yun et al., 2020) math-ematically justifies our choice, where attention mechanismis shown to be responsible for Transformer’s approximation

Page 5: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

ability. The non-negativity of coefficients in cooperative set-ting coincide with the non-negativity of attention in Eq. (2).

Our second theoretical contribution is finer characterizationof coefficients λi in Eq. (6).

Theorem 3. We have the following finer structure of λi inTheorom 2

λi(s) =∑h

λi,h(s). (8)

Namely, Eq. (6) becomes

Qtot(s,~a) = c(s) +∑i,h

λi,h(s)Qi(s, ai), (9)

where λi,h is a linear functional of all partial derivatives∂hQtot

∂Qi1 ...∂Qihof order h, and decays super-exponentially fast

in h.

We remark when we employ attention mechanism to ap-proximate λi, λi,h is modelled as single-head attention. Asfar as we know, this yields, for the first time, a theoreti-cal explanation of multi-head construction in attentionmechanism. Since λi,h decays fast in h, we stop the seriesat H (number of heads) for feasible computations in thenext section. For a proof of Theorem 3 see Appendix A.

Our Eq. (9) appears to be linear in Qi, yet contains allthe non-linear information: the head coefficient λi,h is alinear functional of all partial derivatives of order h, andcorresponds to all cross termsQi1 ...Qih of order h (e.g. λi,2corresponds to second order non-linearity QiQj). This non-linearity includes, in particular, the special case of mixingnetwork in QMIX model (Rashid et al., 2018). In fact, whenexpanding non-linear termsQi1 ...Qih , we find that its majorpart is identical to a linear term with coefficient λi,h, whichcontributes to a summand of the coefficient λi - this is howour proof of Theorem 3 works (Appendix A).

While our theory is developed locally, which usually re-quires an effective (convergence) radius of local approxima-tions in action space, extensive experiments conducted inSection 5 show that this approximation radius readily coversa wide range of practical applications.

4. FrameworkIn this section, we propose the Q-value Attention network(Qatten), a practical deep Q-value decomposition networkfollowing the above theoretical formalization in Eq. (9).Figure 2 illustrates the overall architecture, which consistsof agents’ recurrent Q-value networks representing eachagent’s individual value function Qi(τ i, ai) and the refinedattention based value-mixing network to model the relationbetween Qtot and individual Q-values. The attention-basedmixing network takes individual agents’ Q-values and local

information as input and mixes them with global state toproduce the values of Qtot.

We start from the fundamental theoretic formation of mul-tiple linear mixing of Qtot and Qi. Letting the outer sumover h in Eq.( 8), we have

Qtot = c(s) +

H∑h=1

N∑i=1

λi,h(s)Qi. (10)

For each h, the inner linear sum operation can be imple-mented using the differentiable key-value memory model(Graves et al., 2014; Oh et al., 2016) to establish the linearrelations from the individuals to the global. Specifically, wepass the similarity value between the global state’s embed-ding vector es(s) and the individual properties’ embeddingvector ei(oi) into a softmax.

λi,h ∝ exp(eTi WTk,hWq,hes), (11)

where Wq,h transforms es into the global query and Wk,h

transforms ei into the key of each agent. The es and ei couldbe obtained by a one or two-layer embedding transformationfor s and each oi. To increase the non-linearity of Qtot andQi, we could assign each agent’s Q-value Qi to its oi whenperforming self-attention. As oi is replaced by (oi, Qi)and the self-attention function uses the softmax activation,Qatten could represent the non-linear combination of Qtotand Qi. Next, for the outer sum over h, we use multipleattention heads to implement the approximations of differentorders of partial derivatives. By summing up the head Q-values Qh from the different heads, we get

Qtot = c(s) +

H∑h=1

Qh, where Qh =

N∑i=1

λi,hQi. (12)

H is the number of attention heads. Lastly, the first termc(s) in Eq. (9) could be learned by a neural network withthe global state s as input.

As implied by our theory, Qatten naturally holds the mono-tonicity and achieves IGM property between Qtot and Qi.

∂Qtot∂Qi

≥ 0,∀i ∈ {1, 2, ..., N}, (13)

Thus, Qatten allows tractable maximisation of the jointaction-value in off-policy learning, and guarantees consis-tency between the centralized and decentralized policies.

Weighted Head Q-value In previous descriptions, we di-rectly add the Q-value contribution from different heads.But if the environment becomes complex, we could assignweights wh for the Q-values from different heads to capturemore complicated relations. To ensure monotonicity, we

Page 6: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

Agent 1

(𝑜𝑡1, 𝑎𝑡−1

1 )

𝑄1(𝜏𝑡1, 𝑎𝑡

1)

Agent N

(𝑜𝑡𝑁, 𝑎𝑡−1

𝑁 )

𝑄𝑁(𝜏𝑡𝑁, 𝑎𝑡

𝑁)

Mixing Network

𝑄𝑡𝑜𝑡(𝑠, Ԧ𝑎)

(𝑠𝑡, 𝑜𝑡𝑖 )

MLP

(𝑜𝑡𝑖 , 𝑎𝑡−1

𝑖 )

GRU

MLP

ℎ𝑡−1𝑖 ℎ𝑡

𝑖

𝑄𝑖(𝜏𝑡𝑖 ,∙)

𝑄𝑖(𝜏𝑡𝑖 , 𝑎𝑡

𝑖)

𝜀

𝑄𝑖 𝜏𝑡𝑖 , 𝑎𝑡

𝑖

𝑉

𝑄ℎ = σ𝑖=1𝑁 𝜆𝑖,ℎ𝑄

𝑖

𝑠𝑡 𝑜𝑡𝑖

𝑊𝑞 𝑊𝑘

Scaled 𝑑ot 𝑝roduct

Softmax

𝜆𝑖,ℎ

𝑠𝑡

𝟐. σℎ=1𝐻 𝑤ℎ𝑄

𝑤ℎ

| ∙ |

𝑜𝑡𝑖: = (𝑜𝑡

𝑖 , 𝑄𝑖)non-linear

c (𝑠𝑡)

𝜋

Dot 𝑝roduct

𝑄𝑖 𝜏𝑡𝑖 , 𝑎𝑡

𝑖

𝑄𝑡𝑜𝑡(𝑠, Ԧ𝑎)

𝟏. σℎ=1𝐻 𝑄ℎ

𝑠𝑢𝑚 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑜𝑟

𝑠𝑡+ | ∙ |re𝑙𝑢 𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒: :𝑑𝑒𝑛𝑠𝑒 𝑙𝑎𝑦𝑒𝑟:

Figure 2. The overall architecture of Qatten. The right is agent i’s recurrent deep Q-network, which receives last hidden states hit−1 and

current local observations oit as inputs. Last action ait−1 is also inputted to enhance the local observation. The left is the mixing networkof Qatten, which mixes ~Qi(τ it , a

it) together with st. At time t, st could be represented by the joint unit features or joint observations.

retrieve these head Q-value weights with an absolute acti-vation function from a two-layer feed-work network fNN ,which adjusts head Q-values based on global states s.

Qtot = c(s) +

H∑h=1

wh

N∑i=1

λi,hQi, (14)

where wh = |fNN (s)|h. Compared with directly summingup head Q-values, the weighted head Q-values relaxes theupper bound and lower bound of Qtot imposed by attention.

5. Experimental Evaluation5.1. Settings

In this section, we evaluate Qatten in the StarCraft IIdecentralized micromanagement tasks and use StarCraftMulti-Agent Challenge (SMAC) environment (Samvelyanet al., 2019) as our testbed, which has become a common-used benchmark for evaluating state-of-the-art MARL ap-proaches such as COMA (Foerster et al., 2018), QMIX(Rashid et al., 2018) and QTRAN (Son et al., 2019). Wetrain multiple agents to control allied units respectively,while the enemy units are controlled by a built-in hand-crafted AI. At the beginning of each episode, the enemyunits are going to attack the allies. Proper micromanage-ment of units during battles are needed to maximize thedamage to enemy units while minimizing damage received,hence requires a range of coordination skills such as fo-cusing fire and avoiding overkill. Learning these diverse

cooperative behaviors under partial observation is challeng-ing. All the results are averaged over 5 runs with differentseeds. Training and evaluation schedules such as the testingepisode number and training hyper-parameters are kept thesame as QMIX in SMAC. For the attention part, the embed-ding dim for the query (s) and key oi is 32, and the headnumber is set to 4. More details are provided in Appendix.

Table 1. Maps in hard and super hard scenarios.Name Ally Units Enemy Units Type

5m vs 6m(hard) 5 Marines 6 Marines Asymmetric

Homogeneous3s vs 5z(hard) 3 Stalkers 5 Zealots Asymmetric

Heterogeneous2c vs 64zg

(hard) 2 Colossi 64 Zerglings AsymmetricHeterogeneous

bane vs bane(hard)

4 Banelings20 Zerglings

4 Banelings20 Zerglings

SymmetricHeterogeneous

3s5z vs 3s6z(super hard)

3 Stalkers5 Zealots

3 Stalkers6 Zealots

AsymmetricHeterogeneous

MMM2(super hard)

1 Medivac2 Marauders7 Marines

1 Medivac3 Marauders8 Marines

AsymmetricHeterogeneous

According to the map characters and learning performanceof algorithms, SMAC divides these maps into three levels:easy scenarios, hard scenarios and super hard scenarios.All map scenarios are of different agent number or agenttype. The easy scenarios include 2s vs 1sc, 2s3z, 3s5z,1c3s5z and 10m vs 11m. Here we briefly introduce the hardsenarios and super hard scenarios in Table 1.

Page 7: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

In the hard scenarios, 5m vs 6m requires the precise controlsuch as focusing fire and consistent walking to win. In3s vs 5z, since Zealots counter Stalkers, the only winningstrategy is to kite the enemy around the map and kill themone after another. In 2c vs 64zg, 64 enemy units make theaction space of the agents the largest among all scenarios.In bane vs bane, the strategy of winning is to correctly usethe Banelings to destroy as many enemies as possible. Forsuper hard scenarios, the key winning strategy of MMM2is that the Medivac heads to enemies first to absorbing fireand then retreats to heal the right ally with the least health.As the most difficult scenarios, 3s5z vs 3s6z may require abetter exploration mechanism while training.

5.2. Validation

5.2.1. EASY SCENARIOS

First, we test Qatten on the easy scenarios to validate itseffectiveness. Table 2 shows the median test win rate ofdifferent MARL algorithms. These results demonstrate thatQatten achieves competitive performance with QMIX andother popular methods. Qatten could master these easytasks and also works well in heterogeneous and asymmetricsettings. COMA’s poor performance results from the sampleinefficient on-policy learning and the naive critic structure.Except 2s vs 1sc and 2s3z, IDL’s win percentage is quitelow as directly using global rewards to update policies bringsthe non-stationary, which becomes severe when the numberof agents increases. QTRAN also performs not well, as thepractical relaxations impede the exactness of its updating.

Table 2. Median performance of the test win percentage.

Scenario Qatten QMIX COMA VDN IDL QTRAN

2s vs 1sc 100 100 97 100 100 1002s3z 97 97 34 97 75 833s5z 94 94 0 84 9 13

1c3s5z 97 94 23 84 11 6710m vs 11m 97 94 5 94 19 59

5.2.2. HARD SCENARIOS

Next, we test Qatten on the hard scenarios. Results arepresented in Figure 3 as 25%-75% percentile is shaded.To summarize, these scenarios reflects different challenges,and existing approaches fails to achieve consistent perfor-mance as they do in the easy scenarios. For example, inbane vs bane, it is surprising that IQL performs quite well(rank 2th) as there are 24 agents while QMIX cannot learnsteadily. The reason is that the global Q-value heavily de-pends on the 4 Banelings among 24 agents as they are vitalto win the battle, thus IQL only needs to learn the well coor-dinated policies for 4 Banelings while QMIX cannot easilyfind the appropriate formation for Qtot and Qi. In the four

0 0.4m 0.8m 1.2m 1.6m 2mSteps

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Median Test Win Rate %

5m_vs_6mIQLCOMAVDNQattenQMIXQTRAN

(a) Map 5m vs 6m

0 0.4m 0.8m 1.2m 1.6m 2mSteps

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Median Test Win Rate %

3s_vs_5zIQLCOMAVDNQattenQMIXQTRAN

(b) Map 3s vs 5m

0 0.4m 0.8m 1.2m 1.6m 2mSteps

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Median Test Win Rate %

bane_vs_baneIQLCOMAVDNQattenQMIXQTRAN

(c) Map bane vs bane

0 0.4m 0.8m 1.2m 1.6m 2mSteps

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Median Test Win Rate %

2c_vs_64zgIQLCOMAVDNQattenQMIXQTRAN

(d) Map 2c vs 64zg

Figure 3. Median win percentage on the hard scenarios.

maps, VDN and QMIX show advantages over each otheron different maps. In contrast, Qatten consistently beats allother approaches in these hard maps, which validates theeffectiveness of its general mixing formula of Qtot and Qi.

5.2.3. SUPER HARD SCENARIOS

Finally, we test Qatten on the super hard scenarios as shownin Figure 4 where 25%-75% percentile is shaded. Dueto the difficulty and complexity, we augment Qatten withweighted head Q-values. In MMM2, Qatten masters thewinning strategy by heading Medivac to absorb damage andthen retreating it and exceeds QMIX by a large margin. The3s5z vs 3s6z is another more challenging scenario and allexisting approaches fails. In contrast, our approach Qattencan win approximately 16% after 2 million steps of training.

0 0.4m 0.8m 1.2m 1.6m 2mSteps

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Median Test Win Rate %

MMM2IQLCOMAVDNQattenQMIXQTRAN

(a) Map MMM2

0 0.4m 0.8m 1.2m 1.6m 2mSteps

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Median Test Win Rate %

3s5z_vs_3s6zIQLCOMAVDNQattenQMIXQTRAN

(b) Map 3s5z vs 3s6z

Figure 4. Median win percentage on the super hard scenarios.

5.3. Ablation Study

We investigate the influence of weighted head Q-valuesusing three difficult scenarios (2c vs 64zg, MMM2 and3s5z vs 3s6z) as demonstrated in previous experiments for

Page 8: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

illustration. For clarity, the basic Qatten without weightedhead Q-value is called Qatten-base while Qatten withweighted head Q-value is called Qatten-weighted. Fig-ure 5 shows the ablation results and 25%-75% percentileis shaded. As we can see, the weighted head Q-values canleverage the performance of Qatten-base on these difficultscenarios. It shows that this mechanism may capture sophis-ticated relations between Qtot and Qi more accurately withwh adjusting head weights flexibly. Specifically, in MMM2,Qatten is also augmented with the non-linearity trick whichadds Qi into oi while performing self-attention.

0 4m 8m 12m 16m 20mSteps

0.00.10.20.30.40.50.60.70.80.91.0

Med

ian

Test

Win

Rat

e %

2c_vs_64zgQatten-baseQatten-weighted

(a) Map 2c vs 64zg

0 4m 8m 12m 16m 20mSteps

0.00.10.20.30.40.50.60.70.80.91.0

Med

ian

Test

Win

Rat

e %

MMM2Qatten-baseQatten-weighted

(b) Map MMM2

0 4m 8m 12m 16m 20mSteps

0.00.10.20.30.40.50.60.70.80.91.0

Med

ian

Test

Win

Rat

e %

3s5z_vs_3s6zQatten-baseQatten-weighted

(c) Map 3s5z vs 3s6z

Figure 5. Ablation study of Qatten on three difficult scenarios.

5.4. Attention Analysis

Next, we visualize the attention weights (λi,h) on each stepduring a battle to better understand Qatten’s learning pro-cess. We choose two representative maps 5m vs 6m and3s5z vs 3s6z for illustration and the attention weight heatmap of each head for each map is shown in Figure 6(a-b)and Figure 6(c-d) respectively. For 5m vs 6m, the mostimportant trick to win is to avoid being killed and focusingfire to kill enemy. From Figure 6(a), we see allies havesimilar attention weights, which means that Qatten indi-cates an almost equally divided Qtot for agents. This isbecause that each ally marine plays similar roles on thishomogeneous scenario. This may also explain the fact thatVDN could slightly outperform QMIX when the weights ofeach Qi is roughly equal. But we still find some tendenciesof Qatten. One main tendency is that, the only alive unit(agent 0) receives the highest weight in the later steps ofthe episode, which encourages agents to survive for win-ning. Thus, with the refined attention mechanism adjustingweights with minor difference, Qatten performs best.

Then we analyse another map 3s5z vs 3s6z and the atten-tion heat map is shown in Figure 6(c). After analyzingthe correlation of attention weights and agent features, wenotice that Head 1 focuses on the output damage. Unitswith high CoolDown (indicating just fired) receives highattention values. Head 1 and 2 tend to focus on the Stalkeragents (Agent 0, 1 and 2) while Stalkers (Agent 3-7) almostreceives no attention at the beginning. While Stalker agent 1with high health receives attention from head 1 and 2, Zealotagent 6 with high health receives attention from head 0 and3. In summary, attention weights on map 3s5z vs 3s6z

0 1 2 3 4Agent

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

Step

Head 0

0.17

0.18

0.19

0.20

0.21

0.22

0.23

0.24

0 1 2 3 4Agent

Head 1

0.18

0.20

0.22

0.24

0.26

0.28

0.30

0 1 2 3 4Agent

Head 2

0.16

0.18

0.20

0.22

0.24

0.26

0.28

0 1 2 3 4Agent

Head 3

0.16

0.18

0.20

0.22

0.24

0.26

0.28

(a) 5m vs 6m Attention Heat Map

0 8 16 24Step

0.00.10.20.30.40.50.60.70.80.91.0

HealthAgent 0 mAgent 1 mAgent 2 mAgent 3 mAgent 4 m

0 8 16 24Step

0.00.10.20.30.40.50.60.70.80.9

CoolDown

0 8 16 24Step

0.25

0.20

0.15

0.10

0.05

0.00Relative X

0 8 16 24Step

0.05

0.00

0.05

Relative Y

(b) 5m vs 6m Agent Property

01234567Agent

03691215182124273033363942454851545760

Step

Head 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

01234567Agent

Head 1

0.2

0.4

0.6

0.8

01234567Agent

Head 2

0.1

0.2

0.3

0.4

0.5

0.6

01234567Agent

Head 3

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(c) 3s5z vs 3s6z Attention Heat Map

0 8 16 24 32 40 48 56Step

0.00.10.20.30.40.50.60.70.80.91.0

HealthAgent 0 sAgent 1 sAgent 2 sAgent 3 zAgent 4 zAgent 5 zAgent 6 zAgent 7 z

0 8 16 24 32 40 48 56Step

0.00.10.20.30.40.50.60.70.80.9

CoolDown

0 8 16 24 32 40 48 56Step

0.300.250.200.150.100.050.000.050.100.150.20

Relative X

0 8 16 24 32 40 48 56Step

0.10

0.05

0.00

0.05

Relative Y

(d) 3s5z vs 3s6z Agent Property

Figure 6. Attention weights on 5m vs 6m and 3s5z vs 3s6z. Stepsincrease from top to the bottom in the attention heat maps. Hori-zontal ordination indicates the agent id under each head.

are changing violently. As the battle involves highly fluc-tuant dynamics, VDN which statically decomposes Qtotequally thus cannot adapt to the complex scenarios. Simi-larly, QMIX with a rough and implicit approximation wayalso fails. In contrast, Qatten could approximate the so-phisticated relations between Qi and Qtot with differentattention heads to capture different features in sub-spaces.

6. Conclusion and Future WorkIn this paper, we propose the novel Q-value Attention net-work for the multiagent Q-value decomposition problem.For the first time, we derive a theoretic linear decompositionformula of Qtot and Qi which covers previous methods andprovide a theoretical explanation of multi-head structure. Toapproximate each term of the decomposition formula, weintroduce multi-head attention to establish the mixing net-work. Furthermore, Qatten could be enhanced by weightedhead Q-values. Experiments on the standard MARL bench-mark show that our method obtains the best performanceon almost all maps and the attention analysis gives the in-tuitive explanations about the weights of each agent whileapproximatingQtot and, to some degree, reveals the internalworkflow of the Qtot’s approximation from Qi.

For future work, improving Qatten by combining with ex-plicit exploration mechanism on difficult MARL tasks is astraightforward direction. Besides, incorporating recent pro-gresses of attention to adapt Qatten into large-scale settingswhere hundreds of agents exist is also promising.

Page 9: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

ReferencesBusoniu, L., Babuska, R., and De Schutter, B. A comprehen-

sive survey of multiagent reinforcement learning. IEEETransactions on Systems, Man, and Cybernetics, 38(2):156–172, 2008.

Cao, Y., Yu, W., Ren, W., and Chen, G. An overview ofrecent progress in the study of distributed multi-agent co-ordination. IEEE Transactions on Industrial Informatics,9(1):427–438, 2012.

Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., andWhiteson, S. Counterfactual Multi-Agent Policy Gradi-ents. In Proceedings of the 32nd AAAI Conference onArtificial Intelligence, 2018.

Graves, A., Wayne, G., and Danihelka, I. NeuralTuring Machines. arXiv:1410.5401 [cs], Decem-ber 2014. URL http://arxiv.org/abs/1410.5401. arXiv: 1410.5401.

Gupta, J. K., Egorov, M., and Kochenderfer, M. Cooperativemulti-agent control using deep reinforcement learning.In Proceedings of the 16th International Conference onAutonomous Agents and MultiAgent Systems, pp. 66–83,2017.

Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Pro-ceedings, pp. 157–163. Elsevier, 1994. doi: 10.1016/B978-1-55860-335-6.50027-1.

Lowe, R., WU, Y., Tamar, A., Harb, J., Pieter Abbeel, O.,and Mordatch, I. Multi-Agent Actor-Critic for MixedCooperative-Competitive Environments. In Proceedingsof the 31th Advances in Neural Information ProcessingSystems, pp. 6379–6390. Curran Associates, Inc., 2017.

Mahajan, A., Rashid, T., Samvelyan, M., and Whiteson,S. MAVEN: Multi-Agent Variational Exploration. InWallach, H., Larochelle, H., Beygelzimer, A., Alch-Buc,F. d., Fox, E., and Garnett, R. (eds.), Proceedings of the32nd International Conference on Neural InformationProcessing Systems, pp. 7611–7622. Curran Associates,Inc., 2019.

Matignon, L., Laurent, G. J., and Le Fort-Piat, N. Inde-pendent reinforcement learners in cooperative Markovgames: a survey regarding coordination problems. TheKnowledge Engineering Review, 27(1):1–31, 2012.

Oh, J., Chockalingam, V., Singh, S., and Lee, H.Control of memory, active perception, and actionin minecraft. In Proceedings of the 33rd Interna-tional Conference on International Conference on Ma-chine Learning, ICML’16, pp. 2790–2799. JMLR.org,2016. URL http://dl.acm.org/citation.cfm?id=3045390.3045684.

Palmer, G., Tuyls, K., Bloembergen, D., and Savani, R.Lenient multi-agent deep reinforcement learning. In Pro-ceedings of the 17th International Conference on Au-tonomous Agents and MultiAgent Systems, pp. 443–451,2018.

Rashid, T., Samvelyan, M., Witt, C. S. d., Farquhar, G., Fo-erster, J. N., and Whiteson, S. QMIX: Monotonic ValueFunction Factorisation for Deep Multi-Agent Reinforce-ment Learning. In Proceedings of the 35th InternationalConference on Machine Learning, pp. 4292–4301, 2018.

Samvelyan, M., Rashid, T., de Witt, C. S., Farquhar, G.,Nardelli, N., Rudner, T. G. J., Hung, C.-M., Torr, P.H. S., Foerster, J., and Whiteson, S. The StarCraft Multi-Agent Challenge. In arXiv:1902.04043 [cs, stat], Febru-ary 2019. URL http://arxiv.org/abs/1902.04043. arXiv: 1902.04043.

Schroeder de Witt, C., Foerster, J., Farquhar, G., Torr, P.,Boehmer, W., and Whiteson, S. Multi-Agent CommonKnowledge Reinforcement Learning. In Wallach, H.,Larochelle, H., Beygelzimer, A., Alch-Buc, F. d., Fox,E., and Garnett, R. (eds.), Proceedings of the Advancesin Neural Information Processing Systems 32, pp. 9924–9935. Curran Associates, Inc., 2019.

Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi,Y. QTRAN: Learning to Factorize with Transformationfor Cooperative Multi-Agent Reinforcement Learning.In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceed-ings of the 36th International Conference on MachineLearning, volume 97 of Proceedings of Machine Learn-ing Research, pp. 5887–5896, Long Beach, California,USA, June 2019. PMLR.

Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zam-baldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo,J. Z., Tuyls, K., and Graepel, T. Value-DecompositionNetworks For Cooperative Multi-Agent Learning BasedOn Team Reward. In Proceedings of the 17th Interna-tional Conference on Autonomous Agents and MultiAgentSystems, AAMAS ’18, pp. 2085–2087, Richland, SC,2018. International Foundation for Autonomous Agentsand Multiagent Systems. URL http://dl.acm.org/citation.cfm?id=3237383.3238080.

Tan, M. Multi-agent reinforcement learning: Independentvs. cooperative agents. In In Proceedings of the 10thInternational Conference on Machine Learning, pp. 330–337. Morgan Kaufmann, 1993.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L., and Polosukhin, I. At-tention is All you Need. In Guyon, I., Luxburg, U. V.,Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,

Page 10: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

and Garnett, R. (eds.), Proceedings of the 30th Inter-national Conference on Neural Information Process-ing Systems, pp. 5998–6008. Curran Associates, Inc.,2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.

Ying, W. and Sang, D. Multi-agent framework for thirdparty logistics in e-commerce. Expert Systems with Ap-plications, 29(2):431–436, 2005.

Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S. J., andKumar, S. Are transformers universal approximators ofsequence-to-sequence functions? In Proceedings of the8th International Conference on Learning Representa-tions, 2020.

Page 11: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

A. ProofsProof of Theorem 1:

Proof. This follows directly from Theorem 2: Qi is onlyrelated to ai, therefore, when ai is expanded from Qi, thereis no terms aiaj for i 6= j.

Proof of Theorem 2:

Proof. Theorem 2 follows directly from Theorem 3, and itsproof is identical to that of Theorem 3 as well. Please seethe proof of Theorem 3 for completeness.

Proof of Theorem 3:

Proof. We expand Qtot in terms of Qi

Qtot =constant+∑i

µiQi +

∑ij

µijQiQj + ...

+∑i1...ik

µi1...ikQi1 ...Qik + ...

(15)

where

µi =∂Qtot∂Qi

, µij =1

2

∂2Qtot∂Qi∂Qj

,

and in general

µi1...ik =1

k!

∂kQtot∂Qi1 ...∂Qik

.

We will simply take

λi,1 = µi

in Theorem 3.

Recall that every agent i is related to the whole group, andthus its interest must be associated to that of the group. Thatis to say, variations of each Qi will have an impact on Qtot:

∂Qtot∂Qi

6= 0.

Now that gradient ∂Qtot

∂ai vanishes as in Eq. (3), it can beexpanded by the chain rule as

∂Qtot∂ai

=∂Qtot∂Qi

∂Qi

∂ai= 0.

We conclude that

∂Qi

∂ai(ao) = 0.

This is exactly Equation 7, which we copy here for thereader’s convenience. Consequently, we have local expan-sion

Qi(ai) = αi + βi(ai − aio)2 + o((ai − aio)2).

Now we apply the equation above to the second order termin Equation 15: ∑

ij

µijQiQj

=∑ij

µij(αi+βi(ai−aio)2)(αj+βj(aj−ajo)2)+o(‖a−ao‖2)

=∑ij

µijαiαj + 2∑ij

µijαjβi(ai − aio)2 + o(‖a− ao‖)2

=∑ij

µijαiαj + 2∑ij

µijαj(Qi − αi) + o(‖a− ao‖)2

= −∑ij

µijαiαj + 2∑ij

µijαjQi + o(‖a− ao‖)2

Therefore, we will take

λi,2 = 2∑j

µijαj .

In general, we have

µi1...ik =1

k!

∂kQtot∂Qi1 ...∂Qik

and ∑i1,...,ik

µi1...ikQi1 ...Qik = −(k − 1)

∑i1,...,ik

µi1...ik

+k∑

i1,...,ik

µi1...ikαi1 ...αik−1Qik + o(‖a− ao‖2)

so we take

λi,k = k∑

i1,...,ik

µi1...ik−1iαi1 ...αik−1.

The convergence of the series∑k λi,k only requires mild

conditions, e.g. boundedness or even small growth of partialderivatives ∂kQtot

∂Qi1 ...∂Qikin terms of k.

Both Theorem 2 and Theorem 3 follow from this proof.

B. Experimental SettingsWe follow the settings of SMAC (Samvelyan et al., 2019),which could be referred in the SMAC paper. For clarity andcompleteness, we state these environment details again.

Page 12: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

B.1. States and Observations

At each time step, agents receive local observations withintheir field of view. This encompasses information about themap within a circular area around each unit with a radiusequal to the sight range, which is set to 9. The sight rangemakes the environment partially observable for agents. Anagent can only observe others if they are both alive andlocated within its sight range. Hence, there is no way foragents to distinguish whether their teammates are far awayor dead. If one unit (both for allies and enemies) is deador unseen from another agent’s observation, then its unitfeature vector is reset to all zeros. The feature vector ob-served by each agent contains the following attributes forboth allied and enemy units within the sight range: distance,relative x, relative y, health, shield, and unit type. If agentsare homogeneous, the unit type feature will be omitted. AllProtos units have shields, which serve as a source of protec-tion to offset damage and can regenerate if no new damageis received. Lastly, agents can observe the terrain featuressurrounding them, in particular, the values of eight points ata fixed radius indicating height and walkability.

The global state is composed of the joint unit features ofboth ally and enemy soldiers. Specifically, the state vectorincludes the coordinates of all agents relative to the centreof the map, together with unit features present in the obser-vations. Additionally, the state stores the energy/cooldownof the allied units based the unit property, which representsthe minimum delay between attacks/healing. All features,both in the global state and in individual observations ofagents, are normalized by their maximum values.

B.2. Action Space

The discrete set of actions which agents are allowed totake consists of move[direction], attack[enemy id], stop andno-op. Dead agents can only take no-op action while liveagents cannot. Agents can only move with a fixed movementamount 2 in four directions: north, south, east, or west. Toensure decentralization of the task, agents are restricted touse the attack[enemy id] action only towards enemies intheir shooting range. This additionally constrains the abilityof the units to use the built-in attack-move macro-actions onthe enemies that are far away. The shooting range is set to be6 for all agents. Having a larger sight range than a shootingrange allows agents to make use of the move commandsbefore starting to fire. The unit behavior of automaticallyresponding to enemy fire without being explicitly ordered isalso disabled. As healer units, Medivacs use heal[agent id]actions instead of attack[enemy id].

B.3. Rewards

At each time step, the agents receive a joint reward equalto the total damage dealt on the enemy units. In addition,agents receive a bonus of 10 points after killing each oppo-nent, and 200 points after killing all opponents for winningthe battle. The rewards are scaled so that the maximumcumulative reward achievable in each scenario is around 20.

B.4. Training Configurations

The training time is about 8 hours to 18 hours on these maps(GPU Nvidia RTX 2080 and CPU AMD Ryzen Threadrip-per 2920X), which is ranging based on the agent numbersand map features of each map. The number of the totaltraining steps is about 2 million and every 10 thousand stepswe train and test the model. When training, a batch of 32episodes are retrieved from the replay buffer which containsthe most recent 5000 episodes. We use ε-greedy policy forexploration. The starting exploration rate is set to 1 andthe end exploration rate is 0.05. Exploration rate decayslinearly at the first 50 thousand steps. We keep the defaultconfigurations of environment parameters.

Table 3. The network configurations of Qatten’s mixing network.Qatten mixing

network configurations Value

Query embedding layer number 2Unit number in

query embedding layer 1 64

Activation afterquery embedding layer 1 Relu

Unit number inquery embedding layer 2 32

Key embedding layer number 1Unit number in

key embedding layer 1 32

Head weight layer number 2Unit number in

head weight layer 1 64

Activation afterhead weight layer 1 Relu

Unit number inhead weight layer 2 4

Attention head number 4Constant value layer number 2

Unit number inconstant value layer 1 32

Activation afterconstant value layer 1 Relu

Unit number inconstant value layer 2 1

B.5. Mixing Network Hyper-parameters

We adopt the Python MARL framework (PyMARL)(Samvelyan et al., 2019) on the github to develop our algo-rithm. The hyper-parameters of training and testing configu-

Page 13: Qatten: A General Framework for Cooperative Multiagent ...Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning Yaodong Yang 1Jianye Hao Ben Liao2 Kun Shao3

Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning

rations are the same as in SMAC (Samvelyan et al., 2019)and could be referred in the source codes. Here we list thespecial parameters of Qatten’s mixing network in Table 3.

B.6. Win Percentage Table of All Maps

We here give the median win rates of all methods on the sce-narios presented in our paper. In MMM2 and 3s5z vs 3s6z,we augment Qatten with the weighted head Q-values. Othermaps are reported by the win rates based on the originalQatten (Qatten-base). We could see Qatten obtains the bestperformance on almost all the map scenarios.

Table 4. Median performance of the test win percentage.

Scenario Qatten QMIX COMA VDN IDL QTRAN

2s vs 1sc 100 100 97 100 100 1002s3z 97 97 34 97 75 833s5z 94 94 0 84 9 13

1c3s5z 97 94 23 84 11 675m vs 6m 74 63 0 63 49 573s vs 5z 96 85 0 87 43 0

bane vs bane 97 62 40 90 97 1002c vs 64zg 65 45 0 19 2 10

MMM2 79 61 0 0 0 03s5z vs 3s6z 16 1 0 0 0 0