feudal multi-agent hierarchies for cooperative …feudal multi-agent hierarchies the manager must...

Feudal Multi-Agent Hierarchiesfor Cooperative Reinforcement Learning

Sanjeevan Ahilan 1 Peter Dayan 2

Abstract

We investigate how reinforcement learning agentscan learn to cooperate. Drawing inspiration fromhuman societies, in which successful coordinationof many individuals is often facilitated by hierar-chical organisation, we introduce Feudal Multi-agent Hierarchies (FMH). In this framework, a‘manager’ agent, which is tasked with maximis-ing the environmentally-determined reward func-tion, learns to communicate subgoals to multiple,simultaneously-operating, ‘worker’ agents. Work-ers, which are rewarded for achieving managerialsubgoals, take concurrent actions in the world.We outline the structure of FMH and demonstrateits potential for decentralised learning and control.We find that, given an adequate set of subgoalsfrom which to choose, FMH performs, and partic-ularly scales, substantially better than cooperativeapproaches that use a shared reward function.

1. IntroductionIn cooperative multi-agent reinforcement learning,simultaneously-acting agents must learn to work together toachieve a shared set of goals. A straightforward approachis for each agent to optimise a global objective usingsingle-agent reinforcement learning (RL) methods such asQ-Learning (Watkins & Dayan, 1992) or policy gradients(Williams, 1992; Sutton et al., 2000). Unfortunately thissuffers various problems in general.

First, from the perspective of any one agent, the environ-ment is non-stationary. This is because as other agents learn,their policies change, creating a partially unobservable influ-ence on the effect of the first agent’s actions. This issue hasrecently been addressed by a variety of deep RL methods,for instance with centralised training of ultimately decen-

1Gatsby Computational Neuroscience Unit, University CollegeLondon, London, United Kingdom 2Max Planck Institute for Bio-logical Cybernetics, 72076 Tubingen, Germany. Correspondenceto: Sanjeevan Ahilan <[email protected]>.

tralised policies (Lowe et al., 2017; Foerster et al., 2018).However, this requires that the centralised critic or criticshave access to the actions and observations of all agentsduring training, which may not always be possible.

A second problem is coordination; agents must learn howto choose actions coherently, even in environments inwhich many optimal equilibria exist (Matignon et al., 2012).Whilst particularly challenging when agents are completelyindependent, such problems can be made more feasible ifa form of communication is allowed (Vlassis, 2007). Nev-ertheless, it is difficult for agents to learn how to communi-cate relevant information effectively to solve coordinationproblems, with most approaches relying on further helpfulproperties such as a differentiable communication channel(Sukhbaatar et al., 2016; Foerster et al., 2016) and/or amodel of the world’s dynamics (Mordatch & Abbeel, 2018).

Third, multi-agent methods scale poorly – the effective statespace grows exponentially with the number of agents. Learn-ing a centralised value function therefore suffers the curse ofdimensionality, whilst the alternative of decentralised learn-ing often appears inadequate for addressing non-stationarity.Optimising a global objective also becomes challenging atscale, as it becomes difficult to assign credit to each agent(Wolpert & Tumer, 2002; Chang et al., 2004).

A clue for meeting this myriad of challenges may lie in theway in which human and animal societies are hierarchicallystructured. In particular, even in broadly cooperative groups,it is frequently the case that different individuals agree tobe assigned different objectives which they work towardsfor the benefit of the collective. For example, members ofa company typically have different roles and responsibili-ties. They will likely report to managers who define theirobjectives, and they may in turn be able to set objectivesto more junior members. At the highest level, the CEO isresponsible for the company’s overall performance.

Inspired by this idea, we propose an approach for the multi-agent domain, which organises many, simultaneously actingagents into a managerial hierarchy. Whilst typically allagents in a cooperative task seek to optimise a shared re-ward, in Feudal Multi-agent Hierarchies (FMH) we insteadonly expose the highest-level manager to this ‘task’ reward.

arX

iv:1

901.

0849

2v1

[cs

.MA

] 2

4 Ja

n 20

19

Feudal Multi-Agent Hierarchies

The manager must therefore learn to solve the principal-agent problem (Jensen & Meckling, 1976) of communicat-ing subgoals, which define a reward function, to the workeragents under its control. Workers learn to satisfy these sub-goals by taking actions in the world and/or by setting theirown subgoals for workers immediately below them in thehierarchy.

FMH allows for a diversity of rewards. This can provideindividual agents with a rich learning signal, but necessarilyimplies that interactions between agents will not in generalbe fully cooperative. However, the intent of our designis to achieve collective behaviours which are apparentlycooperative, from the perspective of an outside observerviewing performance on the task objective.

Our idea is a development of a single-agent method for hier-archical RL known as feudal RL (Dayan & Hinton, 1993;Vezhnevets et al., 2017), which involves a ‘manager’ agentdefining subgoals for a ‘worker’ agent in order to achievetemporal abstraction. Feudal RL allows for the division oftasks into a series of subtasks, but has largely been investi-gated with only one worker per manager acting at any onetime. By introducing a feudal hierarchy with multiple agentsacting simultaneously in FMH, we not only divide tasks overtime but also across worker agents. Furthermore, we em-brace the full multi-agent setting, in which observations arenot in general shared across agents.

We outline a method to implement FMH for concurrentlyacting, communicating agents trained in a decentralised fash-ion. Our approach pre-specifies appropriate positions in thehierarchy as well as a mapping from the manager’s choice ofcommunication to the workers’ reward functions. We showhow to facilitate learning of our deep RL agents within FMHthrough suitable pretraining and repeated communication toencourage temporally extended goal-setting.

We conduct a range of experiments that highlight the advan-tages of FMH. In particular, we show its ability to addressnon-stationarity during training, as managerial reward ren-ders the behaviour of workers more predictable. We alsodemonstrate FMH’s ability to scale to many agents andmany possible goals, as well as to enable effective coordina-tion amongst workers. It performs substantially better thanboth decentralised and centralised approaches.

2. Background2.1. Markov Decision Processes

Single-agent RL can be formalised in terms of MarkovDecision Processes, which consist of sets of states S andactionsA, a reward function r : S×A → R and a transitionfunction T : S ×A → ∆(S), where ∆(S) denotes the setof discrete probability distributions over S. Agents act

according to a stochastic policy π : S × A → [0, 1]. Onepopular objective is to maximise the discounted expectedfuture reward, defined as Eπ[

∑∞t=0 γ

tr(st, at)] where theexpectation is over the sequence of states and actions whichresult from policy π, starting from an initial state distributionρ0 : S → [0, 1]. Here γ ∈ [0, 1) is a discount factorand t is the time step. This objective can be equivalentlyexpressed as Es∼ρπ,a∼π[r(s, a)], where ρπ is the discountedstate distribution induced by policy π starting from ρ0.

2.2. Deterministic Policy Gradient Algorithms

Deterministic policy gradients (DPG) is a frequently usedsingle-agent algorithm for continuous control using model-free RL (Silver et al., 2014). It uses deterministic policiesµθ : S → A, whose parameters θ are adjusted in an off-policy fashion using an exploratory behavioural policy toperform stochastic gradient ascent on an objective J(θ) =Es∼ρµ,a∼µθ

[r(s, a)].

We can write the gradient of J(θ) as:

∇θJ(θ) = Es∼ρµ [∇θµθ(s)∇aQµ(s, a)|a=µθ(s)]. (1)

Deep Deterministic Policy Gradients (DDPG) (Lillicrapet al., 2015) is a variant with policy µ and critic Qµ beingrepresented via deep neural networks. Like DQN (Mnihet al., 2015), DDPG stores experienced transitions in a re-play buffer and replays them during training.

2.3. Markov Games

A partially observable Markov game (POMG) (Littman,1994; Hu et al., 1998) for N agents is defined by a set ofstates S, and sets of actions A1, ...,AN and observationsO1, ...,ON for each agent. In general, the stochastic policyof agent i may depend on the set of action-observationhistories Hi ≡ (Oi×Ai)∗ such that πi : Hi×Ai → [0, 1].In this work we restrict ourselves to history-independentstochastic policies πi : Oi ×Ai → [0, 1]. The next state isgenerated according to the state transition function T : S ×A1 × ...×An → ∆(S). Each agent i obtains deterministicrewards defined as ri : S×A1× ...×An → R and receivesa private observation oi : S → Oi. There is an initialstate distribution ρ0 : S → [0, 1] and each agent aims tomaximise its own discounted sum of future rewards.

2.4. Centralised and Decentralised Training

In multi-agent RL, agents can be trained in a centralisedor decentralised fashion. In decentralised training, agentshave access only to local information: their own action andobservation histories during training (Tan, 1993). Agentsare often trained using single-agent methods for RL, suchas Q-Learning or DDPG.

In centralised training, the action and observation histories


of all agents are used, effectively reducing the multi-agentproblem to a single-agent problem. Although it may appearrestrictive for agents to require access to this full informa-tion, this approach has generated recent interest due to thepotential for centralised training of ultimately decentralisedpolicies (Lowe et al., 2017; Foerster et al., 2018). For exam-ple, in simulation one can train a Q-function, known as acritic, which exploits the action and observation histories ofall agents to aid the training of local policies for each one.These policies can then be deployed in multi-agent systemsin the real world, where centralisation may be infeasible.

2.5. Multi-Agent Deep Determinstic Policy Gradients

Multi-agent deep deterministic policy gradients (MADDPG)(Lowe et al., 2017) is an algorithm for centralised trainingand decentralised control of multi-agent systems. It usesdeterministic polices, as in DDPG, which condition onlyon each agent’s local observations and actions. MADDPGhandles the non-stationarity associated with the simultane-ous adaptation of all the agents by introducing a separatecentralised critic for each agent. Lowe et al., trained history-independent feedforward networks on a range of mixedcooperative-competitive environments, significantly improv-ing upon agents trained independently with DDPG.

3. MethodsWe propose FMH, a framework for multi-agent RL whichaddresses the three major issues outlined in the introduction:non-stationarity, scalability and coordination. In order toincrease the generality of our approach, we do so withoutrequiring a number of potentially helpful features: access tothe observations of all agents (as in centralised training), adifferentiable model of the world’s dynamics or a differen-tiable communication channel between agents.

3.1. Hierarchies

We begin by defining the type of hierarchical structure wecreate for the agents. In its most straightforward setting,this hierarchy corresponds to a rooted directed tree, withthe highest level manager as the root node and each workerreporting to only a single manager. Our experiments usethis structure, although, for simplicity, using just two-levelhierarchies. However, we also note the possibility, in moreexotic circumstances, of considering more general directedacyclic graphs in which a single worker reports to multiplemanagers, and allowing for more than one manager at thehighest level of the hierarchy. Acyclicity prevents poten-tially disastrous feedback cycles for the reward.

Appropriate assignment of agents to positions in a hierarchyshould depend on the varied observation and action capa-bilities of the agents. Agents exposed to key information

associated with global task performance are likely to besuited to the role of manager, whereas agents with morenarrow information, but which can act to achieve subgoalsare more naturally suited to being workers. As we showin our experiments, identifying these differences can oftenmake assigning agents straightforward. Here, we specifypositions directly, to focus on demonstrating the resultingeffectiveness of the hierarchy. However, extensions in whichthe hierarchy is itself learned would be interesting.

3.2. Goal-Setting

In FMH, managers set goals for workers by defining theirrewards. To arrange this, managers communicate a specialkind of message to workers that defines the workers’ rewardfunctions according to a specified mapping. For example,if there are three objects in a room, we might allow themanager to select from three different messages, each defin-ing a worker reward function proportional to the negativedistance from a corresponding object. Our experiments ex-plore variants of this scheme, imposing the constraint thatthe structure of the reward function remain fixed whilst thetarget is allowed to change. In this context and task, mes-sages therefore correspond to ‘goals’ requiring approach todifferent objects. The manager must learn to communicatethese goals judiciously in order to solve complex tasks. Inturn, the workers must learn how to act in the light of the re-sulting managerial reward, in addition to immediate rewards(or costs) they might also experience.

Manager

Worker 1 Worker 2

Environment

Figure 1. An example of a worker-computed Feudal MultiagentHierarchy with one manager agent and two worker agents. Workerrewards are goal-dependent and computed locally, the manager’sreward is provided by the environment.

Since a communicated goal is simply another type of actionfor the manager, our approach is consistent with the formal-ism of a (partially-observable) Markov game, for which the


reward for agent i is defined as ri : S ×A1× ...×An → R.

However, the worker’s reward need not be treated as part ofthe environment. For instance, each worker agent i couldcompute its own reward locally ri : Oi×Ai → R, whereOiincludes the observed goal from the manager. We illustratethis ‘worker-computed’ interpretation in Fig. 1.

3.3. Pretraining and Temporally Extended Subgoals

We next consider the issue of non-stationarity. This fre-quently arises in multi-agent RL because the policies ofother agents may change in unpredictable ways as they learn.By contrast, in FMH we allow manager agents to determinethe reward functions of workers, compelling the workerstowards more predictable behaviour from the perspectiveof the manager. However, the same issue applies at thestarting point of learning for workers: they will not yet havelearned how to satisfy the goals. We would therefore expecta manager to underestimate the value of the subgoals it se-lects early in training, potentially leading it sub-optimally todiscard subgoals which are harder for the worker to learn.

Thus, it would be beneficial for worker agents already tohave learned to fulfill managerial subgoals. We addressthis issue practically in two steps. First, we introduce abottom-up ‘pretraining’ procedure, in which we initiallytrain the workers before training the manager. Although themanager is not trained during this period, it still acts in themulti-agent environment and sets subgoals for the workeragents. As subgoals are initially of (approximately) equalvalue, the manager will explore them uniformly. If the setof possible subgoals is sufficiently compact, this will enableworkers to gain experience of each potential subgoal.

This pretraining does not completely solve the non-stationarity problem. This is because the untrained managerwill, with high probability, vacillate between subgoals, pre-venting the workers under its command from having anyreasonable hope of extracting reward. We therefore wantmanagers not only to try out a variety of subgoals but alsoto commit to them long enough that they have any hope ofbeing at least partially achieved. Thus, the second compo-nent of the solution is a communication-repeat heuristic forthe manager such that goal-setting is temporally extended.We demonstrate its effectiveness in our experiments.

3.4. Coordination

Multi-agent problems may have many equilibria, and goodones can only be achieved through effective coordination ofagent behaviour. In FMH, the responsibility of coordinationis incumbent upon the manager, which exerts control overits workers through its provision of reward. We show in thesimplified setting of a matrix game, how a manager maycoordinate the actions of its workers (see Suppl. Mat. C).

3.5. FMH-DDPG

FMH provides a framework for rewarding agents in multi-agent domains, and can work with many different RL algo-rithms. In our experiments, we chose to apply FMH in com-bination with the single-agent algorithm DDPG, trained ina fully decentralised fashion. As we do not experiment withcombining FMH with any other algorithm, we frequentlyrefer to FMH-DDPG simply as FMH.

3.6. Parameter Sharing

We apply our method to scenarios in which a large numberof agents have identical properties. For convenience, whentraining using a decentralised algorithm (FMH-DDPG orvanilla DDPG) we implement parameter sharing amongidentical agents, in order to train them efficiently. We onlyadd experience from a single agent (among those sharingparameters) into the shared replay buffer, and carry out asingle set of updates. We find this gives very similar resultsto training without parameter sharing, in which each agentis updated using its own experiences stored in its own replaybuffer (see Suppl. Mat. A.2).

4. Experiments and ResultsWe used the multi-agent particle environment1 as a frame-work for conducting experiments to test the potential ofour method to address the issues of non-stationarity, scala-bility and coordination, comparing against MADDPG andDDPG. We will release code for both the model and theenvironments after the reviewing process ends.

The RL agents have both an actor and a critic, each cor-responding to a feedforward neural network. We give adetailed summary of all hyperparameters used in Suppl.Mat. A.1.

4.1. Cooperative Communication

We first experiment with an environment implemented inLowe et al. (2017) called ‘Cooperative Communication’(Figure 2a), in which there are two cooperative agents, onecalled the ‘speaker’ and the other called the ‘listener’, placedin an environment with many landmarks of differing colours.On each episode, the listener must navigate to a randomlyselected landmark; and in the original problem, both lis-tener and speaker obtain reward proportional to the negativedistance2 of the listener from the correct target. However,whilst the listener observes its relative position from eachof the differently coloured landmarks, it does not know towhich landmark it must navigate. Instead, the colour of the

1https://github.com/openai/multiagent-particle-envs2The original implementation used the negative square distance;

we found this slightly worse for all algorithms.


(a) (b) (c)

Figure 2. Cooperative Communication with 1 listener and 12 landmarks. (a) The speaker (grey circle) sees the colour of the listener (greencircle) , which indicates the target landmark (green square). It communicates a message to the listener at every time step. Here there are12 landmarks and the agent trained using FMH has successfully navigated to the correct landmark. (b) FMH substantially outperformsMADDPG and DDPG. The dotted green line indicates the end of pretraining for FMH. (c) FMH worker reward and the probability of themanager correctly assigning the correct target to the worker. The manager learns to assign the target correctly with probability 1.

target landmark can be seen by the speaker, which cannotobserve the listener and is unable to move. The speakercan however communicate to the listener at every time step,and so successful performance on the task corresponds tothe speaker enabling the listener to reach the correct target.We also note that, although reward is provided during theepisode, it is used only for training agents and not directlyobserved, which means that agents cannot simply learn tofollow the gradient of reward.

Although this task seems simple, it is challenging for manyRL methods. Lowe et al. (2017) trained, in a decentralisedfashion, a variety of single-agent algorithms, includingDDPG, DQN and trust-region policy optimisation (Schul-man et al., 2015) on a version of this problem with threelandmarks and demonstrated that they all perform poorlyon this task. Of these methods, DDPG reached the highestlevel of performance and so we use DDPG as the strongestcommonly used baseline for the decentralised approach. Wealso compare our results to MADDPG, which combinesDDPG with centralised training. MADDPG was found toperform strongly on Cooperative Communication with threelandmarks, far exceeding the performance of DDPG.

For our method, FMH, we also utilised DDPG, but revertedto the more generalizable decentralized training that waspreviously ineffective. Crucially, we assigned the speakerthe role of manager and the listener the role of worker. Thespeaker can therefore communicate messages which corre-spond to the subgoals of the different coloured landmarks.The listener is not therefore rewarded for going to the correcttarget but is instead rewarded proportional to the negativedistance from the speaker-assigned target. The speaker itselfis rewarded according to the original problem definition, thenegative distance of the listener from the true target.

We investigated in detail a version of Cooperative Commu-

nication with 12 possible landmarks (Figure 2a). FMH per-formed significantly better than both MADDPG and DDPG(Figure 2b) over a training period of 100 epochs (each epochcorresponds to 1000 episodes). For FMH, we pretrained theworker for 10 epochs and enforced extended communicationover 8 time steps (each episode is 25 time steps).

In Figure 2c, the left axis shows the reward received bythe FMH worker (listener) over training. This increasedduring pretraining and again immediately after pretrainingconcludes. Managerial learning after pretraining resultedin decreased entropy of communication over the durationof an episode (see Suppl. Mat. A.3), allowing the workerto optimise the managerial objective more effectively. Thisin turn enabled the manager to assign goals correctly, withthe rise in the probability of correct assignment occurringshortly thereafter (right axis), then reaching perfection.

Our results show how FMH resolves the issue of non-stationarity. Initially, workers are able to achieve rewardby learning to optimise managerial objectives, even whilstthe manager itself is not competent. This learning elicitsrobust behaviour from the worker, conditioned on manage-rial communication, which makes workers more predictablefrom the persepective of the manager. This then enables themanager to achieve competency – learning to assign goalsso as to solve the overall task.

Our implementation of FMH used both pretraining and ex-tended goal-setting with a communication repeat heuristic.In Suppl. Mat. A.4, we show that pretraining the workerimproved performance on this task, although even with-out pretraining FMH still performed better than MADDPGand DDPG. The introduction of extended communication ishowever more critical. In Figure 3 we show the performanceof FMH with goal-setting over various number of time steps(and fixed pretraining period of 10 epochs). When there


NUMBER OF FINAL REWARD EPOCHS UNTIL CONVERGENCELISTENERS LANDMARKS FMH MADDPG DDPG COM FMH MADDPG DDPG

1 3 −6.63± 0.05 −6.58± 0.03 −14.26± 0.07 −17.28 56 24 551 6 −6.91± 0.07 −6.69± 0.06 −18.10± 0.07 −18.95 57 66 421 12 −7.79± 0.06 −15.96± 0.09 −19.32± 0.11 −19.56 − − 363 6 −7.10± 0.04 −11.13± 0.03 −18.90± 0.05 −18.95 77 − 505 6 −7.17± 0.03 −18.47± 0.04 −19.73± 0.06 −18.95 79 75 5310 6 −8.96± 0.03 −19.80± 0.06 −21.19± 0.03 −18.95 − 59 32

Table 1. Performance of FMH, MADDPG and DDPG for versions of Cooperative Communication with different numbers of listeners andlandmarks. Final reward is determined by training for 100 epochs and evaluating the mean reward per episode in the final epoch. Weindicate no convergence with a − symbol. For further details see Sup. Mat. A.6.

were no communication repeats (Comm 1), performancewas similar to MADDPG, but introducing even a singlerepeat greatly improved performance. By contrast, introduc-ing communication repeats to MADDPG did not improveperformance (see Suppl. Mat. A.5).

Figure 3.Communication sentby the manager is re-peated for extendedgoal-setting over var-ious numbers of timesteps.

4.1.1. SCALING TO MANY AGENTS

We next scaled Cooperative Communication to includemany listener agents. At the beginning of an episode eachlistener is randomly assigned a target out of all the possiblelandmarks and the colour of each listener’s target is ob-served by the speaker. The speaker communicates a singlemessage to each listener at every time step. To allow foreasy comparison with versions of Cooperative Communi-cation with only one listener, we normalised the reward bythe number of listeners. As discussed in the methods, weshared parameters across listeners for FMH and DDPG.

In Table 1 we show the performance of FMH, MADDPGand DDPG for variants of Cooperative Communication withdifferent numbers of listener agents and landmarks. Con-sider the version with 6 landmarks, which we found thatMADDPG could solve with a single listener within 100epochs (unlike for 12 landmarks). On increasing the num-ber of listeners up to a maximum of 10, we found that FMHscales much better than MADDPG; FMH was able to learnan effective policy with 5 listener agents whereas MAD-DPG could not. Further, FMH even scaled to 10 listeners,although it did not converge over the 100 epochs.

To aid interpretation of the reward values in Table 1 we alsocompare performance to a policy which simply moves tothe centroid of the landmarks. This ‘Centre of Mass’ (CoM)agent was trained using MADDPG until convergence on asynthetic task in which reaching the centroid is maximallyrewarded, and then evaluated on the true version of the task.We find that both MADDPG and DDPG do not performbetter than the CoM agent when there are 10 listeners and 6landmarks.

In Figure 4 we show the final state achieved on an exampleepisode of this version of the task, for agents trained usingMADDPG and FMH. After training for 100 epochs, MAD-DPG listeners do not find the correct targets by the end ofthe episode whereas FMH listeners manage to do so.

MADDPG FMH

Figure 4. Scaling Cooperative Communication to 10 listeners with6 landmarks - final time step on example episode.

4.2. Cooperative Coordination

We then designed a task to test coordination called ‘Coop-erative Coordination’ (Figure 5a). In this task, there are 6landmarks. At the beginning of each episode 3 landmarksare randomly selected to be green targets and 3 blue decoys.A team of 3 agents must navigate to cover the green targetswhilst ignoring the blue decoys, but they are unable to seethe colours of the landmarks. A fourth agent, the speaker,can see the colours of the landmarks and can send messagesto the other agents (but cannot move). All agents can see


(a) (b) (c) (d)

Figure 5. Cooperative Coordination (a) Three listeners (light grey agents) must move to cover the green landmarks whilst ignoring theblue landmarks. However, only the speaker (dark grey agent) can see the landmarks’ colours; it communicates with the listeners at everytime step. In this example, FMH agents have successfully coordinated to cover the three correct targets. (b) FMH performs significantlybetter than MADDPG and DDPG. The dotted green line indicates the end of pretraining for FMH. (c) Agents trained using FMH coveron average more targets, by the end of the episode, than MADDPG and DDPG (d) Agents trained using FMH avoid collisions moreeffectively than MADDPG and DDPG over the duration of an episode.

each others’ positions and velocities, are large in size andface penalties if they collide with each other. The task sharesaspects with ‘Cooperative Navigation’ from (Lowe et al.,2017) but is considerably more complex due to the greaternumber of potential targets and the hidden information.

We apply FMH to this problem, assigning the speaker agentthe role of manager. One consideration is whether the man-ager, the worker, or both should receive the negative penal-ties due to worker collisions. Here we focus on the case inwhich the manager only concerns itself with the ‘task’ re-ward function. Penalties associated with collisions are there-fore experienced only by the workers themselves, whichseek to avoid these whilst still optimising the managerialobjective.

In Figure 5b we compare the performance of FMH to MAD-DPG and DDPG. As with Cooperative Communication,FMH does considerably better than both after training for150 epochs. This is further demonstrated when we evaluatethe trained policies over a period of 10 epochs: Figure 5cshows the mean number of targets covered by the final timestep of the episode, for which FMH more than doublesMADDPG. Figure 5d shows the mean number of collisions(which are negatively rewarded) during an episode. FMHcollides very rarely whereas MADDPG and DDPG collideover 4 times more frequently.

We also implement a version of Cooperative Coordination inwhich the manager is responsible not only for coordinatingits workers but must also navigate to targets itself. We findthat it can learn to do both roles effectively (see Suppl. Mat.B.2.1).

4.2.1. EXPLOITING DIVERSITY

One role of a manager is to use the diversity it has availablein its workers to solve tasks more effectively. We testedthis in a version of Cooperative Coordination in which oneof the listener agents was lighter than the other two and socould reach farther targets more quickly.

We trained FMH (without parameter sharing due to thediversity) on this task and then evaluated the trained policieson a ‘Two-Near, One-Far’ (TNOF) version of the task inwhich one target is far away and the remaining two areclose (see Suppl. Mat. B.2.2). We did this to investigatewhether FMH, which was trained on the general problem,had learned the specific strategy of assigning the farthesttarget to the fastest agent. We found this this to be true 100percent of the time (evaluating over 10 epochs); we illustratethis behaviour in Figure 6.

Figure 6. FMH solves the TNOF task (example episode). Left:Agents are initialised at the bottom of the environment, two targetsare close by, and one far away. Right: By the end of the episode,the faster (red) agent covers the farther target on the right, despitestarting on the left.


5. DiscussionWe have shown how cooperative multi-agent problems canbe solved efficiently by defining a hierarchy of agents. Ourhierarchy was reward-based, with manager agents settinggoals for workers, and with the highest level manager opti-mising the overall task objective. Our method, called FMH,was trained in a decentralised fashion and showed consider-ably better scaling and coordination than both centralisedand decentralised methods that used a shared reward func-tion.

Our work was partly inspired by the feudal RL architecture(FRL) (Dayan & Hinton, 1993), a single-agent method forhierarchical RL (Barto & Mahadevan, 2003) which was alsodeveloped in the context of deep RL by Vezhnevets et al.(2017). In particular, FMH addresses the ‘too many chiefs’inefficiency inherent to FRL, namely that each manageronly has a single worker under its control. A much widerrange of management possibilities and benefits arises whenmultiple agents operate at the same time to achieve oneor more tasks (Busoniu et al., 2008). We focused on thecooperative setting (Panait & Luke, 2005); however, unlikethe fully-cooperative setting, in which all agents optimisea shared reward function (Boutilier, 1996) our approachintroduces a diversity of rewards, which can help with credit-assignment (Wolpert & Tumer, 2002; Chang et al., 2004) butalso introduces elements of competition. This competitionneed not always be deleterious; for example, in some cases,an effective way of optimising the task objective might beto induce competition amongst workers, as in generativeadversarial networks (Goodfellow et al., 2014).

We followed FRL (though not all its successors; (Vezhn-evets et al., 2017) in isolating the workers from much ofthe true environmental reward, making them focus on theirown narrower tasks. Such reward hiding was not complete– we considered individualised or internalised costs fromcollisions that workers experienced directly, such that theirreward was not purely managerially dependent. A morecomplete range of possibilities for creating and decompos-ing rewards between managers and workers when the objec-tives of the two are not perfectly aligned, could usefully bestudied under the aegis of principal-agent problems (Jensen& Meckling, 1976; Laffont & Martimort, 2009).

Goal-setting in FMH was achieved by specifying a relation-ship between the chosen managerial communication andthe resulting reward function. The communication was alsoincorporated into the observational state of the worker; how-ever, the alternative possibility of a goal-embedding wouldbe worth exploring (Vezhnevets et al., 2017). We also speci-fied goals directly in the observational space of the workers,whereas Vezhnevets et al. specified goals in a learned hiddenrepresentation. This would likely be of particular value forproblems in which defining a set of subgoals is challenging,

such as those which require learning directly from pixels.More generally, work on the way that agents can learn toconstruct and agree upon a language for goal-setting wouldbe most important.

To train our multi-agent systems we leveraged recent ad-vances in the field of deep RL; in particular the algorithmDDPG (Lillicrap et al., 2015), which can learn in a de-centralised fashion. Straightforward application of this al-gorithm has been shown to achieve some success in themulti-agent domain (Gupta et al., 2017) but also shown itto be insufficient in handling more complex multi-agentproblems (Lowe et al., 2017). By introducing a managerialhierarchy, we showed that FMH-DDPG has the potential togreatly facilitate multi-agent learning whilst still retainingthe advantage of decentralised training. Our proposed ap-proach could also be combined with centralised methods,and this would be worth further exploration.

Other work in multi-agent RL has also benefitted from ideasfrom hierarchical RL. The MAXQ algorithm (Dietterich,2000) has been used to train homogenous agents (Makaret al., 2001), allowing them to coordinate by communicat-ing subtasks rather than primitive actions, an idea recentlyre-explored in the context of deep RL (Tang et al., 2018). Ameta-controller which structures communication betweenagent pairs in order to achieve coordination has also beenproposed (Kumar et al., 2017) (and could naturally be hy-bridized with FMH), as well as the use of master-slavearchitecture which merges the actions of a centralised mas-ter agent with those of decentralised slave agents (Konget al., 2017). Taken together, these methods represent inter-esting alternatives for invoking hierarchy which are unlikeour primarily reward-based approach.

There are a number of additional directions for future work.First, the hierarchies used were simple in structure and spec-ified in advance, based on our knowledge of the variousinformation and action capabilities of the agents. It wouldbe interesting to develop mechanisms for the formation ofcomplex hierarchies. Second, in settings in which workerscan acquire relevant task information, it would be worth-while investigating how a manager might incentivise themto provide this. Third, it would be interesting to considerhow a manager might learn to allocate resources, such asmoney, computation or communication bandwidth to en-able efficient group behaviour. Finally, we did not explorehow managers should train or supervise the workers beneaththem, such as through reward shaping (Ng et al., 1999).Such an approach might benefit from recurrent networks,which could enable managers to use the history of workerperformance to better guide their learning.


6. AcknowledgementsWe would like to thank Heishiro Kanagawa, Jorge A. Menen-dez and Danijar Hafner for helpful comments on a draft ver-sion of the manuscript. Sanjeevan Ahilan received fundingfrom the Gatsby Computational Neuroscience Unit and theMedical Research Council. Peter Dayan received fundingfrom the Max Planck Society.

ReferencesBarto, A. G. and Mahadevan, S. Recent advances in hier-

archical reinforcement learning. Discrete event dynamicsystems, 13(1-2):41–77, 2003.

Boutilier, C. Planning, learning and coordination in multia-gent decision processes. In Proceedings of the 6th confer-ence on Theoretical aspects of rationality and knowledge,pp. 195–210. Morgan Kaufmann Publishers Inc., 1996.

Busoniu, L., Babuska, R., and De Schutter, B. A comprehen-sive survey of multiagent reinforcement learning. IEEETransactions on Systems, Man, And Cybernetics-Part C:Applications and Reviews, 38 (2), 2008, 2008.

Chang, Y.-H., Ho, T., and Kaelbling, L. P. All learning islocal: Multi-agent learning in global reward games. InAdvances in neural information processing systems, pp.807–814, 2004.

Dayan, P. and Hinton, G. E. Feudal reinforcement learning.In Advances in neural information processing systems,pp. 271–278, 1993.

Dietterich, T. G. Hierarchical reinforcement learning withthe maxq value function decomposition. Journal of Arti-ficial Intelligence Research, 13:227–303, 2000.

Foerster, J., Assael, I. A., de Freitas, N., and Whiteson,S. Learning to communicate with deep multi-agent rein-forcement learning. In Advances in Neural InformationProcessing Systems, pp. 2137–2145, 2016.

Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., andWhiteson, S. Counterfactual multi-agent policy gradi-ents. In Thirty-Second AAAI Conference on ArtificialIntelligence, 2018.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In Advances in neuralinformation processing systems, pp. 2672–2680, 2014.

Gupta, J. K., Egorov, M., and Kochenderfer, M. Cooperativemulti-agent control using deep reinforcement learning.In International Conference on Autonomous Agents andMultiagent Systems, pp. 66–83. Springer, 2017.

Hu, J., Wellman, M. P., et al. Multiagent reinforcementlearning: theoretical framework and an algorithm. InICML, volume 98, pp. 242–250. Citeseer, 1998.

Jensen, M. C. and Meckling, W. H. Theory of the firm: Man-agerial behavior, agency costs and ownership structure.Journal of financial economics, 3(4):305–360, 1976.

Kong, X., Xin, B., Liu, F., and Wang, Y. Revisiting themaster-slave architecture in multi-agent deep reinforce-ment learning. arXiv preprint arXiv:1712.07305, 2017.

Kumar, S., Shah, P., Hakkani-Tur, D., and Heck, L. Fed-erated control with hierarchical multi-agent deep rein-forcement learning. arXiv preprint arXiv:1712.08266,2017.

Laffont, J.-J. and Martimort, D. The theory of incentives:the principal-agent model. Princeton university press,2009.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,T., Tassa, Y., Silver, D., and Wierstra, D. Continuouscontrol with deep reinforcement learning. arXiv preprintarXiv:1509.02971, 2015.

Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Pro-ceedings 1994, pp. 157–163. Elsevier, 1994.

Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P.,and Mordatch, I. Multi-agent actor-critic for mixedcooperative-competitive environments. In Advances inNeural Information Processing Systems, pp. 6379–6390,2017.

Makar, R., Mahadevan, S., and Ghavamzadeh, M. Hierarchi-cal multi-agent reinforcement learning. In Proceedings ofthe fifth international conference on Autonomous agents,pp. 246–253. ACM, 2001.

Matignon, L., Laurent, G. J., and Le Fort-Piat, N. Inde-pendent reinforcement learners in cooperative markovgames: a survey regarding coordination problems. TheKnowledge Engineering Review, 27(1):1–31, 2012.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529, 2015.

Mordatch, I. and Abbeel, P. Emergence of grounded com-positional language in multi-agent populations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

Ng, A. Y., Harada, D., and Russell, S. Policy invarianceunder reward transformations: Theory and application toreward shaping. In ICML, volume 99, pp. 278–287, 1999.


Panait, L. and Luke, S. Cooperative multi-agent learning:The state of the art. Autonomous agents and multi-agentsystems, 11(3):387–434, 2005.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,P. Trust region policy optimization. In InternationalConference on Machine Learning, pp. 1889–1897, 2015.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., andRiedmiller, M. Deterministic policy gradient algorithms.In ICML, 2014.

Sukhbaatar, S., Fergus, R., et al. Learning multiagent com-munication with backpropagation. In Advances in NeuralInformation Processing Systems, pp. 2244–2252, 2016.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour,Y. Policy gradient methods for reinforcement learningwith function approximation. In Advances in neural in-formation processing systems, pp. 1057–1063, 2000.

Tan, M. Multi-agent reinforcement learning: Independentvs. cooperative agents. In Proceedings of the tenth inter-national conference on machine learning, pp. 330–337,1993.

Tang, H., Hao, J., Lv, T., Chen, Y., Zhang, Z., Jia, H.,Ren, C., Zheng, Y., Fan, C., and Wang, L. Hierarchicaldeep multiagent reinforcement learning. arXiv preprintarXiv:1809.09332, 2018.

Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jader-berg, M., Silver, D., and Kavukcuoglu, K. Feudal net-works for hierarchical reinforcement learning. In Interna-tional Conference on Machine Learning, pp. 3540–3549,2017.

Vlassis, N. A concise introduction to multiagent systemsand distributed artificial intelligence. Synthesis Lectureson Artificial Intelligence and Machine Learning, 1(1):1–71, 2007.

Watkins, C. J. and Dayan, P. Q-learning. Machine learning,8(3-4):279–292, 1992.

Williams, R. J. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning. Machinelearning, 8(3-4):229–256, 1992.

Wolpert, D. H. and Tumer, K. Optimal payoff functions formembers of collectives. In Modeling complexity in eco-nomic and social systems, pp. 355–369. World Scientific,2002.


A. Experimental ResultsA.1. Parameter settings

In all of our experiments, we used the Adam optimizer witha learning rate of 0.001 and τ = 0.01 for updating thetarget networks. γ was 0.75. The size of the replay bufferwas 107 and we updated the network parameters after every100 samples added to the replay buffer. We used a batchsize of 1024 episodes before making an update. For ourfeedforward networks we used two hidden layers with 256neurons per layer. We trained with 10 random seeds (exceptotherwise stated).

Hyperparameters were optimised using a line search centredon the experimental parameters used in (Lowe et al., 2017).Our optimised parameters were found to be identical exceptfor a lower value of γ (0.75) and of the learning rate (0.001),and a larger replay buffer (107). We found these values gavethe best performance for both MADDPG and FMH on aversion of Cooperative Communication with 6 landmarksevaluated after 50 epochs (an epoch is defined to be 1000episodes).

A.2. Parameter Sharing

We implemented parameter sharing for the decentralisedalgorithms DDPG and FMH. To make training results ap-proximately similar to implementations which do not useparameter sharing we restrict updates to a single agent andadd experience only from a single agent to the shared re-play buffer (amongst those sharing parameters). We findin practice that both approaches give very similar resultsfor FMH, whereas parameter sharing slightly improves theperformance of DDPG – we show this for a version ofCooperative Communication with 3 listeners and 6 targets(Figure S1). In general sharing parameters reduces train-ing time considerably, particularly as the number of agentsscales.

Figure S1. Parametersharing does notaffect performancefor FMH but slightlyimproves DDPG.

A.3. Entropy of Communication

We show the change in entropy of communication over theduration of an episode for the various algorithms (Figure

S2). Agents are trained on Cooperative Communicationwith 12 landmarks and a single listener. As the target doesnot change during the middle of an episode, we expect theentropy to decrease as agents learn.

For FMH, during pretraining, entropy is high as all goalsare sampled approximately uniformly (but with enforced ex-tended communication over 8 time steps). However, shortlyafter pretraining ends the entropy of managerial communi-cation rapidly decreases. This is in contrast to MADDPGwhich decreases in entropy more steadily.

Figure S2. Entropyof managerial com-munication overthe duration of anepisode at differentstages in training

A.4. Pretraining

Agents trained using FMH were pretrained for 10 epochsacross all experiments. During pretraining all agents actin the multi-agent environment. Although all experiencesare added appropriately into the replay buffers, the managerdoes not update its parameters during pretraining whereasthe workers do.

We show the benefits of pretraining on Cooperative Com-munication with 12 landmarks and a single listener agent inFigure S3.

Figure S3. Pretrainingimproves FMH, al-though agents canstill learn effectivelywithout it. We useextended communi-cation (goal-setting)for 8 time steps.

A.5. Extended Communication

In the main text we showed that extended goal-setting inFMH lead to substantially improved performance. We alsoconsidered whether a similar approach would benefit meth-ods which do not treat communication as goals.

We found that extended communication did not help MAD-DPG on the same task, with learning curves being in all


cases virtually identical (Figure S4).

Figure S4.Extended commu-nication does notsignificantly improvethe performance ofMADDPG.

A.6. Further details on Table 1

Values in the table were determined using 10 random seedsin all cases, except for the one exception of MADDPG with10 listeners and 6 landmarks, which used 3 random seeds(training time is substantially longer as we do not shareparameters). The CoM agent was trained on the synthetictask with different numbers of landmarks. Performance ofthe trained CoM policies was then evaluated over a periodof 10 epochs on the corresponding true tasks.

Convergence was determined by comparing the mean perfor-mance in the final 5 epochs with the mean performance of asliding window 5 epochs in width (we also take the meanacross random seeds). If the mean performance within thewindow was within 2 percent of the final performance, andremained so for all subsequent epochs, we defined this asconvergence, unless the first time this happened was withinthe final 10 epochs. In such a case, we define the algorithmas not having converged. For assessing the exact time ofconvergence in the case of FMH we report values whichinclude the 10 epochs of pretraining.

A.7. Cooperative Communication with 3 landmarks

For reference we show performance of the various algo-rithms on Cooperative Communication with 3 landmarks.Both MADDPG and FMH perform well on this task, al-though MADDPG reaches convergence more rapidly (Fig-ure S5).

Figure S5. CooperativeCommunication with1 listener and 3landmarks.

B. EnvironmentsB.1. Cooperative Communication

We provide further details on our version of CooperativeCommunication (see main text for original description). Ingeneral, we keep environment details the same as Lowe etal., including the fact that the manager only has access tothe target colour. However, we also scale up the numberof coloured landmarks, which we do by taking the RGBvalues provided in the multi-agent particle environment,[0.65, 0.15, 0.15], [0.15, 0.65, 0.15], [0.15, 0.15, 0.65], andadding 9 more by placing them on the remaining verticesof the corresponding cube and at the centre-point of four ofthe faces (in RGB space). The colours for the 12 landmarksare therefore:

[0.65, 0.15, 0.15], [0.15, 0.65, 0.15], [0.15, 0.15, 0.65],[0.65, 0.65, 0.15], [0.15, 0.65, 0.65], [0.65, 0.15, 0.65],[0.65, 0.65, 0.65], [0.15, 0.15, 0.15], [0.40, 0.40, 0.65],[0.40, 0.40, 0.15], [0.15, 0.40, 0.40], [0.65, 0.40, 0.40].

The particular colour values used for the landmarks influ-ences the performance of RL algorithms as landmarks whichhave similar colours are harder for the speaker to learn todistinguish.

B.2. Cooperative Coordination

We provide further details on our version of CooperativeCoordination (see main text for original description). Thetask provides a negative reward of -1 to each agent involvedin a collisions. For DDPG and MADDPG this penaltyis shared across agents, whereas in FMH only the agentsinvolved in the collision experience this penalty.

We also evaluated performance of trained policies in Figures5c and 5d with slight modifications to the overall task. Inthe case of Figure 5c, to ensure that targets were neverimpossible to achieve by overlapping with the immobilemanager, we moved the manager off-screen. For Figure 5dwe ensured that agent positions were never initialised in away such that they would automatically collide (such casesare rare).

B.2.1. COORDINATION WITH A MOBILE MANAGER

We implemented a version of Cooperative Coordinationwith 2 listeners and 1 speaker which together need to coverthe 3 green targets. The manager must multi-task, directingworkers to the correct targets whilst also covering thesetargets itself. We find that the manager learns to do this,outperforming both MADDPG and DDPG (Figure S6)


Figure S6. FMHperforms well, evenwhen the manager isrequired to move tocover targets whilstalso setting goals forworkers

B.2.2. THE TWO-NEAR, ONE-FAR TASK

This task is not used for training but to evaluate the per-formance of agents trained on a version of CooperativeCoordination in which one agent is twice as light as normaland the remaining two are twice as heavy.

Evaluating the optimal assignments on this task can be dif-ficult, so we assess it in the more easily interpreted TNOFenvironment. In TNOF, the agents start at the bottom of theenvironment. Two green targets are located nearby (in bot-tom 40 percent of screen) whereas one target is far away (intop 30 percent of screen). The x-coordinates are randomlysampled at the beginning of each episode and blue decoysare also added (one nearby, two far).

C. Solving a Coordination GameSome of the most popular forms of multiagent task arecoordination games from microeconomics in which there aremultiple good and bad Nash equilibria, and it is necessary tofind the former. It is intuitively obvious that appointing oneof the agents as a manager might resolve the symmetriesinherent in cooperative coordination game in which agentsneed to take different actions to receive reward:

Player YA B

Player X A (0, 0) (1, 1)

B (1, 1) (0, 0)

This game has two pure strategy Nash equilibria and onemixed strategy Nash equilibrium which is Pareto dominatedby the pure strategies. The challenge of this game isfor both agents to choose a single Pareto optimal Nashequilibrium, either (A,B) or (B,A).

For a matrix game, we define the feudal approachas allowing Player X , the manager, to specify the rewardplayer Y will receive for its actions. This is a simplificationwhen compared to the more general setting of a Markovgame in which the feudal manager can reward not only

actions but also achievement of certain states.3. In order tospecify the reward, we assume that Player X communicatesa goal, either gA or gB , prior to both players taking theiractions. If Player X sends gA it means that action A is nowrewarded for Player Y and action B is not. Player X’srewards are unchanged, and so together this induces thefollowing matrix game:

Player YA B

Player X A (0, 1) (1, 0)

B (1, 1) (0, 0)

Action A for player Y is now strictly dominant and so arational Player Y will always choose it. By iterated elimi-nation of strictly dominated strategies we therefore find theresulting matrix game:

Player YA

Player X A (0, 1)

B (1, 1)

And so a rational Player X will always choose B, resultingin an overall strategy of (B,A) conditioned on an initialcommunication of gA. By symmetry, we can see that con-ditioned on gB , rational players X and Y will play (A,B).The feudal approach therefore allows the manager to flexiblycoordinate the pair of agents to either Nash equilibrium. Forgames involving N players, coordination can be achievedby the manager sending out N-1 goals.

3Typically we prefer the manager to choose distal states astargets rather than actions as this requires the manager to micro-manage less and so supports temporal abstraction

feudal multi-agent hierarchies for cooperative …feudal multi-agent hierarchies the manager must...

Documents