evolutionary reinforcement learning for sample …evolutionary reinforcement learning for...

Evolutionary Reinforcement Learning for Sample-Efficient MultiagentCoordination

Shauharda Khadka * 1 Somdeb Majumdar * 1 Santiago Miret 1 Stephen McAleer 2 Kagan Tumer 3

AbstractMany cooperative multiagent reinforcement learn-ing environments provide agents with a sparseteam-based reward, as well as a dense agent-specific reward that incentivizes learning basicskills. Training policies solely on the team-basedreward is often difficult due to its sparsity. Fur-thermore, relying solely on the agent-specific re-ward is sub-optimal because it usually does notcapture the team coordination objective. A com-mon approach is to use reward shaping to con-struct a proxy reward by combining the individualrewards. However, this requires manual tuningfor each environment. We introduce MultiagentEvolutionary Reinforcement Learning (MERL), asplit-level training platform that handles the twoobjectives separately through two optimizationprocesses. An evolutionary algorithm maximizesthe sparse team-based objective through neuroevo-lution on a population of teams. Concurrently, agradient-based optimizer trains policies to onlymaximize the dense agent-specific rewards. Thegradient-based policies are periodically added tothe evolutionary population as a way of informa-tion transfer between the two optimization pro-cesses. This enables the evolutionary algorithm touse skills learned via the agent-specific rewardstoward optimizing the global objective. Resultsdemonstrate that MERL significantly outperformsstate-of-the-art methods, such as MADDPG, on anumber of difficult coordination benchmarks.

1. IntroductionCooperative multiagent reinforcement learning (MARL)studies how multiple agents can learn to coordinate as a

*Equal contribution 1Intel Labs 2University of California,Irvine 3Oregon State University. Correspondence to: ShauhardaKhadka <[email protected]>, Somdeb Majumdar<[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Vienna, Austria, PMLR 108, 2020. Copyright 2020 bythe author(s).

team toward maximizing a global objective. CooperativeMARL has been applied to many real world applicationssuch as air traffic control (Tumer and Agogino, 2007), multi-robot coordination (Sheng et al., 2006; Yliniemi et al., 2014),communication and language (Lazaridou et al., 2016; Mor-datch and Abbeel, 2018), and autonomous driving (Shalev-Shwartz et al., 2016).

Many such environments endow agents with a team rewardthat reflects the team’s coordination objective, as well asan agent-specific local reward that rewards basic skills. Forinstance, in soccer, dense local rewards could capture agent-specific skills such as passing, dribbling and running. Theagents must then coordinate when and where to use theseskills in order to optimize the team objective, which is win-ning the game. Usually, the agent-specific reward is denseand easy to learn from, while the team reward is sparse andrequires the cooperation of all or most agents.

Having each agent directly optimize the team reward andignore the agent-specific reward usually fails or leads tosample-inefficiency for complex tasks due to the sparsityof the team reward. Conversely, having each agent directlyoptimize the agent-specific reward also fails because it doesnot capture the team’s objective, even with state-of-the-artmethods such as MADDPG (Lowe et al., 2017).

One approach to this problem is to use reward shaping,where extensive domain knowledge about the task is used tocreate a proxy reward function (Rahmattalabi et al., 2016).Constructing this proxy reward function is challenging incomplex environments, and is usually domain-dependent. Inaddition to requiring domain knowledge and manual tuning,this approach also poses risks of changing the underlyingproblem itself (Ng et al., 1999). Common approaches tocreating proxy rewards via linear combinations of the twoobjectives also fail to solve or generalize to complex coordi-nation tasks (Devlin et al., 2011; Williamson et al., 2009).

In this paper, we introduce Multiagent Evolutionary Rein-forcement Learning (MERL), a state-of-the-art algorithmfor cooperative MARL that does not require reward shap-ing. MERL is a split-level training platform that combinesgradient-based and gradient-free optimization. The gradient-free optimizer is an evolutionary algorithm that maximizesthe team objective through neuroevolution. The gradient-

arX

iv:1

906.

0731

5v3

[cs

.LG

] 1

1 Ju

n 20

20

MERL

based optimizer is a policy gradient algorithm that maxi-mizes each agent’s dense, local rewards. These gradient-based policies are periodically copied into the evolutionarypopulation, while the two processes operate concurrentlyand share information through a shared replay buffer.

A key strength of MERL is that it is a general method whichdoes not require domain-specific reward shaping. This isbecause MERL optimizes the team objective directly whilesimultaneously leveraging agent-specific rewards to learnbasic skills. We test MERL in a number of multiagentcoordination benchmarks. Results demonstrate that MERLsignificantly outperforms state-of-the-art methods such asMADDPG, while using the same observations and rewardfunctions. We also demonstrate that MERL scales gracefullyto increasing complexity of coordination objectives whereMADDPG and its variants fail to learn entirely.

2. Background and Related WorkMarkov Games: A standard reinforcement learning (RL)setting is formalized as a Markov Decision Process (MDP)and consists of an agent interacting with an environmentover a finite number of discrete time steps. This formula-tion can be extended to multiagent systems in the form ofpartially observable Markov games (Littman, 1994). AnN -agent Markov game is defined by a global state of theworld, S, and a set of N observations {Oi} and N actions{Ai} corresponding to the N agents. At each time step t,each agent observes its corresponding observation Oti andmaps it to an action Ati using its policy πi.

Each agent receives a scalar reward rti based on the globalstate St and joint action of the team. The world then tran-sitions to the next state St+1 which produces a new set ofobservations {Oi}. The process continues until a terminalstate is reached. Ri =

∑Tt=0 γ

trti is the total return foragent i with discount factor γ ∈ (0, 1]. Each agent aims tomaximize its expected return.

TD3: Policy gradient (PG) methods frame the goal of max-imizing the expected return as the minimization of a lossfunction. A widely used PG method for continuous, high-dimensional action spaces is DDPG (Lillicrap et al., 2015).Recently, (Fujimoto et al., 2018) extended DDPG to TwinDelayed DDPG (TD3), addressing its well-known overes-timation problem. TD3 is the state-of-the-art, off-policyalgorithm for model-free DRL in continuous action spaces.

TD3 uses an actor-critic architecture (Sutton and Barto,1998) maintaining a deterministic policy (actor) π : S → A,and two distinct critics Q : S × A → Ri. Each critic in-dependently approximates the actor’s action-value functionQπ. A separate copy of the actor and critics are kept astarget networks for stability and are updated periodically. Anoisy version of the actor is used to explore the environment

during training. The actor is trained using a noisy version ofthe sampled policy gradient computed by backpropagationthrough the combined actor-critic networks. This mitigatesoverfitting of the deterministic policy by smoothing the pol-icy gradient updates.

Evolutionary Reinforcement Learning (ERL) is a hybridalgorithm that combines Evolutionary Algorithms (EAs)(Floreano et al., 2008; Lüders et al., 2017; Fogel, 2006;Spears et al., 1993), with policy gradient methods (Khadkaand Tumer, 2018). Instead of discarding the data generatedduring a standard EA rollout, ERL stores this data in a cen-tral replay buffer shared with the policy gradient’s own roll-outs - thereby increasing the diversity of the data availablefor the policy gradient learners. Since the EA directly opti-mizes for episode-wide return, it biases exploration towardsstates with higher long-term returns. The policy gradientalgorithm which learns using this state distribution inheritsthis implicit bias towards long-term optimization. Concur-rently, the actor trained by the policy gradient algorithm isinserted into the evolutionary population allowing the EAto benefit from the fast gradient-based learning.

Related Work: (Lowe et al., 2017) introduced MADDPGwhich tackled the inherent non-stationarity of a multiagentlearning environment by leveraging a critic which had fullaccess to the joint state and action during training. (Foersteret al., 2018b) utilized a similar setup with a centralized criticacross agents to tackle StarCraft micromanagement tasks.An algorithm that could explicitly model other agents’ learn-ing was investigated in (Foerster et al., 2018a). However, allthese approaches rely on a dense agent reward that properlycaptures the team objective. Methods to solve for thesetypes of agent-specific reward functions were investigatedin (Li et al., 2012) but were limited to tasks with strongsimulators where tree-based planning could be used.

A closely related work to MERL is (Liu et al., 2019) wherePopulation-Based Training (PBT) (Jaderberg et al., 2017)is used to optimize the relative importance between a col-lection of dense, shaped rewards alongside their discountrates automatically during training. This can be interpretedas a singular central reward function constructed by scalar-izing a collection of reward signals where the scalarizationcoefficients and discount rates are adaptively learned duringtraining. PBT-MARL a powerful method that can leveragecomplex mixtures of its reward signals throughout training.However, it still relies on finding ideal mixing function be-tween rewards signals to drive learning. In contrast, MERLoptimizes its reward functions independently, relying in-stead on information transfer across them to drive learning.This is facilitated through shared replay buffers and pol-icy migration directly. This form of information transferthrough a shared replay buffer has been explored extensivelyin recent literature (Colas et al., 2018; Khadka et al., 2019).

MERL

3. Motivating Example

(a) Rover domain (b) MERL vs TD3 vs EA

Figure 1. (a) Rover domain with a clear misalignment betweenagent and team reward functions (b) Comparative performance ofMERL compared against TD3-mixed, TD3-global and EA.

Consider the rover domain (Agogino and Tumer, 2004), aclassic multiagent task where a team of rovers coordinateto explore a region. The team objective is to observe allPOIs (Points of Interest). Each robot also receives an agent-specific reward defined as negative distance to the closestPOI. In Figure 1(a), a team of two rovers R1 and R2 seekto explore and observe POIs P1 and P2. R1 is closer to P2

and has enough fuel to reach either of the POIs whereas R2

can only reach P2. There is no communication between therovers.

If R1 optimizes only locally by pursuing the closer POIP2, then the team objective is not achieved since R2 canonly reach P2. The globally optimal solution for R1 is tospend more fuel and pursue P1, which is misaligned with itslocally optimal solution. Figure 1(b) shows the comparativeperformance of four algorithms - namely TD3-global, TD3-mixed, EA and MERL on this coordination task.

TD3-global optimizes only the team reward and TD3-mixed optimizes a linear mixture of the individual and teamrewards. Since the team reward is sparse, TD3-global failsto learn anything meaningful. In contrast, TD3-mixed, dueto the dense agent-specific reward component, successfullylearns to perceive and navigate. However, since the mixedreward does not capture the true team objective, it convergesto the greedy local policy of R1 pursuing P2.

EA relies on randomly stumbling onto a solution - e.g., atrajectory that takes the rover to a POI. Since EA directlyoptimizes the team objective, it has a strong preference forthe globally optimal solution. However, the probability ofthe rovers randomly stumbling onto the globally optimal so-lution is extremely low. The greedy solution is significantlymore likely to be found - and is what EA converges to.

MERL combines the core strengths of TD3 and EA. TheTD3 component in MERL exploits the dense agent-specificreward to learn perception and navigation skills while beingagnostic to the global objective. The task is then reducedto its coordination component - picking the right POI to

go to. This is effectively tackled by the EA module withinMERL. This ability to independently leverage reward func-tions across multiple levels even when they are misalignedis the core strength of MERL.

4. Multiagent Evolutionary ReinforcementLearning

MERL leverages both agent-specific and team objectivesthrough a hybrid algorithm that combines gradient-free andgradient-based optimization. The gradient-free optimizer isan evolutionary algorithm that maximizes the team objectivethrough neuroevolution. The gradient-based optimizer trainspolicies to maximize agent-specific rewards. These gradient-based policies are periodically added to the evolutionarypopulation and participate in evolution. This enables theevolutionary algorithm to use agent-specific skills learned bytraining on the agent-specific rewards toward optimizing theteam objective without needing to resort to reward shaping.

Figure 2. Multi-headed team policy

Policy Topology: We represent our multiagent (team) poli-cies using a multi-headed neural network π as illustrated inFigure 2. The head πk represents the k-th agent in the team.Given an incoming observation for agent k, only the outputof πk is considered as agent k’s response. In essence, allagents act independently based on their own observationswhile sharing weights (and by extension, the features) inthe lower layers (trunk). This is commonly used to improvelearning speed (Silver et al., 2017). Further, each agent kalso has its own replay buffer (Rk) which stores its expe-rience defined by the tuple (state, action, next state, localreward) for each interaction with the environment (rollout)involving that agent.

Team Reward Optimization: Figure 3 illustrates theMERL algorithm. A population of M multi-headed teams,each with the same topology, is initialized with randomweights. The replay buffer Rk is shared by the k-th agentacross all teams. The population is then evaluated for eachrollout. The team reward for each team is disbursed at theend of the episode and is considered as its fitness score.

MERL

A selection operator selects a portion of the population forsurvival with probability proportionate to their fitness scores.The weights of the teams in the population are probabilisti-cally perturbed through mutation and crossover operatorsto create the next generation of teams. A portion of theteams with the highest relative fitness is preserved as elites.At any given time, the team with the highest fitness, or thechampion, represents the best solution for the task.

Figure 3. High level schematic of MERL highlighting the integra-tion of local and global reward functions

Policy Gradient: The procedure described so far resemblesa standard EA except that each agent k stores each of itsexperiences in its associated replay buffer (Rk) instead ofjust discarding it. However, unlike EA, which only learnsbased on the low-fidelity global reward, MERL also learnsfrom the experiences within episodes of a rollout usingpolicy gradients. To enable this kind of "local learning",MERL initializes one multi-headed policy network πpg andone critic Q. A noisy version of πpg is then used to conductits own set of rollouts in the environment, storing each agentk’s experiences in its corresponding buffer (Rk) similar tothe evolutionary rollouts.

Agent-Specific Reward Optimization: Crucially, eachagent’s replay buffer is kept separate from that of everyother agent to ensure diversity amongst the agents. Theshared critic samples a random mini-batch uniformly fromeach replay buffer and uses it to update its parameters usinggradient descent. Each agent πkpg then draws a mini-batchof experiences from its corresponding buffer (Rk) and usesit to sample a policy gradient from the shared critic. Unlikethe teams in the evolutionary population which directly seekto optimize the team reward, πpg seeks to maximize theagent-specific local reward while exploiting the experiencescollected via evolution.

Skill Migration: Periodically, the πpg network is copiedinto the evolving population of teams and propagates itsfeatures by participating in evolution. This is the core mech-anism that combines policies learned via agent-specific and

team rewards. Regardless of whether the two rewards arealigned, evolution ensures that only the performant deriva-tives of the migrated network are retained. This mechanismguarantees protection against destructive interference com-monly seen when a direct scalarization between two rewardfunctions is attempted. Further, the level of informationexchange is automatically adjusted during the process oflearning, in contrast to being manually tuned by an expert.

Algorithm 1 MERL

1: Initialize a population of M multi-head teams popπwith k agents each and initialize their weights θπ

2: Initialize a shared critic Q with weights θQ

3: Initialize an ensemble of N empty cyclic replay buffersRk, one for each agent

4: Define a white Gaussian noise generatorWg randomnumber generator r() ∈ [0, 1)

5: for generation = 1,∞ do6: for team π ∈ popπ do7: g,R = Rollout (π,R, noise=None, ξ)8: _,R = Rollout (π,R, noise=Wg , ξ = 1)9: Assign g as π’s fitness

10: end for11: Rank the population popπ based on fitness scores12: Select the first e teams π ∈ popπ as elites13: Select the remaining (M − e) teams π from popπ , to

form Set S using tournament selection14: while |S| < (M − e) do15: Single-point crossover between a randomly sam-

pled π ∈ e and π ∈ S and append to S16: end while17: for Agent k=1,N do18: Randomly sample a minibatch of T transitions

(oi, ai, li, oi+1) from Rk

19: Compute yi = li + γ minj=1,2

Q′j(oi+1, a∼|θQ

′j )

20: where a∼ = π′pg(k, oi+1|θπ′pg ) [action sampled

from the kth head of π′pg] +ε21: UpdateQ by minimizing the loss: L = 1

T

∑i(yi−

Q(oi, ai|θQ)222: Update πkpg using the sampled policy gradient

∇θπpgJ ∼1T

∑∇aQ(o, a|θQ)|o=oi,a=ai∇θπpgπ

kpg(s|θπpg)|o=oi

24: Soft update target networks:25: θπ

′ ⇐ τθπ + (1− τ)θπ′

26: θQ′ ⇐ τθQ + (1− τ)θQ′

27: end for28: Migrate the policy gradient team popj : for weakest

π ∈ popjπ : θπ ⇐ θπpg

29: end for

MERL

(a) Predator-Prey (b) Physical Deception (c) Keep-Away (d) Rover Domain

Figure 4. Environments tested (Lowe et al., 2017; Rahmattalabi et al., 2016)

Algorithm 1 provides a detailed pseudo-code of the MERLalgorithm. The choice of hyperparameters is explained inthe Appendix. Additionally, our source code 1 is availableonline.

5. ExperimentsWe adopt environments from (Lowe et al., 2017) and (Rah-mattalabi et al., 2016) to perform our experiments. Eachenvironment consists of multiple agents and landmarks ina two-dimensional world. Agents take continuous controlactions to move about the world. Figure 4 illustrates the fourenvironments which are described in more detail below.

Predator-Prey: N slower cooperating agents (predators)must chase the faster adversary (prey) around an environ-ment with L large landmarks in randomly-generated loca-tions. The predators get a reward when they catch (touch)the prey while the prey is penalized. The team reward forthe predators is the cumulative number of prey-touches inan episode. Each predator can also compute the averagedistance to the prey and use it as its agent-specific reward.All agents observe the relative positions and velocities of theother agents, as well as the positions of the landmarks. Theprey can accelerate 33% faster than the predator and has ahigher top speed. We tests two versions termed simple andhard predator-prey where the prey is 30% and 100% faster,respectively. Additionally, the prey itself learns dynamicallyduring training. We use DDPG (Lillicrap et al., 2015) as alearning algorithm for training the prey policy. All of ourcandidate algorithms are tested on their ability to train theteam of predators in catching this prey.

Physical Deception: N agents cooperate to reach a singletarget Point of Interest (POI) among N POIs. They arerewarded based on the closest distance of any agent to thetarget. A lone adversary also desires to reach the targetPOI. However, the adversary does not know which of thePOIs is the correct one. Thus the cooperating agents must

1https://tinyurl.com/y6erclts

learn to spread out and cover all POIs so as to deceive theadversary as they are penalized based on the adversary’sdistance to the target. The team reward for the agents isthen the cumulative reward in an episode. We use DDPG(Lillicrap et al., 2015) to train the adversary policy.

Keep-Away: In this scenario, a team of N cooperatingagents must reach a target POI out of L total POIs. Eachagent is rewarded based on its distance to the target. Weconstruct the team reward as simply the sum of the agent-specific rewards in an episode. An adversary also has tooccupy the target while keeping the cooperating agents fromreaching the target by pushing them away. To incentivizethis behavior, the adversary is rewarded based on its distanceto the target POI and penalized based on the distance of thetarget from the nearest cooperating agent. Additionally, itdoes not know which of the POIs is the target and must inferthis from the movement of the agents. DDPG (Lillicrapet al., 2015) is used to train the adversary policy.

Rover Domain: This environment is adapted from (Rah-mattalabi et al., 2016). Here, N agents must cooperate toreach a set of K POIs. Multiple agents need to simultane-ously go to the same POI in order to observe it. The numberof agents required to observe a POI is termed the couplingrequirement. Agents do not know and must infer the cou-pling factor from the rewards obtained. If a team with feweragents than this number go to a POI, no reward is observed.The team’s reward is the percentage of POIs observed at theend of an episode.

Each agent can also locally compute its distance to its closestPOI and use it as its agent-specific reward. Its observationcomprises two channels to detect POIs and rovers, respec-tively. Each channel receives intensity information over10◦ resolution spanning the 360◦ around the agent’s posi-tion loosely based on the characteristic of a Pioneer robot(Thrun et al., 2000). This is similar to a LIDAR. Since itreturns the closest reflector, occlusions make the problempartially-observable. A coupling factor of 1 is similar to thecooperative navigation task in (Lowe et al., 2017). We testcoupling factors from 1 to 7 to capture extremely complex

https://tinyurl.com/y6erclts

MERL

coordination objectives.

Compared Baselines: We compare the performance ofMERL with a standard neuroevolutionary algorithm (EA)(Fogel, 2006), MADDPG (Lowe et al., 2017) and MATD3,a variant of MADDPG that integrates the improvementsdescribed within TD3 (Fujimoto et al., 2018) over DDPG.Internally, MERL uses EA and TD3 as its team-rewardand agent-specific reward optimizer, respectively. MAD-DPG was chosen as it is the state-of-the-art multiagent RLalgorithm. We implemented MATD3 to ensure that the dif-ferences between MADDPG and MERL do not originatefrom having the more stable TD3 over DDPG.

Further, we implement MADDPG and MATD3 using ei-ther only global team reward or mixed rewards where thelocal, team-specific rewards were added to the global re-ward. The local reward function was simply defined as thenegative of the distance to the closest POI. In this setting,we sweep over different scaling factors to weigh the tworewards - an approach commonly used to shape rewards.These experimental variations allow us to evaluate the effi-cacy of the differentiating features of MERL as opposed toimprovements that might come from other ways of combin-ing reward functions.

Methodology for Reported Metrics: For MATD3 andMADDPG, the team network was periodically tested on 10task instances without any exploratory noise. The averagescore was logged as its performance. For MERL and EA,the team with the highest fitness was chosen as the cham-pion for each generation. The champion was then tested on10 task instances, and the average score was logged. Thisprotocol shielded the reported metrics from any bias of thepopulation size. We conduct 5 statistically independent runswith random seeds from {2019, 2023} and report the aver-age with error bars showing a 95% confidence interval. Allscores reported are compared against the number of envi-ronment steps (frames). A step is defined as the multiagentteam taking a joint action and receiving a feedback from theenvironment. To make the comparisons fair across single-team and population-based algorithms, all steps taken by allteams in the population are counted cumulatively.

6. ResultsPredator-Prey: Figure 5 shows the comparative perfor-mance in controlling the team of predators in the Predator-Prey environment. Note that this is an adversarial environ-ment where the prey dynamically adapts against the preda-tors. The prey (considered as part of the environment in thisanalysis) uses DDPG to learn constantly against our teamof predators. This is why predator performance (measuredas number of prey touches) exhibits ebb and flow duringlearning. MERL outperforms MATD3, EA, and MADDPG

across both simple and hard variations of the task. EA seemsto be approaching MERL’s performance, but is significantlyslower to learn. This is an expected behavior for neuroevolu-tionary methods, which are known to be sample-inefficient.In contrast, MERL, by virtue of its fast policy-gradient com-ponents, learns significantly faster.

(a) (b)

Figure 5. Results on Predator-Prey where the prey is (a) 30% fasterand (b) 100% fasterPhysical Deception: Figure 6 (left) shows the comparativeperformance in controlling the team of agents in the Physi-cal Deception environment. The performance here is largelybased on how close the adversary comes to the target POI.Since the adversary starts out untrained, all compared algo-rithms start out with a fairly high score. As the adversarygradually learns to infer and move towards the target POI,MATD3 and MADDPG demonstrate a gradual decline inperformance. However, MERL and EA are able to hold theirperformance by concocting effective counter-strategies indeceiving the adversary. EA reaches the same performanceas MERL, but is slower to learn.

Keep-Away: Figure 6 (right) show the comparative per-formance in Keep-Away. Similar to Physical Deception,MERL and EA are able to hold performance by attaininggood counter-measures against the adversary while MATD3and MADDPG fail to do so. However, EA slightly outper-forms MERL on this task.

(a) (b)

Figure 6. Results on (a) Physical Deception and (b) Keep-Away

Rover Domain: Figure 7 shows the comparative perfor-mance of MERL, MADDPG, MATD3, and EA tested inthe rover domain with coupling factors 1, 3 and 7. In or-der to benchmark against the proxy reward functions thatuse scalarized linear combinations, we test MADDPG andMATD3 with two variations of reward functions. Globalrepresents the scenario where the agents only receive the

MERL

(a) Coupling 1 (b) Coupling 3 (c) Coupling 7 (d) Legend

Figure 7. Performance on the Rover Domain.

sparse team reward as their reinforcement signal. Mixedrepresents the scenario where the agents receive a linearcombination of the team-reward and agent-specific reward.Each reward is normalized before being combined. A weigh-ing coefficient of 10 is used to amplify the team-reward’sinfluence in order to counter its sparsity. The weighingcoefficient was tuned using a grid search (see Figure 8).

MERL significantly outperforms all baselines across allcoupling requirements. The tested baselines clearly degradequickly beyond a coupling of 3. The increasing couplingrequirement is equivalent to increasing difficulty in joint-space exploration and entanglement in the team objective.However, it does not increase the size of the state-space,complexity of perception, or navigation. This indicates thatthe degradation in performance is strictly due to the increasein complexity of the team objective.

Notably, MERL is able to learn on coupling greater thann = 6 where methods without explicit reward shapinghave been shown to fail entirely (Rahmattalabi et al., 2016).MERL successfully completes the task using the same setof information and coarse, unshaped reward functions as theother algorithms. This is mainly due to MERL’s split-levelapproach, which allows it to leverage the agent-specificreward function to solve navigation and perception whileconcurrently using the team-reward function to learn teamformation and effective coordination.

Scalarization Coefficients for Mixed Rewards: Figure8 shows the performance of MATD3 in optimizing mixedrewards computed with different coefficients used to amplifythe team-reward relative to the agent-reward. The resultsdemonstrate that finding a good balance between these tworewards through linear scalarization is difficult, as all valuestested fail to make any progress in the task. This is becausea static scalarization cannot capture the dynamic propertiesof which reward is important when and instead leads toan ineffective proxy. In contrast, MERL is able to leverageboth reward functions without the need to explicitly combinethem either linearly or via more complex mixing functions.

Team Behaviors: Figure 9 illustrates the trajectories gener-ated for the Rover Domain with a coupling of n = 3 - the

Figure 8. MATD3’s with varying scalarization coefficients

animations can also be viewed at our code repository online2. The trajectories for a team fully trained with MERL isshown in Figure 9 (a). Here, team formation and collabora-tive pursuit of the POIs is immediately apparent. Two teamsof 3 agents each form at the start of the episode. Further, thetwo teams also coordinate to pursue different POIs in orderto maximize the team reward. While not perfect (the bottomPOI is left unobserved), they do succeed in observing 3 outof the 4 POIs.

(a) MERL (b) MADDPG

Figure 9. Agent trajectories for coupling = 3. Golden/black circlesare observed/unobserved POIs respectively

2https://tinyurl.com/ugcycjk

https://tinyurl.com/ugcycjk

MERL

Figure 10. Selection rate for migrating policies

In contrast, MADDPG-mixed (shown in Figure 9 (b)) failsto observe any POI. From the trajectories, it is apparentthat the agents have successfully learned to perceive andnavigate to reach POIs. However, they are unable to usethis skill towards fulfilling the team objective. Instead eachagent is rather split on the objective that it is optimizing.Some agents seem to be in sole pursuit of POIs withoutany regard for team formation or collaboration while othersseem to exhibit random movements. The primary reasonfor this is the mixed reward function that directly combinesthe agent-specific and team reward functions. Since the tworeward functions have no guarantees of alignment acrossthe state-space of the task, they invariably lead to learningthese sub-optimal joint-behaviors that solve a certain formof scalarized mixed objective. In contrast, MERL by virtueof its bi-level optimization framework is able to leverageboth reward functions without the need to explicitly com-bine them. This enables MERL to avoid these sub-optimalpolicies and solve the task without any reward shaping ormanual tuning.

Conditional Selection Rate: We ran experiments trackingwhether the policies migrated from the policy gradient learn-ers to the evolutionary population were selected or discardedduring the subsequent selection process (Figure 10). Math-ematically, this is equivalent to the conditional probabilityof selection (sel), given that the individual was a migrantthis generation (mig), represented as P (sel | mig). Theexpected selection rate P (sel) under uniform random se-lection is 0.47 (see Appendix B). Conditional selection rateabove this baseline indicates that migration is a positive pre-dictor for selection, and vice-versa. The conditional selec-tion rate are distributed across both sides of this line, varyingby task. The lowest selection rate is seen for Keep-awaywhere evolution virtually discards all migrated individuals.This is consistent with the performance in Keep-Away (Fig-ure 6b) where EA outperforms all other baselines while thepolicy-gradient methods struggle.

Both of the predator-prey tasks have consistently high con-ditional selection rate, indicating positive information trans-fer from the policy-gradient to the evolutionary populationthroughout training. This is consistent with Figure 5 whereMERL outperforms all other baselines. For a dynamic envi-ronment such as predator-prey where the prey is consistentlyadapting against the agents, such information transfer is cru-cial for sustaining success. For physical deception, as wellas the rover domain, the evolutionary population initiallybenefits heavily from the migrating individuals (consistentwith Figure 6(a)). However, as these information propagatesthrough the population, the marginal benefit from migrationwanes through the course of learning. This form of adaptiveinformation transfer is a key characteristic within MERL,enabling it to integrate two optimization processes towardsmaximizing the team objective without being misguidederroneously by either.

7. ConclusionWe introduced MERL, a hybrid multiagent reinforcementlearning algorithm that leverages both agent-specific andteam objectives by combining gradient-based and gradient-free optimization. MERL achieves this by using a fastpolicy-gradient optimizer to exploit dense agent-specificrewards while concurrently leveraging neuroevolution totackle the team-objective. Periodic migration and a sharedreplay buffer ensure consistent information flow between thetwo processes. This enables MERL to leverage both modesof learning towards tackling the global team objective.

Results demonstrate that MERL significantly outperformsMADDPG, the state-of-the-art MARL method, in a widearray of benchmarks. We also tested a modification of MAD-DPG to integrate TD3 - the state-of-the-art single-agent RLalgorithm. These experiments demonstrate that the core im-provements of MERL come from its ability to leverage teamand agent-specific rewards without the need to explicitlycombine them. This differentiates MERL from other ap-proaches like reward scalarization and reward shaping thateither require extensive manual tuning or can detrimentallychange the MDP (Ng et al., 1999) itself.

In this paper, MERL expended a constant amount of re-sources across its policy-gradient and EA componentsthroughout training. Exploring automated methods to dy-namically allocate computational resources among the twoprocesses based on the conditional selection rate is an excit-ing thread for further investigation. Other future threads willexplore MERL for settings such as Pommerman (Resnicket al., 2018), StarCraft (Risi and Togelius, 2017; Vinyalset al., 2017), RoboCup (Kitano et al., 1995), football (Ku-rach et al., 2019) and general multi-reward settings such asmultitask learning.

MERL

ReferencesA. K. Agogino and K. Tumer. Unifying temporal and

structural credit assignment problems. In Proceedings ofthe Third International Joint Conference on AutonomousAgents and Multiagent Systems-Volume 2, pages 980–987.IEEE Computer Society, 2004.

C. Colas, O. Sigaud, and P.-Y. Oudeyer. Gep-pg: Decou-pling exploration and exploitation in deep reinforcementlearning algorithms. arXiv preprint arXiv:1802.05054,2018.

S. Devlin, M. Grzes, and D. Kudenko. Multi-agent, rewardshaping for robocup keepaway. In The 10th Interna-tional Conference on Autonomous Agents and MultiagentSystems-Volume 3, pages 1227–1228. International Foun-dation for Autonomous Agents and Multiagent Systems,2011.

D. Floreano, P. Dürr, and C. Mattiussi. Neuroevolution:from architectures to learning. Evolutionary Intelligence,1(1):47–62, 2008.

J. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson,P. Abbeel, and I. Mordatch. Learning with opponent-learning awareness. In Proceedings of the 17th Interna-tional Conference on Autonomous Agents and MultiAgentSystems, pages 122–130. International Foundation forAutonomous Agents and Multiagent Systems, 2018a.

J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, andS. Whiteson. Counterfactual multi-agent policy gradi-ents. In Thirty-Second AAAI Conference on ArtificialIntelligence, 2018b.

D. B. Fogel. Evolutionary computation: toward a newphilosophy of machine intelligence, volume 1. John Wiley& Sons, 2006.

S. Fujimoto, H. van Hoof, and D. Meger. Addressing func-tion approximation error in actor-critic methods. arXivpreprint arXiv:1802.09477, 2018.

M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki,J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning,K. Simonyan, et al. Population based training of neuralnetworks. arXiv preprint arXiv:1711.09846, 2017.

S. Khadka and K. Tumer. Evolution-guided policy gradientin reinforcement learning. In Advances in Neural Infor-mation Processing Systems, pages 1196–1208, 2018.

S. Khadka, S. Majumdar, T. Nassar, Z. Dwiel, E. Tumer,S. Miret, Y. Liu, and K. Tumer. Collaborativeevolutionary reinforcement learning. arXiv preprintarXiv:1905.00976v2, 2019.

H. Kitano, M. Asada, Y. Kuniyoshi, I. Noda, and E. Osawa.Robocup: The robot world cup initiative, 1995.

K. Kurach, A. Raichuk, P. Stanczyk, M. Zajac, O. Bachem,L. Espeholt, C. Riquelme, D. Vincent, M. Michalski,O. Bousquet, et al. Google research football: A novelreinforcement learning environment. arXiv preprintarXiv:1907.11180, 2019.

A. Lazaridou, A. Peysakhovich, and M. Baroni. Multi-agent cooperation and the emergence of (natural) lan-guage. arXiv preprint arXiv:1612.07182, 2016.

F.-D. Li, M. Wu, Y. He, and X. Chen. Optimal control inmicrogrid using multi-agent reinforcement learning. ISAtransactions, 51(6):743–751, 2012.

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez,Y. Tassa, D. Silver, and D. Wierstra. Continuous con-trol with deep reinforcement learning. arXiv preprintarXiv:1509.02971, 2015.

M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning pro-ceedings 1994, pages 157–163. Elsevier, 1994.

S. Liu, G. Lever, J. Merel, S. Tunyasuvunakool, N. Heess,and T. Graepel. Emergent coordination through competi-tion. arXiv preprint arXiv:1902.07151, 2019.

R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mor-datch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Infor-mation Processing Systems, pages 6379–6390, 2017.

B. Lüders, M. Schläger, A. Korach, and S. Risi. Continualand one-shot learning through neural networks with dy-namic external memory. In European Conference on theApplications of Evolutionary Computation, pages 886–901. Springer, 2017.

B. L. Miller and D. E. Goldberg. Genetic algorithms, tour-nament selection, and the effects of noise. In ComplexSystems. Citeseer, 1995.

I. Mordatch and P. Abbeel. Emergence of grounded compo-sitional language in multi-agent populations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

A. Y. Ng, D. Harada, and S. Russell. Policy invarianceunder reward transformations: Theory and application toreward shaping. In ICML, volume 99, pages 278–287,1999.

A. Rahmattalabi, J. J. Chung, M. Colby, and K. Tumer. D++:Structural credit assignment in tightly coupled multiagentdomains. In 2016 IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), pages 4424–4429.IEEE, 2016.

MERL

C. Resnick, W. Eldridge, D. Ha, D. Britz, J. Foerster, J. To-gelius, K. Cho, and J. Bruna. Pommerman: A multi-agentplayground. arXiv preprint arXiv:1809.07124, 2018.

S. Risi and J. Togelius. Neuroevolution in games: Stateof the art and open challenges. IEEE Transactions onComputational Intelligence and AI in Games, 9(1):25–41,2017.

T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolutionstrategies as a scalable alternative to reinforcement learn-ing. arXiv preprint arXiv:1703.03864, 2017.

S. Shalev-Shwartz, S. Shammah, and A. Shashua. Safe,multi-agent, reinforcement learning for autonomous driv-ing. arXiv preprint arXiv:1610.03295, 2016.

W. Sheng, Q. Yang, J. Tan, and N. Xi. Distributed multi-robot coordination in area exploration. Robotics andAutonomous Systems, 54(12):945–955, 2006.

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai,A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel,et al. Mastering chess and shogi by self-play with ageneral reinforcement learning algorithm. arXiv preprintarXiv:1712.01815, 2017.

W. M. Spears, K. A. De Jong, T. Bäck, D. B. Fogel, andH. De Garis. An overview of evolutionary computation.In European Conference on Machine Learning, pages442–459. Springer, 1993.

R. S. Sutton and A. G. Barto. Reinforcement learning: Anintroduction, volume 1. MIT press Cambridge, 1998.

S. Thrun, W. Burgard, and D. Fox. A real-time algorithm formobile robot mapping with applications to multi-robotand 3d mapping. In ICRA, volume 1, pages 321–328,2000.

K. Tumer and A. Agogino. Distributed agent-based airtraffic flow management. In Proceedings of the 6th in-ternational joint conference on Autonomous agents andmultiagent systems, page 255. ACM, 2007.

O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezh-nevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou,J. Schrittwieser, et al. Starcraft ii: A new challenge forreinforcement learning. arXiv preprint arXiv:1708.04782,2017.

S. A. Williamson, E. H. Gerding, and N. R. Jennings. Re-ward shaping for valuing communications during multi-agent coordination. In Proceedings of The 8th Interna-tional Conference on Autonomous Agents and MultiagentSystems-Volume 1, pages 641–648. International Foun-dation for Autonomous Agents and Multiagent Systems,2009.

L. Yliniemi, A. K. Agogino, and K. Tumer. Multirobotcoordination for space exploration. AI Magazine, 35(4):61–74, 2014.

MERL

A. Hyperparameters Description

Table 1. Hyperparameters used for Predator-Prey, Keep-away andPhysical Deception

Hyperparameter MERL MATD3/MADDPGPopulation size 10 N/ARollout size 10 10Target weight 0.01 0.01Actor Learning Rate 0.01 0.01Critic Learning Rate 0.01 0.01Discount Factor 0.95 0.95Replay Buffer Size 1e6 1e6

Batch Size 1024 1024Mutation Prob 0.9 N/AMutation Fraction 0.1 N/AMutation Strength 0.1 N/ASuper Mutation Prob 0.05 N/AReset Mutation Prob 0.05 N/ANumber of elites 4 N/AExploration Policy N (0, σ) N (0, σ)Exploration Noise 0.4 0.4Rollouts per fitness 10 N/AActor Architecture [100, 100] [100, 100]Critic Architecture [100, 100] [300, 300]TD3 Noise Variance 0.2 0.2TD3 Noise Clip 0.5 0.5TD3 Update Freq 2 2

Table 1 details the hyperparameters used for MERL,MATD3, and MADDPG in tackling predator-prey and co-operative navigation. The hyperparmaeters were inheritedfrom (Lowe et al., 2017) to match the original experimentsfor MADDPG and MATD3. The only exception to thiswas the use of hyperbolic tangent instead of Relu activationfunctions.

Table 2 details the hyperparameters used for MERL,MATD3, and MADDPG in the rover domain. The hyperpa-rameters themselves are defined below:

• Optimizer = AdamAdam optimizer was used to update both the actor andcritic networks for all learners.

• Population size MThis parameter controls the number of different actors(policies) that are present in the evolutionary popula-tion.

• Rollout sizeThis parameter controls the number of rollout workers(each running an episode of the task) per generation.

Note: The two parameters above (population size kand rollout size) collectively modulates the proportion

Table 2. Hyperparameters used for Rover DomainHyperparameter MERL MATD3/MADDPGPopulation size 10 N/ARollout size 50 50Target weight 1e−5 1e−5

Actor Learning Rate 5e−5 5e−5

Critic Learning Rate 1e−5 1e−5

Discount Factor 0.5 0.97Replay Buffer Size 1e5 1e5

Batch Size 512 512Mutation Prob 0.9 N/AMutation Fraction 0.1 N/AMutation Strength 0.1 N/ASuper Mutation Prob 0.05 N/AReset Mutation Prob 0.05 N/ANumber of elites 4 N/AExploration Policy N (0, σ) N (0, σ)Exploration Noise σ 0.4 0.4Rollouts per fitness ξ 10 N/AActor Architecture [100, 100] [100, 100]Critic Architecture [100, 100] [300, 300]TD3 Noise variance 0.2 0.2TD3 Noise Clip 0.5 0.5TD3 Update Frequency 2 2

of exploration carried out through noise in the actor’sparameter space and its action space.

• Target weight τThis parameter controls the magnitude of the soft up-date between the actors and critic networks, and theirtarget counterparts.

• Actor Learning RateThis parameter controls the learning rate of the actornetwork.

• Critic Learning RateThis parameter controls the learning rate of the criticnetwork.

• Discount RateThis parameter controls the discount rate used to com-pute the return optimized by policy gradient.

• Replay Buffer SizeThis parameter controls the size of the replay buffer.After the buffer is filled, the oldest experiences aredeleted in order to make room for new ones.

• Batch SizeThis parameters controls the batch size used to computethe gradients.

MERL

• Actor Activation FunctionHyperbolic tangent was used as the activation function.

• Critic Activation FunctionHyperbolic tangent was used as the activation function.

• Number of ElitesThis parameter controls the fraction of the populationthat are categorized as elites. Since an elite individual(actor) is shielded from the mutation step and preservedas it is, the elite fraction modulates the degree of explo-ration/exploitation within the evolutionary population.

• Mutation ProbabilityThis parameter represents the probability that an actorgoes through a mutation operation between generation.

• Mutation FractionThis parameter controls the fraction of the weights in achosen actor (neural network) that are mutated, oncethe actor is chosen for mutation.

• Mutation StrengthThis parameter controls the standard deviation of theGaussian operation that comprises mutation.

• Super Mutation ProbabilityThis parameter controls the probability that a super mu-tation (larger mutation) happens in place of a standardmutation.

• Reset Mutation ProbabilityThis parameter controls the probability a neural weightis instead reset between N (0, 1) rather than being mu-tated.

• Exploration NoiseThis parameter controls the standard deviation of theGaussian operation that comprise the noise added tothe actor’s actions during exploration by the learners(learner roll-outs).

• TD3 Policy Noise VarianceThis parameter controls the standard deviation of theGaussian operation that comprise the noise added tothe policy output before applying the Bellman backup.This is often referred to as the magnitude of policysmoothing in TD3.

• TD3 Policy Noise ClipThis parameter controls the maximum norm of thepolicy noise used to smooth the policy.

• TD3 Policy Update FrequencyThis parameter controls the number of critic updatesper policy update in TD3.

B. Expected Selection RateWe use a multi-step selection process inside MERL. Firstwe select e top-performing individuals as elites sequentiallyfrom the population without replacement. Then we con-duct a tournament selection (Miller and Goldberg, 1995)with tournament size t with replacement from the entirepopulation including the elites. t is set to 3 in this paper.The candidates selected from the tournament selection arepruned for duplicates and the resulting set is carried overas the offsprings for this generation. The combined set ofoffsprings and the elites represent the candidates selectedfor that generation of evolution.

The expected selection rate P (s) is defined as the probabil-ity of selection for an individual if the selection was random.This is equivalent to conducting selection using a randomranking of the population where the fitness scores were ig-nored and a random number was assigned as an individual’sfitness. Note that the selection rate is not the probabilityof selection for a random policy (individual with randomneural network weights) inserted into the evolutionary pop-ulation. The probability of selection for such a randompolicy would be extremely low as the other individuals inthe population would be ranked significantly higher.

In order to compute the expected selected rate, we needto compute it for the elite set and the offspring set. Theexpected selection rate for the elite set is given by e/M .The expected selection rate for the offspring set involvesmultiple rounds of tournament selection with replacementfollowed by pruning of duplicates to compute the combinedset along with the elites. We computed this expectationempirically using an experiment with 1000000 iterationsand found the expected selection rate to be 0.47.

C. Extended EA BenchmarksWe conducted some additional experiments by varying thepopulation sizes used for the EA baseline. The purpose ofthese experiments is to investigate if larger population sizes(as is the norm for EA algorithms) can alleviate the need forpolicy-gradient module within MERL.

Additionally, we also investigated Evolutionary Strategies(ES) (Salimans et al., 2017), which has been widely adoptedin the community in recent years. We perform hyperparam-eter sweeps to tune this baseline. All results are reported inthe rover domain with a coupling of 3.

C.1. Evolutionary Strategies (ES)

ES Population Sweep: Figure 11(left) compares ES withvarying population sizes in the rover domain with a couplingof 3. Sigma for all ES runs are set at 0.1. Among the ES runs,a population size of 100 yields the best results converging to

MERL

(a) ES Population Sweep (b) ES Sigma Sweep

Figure 11. (a) Evolutionary Strategies population size sweep on the rover domain with a coupling of 3. (b)Evolutionary Strategies Noisemagnitude (sigma) sweep on the rover domain with a coupling of 3

0.1 in 100-millions frames. MERL (red) on the other handis ran for 2-million frames and converges to 0.48.

ES Noise Sweep: Apart from the population size, a keyhyperparameter for ES is the variance of the perturbationfactor (sigma). We run a parameter sweep for sigma andreport results in Figure 11(right). We do not see a great dealof improvement with the change of sigma.

Figure 12. Evolutionary Algorithm Population size sweep on therover domain with a coupling of 3. MERL was run for 2-millionsteps while the other EA runs were ran for 100-million steps.

C.2. EA Population Size

Next, we conduct an experiment to evaluate the efficacy ofdifferent population sizes from 10-1, 000 for the Evolution-ary algorithm used in the paper. All results are reported forthe rover domain with a coupling factor of 3 and are illus-trated in Figure 12. The best EA performance was found fora population size of 100 reaching 0.3 in 100-million timesteps. Compare this to MERL which reaches a performanceof 0.48 in only 2 million time steps. This demonstrates akey thesis behid MERL - the efficacy of the guided evolutionapproach over purely evolutionary approaches.

D. Rollout MethodologyAlgorithm 2 describes an episode of rollout under MERLdetailing the connections between the local reward, globalreward, and the associated replay buffer.Algorithm 2 Rollout

1: function Rollout(π,R, noise, ξ)2: fitness = 03: for j = 1:ξ do4: Reset environment and get initial joint state js5: while env is not done do6: Initialize an empty list of joint action ja = []7: for Each agent (actor head) πk ∈ π and sk in

js do8: ja⇐ ja ∪ πk(sk|θπ

k

) + noiset9: end for

10: Execute ja and observe joint local reward jl,global reward g and joint next state js′

11: for Each Replay BufferRk ∈ R and sk, ak, lk,s′k in js, ja, jl, js′ do

12: Append transition (sk, ak, lk, s′k) to Rk

13: end for14: js = js′

15: if env is done: then16: fitness← g17: end if18: end while19: end for20: Return fitness

ξ ,R21: end function

evolutionary reinforcement learning for sample …evolutionary reinforcement learning for...

Documents