an atari model zoo for analyzing, visualizing, and ... · building upon models is the idea of a...

20
An Atari Model Zoo for Analyzing, Visualizing, and Comparing Deep Reinforcement Learning Agents Felipe Petroski Such 1 , Vashisht Madhavan 1 , Rosanne Liu 1 , Rui Wang 1 Pablo Samuel Castro 2 , Yulun Li 1 , Jiale Zhi 1 , Ludwig Schubert 3 Marc G. Bellemare 2 Jeff Clune 1 , Joel Lehman 1* 1 Uber AI Labs , 2 Google Brain , 3 OpenAI Abstract Much human and computational effort has aimed to improve how deep reinforcement learning (DRL) algorithms perform on benchmarks such as the Atari Learning Environment. Comparatively less effort has focused on understanding what has been learned by such methods, and investigating and comparing the representations learned by different families of DRL algorithms. Sources of friction in- clude the onerous computational requirements, and general logistical and architectural complications for running DRL algorithms at scale. We lessen this friction, by (1) training several algorithms at scale and releasing trained models, (2) integrating with a previous DRL model release, and (3) releas- ing code that makes it easy for anyone to load, vi- sualize, and analyze such models. This paper in- troduces the Atari Zoo framework, which contains models trained across benchmark Atari games, in an easy-to-use format, as well as code that im- plements common modes of analysis and connects such models to a popular neural network visual- ization library. Further, to demonstrate the poten- tial of this dataset and software package, we show initial quantitative and qualitative comparisons be- tween the performance and representations of sev- eral DRL algorithms, highlighting interesting and previously unknown distinctions between them. 1 Introduction Since its introduction the Atari Learning Environment (ALE; [Bellemare et al., 2013]) has been an important reinforcement learning (RL) testbed. It enables easily evaluating algorithms on over 50 emulated Atari games spanning diverse game- play styles, providing a window on such algorithms’ gener- ality. Indeed, surprisingly strong results in ALE with deep neural networks (DNNs), published in Nature [Mnih et al., 2015], greatly contributed to the current popularity of deep reinforcement learning (DRL). Like other machine learning benchmarks, much effort aims to quantitatively improve state-of-the-art (SOTA) scores. As * Corresponding author: [email protected] the DRL community grows, a paper pushing SOTA is likely to attract significant interest and accumulate citations. While improving performance is important, it is equally important to understand what DRL algorithms learn, how they pro- cess and represent information, and what are their proper- ties, strengths, and weaknesses. These questions cannot be answered through simple quantitative measurements of per- formance across the ALE suite of games. Compared to pushing SOTA, much less work has focused on understanding, interpreting, and visualizing products of DRL; in particular, little research compares DRL algorithms across dimensions other than performance. This paper thus aims to alleviate the considerable friction for those looking to rigorously understand the qualitative behavior of DRL agents. Three main sources of such friction are: (1) the significant computational resources required to run DRL at scale, (2) the logistical tedium of plumbing the products of different DRL algorithms into a common interface, and (3) the wasted effort in re-implementing standard analysis pipelines (like t-SNE embeddings of the state space [Mnih et al., 2015], or acti- vation maximization for visualizing what neurons in a model represent [Erhan et al., 2009; Olah et al., 2018; Nguyen et al., 2017; Simonyan et al., 2013; Yosinski et al., 2015; Ma- hendran and Vedaldi, 2016]). To address these frictions, this paper introduces the Atari Zoo, a release of trained models spanning major families of DRL algorithms, and an accompa- nying open-source software package 1 that enables their easy analysis, comparison, and visualization (and similar analysis of future models). In particular, this package enables easily downloading particular frozen models of interest from the zoo on-demand, further evaluating them in their training environ- ment or modified environments, generating visualizations of their neural activity, exploring compressed visual representa- tions of their behavior, and creating synthetic input patterns that reveal what particular neurons most respond to. To demonstrate the promise of this model zoo and soft- ware, this paper presents an initial analysis of the products of seven DRL algorithms spanning policy gradient, value- based, and evolutionary methods 2 : A2C (policy-gradient; 1 https://github.com/uber-research/atari-model-zoo 2 While evolutionary algorithms are excluded from some defini- tions of RL, their inclusion in the zoo can help investigate what dis- tinguishes such black-box optimization from more traditional RL. arXiv:1812.07069v2 [cs.NE] 29 May 2019

Upload: others

Post on 22-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

An Atari Model Zoo for Analyzing, Visualizing, and Comparing DeepReinforcement Learning Agents

Felipe Petroski Such1 , Vashisht Madhavan1 , Rosanne Liu1 , Rui Wang1

Pablo Samuel Castro2 , Yulun Li1 , Jiale Zhi1 , Ludwig Schubert3Marc G. Bellemare2 Jeff Clune1 , Joel Lehman1∗

1Uber AI Labs , 2Google Brain , 3OpenAI

AbstractMuch human and computational effort has aimed toimprove how deep reinforcement learning (DRL)algorithms perform on benchmarks such as theAtari Learning Environment. Comparatively lesseffort has focused on understanding what has beenlearned by such methods, and investigating andcomparing the representations learned by differentfamilies of DRL algorithms. Sources of friction in-clude the onerous computational requirements, andgeneral logistical and architectural complicationsfor running DRL algorithms at scale. We lessenthis friction, by (1) training several algorithms atscale and releasing trained models, (2) integratingwith a previous DRL model release, and (3) releas-ing code that makes it easy for anyone to load, vi-sualize, and analyze such models. This paper in-troduces the Atari Zoo framework, which containsmodels trained across benchmark Atari games, inan easy-to-use format, as well as code that im-plements common modes of analysis and connectssuch models to a popular neural network visual-ization library. Further, to demonstrate the poten-tial of this dataset and software package, we showinitial quantitative and qualitative comparisons be-tween the performance and representations of sev-eral DRL algorithms, highlighting interesting andpreviously unknown distinctions between them.

1 IntroductionSince its introduction the Atari Learning Environment (ALE;[Bellemare et al., 2013]) has been an important reinforcementlearning (RL) testbed. It enables easily evaluating algorithmson over 50 emulated Atari games spanning diverse game-play styles, providing a window on such algorithms’ gener-ality. Indeed, surprisingly strong results in ALE with deepneural networks (DNNs), published in Nature [Mnih et al.,2015], greatly contributed to the current popularity of deepreinforcement learning (DRL).

Like other machine learning benchmarks, much effort aimsto quantitatively improve state-of-the-art (SOTA) scores. As

∗Corresponding author: [email protected]

the DRL community grows, a paper pushing SOTA is likelyto attract significant interest and accumulate citations. Whileimproving performance is important, it is equally importantto understand what DRL algorithms learn, how they pro-cess and represent information, and what are their proper-ties, strengths, and weaknesses. These questions cannot beanswered through simple quantitative measurements of per-formance across the ALE suite of games.

Compared to pushing SOTA, much less work has focusedon understanding, interpreting, and visualizing products ofDRL; in particular, little research compares DRL algorithmsacross dimensions other than performance. This paper thusaims to alleviate the considerable friction for those looking torigorously understand the qualitative behavior of DRL agents.Three main sources of such friction are: (1) the significantcomputational resources required to run DRL at scale, (2) thelogistical tedium of plumbing the products of different DRLalgorithms into a common interface, and (3) the wasted effortin re-implementing standard analysis pipelines (like t-SNEembeddings of the state space [Mnih et al., 2015], or acti-vation maximization for visualizing what neurons in a modelrepresent [Erhan et al., 2009; Olah et al., 2018; Nguyen etal., 2017; Simonyan et al., 2013; Yosinski et al., 2015; Ma-hendran and Vedaldi, 2016]). To address these frictions, thispaper introduces the Atari Zoo, a release of trained modelsspanning major families of DRL algorithms, and an accompa-nying open-source software package1 that enables their easyanalysis, comparison, and visualization (and similar analysisof future models). In particular, this package enables easilydownloading particular frozen models of interest from the zooon-demand, further evaluating them in their training environ-ment or modified environments, generating visualizations oftheir neural activity, exploring compressed visual representa-tions of their behavior, and creating synthetic input patternsthat reveal what particular neurons most respond to.

To demonstrate the promise of this model zoo and soft-ware, this paper presents an initial analysis of the productsof seven DRL algorithms spanning policy gradient, value-based, and evolutionary methods2: A2C (policy-gradient;

1https://github.com/uber-research/atari-model-zoo2While evolutionary algorithms are excluded from some defini-

tions of RL, their inclusion in the zoo can help investigate what dis-tinguishes such black-box optimization from more traditional RL.

arX

iv:1

812.

0706

9v2

[cs

.NE

] 2

9 M

ay 2

019

Page 2: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

[Mnih et al., 2016]), IMPALA (policy-gradient; [Espeholt etal., 2018]), DQN (value-based; [Mnih et al., 2015]), Rainbow(value-based; [Hessel et al., 2017]), Ape-X (value-based;[Horgan et al., 2018]), ES (evolutionary; [Salimans et al.,2017]), and Deep GA (evolutionary; [Such et al., 2017]).The analysis illuminates differences in learned policies acrossmethods that are independent of raw score performance, high-lighting the benefit of going beyond simple quantitative mea-sures and of having a unifying software framework that en-ables analyses with multiple, different, complementary tech-niques and applying them across many RL algorithms.

2 Background2.1 Visualizing Deep NetworksOne line of DNN research focuses on visualizing the inter-nal dynamics of a DNN [Yosinski et al., 2015] or examineswhat particular neurons detect or respond to [Erhan et al.,2009; Zeiler and Fergus, 2014; Olah et al., 2018; Nguyen etal., 2017; Simonyan et al., 2013; Yosinski et al., 2015; Ma-hendran and Vedaldi, 2016]. The hope is to gain more in-sight into how DNNs are representing information, motivatedboth to enable more interpretable decisions from these mod-els [Olah et al., 2018] and to illuminate previously unknownproperties about DNNs [Yosinski et al., 2015]. For example,through live visualization of all activations in a vision net-work responding to different images, Yosinski et al. [2015]highlighted that representations were often surprisingly lo-cal (as opposed to distributed), e.g. one convolutional filterproved to be a reliable face detector. One practical value ofsuch insights is that they can catalyze future research. TheAtari Zoo enables animations in the spirit of Yosinski et al.[2015] that show an agent’s activations as it interacts witha game, and also enables creating synthetic inputs via acti-vation maximization [Erhan et al., 2009; Zeiler and Fergus,2014; Olah et al., 2018; Nguyen et al., 2017; Simonyan et al.,2013; Yosinski et al., 2015; Mahendran and Vedaldi, 2016],specifically by connecting DRL agents to the Lucid visualiza-tion package [Luc, 2018].

2.2 Understanding Deep RLWhile much more visualization and understanding work hasbeen done for vision models than for DRL, a few papers di-rectly focus on understanding DRL agents [Greydanus et al.,2017; Zahavy et al., 2016], and many others feature someanalysis of DRL agent behavior (often in the form of t-SNEdiagrams of the state space; see [Mnih et al., 2015]). Oneapproach to understanding DRL agents is to investigate thelearned features of models [Greydanus et al., 2017; Zahavyet al., 2016]. For example, Zahavy et al. [2016] visualizewhat pixels are most important to an agent’s decision by us-ing gradients of decisions with respect to pixels. Another ap-proach is to modify the DNN architecture or training proce-dure such that a trained model will have more interpretablefeatures [Annasamy and Sycara, 2018]. For example, An-nasamy and Sycara [2018] augment a model with an atten-tion mechanism and a reconstruction loss, hoping to produceinterpretable explanations as a result.

The software package released here is in the spirit of thefirst paradigm. It facilitates understanding the most com-

monly applied architectures instead of changing them, al-though it is designed also to accommodate importing in newvision-based DRL models, and could thus be also used toanalyze agents explicitly engineered to be interpretable. Inparticular, the package enables re-exploring many past DRLanalysis techniques at scale, and across algorithms, whichwere previously applied only for one algorithm and acrossonly a handful of hand-selected games.

2.3 Model ZoosA useful mechanism for reducing friction for analyzing andbuilding upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated, fine-tuned, and/or compared (e.g. by looking at howtheir high-level representations differ). For example, theCaffe website includes a model zoo with many popular vi-sion models, as do Tensorflow, Keras, and PyTorch. Theidea is that training large-scale vision networks (e.g. on theImageNet dataset) can take weeks with powerful GPUs, andthat there is little reason to constantly reduplicate the effortof training. Pre-trained word-embedding models are often re-leased with similar motivation, e.g. for Word2Vec or GLoVE.However, such a practice is much less common in the spaceof DRL; one reason is that so far, unlike with vision mod-els and word-embedding models, there are few other down-stream tasks from which Atari DRL agents provide obviousvalue. But, if the goal is to better understand these modelsand algorithm, both to improve them and to use them safely,then there is value in their release.

The recent Dopamine reproducible DRL package [Belle-mare et al., 2018] released trained ALE models; it includesfinal checkpoints of models trained by several DQN variants.However, in general it is non-trivial to extract TensorFlowmodels from their original context for visualization purposes,and to compare agent behavior across DRL algorithms in thesame software framework (e.g. due to slight differences inimage preprocessing), or to explore dynamics that take placeover learning, i.e. from intermediate checkpoints. To rem-edy this, for this paper’s accompanying software release, theDopamine checkpoints were distilled into frozen models thatcan be easily loaded into the Atari Zoo framework; and foralgorithms trained specifically for the Atari Zoo, we distillintermediate checkpoints in addition to final ones.

3 Generating the ZooThe approach is to run several validated implementations ofDRL algorithms and to collect and standardize the modelsand results, such that they can then be easily used for down-stream analysis and synthesis. There are many algorithms,implementations of them, and different ways that they couldbe run (e.g. different hyperparameters, architectures, inputrepresentations, stopping criteria, etc.). These choices influ-ence the kind of post-hoc analysis that is possible. For ex-ample, Rainbow most often outperforms DQN, and if onlyfinal models are released, it is impossible to explore scientificquestions where it is important to control for performance.

We thus adopted the high level principles that the Atari Zooshould hold as many elements of architecture and experimen-

Page 3: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

tal design constant across algorithms (e.g. DNN structure, in-put representation), should enable as many types of down-stream analysis as possible (e.g. by releasing checkpointsacross training time), and should make reasonable allowancesfor the particularities of each algorithm (e.g. ensuring hyper-parameters are well-fit to the algorithm, and allowing for dif-ferences in how policies are encoded or sampled from). Thenext paragraphs describe specific design choices.

3.1 Frozen Model Selection CriteriaTo enable the platform to facilitate a variety of explorations,we release multiple frozen models for each run, according todifferent criteria that may be useful to control for when com-paring learned policies. The idea is that depending on thedesired analysis, controlling for samples, or for wall-clock,or for performance (i.e. comparing similarly-performing poli-cies) will be more appropriate. In particular, in addition toreleasing the final model for each run, additional are mod-els taken over training time (at one, two, four, six, and tenhours); over game frame samples (400 million, and 1 billionframes); over scores (if an algorithm reaches human level per-formance); and also a model before any training, to enableanalysis of how weights change from their random initializa-tion. The hope is that these frozen models will cover a widespectrum of possible use cases.

3.2 Algorithm ChoiceOne important choice for the Atari Zoo is which DRL algo-rithms to run and include. The main families of DRL algo-rithms that have been applied to the ALE are policy gradientsmethods like A2C [Mnih et al., 2016], value-based meth-ods like DQN [Mnih et al., 2015], and black-box optimiza-tion methods like ES [Salimans et al., 2017] and Deep GA[Such et al., 2017]. Based on representativeness and avail-able trusted implementations, the particular algorithms cho-sen to train included two policy gradients algorithms (A2C[Mnih et al., 2016] and IMPALA [Espeholt et al., 2018]),two evolutionary algorithms (ES [Salimans et al., 2017] andDeep GA [Such et al., 2017]), and one value-function basedalgorithm (a high-performing DQN variant, Ape-X; [Horganet al., 2018]). Additionally, models are also imported fromthe Dopamine release [Bellemare et al., 2018], which in-clude DQN [Mnih et al., 2015] and a sophisticated variantof it called Rainbow [Hessel et al., 2017]. Note that fromthe Dopamine models, only final models are currently avail-able. Hyperparameters and training details for all algorithmsare available in supplemental material section S3. We hope toinclude models from additional algorithms in future releases.

3.3 DNN Architecture and Input RepresentationAll algorithms are run with the DNN architecture from Mnihet al. [2015], which consists of three convolutional layers(with filter size 8x8, 4x4, and 3x3, followed by a fully-connected layer). For most of the explored algorithms, thefully-connected layer connects to an output layer with oneneuron per valid action in the underlying Atari game. How-ever, A2C and IMPALA have an additional output that ap-proximates the state value function; Ape-X’s architecture fea-tures dueling DQN [Wang et al., 2015], which has two sepa-

(a) RGB frame (b) Observation (c) RAM state

Figure 1: Input and RAM Representation. (a) One RGB frame ofemulated Atari gameplay is shown, which is (b) preprocessed andconcatenated with previous frames before being fed as an observa-tion into the DNN agent. A compressed representation of a 2000-step ALE simulation is shown in (c), i.e. the 1024-bit RAM state(horizontal axis) unfurled over frames (vertical axis).

rate fully-connected streams; and Rainbow’s architecture in-cludes C51 [Bellemare et al., 2017], which uses many outputsto approximate the distribution of expected Q-values.

Atari frames are 210x160 color images (see figure 1a); thecanonical DRL representation is a a tensor consisting of thefour most recent observation frames, grayscaled and down-sampled to 84x84 (figure 1b). By including some previousframes, the aim is to make the game more fully-observable, toboost performance of the feed-forward architectures that arecurrently most common in Atari research (although recurrentarchitectures offer possible improvements [Mnih et al., 2016;Espeholt et al., 2018]). One useful Atari representation thatis applied in post-training analysis in this paper, is the AtariRAM state, which is only 1024 bits long but encompasses thetrue underlying state (figure 1c).

3.4 Data CollectionAll algorithms are run across 55 Atari games, for at least threeindependent random weight initializations. Regular check-points were taken during training; after training, the check-points that best fit each of the desired criteria (e.g. 400 mil-lion frames or human-level performance) were frozen and in-cluded in the zoo. The advantage of this post-hoc cullingis that additional criteria can be added in the future, e.g. ifAtari Zoo users introduce a new use case, because the originalcheckpoints are archived. Log files were stored and convertedinto a common format that are also released with the mod-els, to aid future performance curve comparisons for otherresearchers. Each frozen model was run post-hoc in ALEfor 2500 timesteps to generate cached behavior of policiesin their training environment, which includes the raw gameframes, the processed four-frame observations, RAM states,and high-level representations (e.g. neural representations athidden layers). As a result, it is possible to do meaningfulanalysis without ever running the models themselves.

4 Quantitative AnalysisThe open-source software package released with the accep-tance of this work provides an interface to the Atari Zoodataset, and implements several common modes of analysis.Models can be downloaded with a single line of code; and

Page 4: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

other single-line invocations interface directly with ALE andreturn the behavioral outcome of executing a model’s policy,or create movies of agents superimposed with neural activa-tion, or access convolutional weight tensors. In this section,we demonstrate analyses the Atari Zoo software can facili-tate, and highlight some of its built-in features. For manyof the analyses below, for computational simplicity we studyresults in a representative subset of 13 ALE games used byprior research [Such et al., 2017], which we refer to here asthe analysis subset of games.

4.1 Convolutional Filter AnalysisWhile understanding a DNN only by examining its weightsis challenging, weights directly connected to the input can of-ten be interpreted. For example, from visualizing the weightsof the first convolutional layer in a vision model, Gabor-likeedge detection filters are nearly always present. An interest-ing question is if Gabor-like filters also arise when DRL al-gorithms are trained from pixel input (as is done here). Invisualizing filters across games and DRL algorithms, edge-detector-like features sometimes arise in the gradient-basedmethods, but they are seemingly never as crisp as in visionmodels; this may because ALE lacks the visual complexityof natural images. In contrast, the filters in the evolution-ary models are less regular. Representative examples acrossgames and algorithms are shown in supplemental figure S1.

Learned filters commonly are tiled similarly across time(i.e. across the four DNN input frames), with past frames hav-ing lower-intensity weights. One explanation is that rewardgradients are more strongly influenced by present observa-tions. To explore this systematically, across games and algo-rithms we examined the absolute magnitude of filter weightsconnected to the present frame versus the past. In contrast tothe gradient-based methods the evolutionary methods showno discernable preference across time (supplemental figureS2), again suggesting that their learning differs qualitativelyfrom the gradient-based methods. Interestingly, a rigorousinformation-theoretic approximation of memory usage is ex-plored by Dann et al. [2016] in the context of DQN; our mea-sure well-correlates with theirs despite the relative simplicityof exploring only filter weight strength (supplemental sectionS1.1).

4.2 Robustness to Observation NoiseAn important property is how agents perform in slightlyout-of-distribution (OOD) situations; ideally they would notcatastrophically fail in the face of nominal change. While itis difficult to freely alter the ALE game dynamics (withoutlearning how to program in 6502 assembly code), it is pos-sible to systematically distort observations. Here we exploreone simple OOD change to observations by adding increas-ingly severe noise to the observations input to DNN-basedagents, and observe how their evaluated game score degrades.The motivation is to discover whether some learning algo-rithms are learning more robust policies than others. The re-sults show that with some caveats, methods with a direct rep-resentation of the policy appear more robust to observationnoise (supplemental figure S4). A similar study conductedfor robustness to parameter noise (supplemental section S1.2)

Figure 2: A sub-detecting neuron in Seaquest. Each image repre-sents an observation from an Ape-X agent playing Seaquest. Thered square indicates which image patch highly-activated the sub-detecting neuron on the third convolutional layer of the DNN. Hav-ing multiple tools (such as this image patch finder, or the activationmovies which identified this neuron of interest) enables more easilytriangulating and verifying hypotheses about the internals of a DRLagent’s neural network.

tentatively suggests that actor-critic methods are more robustto such noise.

4.3 Distinctiveness of Learned PoliciesTo explore the distinctive signature of solutions discovered bydifferent DRL algorithms, we train image classifiers to iden-tify the generating DRL algorithm given states sampled fromindependent runs of each algorithm (details are in supplemen-tal section S1.3). Supplemental figure S6 shows the confu-sion matrix for Seaquest, wherein a cluster of policy searchmethods (A2C, ES, and GA) have the most inter-class confu-sion, reflecting (as confirmed in later sections) that these al-gorithms tend to converge to the same sub-optimal behaviorin this game; results are qualitatively similar when tabulatedacross the analysis subset of games (supplemental figure S7).

5 VisualizationWe next highlight the Atari Zoo’s capabilities to quickly andsystematically visualize policies, which broadly can be di-vided into three categories: Direct policy visualization, di-mensionality reduction, and neuron activation maximization.

5.1 Animations to Inspect Policy and ActivationsTo quickly survey the solutions being learned, our softwaregenerates grids of videos, where one grid axis spans differentDRL algorithms, and the other axis covers independent runsof the algorithm. Such videos can highlight when different al-gorithms are converging to the same local optimum (e.g. sup-plemental figure S9 shows a situation where this is the casefor A2C, ES, and the GA; video: http://bit.ly/2XpD5kO).

To enable investigating the internal workings of the DNN,our software generates movies that display activations of allneurons alongside animated frames of the agent acting ingame. This approach is inspired by the deep visualizationtoolbox [Yosinski et al., 2015], but put into a DRL con-text. Supplemental figure S10 shows how this tool can lead

Page 5: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Figure 3: Multiple runs of algorithms and sharing the same RAM-space embedding in Seaquest. This plot shows one ALE evaluationper model for A2C, ES, and Ape-X, visualized in the same underlying RAM t-SNE embedding. Each dot represents a separate frame fromeach agent, colored by score (darker color indicates higher score). The plot highlights that in this game, A2C and ES visit similar distributionsof states (corresponding to the same sub-optimal behavior), while Ape-X visits a distinct part of the state-space, i.e. matching what couldmanually be distilled from watching the policy movies shown in supplemental figure S9. The interface allows clicking on points to observethe corresponding RGB frame, and for toggling different runs of different algorithms for visualization.

to recognizing the functionality of particular high-level fea-tures (video: http://bit.ly/2tFHiCU); in particular, it helped toidentify a submarine detecting neuron on the third convolu-tion layer of an Ape-X agent. Note that for ES and GA, nosuch specialized neuron was found; activations seemed qual-itatively more distributed for those methods.

5.2 Image Patches that Maximally Excite FiltersOne automated technique for uncovering the functionality ofa particular convolutional filter is to find which image patchesevoke from it the highest magnitude activations. Given atrained DRL agent and a target convolution filter to analyze,observations from the agent interacting with its ALE trainingenvironment are input to the agent’s DNN, and resulting mapsof activations from the filter of interest are stored. These mapsare sorted by the single maximum activation within them, andthe geometric location within the map of that maximum ac-tivation is recorded. Then, for each of these top-most acti-vations, the specific image patch from the observation thatgenerated it is identified and displayed, by taking the recep-tive field of the filter into account (i.e. modulated by both thestride and size of the convolutional layers). As a sanity check,we validate that the neuron identified in the previous sectiondoes indeed maximally fire for submarines (figure 2).

5.3 Dimensionality ReductionDimensionality reduction provides another view on agent be-havior; often DRL research includes t-SNE plots of agentDNN representations that summarize behavior in the domain[Mnih et al., 2015]. Our software includes such an imple-mentation (supplemental figure S12).

However, such an approach relies on embedding the high-level representation of one agent; it is unclear how to apply

Figure 4: Synthesized inputs for output layer neurons inSeaquest. For a representative run of Rainbow and DQN, inputs areshown optimized to maximize the activation of the first neuron inthe output layer of a Seaquest network. Because Rainbow includesC51, its image is in effect optimized to maximize the probability of alow-reward scenario; this neuron appears to be learning interpretablefeatures such as submarine location and the seabed. When maximiz-ing (or minimizing) DQN Q-value outputs (one example shown onleft), this qualitative outcome of interpretability was not observed.

it to create an embedding appropriate for comparisons of dif-ferent independent runs of the same algorithm, or runs fromdifferent DRL algorithms. As an initial approach, we imple-ment an embedding based on the Atari RAM representation(which is the same across algorithms and runs, but distinct be-tween games). Like the grid view of agent behaviors and thestate-distinguishing classifier, this t-SNE tool provides high-level information from which to compare runs of or betweendifferent algorithms (figure 3); details of this approach areprovided in supplemental section S2.1.

Page 6: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Figure 5: Synthesized inputs for fully-connected layer neurons in Freeway. Inputs synthesized to maximize activations of the first threeneurons in the first fully connected layer are shown for a respresentative DQN and Rainbow DNN. One of the Rainbow neurons (in redrectangle) appears to be capturing lane features.

5.4 Synthesizing Inputs to Understand NeuronsWhile the previous sections explore DNN activations in thecontext of an agent’s training environment, another approachis to optimize synthetic input images that stimulate particu-lar DNN neurons. Variations on this approach have yieldedstriking results in vision models [Nguyen et al., 2017; Olahet al., 2018; Simonyan et al., 2013]; the hope is that thesetechniques could yield an additional view on DRL agents’neural representations. To enable this analysis, we leveragethe Lucid visualization library [Luc, 2018]; in particular, wecreate wrapper classes that enable easy integration of AtariZoo models into Lucid, and release Jupyter notebooks thatgenerate synthetic inputs for different DRL models.

We now present a series of synthetic inputs generated bythe Lucid library across a handful of games that highlightthe potential of these kinds of visualizations for DRL under-standing (further details of the technique used are described insupplemental section S2.2. We first explore the kinds of fea-tures learned across depth. Supplemental figure S13 supportswhat was learned by visualizing the first-layer filter weightsfor value-based networks (section 4.1; i.e. showing that firstconvolution layers in the value-based networks appear to belearning edge-detector features). The activation videos ofsection 5.1 and the patch-based approach of section 5.2 helpto provide grounding, showing that in the context of the game,some first-layer filters detect the edges of the screen, in effectto serve as location anchors, while others encode conceptslike blinking objects (see figure S11). Supplemental figureS14 explores visualizing later-layer convolution filters, andfigure 4 show inputs synthesized to maximize output neurons,which sometimes yields interpretable features.

Such visualizations can also reveal that critical featuresare being attended to (figure 5 and supplemental figure S15).Overall, these visualizations demonstrate the potential of thiskind of technique, and we believe that many useful furtherinsights may result from a more systematic application andinvestigation of this and many of the other interesting visu-alization techniques implemented by Lucid, which can noweasily be applied to Atari Zoo models. Also promising would

be to further explore regularization to constrain the space ofsynthetic inputs, e.g. a generative model of Atari frames inthe spirit of Nguyen et al. [2017] or similar works.

6 Discussion and ConclusionsThere are many follow-up extensions that the initial explo-rations of the zoo raise. One natural extension is to includemore DRL algorithms (e.g. TRPO or PPO [Schulman et al.,2017]). Beyond algorithms, there are many alternate archi-tectures that might have interesting effects on representationand decision-making, for example recurrent architectures, orarchitectures that exploit attention. Also intriguing is exam-ining the effect of the incentive driving search: Do auxiliaryor substitute objectives qualitatively change DRL representa-tions, e.g. as in UNREAL [Jaderberg et al., 2016], curiosity-driven exploration [Pathak et al., 2017], or novelty search[Conti et al., 2017]? How do the representations and fea-tures of meta-learning agents such as MAML [Finn et al.,2017] change as they learn a new task? Finally, there areother analysis tools that could be implemented, which mightilluminate other interesting properties of DRL algorithms andlearned representation, e.g. the image perturbation analysis ofGreydanus et al. [2017] or a variety of sophisticated neuronvisualization techniques [Nguyen et al., 2017]. We welcomecommunity contributions for these algorithms, models, archi-tectures, incentives, and tools.

While the main motivation for the zoo was to reduce fric-tion for research into understanding and visualizing the be-havior of DRL agents, it can also serve as a platform forother research questions. For example, having a zoo of agentstrained on individual games, for different amounts of data,also would reduce friction for exploring transfer learningwithin Atari, i.e. whether experience learned on one gamecan quickly benefit on another game. Also, by providing ahuge library of cached rollouts for agents across algorithms,the zoo may be interesting in the context of learning fromdemonstrations, or for creating generative models of games.In conclusion, we look forward to seeing how this dataset willbe used by the community at large.

Page 7: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

ReferencesRaghuram Mandyam Annasamy and Katia Sycara. Towards

better interpretability in deep q-networks. arXiv preprintarXiv:1809.05630, 2018.

Marc G Bellemare, Yavar Naddaf, Joel Veness, and MichaelBowling. The arcade learning environment: An evaluationplatform for general agents. Journal of Artificial Intelli-gence Research, 47:253–279, 2013.

Marc G Bellemare, Will Dabney, and Remi Munos. A dis-tributional perspective on reinforcement learning. arXivpreprint arXiv:1707.06887, 2017.

Marc G. Bellemare, Pablo Samuel Castro, Carles Gelada,Saurabh Kumar, and Subhodeep Moitra. Dopamine.GitHub, GitHub repository, 2018.

Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such,Joel Lehman, Kenneth O Stanley, and Jeff Clune. Improv-ing exploration in evolution strategies for deep reinforce-ment learning via a population of novelty-seeking agents.arXiv preprint arXiv:1712.06560, 2017.

Christoph Dann, Katja Hofmann, and Sebastian Nowozin.Memory lens: How much memory does an agent use?arXiv preprint arXiv:1611.06928, 2016.

Dumitru Erhan, Yoshua Bengio, Aaron Courville, and PascalVincent. Visualizing higher-layer features of a deep net-work. University of Montreal, 1341(3):1, 2009.

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Si-monyan, Volodymir Mnih, Tom Ward, Yotam Doron, VladFiroiu, Tim Harley, Iain Dunning, et al. Impala: Scalabledistributed deep-rl with importance weighted actor-learnerarchitectures. arXiv preprint arXiv:1802.01561, 2018.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep net-works. arXiv preprint arXiv:1703.03400, 2017.

Sam Greydanus, Anurag Koul, Jonathan Dodge, and AlanFern. Visualizing and understanding atari agents. arXivpreprint arXiv:1711.00138, 2017.

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, TomSchaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bi-lal Piot, Mohammad Azar, and David Silver. Rainbow:Combining improvements in deep reinforcement learning.arXiv preprint arXiv:1710.02298, 2017.

Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Sil-ver. Distributed prioritized experience replay. arXivpreprint arXiv:1803.00933, 2018.

Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czar-necki, Tom Schaul, Joel Z Leibo, David Silver, and KorayKavukcuoglu. Reinforcement learning with unsupervisedauxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.

Lucid: A collection of infrastructure and tools for research inneural network interpretability. http://https://github.com/tensorflow/lucid, 2018.

Aravindh Mahendran and Andrea Vedaldi. Visualizing deepconvolutional neural networks using natural pre-images.

International Journal of Computer Vision, 120(3):233–255, 2016.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, An-drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves,Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski,et al. Human-level control through deep reinforcementlearning. Nature, 518(7540):529, 2015.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza,Alex Graves, Timothy Lillicrap, Tim Harley, David Silver,and Koray Kavukcuoglu. Asynchronous methods for deepreinforcement learning. In International conference on ma-chine learning, pages 1928–1937, 2016.

Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovit-skiy, and Jason Yosinski. Plug & play generative networks:Conditional iterative generation of images in latent space.In CVPR, volume 2, page 7, 2017.

Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter,Ludwig Schubert, Katherine Ye, and Alexander Mordv-intsev. The building blocks of interpretability. Distill,3(3):e10, 2018.

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and TrevorDarrell. Curiosity-driven exploration by self-supervisedprediction. In International Conference on Machine Learn-ing (ICML), volume 2017, 2017.

Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and IlyaSutskever. Evolution strategies as a scalable alternative toreinforcement learning. arXiv preprint arXiv:1703.03864,2017.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad-ford, and Oleg Klimov. Proximal policy optimization al-gorithms. arXiv preprint arXiv:1707.06347, 2017.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.Deep inside convolutional networks: Visualising imageclassification models and saliency maps. arXiv preprintarXiv:1312.6034, 2013.

Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti,Joel Lehman, Kenneth O Stanley, and Jeff Clune. Deepneuroevolution: genetic algorithms are a competitive alter-native for training deep neural networks for reinforcementlearning. arXiv preprint arXiv:1712.06567, 2017.

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt,Marc Lanctot, and Nando De Freitas. Dueling network ar-chitectures for deep reinforcement learning. arXiv preprintarXiv:1511.06581, 2015.

Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, andHod Lipson. Understanding neural networks through deepvisualization. arXiv preprint arXiv:1506.06579, 2015.

Tom Zahavy, Nir Ben-Zrihem, and Shie Mannor. Graying theblack box: Understanding dqns. In International Confer-ence on Machine Learning, pages 1899–1908, 2016.

Matthew D Zeiler and Rob Fergus. Visualizing and under-standing convolutional networks. In European conferenceon computer vision, pages 818–833. Springer, 2014.

Page 8: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Supplementary MaterialThe following sections contain additional figures, and describe in more detail the experimental setups applied in the paper’sexperiments.

S1 Quantitative Analysis DetailsFigure S1 shows a sampling of first-layer convolutional filters from final trained models, and figure S2 highlights that suchfilters often differentially attend to the present over the past.

S1.1 Further Study of Temporal Bias in DQNAs an exploration of the connection between the information theoretic measure of memory-dependent action in Dann et al.[2016] and the pattern highlighted in this paper (i.e. the strength of filter weights in the first layer of convolutions may highlighta network’s reliance on the past), we examined first-layer filters in DQN across all 55 games. A simple metric of present-focusis the ratio of average weight magnitudes for the past three frames to the present frame. When sorted by this metric (see figureS3), there is high agreement with the 8 games identified by Dann et al. [2016] that use memory. In particular, three out of thetop four games identified by our metric align with theirs; as do six out of the top twelve games, considered among the gamesthat overlap between their 49 and our 55.

S1.2 Observation and Parameter Noise DetailsFigure S4 shows robustness to observation noise for games in the analysis subset. Beyond observation noise, another interestingproperty of learning algorithms is the kind of local optimum they find in the parameter space, i.e. whether the learned functionis smooth or not in the area of a found solution. One gross tool for examining this property is to test the robustness of policiesto parameter perturbations. It is plausible that the evolutionary methods would be more robust in this way, given that theyare trained through parameter perturbations. To measure this, we perturb the convolutional weights of each DRL agent withincreasingly severe normally-distributed noise. We perturbed the convolutional weights only, because that part of the DNNis identical across agents, whereas the fully-connected layers sometimes vary in structure across DRL algorithms (e.g. value-based algorithms like Rainbow or Ape-X that include features that require post-convolutional architectural changes). Figure S5shows the results across games and algorithms; while no incontrovertible trend exists, the two policy-gradient methods (A2Cand Impala) show greater relative robustness than the other methods.

Note that for both algorithm-best performance plots (i.e. figures S4a and S5a), three games were excluded from analysisbecause at least one of the DRL algorithms performed worse than a random policy before pertubation; including them wouldhave conflated performance and robustness, which would undermine the purpose of the plot.

S1.3 Distinctiveness of Policies Learned by AlgorithmsWe use only the “present” channel of each gray-scale observation frame (i.e. without the complete four-frame stack) to train aclassifier for each game. The classifier consists of two convolution layers and two fully connected layers, and is trained withearly stopping to avoid overfitting. For each game, 2501 frames are collected from multiple evaluations by each model. Thereported classification results use 20% of the frames as test set. Figure S6 visualizes the confusion matrix for Seaquest frameclassification, while figure S7 shows a confusion matrix summed across games.

We also provide summaries of F1 classification performance: Figure S8 summarizes classification performance across DRLalgorithms and games, while table S1 shows performance averaged across games.

S2 Visualization DetailsThis section provides more details and figures for the visualization portion of the paper’s analysis. Figure S9 shows one frameof a collage of simultaneous videos that give a quick high-level comparison of how different algorithms and runs are solving anALE environment. Figure S9 shows one frame of a video that simultaneously shows a DNN agent acting in an ALE environmentand all of the activations of its DNN.

Figure S11 shows a second example of how the image-patch finder can help ground out what particular DNN neurons arelearning.

S2.1 t-SNE DetailsTo visualize RAM states and high-level DNN representations in 2D, as is typical in t-SNE analysis PCA is first applied to reducethe number of dimensions (to 50), followed by 3000 t-SNE iterations with perplexity of 30. The dimensionality reduction ofRAM states is applied across all available runs of DRL algorithms to be jointly embedded. In contrast, dimensionality reductionof high-level DNN representations is particular to a specific model trained by a single DRL algorithm (i.e. each run of a DRLalgorithm learns its own distinct representation).

Page 9: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

(a) Seaquest

(b) Venture

Figure S1: Learned Convolutional Filters. Shown are first-layer convolutional filters taken from DNNs trained by each algorithm, as well asrandom filters drawn from a normal distribution (Random). In games in which they exceed random performance, filters for the gradient-basedalgorithms often have spatial structure and sometimes resemble edge detectors, and the intensity of weights often degrades into the past (i.e.the left-most patches). This can be seen for all gradient-based methods in (a) Seaquest; when gradient-based methods fail to learn, as DQNand A2C often do in (b) Venture, their filters then appear more random (this effect is consistent across runs). Filters for the evolutionaryalgorithms appear less regular, even when their performance is competitive with the gradient-based methods.

Page 10: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Figure S2: Significance of Time Across Models. Filter weight magnitudes across input patches are shown averaged across a representativesample of ALE games with 3 independent runs each for each DRL algorithm; also included for analysis are random filters drawn from anormal distribution (Random). Before averaging, past-frame weight magnitudes are normalized by that of weights attending to the mostrecent observation (i.e. the present magnitudes are anchored to 1.0). For the gradient-based DRL algorithms (Ape-X, Rainbow, DQN, &A2C), filter weights are stronger when connected to the current frame than to historical frames. Interestingly, such a trend is not seen for theevolutionary algorithms; note that ES includes L2 regularization, so this effect is not merely an artifact of weight decay being present in thegradient-based methods only. The effect is also present when looking at individual games (data not shown).

Figure S3: Attention to the past in DQN. DQN’s tendency to focus on the present relative to the past (as measured by filter magnitudesfrom different input frames), is shown across 55 ALE games. From left to right, the amount of present-bias increases, e.g. the games at theleft seemingly have greater use for information stored in the past three frames relative to the games on the right.

(a) Normalized by algorithm best (b) Normalized by best over all algorithms

Figure S4: Robustness to Observation Noise. How performance of trained policies degrades with increasing severe normally-distributednoise is shown, averaged over three independent runs across the analysis subset of games. The figure hows performance degrades (a) relativeto baseline performance by that algorithm on each game, and (b) by the best performance of any algorithm on each game. Zero performancein this chart represents random play. The conclusion is that the policy search algorithms show less steep degradation relative to (a) their ownbest performance; although this is confounded by (b) the overall better absolute performance of the value-based methods. Follow-up analysiswill control for performance, by using the Atari Zoo human-performance frozen models.

Page 11: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

(a) Performance relative to algorithm best (b) Performance relative to best over algorithms

Figure S5: Robustness to Parameter Noise. How performance of trained policies degrades with increasing severe normally-distributedparameter noise is shown, averaged over three independent runs across the analysis subset of games. The figure hows performance degrades(a) relative to baseline performance by that algorithm on each game, and (b) by the best performance of any algorithm on each game. Zeroperformance in this chart represents random play. Interestingly, the two policy-gradient methods demonstrate a very similar algorithm-bestprofile that is more robust than the other methods; our prior hypothesis was that the evolutionary algorithms might exhibit higher robustnessby this measure (given that they are trained with parameter perturbations).

Figure S6: Confusion matrix for Seaquest frame classification. The cell in the ith row and the jth column denotes the number of framesgenerated by the algorithm in row i that are predicted to be generated by the algorithm in column j. The conclusion is that in this game,there is a cluster of confusion among many of the direct policy search algorithms (ES, A2C, and GA), highlighting that they are convergingto similarly sub-optimal behaviors.

Page 12: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Figure S7: Confusion matrix summed across all games. The cell in the ith row and the jth column denotes the total number of frames fromthe rollouts of the algorithm in row i predicted to be a from rollouts of the algorithm in column j. The true positive predictions are reset to 0to highlight the false positives.

Figure S8: F1 scores for frame classification. F1 score is defined as 2 × precision×recallprecision+recall . We observe the classifier distinguishes each

algorithm in all environments with at least 0.5 score.

Page 13: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Game Mean F1

Amidar 0.96Assault 0.86Asterix 0.96Asteroids 0.96Atlantis 0.73Enduro 0.92Frostbite 0.94Gravitar 0.92Kangaroo 0.93Seaquest 0.88Skiing 0.9Venture 0.96Zaxxon 0.96

Table S1: Average F1 scores by game. The score is an unweighted average of F1 scores across all algorithm. Lower scores indicate gamesfor which different DRL algorithms are less distinguishable from each other.

S2.2 Synthetic Input Generation DetailsWe use the lucid library [Luc, 2018] to visualize what types of inputs maximize neuron activations throughout the agents’networks. This study used the trained checkpoints provided by Dopamine [Bellemare et al., 2018] for DQN and Rainbow(although it could be applied to any of the DRL algorithms in the Atari Zoo). These frozen graphs are then loaded as part of aLucid model and an optimization objective is created.

An input pattern to the network (consisting of a stack of four 84x84 pixel screens) is optimized to maximize the activations ofthe desired neurons. Initially, the four 84x84 frames are initialized with random noise. The result of optimization ideally yieldsvisualizations that reveal qualitatively what features the neurons have learned to capture. As recommended in Olah et al. [2017]and Mahendran and Vedaldi [2016] we apply regularization to produce clearer results; for most images we use only imagejitter (i.e. randomly offsetting the input image by a few pixels to encourage local translation invariance). For some images, wefound it helpful to add total variation regularization (to encourage local smoothness; see Mahendran and Vedaldi [2016]) andL1 regularization (to encourage pixels that are not contributing to the objective to become zero) on the optimized image.

S3 DRL Algorithm Details and HyperparametersThis section describes the implementations and hyperparameters used for training the models released with the zoo. The DQNand Rainbow models come from the Dopamine model release [Bellemare et al., 2018]1. The following sections describe thealgorithms for the newly-trained models released with this paper.

S3.1 A2CThe implementation of A2C [Mnih et al., 2016] that generated the models in this paper was derived from the OpenAI baselinessoftware package [Dhariwal et al., 2017]. It ran with 20 parallel worker threads for 400 million frames; checkpoints occurredevery 4 million frames. Hyperparameters are listed in table S2.

Hyperparameter Setting

Learning Rate 7e-5τ 1.0Value Function Loss Coefficient 0.5Entropy Loss Coefficient 0.01Discount factor 0.99

Table S2: A2C Hyperparameters. Population sizes are incremented to account for elites (+1). Many of the unusual numbers were foundvia preliminary hyperparameter searches in other domains.

1The hyperparameters and training details for Dopamine can be found in https://github.com/google/dopamine/tree/master/baselines/

Page 14: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Figure S9: Grid of Rollout Videos in Seaquest. The vertical axis represents different independently-trained models, while the horizontalaxis represents the DRL algorithms included in the Atari Zoo. In Seaquest, one objective is to control a submarine to shoot fish withoutgetting hit by them, and another is to avoid running out of oxygen by intermittently resurfacing. All three independent runs of A2C, GA,and ES converge to the same sub-optimal behavior: They dive to the bottom of the ocean, and shoot fish until they run out of oxygen. Thevalue-function based methods exhibit more sophisticated behavior, highlighting that in this game, greedy policy searches may often convergeto sub-optimal solutions, while learning the value of state-action pairs can avoid this pathology. Video is available at: http://bit.ly/2XpD5kO

S3.2 Ape-XThe implementation of Ape-X used to generate the models in this paper was based on the one found here: https://github.com/uber-research/ape-x. The hyperparameters are reported in Table S3.

Hyperparameter Setting

Buffer Size 221

Number of Actors 384Batch Size 512n-step 3gamma 0.99gradient clipping 40target network period 2500Prioritized replay (α, β) (0.6, 0.4)Adam Learning rate 0.00025 / 4

Table S3: Ape-X Hyperparameters. For more details on what these parameters signify, see [Horgan et al., 2018].

S3.3 GAThe implementation of GA used to generate the models in this paper was based on the one found here: https://github.com/uber-research/deep-neuroevolution. The hyperparameters are reported in Table S4 and were found through random search.

S3.4 ESThe implementation of ES used to generate the models in this paper was based on the one found here: https://github.com/uber-research/deep-neuroevolution. The hyperparameters reported in Table S5 were found via preliminary search and aresimilar to those reported in [Conti et al., 2017].

Page 15: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Figure S10: Policy and activation visualization. The figure shows a still frame from a video of an Ape-X agent acting in the Seaquestenvironment (full video can be accessed at http://bit.ly/2tFHiCU). On the left, the RGB frame is shown, while from top to bottom on theright are: the processed observations, and then the activations for the convolutional layers, the fully connected layer, and finally, the Q-valueoutputs. From watching the video, it is apparent that the brightest neuron in the third convolutional layer tracks the position of the submarine.This shows that like in vision DNNs, sometimes important features are represented in a local, rather than distributed fashion [Yosinski et al.,2015].

Page 16: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Figure S11: Location-anchor and oxygen-detector in a Rainbow agent in Seaquest. The top three images show image patches (redsquare) that highly-activate a first-layer convolution filter of a Rainbow agent; this filter always activates maximally in the same geometriclocation, potentially serving as a geometric anchor for localization by down-stream filters. The bottom three images show images patches thathighly-activate a separate first-layer filter in the same agent. It detects blinking objects; the submarine can blink before it runs out of oxygen,and the oxygen meter itself blinks when it is running low.

Figure S12: Comparing high-level DNN representations through separate t-SNE embeddings. The figure shows separate t-SNE em-beddings of high-level representations for DNNs trained to play Seaquest by A2C and Ape-X. Each dot corresponds to a specific frame in arollout, and darker shades indicate higher scores. Embeddings that represent similar frames cluster together, indicating states with differentpositions of the submarine, and objects of various numbers, categories and colors. Representative frames for selected clusters are displayed.For example, in the left figure (A2C), the top-left cluster represents terminated states, and the bottom-left cluster corresponds to the situationof oxygen depletion, while in the right figure (Ape-X), bottom-right cluster corresponds to a repeated series of actions that the agent takes tosurface and refill its oxygen.

Page 17: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Figure S13: Synthesized inputs for neurons in the first convolutional layer in Seaquest. Inputs optimized to activate the first three neuronsin the first convolutional layer are shown for representative runs of DQN and Rainbow. These neurons appear to be learning ‘edge-detector’style features.

Figure S14: Synthesized inputs for neurons in the third convolutional layer in Seaquest. Inputs optimized to activate the first four neuronsin the last (third) convolutional layer in Seaquest are shown for a representative run of DQN and Rainbow (hyperparameters for regularization,e.g. a total variation penalty, were optimized by hand to improve image quality). Both networks appear to focus on particular styles of objects,combinations of them, and animation-related features such as blinking. Some synthetic inputs make sense in the context of other investigatorytools; e.g. Rainbow’s first neuron’s synthetic input includes objects that blink between frames, and when explored with the patch activationtechnique, is seen responding most intensely when the sub is blinking and about to explode from running out of oxygen. However, for somefeatures it is unclear how the synthetic input is to be interpreted without further investigation, e.g. the patch activation technique shows thatRainbow’s third neuron responds most when the sub is nearing the top border of the water. Further experimentation with regularization withinLucid, or employing more sophisticated techniques, may help to improve these initial results.

Page 18: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Figure S15: Synthesized inputs for neurons in the third convolutional layer in Pong. Inputs optimized to activate the first four neurons inthe last (third) convolutional layer in Pong are shown for a representative run of DQN and Rainbow. Both networks seem to learn qualitativelysimilar features, with images featuring vertical lines reminiscent of patterns and smaller objects reminiscent of balls. Further exploration isneeded to ground out these evocative appearances.

Page 19: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Hyperparameter Setting

σ (Mutation Power) 0.002Population Size 1000Truncation Size 20

Table S4: GA Hyperparameters. For more details on what these parameters signify, see [Such et al., 2017].

Hyperparameter Setting

σ (Mutation Power) 0.02Virtual Batch Size 128Population Size 5000Learning Rate 0.01Optimizer AdamL2 Regularization Coefficient 0.005

Table S5: ES Hyperparameters. For more details on what these parameters signify, see [Salimans et al., 2017; Conti et al., 2017].

S3.5 ImpalaThe implementation of Impala used to generate the models in this paper is based on the one found here: https://github.com/deepmind/scalable agent. The hyperparameters reported in Table S6 are the same as those reported in [Espeholt et al., 2018].

Hyperparameter Setting

Number of actors 25Image Width 84Image Height 84Grayscaling YesAction Repetitions 4Max-pool over last N action repeat frames 2Frame Stacking 4End of episode when life lost YesReward Clipping [-1, 1]Unroll Length (n) 20Batch size 32Discount (γ) 0.99Baseline loss scaling 0.5Entropy Regularizer 0.01RMSProp momentum 0.0RMSProp ε 0.01Learning rate 0.0006Clip global gradient norm 40.0Learning rate schedule Anneal linearly to 0 from beginning to end of training

Table S6: IMPALA Hyperparameters. For more details on what these parameters signify, see [Espeholt et al., 2018].

ReferencesMarc G. Bellemare, Pablo Samuel Castro, Carles Gelada, Saurabh Kumar, and Subhodeep Moitra. Dopamine. GitHub, GitHub

repository, 2018.

Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Improving ex-ploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. arXiv preprintarXiv:1712.06560, 2017.

Page 20: An Atari Model Zoo for Analyzing, Visualizing, and ... · building upon models is the idea of a model zoo, i.e. a repos-itory of pre-trained models that can easily be further investi-gated,

Christoph Dann, Katja Hofmann, and Sebastian Nowozin. Memory lens: How much memory does an agent use? arXiv preprintarXiv:1611.06928, 2016.

Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, SzymonSidor, and Yuhuai Wu. Openai baselines. GitHub, GitHub repository, 2017.

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, TimHarley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXivpreprint arXiv:1802.01561, 2018.

Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributedprioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.

Lucid: A collection of infrastructure and tools for research in neural network interpretability. http://https://github.com/tensorflow/lucid, 2018.

Aravindh Mahendran and Andrea Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. Interna-tional Journal of Computer Vision, 120(3):233–255, 2016.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, andKoray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machinelearning, pages 1928–1937, 2016.

Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017. https://distill.pub/2017/feature-visualization.

Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative toreinforcement learning. arXiv preprint arXiv:1703.03864, 2017.

Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Deep neuroevo-lution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXivpreprint arXiv:1712.06567, 2017.

Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deepvisualization. arXiv preprint arXiv:1506.06579, 2015.