adavances in deep-q learning.pdf

Upload: jose-garcia

Post on 28-Feb-2018

245 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    1/64

    Deep Reinforcement Learning with Double Q-learning

    Hado van Hasselt and Arthur Guez and David SilverGoogle DeepMind

    Abstract

    The popular Q-learning algorithm is known to overestimateaction values under certain conditions. It was not previouslyknown whether, in practice, such overestimations are com-mon, whether they harm performance, and whether they cangenerally be prevented. In this paper, we answer all thesequestions affirmatively. In particular, we first show that therecent DQN algorithm, which combines Q-learning with adeep neural network, suffers from substantial overestimationsin some games in the Atari 2600 domain. We then show thatthe idea behind the Double Q-learning algorithm, which wasintroduced in a tabular setting, can be generalized to workwith large-scale function approximation. We propose a spe-cific adaptation to the DQN algorithm and show that the re-sulting algorithm not only reduces the observed overestima-tions, as hypothesized, but that this also leads to much betterperformance on several games.

    The goal of reinforcement learning (Sutton and Barto, 1998)is to learn good policies for sequential decision problems,by optimizing a cumulative future reward signal. Q-learning(Watkins, 1989) is one of the most popular reinforcementlearning algorithms, but it is known to sometimes learn un-realistically high action values because it includes a maxi-mization step over estimated action values, which tends toprefer overestimated to underestimated values.

    In previous work, overestimations have been attributedto insufficiently flexible function approximation (Thrun andSchwartz, 1993) and noise (van Hasselt, 2010, 2011). Inthis paper, we unify these views and show overestimationscan occur when the action values are inaccurate, irrespectiveof the source of approximation error. Of course, imprecisevalue estimates are the norm during learning, which indi-cates that overestimations may be much more common thanpreviously appreciated.

    It is an open question whether, if the overestimations

    do occur, this negatively affects performance in practice.Overoptimistic value estimates are not necessarily a prob-lem in and of themselves. If all values would be uniformlyhigher then the relative action preferences are preserved andwe would not expect the resulting policy to be any worse.Furthermore, it is known that sometimes it is good to be op-timistic: optimism in the face of uncertainty is a well-known

    Copyright c 2016, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

    exploration technique (Kaelbling et al., 1996). If, however,the overestimations are not uniform and not concentrated atstates about which we wish to learn more, then they mightnegatively affect the quality of the resulting policy. Thrunand Schwartz (1993) give specific examples in which thisleads to suboptimal policies, even asymptotically.

    To test whether overestimations occur in practice and atscale, we investigate the performance of the recent DQN al-gorithm (Mnih et al., 2015). DQN combines Q-learning witha flexible deep neural network and was tested on a variedand large set of deterministic Atari 2600 games, reachinghuman-level performance on many games. In some ways,this setting is a best-case scenario for Q-learning, becausethe deep neural network provides flexible function approx-imation with the potential for a low asymptotic approxima-tion error, and the determinism of the environments preventsthe harmful effects of noise. Perhaps surprisingly, we showthat even in this comparatively favorable setting DQN some-times substantially overestimates the values of the actions.

    We show that the idea behind the Double Q-learning algo-rithm (van Hasselt, 2010), which was first proposed in a tab-ular setting, can be generalized to work with arbitrary func-tion approximation, including deep neural networks. We usethis to construct a new algorithm we call Double DQN. Wethen show that this algorithm not only yields more accuratevalue estimates, but leads to much higher scores on severalgames. This demonstrates that the overestimations of DQNwere indeed leading to poorer policies and that it is benefi-cial to reduce them. In addition, by improving upon DQNwe obtain state-of-the-art results on the Atari domain.

    Background

    To solve sequential decision problems we can learn esti-mates for the optimal value of each action, defined as theexpected sum of future rewards when taking that action and

    following the optimal policy thereafter. Under a given policy, the true value of an action a in a states is

    Q(s, a) E [R1+ R2+ . . .| S0 = s, A0 = a, ] ,

    where [0, 1] is a discount factor that trades off the impor-tance of immediate and later rewards. The optimal value isthenQ(s, a) = maxQ(s, a). An optimal policy is eas-ily derived from the optimal values by selecting the highest-valued action in each state.

    arXiv:1509.0

    6461v3

    [cs.LG]8

    Dec2015

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    2/64

    Estimates for the optimal action values can be learnedusing Q-learning (Watkins, 1989), a form of temporal dif-ference learning (Sutton, 1988). Most interesting problemsare too large to learn all action values in all states sepa-rately. Instead, we can learn a parameterized value functionQ(s, a; t). The standard Q-learning update for the param-eters after taking action At in state St and observing theimmediate rewardRt+1 and resulting state St+1 is then

    t+1 = t+(YQt Q(St, At; t))tQ(St, At; t) . (1)

    where is a scalar step size and the target YQt is defined as

    YQt Rt+1+ maxa

    Q(St+1, a; t) . (2)

    This update resembles stochastic gradient descent, updating

    the current valueQ(St, At; t)towards a target value YQt .

    Deep Q Networks

    A deep Q network (DQN) is a multi-layered neural networkthat for a given state s outputs a vector of action valuesQ(s, ; ), where are the parameters of the network. For

    an n-dimensional state space and an action space contain-ingm actions, the neural network is a function from Rn toRm. Two important ingredients of the DQN algorithm as

    proposed by Mnih et al. (2015) are the use of a target net-work, and the use of experience replay. The target network,with parameters , is the same as the online network ex-cept that its parameters are copied every steps from theonline network, so that then t = t, and kept fixed on allother steps. The target used by DQN is then

    YDQNt Rt+1+ maxa

    Q(St+1, a;

    t ) . (3)

    For the experience replay (Lin, 1992), observed transitionsare stored for some time and sampled uniformly from this

    memory bank to update the network. Both the target networkand the experience replay dramatically improve the perfor-mance of the algorithm (Mnih et al., 2015).

    Double Q-learning

    The max operator in standard Q-learning and DQN, in (2)and (3), uses the same values both to select and to evalu-ate an action. This makes it more likely to select overesti-mated values, resulting in overoptimistic value estimates. Toprevent this, we can decouple the selection from the evalua-tion. This is the idea behind Double Q-learning (van Hasselt,2010).

    In the original Double Q-learning algorithm, two value

    functions are learned by assigning each experience ran-domly to update one of the two value functions, such thatthere are two sets of weights, and . For each update, oneset of weights is used to determine the greedy policy and theother to determine its value. For a clear comparison, we canfirst untangle the selection and evaluation in Q-learning andrewrite its target (2) as

    YQt =Rt+1+ Q(St+1, argmaxa

    Q(St+1, a; t); t) .

    The Double Q-learning error can then be written as

    YDoubleQt Rt+1+ Q(St+1, argmaxa

    Q(St+1, a; t);

    t) .

    (4)Notice that the selection of the action, in the argmax, isstill due to the online weights t. This means that, as in Q-learning, we are still estimating the value of the greedy pol-icy according to the current values, as defined by t. How-

    ever, we use the second set of weights

    t to fairly evaluatethe value of this policy. This second set of weights can beupdated symmetrically by switching the roles of and .

    Overoptimism due to estimation errors

    Q-learnings overestimations were first investigated byThrun and Schwartz (1993), who showed that if the actionvalues contain random errors uniformly distributed in an in-terval[, ]then each target is overestimated up to m1

    m+1,

    where m is the number of actions. In addition, Thrun andSchwartz give a concrete example in which these overes-timations even asymptotically lead to sub-optimal policies,and show the overestimations manifest themselves in a smalltoy problem when using function approximation. Later vanHasselt (2010) argued that noise in the environment can leadto overestimations even when using tabular representation,and proposed Double Q-learning as a solution.

    In this section we demonstrate more generally that esti-mation errors of any kind can induce an upward bias, re-gardless of whether these errors are due to environmentalnoise, function approximation, non-stationarity, or any othersource. This is important, because in practice any methodwill incur some inaccuracies during learning, simply due tothe fact that the true values are initially unknown.

    The result by Thrun and Schwartz (1993) cited abovegives an upper bound to the overestimation for a specificsetup, but it is also possible, and potentially more interest-ing, to derive a lower bound.

    Theorem 1. Consider a state s in which all the true optimalaction values are equal atQ(s, a) = V(s)for some V(s).LetQtbe arbitrary value estimates that are on the whole un-biased in the sense that

    a

    (Qt(s, a)V(s)) = 0, but thatare not all correct, such that 1

    m

    a

    (Qt(s, a)V(s))2 =C

    for some C >0, wherem 2 is the number of actions ins.

    Under these conditions,maxa Qt(s, a) V(s) +

    C

    m1.

    This lower bound is tight. Under the same conditions, thelower bound on the absolute error of the Double Q-learningestimate is zero. (Proof in appendix.)

    Note that we did not need to assume that estimation errorsfor different actions are independent. This theorem shows

    that even if the value estimates are on average correct, esti-mation errors of any source can drive the estimates up andaway from the true optimal values.

    The lower bound in Theorem 1 decreases with the num-ber of actions. This is an artifact of considering the lowerbound, which requires very specific values to be attained.More typically, the overoptimism increases with the num-ber of actions as shown in Figure 1. Q-learnings overesti-mations there indeed increase with the number of actions,

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    3/64

    Figure 1: The orange bars show the bias in a single Q-learning update when the action values are Q(s, a) =V(s) + aand the errors {a}

    ma=1 are independent standard

    normal random variables. The second set of action valuesQ, used for the blue bars, was generated identically and in-dependently. All bars are the average of 100 repetitions.

    while Double Q-learning is unbiased. As another example,if for all actions Q(s, a) = V(s)and the estimation errorsQt(s, a) V(s) are uniformly random in [1, 1], then theoveroptimism is m1

    m+1. (Proof in appendix.)

    We now turn to function approximation and consider areal-valued continuous state space with 10 discrete actionsin each state. For simplicity, the true optimal action valuesin this example depend only on state so that in each stateall actions have the same true value. These true values areshown in the left column of plots in Figure 2 (purple lines)and are defined as either Q(s, a) = sin(s) (top row) orQ(s, a) = 2exp(s

    2)(middle and bottom rows). The leftplots also show an approximation for a single action (greenlines) as a function of state as well as the samples the es-timate is based on (green dots). The estimate is a d-degreepolynomial that is fit to the true values at sampled states,where d = 6 (top and middle rows) or d = 9 (bottomrow). The samples match the true function exactly: there isno noise and we assume we have ground truth for the actionvalue on these sampled states. The approximation is inex-act even on the sampled states for the top two rows becausethe function approximation is insufficiently flexible. In thebottom row, the function is flexible enough to fit the greendots, but this reduces the accuracy in unsampled states. No-tice that the sampled states are spaced further apart near theleft side of the left plots, resulting in larger estimation errors.In many ways this is a typical learning setting, where at eachpoint in time we only have limited data.

    The middle column of plots in Figure 2 shows estimatedaction value functions for all 10 actions (green lines), asfunctions of state, along with the maximum action value ineach state (black dashed line). Although the true value func-tion is the same for all actions, the approximations differbecause we have supplied different sets of sampled states.1

    The maximum is often higher than the ground truth shown

    in purple on the left. This is confirmed in the right plots,which shows the difference between the black and purplecurves in orange. The orange line is almost always positive,

    1Each action-value function is fit with a different subset of in-teger states. States6 and 6 are always included to avoid extrap-olations, and for each action two adjacent integers are missing: foraction a1 states5and 4are not sampled, fora2 states4and3are not sampled, and so on. This causes the estimated values todiffer.

    indicating an upward bias. The right plots also show the es-timates from Double Q-learning in blue2, which are on aver-age much closer to zero. This demonstrates that Double Q-learning indeed can successfully reduce the overoptimism ofQ-learning.

    The different rows in Figure 2 show variations of the sameexperiment. The difference between the top and middle rowsis the true value function, demonstrating that overestima-

    tions are not an artifact of a specific true value function.The difference between the middle and bottom rows is theflexibility of the function approximation. In the left-middleplot, the estimates are even incorrect for some of the sam-pled states because the function is insufficiently flexible.The function in the bottom-left plot is more flexible but thiscauses higher estimation errors for unseen states, resultingin higher overestimations. This is important because flexi-ble parametric function approximators are often employedin reinforcement learning (see, e.g., Tesauro 1995; Sallansand Hinton 2004; Riedmiller 2005; Mnih et al. 2015).

    In contrast to van Hasselt (2010) we did not use a sta-tistical argument to find overestimations, the process to ob-tain Figure 2 is fully deterministic. In contrast to Thrun and

    Schwartz (1993), we did not rely on inflexible function ap-proximation with irreducible asymptotic errors; the bottomrow shows that a function that is flexible enough to cover allsamples leads to high overestimations. This indicates thatthe overestimations can occur quite generally.

    In the examples above, overestimations occur even whenassuming we have samples of the true action value at cer-tain states. The value estimates can further deteriorate if webootstrap off of action values that are already overoptimistic,since this causes overestimations to propagate throughoutour estimates. Although uniformly overestimating valuesmight not hurt the resulting policy, in practice overestima-tion errors will differ for different states and actions. Over-estimation combined with bootstrapping then has the perni-cious effect of propagating the wrong relative informationabout which states are more valuable than others, directlyaffecting the quality of the learned policies.

    The overestimations should not be confused with opti-mism in the face of uncertainty (Sutton, 1990; Agrawal,1995; Kaelbling et al., 1996; Auer et al., 2002; Brafman andTennenholtz, 2003; Szita and Lorincz, 2008; Strehl et al.,2009), where an exploration bonus is given to states oractions with uncertain values. Conversely, the overestima-tions discussed here occur only after updating, resulting inoveroptimism in the face of apparent certainty. This was al-ready observed by Thrun and Schwartz (1993), who notedthat, in contrast to optimism in the face of uncertainty, these

    overestimations actually can impede learning an optimalpolicy. We will see this negative effect on policy quality con-firmed later in the experiments as well: when we reduce theoverestimations using Double Q-learning, the policies im-prove.

    2We arbitrarily used the samples of action ai+5 (for i 5)or ai5 (for i > 5) as the second set of samples for the doubleestimator of action ai.

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    4/64

    Figure 2: Illustration of overestimations during learning. In each state (x-axis), there are 10 actions. Theleft columnshows the true valuesV(s)(purple line). All true action values are defined by Q(s, a) = V(s). The green line shows estimated values Q(s, a) for one actionas a function of state, fitted to the true value at several sampled states (green dots). The middle columnplots show all the estimated values(green), and the maximum of these values (dashed black). The maximum is higher than the true value (purple, left plot) almost everywhere.The right column plots shows the difference in orange. The blue line in the right plots is the estimate used by Double Q-learning with asecond set of samples for each state. The blue line is much closer to zero, indicating less bias. The three rows correspond to different truefunctions (left, purple) or capacities of the fitted function (left, green). (Details in the text)

    Double DQNThe idea of Double Q-learning is to reduce overestimationsby decomposing the max operation in the target into actionselection and action evaluation. Although not fully decou-pled, the target network in the DQN architecture providesa natural candidate for the second value function, withouthaving to introduce additional networks. We therefore pro-pose to evaluate the greedy policy according to the onlinenetwork, but using the target network to estimate its value.In reference to both Double Q-learning and DQN, we referto the resulting algorithm as Double DQN. Its update is the

    same as for DQN, but replacing the targetYDQNt with

    YDoubleDQNt Rt+1+Q(St+1, argmaxa

    Q(St+1, a; t),

    t ) .

    In comparison to Double Q-learning (4), the weights of thesecond network t are replaced with the weights of the tar-

    get networkt for the evaluation of the current greedy pol-icy. The update to the target network stays unchanged fromDQN, and remains a periodic copy of the online network.

    This version of Double DQN is perhaps the minimal pos-sible change to DQN towards Double Q-learning. The goalis to get most of the benefit of Double Q-learning, whilekeeping the rest of the DQN algorithm intact for a fair com-parison, and with minimal computational overhead.

    Empirical resultsIn this section, we analyze the overestimations of DQN andshow that Double DQN improves over DQN both in terms ofvalue accuracy and in terms of policy quality. To further testthe robustness of the approach we additionally evaluate thealgorithms with random starts generated from expert humantrajectories, as proposed by Nair et al. (2015).

    Our testbed consists of Atari 2600 games, using the Ar-cade Learning Environment (Bellemare et al., 2013). The

    goal is for a single algorithm, with a fixed set of hyperpa-rameters, to learn to play each of the games separately frominteraction given only the screen pixels as input. This is a de-manding testbed: not only are the inputs high-dimensional,the game visuals and game mechanics vary substantially be-tween games. Good solutions must therefore rely heavilyon the learning algorithm it is not practically feasible tooverfit the domain by relying only on tuning.

    We closely follow the experimental setting and net-work architecture outlined by Mnih et al. (2015). Briefly,the network architecture is a convolutional neural network(Fukushima, 1988; LeCun et al., 1998) with 3 convolution

    layers and a fully-connected hidden layer (approximately1.5M parameters in total). The network takes the last fourframes as input and outputs the action value of each action.On each game, the network is trained on a single GPU for200M frames, or approximately 1 week.

    Results on overoptimism

    Figure 3 shows examples of DQNs overestimations in sixAtari games. DQN and Double DQN were both trained un-der the exact conditions described by Mnih et al. (2015).DQN is consistently and sometimes vastly overoptimisticabout the value of the current greedy policy, as can be seenby comparing the orange learning curves in the top row ofplots to the straight orange lines, which represent the ac-tual discounted value of the best learned policy. More pre-cisely, the (averaged) value estimates are computed regu-larly during training with full evaluation phases of lengthT= 125, 000steps as

    1

    T

    Tt=1

    argmaxa

    Q(St, a; ) .

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    5/64

    Figure 3: The top and middlerows show value estimates by DQN (orange) and Double DQN (blue) on six Atari games. The results areobtained by running DQN and Double DQN with 6 different random seeds with the hyper-parameters employed by Mnih et al. (2015). Thedarker line shows the median over seeds and we average the two extreme values to obtain the shaded area (i.e., 10% and 90% quantiles withlinear interpolation). The straight horizontal orange (for DQN) and blue (for Double DQN) lines in the top row are computed by running thecorresponding agents after learning concluded, and averaging the actual discounted return obtained from each visited state. These straightlines would match the learning curves at the right side of the plots if there is no bias. Themiddlerow shows the value estimates (in log scale)for two games in which DQNs overoptimism is quite extreme. The bottomrow shows the detrimental effect of this on the score achieved bythe agent as it is evaluated during training: the scores drop when the overestimations begin. Learning with Double DQN is much more stable.

    The ground truth averaged values are obtained by runningthe best learned policies for several episodes and computingthe actual cumulative rewards. Without overestimations wewould expect these quantities to match up (i.e., the curve to

    match the straight line at the right of each plot). Instead, thelearning curves of DQN consistently end up much higherthan the true values. The learning curves for Double DQN,shown in blue, are much closer to the blue straight line rep-resenting the true value of the final policy. Note that the bluestraight line is often higher than the orange straight line. Thisindicates that Double DQN does not just produce more ac-curate value estimates but also better policies.

    More extreme overestimations are shown in the middletwo plots, where DQN is highly unstable on the games As-terix and Wizard of Wor. Notice the log scale for the valueson the y-axis. The bottom two plots shows the correspond-ing scores for these two games. Notice that the increases in

    value estimates for DQN in the middle plots coincide withdecreasing scores in bottom plots. Again, this indicates thatthe overestimations are harming the quality of the resultingpolicies. If seen in isolation, one might perhaps be temptedto think the observed instability is related to inherent insta-bility problems of off-policy learning with function approx-imation (Baird, 1995; Tsitsiklis and Van Roy, 1997; Suttonet al., 2008; Maei, 2011; Sutton et al., 2015). However, wesee that learning is much more stable with Double DQN,

    DQN Double DQN

    Median 93.5% 114.7%

    Mean 241.1% 330.3%

    Table 1:Summary of normalized performance up to 5 minutes ofplay on 49 games. Results for DQN are from Mnih et al. (2015)

    suggesting that the cause for these instabilities is in fact Q-learnings overoptimism. Figure 3 only shows a few exam-ples, but overestimations were observed for DQN in all 49tested Atari games, albeit in varying amounts.

    Quality of the learned policies

    Overoptimism does not always adversely affect the qualityof the learned policy. For example, DQN achieves optimalbehavior in Pong despite slightly overestimating the policyvalue. Nevertheless, reducing overestimations can signifi-cantly benefit the stability of learning; we see clear examples

    of this in Figure 3. We now assess more generally how muchDouble DQN helps in terms of policy quality by evaluatingon all 49 games that DQN was tested on.

    As described by Mnih et al. (2015) each evaluationepisode starts by executing a special no-op action that doesnot affect the environment up to 30 times, to provide differ-ent starting points for the agent. Some exploration duringevaluation provides additional randomization. For DoubleDQN we used the exact same hyper-parameters as for DQN,

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    6/64

    DQN Double DQN Double DQN (tuned)

    Median 47.5% 88.4% 116.7%

    Mean 122.0% 273.1% 475.2%

    Table 2: Summary of normalized performance up to 30 minutesof play on 49 games with human starts. Results for DQN are fromNair et al. (2015).

    to allow for a controlled experiment focused just on re-ducing overestimations. The learned policies are evaluatedfor 5 mins of emulator time (18,000 frames) with an -greedy policy where = 0.05. The scores are averaged over100 episodes. The only difference between Double DQN

    and DQN is the target, using Y DoubleDQNt rather thanYDQN.

    This evaluation is somewhat adversarial, as the used hyper-parameters were tuned for DQN but not for Double DQN.

    To obtain summary statistics across games, we normalizethe score for each game as follows:

    scorenormalized= scoreagent scorerandomscorehuman scorerandom

    . (5)

    The random and human scores are the same as used byMnih et al. (2015), and are given in the appendix.

    Table 1, under no ops, shows that on the whole DoubleDQN clearly improves over DQN. A detailed comparison(in appendix) shows that there are several games in whichDouble DQN greatly improves upon DQN. Noteworthy ex-amples include Road Runner (from 233% to 617%), Asterix(from 70% to 180%), Zaxxon (from 54% to 111%), andDouble Dunk (from 17% to 397%).

    The Gorila algorithm (Nair et al., 2015), which is a mas-sively distributed version of DQN, is not included in the ta-ble because the architecture and infrastructure is sufficientlydifferent to make a direct comparison unclear. For complete-ness, we note that Gorila obtained median and mean normal-ized scores of 96% and 495%, respectively.

    Robustness to Human startsOne concern with the previous evaluation is that in deter-ministic games with a unique starting point the learner couldpotentially learn to remember sequences of actions with-out much need to generalize. While successful, the solutionwould not be particularly robust. By testing the agents fromvarious starting points, we can test whether the found so-lutions generalize well, and as such provide a challengingtestbed for the learned polices (Nair et al., 2015).

    We obtained 100 starting points sampled for each gamefrom a human experts trajectory, as proposed by Nair et al.(2015). We start an evaluation episode from each of thesestarting points and run the emulator for up to 108,000 frames(30 mins at 60Hz including the trajectory before the starting

    point). Each agent is only evaluated on the rewards accumu-lated after the starting point.

    For this evaluation we include a tuned version of DoubleDQN. Some tuning is appropriate because the hyperparame-ters were tuned for DQN, which is a different algorithm. Forthe tuned version of Double DQN, we increased the num-ber of frames between each two copies of the target networkfrom 10,000 to 30,000, to reduce overestimations further be-cause immediately after each switch DQN and Double DQN

    Figure 4: Normalized scores on 57 Atari games, testedfor 100 episodes per game with human starts. Comparedto Mnih et al. (2015), eight games additional games weretested. These are indicated with stars and a bold font.

    both revert to Q-learning. In addition, we reduced the explo-ration during learning from = 0.1 to = 0.01, and thenused = 0.001 during evaluation. Finally, the tuned ver-sion uses a single shared bias for all action values in the toplayer of the network. Each of these changes improved per-formance and together they result in clearly better results.3

    Table 2 reports summary statistics for this evaluation onthe 49 games from Mnih et al. (2015). Double DQN ob-tains clearly higher median and mean scores. Again Gorila

    DQN (Nair et al., 2015) is not included in the table, but forcompleteness note it obtained a median of 78% and a meanof 259%. Detailed results, plus results for an additional 8games, are available in Figure 4 and in the appendix. Onseveral games the improvements from DQN to Double DQNare striking, in some cases bringing scores much closer to

    3Except for Tennis, where the lower during training seemedto hurt rather than help.

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    7/64

    human, or even surpassing these.Double DQN appears more robust to this more challeng-

    ing evaluation, suggesting that appropriate generalizationsoccur and that the found solutions do not exploit the deter-minism of the environments. This is appealing, as it indi-cates progress towards finding general solutions rather thana deterministic sequence of steps that would be less robust.

    DiscussionThis paper has five contributions. First, we have shown whyQ-learning can be overoptimistic in large-scale problems,even if these are deterministic, due to the inherent estima-tion errors of learning. Second, by analyzing the value es-timates on Atari games we have shown that these overesti-mations are more common and severe in practice than pre-viously acknowledged. Third, we have shown that DoubleQ-learning can be used at scale to successfully reduce thisoveroptimism, resulting in more stable and reliable learning.Fourth, we have proposed a specific implementation calledDouble DQN, that uses the existing architecture and deepneural network of the DQN algorithm without requiring ad-ditional networks or parameters. Finally, we have shown that

    Double DQN finds better policies, obtaining new state-of-the-art results on the Atari 2600 domain.

    AcknowledgmentsWe would like to thank Tom Schaul, Volodymyr Mnih, MarcBellemare, Thomas Degris, Georg Ostrovski, and RichardSutton for helpful comments, and everyone at Google Deep-Mind for a constructive research environment.

    ReferencesR. Agrawal. Sample mean based index policies with O(log n) re-

    gret for the multi-armed bandit problem. Advances in AppliedProbability, pages 10541078, 1995.

    P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the

    multiarmed bandit problem. Machine learning, 47(2-3):235256, 2002.

    L. Baird. Residual algorithms: Reinforcement learning with func-tion approximation. In Machine Learning: Proceedings of theTwelfth International Conference, pages 3037, 1995.

    M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The ar-cade learning environment: An evaluation platform for generalagents. J. Artif. Intell. Res. (JAIR), 47:253279, 2013.

    R. I. Brafman and M. Tennenholtz. R-max-a general polynomialtime algorithm for near-optimal reinforcement learning. The

    Journal of Machine Learning Research, 3:213231, 2003.

    K. Fukushima. Neocognitron: A hierarchical neural network ca-pable of visual pattern recognition. Neural networks, 1(2):119130, 1988.

    L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcementlearning: A survey. Journal of Artificial Intelligence Research,4:237285, 1996.

    Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-basedlearning applied to document recognition. Proceedings of the

    IEEE, 86(11):22782324, 1998.

    L. Lin. Self-improving reactive agents based on reinforcementlearning, planning and teaching. Machine learning, 8(3):293321, 1992.

    H. R. Maei. Gradient temporal-difference learning algorithms.PhD thesis, University of Alberta, 2011.

    V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostro-vski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King,D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529533, 2015.

    A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. D.Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Pe-tersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver. Mas-sively parallel methods for deep reinforcement learning. InDeep

    Learning Workshop, ICML, 2015.

    M. Riedmiller. Neural fitted Q iteration - first experiences with adata efficient neural reinforcement learning method. In J. Gama,R. Camacho, P. Brazdil, A. Jorge, and L. Torgo, editors, Pro-ceedings of the 16th European Conference on Machine Learning(ECML05), pages 317328. Springer, 2005.

    B. Sallans and G. E. Hinton. Reinforcement learning with factoredstates and actions. The Journal of Machine Learning Research,5:10631088, 2004.

    A. L. Strehl, L. Li, and M. L. Littman. Reinforcement learning infinite MDPs: PAC analysis. The Journal of Machine Learning

    Research, 10:24132444, 2009.

    R. S. Sutton. Learning to predict by the methods of temporal dif-ferences. Machine learning, 3(1):944, 1988.

    R. S. Sutton. Integrated architectures for learning, planning, andreacting based on approximating dynamic programming. InProceedings of the seventh international conference on machinelearning, pages 216224, 1990.

    R. S. Sutton and A. G. Barto. Introduction to reinforcement learn-ing. MIT Press, 1998.

    R. S. Sutton, C. Szepesvari, and H. R. Maei. A convergent O(n)algorithm for off-policy temporal-difference learning with lin-ear function approximation. Advances in Neural InformationProcessing Systems 21 (NIPS-08), 21:16091616, 2008.

    R. S. Sutton, A. R. Mahmood, and M. White. An emphatic ap-proach to the problem of off-policy temporal-difference learn-ing. arXiv preprint arXiv:1503.04269, 2015.

    I. Szita and A. Lorincz. The many faces of optimism: a unifyingapproach. In Proceedings of the 25th international conferenceon Machine learning, pages 10481055. ACM, 2008.

    G. Tesauro. Temporal difference learning and td-gammon. Com-munications of the ACM, 38(3):5868, 1995.

    S. Thrun and A. Schwartz. Issues in using function approxima-tion for reinforcement learning. In M. Mozer, P. Smolensky,D. Touretzky, J. Elman, and A. Weigend, editors, Proceedingsof the 1993 Connectionist Models Summer School, Hillsdale, NJ,1993. Lawrence Erlbaum.

    J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-differencelearning with function approximation. IEEE Transactions on

    Automatic Control, 42(5):674690, 1997.

    H. van Hasselt. Double Q-learning. Advances in Neural Informa-tion Processing Systems, 23:26132621, 2010.

    H. van Hasselt. Insights in Reinforcement Learning. PhD thesis,Utrecht University, 2011.

    C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis,University of Cambridge England, 1989.

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    8/64

    AppendixTheorem 1. Consider a state s in which all the true optimal ac-tion values are equal atQ(s, a) = V(s) for some V(s). LetQt be arbitrary value estimates that are on the whole unbiased inthe sense that

    a

    (Qt(s, a) V(s)) = 0, but that are not all

    zero, such that 1m

    a(Qt(s, a) V(s))

    2 =Cfor someC > 0,where m 2 is the number of actions in s. Under these conditions,

    maxa Qt(s, a) V(s) + Cm1

    . This lower bound is tight. Un-

    der the same conditions, the lower bound on the absolute error ofthe Double Q-learning estimate is zero.

    Proof of Theorem 1. Define the errors for each action a as a =Qt(s, a) V(s). Suppose that there exists a setting of{a} such

    that maxa a 1.

    Then the conditions of the theorem hold. If then, furthermore, wehaveQt(s, a1) = V(s)then the error is zero. The remaining ac-tion valuesQt(s, ai), fori >1, are arbitrary.

    Theorem 2. Consider a state s in which all the true optimal actionvalues are equal atQ(s, a) = V(s). Suppose that the estimationerrors Qt(s, a)Q(s, a)are independently distributed uniformlyrandomly in[1, 1]. Then,

    E

    maxa

    Qt(s, a) V(s)

    = m 1

    m + 1

    Proof. Define a = Qt(s, a) Q(s, a); this is a uniform ran-

    dom variable in[1, 1]. The probability thatmaxa Qt(s, a) xfor somex is equal to the probability thata xfor alla simul-taneously. Because the estimation errors are independent, we canderive

    P(maxa

    a x) = P(X1 x X2 x . . . Xm x)

    =ma=1

    P(a x) .

    The function P(a x) is the cumulative distribution function(CDF) ofa, which here is simply defined as

    P(a x) =

    0 ifx 11+x2

    ifx (1, 1)

    1 ifx 1

    This implies that

    P(maxa

    a x) =ma=1

    P(a x)

    =

    0 ifx 11+x2

    mifx (1, 1)

    1 ifx 1

    This gives us the CDF of the random variable maxa a. Its expec-tation can be written as an integral

    E maxa

    a = 11

    xfmax(x) dx ,

    wherefmaxis the probability density function of this variable, de-fined as the derivative of the CDF: fmax(x) =

    ddx

    P(maxa a

    x), so that for x [1, 1] we have fmax(x) = m2

    1+x2

    m1.

    Evaluating the integral yields

    E

    maxa

    a

    =

    1

    1

    xfmax(x) dx

    =

    x + 1

    2

    mmx 1

    m + 1

    11

    = m 1

    m + 1.

    Experimental Details for the Atari 2600

    Domain

    We selected the 49 games to match the list used by Mnih et al.(2015), see Tables below for the full list. Each agent step is com-posed of four frames (the last selected action is repeated duringthese frames) and reward values (obtained from the Arcade Learn-ing Environment (Bellemare et al., 2013)) are clipped between -1and 1.

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    9/64

    Network Architecture

    The convolution network used in the experiment is exactly the oneproposed by proposed by Mnih et al. (2015), we only provide de-tails here for completeness. The input to the network is a 84x84x4tensor containing a rescaled, and gray-scale, version of the last fourframes. The first convolution layer convolves the input with 32 fil-ters of size 8 (stride 4), the second layer has 64 layers of size 4(stride 2), the final convolution layer has 64 filters of size 3 (stride1). This is followed by a fully-connected hidden layer of 512 units.

    All these layers are separated by Rectifier Linear Units (ReLu). Fi-nally, a fully-connected linear layer projects to the output of thenetwork, i.e., the Q-values. The optimization employed to train thenetwork is RMSProp (with momentum parameter 0.95).

    Hyper-parameters

    In all experiments, the discount was set to = 0.99, and the learn-ing rate to = 0.00025. The number of steps between target net-work updates was = 10, 000. Training is done over 50M steps(i.e., 200M frames). The agent is evaluated every 1M steps, andthe best policy across these evaluations is kept as the output of thelearning process. The size of the experience replay memory is 1Mtuples. The memory gets sampled to update the network every 4steps with minibatches of size 32. The simple exploration policyused is an-greedy policy with the decreasing linearly from 1 to

    0.1over 1M steps.

    Supplementary Results in the Atari 2600

    DomainThe Tables below provide further detailed results for our experi-ments in the Atari domain.

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    10/64

    Game Random Human DQN Double DQNAlien 227.80 6875.40 3069.33 2907.30Amidar 5.80 1675.80 739.50 702.10Assault 222.40 1496.40 3358.63 5022.90Asterix 210.00 8503.30 6011.67 15150.00

    Asteroids 719.10 13156.70 1629.33 930.60Atlantis 12850.00 29028.10 85950.00 64758.00Bank Heist 14.20 734.40 429.67 728.30Battle Zone 2360.00 37800.00 26300.00 25730.00Beam Rider 363.90 5774.70 6845.93 7654.00Bowling 23.10 154.80 42.40 70.50Boxing 0.10 4.30 71.83 81.70Breakout 1.70 31.80 401.20 375.00Centipede 2090.90 11963.20 8309.40 4139.40Chopper Command 811.00 9881.80 6686.67 4653.00Crazy Climber 10780.50 35410.50 114103.33 101874.00Demon Attack 152.10 3401.30 9711.17 9711.90Double Dunk -18.60 -15.50 -18.07 -6.30Enduro 0.00 309.60 301.77 319.50Fishing Derby -91.70 5.50 -0.80 20.30Freeway 0.00 29.60 30.30 31.80Frostbite 65.20 4334.70 328.33 241.50Gopher 257.60 2321.00 8520.00 8215.40Gravitar 173.00 2672.00 306.67 170.50H.E.R.O. 1027.00 25762.50 19950.33 20357.00Ice Hockey -11.20 0.90 -1.60 -2.40James Bond 29.00 406.70 576.67 438.00Kangaroo 52.00 3035.00 6740.00 13651.00Krull 1598.00 2394.60 3804.67 4396.70Kung-Fu Master 258.50 22736.20 23270.00 29486.00Montezumas Revenge 0.00 4366.70 0.00 0.00Ms. Pacman 307.30 15693.40 2311.00 3210.00Name This Game 2292.30 4076.20 7256.67 6997.10Pong -20.70 9.30 18.90 21.00Private Eye 24.90 69571.30 1787.57 670.10Q*Bert 163.90 13455.00 10595.83 14875.00River Raid 1338.50 13513.30 8315.67 12015.30Road Runner 11.50 7845.00 18256.67 48377.00Robotank 2.20 11.90 51.57 46.70Seaquest 68.40 20181.80 5286.00 7995.00Space Invaders 148.00 1652.30 1975.50 3154.60Star Gunner 664.00 10250.00 57996.67 65188.00Tennis -23.80 -8.90 -2.47 1.70Time Pilot 3568.00 5925.00 5946.67 7964.00Tutankham 11.40 167.60 186.70 190.60Up and Down 533.40 9082.00 8456.33 16769.90Venture 0.00 1187.50 380.00 93.00Video Pinball 16256.90 17297.60 42684.07 70009.00Wizard of Wor 563.50 4756.50 3393.33 5204.00Zaxxon 32.50 9173.30 4976.67 10182.00

    Table 3: Raw scores for the no-op evaluation condition (5 minutes emulator time). DQN as given by Mnih et al. (2015).

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    11/64

    Game DQN Double DQNAlien 42.75 % 40.31 %Amidar 43.93 % 41.69 %Assault 246.17 % 376.81 %Asterix 69.96 % 180.15 %

    Asteroids 7.32 % 1.70 %Atlantis 451.85 % 320.85 %Bank Heist 57.69 % 99.15 %Battle Zone 67.55 % 65.94 %Beam Rider 119.80 % 134.73 %Bowling 14.65 % 35.99 %Boxing 1707.86 % 1942.86 %Breakout 1327.24 % 1240.20 %Centipede 62.99 % 20.75 %Chopper Command 64.78 % 42.36 %Crazy Climber 419.50 % 369.85 %Demon Attack 294.20 % 294.22 %Double Dunk 17.10 % 396.77 %Enduro 97.47 % 103.20 %Fishing Derby 93.52 % 115.23 %Freeway 102.36 % 107.43 %Frostbite 6.16 % 4.13 %Gopher 400.43 % 385.66 %Gravitar 5.35 % -0.10 %H.E.R.O. 76.50 % 78.15 %Ice Hockey 79.34 % 72.73 %James Bond 145.00 % 108.29 %Kangaroo 224.20 % 455.88 %Krull 277.01 % 351.33 %Kung-Fu Master 102.37 % 130.03 %Montezumas Revenge 0.00 % 0.00 %Ms. Pacman 13.02 % 18.87 %Name This Game 278.29 % 263.74 %Pong 132.00 % 139.00 %Private Eye 2.53 % 0.93 %Q*Bert 78.49 % 110.68 %River Raid 57.31 % 87.70 %Road Runner 232.91 % 617.42 %Robotank 508.97 % 458.76 %Seaquest 25.94 % 39.41 %Space Invaders 121.49 % 199.87 %Star Gunner 598.09 % 673.11 %Tennis 143.15 % 171.14 %Time Pilot 100.92 % 186.51 %Tutankham 112.23 % 114.72 %Up and Down 92.68 % 189.93 %Venture 32.00 % 7.83 %Video Pinball 2539.36 % 5164.99 %Wizard of Wor 67.49 % 110.67 %Zaxxon 54.09 % 111.04 %

    Table 4: Normalized results for no-op evaluation condition (5 minutes emulator time).

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    12/64

    Game Random Human DQN Double DQN Double DQN (tuned)Alien 128.30 6371.30 570.2 621.6 1033.4Amidar 11.80 1540.40 133.4 188.2 169.1Assault 166.90 628.90 3332.3 2774.3 6060.8Asterix 164.50 7536.00 124.5 5285.0 16837.0Asteroids 871.30 36517.30 697.1 1219.0 1193.2Atlantis 13463.00 26575.00 76108.0 260556.0 319688.0Bank Heist 21.70 644.50 176.3 469.8 886.0Battle Zone 3560.00 33030.00 17560.0 25240.0 24740.0Beam Rider 254.60 14961.00 8672.4 9107.9 17417.2Berzerk 196.10 2237.50 635.8 1011.1Bowling 35.20 146.50 41.2 62.3 69.6Boxing -1.50 9.60 25.8 52.1 73.5Breakout 1.60 27.90 303.9 338.7 368.9Centipede 1925.50 10321.90 3773.1 5166.6 3853.5Chopper Command 644.00 8930.00 3046.0 2483.0 3495.0Crazy Climber 9337.00 32667.00 50992.0 94315.0 113782.0Defender 1965.50 14296.00 8531.0 27510.0Demon Attack 208.30 3442.80 12835.2 13943.5 69803.4Double Dunk -16.00 -14.40 -21.6 -6.4 -0.3Enduro -81.80 740.20 475.6 475.9 1216.6Fishing Derby -77.10 5.10 -2.3 -3.4 3.2Freeway 0.10 25.60 25.8 26.3 28.8Frostbite 66.40 4202.80 157.4 258.3 1448.1Gopher 250.00 2311.00 2731.8 8742.8 15253.0Gravitar 245.50 3116.00 216.5 170.0 200.5H.E.R.O. 1580.30 25839.40 12952.5 15341.4 14892.5Ice Hockey -9.70 0.50 -3.8 -3.6 -2.5James Bond 33.50 368.50 348.5 416.0 573.0Kangaroo 100.00 2739.00 2696.0 6138.0 11204.0Krull 1151.90 2109.10 3864.0 6130.4 6796.1Kung-Fu Master 304.00 20786.80 11875.0 22771.0 30207.0Montezumas Revenge 25.00 4182.00 50.0 30.0 42.0Ms. Pacman 197.80 15375.00 763.5 1401.8 1241.3Name This Game 1747.80 6796.00 5439.9 7871.5 8960.3Phoenix 1134.40 6686.20 10364.0 12366.5Pit Fall -348.80 5998.90 -432.9 -186.7Pong -18.00 15.50 16.2 17.7 19.1Private Eye 662.80 64169.10 298.2 346.3 -575.5Q*Bert 183.00 12085.00 4589.8 10713.3 11020.8River Raid 588.30 14382.20 4065.3 6579.0 10838.4Road Runner 200.00 6878.00 9264.0 43884.0 43156.0Robotank 2.40 8.90 58.5 52.0 59.1Seaquest 215.50 40425.80 2793.9 4199.4 14498.0Skiing -15287.40 -3686.60 -29404.3 -11490.4Solaris 2047.20 11032.60 2166.8 810.0Space Invaders 182.60 1464.90 1449.7 1495.7 2628.7Star Gunner 697.00 9528.00 34081.0 53052.0 58365.0Surround -9.70 5.40 -7.6 1.9Tennis -21.40 -6.70 -2.3 11.0 -7.8

    Time Pilot 3273.00 5650.00 5640.0 5375.0 6608.0Tutankham 12.70 138.30 32.4 63.6 92.2Up and Down 707.20 9896.10 3311.3 4721.1 19086.9Venture 18.00 1039.00 54.0 75.0 21.0Video Pinball 20452.0 15641.10 20228.1 148883.6 367823.7Wizard of Wor 804.00 4556.00 246.0 155.0 6201.0Yars Revenge 1476.90 47135.20 5439.5 6270.6Zaxxon 475.00 8443.00 831.0 7874.0 8593.0

    Table 5: Raw scores for the human start condition (30 minutes emulator time). DQN as given by Nair et al. (2015).

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    13/64

    Game DQN Double DQN Double DQN (tuned)Alien 7.08% 7.90% 14.50%Amidar 7.95% 11.54% 10.29%Assault 685.15% 564.37% 1275.74%Asterix -0.54% 69.46% 226.18%Asteroids -0.49% 0.98% 0.90%Atlantis 477.77% 1884.48% 2335.46%Bank Heist 24.82% 71.95% 138.78%Battle Zone 47.51% 73.57% 71.87%Beam Rider 57.24% 60.20% 116.70%Berzerk 21.54% 39.92%Bowling 5.39% 24.35% 30.91%Boxing 245.95% 482.88% 675.68%Breakout 1149.43% 1281.75% 1396.58%Centipede 22.00% 38.60% 22.96%Chopper Command 28.99% 22.19% 34.41%Crazy Climber 178.55% 364.24% 447.69%Defender 53.25% 207.17%Demon Attack 390.38% 424.65% 2151.65%Double Dunk -350.00% 600.00% 981.25%Enduro 67.81% 67.85% 157.96%Fishing Derby 91.00% 89.66% 97.69%Freeway 100.78% 102.75% 112.55%Frostbite 2.20% 4.64% 33.40%Gopher 120.42% 412.07% 727.95%Gravitar -1.01% -2.63% -1.57%H.E.R.O. 46.88% 56.73% 54.88%Ice Hockey 57.84% 59.80% 70.59%James Bond 94.03% 114.18% 161.04%Kangaroo 98.37% 228.80% 420.77%Krull 283.34% 520.11% 589.66%Kung-Fu Master 56.49% 109.69% 145.99%Montezumas Revenge 0.60% 0.12% 0.41%Ms. Pacman 3.73% 7.93% 6.88%Name This Game 73.14% 121.30% 142.87%Phoenix 166.25% 202.31%Pit Fall -1.32% 2.55%Pong 102.09% 106.57% 110.75%Private Eye -0.57% -0.50% -1.95%Q*Bert 37.03% 88.48% 91.06%River Raid 25.21% 43.43% 74.31%Road Runner 135.73% 654.15% 643.25%Robotank 863.08% 763.08% 872.31%Seaquest 6.41% 9.91% 35.52%Skiing -121.69% 32.73%Solaris 1.33% -13.77%Space Invaders 98.81% 102.40% 190.76%Star Gunner 378.03% 592.85% 653.02%Surround 13.91% 76.82%Tennis 129.93% 220.41% 92.52%

    Time Pilot 99.58% 88.43% 140.30%Tutankham 15.68% 40.53% 63.30%Up and Down 28.34% 43.68% 200.02%Venture 3.53% 5.58% 0.29%Video Pinball -4.65% 2669.60% 7220.51%Wizard of Wor -14.87% -17.30% 143.84%Yars Revenge 8.68% 10.50%Zaxxon 4.47% 92.86% 101.88%

    Table 6: Normalized scores for the human start condition (30 minutes emulator time).

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    14/64

    Published as a conference paper at ICLR 2016

    PRIORITIZED EXPERIENCE REPLAY

    Tom Schaul, John Quan, Ioannis Antonoglou and David SilverGoogle DeepMind{schaul,johnquan,ioannisa,davidsilver}@google.com

    ABSTRACT

    Experience replay lets online reinforcement learning agents remember and reuseexperiences from the past. In prior work, experience transitions were uniformlysampled from a replay memory. However, this approach simply replays transitionsat the same frequency that they were originally experienced, regardless of theirsignificance. In this paper we develop a framework for prioritizing experience,so as to replay important transitions more frequently, and therefore learn moreefficiently. We use prioritized experience replay in Deep Q-Networks (DQN), areinforcement learning algorithm that achieved human-level performance acrossmany Atari games. DQN with prioritized experience replay achieves a new state-

    of-the-art, outperforming DQN with uniform replay on 41 out of 49 games.

    1 INTRODUCTION

    Online reinforcement learning (RL) agents incrementally update their parameters (of the policy,value function or model) while they observe a stream of experience. In their simplest form, theydiscard incoming data immediately, after a single update. Two issues with this are (a) stronglycorrelated updates that break the i.i.d. assumption of many popular stochastic gradient-based algo-rithms, and (b) the rapid forgetting of possibly rare experiences that would be useful later on.

    Experience replay(Lin,1992) addresses both of these issues: with experience stored in a replaymemory, it becomes possible to break the temporal correlations by mixing more and less recent

    experience for the updates, and rare experience will be used for more than just a single update.This was demonstrated in the Deep Q-Network (DQN) algorithm (Mnih et al., 2013;2015), whichstabilized the training of a value function, represented by a deep neural network, by using experiencereplay. Specifically, DQN used a large sliding window replay memory, sampled from it uniformlyat random, and revisited each transition1 eight times on average. In general, experience replay canreduce the amount of experience required to learn, and replace it with more computation and morememory which are often cheaper resources than the RL agents interactions with its environment.

    In this paper, we investigate how prioritizingwhich transitions are replayed can make experiencereplay more efficient and effective than if all transitions are replayed uniformly. The key idea is thatan RL agent can learn more effectively from some transitions than from others. Transitions maybe more or less surprising, redundant, or task-relevant. Some transitions may not be immediatelyuseful to the agent, but might become so when the agent competence increases (Schmidhuber,1991).Experience replay liberates online learning agents from processing transitions in the exact orderthey are experienced. Prioritized replay further liberates agents from considering transitions with

    the same frequency that they are experienced.

    In particular, we propose to more frequently replay transitions with high expected learning progress,as measured by the magnitude of their temporal-difference (TD) error. This prioritization can leadto a loss of diversity, which we alleviate with stochastic prioritization, and introduce bias, whichwe correct with importance sampling. Our resulting algorithms are robust and scalable, which wedemonstrate on the Atari 2600 benchmark suite, where we obtain faster learning and state-of-the-artperformance.

    1 A transition is the atomic unit of interaction in RL, in our case a tuple of (stateSt1, action At1, rewardRt, discount t, next state St). We choose this for simplicity, but most of the arguments in this paper also holdfor a coarser ways of chunking experience, e.g. into sequences or episodes.

    1

    arXiv:1511.0

    5952v4

    [cs.LG]25Feb2016

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    15/64

    Published as a conference paper at ICLR 2016

    2 BACKGROUND

    Numerous neuroscience studies have identified evidence of experience replay in the hippocampusof rodents, suggesting that sequences of prior experience are replayed, either during awake restingor sleep. Sequences associated with rewards appear to be replayed more frequently (Atherton et al.,

    2015; Olafsdottir et al., 2015;Foster & Wilson, 2006). Experiences with high magnitude TD erroralso appear to be replayed more often(Singer & Frank, 2009;McNamara et al., 2014).

    It is well-known that planning algorithms such as value iteration can be made more efficient byprioritizing updates in an appropriate order. Prioritized sweeping (Moore & Atkeson,1993;Andreet al., 1998)selects which state to update next, prioritized according to the change in value, if thatupdate was executed. The TD error provides one way to measure these priorities (van Seijen &Sutton, 2013). Our approach uses a similar prioritization method, but for model-free RL rather thanmodel-based planning. Furthermore, we use a stochastic prioritization that is more robust whenlearning a function approximator from samples.

    TD-errors have also been used as a prioritization mechanism for determining where to focus re-sources, for example when choosing where to explore (White et al., 2014) or which features toselect (Geramifard et al., 2011;Sun et al., 2011).

    In supervised learning, there are numerous techniques to deal with imbalanced datasets when class

    identities are known, including re-sampling, under-sampling and over-sampling techniques, possiblycombined with ensemble methods (for a review, seeGalar et al.,2012). A recent paper introduced aform of re-sampling in the context of deep RL with experience replay (Narasimhan et al., 2015); themethod separates experience into two buckets, one for positive and one for negative rewards, andthen picks a fixed fraction from each to replay. This is only applicable in domains that (unlike ours)have a natural notion of positive/negative experience. Furthermore,Hinton(2007)introduced aform of non-uniform sampling based on error, with an importance sampling correction, which ledto a 3x speed-up on MNIST digit classification.

    There have been several proposed methods for playing Atari with deep reinforcement learning, in-cluding deep Q-networks (DQN) (Mnih et al., 2013;2015;Guo et al., 2014;Stadie et al., 2015; Nairet al.,2015;Bellemare et al., 2016), and theDouble DQNalgorithm (van Hasselt et al., 2016), whichis the current published state-of-the-art. Simultaneously with our work, an architectural innovationthat separates advantages from the value function (see the co-submission by Wang et al., 2015) has

    also led to substantial improvements on the Atari benchmark.

    3 PRIORITIZED R EPLAY

    Using a replay memory leads to design choices at two levels: which experiences to store, and whichexperiences to replay (and how to do so). This paper addresses only the latter: making the mosteffective use of the replay memory for learning, assuming that its contents are outside of our control(but see also Section6).

    3.1 A MOTIVATING E XAMPLE

    To understand the potential gains of prioritization, we introduce an artificial Blind Cliffwalk en-vironment (described in Figure1, left) that exemplifies the challenge of exploration when rewards

    are rare. With only n states, the environment requires an exponential number of random steps untilthe first non-zero reward; to be precise, the chance that a random sequence of actions will lead tothe reward is2n. Furthermore, the most relevant transitions (from rare successes) are hidden in amass of highly redundant failure cases (similar to a bipedal robot falling over repeatedly, before itdiscovers how to walk).

    We use this example to highlight the difference between the learning times of two agents. Bothagents perform Q-learning updates on transitions drawn from the same replay memory. The firstagent replays transitions uniformly at random, while the second agent invokes an oracle to prioritizetransitions. This oracle greedily selects the transition that maximally reduces the global loss in itscurrent state (in hindsight, after the parameter update). For the details of the setup, see Appendix B.1.Figure1(right) shows that picking the transitions in a good order can lead to exponential speed-ups

    2

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    16/64

    Published as a conference paper at ICLR 2016

    1 2 n-1 n

    R=1

    Figure 1: Left: Illustration of the Blind Cliffwalk example domain: there are two actions, aright and a wrong one, and the episode is terminated whenever the agent takes the wrong action(dashed red arrows). Taking the right action progresses through a sequence ofn states (blackarrows), at the end of which lies a final reward of1 (green arrow); reward is0 elsewhere. We chosea representation such that generalizing over what action is right is not possible. Right: Mediannumber of learning steps required to learn the value function as a function of the size of the totalnumber of transitions in the replay memory. Note the log-log scale, which highlights the exponentialspeed-up from replaying with an oracle (bright blue), compared to uniform replay (black); faint linesare min/max values from 10 independent runs.

    over uniform choice. Such an oracle is of course not realistic, yet the large gap motivates our searchfor a practical approach that improves on uniform random replay.

    3.2 PRIORITIZING WITHT D- ERROR

    The central component of prioritized replay is the criterion by which the importance of each transi-tion is measured. One idealised criterion would be the amount the RL agent can learn from a tran-sition in its current state (expected learning progress). While this measure is not directly accessible,a reasonable proxy is the magnitude of a transitions TD error , which indicates how surprisingor unexpected the transition is: specifically, how far the value is from its next-step bootstrap esti-mate (Andre et al., 1998). This is particularly suitable for incremental, online RL algorithms, suchas SARSA or Q-learning, that already compute the TD-error and update the parameters in propor-tion to. The TD-error can be a poor estimate in some circumstances as well, e.g. when rewards are

    noisy; see AppendixAfor a discussion of alternatives.To demonstrate the potential effectiveness of prioritizing replay by TD error, we compare the uni-form and oracle baselines in the Blind Cliffwalk to a greedy TD-error prioritization algorithm.This algorithm stores the last encountered TD error along with each transition in the replay memory.The transition with the largest absolute TD error is replayed from the memory. A Q-learning updateis applied to this transition, which updates the weights in proportion to the TD error. New transitionsarrive without a known TD-error, so we put them at maximal priority in order to guarantee that allexperience is seen at least once. Figure2 (left), shows that this algorithm results in a substantialreduction in the effort required to solve the Blind Cliffwalk task.2

    Implementation: To scale to large memory sizes N, we use a binary heap data structure for the pri-ority queue, for which finding the maximum priority transition when sampling isO(1)and updatingpriorities (with the new TD-error after a learning step) is O(log N). See AppendixB.2.1for moredetails.

    3.3 STOCHASTIC P RIORITIZATION

    However, greedy TD-error prioritization has several issues. First, to avoid expensive sweeps overthe entire replay memory, TD errors are only updated for the transitions that are replayed. Oneconsequence is that transitions that have a low TD error on first visit may not be replayed for a longtime (which means effectively never with a sliding window replay memory). Further, it is sensitiveto noise spikes (e.g. when rewards are stochastic), which can be exacerbated by bootstrapping, where

    2 Note that a random (or optimistic) initialization of the Q-values is necessary with greedy prioritization. Ifinitializing with zero instead, unrewarded transitions would appear to have zero error initially, be placed at thebottom of the queue, and not be revisited until the error on other transitions drops below numerical precision.

    3

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    17/64

    Published as a conference paper at ICLR 2016

    Figure 2: Median number of updates required for Q-learning to learn the value function on theBlind Cliffwalk example, as a function of the total number of transitions (only a single one of whichwas successful and saw the non-zero reward). Faint lines are min/max values from 10 randominitializations. Black is uniform random replay, cyan uses the hindsight-oracle to select transitions,red and blue use prioritized replay (rank-based and proportional respectively). The results differby multiple orders of magnitude, thus the need for a log-log plot. In both subplots it is evidentthat replaying experience in the right order makes an enormous difference to the number of updates

    required. See AppendixB.1 for details. Left: Tabular representation, greedy prioritization. Right:Linear function approximation, both variants of stochastic prioritization.

    approximation errors appear as another source of noise. Finally, greedy prioritization focuses on asmall subset of the experience: errors shrink slowly, especially when using function approximation,meaning that the initially high error transitions get replayed frequently. This lack of diversity thatmakes the system prone to over-fitting.

    To overcome these issues, we introduce a stochastic sampling method that interpolates betweenpure greedy prioritization and uniform random sampling. We ensure that the probability of beingsampled is monotonic in a transitions priority, while guaranteeing a non-zero probability even forthe lowest-priority transition. Concretely, we define the probability of sampling transition ias

    P(i) = pi

    kpk(1)

    wherepi > 0 is the priority of transition i. The exponent determines how much prioritization isused, with= 0corresponding to the uniform case.

    The first variant we consider is the direct, proportional prioritization where pi = |i| +, whereis a small positive constant that prevents the edge-case of transitions not being revisited once theirerror is zero. The second variant is an indirect, rank-based prioritization wherepi =

    1rank(i) , where

    rank(i)is the rank of transitioni when the replay memory is sorted according to |i|. In this case,Pbecomes a power-law distribution with exponent . Both distributions are monotonic in ||, butthe latter is likely to be more robust, as it is insensitive to outliers. Both variants of stochasticprioritization lead to large speed-ups over the uniform baseline on the Cliffwalk task, as shown onFigure2(right).

    Implementation: To efficiently sample from distribution (1), the complexity cannot depend onN.For the rank-based variant, we can approximate the cumulative density function with a piecewiselinear function withk segments of equal probability. The segment boundaries can be precomputed(they change only whenN or change). At runtime, we sample a segment, and then sample uni-formly among the transitions within it. This works particularly well in conjunction with a minibatch-based learning algorithm: choose kto be the size of the minibatch, and sample exactly one transitionfrom each segment this is a form of stratified sampling that has the added advantage of balanc-ing out the minibatch (there will always be exactly one transition with high magnitude , one withmedium magnitude, etc). The proportional variant is different, also admits an efficient implementa-tion based on a sum-tree data structure (where every node is the sum of its children, with the pri-orities as the leaf nodes), which can be efficiently updated and sampled from. See Appendix B.2.1for more additional details.

    4

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    18/64

    Published as a conference paper at ICLR 2016

    Algorithm 1Double DQN with proportional prioritization

    1: Input: minibatchk, step-size, replay periodKand sizeN, exponentsand, budgetT.2: Initialize replay memory H = , = 0,p1 = 13: ObserveS0 and chooseA0 (S0)4: fort= 1toT do5: ObserveSt, Rt, t6: Store transition(St1, At1, Rt, t, St)in H with maximal prioritypt = maxi

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    19/64

    Published as a conference paper at ICLR 2016

    V

    i

    d

    e

    o

    P

    i

    n

    b

    a

    l

    l

    D

    e

    m

    o

    n

    A

    t

    t

    a

    c

    k

    R

    o

    b

    o

    t

    a

    n

    k

    B

    r

    e

    a

    k

    o

    u

    t

    D

    e

    f

    e

    n

    d

    e

    r

    T

    u

    t

    a

    n

    k

    h

    a

    m

    B

    o

    x

    i

    n

    g

    S

    t

    a

    r

    g

    u

    n

    n

    e

    r

    B

    o

    w

    l

    i

    n

    g

    B

    a

    n

    k

    H

    e

    i

    s

    t

    C

    e

    n

    t

    i

    p

    e

    d

    e

    Q

    *

    b

    e

    r

    t

    Y

    a

    r

    s

    '

    R

    e

    v

    e

    n

    g

    e

    F

    r

    e

    e

    w

    a

    y

    P

    o

    n

    g

    P

    i

    t

    f

    a

    l

    l

    !

    M

    o

    n

    t

    e

    z

    u

    m

    a

    '

    s

    R

    e

    v

    e

    n

    g

    e

    B

    e

    r

    z

    e

    r

    k

    A

    m

    i

    d

    a

    r

    K

    a

    n

    g

    a

    r

    o

    o

    A

    l

    i

    e

    n

    A

    s

    t

    e

    r

    o

    i

    d

    s

    G

    r

    a

    v

    i

    t

    a

    r

    P

    r

    i

    v

    a

    t

    e

    E

    y

    e

    S

    u

    r

    r

    o

    u

    n

    d

    M

    s

    .

    P

    a

    c

    -

    M

    a

    n

    T

    i

    m

    e

    P

    i

    l

    o

    t

    K

    u

    n

    g

    -

    F

    u

    M

    a

    s

    t

    e

    r

    F

    i

    s

    h

    i

    n

    g

    D

    e

    r

    b

    y

    S

    k

    i

    i

    n

    g

    B

    a

    t

    t

    l

    e

    z

    o

    n

    e

    W

    i

    z

    a

    r

    d

    o

    f

    W

    o

    r

    Z

    a

    x

    x

    o

    n

    V

    e

    n

    t

    u

    r

    e

    S

    o

    l

    a

    r

    i

    s

    I

    c

    e

    H

    o

    c

    k

    e

    y

    U

    p

    '

    n

    D

    o

    w

    n

    H

    .

    E

    .

    R

    .

    O

    .

    C

    h

    o

    p

    p

    e

    r

    C

    o

    m

    m

    a

    n

    d

    T

    e

    n

    n

    i

    s

    R

    i

    v

    e

    r

    R

    a

    i

    d

    K

    r

    u

    l

    l

    F

    r

    o

    s

    t

    b

    i

    t

    e

    N

    a

    m

    e

    T

    h

    i

    s

    G

    a

    m

    e

    S

    e

    a

    q

    u

    e

    s

    t

    C

    r

    a

    z

    y

    C

    l

    i

    m

    b

    e

    r

    B

    e

    a

    m

    R

    i

    d

    e

    r

    E

    n

    d

    u

    r

    o

    A

    s

    t

    e

    r

    i

    x

    R

    o

    a

    d

    R

    u

    n

    n

    e

    r

    P

    h

    o

    e

    n

    i

    x

    A

    s

    s

    a

    u

    l

    t

    S

    p

    a

    c

    e

    I

    n

    v

    a

    d

    e

    r

    s

    D

    o

    u

    b

    l

    e

    D

    u

    n

    k

    J

    a

    m

    e

    s

    B

    o

    n

    d

    0

    0

    7

    A

    t

    l

    a

    n

    t

    i

    s

    G

    o

    p

    h

    e

    r

    r a n k - b a s e d

    p r o p o r t i o n a l

    Figure 3: Difference in normalized score (the gap between random and human is 100%) on 57

    games with human starts, comparing Double DQN with and without prioritized replay (rank-basedvariant in red, proportional in blue), showing substantial improvements in most games. Exact scoresare in Table6. See also Figure9where regular DQN is the baseline.

    4 ATARI E XPERIMENTS

    With all these concepts in place, we now investigate to what extent replay with such prioritizedsampling can improve performance in realistic problem domains. For this, we chose the collectionof Atari benchmarks (Bellemare et al.,2012)with their end-to-end RL from vision setup, becausethey are popular and contain diverse sets of challenges, including delayed credit assignment, partialobservability, and difficult function approximation(Mnih et al., 2015; van Hasselt et al. ,2016). Ourhypothesis is that prioritized replay is generally useful, so that it will make learning with experiencereplay more efficient without requiring careful problem-specific hyperparameter tuning.

    We consider two baseline algorithms that use uniform experience replay, namely the version ofthe DQN algorithm from the Nature paper (Mnih et al., 2015), and its recent extension DoubleDQN (van Hasselt et al., 2016) that substantially improved the state-of-the-art by reducing the over-estimation bias with Double Q-learning (van Hasselt, 2010). Throughout this paper we use the tunedversion of the Double DQN algorithm. For this paper, the most relevant component of these base-lines is the replay mechanism: all experienced transitions are stored in a sliding window memorythat retains the last 106 transitions. The algorithm processes minibatches of 32 transitions sampleduniformly from the memory. One minibatch update is done for each 4 new transitions entering thememory, so all experience is replayed 8 times on average. Rewards and TD-errors are clipped to fallwithin[1, 1]for stability reasons.

    We use the identical neural network architecture, learning algorithm, replay memory and evaluationsetup as for the baselines (see AppendixB.2). The only difference is the mechanism for samplingtransitions from the replay memory, with is now done according to Algorithm1instead of uniformly.We compare the baselines to both variants of prioritized replay (rank-based and proportional).

    Only a single hyperparameter adjustment was necessary compared to the baseline: Given that prior-itized replay picks high-error transitions more often, the typical gradient magnitudes are larger, sowe reduced the step-size by a factor 4 compared to the (Double) DQN setup. For the and 0hyperparameters that are introduced by prioritization, we did a coarse grid search (evaluated on asubset of 8 games), and found the sweet spot to be = 0.7, 0 = 0.5 for the rank-based variantand = 0.6, 0 = 0.4 for the proportional variant. These choices are trading off aggressivenesswith robustness, but it is easy to revert to a behavior closer to the baseline by reducing and/orincreasing.

    We produce the results by running each variant with a single hyperparameter setting across allgames, as was done for the baselines. Our main evaluation metric is thequality of the best policy,

    6

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    20/64

    Published as a conference paper at ICLR 2016

    0 5 0 1 0 0 1 5 0 2 0 0

    t r a i n i n g s t e p ( 1 e 6 )

    0 5 0 1 0 0 1 5 0 2 0 0

    t r a i n i n g s t e p ( 1 e 6 )

    0 %

    2 0 %

    4 0 %

    6 0 %

    8 0 %

    1 0 0 %

    1 2 0 %

    1 4 0 %

    n

    o

    r

    m

    a

    l

    i

    z

    e

    d

    m

    e

    a

    n

    s

    c

    o

    r

    e

    u n i f o r m r a n k - b a s e d p r o p o r t i o n a l u n i f o r m D Q N

    Figure 4: Summary plots of learning speed.Left: median over 57 games of the maximum baseline-normalized score achieved so far. The baseline-normalized score is calculated as in Equation 4 butusing the maximum Double DQN score seen across training is used instead of the human score. Theequivalence points are highlighted with dashed lines; those are the steps at which the curves reach100%, (i.e., when the algorithm performs equivalently to Double DQN in terms of median overgames). For rank-based and proportional prioritization these are at 47% and 38% of total training

    time.Right:Similar to the left, but using the mean instead of maximum, which captures cumulativeperformance rather than peak performance. Here rank-based and proportional prioritization reachthe equivalence points at 41% and 43% of total training time, respectively. For the detailed learningcurves that these plots summarize, see Figure7.

    in terms of average score per episode, given start states sampled from human traces (as introducedinNair et al., 2015and used invan Hasselt et al.,2016,which requires more robustness and gen-eralization as the agent cannot rely on repeating a single memorized trajectory). These results aresummarized in Table1and Figure3, but full results and raw scores can be found in Tables 7and6in the Appendix. A secondary metric is the learning speed, which we summarize on Figure4, withmore detailed learning curves on Figures7and8.

    DQN Double DQN (tuned)baseline rank-based baseline rank-based proportional

    Median 48% 106% 111% 113% 128%Mean 122% 355% 418% 454% 551%> baseline 41 38 42> human 15 25 30 33 33# games 49 49 57 57 57

    Table 1: Summary of normalized scores. See Table6in the appendix for full results.

    We find that adding prioritized replay to DQN leads to a substantial improvement in score on 41 outof 49 games (compare columns 2 and 3 of Table 6or Figure9in the appendix), with the mediannormalized performance across 49 games increasing from 48%to106%. Furthermore, we find thatthe boost from prioritized experience replay is complementaryto the one from introducing DoubleQ-learning into DQN: performance increases another notch, leading to the current state-of-the-arton the Atari benchmark (see Figure3). Compared to Double DQN, the median performance across

    57 games increased from111%to 128%, and the mean performance from418%to 551%bringingadditional games such as River Raid, Seaquest and Surround to a human level for the first time, andmaking large jumps on others (e.g. Gopher, James Bond 007 or Space Invaders). Note that meanperformance is not a very reliable metric because a single game (Video Pinball) has a dominantcontribution. Prioritizing replay gives a performance boost on almost all games, and on aggregate,learning is twice as fast (see Figures4 and8). The learning curves on Figure7illustrate that whilethe two variants of prioritization usually lead to similar results, there are games where one of themremains close to the Double DQN baseline while the other one leads to a big boost, for exampleDouble Dunk or Surround for the rank-based variant, and Alien, Asterix, Enduro, Phoenix or SpaceInvaders for the proportional variant. Another observation from the learning curves is that comparedto the uniform baseline, prioritization is effective at reducing the delay until performance gets off theground in games that otherwise suffer from such a delay, such as Battlezone, Zaxxon or Frostbite.

    7

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    21/64

    Published as a conference paper at ICLR 2016

    5 DISCUSSION

    In the head-to-head comparison between rank-based prioritization and proportional prioritization,we expected the rank-based variant to be more robust because it is not affected by outliers nor errormagnitudes. Furthermore, its heavy-tail property also guarantees that samples will be diverse, andthe stratified sampling from partitions of different errors will keep the total minibatch gradient ata stable magnitude throughout training. On the other hand, the ranks make the algorithm blindto the relative error scales, which could incur a performance drop when there is structure in thedistribution of errors to be exploited, such as in sparse reward scenarios. Perhaps surprisingly, bothvariants perform similarly in practice; we suspect this is due to the heavy use of clipping (of rewardsand TD-errors) in the DQN algorithm, which removes outliers. Monitoring the distribution of TD-errors as a function of time for a number of games (see Figure 10in the appendix), and found that itbecomes close to a heavy-tailed distribution as learning progresses, while still differing substantiallyacross games; this empirically validates the form of Equation1. Figure11, in the appendix, showshow this distribution interacts with Equation1to produce the effective replay probabilities.

    While doing this analysis, we stumbled upon another phenomenon (obvious in retrospect), namelythat some fraction of the visited transitions are never replayed before they drop out of the slidingwindow memory, and many more are replayed for the first time only long after they are encountered.Also, uniform sampling is implicitly biased toward out-of-date transitions that were generated by a

    policy that has typically seen hundreds of thousands of updates since. Prioritized replay with itsbonus for unseen transitions directly corrects the first of these issues, and also tends to help with thesecond one, as more recent transitions tend to have larger error this is because old transitions willhave had more opportunities to have them corrected, and because novel data tends to be less wellpredicted by the value function.

    We hypothesize that deep neural networks interact with prioritized replay in another interesting way.When we distinguish learning the value given a representation (i.e., the top layers) from learning animproved representation (i.e., the bottom layers), then transitions for which the representation isgood will quickly reduce their error and then be replayed much less, increasing the learning focuson others where the representation is poor, thus putting more resources into distinguishing aliasedstates if the observations and network capacity allow for it.

    6 EXTENSIONS

    Prioritized Supervised Learning: The analogous approach to prioritized replay in the contextof supervised learning is to sample non-uniformly from the dataset, each sample using a prioritybased on its last-seen error. This can help focus the learning on those samples that can still belearned from, devoting additional resources to the (hard) boundary cases, somewhat similarly toboosting(Galar et al., 2012). Furthermore, if the dataset is imbalanced, we hypothesize that samplesfrom the rare classes will be sampled disproportionately often, because their errors shrink less fast,and the chosen samples from the common classes will be those nearest to the decision boundaries,leading to an effect similar to hard negative mining (Felzenszwalb et al., 2008). To check whetherthese intuitions hold, we conducted a preliminary experiment on a class-imbalanced variant of theclassical MNIST digit classification problem (LeCun et al.,1998), where we removed 99% of thesamples for digits0, 1, 2, 3, 4in the training set, while leaving the test/validation sets untouched (i.e.,those retain class balance). We compare two scenarios: in the informed case, we reweight the errors

    of the impoverished classes artificially (by a factor 100), in the uninformed scenario, we provideno hint that the test distribution will differ from the training distribution. See Appendix B.3 for thedetails of the convolutional neural network training setup. Prioritized sampling (uninformed, with = 1, = 0) outperforms the uninformed uniform baseline, and approaches the performance ofthe informed uniform baseline in terms of generalization (see Figure 5); again, prioritized trainingis also faster in terms of learning speed.

    Off-policy Replay: Two standard approaches to off-policy RL are rejection sampling and usingimportance sampling ratios to correct for how likely a transition would have been on-policy. Ourapproach contains analogues to both these approaches, the replay probability P and the IS-correctionw. It appears therefore natural to apply it to off-policy RL, if transitions are available in a replaymemory. In particular, we recover weighted IS with w = , = 0,= 1and rejection sampling

    8

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    22/64

    Published as a conference paper at ICLR 2016

    Figure 5: Classification errors as a function of supervised learning updates on severely class-imbalanced MNIST. Prioritized sampling improves performance, leading to comparable errors onthe test set, and approaching the imbalance-informed performance (median of 3 random initializa-tions).Left: Number of misclassified test set samples. Right: Test set loss, highlighting overfitting.

    with p = min(1; ), = 1, = 0, in the proportional variant. Our experiments indicate thatintermediate variants, possibly with annealing or ranking, could be more useful in practice espe-cially when IS ratios introduce high variance, i.e., when the policy of interest differs substantially

    from the behavior policy in some states. Of course, off-policy correction is complementary to ourprioritization based on expected learning progress, and the same framework can be used for a hybridprioritization by definingp = ||, or some other sensible trade-off based on bothand.

    Feedback for Exploration: An interesting side-effect of prioritized replay is that the total numberMi that a transition will end up being replayed varies widely, and this gives a rough indication ofhow useful it was to the agent. This potentially valuable signal can be fed back to the explorationstrategy that generates the transitions. For example, we could sample exploration hyperparameters(such as the fraction of random actions, the Boltzmann temperature, or the amount of of intrin-sic reward to mix in) from a parametrized distribution at the beginning of each episode, monitorthe usefulness of the experience viaMi, and update the distribution toward generating more usefulexperience. Or, in a parallel system like the Gorila agent (Nair et al., 2015), it could guide re-source allocation between a collection of concurrent but heterogeneous actors, each with differentexploration hyperparameters.

    Prioritized Memories: Considerations that help determine which transitions to replay are likelyto also be relevant for determining which memories to store and when to erase them (e.g. when itbecomes likely that they will never be replayed anymore). An explicit control over which memo-ries to keep or erase can help reduce the required total memory size, because it reduces redundancy(frequently visited transitions will have low error, so many of them will be dropped), while auto-matically adjusting for what has been learned already (dropping many of the easy transitions) andbiasing the contents of the memory to where the errors remain high. This is a non-trivial aspect,because memory requirements for DQN are currently dominated by the size of the replay memory,no longer by the size of the neural network. Erasing is a more final decision than reducing the replayprobability, thus an even stronger emphasis of diversity may be necessary, for example by trackingthe age of each transitions and using it to modulate the priority in such a way as to preserve suffi-cient old experience to prevent cycles (related to hall of fame ideas in multi-agent literature,Rosin& Belew, 1997). The priority mechanism is also flexible enough to permit integrating experiencefrom other sources, such as from a planner or from human expert trajectories (Guo et al., 2014),since knowing the source can be used to modulate each transitions priority, e.g. in such a way as topreserve a sufficient fraction of external experience in memory.

    7 CONCLUSION

    This paper introduced prioritized replay, a method that can make learning from experience replaymore efficient. We studied a couple of variants, devised implementations that scale to large replaymemories, and found that prioritized replay speeds up learning by a factor 2 and leads to a newstate-of-the-art of performance on the Atari benchmark. We laid out further variants and extensionsthat hold promise, namely for class-imbalanced supervised learning.

    9

  • 7/25/2019 Adavances in Deep-Q Learning.pdf

    23/64

    Published as a conference paper at ICLR 2016

    ACKNOWLEDGMENTS

    We thank our Deepmind colleagues, in particular Hado van Hasselt, Joseph Modayil, Nicolas Heess,Marc Bellemare, Razvan Pascanu, Dharshan Kumaran, Daan Wierstra, Arthur Guez, Charles Blun-dell, Alex Pritzel, Alex Graves, Balaji Lakshminarayanan, Ziyu Wang, Nando de Freitas, RemiMunos and Geoff Hinton for insightful discussions and feedback.

    REFERENCES

    Andre, David, Friedman, Nir, and Parr, Ronald. Generalized prioritized sweeping. InAdvances inNeural Information Processing Systems. Citeseer, 1998.

    Atherton, Laura A, Dupret, David, and Mellor, Jack R. Memory trace replay: the shaping of memoryconsolidation by neuromodulation. Trends in neurosciences, 38(9):560570, 2015.

    Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning envi-ronment: An evaluation platform for general agents. arXiv preprint arXiv:1207.4708, 2012.

    Bellemare, Marc G., Ostrovski, Georg, Guez, Arthur, Thomas, Philip S., and Munos, R emi. In-creasing the action gap: New operators for reinforcement learning. In Proceedings of the AAAIConference on Artificial Intelligence, 2016. URL http://arxiv.org/abs/1512.04860.

    Collobert, Ronan, Kavukcuoglu, Koray, and Farabet, Clement. Torch7: A matlab-like environment

    for machine learning. InBigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.Diamond, Jared. Zebras and the Anna Karenina principle. Natural History, 103:44, 1994.

    Felzenszwalb, Pedro, McAllester, David, and Ramanan, Deva. A discriminatively trained, multi-scale, deformable part model. InComputer Vision and Pattern Recognition, 2008. CVPR 2008.IEEE Conference on, pp. 18. IEEE, 2008.

    Foster, David J and Wilson, Matthew A. Reverse replay of behavioural sequences in hippocampalplace cells during the awake state. Nature, 440(7084):680683, 2006.

    Galar, Mikel, Fernandez, Alberto, Barrenechea, Edurne, Bustince, Humberto, and Herrera, Fran-cisco. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEETransactions on, 42(4):4