the importance of experience replay database composition in...

The importance of experience replay database

composition in deep reinforcement learning

Tim de BruinDelft Center for Systems and Control

Delft University of [email protected]

Jens KoberDelft Center for Systems and Control


Karl Tuyls∗

Department of Computer ScienceUniversity of Liverpool

[email protected]

Robert BabuskaDelft Center for Systems and Control


Abstract

Recent years have seen a growing interest in the use of deep neural networks asfunction approximators in reinforcement learning. This paper investigates the po-tential of the Deep Deterministic Policy Gradient method for a robot control prob-lem both in simulation and in a real setup. The importance of the size and com-position of the experience replay database is investigated and some requirementson the distribution over the state-action space of the experiences in the databaseare identified. Of particular interest is the importance of negative experiences thatare not close to an optimal policy. It is shown how training with samples that areinsufficiently spread over the state-action space can cause the method to fail, andhow maintaining the distribution over the state-action space of the samples in theexperience database can greatly benefit learning.

1 Introduction

The combination of Reinforcement Learning (RL) with Artificial Neural Networks (ANN) is aninteresting approach to the control of dynamical systems. Reinforcement learning provides a frame-work that makes it possible to learn complex nonlinear control policies without any prior knowledgeof the system to be controlled. When this is combined with the ability of neural networks to repre-sent highly nonlinear mappings and generalize the gained knowledge to previously unseen inputs, itis clear that the combination holds a lot of potential.

One category of methods that is of special interest for control are the model-free off-policy actor-critic methods [11]. These methods allow for continuous actions and use only interaction samplesfrom the system. Although model-free methods require no prior knowledge of the system, thiscomes at the price of needing more samples to learn a good policy. This is problematic whenlearning control on physical systems, since extra interaction time will cause wear on the system.Having an off-policy algorithm helps to some extent since it allows for the use of experience replay[14, 21, 22]. This technique reuses experience samples, increasing the sample efficiency.

One recently published model-free off-policy actor-critic method that uses experience replay is theDeep Deterministic Policy Gradient (DDPG) method [1]. In this method, experience replay is usednot only to increase the sample efficiency but it is also crucial for the stability (convergence) of the

∗Author is also affiliated as a part-time professor with Delft Center for Systems and Control.

1

learning process. In this paper the influence of the properties of the database and the distribution ofthe samples contained in the database on the stability and speed of the learning process are examined.

In reinforcement learning, the way in which new experiences are obtained presents a well knownproblem: the exploration-exploitation trade-off. It is important to explore sufficiently far away fromthe current best policy, to not to get stuck in a local minimum, while still exploiting the currentknowledge so that the learning process is as fast as possible [18]. When using neural networks,additional requirements are placed on the distribution of the experiences over the state-action spacethat also have to be taken into account. These requirements stem from the fact that when the neuralnetwork training data are not varied enough, the network is likely to over fit, resulting in poorgeneralization performance in unexplored parts of the state-space. When this is combined with thefact that neural networks can easily forget about data that they are no longer trained on [16], itbecomes apparent that when an experience database of limited size is used, the contents should bemaintained with some care.

These issues are investigated in the context of a 2-DOF robot, on which the DDPG method is im-plemented both in simulation and on the robot. The remainder of this paper is structured as follows.Section 2 will provide background on the DDPG method and the use of experience replay. InSection 3 we give details about the implementation of the experiments and in Section 4 comparedifferent experience collection strategies, the effect of the size of the experience replay database andthe effect of retaining experiences that were obtained early on in the learning process.

2 Method

The reinforcement learning method used in this paper is based on the Deep Deterministic PolicyGradient method presented in [1], an off-policy actor-critic algorithm. The actor π attempts todetermine the real-valued control action aπ ∈ R

n that will minimize the expected sum of futurecosts c based on the current state of the system s ∈ R

m ; aπ = π(s). The critic attempts to givethe expected discounted sum of future costs when taking action a(k) in state s(k) at time k and thepolicy π is followed for all future time steps:

Qπ(s, a) = E

c(

s(k), a, s(k + 1))

+

K∑

j=k+1

γj−kc(

s(j), π(s(j)), s(j + 1))

∣

∣

∣

∣

∣

∣

s(k)=s

(1)

With 0 ≤ γ < 1 the discount factor that ensures the sum is finite and the emphasis is on short termcosts.

The actor and critic functions are approximated by neural networks with parameter vectors ζ and ξrespectively. The critic network weights are updated to minimize the squared temporal differenceerror:

L(ξ) =(

Q(s, a|ξ)− c+ γQ(s′, π(s′|ζ)|ξ))2

(2)

Where s = s(k), s′ = s(k + 1) and c = c(s, a, s′) for brevity. The actor network is updated in thedirection that will minimize the expected cost according to the critic:

∆ζ ∼ −▽aQ(s, a|ξ)|s=s(k),a=π(s(k)|ζ)▽ζπ(s|ζ)|s=s(k) (3)

It is shown in [4] that this is the true gradient of the expected overall cost with respect to the policyparameters, when the critic has a certain compatible form. Although using this compatible formmight be too restricting, using (3) was shown in [1] to work for non-compatible forms as well,when certain stability measures are added to the method. These measures include the addition of anexperience replay database and the use of separate target networks.

Using (2) and (3) directly can be problematic when using neural networks for the actor and criticfunctions [1, 2]. During the training of the critic network the parameters ξ are updated to bring theoutput of the network Q(s, a|ξ) closer to the target value c + γQ(s′, π(s′|ξ)). However the targetvalue depends on the same parameters ξ that are being updated to bring the output of the networkcloser to the target, both directly for the value of Q at the next state and through the policy whichis updated based on Q. This coupling can cause the training of the networks to become unstable.

2

To solve this, the target values are updated using a separate copy of both the actor and the criticnetworks. A low-pass filter is used to update the network parameters of the copies according to:

ξ− ← τξ + (1− τ)ξ− (4)

ζ− ← τζ + (1− τ)ζ− (5)

Where the choice of τ presents a trade-off between the speed and stability of the learning process.The target networks are then used to replace (2) with:

L(ξ) =(

Q(s, a|ξ)− c+ γQ(s′, π(s′|ζ−)|ξ−))2

(6)

2.1 Experience replay

The second contribution of [1], based on earlier work [2, 3], that enabled the use of deep neuralnetworks as function approximators for the actor and critic in a stable way is the use of the experiencereplay database [14]. The experience tuples < s, a, s′, c > from the interaction with the system arestored in a database. During the training of the neural networks, the experiences are sampled fromthis database, allowing experiences to be used multiple times. The addition of experience replayaids the learning in several ways. The first benefit is increased efficiency. Experience replay helpsto increase the sample efficiency by allowing samples to be reused. On top of this, in the contextof neural networks, experience replay allows for mini-batch updates which helps the computationalefficiency, especially when the training is performed on a GPU.

On top of the efficiency gains that experience replay brings, it also improves the stability of theDDPG learning algorithm in several ways. The first way in which the database helps stabilize thelearning process is that it is used to break the temporal correlations of the neural network learningupdates. Without an experience database, the updates of (2), (3) would be based on subsequentexperience samples from the system. These samples are highly correlated since the state of thesystem does not change much between consecutive time-steps. For real-time control, this effectis even more pronounced with high sampling frequencies. The problem this poses to the learningprocess is that most mini-batch optimization algorithms are based on the assumption of independentand identically distributed data [1]. Learning from subsequent samples would violate this i.i.d.assumption and cause the updates to the network parameters to have a high variance, leading toslower and potentially less stable learning [12, 13]. By saving the experiences over a period oftime and updating the neural networks with mini-batches of experiences that are sampled uniformlyrandom from the database this problem is alleviated.

A second way in which the experience database helps stabilize the learning process is by ensuringthat the experiences used to train the networks are not only based on the most recent policy. Thepolicy that is used during the trials to get new experiences is given by:

µ(s) = π(s|ζ) + ǫO (7)

WhereO is a noise process that provides exploration and ǫ is used to scale the amount of exploration.This policy therefore determines the experiences that are added to the database and that the neuralnetworks are trained on. When the Q function changes slightly the policy can change significantly,leading to very different samples that are added to the database. This in turn could lead to thealgorithm getting stuck in a local minimum or even diverging [2]. By using a database with pastexperiences this effect is reduced as the experiences from previous policies are still used for training.Important to note here is the fact that neural networks are global function approximators. When thetraining data are only from a small part of the state-action space the networks might over-fit to thedata and generalize poorly to other parts of the state space. Additionally, even if a network haspreviously learned to do a task well, it can forget this knowledge completely when learning a newtask, even when the new task is related to the old one [16]. In the case of the DDPG algorithm,even if the critic can accurately predict the expected cost for parts of the state space, this ability candisappear when it is no longer trained on data from this part of the state-action space, as the sameparameters apply to the other parts of the state-action space as well. These effects are shown toseriously affect the learning performance in Section 4.

Regularization can help prevent neural networks from over-fitting to their training data. However,the regularization of neural networks used for control is currently not as well understood as the reg-ularization of neural networks in other domains [17]. It is therefore important to ensure the training

3

data in the experience database is varied enough for the neural networks to properly generalize theirknowledge to the whole state-action space. A simple method that helps with this issue is introducedin Section 4.

u1

u2

θ1

θ2

x

y

Gravity

Figure 1: 2-link arm robot schematic (left) and photo (right).

3 Experimental setup

The DDPG method is used to learn the control of a 2-link arm, both in simulation and on a realsetup. The system is depicted in Figure 1. It consists of two links that are connected through amotorized joint. The arm hangs down from another motorized joint. The angle between the baseand the first link is θ1 and the angle of the second link with respect to the first link is θ2. Bothjoints are constrained to θ1, θ2 ∈ [−π

2 ,π2 ]. The control signals u1, u2 represent scaled versions of

the voltages to the motors in the joints. To prevent damage to the physical setup due to jittering, thecontrol signal to the setup, but not the simulator, is filtered with a low-pass filter:

u(k) = 0.9u(k − 1) + 0.1a(k) (8)

To ensure the Markov property is satisfied the reinforcement learning state at time k consists ofthe angles and angular velocities of the joints and the motor signals at time k − 1, as well as thereference:

s(k) =[

θ1(k) θ2(k) θ1(k) θ2(k) u1(k − 1) u2(k − 1) rx(k) ry(k)]T

(9)

The reference in 9 is given in Cartesian coordinates whereas the state of the arm is given in angularcoordinates. It is left up to the neural networks to learn the mapping between the two and to dealwith the fact that some reference positions can be reached through multiple arm configurations.

The cost of taking action a in state s at time k is based on the distance between the Cartesiancoordinates of the end of the second link at time-step k+ 1, as shown in Figure 1, and the referenceposition at time-step k:

c(s(k), a(k), s(k + 1)) = w1d2(k) + w2(1− e−αd2(k)) + w3‖θ(k)‖2 (10)

with

d2(k) =

∥

∥

∥

∥

∥

[

rx(k)− x(k + 1)ry(k)− y(k + 1)

]

∥

∥

∥

∥

∥

2

2

, x = sin(θ1)+sin(θ1+θ2), y = cos(θ1)+cos(θ1+θ2)

(11)The first term in (10) ensures that the initial learning is quick. Even when the policy is very bad thequadratic cost ensures that it is clear that moving towards the reference position is better than movingaway from it. Using a quadratic term alone will not result in good final policies, however, since sucha cost function is very flat close to the reference. The second term adds a steep drop in the costvery close to the reference to solve this problem. The third term discourages the highly oscillatoryresponses that might otherwise result. The constants in (10) that are used in the experiments arew1 = 0.2, w2 = 0.4, w3 = 8, α = 100.

4

The forgetting factor γ is calculated via:

γ = e−Ts

τγ (12)

Where Ts is the sampling period and τγ is the lookahead horizon in seconds. For this horizon a valueof 2.5 seconds was used, which gives γ = 0.961 for the simulator, which uses a control frequencyof 10 Hz and γ = 0.996 for the real setup which uses a control frequency of 100 Hz.

The networks used for both the actor and the critic are fully connected networks with RectifiedLinear Unit (ReLU) [10] nonlinearities. Both networks have two hidden layers of equal size. In thecritic network the action inputs come into the network before the second hidden layer. The numberof hidden units in both networks is chosen such that they have around 10000 parameters each. Thisgives the actor network 94 neurons per hidden layer and the critic network 93 neurons per hiddenlayer. As in [1], batch normalization [9] is used on the inputs to all layers of the actor network andall layers prior to the action input of the critic network. On the critic network L2 weight decay of0.5 ·10−2 was used. The Adam [7] optimization algorithm is used to train the neural networks. Thisoptimization algorithm is appropriate for non-stationary objective functions and noisy gradients,which makes it suitable for reinforcement learning problems.

The experiments in this paper have been run in a growing batch setting [15]. Each experimentconsists of n repetitions of a learning run which itself consists of o trials of 60 seconds. Duringthese trials the experiences are added to the experience database. Between trials the neural networksare updated based on the experiences in the complete database. For each separate learning run in theexperiment the initialization of the networks and the training references are unique and the databaseis reset. When the database is full the first experiences are overwritten in a first in first out manner.

The noise processO in (7) that is used is an Ohrnstein-Uhlenbeck process:

O(k) = O(k − 1)− αO(k − 1) + βN (0, 1) (13)

With α = 0.6, β = 0.4 andN (0, 1) is Gaussian noise with a mean of 0.0 and standard deviation 1.0.During training the references are determined stochastically and in such a way that they can bereached with both joints θ1, θ2 ∈ [−0.5, 0.5]. They are changed periodically during the trials. Aftereach learning run the final policy is evaluated with a fixed sequence of predetermined references inthe same range.

4 Results

A series of tests were performed to investigate the importance of the size of the experience databaseand the properties of the experiences contained within the database for the speed and stability of thelearning process.

4.1 Data collection strategy

To investigate the effects of the properties of the collected data, three collection strategies are com-pared. For these tests a database was used that could hold all samples of the learning run.

The first strategy is the default reinforcement learning setting in which the data are collected byfollowing a noisy version of the current policy, where the noise adds exploration (7). The amplitudeǫ of the exploration noise is decayed exponentially per trial.

In the second strategy, the data are collected by following a deterministic policy of decent quality. Inthis case a PD controller is used which achieved an average cost per trial of 0.35. No exploration isadded to this policy, so that the effects of only training the networks on data from one policy becomemost apparent.

For the final strategy the noise process of (13) was used exclusively to collect the experiences. Theaverage cost per trial of this random policy was 1.1. In all cases the database size was chosen largeenough to contain the entire learning run.

The results of this experiment are shown in Figure 2. The results of learning from data obtainedwith a completely random policy are close to those of learning from the current best policy withexploration. When trying to learn from experiences obtained by following a good but non changing

5

Trial

5 10 15 20 25

Avera

ge c

ost

per

tria

l

0

0.5

1

1.5

2

Exclusively good policy

Standard

Exclusively random policy

Figure 2: Cost per trial for different experience collection strategies. The curves depict the meansover 5 learning run repetitions in the simulator.

Trial

5 10 15 20 25 30

Avera

ge c

ost

per

tria

l

0

0.5

1

1.5

2

DB size = 2 trials

DB size = 10 trials

DB size = 30 trials

Figure 3: Cost per trial for different database sizes. The curves depict the means over 5 learning runrepetitions in the simulator, the shaded areas contain all runs.

deterministic policy the method completely fails. This is likely to be the result of the neural networksover-fitting to this very limited sampling of the state-action space. It is therefore very important toensure that the samples are diverse enough for the networks to be able to generalize to the wholestate-action space with acceptable accuracy.

4.2 Database size

From Figure 2 it can be seen that the standard learning strategy of sampling the state-action spaceaccording to the current policy with exploration works when the database is large enough to containall the samples that have been collected so far. It is interesting to see if these results also hold whenthe amount of exploration is decaying and the database is not large enough to hold all samples. In thatcase, the database will initially contain experiences from an exploratory policy. As the explorationdecays the new samples will be closer to the policy and less varied. Eventually, the initial exploratorysamples will be overwritten by the new samples, reducing the variety in the database. When thishappens the networks will already contain knowledge from the early exploration. However, thisknowledge might be lost after repeated training on the new experiences.

To test this, the learning performance that results from using an experience replay database largeenough to contain all experiences is compared to the performance that results from using a databasethat contains only a small number of trials. In Figure 3 the results are reported for experiencedatabases that contain the last 2 trials and the last 10 trials.

6

Trial

5 10 15 20 25 30

Avera

ge c

ost

per

tria

l

0

0.2

0��

0��

0��

1

1.2

1��

1��

1��

2

L�� rials

Fi�� d �� rial

(a) Database size = 2 trials

Trial

5 10 15 20 25 30

Avera

ge c

ost

per

tria

l

0

0.2

��

��

��

1

1.2

��

��

��

2

�� trials

�irst 5 and last 5 trials

(b) Database size = 10 trials

Figure 4: Effects of keeping early samples. The curves depict the means over 5 learning run repeti-tions on the simulator, the shaded areas contain all runs.

Although in all experiments a very basic policy is learned in the first few trials, the learning subse-quently becomes instable for all learning runs that use a database containing the experiences fromonly the last two trials, confirming the hypothesis that knowledge that was gained earlier on is simplyforgotten. Additionally, with the amount of exploration decaying, these learning runs never recoverand without exception result in unusable policies.

When a database containing the most recent 10 trials is used the learning performance is on averagequite close to when all trials are stored. However, it can be seen from Figure 3 that the stabilityof the learning process is not as good. Although a deteriorating policy is in all cases recovered insubsequent trials, this is still an unwanted effect.

4.3 Retaining early memories

To investigate whether the stabilizing effect observed in Figure 3 for larger databases is indeedrelated to the presence of the early samples from exploratory policies or simply to the presence ofmore data, new tests are performed with databases that can hold experiences from 2 and 10 trialsrespectively. This time however, when the databases are full, the new samples will overwrite onlythe second half of the database. This way, the experiences from the initial trials are kept indefinitelywhile the database also always contains the experiences from the most recent trials. The results arecompared to the standard strategy of keeping only the most recent trials in Figure 4.

From Figure 4a it can be seen that retaining the experiences from the first trial ensures that themethod becomes stable for a database size of 2 trials. For the experiment with a database that holds10 trials the results show that keeping the initial exploratory experiences in the database not onlyhelps to make the learning process more stable, but also results in better policies than when all trialsare kept in memory. Therefore, it might be necessary to make an exploration-exploitation trade-off not only for the policy by which new experiences are obtained, but also for choosing the sizeof the database and which experiences they replace in the database. Care should be taken howeverwhen keeping experiences in the database indefinitely as the networks might start to over-fit on theseparticular experiences.

4.4 Experiments on the real setup

Experiments have been conducted on the real setup to verify that the phenomena described in thispaper apply to real physical systems in a similar way as they do in simulation. A database largeenough to hold 6 trials was used in four learning runs. For two learning runs the standard methodof overwriting the entire database with new experiences in a FIFO way was used. For the two otherlearning runs the first three episodes were kept in memory and the second half of the database wasoverwritten in a FIFO way by new experiences.

7

Time [s]0 5 10 15 20 25 30

x-po

sitio

n

-1.5

-1

-0.5

0

0.5

1

1.5ReferenceFirst 3 and last 3 trialsLast 6 trials

(a) x-position

Time [s]0 5 10 15 20 25 30

y-po

sitio

n

0

0.5

1

1.5

2

2.5

3ReferenceFirst 3 and last 3 trialsLast 6 trials

(b) y-position

Figure 5: Effects of keeping early samples on the real setup. Responses when using the policiesafter 40 trials of 30 seconds are shown. A database that can hold experiences from 6 trials was used.

The policies that resulted after 30 trials were tested on the same sequence of reference positions.The response of the system for the four tests is shown in Figure 5. It can be seen that althoughthe two learning runs which kept the initial episodes in the database produced useful policies, themethod completely failed on both runs where the six most recent episodes were kept in memory.

5 Conclusion

This work investigates the effects of the properties of the experience replay database on the stabilityand speed of learning when using the Deep Deterministic Policy Gradient method in the context ofcontrol of 2-DOF robot control.

It is shown that the experience replay database is vital for the stability of the learning process. Mostimportant in this respect is that the database should at all times contain experience samples that arediverse enough. Even after the neural networks have already converged to represent a decent policyand the value function of that policy, they can still loose this knowledge when trained on experiencesthat are too limited in their coverage of the state-action space.

Although the samples from early trials, which include thorough exploration, are valuable for thelearning performance of the algorithm, the standard DQN and DDPG methods will eventually re-place these samples with new experiences that might be sampled closer to the policy. This can resultin a loss of stability and learning performance when these new samples are not diverse enough.

Even when a database is used that is large enough to retain all experience samples, the influenceof the early samples on the neural network training will eventually diminish, as the ratio of theexploratory experiences compared to those close to the policy drops. This suggests that there isan exploration-exploitation trade-off not just for generating new experiences, but also for decidingwhich experiences in the database to replace.

It is shown in this work that retaining the experiences from the initial exploratory trials in thedatabase can help prevent a loss of learning stability for very small databases both in simulationand on a physical setup and that using a relatively small database with this method can improve theperformance of the learning algorithm compared to storing all experiences. However, care shouldbe taken to prevent over-fitting to experiences that are kept in the database for too long.

A more sophisticated method to determine which experiences to overwrite in the database, suchthat a sufficient coverage of the state-action space is enforced, remains future work. The consider-ations could for example be based on closed loop system identification theory [19] or the temporaldifference error of the experiences [20].

8

References

[1] Lillicrap, Timothy P., et al. ”Continuous control with deep reinforcement learning.” arXivpreprint arXiv:1509.02971 (2015).

[2] Mnih, Volodymyr, et al. ”Human-level control through deep reinforcement learning.” Nature518.7540 (2015): 529-533.

[3] Mnih, Volodymyr, et al. ”Playing atari with deep reinforcement learning.” arXiv preprintarXiv:1312.5602 (2013).

[4] Silver, David, et al. ”Deterministic policy gradient algorithms.” ICML. 2014.

[5] van Hasselt, Hado, Arthur Guez, and David Silver. ”Deep Reinforcement Learning with DoubleQ-learning.” arXiv preprint arXiv:1509.06461 (2015).

[6] Hafner, Roland, and Martin Riedmiller. ”Reinforcement learning in feedback control.” Machinelearning 84.1-2 (2011): 137-169.

[7] Kingma, Diederik, and Jimmy Ba. ”Adam: A method for stochastic optimization.” arXivpreprint arXiv:1412.6980 (2014).

[8] Engel, Jan-Maarten, and Robert Babuska. ”On-line Reinforcement Learning for Nonlinear Mo-tion Control: Quadratic and Non-Quadratic Reward Functions.” World Congress. Vol. 19. No. 1.2014.

[9] Ioffe, Sergey, and Christian Szegedy. ”Batch normalization: Accelerating deep network trainingby reducing internal covariate shift.” arXiv preprint arXiv:1502.03167 (2015).

[10] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. ”Deep sparse rectifier neural networks.”International Conference on Artificial Intelligence and Statistics. 2011.

[11] Degris, Thomas, Martha White, and Richard S. Sutton. ”Off-policy actor-critic.” arXiv preprintarXiv:1205.4839 (2012).

[12] Bengio, Yoshua. ”Practical recommendations for gradient-based training of deep architec-tures.” Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 437-478.

[13] LeCun, Yann A., et al. ”Efficient backprop.” Neural networks: Tricks of the trade. SpringerBerlin Heidelberg, 2012. 9-48.

[14] Lin, Long-Ji. ”Self-improving reactive agents based on reinforcement learning, planning andteaching.” Machine learning 8.3-4 (1992): 293-321.

[15] Lange, Sascha, Thomas Gabel, and Martin Riedmiller. ”Batch reinforcement learning.” Rein-forcement Learning. Springer Berlin Heidelberg, 2012. 45-73.

[16] Goodfellow, Ian J., et al. ”An empirical investigation of catastrophic forgeting in gradient-based neural networks.” arXiv preprint arXiv:1312.6211 (2013).

[17] Levine, Sergey. ”Exploring deep and recurrent architectures for optimal control.” arXivpreprint arXiv:1311.1761 (2013).

[18] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1.No. 1. Cambridge: MIT press, 1998.

[19] Van den Hof, Paul. ”Closed-loop issues in system identification.” Annual reviews in control 22(1998): 173-186.

[20] Stadie, Bradly C., Sergey Levine, and Pieter Abbeel. ”Incentivizing Exploration In Reinforce-ment Learning With Deep Predictive Models.” arXiv preprint arXiv:1507.00814 (2015).

[21] Adam, Sander, Lucian Busoniu, and Robert Babuska. ”Experience replay for real-time rein-forcement learning control.” Systems, Man, and Cybernetics, Part C: Applications and Reviews,IEEE Transactions on 42.2 (2012): 201-212.

[22] Wawrzynski, Paweł. ”Real-time reinforcement learning by sequential actor–critics and experi-ence replay.” Neural Networks 22.10 (2009): 1484-1497.

9

the importance of experience replay database composition in...

Documents