asynchronous deep q learning for...
TRANSCRIPT
Asynchronous Deep Q-Learning for BreakoutEdgard Bonilla (edgard), Jiaming Zeng (jiaming), Jennie Zheng (jenniezh)
CS229 Machine Learning, Stanford University
Reinforcement learning allows machines toiteratively determine the best behavior in aspecific environment based on feedback andperformance, making it highly adaptable andapplicable to a wide-range of domains. We applydeep Q-learning1 to teach an artificial agent toplay the Atari game Breakout using RAM states.
• Better accommodation of image inputs• implement other variations on the Boltzmann-Q
policy to further explore exploration vs. exploitation trade-off
1.Mnih,V.,etal.(2015).Human-levelcontrolthroughdeepreinforcementlearning. Nature, 518(7540),529-533.2.Mnih,V.,etal.(2016).Asynchronousmethodsfordeepreinforcementlearning. arXiv preprintarXiv:1602.01783.
MDP Formulation§ States: 1x128 RAM state of game§ Actions: Left, right, Do nothing§ Policy: Linearly annealed ε-greedy method
to select action2
§ Rewards: The score of the game
DenseLayer1
(n)
Inpu
ts:
1x128RA
Mstates
DenseLayer:#actio
ns
DenseLayerk
(n)
…
We implemented Asynchronous Q-Learning withtarget network and reward clipping1,2.
Initialize:DQNNetwork,#ofthreads
Startepisode
… Threadt
GetframeActonenvironmentStoreexperiences
…
UpdateDQN
DQNfromotherThreads
ShareDQNwithotherThreads
Thread1
Ifgameterminal
Input # Frames Average Max Min MedianRAM 55M 8.31 36 1 7
Image 8M 1.09 4 0 1
Setup # Frames Average Max Min Median
1 life, ε-greedy 80M 10.57 39 0 8
5 lives, ε-greedy 80M 12.55 35 2 9
5 lives, Boltzmann 42M 22.15 51 8 22.5
Changes ObservationsPolicy Boltzmann-Q improves more quickly
than ε-greedyLives No noticeable differenceInput RAM trains much faster than images
Network Architecture Experimentation§ Network Settings: k = 2, 3, 4; n = 128, 256.§ Comparison for after 42 hrs of training*
§ We plot the moving averages for Max Q andrewards
Default training setting: RAM state inputs to aDQN of 2 layers, each with 256 nodes, and anε-greedy policy. Each game episode has 5 lives. Discounted reward factor, γ = 0.99
Reward statistics over 100 testing episodes§ Varied settings to compare performance,
including a change in training policy
§ Compare to images: after 84 hours of training*
Network Architecture Experimentation§ Network complexity increases training time§ Most architectures show similar performance
for the first 40 million framesTesting over 100 episodes§ Our implementation of Boltzmann-Q
encourages exploration by giving an upperbound on the probability of choosing thegreedy action.
*Computing Info: All times are based on wall time of Stanford Farmshare’s Barley cluster.
Tools used: OpenAI Gym, Keras, Tensorflow
Introduction
Deep Q-Network (DQN)
Methodology
Results
Results (cont.)
Discussion
Future Work
DQNisaneuralnetworkthat,givenaninputstates,estimatesQ(s,a)forallpossibleactionsa.
Outpu
t:#actio
ns