asynchronous deep q learning for...

1
Asynchronous Deep Q-Learning for Breakout Edgard Bonilla (edgard), Jiaming Zeng (jiaming), Jennie Zheng (jenniezh) CS229 Machine Learning, Stanford University Reinforcement learning allows machines to iteratively determine the best behavior in a specific environment based on feedback and performance, making it highly adaptable and applicable to a wide-range of domains. We apply deep Q-learning 1 to teach an artificial agent to play the Atari game Breakout using RAM states. Better accommodation of image inputs implement other variations on the Boltzmann-Q policy to further explore exploration vs. exploitation trade-off 1. Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533. 2. Mnih, V., et al. (2016). Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783. MDP Formulation § States: 1x128 RAM state of game § Actions: Left, right, Do nothing § Policy: Linearly annealed ε-greedy method to select action 2 § Rewards: The score of the game Dense Layer 1 (n) Inputs: 1x128 RAM states Dense Layer: # actions Dense Layer k (n) We implemented Asynchronous Q-Learning with target network and reward clipping 1,2 . Initialize: DQN Network, # of threads Start episode Thread t Get frame Act on environment Store experiences Update DQN DQN from other Threads Share DQN with other Threads Thread 1 If game terminal Input # Frames Average Max Min Median RAM 55M 8.31 36 1 7 Image 8M 1.09 4 0 1 Setup # Frames Average Max Min Median 1 life, ε-greedy 80M 10.57 39 0 8 5 lives, ε-greedy 80M 12.55 35 2 9 5 lives, Boltzmann 42M 22.15 51 8 22.5 Changes Observations Policy Boltzmann-Q improves more quickly than ε-greedy Lives No noticeable difference Input RAM trains much faster than images Network Architecture Experimentation § Network Settings: k = 2, 3, 4; n = 128, 256. § Comparison for after 42 hrs of training * § We plot the moving averages for Max Q and rewards Default training setting: RAM state inputs to a DQN of 2 layers, each with 256 nodes, and an ε-greedy policy. Each game episode has 5 lives. Discounted reward factor, γ = 0.99 Reward statistics over 100 testing episodes § Varied settings to compare performance, including a change in training policy § Compare to images: after 84 hours of training * Network Architecture Experimentation § Network complexity increases training time § Most architectures show similar performance for the first 40 million frames Testing over 100 episodes § Our implementation of Boltzmann-Q encourages exploration by giving an upper bound on the probability of choosing the greedy action. *Computing Info: All times are based on wall time of Stanford Farmshares Barley cluster. Tools used: OpenAI Gym, Keras, Tensorflow Introduction Deep Q-Network (DQN) Methodology Results Results (cont.) Discussion Future Work DQN is a neural network that, given an input state s, estimates Q(s,a) for all possible actions a. Output: # actions

Upload: others

Post on 14-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Asynchronous Deep Q Learning for Breakoutcs229.stanford.edu/proj2016/poster/BonillaZengZhen...Asynchronous methods for deep reinforcement learning.arXiv preprint arXiv:1602.01783

Asynchronous Deep Q-Learning for BreakoutEdgard Bonilla (edgard), Jiaming Zeng (jiaming), Jennie Zheng (jenniezh)

CS229 Machine Learning, Stanford University

Reinforcement learning allows machines toiteratively determine the best behavior in aspecific environment based on feedback andperformance, making it highly adaptable andapplicable to a wide-range of domains. We applydeep Q-learning1 to teach an artificial agent toplay the Atari game Breakout using RAM states.

• Better accommodation of image inputs• implement other variations on the Boltzmann-Q

policy to further explore exploration vs. exploitation trade-off

1.Mnih,V.,etal.(2015).Human-levelcontrolthroughdeepreinforcementlearning. Nature, 518(7540),529-533.2.Mnih,V.,etal.(2016).Asynchronousmethodsfordeepreinforcementlearning. arXiv preprintarXiv:1602.01783.

MDP Formulation§ States: 1x128 RAM state of game§ Actions: Left, right, Do nothing§ Policy: Linearly annealed ε-greedy method

to select action2

§ Rewards: The score of the game

DenseLayer1

(n)

Inpu

ts:

1x128RA

Mstates

DenseLayer:#actio

ns

DenseLayerk

(n)

We implemented Asynchronous Q-Learning withtarget network and reward clipping1,2.

Initialize:DQNNetwork,#ofthreads

Startepisode

… Threadt

GetframeActonenvironmentStoreexperiences

UpdateDQN

DQNfromotherThreads

ShareDQNwithotherThreads

Thread1

Ifgameterminal

Input # Frames Average Max Min MedianRAM 55M 8.31 36 1 7

Image 8M 1.09 4 0 1

Setup # Frames Average Max Min Median

1 life, ε-greedy 80M 10.57 39 0 8

5 lives, ε-greedy 80M 12.55 35 2 9

5 lives, Boltzmann 42M 22.15 51 8 22.5

Changes ObservationsPolicy Boltzmann-Q improves more quickly

than ε-greedyLives No noticeable differenceInput RAM trains much faster than images

Network Architecture Experimentation§ Network Settings: k = 2, 3, 4; n = 128, 256.§ Comparison for after 42 hrs of training*

§ We plot the moving averages for Max Q andrewards

Default training setting: RAM state inputs to aDQN of 2 layers, each with 256 nodes, and anε-greedy policy. Each game episode has 5 lives. Discounted reward factor, γ = 0.99

Reward statistics over 100 testing episodes§ Varied settings to compare performance,

including a change in training policy

§ Compare to images: after 84 hours of training*

Network Architecture Experimentation§ Network complexity increases training time§ Most architectures show similar performance

for the first 40 million framesTesting over 100 episodes§ Our implementation of Boltzmann-Q

encourages exploration by giving an upperbound on the probability of choosing thegreedy action.

*Computing Info: All times are based on wall time of Stanford Farmshare’s Barley cluster.

Tools used: OpenAI Gym, Keras, Tensorflow

Introduction

Deep Q-Network (DQN)

Methodology

Results

Results (cont.)

Discussion

Future Work

DQNisaneuralnetworkthat,givenaninputstates,estimatesQ(s,a)forallpossibleactionsa.

Outpu

t:#actio

ns