bowling with deep learning - stanford universitycs231n.stanford.edu/reports/2017/pdfs/611.pdf ·...

Bowling with Deep Learning

Zizhen JiangDepartment of Electrical Engineering

Stanford University, Stanford, CA, [email protected]

Abstract

A deep learning model, which can learn control poli-cies directly from high-dimensional sensory input using re-inforcement learning, is explored and improved for AtariBowling. Extraction of high-level features from raw-sensory data is made possible by the recent advances indeep learning. However, in general, agents take a series ofactions to get a reward. In the environments with sparse re-wards, reinforcement learning agents struggle to learn. Toovercome this challenge, researches investigated multipleapproaches, such as Q-learning and double Q-learning. Inthis project, we analyze the performances of Q-learning anddouble Q-learning on all Atari games, and further exploreand improve the architecture of Q-learning and double Q-learning specifically on Bowling game. A general archi-tecture design guideline for a specific game, i.e. complexgames take complex neural network architecture, is sug-gested. For the learning process, the training and testingdataset is generated when playing the game. The scoresof the agent playing each round is evaluated as the results.The learning history is provided to investigate the learningprocess.

1. IntroductionHuman beings are attracted by games, which are usually

for enjoyment and sometimes used as an educational tool.One of our favorite games, Rayman, is a platform videogame series owned by Ubisoft [Rayman-wiki]. The agentof Rayman can jump, fly, run, or beat monster. The targetis to save Electoons locked in all cages and then to savethe world (Fig. 1). Complex agent movements and variousenvironments challenge human beings in playing.

To understand how human beings perceive the gameand play, researchers are developing methods to learn tocontrol agents directly from high-dimensional sensoryinput like vision using reinforcement learning (RL). Beforemost successful RL applications that operated on these

Figure 1. One of our favorite games, Rayman, is a platform videogame series owned by Ubisoft.

domains relied on hand-crafted features combined withlinear value functions or policy representations. The qualityof the feature representation decided the performance ofthose systems. Recent advances in deep learning lead tobreakthroughs in computer vision, making it possible toextract high-level features from raw sensory data. Thesemethods utilize a range of neural network architectures,which have been demonstrated beneficial for RL withsensory data [Mnih, 2013].

However, in general, games may involve a series ofactions required to get a reward. A human can easilyunderstand how to acquire a reward. It is difficult for acomputer to sample billions of random moves and succeedwith less than 1% possibility. A reinforcement learningmodel for extended-time games needs to focus on sparseand delayed rewards.

To deal with sparse and delayed rewards, DeepMindintroduces deep Q-learning, which falls under the umbrellaof reinforcement learning [Mnih, 2013]. The world isshocked that computers are able to automatically learn toplay Atari games (Fig. 2) and beat world champions at Go(Fig. 3) [Silver, 2016]. For the first time, reinforcementlearning agents were learning from high-dimensional visualinput using convolutional neural networks. With only

1

Figure 2. Alpha Goon the cover page ofNature. [This imageis from Nature]

Figure 3. Atari computer system. [Q-news]

pixels as input and scores as rewards, the agents achievedsuperhuman performance.

In this work, we analyzes the performance of the re-ported Q-learning architectures on various Atari games. In-stead of providing a general architecture, we explore andimproved the Q-learning based architecture specifically onone game - Bowling (Fig. 4). A general architecture designguideline for a specific game is suggested.

2. Background

The tasks are considered as an agent interacts with theAtari emulator, in a sequence of actions, observations andrewards. At each time-step, the agent selects an actionat from the set of game actions, A = {1, 2, ...,K}. Theemulator takes the action as input, and modifies its internalstate and the game score. The agent doesn’t observe theinternal state of the emulator; instead, the agent observes animage xt ∈ Rd from the emulator, which is a vector of rawpixel values representing the current screen. In addition,the agent receives a reward rt representing the change ingame score. Note that in general the whole prior sequenceof actions and observations determines the game score;agents may receive feedbacks about an action after manythousands of time-steps have elapsed (Fig. 5).

We need to consider the sequences of actions andobservations, st = x1, a1, x2, ..., at−1, xt and the gamestrategies, which depends on the sequences, are learnedthrough playing. Because it is impossible to fully under-stand the current situation from only the current screenxt, of which the agent only observes images, when thetask is partially observed. With the assumption that allsequences in the emulator terminate in a finite number oftime-steps, this formalism gives rise to a large but finiteMarkov decision process (MDP) in which each sequence isa distinct state. As a result, standard reinforcement learningmethods for MDPs can be applied, simply by using thecomplete sequence st as the state representation at time

Figure 4. Bowling is chosen as an example of ’easier games’.

Figure 5. Learning flow of reinforcement learning. [Q-news]

t. The goal of the agent is to interact with the emulatorby selecting actions in a way that maximizes future rewards.

Multiple approaches have been developed. DeepMindfirst presented a model using deep Q-network (DQN)[Mnih, 2013; Mnih, 2015; Wang, 2015; Kulkarni, 2016;van Hasselt, 2016]. Conbining Q-learning with a flexibledeep neural network, DQN was tested on a varied and largeset of deterministic Atari 2600 games, reaching human-level performance on many games [Mnih, 2013; Mnih,2015]. Several variants have been later demonstrated. Thedouble Q-learning (Double DQN) is investigated to dealwith the overestimation caused by insufficiently flexiblefunction approximation and noise [van Hasselt, 2016]. Thedueling network represents two separate estimators: onefor the state value function and one for the state-dependentaction advantage function, generalizing learning acrossactions without imposing any change to the underlyingreinforcement learning algorithms [Wang, 2015]. Focus-ing on the specific ’harder’games, such as MontezumasRevenge, Kulkarni proposed hierarchical-DQN (h-DQN)[Kulkarni, 2016]. However, still few work analyzesthe differences between Atari games and investigate the

2

architecture design from the ’easier’ games.

In this project, we analyze the difference between Atarigames, investigate the reported performances of DQN anddouble DQN on various Atari games, and explore the ar-chitecture design for a ’easier’ game - Bowling. A generalarchitecture design guideline for a specific game is finallyprovided.

3. Technical ApproachWe first implement deep Q-network (DQN) presented

by DeepMind [Q-wiki; Mnih, 2015]. Under the assumptionthat future rewards are discounted by a factor of pertime-step, we define the future discounted return at timet as Rt =

∑Tt′=t γ

t′tγt′ , where T is the time-step atwhich the game terminates. The optimal action-valuefunction Q∗(s, a) is defined as the maximum expectedreturn achievable by following any strategy, after see-ing some sequences and then taking some action a,Q∗(s, a) = maxπE[Rt|st = s, at = a, π], where π is apolicy mapping sequences to actions (or distributions overactions).

At the beginning, Q(s, a) returns an (arbitrary) fixedvalue, randomly generated by the network. Then, each timethe agent selects an action, and observes a reward and a newstate Q(s, a) is updated. The new state may depend on boththe previous state and the selected action. A simple valueiteration update is the core of the algorithm, assuming thatthe old value is close to the correct and makes a correctionbased on the new information.

Q(st, at)← Q(st, at)︸︷︷︸old value

+ αt

( learned value︷︸︸︷rt+1 + γ maxaQ(st+1, a)︸︷︷︸

optimal future value

−Q(st, at)︸︷︷︸old value

)(1)

where αt is the learning rate, rt+1 is the reward, and γ isthe discount factor.

By minimizing a sequence of loss function Li(θi), theQ-network can be trained at each iteration i,

Li(θi) = Es,a∼ρ(.)[(yi − Q(s, a; θi))2] (2)

where yi = Es′∼ε[r + γmaxa′Q(s′, a′; θi−1)|s, a] is thetarget for iteration i and ρ(s, a) is a probability distributionover sequences s and actions a that we refer to as the behav-ior distribution. We fix the parameters from the previousiteration θi−1 awhen optimising the loss function Li(θi).The following gradient is given by differentiating the loss

function with respect to the weights ,

▽θi Li(θi) = Es,a∼ρ(.);s′∼ε[(r+γmaxa′Q(s′, a′; θi−1)

−Q(s, a; θi))▽θi Q(s, a; θi)] (3)

Stochastic gradient descent is used to optimise the lossfunction, the weights are updated after every time-step, andthe expectations are replaced by single samples from thebehavior distribution ρ and the emulator ε receptively.

Further we implement double deep Q-network (DoubleDQN) [van Hasselt, 2016]. The key difference betweenDQN and double DQN is that double DQN decouple theselection from the evaluation.

4. AnalysisWe analyze the reported performance of DQN and

double DQN on various Atari games [van Hasselt, 2010;van Hasselt, 2016]. Five key features of the reported Atarigames are proposed: a) moving agent, whether the agentcan move in the observation image; b) whether the rewardtarget can move in the observation image; c) movingmonster, whether the monster can move to eat the agent; d)moving background, whether the game has a backgroundthat confuses the perception of the agent; e) complexmovement, whether the agent’s contour in the image caninfluence the next action. The summary of those features isprovided at Table 1.

We compare the reported scores [van Hasselt, 2016] andfind that most of the games with smaller number of featuresbenefit from the double DQN, i.e. Double DQN tends toimprove the agent performance. It may be indicator thatthe reported architecture with DQN overestimate the actionvalues for games with smaller number of features, whichwe may call those games ”easier games”. The reportedarchitecture used three convolution layers and two fully-connected layers, besides all those layers are separated byrectifier liner units (ReLu) [van Hasselt, 2016]. The resultsmay indicate that this network may be too complex for theeasier games.

One other outstanding observation is that when gamesinvolve complex agent movement and multiple backgroundsettings, it is difficult to utilize DQN and double DQN toimprove agents’ performances. As an example, the agentin Montezuma’s Revenge can run, jump, and move up anddown in multiple environments, which results in specificarchitecture developments. We perform the traditionalDQN and double DQN on Montezuma’s Revenge. Asexpected, the agent either dies soon or stuck at somewherein the image (Fig. 6-7).

3

Figure 6. The agent inMontezuma’s Revenge diessoon after training.

Figure 7. The agent inMontezuma’s Revenge getsstuck after training.

Figure 8. Simplified neural network.

For this project, we are focusing on the easier games.Without losing generosity, we are using the game Bowling.

5. ExperimentWe simplify the architecture by using two layers of

convolution layers and two fully-connected layers withRelu as the separation layer between adjacent layers (Fig.8). The work is implemented based on the codes providedby cs234 [Base] and implemented in Tensorflow [TF].The first convolution layer convolves the input with 16filters of size 8 (stride 4), and the second convolution layerhas 32 layers of size 4 (stride 2). This is followed by afully-connected hidden layer of 512 units. We may utilizedropout in-between the first and second layers. Our ar-chitecture has a smaller number parameters of the network.

The performance comparison of the DeepMind architec-ture and our architecture is summarized in Table 2. Simplerarchitecture performs better on the simpler Bowling gameusing the same Q-learning update strategy. Our architecturebeats the performance of DQN reported and has a potentialto beat double DQN if tuning the learning rate morecarefully. Reference learning process plot of our network is

Figure 9. Reference learning process plot of our network withDQN.

Figure 10. Reference learning process plot of our network withDouble DQN.

shown in Fig. 9.

We further compare the performance of DQN anddouble DQN using our architecture. However, doubleDQN do not improve the performance further. It may becaused by the enough simplicity of the network withoutoverestimation or the large learning rate, which makes thescores change back and forth (Fig. 10).

Another approach using dropout [Dropout] is alsoinvestigated. We utilize dropout with a keep possibility of0.8. The result is shown in Fig. 11. The performance maybe improved if providing more training time and tuning theparameters more.

The performance of our architecture with DQN, Double

4

Figure 11. Reference learning process plot of our network withdropout.

DQN and dropout is summarized in Table 2.

6. ConclusionThis work analyzes the key features of various Atari

games and suggests that simpler games typically requiresimpler neural network (less number of parameters) andcomplex games require complex neural network (largernumber of parameters). Based on the analysis, we sim-plify the architecture into two convolution layers and twofully-connected layers and beat the performance reportedby DeepMind with DQN. We demonstrate a concept to ana-lyze the features of the games and then explore and improvethe neural network architecture.

7. Future WorkA layer of batch normalization [Ioffe, 2015] may be

added on the current simplified network to verify if thereis any over-fitting and to improve the robustness of the net-work. The analysis may provide a method to improve thearchitecture from bottom to up. Researchers may performthe feature analysis over all Atari games, later explore andimprove the network on specific games first, and then gen-eralize the architecture for all atari games.

8. Reference[cs231n] https://cs231n.stanford.edu[cs234] https://cs234.stanford.edu[Base] Base code from cs234 assignment 2,

http://web.stanford.edu/class/cs234/assignment2/index.html.I do not take cs234. The github codes I found online requiresome different versions of packages and cannot work easily.Later I found the package provided by cs234 suitable forthe implementation of this project.

[Dropout] https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-7-action-selection-strategies-for-exploration-d3a97b7cceaf

[Ioffe, 2015] S. Ioffe, et al., ”Batch normalization: Ac-celerating deep network training by reducing internal co-variate shift.” arXiv preprint arXiv:1502.03167 (2015).

[Kulkarni, 2016] T. D. Kulkarni, et al., ”Hierarchicaldeep reinforcement learning: Integrating temporal abstrac-tion and intrinsic motivation.” Advances in Neural Informa-tion Processing Systems. 2016.

[Mnih, 2015] V. Mnih, et al., Human-Level Controlthrough Deep Reinforcement Learning, Nature 518, 519-533

[Mnih, 2013] V. Mnih, et al., Playing Atari with DeepReinforcement Learning, NIPS, 2013

[Rayman-wiki] https://en.wikipedia.org/wiki/Rayman[Silver, 2016] D. Silver, et al. ”Mastering the game

of Go with deep neural networks and tree search.” Nature529.7587 (2016): 484-489.

[TF] Tensorflow, https://www.tensorflow.org/tutorials/[Q-news] https://deepsense.io/playing-atari-on-ram-

with-deep-q-learning/[Q-wiki] https://en.wikipedia.org/wiki/Q-learning[van Hasselt, 2010] H. Van Hasselt, ”Double Q-

learning.” Advances in Neural Information Processing Sys-tems. 2010.

[van Hasselt, 2016] H. van Hasselt, et al., Deep Rein-forcement Learning with Double Q-Learning, AAAI, 2016

[Wang, 2015] Z. Wang, et al., ”Dueling network archi-tectures for deep reinforcement learning.” arXiv preprintarXiv:1511.06581 (2015)

5

Table 1. Summary of five features for reported Atari games. ’y’ indicates that the game has the feature; ’n’ indicates not having this feature.Game Moving Agent Moving Target Moving Monster Moving Background Complex MovementAlien y n y n yAmidar y n y n nAssault y y y n nAsterix y y n n nAsteroids y y y n yAtlantis n y y n nBank Heist y n n n nBattle Zone n y y y nBeam Rider y y y n nBowling y n n n nBoxing y y y n nBreakout y n n n nCentipede y y y n nChopper Command y y y y nCrazy Climber y n y n nDemon Attack y y y n nDouble Dunk y y y n yEnduro y n y y yFishing Derby y y y n nFreeway y n y n yFrostbite y y y n yGopher y n y n nGravitar y y y y yIce Hockey y y y n yJames Bond y y y n nKangaroo y n y n nKrull y n y n nKung-Fu Master y y y n nMontezuma’s Revenge y y y y yMs. Pacman y n y n nName This Game y y y n nPong y y n n nPrivate Eye y y y y nQ*Bert y n y n nRiver Raid y y y y nRoad Runner y y y y nRobotank n y y y nSeaquest y y y n ySpace Invaders y y y n nStar Gunner y y y y nTennis y y n n yTime Pilot y y y y nTutankham y y y y yUp and Down y y y y nVenture y y y n nVideo Pinball y y n n nWizard of Wor y y y n yZaxxon y y y y n

6

Table 2. Performance comparison of Deepmind architecture and our architecture

Game Random Human DQNDoubleDQN

MyDQNw/o Dropout

MyDQNw/o DropoutPeak

MyDoubleDQN

My DQNw/ Dropout

Bowling 23.10 154.80 42.40 70.50 59.80 68.92 18.92 42.12

7

bowling with deep learning - stanford universitycs231n.stanford.edu/reports/2017/pdfs/611.pdf ·...

Documents