learning agents in quake iii

7/31/2019 Learning Agents in Quake III

1/13

Learning Agents in Quake III

Remco Bonse, Ward Kockelkorn, Ruben Smelik,Pim Veelders and Wilco MoermanDepartment of Computer Science

University of Utrecht, The Netherlands

Abstract

This paper shows the results of applying Reinforcement Learning () to train combat

movement behaviour for a Q A [3] bot. Combat movement of a Q bot isthe part that steers the bot while in battle with an enemy bot or human player. The aim of thisresearch is to train a bot to perform better than the standard Q bot it is based on, by onlychanging its combat movement behaviour.

We extended a standard bot with a Q-learning algorithm that trains a Neural Network tomap a game state vector to Q-values, one for each possible action. This bot (to which we referas NeurioN) is trained in a reduced Q environment, to decrease noise and to make thetraining phase more effective. The training consists of one-to-one combat with its non-learningcounterpart, in runs up to 100.000 kills (frags). Reward is given for avoiding damage, therebyletting the bot learn to avoid getting hit.

We found it is possible to improve a Q bot by using to train combat movement.The trained bot outperforms its non-learning counterparts, but does not appear smarter incombat with human opponents.

1 Introduction

Games are more and more considered a suitable testing ground for artificial intelligence research[5]. Reasons forthis include their ever increasing realism, their limited andsimulated environment(in which agents do not require sensors and imaging techniques to perceive their surroundings),their accessibility and inexpensiveness and the fact that game industry is big business.

Artificial Intelligence has been successfully applied to classic games like Othello and Chess,but, contemporary computer games, like First Person Shooters (), which can be regarded as themost popular games nowadays, typically have limited artificial intelligence (rule-based agents oragents based on finite state machines). As a result, agents (called bots) are by far not able tocompete with human players on their tactical and learning abilities.

With the rendering computations being moved more and more to the graphics cards GPU,CPU cycles become available for More elaborate and computational intensive techniquesmight now be used in these games, adding to the games realism and creating more human-likeartificial opponents.

This research explores the application of Machine Learning () methods to s, focusingon one of the most popular games: Q A [3]. Bots in Q have limitedintelligence and are no match for an expert human player. We will extent such a bot with capabilities and evaluate its performance against non-learning bots and human players. For moreinformation on Q and its bots, we refer to [10, 11].

Previous research on applying techniques to bots include [4], in which Laird et al. presenta bot for Q , based on (800+) rules, that can infer how to set-up an ambush, in a way quiteresembling a human player. Related to our research is the work of Zanetti et al. [ 12]. They have

implemented a Q

bot that uses three

s for: movement during combat, aim and shoot(including selecting which weapon to use) and path planning in non-combat situations. Their

1


2/13

networks are trained using Genetic Algorithms () and a training set of recordings of matches ofexpert human players. The goal is for the networks to imitate the behavior of these experts. Theresulting bot turns out to be far from competitive, but still has learned several expert behaviors.

A bot has several independent decision nodes (e.g. aiming and shooting, goal selection,chatting). Our focus is on movement behaviour of a bot in combat mode (i.e. upon encounteringa nearby enemy). This behaviour typically includes dodging rockets and moving in complexpatterns as to increase the aim and shoot difficulty for the enemy. Human expert players excel atcombat movement with techniques as circle strafing and, of course, a lot of jumping around,thereby making it very difficult for their opponents to get a clear shot. Combat movement is oneof the parts of that is not often used in this type of research (whereas goal and weaponselection are). This made it interesting for us to examine.

As mentioned, Evolutionary Algorithms have already been applied succesfully to similarproblems in s, albeit limited to research environments, yet little is known of the suitabilityof Reinforcement Learning (, see e.g. [1, 9]) in such games. Therefore, we have chosen toapply to improve the combat performance of a Q bot. As we did not have the time

and rescources (i.e. expert players) to implement supervised learning and because supervisedlearning has already been applied (somewhat) succesfully [12], we use an unsupervised learningalgorithm.

2 Reinforcement Learning

We have implemented the Q-learning algorithm (see [9]) as described below.

1 Initialize Q(s, a) arbitrarily2 R e pe a t ( f or e a ch c o mb a t s e qu e nc e ) :

3 Initialize state s4 R e pe a t ( f or e ac h s te p o f c o mb a t s e qu e nc e )

5 Select action a b as e d o n s using softmax6 Execute a, r e ce i ve r e wa r d r, s

7 Q(s, a) Q(s, a) + [r + maxa Q(s, a) Q(s, a)]8 s s

9 Until s i s n on - c om ba t o r b ot i s k il le d

Listing 1: Q-learning algorithm

As follows from Listing 1, we consider each combat sequence a episode. A combat sequencestarts when the combat movement decision node of the Q AI is first addressed and endswhen more than 10 in-game seconds have passed since the last call, in order to keep the Q-valuefunction smooth. Since the Q engine decides when to move into combat mode it mightbe 100 milliseconds or just as well be 100 seconds since the last combat movement call. In this

(fast) game this would mean a completely diff

erent state, which has minimal correlation with theaction taken in the last known state. If the time limit has expired, the last state in the sequencecannot be rewarded properly and is discarded. However, if the bot is killed not too long after thelast combat movement, it still receives a (negative) reward.

We have also experimented with an adaptation of Q-learning known as Advantage Learning.Advantage Learning is implemented by replacing line 7 in Listing 1 with the following equation:

A(s, a) A(s, a) + [r + maxa A(s, a) A(s, a)]

Tk(1)

Here the A-value of a s, a pair is the advantage the agent gets by selecting a in state s. Tstands for the time passed since the last time the action is selected and kis a scaling factor. Both (in Q-learning) and k(in Advantage Learning) are 0, 1]. Since we work with discrete time, the

Tterm is always 1. Advantage learning uses the advantage that a certain state-action-pair (Q-value) has over the current Q-value. This advantage is then scaled (using the scaling factor k). This

2


3/13

algorithm is useful when the Q-values do not differ very much. Normal Q-learning would have tobecome increasingly accurate, to be able to represent the very small, but meaningfull differencesin adjecent Q-values, since policy has to be able to accurately determine the maximum over all the

Q-values in a given state. Since the Q-values are approximated, this poses a severe problem forthe function approximator. It is easy for the approximator, to decrease the overall error by roughlyapproximating the Q-values, however, it is hard, requires lots of training time, and might evenbe impossible (given the structure of the approximator, for instance not enough hidden neurons)to accurately approximate the small but important differences in Q-values. Advantage Learninglearns the scaled differences (advantages) between Q-values, which are larger, and therefore hasa better chance of approximating these advantages, then Q-learning has in approximating theQ-values. Since the adavantages correlate to the Q-values, the policy can just take the maximumadvantage when the best action needs to be selected. [2]

For action selection we use a softmax-policy with Boltzmann selection. The chance P thataction a is chosen in state s is defined as in Equation 2.

P = e

Qt(a)

nb=1 e

Qt(b)

(2)

As is common to softmax, action a1 is chosen more often in state s if Q(s, a1) Q(s, ai), i A.But since the policy is stochastic, there will always be some exploration. In the beginning, muchexploration is performed, because all Q-values are initialized with random values. However,as the learning process progresses, actions that have been rewarded highly, will have higherQ-values than others, thereby exponentially increasing their chance of being selected.

Because the state space is continuous we use a Multilayer Perceptron (MLP) to approximatethe Q-value function. The input layer consists of 10 neurons for the state vector; the output layerconsists of 18 neurons that contain the Q-values for each action. We use one hidden layer, andvary the number of hidden neurons during our experiments, ranging from 0 to 30. Where 0

hidden neurons means there is no hidden layer.We use sigmoid activation functions for the hidden neurons and linear functions for the outputneurons. This is because we expect a continuous output, but we also want to reduce the effectsof noise. The sigmoid functions in the hidden layer filter out small noise in the input. In mostexperiments, we initialized the weights with a random value i from the uniform distributionover -0.01 and 0.01. We also conducted some experiments with a higher margin. More about theparameters of the MLP can be found in Section 4.

We considered several reward rules that might lead to good tactical movement. Possiblereward rules are:

R(st ,a) = (healtht1 healtht)

healtht1(3)

R(st ,a) = numFrags (health

t

1 health

t)

healtht1(4)

R(st,a) =(enemyHealtht1 enemyHealtht)

enemyHealtht1

(healtht1 healtht)

healtht1(5)

R(st,a) = accurateHits (healtht1 healtht)

healtht1(6)

Rule 3 mainly lets the bot minimize damage to itself and thus lead to self preserving bots,which favor evasive actions over aggressive actions. Rule 4 will eventually lead to good gameplay because it is our intuition this is what human players do; keep good health and whenever itis possible frag1 someone. Frags do not happen very often and a good hit does not always meana frag. Therefore, we came up with the next rule. In rule 5 every hit will be taken into account

1Fragging is slang for killing someone in a computer game.

3


4/13

(a) Cube map (b) Round Rocket map

Figure 1: Reduced training environments

and it will be more attractive to attack a player with low health or run away if your own health islow. But we argued that the health loss of the enemy is not directly related to tactical movement.

For example if the enemy happens to take a health pack, the last chosen action will be consideredbad, while it could be a good evasive maneuver. Rule 6 does not have this disadvantage, butdepends too much on aiming skills of the bot.

We have chosen to use the rule (3) which rewards evasive actions, as this is the most importantgoal of combat movement: evade enemy fire. Rule 3 takes in consideration the minimal amountof information to evaluate the action.

3 Environment

To eliminate a part of the randomness commonly present in s, we use a reduced trainingenvironment in the form of a custom level, which is a simple round arena in which the bots fight(Figure 1(b)). In this map, bots enter combat mode almost immediately when they spawn. Toeliminate the influences of different types of weapons, we have chosen to allow only the use ofone type of weapon. To speed-up the learning process, the custom map does not feature anycover, nor health packs and other items, other than this weapon and corresponding ammunition.

The bot we created, which we call NeurioN, is based on one of the standard Q bots,named Sarge. Except for their looks, sounds, and, most importantly, their combat movementbehaviour, these bots are identical. For NeurioN we train this part of the with , while Sargeremained fixed during training. The combat movement is called irregulary by the bot decisionloop. It is not necessary for a bot to be in combat mode to shoot at an opponent, but when a bot isclose to an opponent and has the opponent in sight, the combat mode will be called. By making asmall custom map in which the bots cannot hide from eachother, we forced the bots to enter theircombat mode more often.

The normal behaviour of a bot in combat mode is based on its preference to move and /or

jump in unpredictable directions. In determining these directions, the environment is not takeninto account. The same goes for incoming rockets, for which a bot is blind. The only exceptionsto this are collisions with walls and falling offedges. Because it is not easy to extract informationabout imminent collisions or drops offedges, we have chosen not to include this information inthe state vector of the bot we train.

3.1 Problems with the environment

The environment used for the experiment, Q A [3], seems very suitable for thesetypes of experiments, since its source code is freely available and the game is very stable andwell known. However, it is not without its disadvantages.

The first problem we ran into is the fact that despite the release of the source of Q ,it is still obvious the game is a commercial product that was not created to be an experimentalenvironment. The source code is hardly documented and despite some webpages on the internet

4


5/13

Sarge vs Sarge Control run

(moving average 5000 shown every 500)

0.45

0.5

0.55

0 50000 100000 150000 200000 250000

total frags

win%

Sarge 1

Sarge 2

Figure 2: Advantage for the first spawned bot

[7, 8, 6] information is scarce. The engine is created to work in real-time. While speeding up thesimulation is possible, the training is still slow.

During the creation our reduced training environment, we encountered several other prob-lems. One problem was the cube shaped map we started out with (Figure 1(a)). When in combatmode, the learning bot does not take its static surroundings into account. Therefore more oftenthan desirable NeurioN would end up in a corner, with a very large chance of getting shot. Tocounter this problem a round map was created, so that there are no corners where the bot can getstuck.

Another problem we discovered while training in our map is when the map does not have

any items for bots to pick up, they remain standing still, having no goals at all. If both bots arestanding still with their backs facing each other, nothing will happen. This situation is quite rareand therefore took a while to discover what was ruining our experiments. The problem wastackled by adding excessive amounts of weapons and ammunition-crates.

We also discovered a problem that is caused by the order the different bots in a game are giventime to choose their actions. This is done sequentially, which results in a small but significantadvantage for the bot that is added first to the arena. When two identical (non-learning, standardQ ) bots are added to a game, the bot first in the list wins about 52% of the time, as can beseen in Figure 2. This is a significant difference.

Finally we found out that the shooting/aiming precision of Sarge at the highest difficultylevel is so good that there is not much room to evade the shots. This is especially the case withweapons that fire their ammunition with a high velocity. The bullets of these weapons are almost

instantaneously, which means that they impact directly when fired. Because we wanted the botto learn combat movement it was better to only use the rocket launcher as the only weapon.The rocket launcher is the weapon with the slowest ammunition in Q , so it is the easiestweapon to evade and thus the best weapon to learn tactical movement in combat.

3.2 States

After various different sets of state information, we have finally chosen to use the following statevector:

distance to opponent a number between 0 and 1, where 1 is very close and 0 is far away. Thismeans that when the opponent is out of sight the distance will also be 0;

relative angle of placement of opponent relative angle of bot to opponent: a number between 0and 1, with 1 if the opponent is standing in front of the bot and 0 if the opponent is standing

5


6/13

(a) Opponent placement (b) Opponent viewangle (c) Projectile placement (d) Projectile direction

Figure 3: The state information

behind the bot. See Figure 3(a);

relative side of placement of opponent 1 if the opponent is standing to the left, -1 if the opponent

is standing on the right;relative angle of opponent to bot a number between 0 and 1, with 1 if the bot is standing in front

of the opponent and 0 if the opponent is standing behind the opponent. See Figure 3(b);

relative side of placement of the bot -1 or 1 depending on the sign of the previous input;

distance to the nearest projectile a number between 0 and 1, where 1 is very close and 0 is faraway. If there is no projectile heading towards Neurion 0 is also given;

relative angle of placement of the nearest projectile a number between 0 and 1, with 1 if theprojectile is in front of the bot and 0 if the projectile is behind the bot. See Figure 3(c);

relative side of placement of the nearest projectile 1 if the projectile is to the left, -1 if the pro-jectile is to the right;

relative angle of nearest projectile to bot a number between 0 and 1, if the projectile is headingstraight for the bot this will be 1, if the angle between the projectile and the bot is equal orlarger than a certain threshold this will be 0. Any angle beyond the threshold indicates theprojectile isnt a thread for the bot; See Figure 3(d);

relative side of bot with regard to the nearest projectile 1 if Neurion is to the left, -1 if NeurioNis to the right of the projectile.

3.2.1 Information not used for states

Some information that seems important to describe the game state is left out for the followingreasons:

Health of NeurioN the health of a player is of no influence on the movement of that player whenin direct combat;

Health of the opponent same as above, the health of the opponent is of no effect to the combatmovement;

Current weapon of the opponent in our training environment two ballistic weapons are avail-able, the machine gun for which there is no extra ammunition available and the rocketlauncher. The rocket launcher is by far the most popular weapon from those two for Sargeand thus also for NeurioN. As soon as they get one, they will switch to the rocket launcher.This results in effectively only one weapon that is used, therefore the weapon that theopponent uses is known and it is not necessary to include this information in the statevector.

6


7/13

Figure 4: The actions visualized

3.3 Actions

We consider 18 (legal) combinations of five elementary actions:

Move forward (W);

Move backward (S);

Strafe left (A);

Strafe right (D);

Jump (J);

So, the possible actions are {, W, WA, WJ, . . . ,SDJ, D, DJ} (Figure 4). In this system othermoves would be theoretically possible, like for example moving forward and backward (WS)simultaneously, but of course this is impossible, so these kind of options are left out as moves.This results in the 18 legal moves, 3 options for forward movement (being forward, nothing,backward), 3 options sideways in the same manner and 2 options for jumping (to jump or not tojump. . . ).

The orientation is not considered in our combat movement, because it is determined by theaiming function of the bot, which would overwrite any change made in the rotation of the botwith its own values. The movement of a bot is relative to its orientation vector, given by theaiming function.

4 Setup of Experiments

When using and s, a number of parameters have to be tweaked, as they have a largeinfluence on the resulting training performance.

The training phase consists of a game with a very high fraglimit ( 200, 000 frags). Each setting

is run several times, to validate the results (as the game is stochastic and the network initializationmay be of influence).Because starts each training run with a randomly initialized and contains several vari-

ables of which the optimal setting is unknown, we did a broad sweep in the parameter space.The settings we tried consist of the following variables:

Number of hidden neurons n {0, 5, 15, 30}Discount {0.95, 0.80}Temperature {10, 50}Learningrate {0.001, 0.003, 0.01}Timescale factor k = {1, 0.5, 0.25}2

2

k=

1 is equal to Q-learning

7


8/13

Because we had some trouble with determining which information had to be part of the statevector and made some incorrect implementations of it, many training runs performed earlierunfortunately had to be discarded.

5 Results

During the broad sweep, several combinations of parameters were found, that resulted in succes-ful learning after some 100,000 or more fraggs. It was found that 15 hidden neurons, combinedwith a learningrate of 0.01, a discount of 0.95 and a temperature of 50 worked well, see Figure5(a). This was achieved with the Q-Learning algorithm (and therefore the advantage k factorequals 1). This combination of parameters therefore served as the base for further investigation,and if not otherwise mentioned, these values are used.

Increasing the learningrate above 0.01 resulted in instable neural nets which sometimeslearned, but most of the time remained at the initial level of success, see Figure5(b). A learningrate

of 0.01 was therefore used in further experiments.Increasing the range in which the initial weights were randomly distributed, to 0.1 or even 0.5did not result in any stable learning (at least, with using base values for the other parameters).Sometimes a decent learning curve was seen, but most of the time learning did not occur. SeeFigure 6(a) A margin of 0.01 for the initialisation of the random weights was therefore used.

Varying the timescale factor k never resulted in bad learning or non-learning situation, seeFigure 6(b). Most of the values for kresulted in roughly the same learning curves, although k= 0.1resulted in a quicker learning phase, but a less stable, more jagged curve, as the large standarddeviation indicates.

Decreasing or increasing the number of neurons to 10 did not result in good results (Figure7(a)), or at least not in the timeframe that the base setting with 15 hidden neurons neededto achieve a good success rate. And since the base already needed 100,000-200,000 frags,taking some 10 hours of computing time, other numbers of hidden neurons were not extensivly

investigated.A lower temperature of 10 did not results in learning, however a higher temp of 80 did (Figure

7(b)). A higher temp resulted in much quicker learning, but also a more jagged learning curve.As a comparison, the temperature was varied when 30 hidden neurons were used (other settingswere the same as the base). This experiment revealed that the temperature did not have anysignificant effect on the average that was reached, but the variation increased with the highertemperature, resulting in some good runs and some runs were nothing was learned, see Figure8(a).

6 Conclusion

Learning a bot how to move in combat situations is a difficult task. The experiment depends onboth choosing the right input vector and settings as well as on the used environment, Q in our case. However, using a with 15 hidden neurons, the learning bot can be improved toa 65% chance of winning a 1-on-1 combat against its non learning equal. This means that a botwith a trained wins almost twice as often as a preprogrammed bot.

The learning phase takes quite some time, often more than 100,000 to 200,000 frags, on anormal pc this takes about 10 hours.

Some settings show a capricious learning process, but the speed of learning is higher in thesecases. Despite the fact that the numerical results show an improvement, human players willhardly notice any difference in behaviour between the standard bot and the learning bot. Thisis largely due to the fact that the bot is not always in combat mode when shot at (as one mightexpect), hence not using his .

8


9/13

6.1 Future Work

As an extension of the research described in this paper it is possible to look into the influence of

a larger space vector, in which information concerning the surroundings of the bot is taken intoaccount.

Moreover it would be interesting to let the bot learn its combat movement in a more complexenvironment. One can think of more opponents, weapons and items. Or just more complexmaps, as to let the bot learn in a normal Q game.

Lastly the use of memory could be a good research subject in which the bot uses its last actionas an input. This could be done by using recurrent s

References

[1] M. Harmon and S. Harmon. Reinforcement learning: a tutorial, 1996. http://www-anw.cs.umass.edu./mharmon/rltutorial/frames.html.

[2] M. E. Harmonand L. C. Baird. Multiplayer residual advantagelearning with general functionapproximation. Technical Report Tech. Rep. WL-TR-1065, Wright Laboratory, WL/AACF,2241 Avionics Circle, WrightPatterson Air Force Base, OH 45433-7308, 1996.

[3] id Software. Quake III Arena, 1999. http://www.idsoftware.com/games/quake/quake3-arena/.

[4] J. E. Laird. It knows what youre going to do: adding anticipation to a Quakebot. InAGENTS01: Proceedings of the fifth international conference on Autonomous agents , pages 385392, NewYork, NY, USA, 2001. ACM Press.

[5] J. E. Laird and M. van Lent. Human-Level AIs Killer Application: Interactive ComputerGames. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and TwelfthConference on Innovative Applications of Artificial Intelligence, pages 11711178. AAAI Press /The MIT Press, 2000.

[6] PhaethonH. Quake iii: Arena, baseq3 mod commentary. http://www.linux.ucla.edu/phaethon/q3mc/q3mc.html.

[7] PlanetQuake.com. Code3arena. http://code3arena.planetquake.gamespy.com/.

[8] [email protected]. Quake 3 game-module documentation. http://www.soclose.de/q3doc/index.htm.

[9] R. S. Sutton and A. G. Barto. Course Notes: Reinforcement Learning I: An Introduction. MITPress: Cambridge, MA, 1998.

[10] J. van Waveren. The Quake III Arena Bot. Masters thesis, University of Technology Delft,2003.

[11] J. van Waveren and L. Rothkrantz. Artificial player for Quake III Arena. International Journalof Intelligent Games & Simulation (IJIGS), 1(1):2532, March 2002.

[12] S. Zanetti and A. E. Rhalibi. Machine learning techniques for FPS in Q3. In ACE 04: Proceed-ings of the 2004 ACM SIGCHI International Conference on Advances in computer entertainmenttechnology, pages 239244, New York, NY, USA, 2004. ACM Press.

9
http://www-%20anw.cs.umass.edu./~mharmon/rltutorial/frames.htmlhttp://www-%20anw.cs.umass.edu./~mharmon/rltutorial/frames.htmlhttp://www-%20anw.cs.umass.edu./~mharmon/rltutorial/frames.htmlhttp://www-%20anw.cs.umass.edu./~mharmon/rltutorial/frames.htmlhttp://www-%20anw.cs.umass.edu./~mharmon/rltutorial/frames.htmlhttp://www.idsoftware.com/games/quake/quake3-arena/http://www.idsoftware.com/games/quake/quake3-arena/http://www.idsoftware.com/games/quake/quake3-arena/http://www.linux.ucla.edu/~phaethon/q3mc/q3mc.htmlhttp://www.linux.ucla.edu/~phaethon/q3mc/q3mc.htmlhttp://www.linux.ucla.edu/~phaethon/q3mc/q3mc.htmlhttp://www.linux.ucla.edu/~phaethon/q3mc/q3mc.htmlhttp://code3arena.planetquake.gamespy.com/http://code3arena.planetquake.gamespy.com/http://www.soclose.de/q3doc/index.htmhttp://www.soclose.de/q3doc/index.htmhttp://www.soclose.de/q3doc/index.htmhttp://www.soclose.de/q3doc/index.htmhttp://www.soclose.de/q3doc/index.htmhttp://code3arena.planetquake.gamespy.com/http://www.linux.ucla.edu/~phaethon/q3mc/q3mc.htmlhttp://www.linux.ucla.edu/~phaethon/q3mc/q3mc.htmlhttp://www.idsoftware.com/games/quake/quake3-arena/http://www.idsoftware.com/games/quake/quake3-arena/http://www-%20anw.cs.umass.edu./~mharmon/rltutorial/frames.htmlhttp://www-%20anw.cs.umass.edu./~mharmon/rltutorial/frames.html


10/13

(a)

(b)

Figure 5: Results

10


11/13

(a)

(b)

Figure 6: Results

11


12/13

(a)

(b)

Figure 7: Results

12


13/13

(a)

Figure 8: Results

13

learning agents in quake iii

Documents