master thesis stefan stavrev 6019439

1 Applying Reinforcement Learning Techniques in the Catch-the-Thief Domain Master Thesis Stefan Stavrev University of Amsterdam Master Artificial Intelligence Track: Gaming Supervised by: Ph.D. Shimon A. Whitson (UvA) MSc. Harm van Seijen (T.N.O.) 2 3 Contents 1 Introduction............................................................................. 4 2 Background ............................................................................. 5 2.1 The RL framework............................................................... 5 2.2 Planning and learning .......................................................... 7 2.3 Model-free and model-based learning .................................... 7 2.4 SMDPs and Options ............................................................. 7 2.5 Q-learning.......................................................................... 8 2.6 Value iteration.................................................................... 9 2.7 Multi-agent systems ............................................................ 9 3 Domain description..................................................................10 3.1 The environment................................................................10 3.2 Thief behavior ...................................................................12 4 Methods .................................................................................13 4.1 State-space reduction.........................................................13 4.2 Action-space reduction .......................................................15 4.3 Algorithms ........................................................................16 4.3.1 Combination of offline Value Iteration and shortest-distance-to-target Deterministic Strategy (CVIDS).................................17 4.3.2 Options-based Reinforcement Learning Algorithm (ORLA)..17 4.3.3 Shortest Distance Deterministic Strategy (SDDS) .............21 5 Experimental results ................................................................21 5.1 Experimental setup ............................................................21 5.2 Results .............................................................................25 6 Analysis and discussion ............................................................27 6.1 ORLA-SA vs. CVIDS vs. SDDS..............................................27 6.2 ORLA-MA vs. ORLA-SA........................................................28 7 Related work...........................................................................31 8 Conclusions and Future work.....................................................32 8.1 Conclusions .......................................................................32 8.2 Future work.......................................................................32 Acknowledgments ......................................................................33 Appendices................................................................................33 1. Features to states and vice-versa ..........................................33 2. Actions to Features and vice-versa.........................................34 3. Line segment to line segment intersection ..............................34 4. Standard Error calculation.....................................................37 Bibliography ..............................................................................37 4 1 Introduction Modern Artificial Intelligence is focused on designing intelligent agents. An agent is a program that can make observations and take decisions as a consequence of these observations. An agent is considered to be intelligent, if it can make autonomous decisions and act in a rational way. Learning how to act rationally is not easy. The different machine learning fields have different perspectives on learning. Some of them consider learning by explicit examples of correct behavior, while others rely on positive and negative reward signals for judging the quality of an agents action. The latter field is called Reinforcement Learning (RL). In RL, an agent learns by interacting with its environment. In certain problems, there are multiple agents. One such problem is the catch-the-thief problem. In this problem several guards are pursuing a thief in a simulated environment. The main difficulty of this domain is the scalability, because complexity grows exponentially with the number of agents and the size of the environment. To tackle the scalability problem, we make use of options [3]. Options are an extension over (primitive) actions and may take several timesteps to complete. Options are typically added to the state-space to obtain more effective exploration. However, we use them to reduce the size of the state-action space. This is achieved by only letting the agent learn in certain parts of the state-space, while in the rest of the state space a fixed, hand-coded option is followed by the agent. The motivation behind this approach is that for large parts of the environment it is often easy to come up with a good policy. We propose to use hand-coded options for those parts and only use learning for the more difficult regions, where constructing a good policy is hard. The main purpose of this master thesis is to empirically compare different strategies to use options for dealing with the scalability issue present in the catch-the-thief problem. We are going to investigate both single-agent and multi-agent methods. Based on an analysis of the different methods, hypotheses will be formed about which method will perform well on which task variation and these hypotheses will be tested with experiments. We are mainly interested in the performance in the limit that can be achieved with these methods. Under these settings, scalablity plays a 5 role, due to the limited space (RAM) available to store data, and because the total computation time to compute the maximum performance is limited in any practical setting. While we mainly focus on RL methods, for completeness, we also compare against a planning method. The rest of this thesis is organized as follows: First, we present some background knowledge about the reinforcement learning framework in general and relevant RL techniques, that we are going to use in our research (Section 2); In Section 3, we formally introduce our domain and explain the typical difficulties of RL in this domain. Section 4 describes the methods and algorithms that we use, as well as relevant techniques for reducing the total state-action space. In addition, we form hypothesis about the performance and scalability of the proposed methods. Section 5 describes the experiments that we conduct and shows relevant results. In Section 6 we perform empirical analysis and discuss the outcome of our experiments. After that, we will give a concise overview of existing empirical research in the Catch the thief domain, as well as single and multi-agent perspectives for modeling an agents behavior (Section 7). In Section 8 we summarize our conclusion, and propose extension for future work. 2 Background 2.1 The RL framework We already (briefly) mentioned what an agent is in the Introduction. More formally, an RL agent : An abstract entity (usually a program) that can make observations, takes actions, and receives rewards for the actions taken. Given a history of such interactions, the agent must make the next choice of action so as to maximize the long term sum of rewards. To do this well, an agent may take suboptimal actions which allow it to gather the information necessary to later take optimal or near-optimal actions with respect to maximizing the long term sum of rewards. [44][1] Software agents can act, following hand-coded rules or learn how to act by employing machine learning algorithms. Reinforcement learning is one such sub-area of machine learning. Formally, one way to describe a reinforcement learning problem is by a Markov Decision Process (MDP). An MDP is defined as the tuple 6 {S,A,T,R}: S is a set of states, A is a set of actions, T is the transitional probability; R is a reward function. In MDP environments, a learning agent selects and executes action A at at a current state S st at time t. At time t+1, the agent moves to state S st +1 and receives a reward 1 + tr . The agents goal is to maximize the discounted sum of rewards from time t to , called return tR , defined as follows [1]: + + + + +== + + + =01 322 1...kk tkt t t t r r r r R where ) 1 0 ( < is a discount factor, specifying the degree of importance of future rewards. The agent chooses its actions according to a policy , which is a mapping from states to actions. Each policy is associated with a state-value function ( ) V s, which predicts the expected return for state s, when following policy : ( ) [ | ]t tV s E R s s= = , where E[.] indicates the expected value. The optimal value of a state s, *( ) V s , is defined as the maximum value over all possible policies: ((

= = = =0) ( , | ) , ( max ) (0*tt t t tts a s s a s R E s V (1) Related to the state-value function is the action-value function ( ) Q s, which gives the expected return when taking action a in state s and following policy thereafter: ( , ) [ | , ]t t tQ s a E R s s a a= = = , The optimal Q-value of state-action pair (s,a), is the maximum Q-value over all possible policies: ((

= = = = >=0) ( , , | ) , ( max ) , (0 0 0*tt t t tts a a a s s a s R E a s Q (2) 7 An optimal policy (s) *is a policy whose state-value function is equal to (1). 2.2 Planning and learning When the model of the environment is available, planning methods [1] can be used. Examples of such algorithms are Policy Iteration and Value Iteration. In practice however, a model of the environment is not always available and has to be learned instead. In this case, learning methods [1] can be used, such as Monte Carlo, Q-learning and Sarsa. 2.3 Model-free and model-based learning There are different ways of learning a good policy. Model-free learning algorithms directly update state or state-action values using observed samples. In the limit, some model-free learning methods are guaranteed to find an optimal policy (s) * [1]. Most model-free methods require only O(|S||A|) space. In model-based learning [1], on the other hand, a model of the environment is estimated to determine the transition probability T between states and the reward function R. Then, T and R are used to compute the optimal values by means of off-line planning. Examples of such techniques are Dyna, Prioritized Sweeping [1]. The advantage of employing such methods is they require fewer samples in order to achieve a good policy. Model-based methods require quadratic memory space - O(|S||A|). The choice between model-free and model-based methods depends on the complexity of the domain. Using a model requires more memory, but fewer samples. 2.4 SMDPs and Options A Semi-Markov Decision Process (SMDP) is an extension of an MDP, appropriate for modeling continuous-time discrete-event systems [32], 8 [33], [34]. It is formally defined as the tuple {S,A,T,R,F}, where S is a finite set of states, A is the set of primitive actions, T is the transition function from a state-action pair to a new state, R is the reward function and F is a function, giving transition times probabilities for each state-action pair [2]. Discrete SMDPs represent transition distributions as F(s, N | s,a), which specifies the expected number of steps N that action a will take before terminating in state s, starting in state s. Options are a generalization over primitive actions [3]. Primitive actions always take one timestep to be executed, while an option may take two, three or more timesteps to complete. That is why, options are appropriate for modeling state-transitions that take variable amounts of time, like those in a SMDP. An option, is defined as the triplet { I,, }, where S I is an initiation set; is the option policy and 1 is a termination condition. An option is available in state ts if and only if I st . If the option is taken, then actions are selected according to until the option terminates stochastically according to . In particular, a option executes as follows. First, the next action ta is selected according to a policy . The environment then makes a transition to state 1 + ts , where the option either terminates, with probability (1 + ts ), or else continues, determining 1 + ta according to (1 + ts ), possibly terminating in 2 + ts according to (2 + ts ), and so on. 2.5 Q-learning Q-learning [1][27] is a well-known model-free algorithm, in which an agent iteratively chooses actions according to a policy , and updates state-action values according to the rule: '( , ) (1 ) ( , ) [ max ( ', ')]aQ s a Q s a R Q s a = + + The learning rate ( 1 0 ) determines to what extent a new sample changes the old Q-value. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only 1 The termination condition plays a role similar to the in -models [47], but with an opposite sense. That is, (s) in here corresponds to (1 - (s)) in [47]. 9 the most recent information. Obviously, choosing the correct is important for learning. 2.6 Value iteration Value iteration (VI) [1] is a planning algorithm. It uses as input the transition and reward function, T and R, and computes from this the optimal value function V* by iteratively improving the value estimates. Once the optimal value function is found, the optimal policy can easily be derived. Note that we can use VI in model-based learning to compute a policy, based on the model estimate. Below, the pseudo-code for the VI algorithm is given. 2.7 Multi-agent systems So far, we assumed the existence of only a single agent in an environment. In practice, however, there can be multiple agents interacting with each other in different ways in an environment. A system that consists of a group of such agents is called a Multi-Agent System (MAS). Formally, a MAS is a generalization of an MDP and is called a stochastic game (SG). A SG is the tuple {S,N,Ai,P(s, a ),ri}. S is a finite set of states, N is a finite set of n agents, Ai is a set of actions, available to agent i, P(s, a ) is the transitional probability of reaching state s after executing the joint action a , a vector of the of individual agents primitive actions, at state s, ri is the individual reward for agent i. The rewards ri(s, a ) also depend on the joint action. loop until policy is good enough for all states S s for all actions A a + =') ' ( ) ' , , ( ) , ( ) , (s s V s a s T a s R a s Q Q(s,a) V(s) amax = end for end for end 10 There are different ways to select a joint action a : 1) Central control mechanism a) The central approach is to combine all agents into a single, Big agent. That way, the multi-agent (MA) problem is transformed into single agent problem and we can use regular single agent MDP techniques to solve it. Using that approach, guarantees finding the optimal policy * in the limit, but convergence can be very slow. For example, if there are 4 agents, each with 4 available actions, the total number of joint actions is 44 = 256 2) Decentralized (distributed) control mechanisms a) Another way to compute a is by using coordination graphs2 and explicit communication mechanisms, in order to model dependencies between individual agents. As a consequence of these mechanisms, the joint action is computed in a more efficient way and computational speed-up is achieved. b) Finally, there are multi-agent methods that do not explicitly compute the joint action nor assume any dependence between the agents. Those approaches rely on implicit coordination mechanisms and/or shared value functions in order to achieve coordination between the agents. The total action space scales linearly with the number of agents. From the example above, 4 agents with 4 available actions will result in 44=256 joint actions, but we will only need 4 * 4 = 16 Q-values. The implicit coordination mechanisms make it unnecessary to store 256 Q-values to describe all 256 joint actions. The main advantages of employing distributed MAS are computational speedup and reduction of the total values of the action-space vector that we need to store. A disadvantage is the decrease of the performance in the limit. 3 Domain description 3.1 The environment 2 A graph, where there is a node for each agent and an edge between two agents if they must directly coordinate their actions to optimize some particular Q-value 11 The domain in this thesis is the catch-the-thief domain, also known as the predator-prey domain a number of agents, called guards, are learning a control policy to catch a rogue entity, called the thief. The guards try to catch the thief as quickly as possible, while preventing it from escaping through one of the exits. In this specific case, the domain is modeled as an MDP with a discrete, finite state-action space. The choice has been made to use a discrete environment because it is easy to implement and analyze. A state consists of a feature vector, where each feature encodes the position of one of the entities (a guard or a thief). There are also obstacles/walls in the environment. (Figure 1) Figure 1: A snapshot of our catch-the-thief domain. Guards are dark-blue boxes, the thief is a red box, exits are green and walls are blue boxes. 12 The size of the total state space S, in the specific case of a 20 x 20 2D navigational cell-grid, two guards and one thief, and not considering obstacles (137 in our case), is equal to: [(20 * 20) 137] = 1.82 * 107 (3) Each guard can take 4 (primitive) actions, corresponding with movements in the up, down, left and right direction. Since there are two guards, the total number of joint actions is equal to: 4 = 16. As a consequence, the total state-action space S x A is equal to: 1.82 * 107 x 16 = 2.91* 108 300 million state-action pairs. (4) This number is substantial. That is why we are going to look for a way to reduce the total state action space in the Methods section. Note that the guards are the entities that are controlled by the algorithm. From the perspective of these guards, the thief is part of the environment. The reward ( )ir s received by each agent, after execution of a joint action is equal to: +1, if the thief is caught, -1, if the thief has escaped, 0, otherwise. A thief is caught when it is on the same grid cell in the environment as one of the guards. The thiefs behavior is deterministic, but effected by whether it observes the guards are not. In the next section, we explain in detail its behavior. 3.2 Thief behavior At each timestep, the thief takes a single step in one of four possible directions: up, down, right or left. As mentioned before, the thief chooses its actions deterministically. The thiefs first priority is not being caught by the guards. The thief can observe guards using a visual system with a sensitive area of 13 x 13 cells centered around it. Line segment to line segment intersection is used for determining 13 whether the thief sees any of the guards. Implementation details can be found in Appendix 3. The system detects if there are any guards within the sensitive area of the thief. If that is the case, this system reduces the set of available movement directions, by removing the one that brings the thief closer to that guard. This process is repeated for all guards. From the set of remaining actions, the thief takes the one that results in the shortest path to one the exits (ties are settled using a priority order over the actions in order to keep the thief behavior deterministic). We use the number of cells, free of guards and obstacles for measuring the proximity of the thief to one of the exits. At a single timestep, the following occurs in order: 1. The guards select their movement actions. 2. The guards are moved. 3. If the thief is not caught after the guards are moved, it selects a movement direction, based on the location of the exits and the new location of the visible guards, and takes one step in that direction. 4 Methods In this section we are going to look closely at the algorithms we will use to in our experiments. Furthermore, we explain the state-action space reduction techniques that these methods employ. 4.1 State-space reduction We mention in the introduction that the state-space can be reduced by employing options. In this section we explain how our approach works in detail. The core concept we use is to only do learning in certain parts of the environment, while in others hand-coded options are used. The motivation behind this approach is that for certain states it is difficult to come up with a good hand-coded strategy and learning has to be used. On the other hand, there are those states where the hand-coded strategy works sufficiently well and learning is practically unnecessary. In principal, the state-space of any MDP can be divided into learning and non-learning subsets. Before we explain our technique in more detail, we will first give a few definitions. Lets denote with S our total state-space. Then, S can be divided into: 14 States where we use learning. For these states we estimate Q-values. We call the subset of states where learning occurs, QST. States, where hand-coded options are used. For these states we do not estimate Q-values. We call the subset of states where options are used: NQST. We now give an example of such a division for our catch-the-thief domain. We relate the QST and NQST subsets to so-called learning areas (LAs). These LAs are subsets of the floorplan, where the guards and the thief move around in. We define the QST for this task, as the subset of states where both guards and the thief are in the same LA: , , QST if all entities are inside one and the same LAsNQST otherwise We now compute the reduction of the state-space achieved, using task parameters. We denote: n the total number of entities(guards plus thieves) L1, L2 Lk the learning areas k the number of learning areas, in our case: L1 and L2 (Figure 2) C1 ,C2 Ck , the number of free cells in each learning area Lk. | | LS : the size of QST Figure 2: A representation of the Learning areas L1 and L2, as part of the whole shopping mall floorplan The number of cells free of obstacles in a learning area Ck raised to the power of all entities n gives us the number of states per learning area Lk. The total size of QST, | | LS , is then equal to: L1 L2 15 1| | (5) knL iiS c== In our environment: 71231.82*1024242nSkcc===== Then, according to (5): 3 3 5| | 42 42 1.48*10LS = + = Dividing | | LS by S yields the reduction percentage. 57| | 1.48*100.00811.82*10LSS = = % As we can see, we will be using less that one percent of the original state-space size, which is a significant reduction. 4.2 Action-space reduction In addition to reducing the state-space, one of the methods that we will use (ORLA-MA, see section 4.3.2.2), uses a strategy to reduce the action-space. In particular, we employ a Multi-agent learning technique called Distributed Value Functions (DBF) [4] which models each agent separately, but the reward is common and each agent shares some part of its value function with the other agent: 16 '' ' ' ( , ) (1 - ) ( , ) [ ][ ] (1- )g g g g gg g gVViQ s a Q s a RwhereV V = + += + Here, with the subscript g we denote the agent we are updating; with g- we denote the other agent in the environment; is a weighting function that determines how the state value of agent g will get updated. As a result, instead of searching in a 4 = 16 action space we only look into a 4 + 4 = 8 action space and the total action space is reduced by half. Please note, that since we model the actions of each agent separately, we are not trying to explicitly find the optimal joint action a . We will discuss the trade-offs (if any) of this approach in the Experiments and Analysis sections. 4.3 Algorithms There are three algorithms that we develop. The first one is a Combination of offline Value Iteration and shortest-distance-to-target Deterministic Strategy CVIDS. This method uses planning (value iteration) for some parts of the environment and a options for controlling the agent in other parts, effectively reducing the size of the state-space. The second algorithm reduces the state-space by employing temporarily extended actions (options) and learning only in some areas of the environment. We refer to the latter method as Options-based Reinforcement Learning Algorithm, ORLA. Furthermore, in order to achieve action-space reduction, we extend ORLA into a multi-agent system with DVF. For consistency reasons, we will refer to the single-agent version of this algorithm (ORLA) as ORLA-SA, and the multi-agent- as ORLA-MA. The final algorithm is a simple fixed hand-coded policy and is used for benchmarking (SDDS). 17 4.3.1 Combination of offline Value Iteration and shortest-distance-to-target Deterministic Strategy (CVIDS) For CVIDS, planning only occurs within a subset of the total space, the learning areas. On the other hand, in the non-learning area, a fixed deterministic policy is used, in which each guard takes the shortest distance to the thief. There is a special abstract state outside of learning area which has a constant Q-value, used for updating the states that lie just on the border of a learning area. The problem that we have to solve is now re-defined as an MDP with a smaller state-space. Note that the optimal solution for this derived MDP will in general not be optimal in the original MDP. We sacrifice some performance for computational gain and domain-size reduction. 4.3.2 Options-based Reinforcement Learning Algorithm (ORLA) 4.3.2.1 ORLA-SA The second algorithm, ORLA-SA, models the problem as an SMDP [3]. As mentioned earlier, only some states (QST) have Q-values and the generalized Q-learning update rule is used: )] ' , ' (max[ ) , ( ) 1 ( ) , ('a s QiR a s Q a s Qa + + = (6) and 0 iiriR == where i is the number of steps, since the last update a is the action initiated at s, lasted i steps and terminated in state s, while generating a total discounted sum of rewards R, ri is the reward after taking a step i is the discount factor. Consider for example the transition sequence shown in Figure 3. 18 After the agent is in state SC (QST), it takes an action (the action with the highest Q-value), that leads him to state SD(NQST). Reward is collected and i is increased. While in SD, the agent temporarily extends the action, taken in SCwith a deterministic action. After another transition, the agent is in state SE (QST). The extended action is terminated, and the state-value of SCis updated using the state-value of SE using (6) Figure 3: The shaded nodes represent QST; the blank nodes NQST; the arrows show the transition after taking an action a in state s; the number i = 0,1,2... is the number of timesteps since the last update. Pseudo code for the algorithm: SA SB SC SD SE i 0 1 2 19 4.3.2.2 ORLA-MA The algorithm, described in 4.3.2.1, is easily extended for the DVF approach. Instead of having a global Q-value table, we use local Q-value tables, one per agent. Each agent selects the action with the highest Q-value from its own table. Then both agents perform their individually selected actions, observe the next state and receive a global reward, based on their joint action a and s. We can then re-write ORLA-SA as: observe thief behavior loop( for n number of episodes ) if s QST : select action a with highest Q-value sLast s aLast a R = 0; i = 0; else chose deterministically shortest distance action a end if take action a observe reward r, + = r R R i olda snewa s , , i = i + 1; if ' s QST )] ' , ' (max[) , ( ) 1 ( ) , ('a s QiRaLast sLast Q aLast sLast Qa + ++ = end if s s end 20 observe thief behavior loop( for n number of episodes ) if s QST : for each agent g: select action ag with highest Q- value aLastg ag end for sLast s R = 0; i = 0; else for each agent g: chose deterministically shortest distance action ag end for end if for each agent g: take action ag end for observe joint action a observe reward r, , ,new olds a s aiR R r = + i = i + 1; if ' s QST : for each agent g: ''))] ' , ' ( max) 1 () ' , ' (max( [) , ( ) 1 () , ( ++ + ++ ==g ggag ggag gg ga s Qa s QiRaLast sLast QaLast sLast Q end for end if s s end 21 where with g we denote the other agent; aLastg is the last action, taken by agent g, inside a learning area, and is a weighting function, signifying the contribution of the state-value of agent g- on that of agent g. 4.3.3 Shortest Distance Deterministic Strategy (SDDS) The SDDS is an agent strategy, where the guards always take the shortest distance to the thief, i.e. no learning occurs. The distance is defined as the number of steps it would take the agent to move to the current position of the thief, taking obstacles into account. Ties are settled using a priority order over the actions of the guards. 5 Experimental results In this section, the algorithms discussed in the previous section are compared and evaluated. The evaluation criterion is the mean return computed after learning has finished. It is calculated by summing the individual return per evaluation episode iret , and dividing that sum by the total number of evaluation episodes maxEp3: maxEp1MeanRet maxEpiiret== 5.1 Experimental setup For the first part of the experiments, for all three algorithms, the guards are initialized at the same position (9, 10) and (10, 10) and the thief is initialized at one of twelve pre-defined possible positions. The discrete set of initialization positions is created on purpose, since we 3 The implementation itself was done using unmanaged C++ for the algorithms, managed C++ (Microsoft.Net ,MS Visual studio 2005) for building the simulator (which is the main tool for running the experiments) and OpenGL for the visualization. All the experiments are run on a WindowsXP 32-bit machine, with Intel dual core @ 1.73 MHz CPU, 1 GB of Memory and 128 MB Video card. 22 want to do a controlled experiment and confirm the hypothesis that less initialization positions will cost less computational resource. Different initialization positions for the thief result in different state-action pairs being visited during learning. With 12 initialization positions, some state-action pairs will not be learned and hence, no additional resource is needed for their update. There are also two initialization positions (among those 12) for the thief, where it is impossible for the guards to catch it before escaping, no matter how good their strategy is. These positions were included on purpose, in order to: observe whether those 2 initialization positions affect performance in a certain way see what the resulting policy will be (using the simulator) for the learning agent(s). We will also test how the number of initialization positions affects the total performance of the algorithms by using random initialization for the thief, in addition to using the 12 hand-picked initialization positions. The total size of learning areas we define depends on: how effective we want our algorithms to be - the bigger the size of the learning areas, the more effective the algorithm, but also the more resource are needed. how much computational time (and memory resource) we want to save The more resources we are trying to save, the less effective our methods will be. One of the goals when running those experiments is to discover and discuss what sizes and positions of Learning Area(s) work well and give satisfactory performance. Another goal is to determine whether there are any tradeoffs between the algorithms or any specific and interesting learning cases. 23 Figure 4: A snapshot of our domain-simulator. The Learning areas are represented by orange cells. First, we will run and evaluate the SDDS algorithm - it does not need time for learning and is going to be our baseline. The second one to test is CVIDS. The environment functions T and R are pre-computed. After that, offline Value Iteration is run on the derived MDP (the learning areas orange squares) and Q-values are updated. The third experiment is done using ORLA and has two variations-single agent and multi-agent perspective. The environment is a SMDP in this case, but learning occurs only in the orange areas. Since only ORLA-MA achieves reduction of the action space, we are going to use the memory resource we saved to include more states in QST. We add more states in QST by increasing the size of the Learning area. We extend the LA horizontally towards the exits (Figure 4). That way we expect ORLA-MA to perform better than ORLA-SA. As an additional test, we will compare ORLA-SA and ORLA-MA for the same size learning area, without including any additional states in QST for ORLA-MA. 24 During evaluation, the action selection for CVIDS and ORLA is the same select the action with the highest Q-value, if s QST ; otherwise use the option. Before we run our experiments, we define our global parameters. We use a discount factor of 99 . 0 = and a decaying learning rate ( 0 1 < < ). The learning rate is initially set high because in the beginning of learning new experience is more important; keeps decreasing with a constant, decay rate ( 1 0 < < ). Setting high will make the algorithm converge faster to a policy , but it may well not be an optimal one ( * ). Choosing to be very low, on the other hand, will facilitate convergence but at the cost of longer learning and higher computation time. In order to decay , we use the formula: 1 ) 1 ] ][ [ (] ][ [0+ =a s Na s (Fig. 6) where ] ][ [ a s is the learning rate for the currently updated state-action pair 0 is the initial learning rate, set to 1 is the decay rate ] ][ [ a s N is the number of times a state-action pair has been visited Figure 5: Decaying alpha, with different decay rate: 0.9, 0.5 and 0.2, respectively. 25 We use -greedy exploration policy, with a fixed ( 0 1 ), where an agent selects a random action in each turn with probability of . 5.2 Results During preliminary experiments we determined the optimal parameters for each of the algorithms. We then compared the methods using these optimal parameters. For ORLA-MA, we initially fix the exploration factor, and try to estimate the optimal parameter for weighting the shared agents value functions. Varying between 0.1 and 0.9, yielded maximum mean return for = 0.8. Trying out different exploration factors and learning rate , when = 0.8, does not seem to have a direct impact on the performance. This suggests that is independent of and , and only depends on our domain complexity. The decay-rate is set initially to 0.09. Experimenting with different values we have determined that 02 . 0 = is optimal. Setting it higher makes ORLA-SA and ORLA-MA reach a lower value at the end of the learning period. Setting lower makes the algorithms achieve a slightly-better performance at the cost of twice the learning episodes. Finally, we try out different (constant) exploration factors in the range 0.1 to 0.3. Values higher than 0.3 make learning hard. As a result, a value of = 0.12 was obtained as seemingly optimal for ORLA-MA and 1 . 0 = for ORLA-SA To summarize, the (optimal) parameters are: ORLA-SA ORLA-MA 10 = 10 = 02 . 0 = 02 . 0 = 1 . 0 = 0.12 = = 0.8 In our first experiment (Table 1), we compare ORLA-SA to CVIDS and SDDS. The notation 12 | Rand shows the type of initialization for the thief: 12 hand-picked initialization positions or random initialization. 26 Evaluation Runs: 1000 SDDS 12 | Rand CVIDS 12 | Rand ORLA-SA 12 | Rand ORLA-MA 12 | Rand Mean return 0.24 |0.075 0.47 |0.12 0.69 | 0.2 0.65 | 0.17 Standard Error 0.003|0.002 0.002|0.002 0.001|0.001 0.001|0.002 Table 1: A comparison between SDDS, CVIDS , ORLA-SA and ORLA-MA. For our second major experiment, we compare ORLA-SA and ORLA-MA Tables 2 show a comparison between ORLA-SA and ORLA-MA. In the middle column, the agents learn in exactly the same learning area(but smaller action space for ORLA-MA), while the third column shows the performance of learning, using a bigger learning area for ORLA-MA, but keeping the total state-action space approximately the same size as ORLA-SA. Evaluation Runs: 1000 ORLA-SA 12 | Rand ORLA-MA: same LA 12 | Rand ORLA-MA: Equal state-action space 12 | Rand Number of learning episodes 60000 | 150 000 30000 | 70000 80000 | 150000 Mean return 0.686 | 0.197 0.645 | 0.165 0.750 | 0.113 Standard Error 0.0019 | 0.0023 0.0013 | 0.0021 0.001 | 0.002 Table 2: The thief is initialized in 12 pre-defined positions in the environment and then randomly. Evaluation Runs: 1000 ORLA-SA ORLA-MA: same LA ORLA-MA: Equal state-action space Number of learning episodes 10 000 10 000 10 000 Mean return 0.615 0.643 0.678 Standard Error 0.003 0.003 0.003 27 Table 3: A comparison between ORLA-SA and ORLA-MA, using the same number of episodes and same learning rate. The thief is initialized in 1 of 12 possible positions. As we can see in Table 3, the multi-agent case of ORLA converges very quickly to a near-optimal policy. Only after 60 000, the single-agent ORLA is able to achieve a slightly better performance (Table 2). Using the same state-action space for ORLA-MA (but bigger Learning area) gives even better performance benefit (Table 3). 6 Analysis and discussion As expected, the shortest distance (SDDS) is not always the optimal policy, since the thief can take detours and therefore escapes from the chasing guards (Fig. 6). Figure 6: The thief is trying to escape, by taking a detour to the closest exit. 6.1 ORLA-SA vs. CVIDS vs. SDDS The results of our first experiment (Table 1) show a significant advantage for ORLA-SA over the other 2 algorithms (CVIDS and SDDS). The low performance of SDDS is easily explained: there is no 28 learning or reasoning involved and the nave approach does not perform that well. CVIDS achieves higher performance over SDDS, because SDDS is a simple hand-coded strategy and CVIDS is a more sophisticated algorithm. The MDP over which planning is performed, however, is different from the original MDP and thus CVIDS achieves only an optimum in the derived MDP. Finally, ORLA-SA outperforms both SDDS and CVIDS because it uses learning in a SMDP rather than in a derived MDP . In practice, the choice of an algorithm depends whether a model of the environment is available or not. 6.2 ORLA-MA vs. ORLA-SA After conducting the experiments and observing the mean return and standard error, we can conclude that ORLA-MA performs better after 10 000 learning episodes (Table 3). In the limit, ORLA-MA achieves a slightly lower performance compared with ORLA-SA, but the convergence speed of ORLA-MA is higher. It is surprising that Multi-agent DVF approach performs that well, since convergence to the optimal policy is not guaranteed [4], [19]. We found that if we use 30 000 episodes for learning, the performance of ORLA-MA is sufficiently good. Up to 50 000 learning episodes, ORLA-MA is still able to outperform ORLA-SA. Increasing the number of learning episodes further, however, does not give addition improvement on ORLA-MAs performance. When using bigger LA, but approximately the same size state-action space (and 12 initialization positions), it is not surprising that ORLA-MA outperforms ORLA-SA, given that they had similar performance using the same LA, since ORLA-MA algorithm has more area, available for learning. The only noticeable trade-off is that, when using a bigger LA, ORLA-MA needs more samples/training episodes to converge. This result is expected: the subset QST is larger and therefore the algorithm requires more samples. On the other hand, when initializing the thief randomly in the bigger LA, performance decreases for ORLA-MA as well as ORLA-SA. Performance of ORLA-MA is significantly lower in the same-size LA, as well (Table 2). Apparently, there are some states that are very difficult to be learnt by ORLA-MA. Fig. 7 identifies and compares the learning performance when encountering such a state. 29 Figure 7: A comparison during learning between ORLA-SA (red) and ORLA-SA (blue) for a single, particularly difficult initialization position. We ca see a similar relation, for an easy initialization position (Fig.8). Figure 8: A comparison between ORLA-SA (red) and ORLA-MA (blue) for a single, relatively easy initialization position. 30 Figure 9: Performance (during learning) of ORLA-MA, for the same-size LA as ORLA-SA and random initialization for the thief. In the difficult case, the mean return of ORLA-MA is pretty close to that of ORLA-SA. Performance is quite unstable and sometimes the mean return even drops below 0, which means that the thief escapes more often than being captured. In contrast, (easy case, Fig. 8) ORLA-MAs mean return is still lower, but it never gets below 0. The interpretation of that behavior is ORLA-MA manages to capture the thief, but not in the minimal number of steps. The main reason for the unstable multi-agent behavior is that each agent tries to optimize its Q-table individually, not only by receiving a reward at the end of each episode, but also by minimizing the number of steps, taken in the course of an episode. Furthermore, the Qi-update for the state value of agent i could use a sub-optimal sample from the contribution of the other agent j. Apparently, there are not any difficult cases in the 12 initialization positions of the thief, since ORLA-MAs performance is nearly-optimal, compared to ORLA-SA (Table 2). As a result, it seems that even one or two difficult initialization positions may worsen the overall performance (Table 2, Fig. 7, and Fig. 9). An interesting parameter is the exploration factor . By changing it even by a small fraction, ORLA-MAs performance varies dramatically. We can only speculate on the reasons for that influence, since we did not find any theoretical or empirical explanation for that phenomenon. Furthermore, each agent explores independently the optimal joint action *a is never used for updating Q-values. 31 Varying in ORLA-SA does not seem to have that big impact on its performance. A value of = 0.1 seems optimal for the experiments. As we know, Q-value updates in this case rely solely on the optimal joint action *a . 7 Related work The Catch the thief domain is a variation of the well-known predator-pray domain, first formulated by M. Benda et. al. [36] in 1986. In the original domain, 4 predators catch the pray if they surround it from all sides. The transition model of the pray is stochastic. However, in Catch-the-thief domain 2 guards try to catch a thief by occupying the one and the same cell and the thief moves deterministically. We improve upon previous work by employing methods that reduce the state-action space, thus addressing the problem of scalability. The predator-pray domain has been studied by Peter Stone and Manuela Veloso [37] and Xiao Song Lu and Daniel Kudenko [40]. Using multi-agent systems to solve the original predator-pray domain yielded good practical results - a recent multi-agent approach in the pursuit-evasion game (another variation of the predator-pray domain) has been formulated by Bruno Bouzy and Marc Metivier [10] ; competition between agents has been researched by Jacob Schrum [14] and dynamic analysis using -greedy exploration and multiple agents has been recently conducted by Eduardo Rodrigues Gomes et al. [22]. During our research it was important to determine what type of decentralized multi-agent systems to use and the type of coordination mechanisms. We found insights by reviewing the problems and domains, where it is useful to apply MAS and distributed control, which were well studied by Stone and Veloso [37]. Furthermore, cooperation and coordination mechanisms between multiple agents have been researched by numerous scientists [6] [8] [12] [20] [24] [26] [31] [38] [39]. The relation between SMDPs and options is well-studied in the paper of Sutton et.al.[3]. However, multi-agent approaches using options have not been studied that well and this fact gave additional motivation for conducting our research. In contrast with sparse cooperative Q-learning and collaborative agents, that still compute the joint action in a more-efficient manner, we did not employ any of those mechanisms, since the number of agents in our case is small enough and action 32 decomposition does not make a lot of sense. Instead, we reduced the total Q-value table. The choice has been made to employ multi-agent Distributed Value Functions [4], since it can effectively reduce the total action space. In addition, this approach is one of a few that reduce the total action-space (the others being Independent Learners [19] and a probabilistic approach Frequency Maximum Q Value- FMQ [45]). Independent learners does not create the incentive of cooperate, since each agent models the environment separately, completely ignoring the presence of other learning agents, which are affecting the transition model, as well [19]. As a result, estimating that model becomes extremely hard. The other approach, FMQ [45] makes use of probabilities, in order to estimate the model. Despite the fact that it reduces the action-space, it needs additional resource for computing and storing probabilities. [45][17]. 8 Conclusions and Future work 8.1 Conclusions We have presented, compared and analyzed different RL algorithms and evaluated their performance in a deterministic and discrete environment and were able to efficiently reduce the state-action space, that poses the typical problem of scalability in this learning scenario. Using options seemed to work well. Our results demonstrate that using multiple agents for learning with distributed value functions yielded results, very close in terms of performance to the centralized RL approach, keeping in mind that convergence DVF is not guaranteed. It is interesting to note, that there are certain trade-offs when applying DVF. First, learning performance may become unstable, if the environment is too complex. We showed that, given the specific thief behavior, described in section 3.2, initialization plays a significant role in learning and convergence, since some state-action pairs are never visited. 8.2 Future work There are several interesting extensions of the work described in this thesis. One such direction for future studies could be using a continuous environment with combination of options and some kind of 33 a global function-approximator. A typical example of such an environment is the helicopter hover domain [46]. Another possibility is using local state representations si [4]-instead of modeling the whole environment (i.e. the positions of all entities in catch-the-thief domain), each agent only uses its own position and that of the thief. That way, additional memory space will be saved and scalability will remain linear with respect to the number of agents. In addition, more-interesting experiments can be run (like, increasing the number of thieves, guards or the size of the LA). It is also worth mentioning that the framework and simulator developed can easily be extended to work with RL-GLUE [43]. Acknowledgments This master thesis was a challenging research topic. I will take the opportunity to express my sincere gratitude to the people that made it possible. First of all, I thank my parents for their continuous support during my studies. I thank both my supervisors - Ph.D. Shimon A. Whitson and MSc. Harm van Seijen for their guidance and academic advice. I would also like to express special thanks to Prof. Jeff Schneider from CMU for the useful hints he gave me during the course of my research. Last but not the least, I thank my friends for their support and motivation. Appendices 1. Features to states and vice-versa Instead of using series of features for representing a certain state in the environment, it is much easier if we encode those positions as a unique integer number. In case of 2 features X and Y, the formula is: s = X * (Max number of Y positions) + Y The reverse encoding is as follows: 34 X = s / Max number of Y positions, where / is integer division, and the result of division is the whole part of the number(e.g. 39 / 10 = 3) Y = s % Max number of Y positions, where % is a modular division, and the result of division is the remainder of the number (e.g. 39 % 10 = 9) In the general case: : } . .. , {2 2 1 NX X X feature set : N number of features :inoX number of feature values of feature Xi ix feature value of feature Xi 1...2 2 1 1=+ + + =NN NAwherex A x A x A s 1+ ==Nk ii k noX A , for k = 1 : N-1 The decoding from state s to feature values x1xN is: xi s / AK s s % AK Essentially, this encoding/decoding can be viewed as a perfect hashing algorithm, in which the hash-key is the string with (unique) positions and the hash-value is the integer, to which these positions are mapped. 2. Actions to Features and vice-versa The same algorithm is used as in features to states, with the only difference that instead of the position tuple (x,y) we now have the tuple (a1,a2) for the discrete actions of each one of the guards. 3. Line segment to line segment intersection In order to determine whether the line segment, drawn from the center of the thief to the center of a guard, intersects with walls in the environment between them, we use simple 2D linear algebra formulas: 35 The equations of the lines are Pa = P1 + ua (P2 - P1) Pb = P3 + ub (P4 - P3) Solving for the point where Pa = Pb gives the following two equations in two unknowns (ua and ub) ) - x (x u x ) - x (x u xb a 3 4 3 1 2 1 + = + and ) - y (y u y ) - y (y u yb a 3 4 3 1 2 1 + = + Solving gives the following expressions for ua and ub ) )( ( ) )( () )( ( ) )( () )( ( ) )( () )( ( ) )( (1 2 3 4 1 2 3 43 1 1 2 3 1 1 21 2 3 4 1 2 3 43 1 3 4 3 1 3 4y y x x x x y yx x y y y y x xuy y x x x x y yx x y y y y x xuba = = Substituting either of these into the corresponding equation for the line gives the intersection point. For example the intersection point (x,y) is: x = x1 + ua (x2 - x1) y = y1 + ua (y2 - y1) 36 These equations apply to lines, but since we want to check if there is an intersection between line segments, then it is only necessary to test if ua and ub lie between 0 and 1. We need them both to lie within the range of 0 to 1 ( 1 , 0 b a u u ) so that the intersection point is within both line segments (i.e. they actually intersect). It is also important to note that the inequalities are strict. If they are not, the line segments may happen to intersect exactly at their beginning or end points, resulting in the thief seeing a guard through a wall (Fig. 10). We only need 2 dimensions, since the guards can move in 2 Degrees of freedom with fixed z position(ground). In addition, we only need to fail(to observe no intersection) once to determine that the thief can see one of the guards. Figure 10: A thief can see the bottom guard, if the inequality is not strict: 1 , 0 < < b a u u ; according to the formula, there is no intersection, hence-no obstacle is present between the guard and the thief. 37 4. Standard Error calculation The standard error of the mean is a statistical measurement, showing how much, on average, each value of a population diverges from the population mean. It is calculated, by dividing the standard deviation by the square root of the population size N: NStdErr = , the standard deviation being estimated by: 21) (1 ==Ni i x xN , where xi is the i-th sample and x is the population mean. Bibliography [1] Reinforcement Learning: An Introduction, Richard S. Sutton, Andrew G. Barto, 1998 [2] Advances in neural information processing systems 16, MIT 2004 [3] Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning, Richard S. Sutton , Doina Precup, Satinder Singh, 1998 [4] Distributed Value Functions, J. Schneider, W. Wong, A. Moore, M. Riedmiller, 1999 [5] Problem Solving With Reinforcement Learning, Gavin Adrian Rummery, 1995 [6] Cooperative Multi-Agent Learning: The State of the Art, Liviu Panait, Sean Luke, George Mason University, 2005 [7] Viewing and Camera Control in OpenGL, Niels Joubert, James Andrews, 2008 [8] Sparse Cooperative Q-learning, Jelle R. Kok, Nikos Vlassis, 2004 [9] Empirical Studies in Action Selection with Reinforcement Learning, 38 Shimon Whiteson, Matthew E. Taylor, Peter Stone, 2007 [10] Multi-Agent Model-Based Reinforcement Learning Experiments in the Pursuit Evasion Game, Bruno Bouzy and Marc Metivier, 2008 [11] Insights In Reinforcement Learning, Hado Philip van Hasselt, 2011 [12] Self-Organizing Neural Architectures and Cooperative Learning in a Multiagent Environment, Dan Xiao and Ah-Hwee Tan, Senior Member, IEEE, 2007 [13] Design and Control of Large Collections of Learning Agents, Adrian Agogino, 2001 [14] Competition Between Reinforcement Learning Methods in a Predator-Prey GridWorld, Jacob Schrum, 2008 [15] Time-based Reward Shaping in Real-Time Strategy Games, Martin Midtgaard, Lars Vinther, Jeppe R. Christiansen, Allan M. Christensen, and Yifeng Zeng Aalborg University, Denmark, 2010 [16] Organizational Learning And Network Adaptation In Multi-Agent Systems, Matthew E. Gaston, 2005 [17] A Comprehensive Survey of Multi-Agent Reinforcement Learning, Lucian Busoniu, Robert Babuska, Bart De Schutter, 2008 [18] A Theoretical Analysis of Cooperative Behavior in Multi-agent Q- learning, Ludo Waltman, Uzay Kaymak, 2006 [19] Collaborative Multiagent Reinforcement Learning by Payoff Propagation, Jelle R. Kok, Nikos Vlassis, 2006 [20] Collaborative Reinforcement Learning of Autonomic Behavior, Jim Dowling, Raymond Cunningham, Eoin Curran, Vinny Cahill, 2004 [21] Convergence in Multiagent Coordination, Consensus, and Flocking, Vincent D. Blondel, Julien M. Hendrickx, Alex Olshevsky, John N. Tsitsiklis, 2005 39 [22] Dynamic Analysis of Multiagent Q-learning with -greedy Exploration, Eduardo Rodrigues Gomes, Ryszard Kowalczyk, 2009, ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning [23] Extending Q-Learning to General Adaptive Multi-Agent Systems Gerald Tesauro, 2004 [24] Multi Agent Reinforcement Learning: Independent vs Cooperative Agents, Ming Tan,1993 [25] Dynamic Programming, R. E. Bellman, Princeton University Press, Princeton, NJ, 1957 [26] Distributed On-Line Learning. Cooperative Multi Agent System, Ido Levy, 2007 [27] Learning With Delayed Rewards, C.J. C. H. Watkins, Ph.D.thesis, Cambridge University Psychology Department, 1989 [28] Quicker Q-Learning in Multi-Agent Systems, Adrian K. Agogino, Kagan Tumer, NASA Ames Research Center, 2005 [29] Reinforcement Learning in Large Multiagent Systems, Adrian K. Agogino, Kagan Tumer, NASA Ames Research Center, 2006 [30] Role-based Context-specic Multiagent Q-learning, Jiang Da-Wei, Wang Shi-Yuan, Dong Yi-Sheng, 2007 [31] Using the Max-Plus Algorithm for Multiagent Decision Making in Coordination Graphs, Jelle R. Kok, Nikos Vlassis, 2005 [32] Reinforcement learning methods for continuous-time Markov decision problems, Bradtke, S.J., and Du, M.O.,1995. Advances in Neural Information Processing Systems 7:393400. MIT Press, Cambridge, MA. [33] Self-improving factory simulation using continuous-time average- reward reinforcement learning, Mahadevan, S., Marchalleck, N., Das, T., Gosavi, A., 1997. Proceedings of the 14th International Conference on Machine Learning, pp. 202210. 40 [34] Hierarchical control and learning for Markov decision processes, Parr, R, 1998, PhD Thesis, University of California at Berkeley, [36] Benda, M., Jagannathan, V., Dodhiawala, R. (1986). On optimal cooperation of knowledge sources - an experimental investigation. Technical Report BCS-G2010-280, Boeing Advanced Technology Center, Boeing Computing Services, Seattle, Washington. [37] Multiagent Systems: A Survey from a Machine Learning Perspective, Peter Stone, Manuela Veloso, 2000 [38] Multi-Agent Coordination with Event-based Communication Pedro V. Teixeira, Dimos V. Dimarogonas, Karl H. Johansson, Joao Sousa, 2010 [39] QUICR-learning for Multi-Agent Coordination Adrian K. Agogino, Kagan Tumer, NASA, 2006 [40] Reinforcement Learning in the Sensor-Evader Domain Xiao Song Lu and Daniel Kudenko, Department of Computer Science University of York, UK, 1999 [41] Artificial Intelligence- A Modern Approach, S. Russell, P. Norvig, 2003 [42] On-line q-learning using connectionist systems, G. Rummery, M. Niranjan, 1994. [43] http://glue.rl-community.org/wiki/Main_Page [44] Efficient Exploration in Reinforcement Learning - John Langford, 2006 [45] Reinforcement learning of coordination in cooperative multi-agent systems, S. Kapetanakis and D. Kudenko, in Proceedings 18th National Conference on Artificial Intelligence and 14th Conference on Innovative Applications of Artificial Intelligence (AAAI/IAAI- 02), Menlo Park, US, 28 July 1 August 2002, pp. 326331. [46] RL competition, http://www.rl-competition.org/ [47] TD models: Modeling the world at a mixture of time scales, R.S. 41 Sutton, in: Proc. 12th International Conference on Machine Learning, Morgan Kaufmann, San Mateo, CA, 1995, pp. 531539.

master thesis stefan stavrev 6019439

Documents