[ieee 2007 ieee congress on evolutionary computation - singapore (2007.09.25-2007.09.28)] 2007 ieee...

Abstract—The purpose of this paper is to present a comparison between two methods of building adaptive controllers for robots. In spite of the wide range of techniques which are used for defining autonomous robot architectures, few attempts have been made in comparing their performance under similar circumstances. This comparison is particularly important in establishing benchmarks and in determining the best approach methods. The robotic tasks in our research concern mainly the convergence of behaviors like obstacle avoidance, hitting targets and shortest path finding using various methods of synthesizing control architectures’ parameters. The first approach that has been used combines Neural Networks and Genetic Algorithms in a simple yet robust controller using an Evolutionary Robotics technique. The second one introduces a manner of using Reinforcement Learning with a Neural Network based architecture. The experiments take place in a simulated 3D environment, which was designed to allow the development, testing and comparison of various controllers in terms of advantages and disadvantages in order to establish a benchmark for autonomous robots.

I. INTRODUCTION

n the last few years, a wide variety of approaches were introduced in terms of the design of autonomous robots

based on the simulation of natural evolution. The main idea is to encode the control system as a chromosome. Then an initial random population of such individuals is evaluated by allowing each agent to move through the environment. The performance is measured in order to select the best individuals for reproduction and then some operators are applied to ensure the diversity. This process is repeated until the desired behavior is synthesized.

Nevertheless, the number of comparisons between different techniques of control is rather scarce. These assessments are necessary in order to determine the value of novel approaches and the promising directions of research.

In order to achieve a coherent comparison, the control algorithms need to be evaluated in similar conditions. This

Sergiu Goschin is with University Politehnica Bucharest, Spl.

Independentei 33, Sec. 6, Bucharest, Romania (telephone number: 0040-723-805-742; email: [email protected]).

Eduard Franti is with IMT Bucharest, Erou Nicolae 32B, Sec. 2, Bucharest, Romania (email: [email protected]).

Monica Dascalu is with University Politehnica Bucharest, Spl. Independentei 33, Sec. 6, Bucharest, Romania (email: [email protected]).

Sanda Osiceanu is with University Politehnica Bucharest, Spl. Independentei 33, Sec. 6, Bucharest, Romania (email: [email protected]).

paper focuses on this very assessment by using a classic Evolutionary Robotics method (based on our previous work [8]) and a combination of an evolutionary approach and Reinforcement Learning. As far as we know, the last approach is an innovative way of combining Neural Networks, Genetic Algorithms and Reinforcement Learning for robot control.

We compared the two techniques against several criteria: finding a solution, the required time to find a solution, the required time to find the best solution, the quality of the solutions and the robustness of a control system when changing the environment.

The desired behaviors are obstacle avoidance, target reaching and optimum path finding. The experiments took place inside a dedicated 3D framework that was built for the simulations of autonomous robots.

We chose computer simulation because of the speed and the flexibility it offers when compared to testing in the real world. Although it is confirmed that control systems developed in simulation will not yield the same good results in reality, simulation allows for much faster testing of new ideas and it can at least provide a starting point for developing real controllers. During his research Jakobi (Jakobi [10]) proved that there is a set of minimal conditions that a simulation has to fulfill to produce controllers that are good in the real world.

II. CONTROL ARCHITECTURES

In this section we will describe the simulation environment and the two control architectures implemented.

We chose feed-forward neural networks as the base for the control systems. They have remarkable properties: robustness to noise (which is a usual characteristic of the interaction with the environment), flexibility of the output (due to the number of neurons in the network and their weights) and good mapping of inputs to outputs.

A. Simulation Environment

The 3D framework used for simulation was built using OpenGL and Visual C++. It offers the possibility of creating / editing / saving 3D objects, materials, photo cameras and simulated robots. The user can set all the parameters linked to the robot architecture, the evolution and the characteristics of the environment. It also allows the visualization of the evolution in real time, offline execution of certain moments

Combine and Compare Evolutionary Robotics and Reinforcement Learning as Methods of Designing Autonomous Robots

Sergiu Goschin, Eduard Franti, Monica Dascalu and Sanda Osiceanu

I

1511

1-4244-1340-0/07$25.00 c©2007 IEEE

and statistics computation. Multi-threading is used to permit independent computing

and drawing contexts, thus allowing a faster computation which is necessary in evolutionary training.

B. Evolutionary Robotics Approach

The first method we used for synthesizing a robot controller was Evolutionary Robotics. The starting point was the work of Nolfi, Floreano and Mondada (Floreano and Mondada [6], Nolfi et al. [12]).

Feed-forward Neural Networks are composed of a set of interconnected processing elements. Each neuron computes its activation as a function of the weighted sum of its inputs. The function used in this paper was sigmoid. There were no hidden layers used because this would have uselessly widened the search and the convergence time.

The inputs of the neural network are the sensors values and the output determines the future direction of the robot.

The behaviors are learned through weight modifications. Usually training a neural network means minimizing an error function (computed as the difference between real and desired outputs of the network). Most of the training algorithms (including the most used – Backpropagation) are of gradient type. One problem in our case was that there was no way of knowing what the outputs of the network should have been, therefore no error calculation was possible to allow the use of Backpropagation. Another problem with this kind of algorithms is that they can get blocked in a local optimum in certain cases. One way of avoiding this is to use evolutionary algorithms, thus transforming the classic training process into an evolution of the weights of the connections (evolution that depends on the agent and the task that it is supposed to accomplish).

The evaluation function is dependent on the application and doesn’t have to follow the restrictions of the gradient methods.

There are three phases in the evolutionary approach: • Deciding the representation of the connection weights

– we used a binary representation. • Deciding the type of the evolutionary process – we

chose the genetic operators of crossover and mutation.

• Deciding the fitness / evaluation function. The cycle of evolution of the connection weights as it was

implemented is presented below: 1. Establish the architecture of the neural network and

initialize the chromosome population. 2. Decode each individual (chromosome) from the

current generation into a set of weights and build the corresponding neural network.

3. Evaluate each neural network using the evaluation function that was decided at the beginning (parameters are computed during each individual’s life).

4. Select the parents for the next generation based on their fitness value.

5. Apply the genetic operators to obtain the children chromosomes and then restart the process from step 2 (until an exit condition is fulfilled – maximum number of generations or good fitness for instance).

An important choice that has to be made is the evaluation function. There are no formal guide lines for choosing the such a function. Small differences can lead to completely different results. Usually choosing an evaluation function is a trial-and-error process, which can be time consuming when performed on real robots rather than simulations.

According to Urzelai (Urzelai [15]) a good evaluation function must have certain characteristics: it should be implicit – meaning that it should contain a minimum number of variables and constraints, it should be internal – meaning its value can be computed using information that is internal to the agent and it should be behavioral – meaning it should be linked to the behavior of the agent and not the functional aspects needed to generate such a behavior.

Such a fitness function would allow the evolution to adapt to different, unpredictable environments.

The activation function chosen for this experiment depends on three parameters: the medium activation of the sensors (thus encouraging wall avoidance and straight trajectories), the number of steps until the target is reached (if the target is not reached then the number of steps is equal to the maximum number of steps the agent is allowed to wander through the environment) and the final distance to the target. The formula is presented below in Figure 1:

Figure 1. Evaluation function

The first experiments didn’t include a target following

module. The search space was too wide and the algorithm wasn’t working in some cases. The only behavior the agents learned was to avoid obstacles.

That’s why a target following module was introduced. Its main function was to orient the simulated robot towards the target once the maximum sensor activation dropped below a certain threshold. This parameter was coded and introduced in the evolution process in order to obtain an automatic adjustment of the priority of the two modules.

C. Reinforcement Learning Approach

The second method we implemented is a new way of combining Reinforcement Learning and Evolutionary Robotics and it is an extension of the previous approach.

The main drawback of the pure Evolutionary Robotics method is that the agents have a fixed Neural Network during their quest for reaching the target and avoiding obstacles. The weights of the network are only changed at the beginning of the “life” following the evolution process. It is a good way of learning but it is rather slow. Lifetime

1512 2007 IEEE Congress on Evolutionary Computation (CEC 2007)

learning would be a significant improvement to the process. It would be much more interesting for a robot to be “born”

with a good capability of learning how to avoid obstacles and get to a target, than to be “born” directly with behavior knowledge. Intuitively (and as confirmed by the experiments), a controller capable of self-adapting during its lifetime is more robust compared to one that has a fixed architecture when testing in unknown environments.

The problem with learning in the course of a lifetime is that there is no teacher to tell the network how to modify the weights (no person can do that for tens of generations and for hundreds of controllers in each generation). The only thing a controller can use is the information it gathers from the environment. It then transforms that information in a reward signal that is then fed back to the controller (in our case a Neural Network) as an error correction.

The proposed architecture is presented in Figure 2:

Figure 2. ER and RL combined approach

All inputs of the Neural Networks are normalized in the

interval [0,1]. The part for the motor control is similar to the previous

approach. The inputs come from the sensors and then the weighted sum is normalized to a change in direction for the robot.

The difference is that at each step a reinforcement signal that comes from the Reinforcement Signal Generator (also a Neural Network), is used as a reference to compute an error which is propagated back in the Motor Control Neural Network through Backpropagation. Thus the weights of the network that drives the robot change continuously in its lifetime.

The Reinforcement Signal Generator is a Neural Network with one hidden layer. The first neuron of the hidden layer

gets the information from the sensors. The second one gets the current orientation difference to the direction of the target (physically feasible if the target is a source of light for instance), the distance to the target (physically feasible if any positioning system is in place) and the status of hitting an object in the environment (1 if an object is hit, 0 otherwise). By doing so, a part of the evolutionary evaluation function was transferred in the Reinforcement Generator and so allowed lifetime learning.

The weighted sum of the two hidden neurons determines the reinforcement signal (which is of course normalized to be similar to the output of the motor network).

The weights of the Reinforcement Signal Generator network are fixed during the life of an individual.

All the weights of the system (w1:wn, v1:vn, z1:z3, f1:f2) are coded into chromosomes that are used as individuals in the evolutionary process (with the same evaluation function as for the first approach). The purpose is to evolve systems that start with a capacity of learning the desired behaviors and have a regulatory system capable of teaching when given all the available information about the environment.

III. EXPERIMENTS AND RESULTS

In the experiments we used several 3D environments to test the evolution of the controllers. A relevant medium is shown in Figure 3.

There were several modifications done to an initial environment by increasing the difficulty of reaching the target from the starting point. In the initial experiments the target was placed at a certain position inside the maze. Thus, the only behaviour the agent was supposed to learn was obstacle avoidance in a specific manner. As tasks became more complex, the behaviour became harder to synthesize.

Another important characteristic of the experiment is the simulated robot. We simulated a robot with 8 frontal proximity sensors that had a fixed range. These sensors have activation equal to 0 if they don’t hit anything and larger than 0 if they hit something – their value increasing linearly with the distance to the hit object. It was observed that a wider range for the sensors leads to a smaller number of “physical” hits during the learning process.

The loaded target can be seen in Figure 3. As it can be observed it is pretty difficult to reach that point – the agent has to escape from the local optimums in the maze, then go around the central obstacle (without being blocked) and hit the target.

A. Evolutionary Robotics Experiments

For the Evolutionary Robotics experiments, we chose the following genetic algorithm parameters: the number of generations to be 50, the chromosomes number in the population to be 60, the codification length of a weight of the neural network to be 10, the total length of a chromosome to be 80 (8 sensors leading to 8 input neurons * 10 per neuron

2007 IEEE Congress on Evolutionary Computation (CEC 2007) 1513

= 80), crossover probability to be 40%, mutation probability to be 5%. These numbers were chosen after various experiments. Smaller populations led to premature convergence in local optimum, bigger ones took a lot of time to converge.

An agent was allowed to wander through the environment for a fixed number of steps. At the beginning the number of steps was rather large. The genetic algorithm found a solution, but the path was far from optimum (it failed in finding the shortest path to the target). However, when the number of steps was lowered, there was an increased selection pressure on the optimization procedure to find a better solution, which turned out successful in the end.

Each complete experiment lasted approximately 20 minutes on a 2 GHz computer.

During the first three generations, the agents are blocked in the central part of the maze. At the beginning the majority collide with a wall and remain there. Only a few are successful in avoiding the walls. These are the ones that are selected for the next generations (due to their lower activation of sensors).

In the fourth generation one agent succeeds in getting out of the central part and avoids all the obstacles but gets stuck at the exit of the maze. It has learned the obstacle avoidance behaviour. The weights of the neural network are almost set for the next level.

During generations 5-8 the agents succeed in exiting the

maze and finding the path to the isolated obstacle (see Figure 3). They are not following the optimum path (when exiting the maze they are following the ellipsoid rather then going directly to the target) and they are not getting to the target yet.

In the 9th generation an individual hits the target for the first time. Due to the probabilistic nature of the genetic algorithms this individual is lost in the 10th generation. Nevertheless, the population is now prepared for the next level.

During generations 10-21 the best individuals hit the target improving the first solution. Each evolution means fewer steps to the target, less activation of the sensors (better obstacle avoidance). However, the global optimum is never reached in this experiment.

In generations 21 – 50 there is no significant evolution of the performance. The maximum value of the fitness function was 0.039. To improve the performances several modifications were made.

In the second experiment a bigger angle for the rotation of the robot was tested. A range of (-PI, PI) was used instead of (-PI/2, PI/2) from the first experiment. The rest of the parameters remained unchanged. The maximum value of the fitness function was higher: 0.049. This meant that a better path to the target was found (even if it is not the best yet).

In the third experiment the fitness function was changed. We added a new parameter – the medium rotation angle to

Target

19th Gen

Robot

Figure 3. An example of environment with a path found by ER algorithm


the current direction of the agent. The function is bigger as this parameter is lower. The purpose is to create a selection pressure for more straight trajectories in order to improve obstacle avoidance. The results obtained were the best. The agent found the shortest path to the target and the fitness function had the biggest value in the experiments with ER: 0.063.

The best individuals were also put to test with different targets in the same environment and they reached the target almost every time. Also, when the starting position was changed, the target was always reached. That means that the agents learned a generic behaviour for that specific environment.

Though, when the environment changed completely, most of the top agents failed to find the way to the target (some did but not in all situations).

B. Reinforcement Learning Experiments

We used the same parameters for the evolution in the Reinforcement Learning experiments except for the codification length which was 210.

The convergence of the desired behaviors was much slower than for the first method (see Figure 4).

This is normal since the length of the chromosomes was much bigger and there was no helping module to head the robots to the target explicitly.

The first successful target hit appeared only in the 21st generation (similar in quality with the best solution found when using pure ER). From this point on however, the improvement rate was much higher.

In about 4-5 generations the robots found the shortest way to the target and then slightly improved the manner in which they avoided the obstacles.

The highest fitness value function was 0.071 (much higher than the first experiment with ER).

Experimentally, we have removed the Reinforcement Signal Generator from the best individual and let him wander through the environment. It immediately got stuck into a wall. We can deduce that the Neural Network that controls the motor is not trained by the evolutionary approach to avoid obstacles, but to learn how to do it fast in the simulation.

The best individuals were tested from different starting positions in the same environment with different targets and they always found a path to the solution.

They were also tested in completely different environments with different characteristics and yielded very good results (unless “powerful” local optimum existed). This is what differentiates this approach from the previous one. The agents learned a specific behavior for any environment.

The bottom line is that ER was faster in finding a solution in a given medium, but ER&RL found a better solution. Although it was faster, the ER approach found the solution within a timeline that was in the same class as the one of ER&RL.

Also the robustness of the second type of controller was bigger than that of the first one, since the second was much more adaptive to new environments.

In some situations the first method didn’t find a solution at all, while the second found really good ones.

Figure 4. Performance evolution for the two experiments

2007 IEEE Congress on Evolutionary Computation (CEC 2007) 1515

C. Limitations

The main limitation of our approach is that we didn’t implement any high level planner for our system. The main reason for not doing so, is that that we had another purpose – that of comparing two architectures in similar conditions. Nevertheless it is important to emphasize that, no matter how good reactive systems are, they need to be complemented with high level planners in order to accomplish complex tasks.

IV. CONCLUSIONS

In this paper we presented two methods of synthesizing controllers for simulated autonomous robots and compared them in similar situations.

The winner was the approach that combined Evolutionary Robotics and Reinforcement Learning due to its higher generalization capability and the better solutions it generated. The main reason for that was the lifetime capability of self-adaptation owed to the Reinforcement Signal Neural Network compared to the fixed configuration of the pure Evolutionary Robotics technique.

The system gave very good results considering that it only relied on the current input step from the environment.

The ultimate purpose of any autonomous robot design is to develop systems capable of self-governing in complex and unpredictable environments. We hope this paper is a step towards this end.

As future developments we are working on training the system on a variety of environments (even if this doesn’t seem really necessary for the more complex approach), creating more complex motor control architectures, creating high level planners with states’ history in order to handle more complex environments and tasks and ultimately to pass the most successful architectures on real robots.

ACKNOWLEDGMENT

Sergiu Goschin would like to thank Professor Cristian Giumale of University Politehnica Bucharest for his help, ideas and support during this research.

REFERENCES

[1] Ackley, D.H., and Littman, M.S. 1991. Interactions between learning and evolution. In Artificial Life, Vol. X, 487–509.

[2] Arkin, R.C. eds. 1998. Behavior-Based Robotics. Cambridge, M.A.: MIT Press.

[3] Blynel, J. 2003. Evolving Reinforcement Learning-Like Abilities for Robots. In The 5th International Conference on Evolvable Systems.

[4] Brooks, R.A. 1991. Intelligence without representation. In Artificial Intelligence, vol. 47. 139–159.

[5] Fang, J., and Xi, Y. 1997. Neural network design based on evolutionary programming. In Artificial Intelligence in Engineering,vol. 11, no. 2. 155–161.

[6] Floreano, D., and Mondada, F. 1994. Automatic Creation of an Autonomous Agent: Genetic Evolution of a Neural-Network Driver

Robot. In Proceedings of the Conference on Simulation of Adaptive Behavior 1994.

[7] Floreano, D., and Mondada, F. 1996. Evolution of plastic neurocontrollers for situated agents. In Proceedings of the Fourh International Conference on Simulation of Adaptive Behavior. Cambridge, MA.: MIT Press-Bradford Books.

[8] Goschin, S.; Franti, E.; Dascalu, M.; and Pietraroiu, M., 2006. Autonomous Agents with Control Systems Based on Genetic Algorithms. In Proceedings of the 12th IASTED International Conference on Robotics and Applications 2006, 537-071: 49–53.

[9] Huber, M. 2000. A Hybrid Architecture for Adaptive Robot Control. Ph. D. diss., Graduate School of University of Massachusetts Amherst.

[10] Jakobi, N. 1998. Minimal Simulations for Evolutionary Robotics. Ph.D. diss., School of Cognitive and Computing Sciences, University of Sussex.

[11] Nelson, A.L.; Grant, E.; Galeotti, J.M.; and Rhody, S. 2004. Maze exploration behaviors using an integrated evolutionary robotics environment. In Robotics and Autonomous Systems, 46(3): 159–173.

[12] Nolfi, S.; Floreano, D.; Miglino, O.; and Mondada, F. 1994. How to Evolve Autonomous Robots: Different Approaches in Evolutionary Robotics. In Artificial Life IV, 190–197.

[13] Sutton, R.S., and Barto A.G. eds. 1998. Reinforcement Learning: An Introduction. Cambridge, M.A.: MIT Press.

[14] Sutton, R.S.; McAllester, D.; Singh., S.; and Mansour, Y. 2000. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of Advances in Neural Information Processing Systems 12. 1057 – 1063: MIT Press.

[15] Urzelai, J. 2000. Evolutionary adaptive robots: artificial evolution of adaptation mechanisms for autonomous agents. Ph. D. diss., Laboratory of Intelligent Systems, Ecole Polytechnique Federale de Lausanne.

[16] Togelius, J. 2004. Evolution of a subsumption architecture neurocontroller. In Journal of Intelligent and Fuzzy Systems, 15:15–20.

[17] Yao, X. 1999. Evolving artificial neural netwoks. In Proceedings of the IEEE, 87(9): 1423–1447.


[ieee 2007 ieee congress on evolutionary computation - singapore (2007.09.25-2007.09.28)] 2007 ieee...

Documents