making robots smart - iri robots smart ral behavio rning lea combines sensing and action edited by...

MAKING ROBOTSSMARTBehavioral Learning CombinesSensing and ActionEDITED BYAttilio GiordanaUniversit�a di TorinoTorino, ItalyMichael KaiserABB Corporate Research Ltd.Baden-Daettwil, SwitzerlandVolker KlingsporUniversit�at DortmundDortmund, GermanyHendrik Van BrusselKatholieke Universiteit LeuvenLeuven, BelgiumKLUWER ACADEMIC PUBLISHERSBoston/London/Dordrecht

CONTENTS1 PREFACER. Dillmann 1Part I LEARNING IN EXECUTION ANDCONTROL 32 INTRODUCTION TO SKILL LEARNINGM. Kaiser, R. Dillmann 51 Skill Analysis, Design, and Implementation 62 The Skill Model 73 Skill Acquisition from Human Performance Data 94 The Skill Acquisition Process 135 Methods for Skill Acquisition 146 Summary 173 LEARNING FUNCTION APPROXIMATORSC. Baroglio, A. Giordana, R. Piola 191 Introduction 192 Function Approximation 213 Learning Algorithms 284 Empirical Results and observations 415 Conclusions 464 LEARNING SENSOR ASSISTED ASSEMBLYOPERATIONSM. Nuttin, H. Van Brussel 491 Introduction 49v

vi Making Robots Smart2 A historical experiment with learning automata 513 A connectionist reinforcement learning controller 534 Conclusions 565 LEARNING AND RE-CALIBRATION INFLEXIBLE ASSEMBLYM. Nuttin, H. Van Brussel 571 Introduction 572 Re-calibration and the learning task 583 Inductive synthesis of Regression Trees and Cascade Corre-lation Nets 604 Generation of examples 625 Experiments 646 Learning and dimensional tolerances 657 Experiments 678 Conclusion 686 CONTACT ESTIMATION FOR COMPLIANTMOTION CONTROLR. Su�arez, L. Basa~nez, J. Rosell 691 Introduction 692 The planner 713 Contact Estimation: Analytical approach 754 Contact Estimation: Learning approach 825 Comparison of the analytical and learning approaches 856 Conclusions 877 LEARNING SENSOR-BASED NAVIGATIONJ. del R. Mill�an, C. Torras 911 Introduction 912 Robot Testbed 943 The Learning Approach 964 Controller Architecture 995 Learning Mechanisms 1026 Learning Opportunities 1057 Experimental Results 107

Contents vii8 Conclusions 1128 LEARNING TO CONTROL A VISUALSENSING SYSTEMM. Accame 1151 Introduction 1152 Description of the Visual Sensing System 1173 The Camera Module 1194 The Edge Extraction Module 1225 A Sample Experiment: Door Frame Identi�cation 1306 Conclusions 130Part II LEARNING FOR HUMAN-ROBOTINTERACTION 1339 LEARNING IN HUMAN-ROBOTCOMMUNICATIONM. Kaiser, H. Friedrich, V. Klingspor, K. Morik 1351 Introduction 1352 The Psychology of Human-Agent Interaction 1363 Human-Robot Communication 1374 Learning Tasks in Human-Agent Interaction 1395 Summary 14110 LEARNING TO CLASSIFYT. Rauber, M. Barata 1431 Introduction 1432 Holistic Learning of Classi�ers 1453 Tools for the Learning of Classi�ers 1514 Learning of Classi�ers in Machine Tool Supervision 1565 Conclusions and Outlook 16611 LEARNING A TAXONOMYOF FAILURESIN ASSEMBLYL. Seabra Lopes 1671 Introduction 167

viii Making Robots Smart2 Learning and Execution Supervision 1693 An Algorithm that Learns Hierarchies of Structured Con-cepts 1714 Example Generation and Processing 1775 Using Hierarchical Decomposition 1866 Conclusion 19012 INCREMENTAL SIGNAL TO SYMBOLPROCESSINGK. Morik, S. Wessel 1911 Introduction 1912 Representation and Scenario 1943 Finding a sequence of symbolic descriptions for sensor data 1974 Adapting the tolerance parameter 1985 Empirical evaluation 2006 Comparison with related work 2027 Conclusion 20313 LEARNING UNDERSTANDABLECONCEPTS FOR ROBOT NAVIGATIONV. Klingspor, K. Morik 2071 Introduction 2082 Operational Concepts 2113 The Layout for Learning 2144 Learning Tasks and Experiments 2205 Applying learned concepts 2306 Conclusion 23114 PROGRAM OPTIMIZATION FORREAL-TIME PERFORMANCEA. Rieger, V. Klingspor 2331 Introduction 2332 Program optimization 2343 Inferring the pre�x acceptor 2404 Speeding up inferences 2415 The Parallel Performance System 244

Contents ix6 Depth One Inferences with General Horn Clauses 2477 Real World Test 2478 Conclusion 248

CONTRIBUTORSMarco AccameUniversity of GenoaDepartment of Biophysical and ElectronicEngineering (DIBE)16145 Genova, ItalyManuel BarataUniversidade Nova de LisboaDepartamento de Engenharia Electrot�ec-nica2825 Monte da Caparica, PortugalCristina BaroglioUniversit�a di TorinoDipartimento di Informatica10149 Torino, ItalyLuis Basa~nezUniversitat Polit�ecnica de CatalunyaInstitut de Cibern�etica08028 Barcelona, SpainLu��s M. Camarinha-MatosUniversidade Nova de LisboaDepartamento de Engenharia Electrot�ec-nica2825 Monte da Caparica, PortugalR�udiger DillmannUniversit�at KarlsruheInstitute for Real-Time Computer Systems& Robotics76128 Karlsruhe, Germany

Holger FriedrichUniversit�at KarlsruheInstitute for Real-Time Computer Systems& Robotics76128 Karlsruhe, GermanyAttilio GiordanaUniversit�a di TorinoDipartimento di Informatica10149 Torino, ItalyMichael KaiserABB Corporate Research Ltd.Information Technology Dept.5405 Baden-Daettwil, SwitzerlandVolker KlingsporUniversit�at DortmundLehrstuhl Informatik VIII44221 Dortmund, GermanyJose del R. MillanJoint Research Centre of the EuropeanCommissionInstitute for Systems, Informatics and Safety21020 Ispra, ItalyKatharina MorikUniversit�at DortmundLehrstuhl Informatik VIII44221 Dortmund, GermanyMarnix NuttinKatholieke Universiteit LeuvenDepartment of Mechanical Engineering,Division PMA3001 Leuven, Belgium

Contributors xiRoberto PiolaUniversit�a di TorinoDipartimento di Informatica10149 Torino, ItalyThomas W. RauberUniversidade Federal do Esp��rito SantoDepartamento de Inform�atica29065-900 Vit�oria - ES, Brazil,Anke RiegerUniversit�at DortmundLehrstuhl Informatik VIII44221 Dortmund, GermanyJan RosellUniversitat Polit�ecnica de CatalunyaInstitut de Cibern�etica08028 Barcelona, SpainLu��s Seabra LopesUniversidade de AveiroDepartamento de Electr�onica e Telecomu-nica�c~oes3810 Aveiro, PortugalRa�ul Su�arezUniversitat Polit�ecnica de CatalunyaInstitut de Cibern�etica08028 Barcelona, SpainCarme TorrasUniversitat Polit�ecnica de CatalunyaInstitut de Cibern�etica08028 Barcelona, SpainHendrik Van BrusselKatholieke Universiteit LeuvenDepartment of Mechanical Engineering,Division PMA3001 Leuven, Belgium

Stephanie WesselUniversit�at HamburgLabor f�ur K�unstliche IntelligenzVogt-K�olln-Str. 3022527 Hamburg, Germany

7LEARNING SENSOR-BASEDNAVIGATIONJ. del R. Mill�an and C. Torras*Joint Research Centre of the European Commission*Universitat Polit�ecnica de CatalunyaABSTRACTA mobile robot that uses reinforcement learning to acquire reactive navigationskills is presented. The basic skills needed for reaching a goal while avoidingobstacles are encoded as sensation-action associations in a modular Neural Net.The robot has no a priori knowledge of either the environment or the e�ect ofits actions, and learns on-line using only raw sensory data collected as it moves.The experimental results show that a few trials su�ce for the robot to navigatee�ciently in a real environment of moderate complexity.1 INTRODUCTIONMobile robots must be able to navigate autonomously if they are to performuseful tasks in unconstrained environments. Moreover, since these environ-ments are usually unknown when faced for the �rst time, robots need to beendowed with learning capabilities. This chapter deals with the learning taskof acquiring a set of safe and e�cient elementary operations for navigation. Byelementary we mean that the learning task is aimed at closing the sensing-actionloop at the lowest control level of the autonomous vehicle. In other words, therobot is to learn the suitable reactive motor skills using only raw sensory data.In particular, these skills correspond to goal-oriented obstacle-avoidance reac-tions since the robot has to navigate e�ciently. Some crucial aspects of thislearning task are the following. First, the robot must always make decisions inreal time. A robot operating in a hazardous environment cannot a�ord to stoplong times to select the best course of action and/or update its current knowl-87

88 Chapter 7edge about the task. Second, the robot's knowledge has to be grounded. Infact, since the robot does not have any a priori knowledge of the environment,it can only work upon information that is extracted from its sensors. Third,the robot's controller (and all related modules) has to deal with noisy sensorydata.One way in which a robot can autonomously learn reactive motor skills is to�rst acquire a model of the e�ects of its actions. The robot can then employthis model to make a plan to achive the goal. Examples of this approachare the works of Atkeson [Atkeson, 1991], Moore [Moore, 1991] and Thrun etal. [Thrun et al., 1991]. Nevertheless, building and maintaining good enoughmodels is not only computationally expensive, but also prone to errors if builtfrom noisy sensory data.A di�erent approach is to learn the suitable reactions directly from the in-teractions with the environment. The robot simply tries di�erent actions forevery situation it �nds when experiencing the environment and selects the mostuseful ones as measured by a reinforcement or performance feedback signal. In-deed, reinforcement learning (RL) is thought to be an appropriate paradigm toacquire control policies for autonomous robots that work in initially unknownenvironments. Many of the tasks faced by such robots have absorbing goalstates, and so their aim is to learn to perform those actions that maximize thecumulative reinforcement in the long-term (i.e., from the moment an action istaken until the goal is reached). These robots must learn a policy that maxi-mizes V (t) =PTk=0 z(t+ k), where z(�) is the reward obtained at time � andV (t) is the total future reinforcement from time t until the goal is reached attime T .Most RL systems learn value functions by means of temporal di�erence (TD)methods [Sutton, 1988]. A value function estimates V for situations or forsituation-action pairs. Actor-Critic architectures [Barto et al., 1983] and Q-learning [Watkins, 1989] are two of these RL systems. Recent works haveproven the asymptotic convergence of TD methods (e.g., [Dayan and Sejnowski,1994]). Unfortunately, these proofs rely on several assumptions that hardly ap-ply to robots facing real-world tasks. In particular, most of them require tocodify the TD's estimates into tabular representations (which implies discretesituation and action states)a and to try out every action for every situationaSome results apply to non-tabular representations [Tsitsiklis and Van Roy, 1996], but relyon the use of appropriate features that are typically hand-coded. Interestingly, the authorsdiscuss how their feature-based method may be combined with local networks similar to ours.On the contrary, theoretical results (e.g., [Gordon, 1995]) have shown that TD methods mayfail to converge if global function approximators, such as feedforward NN, are used.

Learning Sensor-based Navigation 89an in�nite number of times (which is always unfeasible, and can only be ap-proximated if both spaces are small). In many real-world tasks, however, thesituation and action spaces are continuous and robots cannot a�ord long, riskylearning trials. Thus, practical learning robots require compact representationssuch as NN to generalize experience between similar situations and actions, andto limit their experience to relevant parts of the problem only. But, what canwe expect from reinforcement-based agents that do not satisfy the theoreticalrequirements? Our experimental results with the autonomous mobile robotTeseo demonstrate empirically that, despite the lack of convergence results,reinforcement-based robots can rapidly solve real-world tasks. See also chap-ter 5 for other examples of successful reinforcement-based robots.Key components of our learning architecture are the use of local networksb andthe incorporation of bias into the network. The local networks we advocate forallow the robot to learn incrementally so as to adapt to new or changing envi-ronments without degrading its performance in previous situations. See 3 for amore detailed explanation of the advantages of local networks and incrementallearning. We use built-in re exes (domain knowledge) as bias. Two are thebene�ts of bias. First, it accelerates the learning process since it focusses thesearch process on promising parts of the action space immediately. Second, itmakes the robot operational from the very beginning and increments the safetyof the learning process.CategorizationLearning task: The robot has to learn autonomously to reach a goal whileavoiding obstacles in an unknown environment. This amounts to learn-ing suitable reactive motor skills on the basis of only raw sensory dataand some built-in re exes. In particular, these skills correspond to goal-oriented obstacle-avoidance reactions, which are computed by means ofan unstructured numerical function that maps perceived situations ontosuitable actions.Training data generation: As the robot moves in a real environment, it willface thousands of sensory situations for which to perform actions. Eachsuch situation constitutes a training instance, which has associated a datapair consisting of the set of readings from the robot sensors and the setof motor commands executed. In this way, non-labeled training data ofan homogeneous nature are generated. Moreover, training data are noisybIndeed, Sutton [Sutton, 1996] reports that TD converges if the function approximator isa sparse coarse coding network instead of a global one.

90 Chapter 7and lack any structure, since the robot works upon raw sensory data andperforms basic actions. An additional source of noise is the odometrysystem.Choice of learning technique: The mobile robot must not only perform ac-tions autonomously, but also learn autonomously using the environmentas its teacher. Moreover, the robot has no a priori knowledge about eitherthe environment or the e�ects of its actions, and must only use informationabout what it can actually perceive. Finally, the robot has to learn incre-mentally and on-line. These requirements point to neural reinforcementlearning as the most suitable approach.Evaluation of learning result: The performance of the learning robot isevaluated on-line in terms of the quality of the trajectory; that is, lengthand clearance to the obstacles.Adaptation of learning result: Since the robot is learning autonomously,incrementally and on-line, it adapts continuously its behavior.Constraints: Learning must take place as the robot moves. This imposes astrong time constraint if the robot is to act in real time.2 ROBOT TESTBEDTeseo is a commercial Nomad 200 mobile robot (�gure 1). It has three in-dependent motors: the �rst moves the three wheels of the robot together, thesecond steers the wheels together, and the third rotates the turret of the robot.The robot is equipped with 16 infrared sensors, 16 sonar sensors, and 20 contactsensors. The infrared and sonar sensors are evenly placed around the perimeterof the turret and the tactile sensors cover all the perimeter of the robot belowthe turret. Moreover, the robot has a dead-reckoning system that keeps trackof the robot's position and orientation.Teseo is controlled by a NN that maps the currently perceived situation intothe next action. Its task is to reach a given goal location speci�ed in cartesiancoordinates. A situation is made of sensory information coming from physicalas well as virtual sensors. The action determines the next direction of travel.Teseo receives a reinforcement signal after performing every action. Thisassumption is hard to satisfy in many reinforcement learning tasks, but not inthose where the robot can estimate the state of achievement at any moment.In our case, knowing the goal location permits rewarding actions approaching

Learning Sensor-based Navigation 91

Figure 1 The mobile robot Teseo, a commercial Nomad 200.it while avoiding obstacles. It is important to note, however, that Teseo doesnot seek to optimize each immediate reinforcement signal, but to optimize thetotal amount of reinforcement obtained along the path to the goal.While moving, Teseo uses a low-level asynchronous emergency routine to pre-vent collisions. The robot stops and retracts whenever its range sensors detectan obstacle in front of it which is closer than a safety distance or its tactilesensors detect a collision. In this case, the learning algorithm penalizes theaction that makes the emergency routine intervene.The input to the NN consists of a vector of 40 real numbers in the interval[0; 1]. The �rst 32 components correspond to the infrared and sonar sensorreadings. In this case, a value close to zero means that the corresponding sensoris detecting a very near obstacle. The remaining 8 components correspond to acoarse codi�cation of an inverse exponential function of the distance from thecurrent robot location to the goal, as computed by a virtual sensor based onthe dead-reckoning system. The incorporation of information about the goalinto the input pattern allows Teseo to discriminate between similar sensorysituations that require di�erent suitable actions.

92 Chapter 7The output of the NN consists of a single component that controls directlythe steering motor and indirectly the turret rotation motor. This componentis a real number in the interval [�180; 180] and determines the direction oftravel with respect to the vector connecting the goal and the current robotlocation. Once the robot has steered the commanded degrees, it translates a�xed distance (25 cm) and, at the same time, it rotates its turret in order tomaintain the front infrared and sonar sensors oriented toward the goal. It isworth noting that a codi�cation of both the physical sensor readings and themotor command relative to the goal location enhances Teseo's generalizationcapabilities. This relative codi�cation is not only a \technical trick", possiblein a robot like the Nomad 200, where the turret orientation can be di�erentfrom the robot's direction of travel. There exists also experimental evidencethat animals learn goal-oriented actions rather than speci�c motor commands[Roitblat, 1994].The reinforcement signal z is a real number in the interval [�3; 0] which mea-sures the cost of doing a particular action in a given situation. The cost of anaction is directly derived from the task de�nition, which is to reach the goalalong trajectories that are su�ciently short and, at the same time, have a wideclearance to the obstacles. The reward is �3 if Teseo collides or detects anobstacle closer than a safety distance, and it is 0 if Teseo moves straight tothe goal and does not detect near obstacles.3 THE LEARNING APPROACHTeseo's learning architecture is an improved Actor-Critic system. The Criticuses TD methods to estimate the total future reinforcement that will be ob-tained if the robot performs the best currently known actions that take it fromits current location to the goal. The Actor uses this estimation to update thesituation-action mapping, codi�ed into the NN, using the associative search(AS) learning rule [Barto et al., 1983; Williams, 1992]. The improvements tothe standard Actor-Critic architecture allows Teseo to overcome three limi-tations of basic reinforcement learning: slow convergence, unsafe process, andlack of incremental improvement. The next three subsections present threeaspects of Teseo's architecture that address these limitations.

Learning Sensor-based Navigation 933.1 Learning from basic re exesThe main reason for the slow convergence of basic reinforcement learning is thatevery time the learner faces a new situation it takes random actions. A wayof overcoming this problem is to use domain knowledge to determine quicklywhere to search the suitable action to handle each situation.Along this line, instead of learning from scratch, Teseo searches within a�xed set of basic re exes every time its neural controller fails to generalizecorrectly its previous experience to the current situation. The neural controllerassociates the selected re ex with the perceived situation in one step. Thesensory situation is represented by a new unit of the NN and the selectedre ex is codi�ed into the weights of the controller. This new reaction ruleis tuned subsequently through reinforcement learning. As explained later inthe chapter, incorrect generalizations occur when none of the learned reactionrules is applicable to the perceived situation or when the actual reinforcementobtained after taking an action is much worse than expected. In this way, theNN gets control (and thus suppresses the activation of the basic re exes) moreoften as the robot explores the environment.Basic re exes correspond to previous knowledge about the task and are codi�edas simple reactive behaviors [Brooks, 1986]. As the actions computed by theneural controller, the re exes determine the next direction of travel which isfollowed for a �xed distance while the robot rotates its turret. Each basic re exselects one of the 16 directions corresponding to the current orientations of theinfrared and sonar sensors. We have chosen this �xed set of directions becausethey are the most informative for the robot in terms of obstacle detection.The basic re exes are prewired into a subsumption-like architecture with threereactive behaviors. These behaviors are, in order of priority, collision avoidance,move to goal and track object boundary (�gure 2). Every time that the basicre exes are invoked, the behavior with the highest priority that is applicable tothe current situation is activated. The behaviors collision avoidance and trackobject boundary are inspired in those of [Mataric, 1992] and their combinationmakes Teseo follow safely the boundaries of objects in the environment.It is worth noting that, except in simple cases, these re exes alone do notgenerate e�cient trajectories; they just provide acceptable starting points forthe reinforcement learning algorithm to search appropriate actions. Teseo doesnot learn to coordinate preprogrammed re exes; rather it learns new, e�cientreactions from basic re exes.

94 Chapter 7Sensors S

Track

Boundary

Move to

Goal

Avoid

S Motors

1

2

3

Figure 2 Codi�cation of the basic re exes into a subsumption-like archi-tecture with 3 reactive behaviors. Applicable behaviors with higher prioritiessuppress other behaviors with lower priorities.Integrating learning and reaction in this way allows Teseo to focus on promis-ing parts of the action space immediately, which accelerates learning. In addi-tion, learning on top of basic re exes makes Teseo operational from the verybeginning and increments the safety of the learning process.3.2 ModularityTeseo's controller is a modular network. Each module maps similar sensoryinputs into similar actions and, in addition, they have similar long-term con-sequences. A resource-allocating procedure adds new units to the appropriatemodule only when necessary to improve Teseo's performance. This procedureis based mainly on the work of Alpaydin [Alpaydin, 1991]. The main di�erenceis that Alpaydin's approach is intended for supervised learning tasks, while oursis for reinforcement learning tasks.Modularity guarantees that improvements in the response to a given situationwill not negatively alter other unrelated reactions, thus achiving incrementallearning.

Learning Sensor-based Navigation 95x1

x2

…x40

z

b1 bj

Situation Exemplar Units

Module

j

Evaluator

1e

ne j

e11

ei1

Output Unit

Action

−

j

Figure 3 Controller architecture.3.3 ExplorationTeseo explores the action space by concentrating the search around the bestactions currently known. The width of the search is determined by a counter-based scheme associated to the modules (see section 4.2). This explorationtechnique allows Teseo to avoid experiencing irrelevant actions and to mini-mize the risk of collisions.4 CONTROLLER ARCHITECTUREThe neural controller is a modular two-layer network (�gure 3). The �rstlayer consists of units with overlapping localized receptive �elds, which we callexemplars. The second layer is made of one single stochastic linear unit, theoutput unit. Big arrows in �gure 3 represent a full connectivity between thecomponents of the input vector and the exemplars and between the exemplarsand the output unit.

96 Chapter 74.1 Modules and exemplarsSince modules are created dynamically as Teseo explores its environment, dif-ferent modules can have di�erent number of exemplars. For instance, module 1in �gure 3 has i exemplars while module j has n exemplars. Every module jkeeps track of four adaptive values that are updated during learning. First, theexpected total future reinforcement, bj , that the robot will receive if it uses thismodule for computing the next action. Second, the width of the receptive �eldsof all the exemplars in this module, dj . Third, a counter that records how manytimes this module has been used without improving the robot's performance,cj . Fourth, the prototypical action the robot should normally take wheneverthe perceived situation is classi�ed into this module, paj .Every exemplar ejk is a tuple (vjk ; dj) which represents a sphere in the inputspace. This sphere is centered on the point vjk 2 [0; 1]40, and has a radius (orwidth of the receptive �eld) dj . The activation level ajk of the exemplar ejk is areal value in the interval [0; 1] that measures how well ejk matches the currentperceived situation x. The activation level is 0 if the perceived situation isoutside the receptive �eld of the exemplar, and 1 if the situation correspondsto the point where the exemplar is centered. As mentioned above, exemplarshave overlapping receptive �elds. Hence a given situation can activate severalexemplars simultaneously, each at a di�erent degree.Actions are computed after a competitive process among the existing modules:only the module that best classi�es the perceived situation propagates theactivation levels of its exemplars to the output unit. The winning module,if any, is the one having the highest active exemplar. In the case that noexemplar \matches" the perceived situation |i.e., if the input does not fall inthe receptive �eld of any exemplar|, then a basic re ex is triggered and thecurrent situation becomes a new exemplar. Section 5.2 provides more detailsabout the resource-allocating procedure.After reacting, the evaluator computes the reinforcement signal, z. Then, ifthe action was computed through the module j, the di�erence between z andbj is used for learning. Only the weights of the links associated to the winningmodule j are modi�ed.Thus, in order to improve its reactions to a particular kind of situations, Teseoneeds just to adapt the module that covers that region of the input space. Theadaptation of a module is done either by adding more exemplars or by tuningits weights.

Learning Sensor-based Navigation 97e

1

j

ek

j

enj

c j

pa j

Module j Output Unit

Action

pa

s

Exploration

σ

Σµ

+45

-45+

Figure 4 The output unit. For the sake of simplicity, this �gure only showsthe information sent by the winning module.4.2 Output unitAs illustrated in �gure 4, the output of the neural controller is a prototypicalaction pa plus a certain variation s.As mentioned in section 3.3, the exploration needed for reinforcement learningis carried out only around the best actions currently known. Since each actionis the sum of two components, pa and s, the exploration mechanism works oneach of them separately.On the one hand, and assuming j is the winning module, if cj is not divisibleby a constant kexp, then paj is chosen. Otherwise, pa is taken to be the pro-totypical action associated to the module m activated by the current situationthat has the best expected total future reinforcement, bm. The value of kexpwas empirically chosen to be 3 in the experiments.On the other hand, the deviation s from pa is computed through a stochasticprocess in such a way that Teseo will only explore actions between pa andits four neighboring prototypical actions (two to the left and two to the right).The computation of s is done in three steps.The �rst step is to determine the value of the stochastic process' parameters.The mean � is a weighted sum of the activation levels of the exemplars ej1; : : : ; ejn

98 Chapter 7of the winning module: � = nXk=1wjkajk; (6:1)where wjk is the weight associated to the link between ejk and the output unit,and ajk is the activation level of ejk. The variance � is proportional to cj .This follows from the idea that the most often the module j is used withoutimproving Teseo's performance, the higher � must be.In the second step, the unit calculates its activation level l which is a normallydistributed random variable: l = N(�; �): (6:2)In the third step, the unit computes s:s = ( 45; if l > 45,�45; if l < �45,l; otherwise. (6:3)5 LEARNING MECHANISMSThere are four basic forms of learning in the proposed architecture. The �rstis related to the update of the bj . The second kind of learning regards thetopology of the network. The third type of learning consists on tuning theposition of the exemplars. Finally, the fourth concerns weight modi�cation.5.1 Improving reinforcement estimatesLet us remind that bj is an estimate of the total future reinforcement Teseowill obtain if it performs the best currently known actions that take it fromits current location (whose associated observed situation is classi�ed into thejth module) to the goal. Consequently, the value bj of the module j should,after learning, be equal to the sum of the cost z of reaching the best nextmodule i plus the value bi. Since z takes negative values, minimizing futurecost corresponds to maximizing future reinforcement:bj = maxy2Actions(z) + bi: (6:4)

Learning Sensor-based Navigation 99During learning, however, this equality does not hold for the value bj of everymodule j because the optimal action is not always taken and, even if it is, thevalue bi of the next module i has not yet converged.In order to iteratively update the values of bj , so that �nally (6.4) holds forall of them, we have used the simplest TD method, i.e. TD(0) [Sutton, 1988].If the situation perceived at time t is classi�ed into the module j and, afterperforming the computed action, the next situation belongs to the module iand the reinforcement signal is z(t+ 1), then:bj(t+ 1) = bj(t) + � � �z(t+ 1) + bi(t)� bj(t)�: (6:5)� controls the intensity of the modi�cation, taking the value 0.75 when Teseobehaves better than expected, and 0.075 in the opposite case. The rationale formodifying less intensively bj when z(t+1) + bi(t)� bj(t) < 0 is that this erroris probably due to the selection of an action di�erent from the best currentlyknown for the module j.The problem is for Teseo to �gure out bi(t) when no module is activated bythe next situation. In this case, bi is estimated on the basis of the distancefrom the next location to the goal and the distance to the perceived obstaclesin between the robot and the goal.5.2 Network growthThe second learning mechanism makes the NN grow as a function of the inputsreceived. Initially, there exist neither exemplars nor, consequently, modulesand the resource-allocating procedure creates them as they are needed.As mentioned in section 4.1, if no exemplar \matches" the perceived situation,then a basic re ex is triggered and the current situation becomes a new exem-plar. The weight of the link from this exemplar to the output unit is initiallyset to zero and evolves subsequently through reinforcement learning.The new exemplar is added to one of the existing modules if its receptive �eldoverlaps the receptive �elds of the module's exemplars and the selected re ex isthe same as the module's prototypical action. The �rst condition assures thatevery module will cover a connected input subspace.If any of the two conditions above is not satis�ed, then a new module consistingof this exemplar and its associated connections is created. Concerning the

100 Chapter 7four parameters associated to this new module m, they are initially set to thefollowing values: dm equals kd, cm equals 0, pam is the selected re ex, and bmis estimated as when no module is activated by the perceived situation (seethe preceding section). The value of kd is 0:4, which corresponds to 1% of thewhole input space.5.3 Tuning exemplarsThe third learning mechanism moves the position of the exemplars ej1; : : : ; ejnof the winning module j in order to better cover the input subspace dominatedby that module. That is, the coordinates of the kth exemplar, vjk, are updatedproportionally to how well they match the perceived situation x:vjk(t+ 1) = vjk(t) + � � ajk(t) � �x(t) � vjk(t)�; (6:6)where � is the learning rate. In the experiments reported below, the value of �is 0:1.5.4 Weight updateThe classical associative search (AS) [Barto et al., 1983; Williams, 1992] rule isapplied to the weights of the connections from the exemplars ej1; : : : ; ejn to theoutput unit:wjk(t+ 1) = wjk(t) + � � �z(t+ 1) + bi(t)� bj(t)� � �jk(t); (6:7)where � is the learning rate, and �jk is the eligibility factor. The intensity ofthe weight modi�cations depends on the relative merit of the action, which isjust the error provided by the TD method.The eligibility factor of a given weight measures how in uential that weightwas in choosing the action. In our experiments, �jk is computed in such amanner that the learning rule corresponds to a gradient ascent mechanism onthe expected reinforcement [Williams, 1992]:�jk(t) = @lnN@wjk (t) = ajk(t) l(t)� �(t)�2(t) ; (6:8)where N is the normal distribution function in (6.2). The weights wjk aremodi�ed more intensely in case of reward |i.e., when Teseo behaves better

Learning Sensor-based Navigation 101than expected| than in case of penalty. These two values of � are 0:2 and 0:02,respectively. The aim here is that Teseo maintains the best situation-actionrules known so far, while exploring other reaction rules.6 LEARNING OPPORTUNITIESLet us present now the four occasions on which learning takes place. The �rstarises during the classi�cation phase, the next two happen after reacting, andthe last one takes place when reaching the goal.6.1 Unexperienced situationIf the perceived situation is not classi�ed into one of the existing modules, thenthe basic re exes get control of the robot, and the resource-allocating procedurecreates a new exemplar which is added either to one of the existing modules orto a new module.6.2 Performing within expectationsIf the perceived situation is classi�ed into the module j and z(t+ 1) + bi(t) �bj(t) � kz, where kz is a negative constant, then (i) the exemplars of that mod-ule are tuned to make them closer to the situation, (ii) the weights associatedto the connections between the exemplars and the output unit are modi�edusing the AS reinforcement learning rule, (iii) bj is updated through TD(0),and (iv) dj , cj , and paj are adapted.The adaptive parameters are updated di�erently in case of reward than in caseof penalty. In case of reward, dj is increased by 0:1, cj is initialized to 0, andif the output of the neural controller, � = pa + s, is closer to a prototypicalaction �(t) other than paj(t), then paj(t + 1) = �(t). In case of penalty, cj isincreased by 1 and dj is decreased by 0:1 if it is still greater than the thresholdkd=2, where kd is the initial value of dj .

102 Chapter 76.3 Performing rather badlyIf the perceived situation is classi�ed into the module j and z(t+ 1) + bi(t) �bj(t) < kz , then the topology of the network is slightly altered and dj is de-creased by 0:1 if it is still greater than the threshold kd=2.If the total future reinforcement computed after reacting, z+bi, is considerablyworse than the expected one, bj , this means that the situation was incorrectlyclassi�ed and needs to be classi�ed into a di�erent module.The resource-allocating procedure creates a new exemplar, eu, that has thesame coordinates as the perceived situation, but it does not add it to anymodule. The next time this situation (or a similar one) will be faced, eu will bethe closest exemplar. Consequently, no module will classify the situation andthe basic re exes will get control of the robot. Then, the resource-allocatingprocedure will add eu either to one of the existing modules or to a new moduleas described in section 5.2.This means that Teseo classi�es situations, in a �rst step, based on the sim-ilarity of their input representations. Then, it also incorporates task-speci�cinformation for classifying based on the similarity of reinforcements received.6.4 Reaching the goalFinally, whenever the goal is reached, the values bj and paj of every winningmodule j along the path to the goal are also updated in reverse chronologicalorder. To do that, Teseo stores the triples j(t); �(t); z(t + 1)� along thecurrent path, where j(t) indicates the winning module at time t. Then, afterreaching the goal at time n + 1 and updating the last value bj(n), the valuebj(n�1) is updated, and so on until bj(1). This technique only accelerates theconvergence of the value bj of every module j, but does not change its steadyvalue [Mahadevan and Connell, 1992].Regarding the update of the value paj(t), it is done before updating the valuebj(t). That is, if z(t+1)+bj(t+1)�bj(t) � kz and �(t) is closer to a prototypicalaction �(t) other than paj(t), then paj(t+ 1) = �(t).


Figure 5 The environment and �rst trajectory generated for a starting loca-tion within the o�ce. Note that Teseo has some problems in going throughthe doorway.7 EXPERIMENTAL RESULTSTeseo's performance has been tested on a corridor with o�ces at both sides.The task is to generate a short but safe trajectory from inside an o�ce (ofsize 4.5 � 3.5 meters) to a point at the end of a corridor (1.8 meters wide).The o�ce is cluttered with furnitures and, occasionally, static boxes. Teseoachieves the target location every time and it never gets lost or trapped insideconcave obstacles. The �rst time it tries to reach the goal it relies almost all thetime on the basic re exes, which make Teseo follow walls and move aroundobstacles. As illustrated in �gure 5, in the �rst trial, Teseo enters into adead-end section of the o�ce (but it does not get trapped into it) and even itcollides against the door frame because its sensors were not able to detect it.Collisions happened because the frame of the door is relatively thin and theincident angles of the rays drawn from the sonar and infrared sensors were toolarge resulting in specular re ections.Teseo learns to follow a smooth trajectory to the goal very quickly, in justten trials (�gure 6). The resulting neural controller consists of 48 modules and89 exemplars.

104 Chapter 7

Figure 6 Trajectory generated after travelling ten times to the goal.This experiment was run �fteen times, obtaining similar results. �gures 7, 8and 9 show the learning curves for one of the runs and table 1 reports themean and standard deviation of the �nal performances over the �fteen runs.Note that the performance improves from the �rst trajectory to the seventh.In the eighth trajectory, Teseo explores new actions that turn out to be in-appropriate. Then, Teseo recovers its preceding level of performance. Notealso that after the tenth trajectory Teseo's performance stabilizes. Figure 9illustrates the number of exemplars and modules of the neural controller afterthe generation of each trajectory. Mean StandardDeviationTotal Reinforcement -6.52 0.2108Steps 42.8 0.8718Exemplars 88.6 2.0010Modules 47.6 1.9596Table 1 Mean and standard deviation of the �nal performances over ten runs.


Trajectories

To

tal

Re

info

rce

me

nt

-16.00

-15.50

-15.00

-14.50

-14.00

-13.50

-13.00

-12.50

-12.00

-11.50

-11.00

-10.50

-10.00

-9.50

-9.00

-8.50

-8.00

-7.50

-7.00

-6.50

-6.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 7 Total reinforcement obtained along each trajectory.

Trajectories

Ste

ps

40

42

44

46

48

50

52

54

56

58

60

62

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Figure 8 Number of steps taken along each trajectory.

106 Chapter 7

Trajectories

Nu

mb

er

of

Ex

em

pla

rs o

r M

od

ule

s

202326293235384144475053565962656871747780838689

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Exemplars Modules

Figure 9 Number of exemplars and modules of the neural controller afterthe generation of each trajectory.The resulting trajectories are quite smooth even if the basic re exes have beenprogrammed in a discretized way.Figure 10 illustrates instances of the reaction rules learned. For every locationconsidered (little circles) the move to be taken is depicted. Figure 10 showsthat Teseo generates solution paths from any starting location inside the room.This indicates that Teseo exhibits good generalization abilities, since it canhandle many more situations than those previously perceived.Once Teseo has learned e�cient navigation strategies, occasional static obsta-cles are put in the way to the goal. As illustrated in �gure 11, Teseo movesaround the obstacles and then it returns to the original, e�cient trajectory. Inother experiments, the goal location is changed. Teseo learns also to navigateto this new goal in a few trials and it is still able to reach the �rst goal ase�ciently as before.However, a note of caution should be stated when talking about generalizationcapabilities since, because of its modular architecture and the use of exemplarswith localized receptive �elds, Teseo may not be able to navigate in radicallynew environments.


Figure 10 Generalization abilities: Situation-action rules applied for a sam-ple of locations within the o�ce and the �rst part of the corridor.

Figure 11 Teseo's behavior when facing an unexpected obstacle.

108 Chapter 7To �nish this section, we wish to mention that, given the robot testbed used inthe reported experiments, we have only used static goals. But Mill�an [Mill�an,1995; Mill�an, 1997] presents results with simulated and real robots that showhow our approach deals with moving targets.8 CONCLUSIONSWe have described a reinforcement learning architecture that allows an au-tonomous mobile robot to acquire e�cient navigation strategies in a few trials.Besides rapid learning, the architecture has three further appealing features.First, the robot improves its performance incrementally as it interacts with aninitially unknown environment, and it ends up learning to avoid collisions evenin those situations in which its sensors cannot detect the obstacles. This is ade�nite advantage over non-learning reactive robots. Second, since it learnsfrom basic re exes, the robot is operational from the very beginning and thelearning process is safe. Third, the robot exhibits high tolerance to noisy sen-sory data and reasonable generalization abilities. All these features make thislearning robot's architecture very well suited to real-world applications. Formore details about the learning architecture, see [Mill�an, 1996].Our approach su�ers, however, from some limitations. The �rst limitation isthat the reactive rules learned in a given environment do not generalize acrossnew environments. This is a consequence of both the modular architectureand the use of exemplars with localized receptive �elds: each module can onlygeneralize locally. But the robot can rapidly adapt to the new environmentswhile maintaining the same level of performance in the original one. Indeed,there exists a trade-o� between generalization, on the one hand, and rapid, safeand incremental learning, on the other hand. The experimental results reportedin this chapter illustrate one aspect of this trade-o�, while Mill�an and Torras[Mill�an and Torras, 1992] illustrate the opposite aspect. They have proposed aneural controller for a simulated mobile robot that shows good generalizationcapabilities when facing new environments, but it requires much more time tolearn, does not improve the performance incrementally, and applies its currentknowledge to any new situation without any caution.Another limitation is the number of parameters of the architecture that needto be manually tuned. Mill�an [Mill�an, 1997] describes a conceptually simplerarchitecture.

Learning Sensor-based Navigation 109The main limitation is the strong dependence on a reliable odometry systemthat keeps track of the robot's relative position with respect to the goal. Inthe reported experiments, odometry is totally based on dead-reckoning andthus the goal has to be �xed and speci�ed in cartesian coordinates. In allthe experiments we have carried out so far, dead-reckoning has proven su�-cient to reach the goal. As long as its estimation of the position of the robotdoes not di�er greatly from the actual one, the neural controller is still able toproduce correct actions. But, dead-reckoning will probably be insu�cient inlarge environments. Mill�an [Mill�an, 1997] describes the incorporation of pho-tosensitive sensors that allow the robot to recognize the goal and track it whilemoving. Furthermore, Mill�an and Arleo [Mill�an and Arelo, 1997] have devel-oped a complementary NN technique to learn a map of an unknown structuredenvironment incrementally and on-line. It creates a variable resolution parti-tioning of the environment, from which a topological map is learned on-the- y.It is worth noting that, once the environment is partitioned into a topologicalmap, the robot must only learn e�cient sensor-based strategies to move froma given node to the neighboring ones. Thus the acquired sensory-motor rulesare goal-independent.

making robots smart - iri robots smart ral behavio rning lea combines sensing and action edited by...

Documents