distributed reinforcement learning for self-reconﬁguring...

Distributed Reinforcement Learning for

Self-Reconfiguring Modular Robots

by

Paulina Varshavskaya

Submitted to the Department of Electrical Engineering and ComputerScience

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2007

c© Massachusetts Institute of Technology 2007. All rights reserved.

Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Department of Electrical Engineering and Computer Science

July 27, 2007

Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Daniela Rus

ProfessorThesis Supervisor

Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Arthur C. Smith

Chairman, Department Committee on Graduate Students

Distributed Reinforcement Learning for Self-ReconfiguringModular Robots

byPaulina Varshavskaya

Submitted to the Department of Electrical Engineering and Computer Scienceon July 27, 2007, in partial fulfillment of the

requirements for the degree ofDoctor of Philosophy

Abstract

In this thesis, we study distributed reinforcement learning in the context of automat-ing the design of decentralized control for groups of cooperating, coupled robots.Specifically, we develop a framework and algorithms for automatically generating dis-tributed controllers for self-reconfiguring modular robots using reinforcement learning.The promise of self-reconfiguring modular robots is that of robustness, adaptabilityand versatility. Yet most state-of-the-art distributed controllers are laboriously hand-crafted and task-specific, due to the inherent complexities of distributed, local-onlycontrol. In this thesis, we propose and develop a framework for using reinforcementlearning for automatic generation of such controllers. The approach is profitable be-cause reinforcement learning methods search for good behaviors during the lifetimeof the learning agent, and are therefore applicable to online adaptation as well asautomatic controller design. However, we must overcome the challenges due to thefundamental partial observability inherent in a distributed system such as a self-reconfiguring modular robot.

We use a family of policy search methods that we adapt to our distributed problem.The outcome of a local search is always influenced by the search space dimensional-ity, its starting point, and the amount and quality of available exploration throughexperience. We undertake a systematic study of the effects that certain robot andtask parameters, such as the number of modules, presence of exploration constraints,availability of nearest-neighbor communications, and partial behavioral knowledgefrom previous experience, have on the speed and reliability of learning through policysearch in self-reconfiguring modular robots. In the process, we develop novel algorith-mic variations and compact search space representations for learning in our domain,which we test experimentally on a number of tasks.

This thesis is an empirical study of reinforcement learning in a simulated lattice-based self-reconfiguring modular robot domain. However, our results contribute tothe broader understanding of automatic generation of group control and design ofdistributed reinforcement learning algorithms.

Thesis Supervisor: Daniela RusTitle: Professor

Acknowledgments

My advisor Daniela Rus has inspired, helped and supported me throughout. Wereit not for the weekly dose of fear of disappointing her, I might never have found themotivation for the work it takes to finish. Leslie Kaelbling has been a great sourceof wisdom on learning, machine or otherwise. Much of the research presented herewill be published elsewhere co-authored by Daniela and Leslie. And despite this workbeing done entirely in simulation, Rodney Brooks has agreed to be on my committeeand given some very helpful comments on my project. I have been fortunate to ben-efit from discussions with some very insightful minds: Leonid Peshkin, Keith Kotay,Michael Collins and Sarah Finney have been particularly helpful. The members ofthe Distributed Robotics Lab and those of the Learning and Intelligent Systems per-suasion have given me their thoughts, feedback, encouragement and friendship, forwhich I am sincerely thankful.

This work was supported by Boeing Corporation. I am very grateful for theirsupport.

I cannot begin to thank my friends and housemates, past and present, of theBishop Allen Drive Cooperative, who have showered me with their friendship andcreated the most amazing, supportive and intense home I have ever had the pleasureto live in. I am too lazy to enumerate all of you — because I know I cannot disappointyou — but you have a special place in my heart. To everybody who hopped in thecar with me and left the urban prison behind, temporarily, repeatedly, to catch thesun, the rocks and the snow, to give ourselves a boost of sanity — thank you.

To my parents: spasibo vam, l�bimye mama i papa, za vse i, osobenno, zato, qto vy vs� �izn~ davali mne l�bu� vozmo�nost~ uqit~s�, nesmotr�na to, qto uqeba zavela men� tak daleko ot vas, thank you!

To my life- and soul-mate Luke: thank you for seeing me through this.

Contents

1 Introduction 71.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.1 Self-reconfiguring modular robots . . . . . . . . . . . . . . . . 91.2.2 Research motivation vs. control reality . . . . . . . . . . . . . 11

1.3 Conflicting assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.1 Assumptions of a Markovian world . . . . . . . . . . . . . . . 141.3.2 Assumptions of the kinematic model . . . . . . . . . . . . . . 161.3.3 Possibilities for conflict resolution . . . . . . . . . . . . . . . . 171.3.4 Case study: locomotion by self-reconfiguration . . . . . . . . . 18

1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.5 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Related Work 222.1 Self-reconfiguring modular robots . . . . . . . . . . . . . . . . . . . . 22

2.1.1 Hardware lattice-based systems . . . . . . . . . . . . . . . . . 222.1.2 Automated controller design . . . . . . . . . . . . . . . . . . . 232.1.3 Automated path planning . . . . . . . . . . . . . . . . . . . . 23

2.2 Distributed reinforcement learning . . . . . . . . . . . . . . . . . . . . 242.2.1 Multi-agent Q-learning . . . . . . . . . . . . . . . . . . . . . . 242.2.2 Hierarchical distributed learning . . . . . . . . . . . . . . . . . 242.2.3 Coordinated learning . . . . . . . . . . . . . . . . . . . . . . . 242.2.4 Reinforcement learning by policy search . . . . . . . . . . . . 25

2.3 Agreement algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Policy Search in Self-Reconfiguring Modular Robots 273.1 Notation and assumptions . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Gradient ascent in policy space . . . . . . . . . . . . . . . . . . . . . 283.3 Distributed GAPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 GAPS learning with feature spaces . . . . . . . . . . . . . . . . . . . 313.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5.1 The experimental setup . . . . . . . . . . . . . . . . . . . . . 323.5.2 Learning by pooling experience . . . . . . . . . . . . . . . . . 333.5.3 MDP-like learning vs. gradient ascent . . . . . . . . . . . . . . 35

4

3.5.4 Learning in feature spaces . . . . . . . . . . . . . . . . . . . . 363.5.5 Learning from individual experience . . . . . . . . . . . . . . . 393.5.6 Comparison to hand-designed controllers . . . . . . . . . . . . 403.5.7 Remarks on policy correctness and scalability . . . . . . . . . 42

3.6 Key issues in gradient ascent for SRMRs . . . . . . . . . . . . . . . . 453.6.1 Number of parameters to learn . . . . . . . . . . . . . . . . . 463.6.2 Quality of experience . . . . . . . . . . . . . . . . . . . . . . . 47

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 How Constraints and Information Affect Learning 504.1 Additional exploration constraints . . . . . . . . . . . . . . . . . . . . 504.2 Smarter starting points . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.1 Incremental GAPS learning . . . . . . . . . . . . . . . . . . . 524.2.2 Partially known policies . . . . . . . . . . . . . . . . . . . . . 52

4.3 Experiments with pooled experience . . . . . . . . . . . . . . . . . . . 534.3.1 Pre-screening for legal motions . . . . . . . . . . . . . . . . . . 534.3.2 Scalability and local optima . . . . . . . . . . . . . . . . . . . 544.3.3 Extra communicated observations . . . . . . . . . . . . . . . . 564.3.4 Effect of initial condition on learning results . . . . . . . . . . 584.3.5 Learning from better starting points . . . . . . . . . . . . . . 59

4.4 Experiments with individual experience . . . . . . . . . . . . . . . . . 624.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 Agreement in Distributed Reinforcement Learning 685.1 Agreement Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.1.1 Basic Agreement Algorithm . . . . . . . . . . . . . . . . . . . 695.1.2 Agreeing on an Exact Average of Initial Values . . . . . . . . 695.1.3 Distributed Optimization with Agreement Algorithms . . . . . 70

5.2 Agreeing on Common Rewardsand Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6 Synthesis and Evaluation: Learning Other Behaviors 826.1 Learning different tasks with GAPS . . . . . . . . . . . . . . . . . . . 83

6.1.1 Building towers . . . . . . . . . . . . . . . . . . . . . . . . . . 836.1.2 Locomotion over obstacles . . . . . . . . . . . . . . . . . . . . 85

6.2 Introducing sensory information . . . . . . . . . . . . . . . . . . . . . 866.2.1 Minimal sensing: spatial gradients . . . . . . . . . . . . . . . . 876.2.2 A compact representation: minimal learning program . . . . . 876.2.3 Experiments: reaching for a random target . . . . . . . . . . . 88

6.3 Post learning knowledge transfer . . . . . . . . . . . . . . . . . . . . . 896.3.1 Using gradient information as observation . . . . . . . . . . . 896.3.2 Experiments with gradient sensing . . . . . . . . . . . . . . . 906.3.3 Using gradient information for policy transformations . . . . . 90

5

6.3.4 Experiments with policy transformation . . . . . . . . . . . . 926.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7 Concluding Remarks 967.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.3 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . 98

6

Chapter 1

Introduction

The goal of this thesis is to develop and demonstrate a framework and algorithmsfor decentralized automatic design of distributed control for groups of cooperating,coupled robots. This broad area of research will advance the state of knowledge indistributed systems of mobile robots, which are becoming ever more present in ourlives. Swarming robots, teams of autonomous vehicles, and ad hoc mobile sensornetworks are now used to automate aspects of exploration of remote and dangeroussites, search and rescue, and even conservation efforts. The advance of coordinatedteams is due to their natural parallelism: they are more efficient than single agents;they cover more terrain more quickly; they are more robust to random variation andrandom failure. As human operators and beneficiaries of multitudes of intelligentrobots, we would very much like to just tell them to go, and expect them to decideon their own exactly how to achieve the desired behavior, how to avoid obstaclesand pitfalls along the way, how to coordinate their efforts, and how to adapt toan unknown, changing environment. Ideally, a group of robots faced with a newtask, or with a new environment, should take the high-level task goal as input andautomatically decide what team-level and individual-level interactions are needed tomake it happen.

The reality of distributed mobile system control is much different. Distributedprograms must be painstakingly developed and verified by human designers. Theinteractions and interference the system will be subject to are not always clear. Anddevelopment of control for such distributed mobile systems is in general difficultfor serially-minded human beings. Yet automatic search for the correct distributedcontroller is problematic — there is incredible combinatorial explosion of the searchspace due to the multiplicity of agents; there are also intrinsic control difficulties foreach agent due to the limitations of local information and sensing. The difficultiesare dramatically multiplied when the search itself needs to happen in a completelydistributed way with the limited resources of each agent.

This thesis is a study in automatic adaptable controller design for a particularclass of distributed mobile systems: self-reconfiguring modular robots. Such robotsare made up of distinct physical modules, which have degrees of freedom (DOFs) tomove with respect to each other and to connect to or disconnect from each other. Theyhave been proposed as versatile tools for unknown and changing situations, when mis-

7

sions are not pre-determined, in environments that may be too remote or hazardousfor humans. These types of scenarios are where a robot’s ability to transform itsshape and function would be undoubtedly advantageous. Clearly, adaptability andautomatic decision-making with respect to control is essential for any such situations.

A self-reconfiguring modular robot is controlled in a distributed fashion by themany processors embedded in its modules (usually one processor per module), whereeach processor is responsible for only a few of the robot’s sensors and actuators. Forthe robot as a whole to cohesively perform any task, the modules need to coordinatetheir efforts without any central controller. As with other distributed mobile systems,in self-reconfiguring modular robots (SRMRs) we give up on easier centralized controlin the hope of increased versatility and robustness. Yet most modular robotic systemsrun task-specific, hand-designed algorithms. For example, a distributed localized rule-based system controlling adaptive locomotion (Butler et al. 2001) has 16 manuallydesigned local rules, whose asynchronous interactions needed to be anticipated andverified by hand. A related self-assembly system had 81 such rules (Kotay & Rus2004). It is very difficult to manually generate correct distributed controllers of thiscomplexity.

Instead, we examine how to automatically generate local controllers for groupsof robots capable of local sensing and communications to near neighbors. Both thecontrollers and the process of automation itself must be computationally fast and onlyrequire the kinds of resources available to individuals in such decentralized groups.

In this thesis we present a framework for automating distributed controller designfor SRMRs through reinforcement learning (RL) (Sutton & Barto 1999): the intuitiveframework where agents observe and act in the world and receive a signal whichtells them how well they are doing. Learning may be done from scratch (with noinitial knowledge of how to solve the task), or it can be seeded with some amountof information — received wisdom directly from the designer, or solutions found inprior experience. Automatic learning from scratch will be hindered by the large searchspaces pervasive in SRMR control. It is therefore to be expected that constraining andguiding the search with extra information will in no small way influence the outcome oflearning and the quality of the resulting distributed controllers. Here, we undertake asystematic study of the parameters and constraints influencing reinforcement learningin SRMRs.

It is important to note that unlike with evolutionary algorithms, reinforcementlearning optimizes during the lifetime of the learning agent, as it attempts to performa task. The challenge in this case is that learning itself must be fully distributed.The advantage is that the same paradigm can be used to (1) automate controllerdesign, and (2) run distributed adaptive algorithms directly on the robot, enabling itto change its behavior as the environment, its goal, or its own composition changes.Such capability of online adaptation would take us closer to the goal of more versatile,robust and adaptable robots through modularity and self-reconfiguration. We are ul-timately pursuing both goals (automatic controller generation and online adaptation)in applying RL methods to self-reconfigurable modular robots.

Learning distributed controllers, especially in domains such as SRMRs whereneighbors interfere with each other and determine each other’s immediate capabil-

8

ities, is an extremely challenging problem for two reasons: the space of possible be-haviors are huge and they are plagued by numerous local optima which take the formof bizarre and dysfunctional robot configurations. In this thesis, we describe strate-gies for minimizing dead-end search in the neighborhoods of such local optima byusing easily available information, smarter starting points, and inter-module commu-nication. The best strategies take advantage of the structural properties of modularrobots, but they can be generalized to a broader range of problems in cooperativegroup behavior. In this thesis, we have developed novel variations on distributedlearning algorithms. The undertaking is both profitable and hard due to the na-ture of the platforms we study. SRMRs have exceedingly large numbers of degrees offreedom, yet may be constrained in very specific ways by their structural composition.

1.1 Problem statement

Is distributed learning of control possible in the SRMR domain? If so, is it a trivial or ahard application? Where are the potential difficulties, and how do specific parametersof the robot design and the task to be learned affect the speed and reliability oflearning? What are the relevant sources of information and search constraints thatenable the learning algorithms to steer away from dead-end local optima? How mightindividuals share their local information, experience, and rewards with each other?How does sharing affect the process and the outcome of learning? What can besaid about the learned distributed controllers? Is there any common structure orknowledge that can be shared across tasks?

The goal of this thesis is to systematically explore the above questions, and providean empirical study of reinforcement learning in the SRMR domain that bears on thelarger issues of automating generation of group control, and design of distributedreinforcement learning algorithms.

In order to understand precisely the contributions of our work, we will now de-scribe its position and relevance within the context of the two fields of concern to us:self-reconfiguring modular robots and reinforcement learning. In the rest of this intro-ductory chapter, we give the background on both SRMRs and reinforcement learning,describe and resolve a fundamental conflict of assumptions in those fields, present acircumscribed case study which will enable us to rigorously test our algorithms, andgive an outline of the thesis.

1.2 Background

1.2.1 Self-reconfiguring modular robots

Self-reconfiguring modular robotics (Everist et al. 2004, Kotay & Rus 2005, Murataet al. 2001, Shen et al. 2006, Yim et al. 2000, Zykov et al. 2005) is a young and grow-ing field of robotics with a promise of robustness, versatility and self-organization.Robots made of a large number of identical modules, each of which has computa-tional, sensory, and motor capabilities, are proposed as universal tools for unknown

9

(a) (b)

Figure 1-1: Some recent self-reconfiguring modular robots (SRMRs): (a) TheMolecule (Kotay & Rus 2005), (b) SuperBot (Shen et al. 2006). Both can act aslattice-based robots. SuperBot can also act as a chain-type robot (reprinted withpermission).

environments. Exemplifying the vision of modularity is the idea of a self-built fac-tory on another planet: small but strong and smart modules can be packed tightlyinto a spaceship, then proceed to build structures out of themselves, as needed, in acompletely autonomous and distributed fashion. A malfunction in the structure canbe easily repaired by disconnecting the offending modules, to be replaced by freshstock. A new configuration can be achieved without the cost of additional materials.A number of hardware systems have been designed and built with a view to approachthis vision. A taxonomy of self-reconfiguring hardware platforms can be built alongtwo dimensions: the structural constraints on the robot topology and the mechanismof reconfiguration.

Lattice vs. chain robots

SRMRs can be distinguished by their structural topology. Lattice-based robots areconstructed with modules fitting on a geometric lattice (e.g., cubes or tetrahedra).Connections may be made at any face of the module that corresponds to a latticeintersection. By contrast, chain-based robots are constructed with modules forming achain, tree or ring. This distinction is best understood in terms of the kinds of degreesof freedom and motions possible in either case. In lattice-based robots, modules areusually assumed to not have individual mobility other than the degrees of freedomrequired to connect to a neighboring site and disconnect from the current one. Self-reconfiguration is therefore required for any locomotion and most other tasks. On theother hand, chain-type SRMRs may use the degrees of freedom in the chain (or treebranch) as “limbs” for direct locomotion, reaching or other mobility tasks. A chain-type robot may move as a snake in one configuration, and as an hexapod insect inanother one. The only connections and disconnections of modules occurs specificallywhen the robot needs to change configurations. But a lattice-based robot needs tokeep moving, connecting and disconnecting its modules in order to move its centerof mass in some direction (locomotion) as well as when specific reconfiguration is

10

required.Figure 1-1 shows two recent systems. The first one, called the Molecule (Kotay

& Rus 2005), is a pure lattice system. The robot is composed of male and female“atoms”, which together represent a “molecule” structure which is also the basis ofa lattice. The second system, in figure 1-1b, is SuperBot (Shen et al. 2006), which isa hybrid. On the one hand, it may function in a lattice-based fashion similar to theMolecule: SuperBot modules are cubic, with six possible connection sites, one at eachface of the cube, which makes for a straightforward cubic grid. On the other hand,the robot can be configured topologically and function as a chain (e.g., in a snakeconfiguration) or tree (e.g., in a tetrapod, hexapod or humanoid configuration).

While chain-type or hybrid self-reconfiguring systems are better suited for tasksrequiring high mobility, lattice-based robots are useful in construction and shape-maintenance scenarios, as well as for self-repair. The research effort in designinglattice-based robots and developing associated theory has also been directed towardsfuture systems and materials. As an example, miniaturization may lead to fluid-likemotion through lattice-based self-reconfiguration. On another extreme, we mightenvision self-reconfiguring buildings built of block modules that are able and requiredto infrequently change their position within the building.

The focus of this thesis in mainly on lattice-based robots. However, chain-typesystems may also benefit from adaptable controllers designed automatically throughreinforcement learning.

Deterministic vs. stochastic reconfiguration

SRMRs may also be distinguished according to their mode of reconfiguration. Thesystems we label here as deterministic achieve reconfiguration by actively powered,motorized joints and/or gripping mechanisms within each module. Each modulewithin such a system, when it finds itself in a certain local configuration, may onlyexecute a finite and usually small number of motions that would change its positionrelative to the rest of the robot, and result in a new configuration. Both the Moleculeand SuperBot (figure 1-1) are deterministic systems under this taxonomy.

On the other hand, systems such as programmable parts (Bishop et al. 2005),self-replicating robots from random parts (Griffith et al. 2005) or robots for in-spaceassembly (Everist et al. 2004), are composed of passive elements which require extra-neous energy provided by the environment. Actuation, and therefore assembly andreconfiguration occurs due to stochastic influences in this energy-rich environment,through random processes such as Brownian motion.

The focus of this thesis is exclusively on self-actuated, deterministic SRMRs thatdo not require extra energy in the environment.

1.2.2 Research motivation vs. control reality

Development of self-reconfiguring modular robotic systems, as well as theory andalgorithms applicable to such systems, can be motivated by a number of problems.On the one hand, researchers are looking to distill the essential and fundamental

11

properties of self-organization through building and studying self-assembling, self-replicating and self-reconfiguring artificial systems. This nature-inspired scientificmotive is pervasive especially in stochastically reconfigurable robotics, which focuseson self-assembly.

The designers of actuated, deterministic SRMRs, on the other hand, tend tomotivate their efforts with the promise of versatility, robustness and adaptabilityinherent in modular architectures. Modular robots are expected to become a kind ofuniversal tool, changing shape and therefore function as the task requirements alsochange. It is easy to imagine, for example, a search and rescue scenario with sucha versatile modular robot. Initially in a highly-mobile legged configuration that willenable the robot to walk fast over rough terrain, it can reconfigure into a numberof serpentine shapes that burrow into the piles of rubble, searching in parallel forsigns of life. Another example, often mentioned in the literature, concerns space andplanetary exploration. A modular robot can be tightly packed into a relatively smallcontainer for travel on a spaceship. Upon arrival, it can deploy itself according tomission (a “rover” without wheels, a lunar factory) by moving its modules into acorrect configuration. If any of the modules should break down or become damaged,the possibility of self-reconfiguration allows for their replacement with fresh stock,adding the benefits of possible self-repair to modular robots.

The proclaimed qualities of versatility, robustness and adaptability with referenceto the ideal “killer apps” are belied by the rigidity of state-of-the-art controllersavailable for actuated self-reconfiguring modular robots.

Existing controllers

Fully distributed controllers of many potentially redundant degrees of freedom arerequired for robustness and adaptability. Compiling a global high-level goal intodistributed control laws or rules based solely on local interactions is a notoriouslydifficult exercise. Yet currently, most modular robotic systems run task-specific, hand-designed algorithms, for example, the rule-based systems in (Butler et al. 2004), whichtook hours of designer time to synthesize. One such rule-based controller for a two-dimensional lattice robot is reproduced here in figure 1-2. These rules guaranteeeastward locomotion by self-reconfiguration on a square lattice.

Automating the design of such distributed rule systems and control laws is clearlyneeded, and the present work presents a framework in which it can be done. There hasbeen some exploration of automation in this case. In some systems the controllers wereautomatically generated using evolutionary techniques in simulation before being ap-plied to the robotic system itself (Kamimura et al. 2004, Mytilinaios et al. 2004). Cur-rent research in automatic controller generation for SRMRs and related distributedsystems will be examined more closely in chapter 2. One of the important differencesbetween the frameworks of evolutionary algorithms and reinforcement learning is theperspective from which optimization occurs.

12

Figure 1-2: A hand-designed rule-based controller for locomotion by self-reconfiguration on a square lattice. Reprinted with permission from Butler et al.(2004).

Adaptation through reinforcement learning

Let us elaborate this point. Evolutionary techniques operate on populations of con-trollers, each of which are allowed to run for some time after which their fitness ismeasured and used as the objective function for the optimization process. Optimiza-tion happens over generations of agents running these controllers. In contrast, inthe reinforcement learning framework, optimization happens over the lifetime of eachagent. The parameters of the controller are modified in small, incremental steps, atruntime, either at every time step or episodically, after a finite and relatively smallnumber of time steps. This difference means that reinforcement learning techniquesmay be used both for automated controller generation and online adaptation to thechanging task or environment. These are the twin goals of using RL algorithms inself-reconfiguring modular robots. While this thesis is primarily concerned with thefirst goal of automated controller generation, we always keep in mind the potentialfor online adaptation that is provided by the techniques presented here.

We now turn to the question of whether the problems and tasks faced by SRMRsare indeed amenable to optimization by reinforcement learning. This is not a trivialquestion since modular robots are distributed, locally controlled systems of agentsthat are fundamentally coupled to each other and the environment. The questionis: are the assumptions made by reinforcement learning algorithms satisfied in ourdomain?

1.3 Conflicting assumptions

The most effective of the techniques collectively known as reinforcement learning allmake a number of assumptions about the nature of the environment in which thelearning takes place: specifically, that the world is stationary (not changing over timein important ways) and that it is fully observed by the learning agent. These as-sumptions lead to the possibility of powerful learning techniques and accompanyingtheorems that provide bounds and guarantees on the learning process and its results.Unfortunately, these assumptions are usually violated by any application domain in-volving a physical robot operating in the real world: the robot cannot know directly

13

(a) (b)

Figure 1-3: The reinforcement learning framework. (a) In a fully observable world,the agent can estimate a value function for each state and use it to select its actions.(b) In a partially observable world, the agent does not know which state it is in due tosensor limitations; instead of a value function, the agent updates its policy parametersdirectly.

the state of the world, but may observe a local, noisy measure on parts of it. Wewill demonstrate below that in the case of modular robots, even their very simplifiedabstract kinematic models violate the assumptions necessary for the powerful tech-niques to apply and the theorems to hold. As a result of this conflict, some creativityis essential in designing or applying learning algorithms to the SRMR domain.

1.3.1 Assumptions of a Markovian world

Before we examine closely the conflict of assumptions, let us briefly review the conceptof reinforcement learning.

Reinforcement learning

Consider the class of problems in which an agent, such as a robot, has to achieve sometask by undertaking a series of actions in the environment, as shown in figure 1-3a.The agent perceives the state of the environment, selects an action to perform fromits repertoire and executes it, thereby affecting the environment which transitions toa new state. The agent also receives a scalar signal indicating its level of performance,called the reward signal. The problem for the agent is to find a good strategy (calleda policy) for selecting actions given the states. The class of such problems can bedescribed by stochastic processes called Markov decision processes (see below). Theproblem of finding an optimal policy can be solved in a number of ways. For instance,if a model of the environment is known to the agent, it can use dynamic programmingalgorithms to optimize its policy (Bertsekas 1995). However, in many cases the modelis either unknown initially or impossibly difficult to compute. In these cases, the agentmust act in the world to learn how it works.

14

Reinforcement learning is sometimes used in the literature to refer to the classof problems that we have just described. It is more appropriately used to name theset of statistical learning techniques employed to solve this class of problems in thecases when a model of the environment is not available to the agent. The agent maylearn such a model, and then solve the underlying decision process directly. Or itmay estimate a value function associated with every state it has visited (as is thecase in figure 1-3a) from the reward signal it has received. Or it may update thepolicy directly using the reward signal. We refer the reader to a textbook (Sutton& Barto 1999) for a good introduction to the problems addressed by reinforcementlearning and the details of standard solutions.

It is assumed that the way the world works does not change over time, so that theagent can actually expect to optimize its behavior with respect to the world. This isthe assumption of a stationary environment. It is also assumed that the probabilityof the world entering the state s′ at the next time step is determined solely by thecurrent state s and the action chosen by the agent. This is the Markovian worldassumption, which we formalize below. Finally, it is assumed that the agent knowsall the information it needs about the current state of the world and its own actionsin it. This is the full observability assumption.

Multi-agent MDPs

When more than one agent are learning to behave in the same world that all of themaffect, this more complex interaction is usually described as a multi-agent MDP.Different formulations of processes involving multiple agents exist, depending on theassumptions we make about the information available to the learning agent or agents.

In the simplest case we imagine that learning agents have access to a kind of oraclethat observes the full state of the process, but execute factored actions (i.e., that iscalled one action of the MDP is the result of coordinated mini-actions of all modulesexecuted at the same time). The individual components of a factored action needto be coordinated among the agents, and information about the actions taken by allmodules also must be available to every learning agent. This formulation satisfies allof the strong assumptions of a Markovian world, including full observability. However,as we argue later in section 1.3.2, to learn and maintain a policy with respect to eachstate is in practical SRMRs unrealistic, as well as prohibitively expensive in bothexperience and amount of computation.

Partially observable MDPs

Instead we can assume that each agent only observes a part of the state that islocal to its operation. If the agents interact through the environment but do nototherwise communicate, then each agent is learning to behave in a partially observableMDP (POMDP), ignoring the distributed nature of the interaction, and applying itslearning algorithm independently. Figure 1-3b shows the interaction between anagent and the environment when the latter is only partially observable through theagent’s sensors. In the partially observable formulation, there is no oracle, and the

15

modules have access only to factored (local, limited, perhaps noisy) observations overthe state. In this case, the assumption of full observability is no longer satisfied. Ingeneral, powerful learning techniques such as Q-learning (Watkins & Dayan 1992) areno longer guaranteed to converge to a solution, and optimal behavior is much harderto find.

Furthermore, the environment with which each agent interacts comprises all theother agents who are learning at the same time and thus changing their behavior.The world is non-stationary due to many agents learning at once, and thus cannoteven be treated as a POMDP, although in principle agents could build a weak modelof the competence of the rest of agents in the world (Chang et al. 2004).

1.3.2 Assumptions of the kinematic model

(1) (2)

Figure 1-4: The sliding-cube kinematic model for lattice-based modular robots: (1)a sliding transition, and (2) a convex transition.

How does our kinematic model fit into these possible sets of assumptions? Webase most of our development and experiments on the standard sliding-cube modelfor lattice-based self-reconfigurable robots (Butler et al. 2004, Fitch & Butler 2006,Varshavskaya et al. 2004, Varshavskaya et al. 2007). In the sliding-cube model, eachmodule of the robot is represented as a cube (or a square in two dimensions), whichcan be connected to other module-cubes at either of its six (four in 2D) faces. Thecube cannot move on its own; however, it can move, one cell of the lattice at a time,on a substrate of like modules in the following two ways, as shown in figure 1-4: 1)if the neighboring module M1 that it is attached to has another neighbor M2 in thedirection of motion, the moving cube can slide to a new position on top of M2, or2) if there is no such neighbor M2, the moving cube can make a convex transition tothe same lattice cell in which M2 would have been. Provided the relevant neighborsare present in the right positions, these motions can be performed relative to any ofthe six faces of the cubic module. The motions in this simplified kinematic modelcan actually be executed by physically implemented robots (Butler et al. 2004). Ofthose mentioned in section 1.2.1, the Molecule, MTRAN-II, and SuperBot can allform configurations required to perform the motions of the kinematic model, andtherefore controllers developed in simulation for this model are potentially useful forthese robots as well.

The assumptions made by the sliding-cube model are that the agents are individualmodules of the model. In physical instantiations, more than one physical module may

16

coordinate to comprise one simulation “cube”. The modules have limited resources,which affects any potential application of learning algorithms:

limited actuation: each module can execute one of a small set of discrete actions;each action takes a fixed amount of time to execute, potentially resulting in aunit displacement of the module in the lattice

limited power: actuation requires a significantly greater amount of power than com-putation or communication

limited computation and memory: on-board computation is usually limited tomicrocontrollers

clock: the system may be either synchronized to a common clock, which is a ratherunrealistic assumption, or asynchronous1, with every module running its codein its own time

Clearly, if an individual module knew the global configuration and position ofthe entire robot, as well as the action each module is about to take, then it wouldalso know exactly the state (i.e., position and configuration) of the robot at the nexttime step, as we assume a deterministic kinematic model. Thus, the world can beMarkovian. However, this global information is not available to individual modulesdue to limitations in computational power and communications bandwidth. Theymay communicate with their neighbors at each of their faces to find out the localconfiguration of their neighborhood region. Hence, there is only a partial observationof the world state in the learning agent, and our kinematic model assumptions are inconflict with those of powerful RL techniques.

1.3.3 Possibilities for conflict resolution

We have established that modular robots have intrinsic partial observability, since thedecisions of one module can only be based on its own state and the observations itcan gather from local sensors. Communications may be present between neighboringmodules but there is generally no practical possibility for one module to know or inferthe total system state at every timestep, except perhaps for very small-scale systems.

Therefore, for SRMR applications, we see two possibilities for resolving the fun-damental conflict between locally observing, acting and learning modules and theMarkov assumption. We can either resign ourselves to learning in a distributedPOMDP, or we can attempt to orchestrate a coordination scheme between modules.In the first case, we lose convergence guarantees for powerful learning techniquessuch as Q-learning, and must resort to less appealing algorithms. In the second case,that of coordinated MDPs, we may be able to employ powerful solutions in practice(Guestrin et al. 2002, Kok & Vlassis 2006); however, the theoretical foundations ofsuch application is not as sound as that of single-agent MDP. In this thesis, we focuson algorithms developed for partially observable processes.

1Later, in chapter 5 we introduce the notion of partially asynchronous execution, which assumesan upper bound on communication delays.

17

1.3.4 Case study: locomotion by self-reconfiguration

We now examine the problem of synthesizing locomotion gaits for lattice-based mod-ular self-reconfiguring robots. The abstract kinematic model of a 2D lattice-basedrobot is shown in figure 1-5. The modules are constrained to be connected to eachother in order to move with respect to each other; they are unable to move on theirown. The robot is positioned on an imaginary 2D grid; and each module can observeat each of its 4 faces (positions 1, 3, 5, and 7 on the grid) whether or not there isanother connected module on that neighboring cell. A module (call it M) can alsoask those neighbors to confirm whether or not there are other connected modules atthe corner positions (2, 4, 6, and 8 on the grid) with respect to M . These eight bitsof observation comprise the local configuration that each module perceives.

Figure 1-5: The setup for locomotion by self-reconfiguration on a lattice in 2D.

Thus, the module M in the figure knows that lattice cells number 3, 4, and 5have other modules in them, but lattice cells 1, 2, and 6-8 are free space. M has arepertoire of 9 actions, one for moving into each adjacent lattice cell (face or corner),and one for staying in the current position. Given any particular local configurationof the neighborhood, only a subset of those actions can be executed, namely thosethat correspond to the sliding or convex motions about neighbor modules at any ofthe face sites. This knowledge may or may not be available to the modules. If it isnot, then the module needs to learn which actions it will be able to execute by tryingand failing.

The modules are learning a locomotion gait with the eastward direction of motionreceiving positive rewards, and the westward receiving negative rewards. Specifically,a reward of +1 is given for each lattice-unit of displacement to the East, −1 for eachlattice-unit of displacement to the West, and 0 for no displacement. M ’s goal in figure1-5 is to move right. The objective function to be maximized through learning is thedisplacement of the whole robot’s center of mass along the x axis, which is equivalentto the average displacement incurred by individual modules.

Clearly, if the full global configuration of this modular robot were known to eachmodule, then each of them could learn to select the best action in any of themwith an MDP-based algorithm. If there are only two or three modules, this may bedone as there are only 4 (or 18 respectively) possible global configurations. However,

18

the 8 modules shown in figure 1-5 can be in a prohibitively large number of globalconfigurations; and we are hoping to build modular systems with tens and hundreds ofmodules. Therefore it is more practical to resort to learning in a partially observableworld, where each module only knows about its local neighborhood configuration.

The algorithms we present in this thesis were not developed exclusively for thisscenario. They are general techniques applicable to multi-agent domains. However,the example problem above is useful in understanding how they instantiate to theSRMR domain. Throughout the thesis, it will be used in experiments to illustratehow information, search constraints and the degree of inter-module communicationinfluence the speed and reliability of reinforcement learning in SRMRs.

1.4 Overview

This thesis provides an algorithmic study, evaluated empirically, of using distributedreinforcement learning by policy search with different information sources for learningdistributed controllers in self-reconfiguring modular robots.

At the outset, we delineate the design parameters and aspects of both robot andtask that contribute to three dimensions along which policy search algorithms may bedescribed: the number of parameters to be learned (the number of knobs to tweak),the starting point of the search, and the amount and quality of experience thatmodules can gather during the learning process. Each contribution presented in thisthesis relates in some way to manipulating at least one of these issues of influence inorder to increase the speed and reliability of learning and the quality of its outcome.

For example, we present two novel representations for lattice-based SRMR control,both of which aim to compress the search space by reducing the number of parame-ters to learn. We quickly find, however, that compact representations require carefulconstruction, shifting the design burden away from design of behavior, but ultimatelymandating full human participation. On the other hand, certain constraints on ex-ploration are very easy for any human designer to articulate, for example, that therobot should not attempt physically impossible actions. These constraints effectivelyreduce the number of parameters to be learned. Other simple suggestions that canbe easily articulated by the human designer (for example: try moving up when yousee neighbors on the right) can provide good starting points to policy search, and wealso explore their effectiveness.

Further, we propose a novel variations on the theme of policy search which arespecifically geared to learning of group behavior. For instance, we demonstrate howincremental policy search with each new learning process starting from the point atwhich a previous run with fewer modules had stopped, reduces the likelihood of beingtrapped in an undesirable local optimum, and results in better learned behaviors. Thisalgorithm is thus a systematic way of generating good starting points for distributedgroup learning.

In addition, we develop a framework of incorporating agreement (also known asconsensus) algorithms into policy search, and we present a new algorithm with whichindividual learners in distributed systems such as SRMRs can agree on both rewards

19

and experience gathered during the learning process — information which can beincorporated into learning updates, and will result in consistently better policies andmarkedly faster learning. This algorithmic innovation thus provides a way of in-creasing the amount and quality of individuals’ experience during learning. Activeagreement also has an effect on the kind of policies that are learned.

Finally, as we evaluate our claims and algorithms on a number of tasks requiringSRMR cooperation, we also inquire into the possibility of post-learning knowledgetransfer between policies which were learned for different tasks from different rewardfunctions. It turns out that local geometric transformations of learned policies canprovide another systematic way of automatically generating good starting points forfurther learning.

1.5 Thesis contributions

This thesis contributes to the body of knowledge on modular robots in that it providesa framework and algorithms for automating controller generation in a distributedway in SRMRs. It contributes to multi-agent reinforcement learning by establishinga new application area, as well as proposing novel algorithmic variations. We see thefollowing specific contributions:

• A framework for distributed automation of SRMR controller search throughreinforcement learning with partial observability

• A systematic empirical study of search constraints and guidance through extrainformation and inter-agent communications

• An algorithm for incremental learning using the additive nature of SRMRs

• An algorithm for policy search with spatial features

• An algorithm for distributed reinforcement learning with inter-agent agreementon rewards and experience

• A novel minimal representation for learning in lattice-based SRMRs

• A study of behavioral knowledge transfer through spatial policy transformation

• Empirical evaluation on the following tasks:

directed locomotion

directed locomotion over obstacle fields

tall structure building

reaching to random goal positions

20

These contributions bring us forward towards delivering on the promise of adapt-ability and versatility of SRMRS — properties often used for motivating SRMR re-search, but conspicuously absent from most current systems. The results of this studyare significant, because they present a framework in which future efforts in automat-ing controller generation and online adaptation of group behavior can be grounded.In addition, this thesis establishes empirically where the difficulties in such cooper-ative group learning lie and how they can be mitigated through the exploitation ofrelevant domain properties.

1.6 Thesis outline

We are proposing the use of reinforcement learning for both off-line controller design,and as actual adaptive control2 for self-reconfiguring modular robots. Having intro-duced the problem and detailed the fundamental underlying conflict of assumptions,we now prepare to resolve it in the following thesis chapters.

A related work review in both SRMRs and distributed reinforcement learning canbe found in chapter 2. In chapter 3 we derive the basic stochastic gradient ascentalgorithm in policy space and identify the key issues contributing to the speed andreliability of learning in SRMRs. We then address some of these issues by introducinga representational variation on the basic algorithm, and present initial experimentalresults from our case study (see section 1.3.4).

We explore the space of extra information and constraints that can be employedto guide the learning algorithms and address the identified key issues in chapter 4.Chapter 5 proposes agreement algorithms as a natural extension to gradient ascentalgorithms for distributed systems. Experimental results to validate our claims can befound throughout chapters 3 to 5. In chapter 6 we perform further evaluation of thethesis findings with different objective functions. Finally, we discuss the significanceof those results and limitations of this approach in chapter 7 and conclude withdirections for future work.

2In the nontechnical sense of adaptable behavior.

21

Chapter 2

Related Work

We are building on a substantial body of research in both reinforcement learning anddistributed robotic systems. Both these fields are well established and we cannotpossibly do them justice in these pages, so we describe here only the most relevantprior and related work. Specifically we focus our discussion on relevant research in thefield of self-reconfigurable modular robots, cooperative and coordinated distributedRL, and agreement algorithms in distributed learning.

2.1 Self-reconfiguring modular robots

This section gives a summary of the most important work in the sub-field of lattice-based SRMRs, as well as the research efforts to date in automating controller designand planning for such robots.

2.1.1 Hardware lattice-based systems

There are a great number of lattice-based self-reconfiguring modular robotics plat-forms, or those that are capable of emulating lattice-based self-reconfiguration. Thefield dates back to the Metamorphic (Pamecha et al. 1996) and Fracta (Murataet al. 1994) systems, which were capable of reconfiguring their shape on a plane. Thefirst three-dimensional systems appeared shortly afterwards, e.g., 3D-Fracta (Murataet al. 1998), the Molecule (Kotay et al. 1998), and the TeleCube (Suh et al. 2002). Allthese systems have degrees of freedom to connect and disconnect modules to and fromeach other, and move each module on the 2D or 3D lattice on a substrate of othermodules. More recent and more sophisticated robots include the Modular Transform-ers MTRAN-II (Kamimura et al. 2004) and the SuperBot (Shen et al. 2006), all ofwhich are capable of functioning as both chain-type and lattice-based robots, as wellas the exclusively non-cubic lattice-based ATRON (Østergaard et al. 2006).

In this thesis, we are using the “sliding cubes” abstraction developed by Butleret al. (2004), which represents the motions of which abstract modules on a cubiclattice are capable. As we have previously mentioned, the Molecule, MTRAN-II andMTRAN-III, and the SuperBot, are all capable of actuating these abstract motions.

22

More recent hardware systems have had a greater focus on stochastic self-assemblyfrom modular parts, e.g., the 2D systems such as programmable parts (Bishop et al.2005) and growing machines (Griffith et al. 2005), where the energy and mobilityis provided to the modules through the flow generated by an air table; and the 3Dsystems such as those of (White et al. 2005), where modules are in passive motioninside a liquid whose flow also provides exogenous energy and mobility to the modularparts. Miche (Gilpin et al. 2007) is a deterministic self-disassembly system that isrelated to this work insofar as the parts used in Miche are not moving. Such work isoutside the scope of this thesis.

2.1.2 Automated controller design

In the modular robotics community, the need to employ optimization techniquesto fine-tune distributed controllers has been felt and addressed before. The CentralPattern Generator (CPG) based controller of MTRAN-II was created with parameterstuned by off-line genetic algorithms (Kamimura et al. 2004). Evolved sequences ofactions have been used for machine self-replication (Mytilinaios et al. 2004, Zykovet al. 2005). Kubica & Rieffel (2002) reported a partially automated system where thehuman designer collaborates with a genetic algorithm to create code for the TeleCubemodule. The idea of automating the transition from a global plan to local rules thatgovern the behavior of individual agents in a distributed system has also been exploredfrom the point of view of compilation in programmable self-assembly (Nagpal 2002).

Reinforcement learning can be employed as just another optimization techniqueoff-line to automatically generate distributed controllers. However, one advantage ofapplying RL techniques to modular robots is that the same, or very similar, algorithmscan also be run on-line, eventually on the robot, to update and adapt its behaviorto a changing environment. Our goal is to develop an approach that will allow suchflexibility of application. Some of the ideas reported here were previously formulatedin less detail by Varshavskaya et al. (2004) and (2006, 2007).

2.1.3 Automated path planning

In the field of SRMRs, the problem of adapting to a changing environment and goalduring the execution of a task has not received as much attention, with one notablerecent exception. Fitch & Butler (2007) use an ingenious distributed path planningalgorithm to automatically generate kinematic motion trajectories for the sliding cuberobots to all move into a goal region. Their algorithm, which is based on a cleverdistributed representation of cost/value and standard dynamic programming, worksin a changing environment with obstacles and a moving goal, so long as at leastone module is always in the goal region. The plan is decided on and executed in acompletely distributed fashion and in parallel, making it an efficient candidate forvery large systems.

We view our approaches as complementary. The planning algorithm requiresextensive thinking time in between each executed step of the reconfiguration. By

23

contrast, in our approach, once the policy is learned it takes no extra run-time plan-ning to execute it. On the other hand, Fitch and Butler’s algorithm works in anentirely asynchronous manner; and they have demonstrated path-planning on verylarge numbers of modules.

2.2 Distributed reinforcement learning

While the reinforcement learning paradigm has not been applied previously to self-reconfigurable modular robotics, there is a vast research field of distributed and multi-agent reinforcement learning (Stone & Veloso 2000, Shoham et al. 2003). Applicationshave included network routing and traffic coordination among others.

2.2.1 Multi-agent Q-learning

In robotics, multi-agent reinforcement learning has been applied to teams of robotswith Q-learning as the algorithm of choice (e.g., Mataric 1997, Fernandez & Parker2001). It was then also discovered that a lot of engineering is needed to use Q-learning in teams of robots, so that there are only a few possible states and a fewbehaviors to choose from. Q-learning has also been attempted on a non-reconfiguringmodular robot (Yu et al. 2002) with similar issues and mixed success. The Robocupcompetition (Kitano et al. 1997) has also driven research in collaborative multi-agentreinforcement learning.

2.2.2 Hierarchical distributed learning

Hierarchies of machines (Andre & Russell 2000) have been used in multi-agent RL asa strategy for constraining policies to make Q-learning work faster. The related issueof hierarchically optimal Q-function decomposition has been addressed by Marthiet al. (2006).

2.2.3 Coordinated learning

More recently there has been much theoretical and algorithmic work in multi-agentreinforcement learning, notably in cooperative reinforcement learning. Some solu-tions deal with situations where agents receive individual rewards but work togetherto achieve a common goal. Distributed value functions introduced by Schneider et al.(1999) allowed agents to incorporate neighbors’ value functions into Q-updates inMDP-style learning. Coordination graphs (Guestrin et al. 2002, Kok & Vlassis 2006)are a framework for learning in distributed MDPs where values are computed by algo-rithms such as variable elimination on a graph representing inter-agent connections.In mirror situations, where individual agents must learn separate value functionsfrom a common reward signal, Kalman filters (Chang et al. 2004) have been used toestimate individuals’ reward contributions.

24

2.2.4 Reinforcement learning by policy search

In partially observable environments, policy search has been extensively used (Peshkin2001, Ng & Jordan 2000, Baxter & Bartlett 2001, Bagnell et al. 2004) whenever mod-els of the environment are not available to the learning agent. In robotics, successfulapplications include humanoid robot motion (Schaal et al. 2003), autonomous heli-copter flight (Ng et al. 2004), navigation (Grudic et al. 2003), and bipedal walking(Tedrake et al. 2005), as well as simulated mobile manipulation (Martin 2004). Insome of those cases, a model identification step is required prior to reinforcementlearning, for example from human control of the plant (Ng et al. 2004), or humanmotion capture (Schaal et al. 2003). In other cases, the system is brilliantly engi-neered to reduce the search task to estimating as few as a single parameter (Tedrakeet al. 2005).

Function approximation, such as neural networks or coarse coding, has also beenwidely used to improve performance of reinforcement learning algorithms, for examplein policy gradient algorithms (Sutton et al. 2000) or approximate policy iteration(Lagoudakis & Parr 2003).

2.3 Agreement algorithms

Agreement algorithms were first introduced in a general framework of distributed(synchronous and asynchronous) computation by Tsitsiklis (e.g., Tsitsiklis et al.1986). The theory includes convergence proofs used in this dissertation, and resultspertaining to asynchronous distributed gradient-following optimization. The frame-work, theory and results are available as part of a widely known textbook (Bertsekas& Tsitsiklis 1997). Agreement is closely related to swarming and flocking, which havelong been discussed and modeled in the biological and physical scientific literature(e.g., Okubo 1986, Vicsek et al. 1995).

Current research in agreement, which is also more frequently known as consen-sus, includes decentralized control theory (e.g., Jadbabaie et al. 2003) and networksliterature (e.g., Olfati-Saber & Murray 2004). As original models were developed topredict the behavior of biological groups (swarms, flocks, herds and schools), currentbiological and ecological studies of group behavior validate the earlier models (e.g.,Buhl et al. 2006). A comprehensive review of such work is beyond the scope of thisthesis.

In machine learning, agreement has been used in the consensus propagation al-gorithm (Moallemi & Van Roy 2006), which can be seen as a special case of beliefpropagation on a Gaussian Markov Random Field. Consensus propagation guaran-tees very fast convergence to the exact average of initial values on singly connectedgraphs (trees). The related belief consensus algorithm (Olfati-Saber et al. 2005) en-ables scalable distributed hypothesis testing.

In the reinforcement learning literature, agreement was also used in distributedoptimization of sensor networks (Moallemi & Van Roy 2003). This research is clos-est to our approach, as it proposes a distributed optimization protocol using policy

25

gradient MDP learning together with pairwise agreement on rewards. By contrast,our approach is applicable in partially observable situations. In addition, we takeadvantage of the particular policy search algorithm to let learning agents agree notonly on reward, but also on experience. The implications and results of this approachare discussed in chapter 5.

26

Chapter 3

Policy Search in Self-ReconfiguringModular Robots

The first approach we take in dealing with distributed actions, local observations andpartial observability is to describe the problem of locomotion by self-reconfigurationof a modular robot as a multi-agent Partially Observable Markov Decision Process(POMDP).

In this chapter, we first describe the derivation of the basic GAPS algorithm,and in section 3.3 the issues concerning its implementation in a distributed learningsystem. We then develop a variation using a feature-based representation and a log-linear policy in section 3.4. Finally, we present some experimental results (section3.5) from our case study of locomotion by self-reconfiguration, which demonstratethat policy search works where Markov-assuming techniques do not.

3.1 Notation and assumptions

We now introduce some formal notation which will become useful for the derivationof learning algorithms. The interaction is formalized as a Markov decision process(MDP), a 4-tuple 〈S, A, T, R〉, where S is the set of possible world states, A the setof actions the agent can take, T : S ×A→ P (S) the transition function defining theprobability of being in state s′ ∈ S after executing action a ∈ A while in state s, andR : S × A → R a reward function. The agent maintains a policy π(s, a) = P (a|s)which describes a probability distribution over actions that it will take in any state.The agent does not know T . In the multi-agent case we imagine that learning agentshave access to a kind of oracle that observes the full state of the MDP, but executefactored actions at = {a1

t ...ant }, if there are n modules. In a multi-agent partially

observable MDP, there is no oracle, and the modules have access only to factoredobservations over the state ot = Z(st) = {o1

t ...ont }, where Z : S → O is the unknown

observation function which maps MDP states to factored observations and O is theset of possible observations.

We assume that each module only conditions its actions on the current observation;and that only one action is taken at any one time. An alternative approach could

27

be to learn controllers with internal state (Meuleau et al. 1999). However, internalstate is unlikely to help in our case, as it would multiply drastically the number ofparameters to be learned, which is already considerable.

To learn in this POMDP we use a policy search algorithm based on gradientascent in policy space (GAPS), proposed by Peshkin (2001). This approach assumesa given parametric form of the policy πθ(o, a) = P (a|o, θ), where θ is the vectorof policy parameters, and the policy is a differentiable function of the parameters.The learning proceeds by gradient ascent on the parameters θ to maximize expectedlong-term reward.

3.2 Gradient ascent in policy space

The basic GAPS algorithm does hill-climbing to maximize the value (that is, long-term expected reward) of the parameterized policy. The derivation starts with notingthat the value of a policy πθ is Vθ = Eθ[R(h)] =

∑hεH R(h)P (h|θ), where θ is the pa-

rameter vector defining the policy and H is the set of all possible experience histories.If we could calculate the derivative of Vθ with respect to each parameter, it would bepossible to do exact gradient ascent on the value by making updates ∆θk = α ∂

∂θkVθ.

However, we do not have a model of the world that would give us P (h|θ) and so wewill use stochastic gradient ascent instead. Note that R(h) is not a function of θ, andthe probability of a particular history can be expressed as the product of two terms(under the Markov assumption and assuming learning is episodic with each episodecomprised of T time steps):

P (h|θ) =

= P (s0)T∏

t=1

P (ot|st)P (at|ot, θ)P (st+1|st, at)

=

[P (s0)

T∏t=1

P (ot|st)P (st+1|st, at)

][T∏

t+1

P (at|ot, θ)

]= Ξ(h)Ψ(h, θ),

where Ξ(h) is not known to the learning agent and does not depend on θ, andΨ(h, θ) is known and differentiable under the assumption of a differentiable policyrepresentation. Therefore, ∂

∂θkVθ =

∑hεH R(h)Ξ(h) ∂

∂θkΨ(h, θ).

28

The differentiable part of the update gives:

∂

∂θkΨ(h, θ =

∂

∂θk

T∏t=1

πθ(at, ot) =

=T∑

t=1

∂

∂θkπθ(at, ot)

∏τ 6=t

πθ(aτ , oτ )

=

T∑t=1

[∂

∂θkπθ(at, ot)

πθ(at, ot)

T∏t=1

πθ(at, ot)

]

= Ψ(h, θ)T∑

t=1

∂

∂θklnπθ(at, ot).

Therefore,

∂∂θk

Vθ =∑

hεH R(h)(P (h|θ)

∑Tt=1

∂∂θk

lnπθ(at, ot)).

The stochastic gradient ascent algorithm operates by collecting algorithmic traces atevery time step of each learning episode. Each trace reflects the contribution of asingle parameter to the estimated gradient. The makeup of the traces will depend onthe policy representation.

The most obvious representation is a lookup table, where rows are possible localobservations and columns are actions, with a parameter for each observation-actionpair. The GAPS algorithm was originally derived (Peshkin 2001) for such a repre-sentation, and is reproduced here for completeness (Algorithm 1). The traces areobtained by counting occurrences of each observation-action pair, and taking the dif-ference between that number and the expected number that comes from the currentpolicy. The policy itself, i.e., the probability of taking an action at at time t giventhe observation ot and current parameters θ is given by Boltzmann’s law:

πθ(at, ot) = P (at|ot, θ) =eβθ(ot,at)∑a∈A eβθ(ot,a)

,

where β is an inverse temperature parameter that controls the steepness of the curveand therefore the level of exploration.

This gradient ascent algorithm has some important properties. As with anystochastic hill-climbing method, it can only be relied upon to reach a local opti-mum in the represented space, and will only converge if the learning rate is reducedover time. We can attempt to mitigate this problem by running GAPS from manyinitialization points and choosing the best policy, or by initializing the parameters ina smarter way. The latter approach is explored in chapter 4.

3.3 Distributed GAPS

GAPS has another property interesting for our twin purposes of controller generationand run-time adaptation in SRMRs. In the case where each agent not only observes

29

Algorithm 1 GAPS (observation function o, M modules)Initialize parameters θ ← small random numbersfor each episode do

Calculate policy π(θ)Initialize observation counts N ← 0Initialize observation-action counts C ← 0for each timestep in episode do

for each module m doobserve om and increment N(om)choose a from π(om, θ) and increment C(om, a)execute a

end forend forGet global reward RUpdate θ according toθ ( o, a ) += α R ( C(o, a)− π(o, a, θ) N(o) )Update π(θ) using Boltzmann’s law

end for

and acts but also learns its own parameterized policy, the algorithm can be extendedin the most obvious way to Distributed GAPS (DGAPS), as was also done by Peshkin(2001). A theorem in his thesis says that the centralized factored version of GAPSand the distributed multi-agent version will make the same updates given the sameexperience. That means that, given the same observation and action counts, and thesame rewards, the two instantiations of the algorithm will find the same solutions.

In our domain of a distributed modular robot, DGAPS is naturally preferable.However, in a fully distributed scenario, agents do not in fact have access to thesame experience and the same rewards as all others. Instead of requiring an identicalreward signal for all agents, we take each module’s displacement along the x axis tobe its reward signal: Rm = xm, since we assume that modules do not communicate.This means that the policy value landscape is now different for each agent. However,the agents are physically coupled with no disconnections allowed. If the true reward

is R =∑N

m=1 xm

N, and each individual reward is Rm = xm, then each Rm is a bounded

estimate of R that’s at most N/2 away from it (in the worst-case scenario whereall modules are connected in one line). Furthermore, as each episode is initialized,modules are placed at random in the starting configuration of the robot. Therefore,the robot’s center of mass and R = E[xm] is the expected value of any one module’sposition along the x axis. We can easily see that in the limit, as the number of turnsper episode increases and as learning progresses, each xm approaches the true reward.

This estimate is better the fewer modules we have and the larger R is. Therefore itmakes sense to simplify the problem in the initial phase of learning, while the rewardsare small, by starting with fewer modules, as we explore in chapter 4. It may alsomake sense to split up the experience into larger episodes, which would generate largerrewards all other things being equal, especially as the number of modules increases.Other practical implications of diverse experience and rewards among the learning

30

agents are examined in chapter 5.

3.4 GAPS learning with feature spaces

In some domains it is unfeasible to enumerate all possible observations and maybeeven all possible actions. When we are confronted with a feature-based representationfor a domain, a log-linear1 version of the basic GAPS algorithm can be employed,which we derive below.

We propose to represent the policy compactly as a function of a number of featuresdefined over the observation-action space of the learning module. Suppose the de-signer can identify some salient parts of the observation that are important to the taskbeing learned. We can define a vector of feature functions over the observation-actionspace Φ(a, o) = [φ1(a, o)φ2(a, o)...φn(a, o)]T . These feature functions are domain-dependent and can have a discrete or continuous response field. Then the policyencoding for the learning agent (the probability of executing an action at) is:

πθ(at, ot) = P (at|ot, θ) =eβΦ(at,ot)·θ∑aεA eβΦ(a,ot)·θ

,

where β is again an inverse temperature parameter. This definition of probabilityof selecting action at is the counterpart of Boltzmann’s law in feature space. Thederivation of the log-linear GAPS algorithm (LLGAPS) then proceeds as follows:

∂

∂θklnP (at|ot, θ) =

=∂

∂θk

(βΦ(at, ot) · θ − ln

(∑aεA

eβΦ(a,ot)·θ

))

= β φk(at, ot)−∂

∂θk

∑a eβΦ(a,ot)·θ∑

a eβΦ(a,ot)·θ

= β

(φk(at, ot)−

∑a φk(a, ot)eβΦ(a,ot)·θ∑

a eβΦ(a,ot)·θ

)

= β

(φk(at, ot)−

∑a

φk(a, ot)πθ(a, ot)

)= λk(t).

The accumulated traces λk are only slightly more computationally involved than thesimple counts of the original GAPS algorithm: there are |A|+ N more computationsper timestep than in GAPS, where A is the set of actions, and N the number offeatures. The updates are just as intuitive, however, as they still assign more creditto those features that differentiate more between actions, normalized by the actions’likelihood under the current policy.

The algorithm (see Algorithm 2) is guaranteed to converge to a local maximum inpolicy value space, for a given feature and policy representation. However, the featurespace may exclude good policies — using a feature-based representation shifts the

1The term log-linear refers to the log-linear combination of features in the policy representation.

31

Algorithm 2 LLGAPS (Observation function o, N feature functions φ)Initialize parameters θ ← small random numbersfor each episode do

Initialize traces Λ← 0for each timestep t in episode do

Observe current situation ot and get features response for every action Φ(∗, ot)Sample action at according to policy (Boltzmann’s law)for k = 1 to N do

Λk ← Λk + β (φk(at, ot)− Σaφk(a, ot)πθ(a, ot))end for

end forθ ← θ + α R Λ

end for

burden of design from policies to features. Without an automatic way of providinggood features, the process of controller generation cannot be called automatic either.Experimentally (section 3.5.4), we attempt to verify the robustness of LLGAPS tovariations in feature spaces by providing the algorithm with a set of ad hoc featuresdeemed salient for the lattice-based motion domain.

3.5 Experimental Results

In this section we present results from a number of experiments designed to testhow well the basic GAPS algorithm and its feature-based extension perform on thelocomotion task using the sliding cubes kinematic model in simulation. First, weestablish empirically the difference between applying a technique derived for MDPs toa partially observable distributed POMDP by comparing how GAPS, Q-learning andSarsa (Sutton 1995), which is an on-policy temporal difference, MDP-based method,fare on the task.

3.5.1 The experimental setup

We conduct experiments on a simulated two-dimensional modular robot, which isshown in figure 3-1a. The simulation is based solely on the abstract kinematic modelof lattice reconfiguration, and we make no attempt to include physical realism, orindeed any physical laws2. However, the simulation enforces two simple rules: (1) therobot must stay fully connected at all times, and (2) the robot must stay connectedto the “ground” at y = 0 by at least one module at all times. Actions whose resultwould violate one of these rules fail and are not executed. The guilty module simplyforfeits its turn.

2Lattice-based SRMRs to date have seen only kinematic controllers. Dynamics enter into playwhen the robot configuration generates non-negligible lever forces on the module-to-module connec-tors. When learning is applied to physical robots, the dynamics of physical reality need to be takeninto account. We discuss some implications of these concerns in section 7.3.

32

Figure 3-1: Experimental setup: simulated 2D modular robot on a lattice.

As previewed in section 1.3.4, each module can observe its immediate Mooreneighborhood (eight immediate neighbor cells) to see if those cells are occupied bya neighboring module or empty. Each module can also execute one of nine actions,which are to move into one of the neighboring cells or a NOP. The simulator issynchronous, and modules execute their actions in turn, such that no module can gotwice before every other module has taken its turn. This assumption is unrealistic;however, it helps us avoid learning to coordinate motion across the entire robot. Thereare examples in the literature (e.g., in Fitch & Butler 2007) of algorithms intendedfor use in lattice-based SRMRs that provide for turn-taking. The task is to learn tomove in one direction (East) by self-reconfiguration. The reward function measuresthe progress the modules make in one episode along the x axis of the simulator. Figure3-2 shows how a policy (a) can be executed by modules to produce a locomotion gait(b) and therefore gain reward.

We run the experiments in two broad conditions. First, the goal of automaticallygenerating controllers can in principle be achieved in simulation off-line, and thenimparted to the robot for run-time execution. In that case, we can require all modulesto share one set of policy parameters, that is, to pool their local observations andactions for one learning “super-agent” to make one set of parameter updates, whichthen propagates to all the modules on the next episode. Second, we run the samelearning algorithms on individual modules operating independently, without sharingtheir experience. We predict that more experience will result in faster learning curvesand better learning results.

3.5.2 Learning by pooling experience

The focus of this section is on the first condition, where a centralized algorithm is runoff-line with factored observations and actions. All modules execute the same policy.

In each case, the experiment consisted of 10 runs of the learning algorithm, startingfrom a randomized initial state. The experiments were set up as episodic learningwith each episode terminating after 50 timesteps. The distance that the robot’s centerof mass moved during an episode was presented to all modules or, equivalently, thecentralized learner as the reward at the end of the episode.

33

(a) (b)

Figure 3-2: (a) A locomotion gait policy (conditional probability distribution overactions given an observation). (b) First few configurations of 9 modules executingthe policy during a test run: in blue, module with lowest ID number, in red, modulewhich is about to execute an action, in green, module which has just finished itsaction.

(a) (b)

Figure 3-3: Performance of policies learned by 15 modules in the 2D locomotion taskrunning Q-learning, Sarsa and GAPS: (a) smoothed (100-point moving window),downsampled average rewards per learning episode over 10 trials (Q-learning andSarsa each), 50 trials (GAPS), with standard error (b) typical single trial learningcurves.

34

Figure 3-4: Reward distributions after 4,000 episodes of learning for Q-learning, Sarsaand GAPS (10 trials per algorithm). This box and whisker plot, generated usingMATLAB R©, shows the lower quartile, the median and the upper quartile values asbox lines for each algorithm. The whiskers are vertical lines that show the extent ofthe rest of the data points. Outliers are data beyond the end of the whiskers. Thenotches on the box represent a robust estimate of the uncertainty about the mediansat the 5% significance level.

Unless noted otherwise, in all conditions the learning rate started at α = 0.01,decreased over the first 1, 500 episodes, and remained at its minimal value of 0.001thereafter. The inverse temperature parameter started at β = 1, increased overthe first 1, 500 episodes, and remained at its maximal value of 3 thereafter. Thisensured more exploration and larger updates in the beginning of each learning trial.These parameters were selected by trial and error, as the ones consistently generatingthe best results. Whenever results are reported as smoothed, downsampled averagereward curves, the smoothing was done with a 100-point moving window on theresults of every trial, which were then downsampled for the sake of clarity, to excluderandom variation and variability due to continued within-trial exploration.

3.5.3 MDP-like learning vs. gradient ascent

The first set of experiments is pitting GAPS against two powerful algorithms whichmake the Markov assumption: Q-learning (Watkins & Dayan 1992) and Sarsa (Sutton1995). Sarsa is an on-policy algorithm, and therefore may do better than Q-learningin our partially observable world. However, our prediction was that both would failto reliably converge to a policy in the locomotion task, whereas gradient ascent wouldsucceed in finding a locally optimal policy3.

Hypothesis: GAPS is capable of finding good locomotion gaits, but Q-learning andSarsa are not.

3As we will see later, a locally optimal policy is not always a good locomotion gait.

35

Figure 3-3a shows the smoothed average learning curves for both algorithms. Fif-teen modules were learning to locomote eastward in 10 separate trial runs (50 trialsfor GAPS). As predicted, gradient ascent receives considerably more reward than ei-ther Q-learning or Sarsa. The one-way ANOVA4 reveals that the rewards obtainedafter 4,000 episodes of learning differed significantly as a function of the learningalgorithm used (F (2, 27) = 33.45, p < .01). Figure 3.5.2 shows the mean rewardsobtained by the three algorithms, with error bars and outliers. The post-hoc multiplecomparisons test reports that the GAPS mean reward is significantly different fromboth Sarsa and Q-learning at 99% confidence level, but the latter are not significantlydifferent from each other. In test trials this discrepancy manifested itself as finding agood policy for moving eastward (one such policy is shown in figure 3-2) for GAPS,and failing to find a reasonable policy for Q-learning and Sarsa: modules oscillated,moving up-and-down or left-and-right and the robot did not make progress. In figure3-3b we see the raw rewards collected at each episode in one typical trial run of bothlearning algorithms.

While these results demonstrate that modules running GAPS learn a good policy,they also show that it takes a long time for gradient ascent to find it. We nextexamine the extent to which we can reduce the number of search space dimensions,and therefore, the experience required by GAPS through employing the feature-basedapproach of section 3.4.

3.5.4 Learning in feature spaces

We compare experiments performed under two conditions. In the original condition,the GAPS algorithm was used with a lookup table representation with a single pa-rameter θ(o, a) for each possible observation-action pair. For the task of locomotionby self-reconfiguration, this constituted a total of 28 × 9 = 2304 parameters to beestimated.

In the log-linear function approximation condition, the LLGAPS algorithm wasrun with a set of features determined by hand as the cross product of salient aspectsof neighborhood observations and all possible actions. The salient aspects were fullcorners and straight lines, and empty corners and straight lines. The features wereobtained by applying partial masks to the observed neighborhood, and selecting aparticular action, with response fields as shown in figure 3-5. This gave a total of144 features, the partial masks for which are fully detailed in figure 3-6. In all cases,modules were learning from the space of reactive policies only.

4The rewards are most probably not normally distributed: indeed, the distribution will be mul-timodal, with “good” and “bad” trials clustered together. It may therefore be more appropriateto use the nonparameteric Kruskal-Wallis test, which does not make the normal distribution as-sumption. The Kruskal-Wallis test also reveals a significant difference in reward as a function ofthe learning algorithm (χ2(2, 27) = 16.93, p < .01). However, the KW-test does not transfer well tomulti-way analysis (Toothaker & Chang 1980), which will be necessary in further experiments. Wewill therefore use ANOVA, because it is robust to some assumption violations, and because we areonly interested in large effects. But see also Brunner & Puri (2001) for a framework for multi-waynonparametric testing.

36

(a) (b)

Figure 3-5: 6 modules learning with LLGAPS and 144 features. (a) One of thefeatures used for function approximation in the LLGAPS experiments: this featurefunction returns 1 if there are neighbors in all three cells of the upper left cornerand at = 2(NE). (b) Smoothed (100-point moving window), downsampled averagerewards over 10 runs, with standard error: comparison between original GAPS andLLGAPS.

Hypothesis: LLGAPS’s compact representation will lead to faster learning (betterconvergence times), as compared to standard GAPS.

Figure 3-5b presents the results of comparing the performance of LLGAPS withthe original GAPS algorithm on the locomotion task for 6 modules. We see that bothalgorithms are comparable in both their speed of learning and the average qualityof the resulting policies. If anything, around episode 1,500 basic GAPS has betterperformance. However, a two-way mixed-factor ANOVA (repeated measures every1000 episodes, GAPS vs. LLGAPS) reveals no statistical significance. Our hypothesiswas wrong, at least for this representation.

However, increasing the number of modules reveals that LLGAPS is more prone tofinding unacceptable local minima, as explained in section 4.3.2. We also hypothesizedthat increasing the size of the observed neighborhood would favor the feature-basedLLGAPS over the basic GAPS. As the size of the observation increases, the number ofpossible local configuration grows exponentially, whereas the features can be designedto grow linearly. We ran a round of experiments with an increased neighborhoodsize of 12 cells obtained as follows: the acting module observes its immediate faceneighbors in positions 1,3,5, and 7, and requests from each of them a report on thethree cells adjacent to their own faces5. Thus the original Moore neighborhood plusfour additional bits of observation are available to every module, as shown in figure3-7a. This setup results in 212 × 9 = 36, 869 parameters to estimate for GAPS. ForLLGAPS we incorporated the extra information thus obtained into an additional72 features as follows: one partial neighborhood mask per extra neighbor present,

5If no neighbor is present at a face, and so no information is available about a corner neighbor,it is assumed to be an empty cell.

37

Full corner Full line Empty corner Empty line

- current acting module

- neighbor present

- no neighbor (empty cell)

- neighbor or empty cells

Figure 3-6: All 16 observation masks for feature generation: each mask, combinedwith one of 9 possible actions, generates one of 144 features describing the space ofpolicies.

38

15 modules 20 modulesGAPS 0.1±0.1 5±1.7

LLGAPS with 144 features 6.6±1.3 9.7±0.2

Table 3.1: Mean number of configurations that do not correspond to locomotion gaits,out of 10 test trials, for 10 policies learned by GAPS vs. LLGAPS with 144 features,after 6,000 episodes of learning, with standard error. We will use the mean numberof non-gait configurations throughout this thesis as one of the performance measuresfor various algorithms, representations and constraints. The standard error σ/

√(N)

(N is the number of sampled policies, here equal to 10) included in this type of tableboth gives an estimate of how spread-out the values from different policies are, andis a measure of quality of the mean value as a statistic.

and one per extra neighbor absent, where each of those 8 partial masks generates9 features, one per possible action. We found that increasing the observation sizeindeed slows the basic GAPS algorithm down (figure 3-7b, where LLGAPS ran witha much lower learning rate to avoid too-large update steps: α = 0.005 decreasing toα = 1e−5 over the first 1,500 episodes and remaining at the minimal value thereafter).During the first few hundred episodes, LLGAPS does better than GAPS. However,we have also found that there is no significant difference in performance after bothhave found their local optima (mixed-factor ANOVA with repeated measures every1000 episodes, starting at episode 500 reveals no significance for GAPS vs. LLGAPSbut a significant difference at 99% confidence level for the interaction between the twofactors of episodes and algorithm). Table 3.1 shows that the feature-based LLGAPSis even more prone to fall for local optima.

Two possible explanations would account for this phenomenon. On the one hand,feature spaces need to be carefully designed to avoid these pitfalls, which shifts the hu-man designer’s burden from developing distributed control algorithms to developingsets of features. The latter task may not be any less challenging. On the other hand,even well-designed feature spaces may eliminate redundancy in the policy space, andtherefore make it harder for local search to reach an acceptable optimum. This effectis the result of compressing the policy representation into fewer parameters. Whenan action is executed that leads to worse performance, the features that contributedto the selection of this action will all get updated. That is, the update step resultsa simultaneous change in different places in the observation-action space, which canmore easily create a new policy with even worse performance. Thus, the parametersbefore the update would be locally optimal.

3.5.5 Learning from individual experience

When one agent learns from the combined experience of all modules’ observations,actions and rewards, it is not surprising that the learning algorithm converges faster

39

(a) (b)

Figure 3-7: (a) During experiments with an extended observation, modules had accessto 12 bits of observation as shown here. (b) Smoothed (100 point moving window),downsampled average rewards obtained by 6 modules over 10 trials, with standarderror: comparison between basic GAPS and LLGAPS.

than in the distributed case, as the learner has more data on each episode. However,with the size of the robot (i.e., the number of modules acting and learning together)grows the discrepancy between the cumulative experience of all agents, and the in-dividual experience of one agent. When many modules are densely packed into aninitial shape at the start of every episode, we can expect the ones “stuck” in themiddle of the shape to not gather any experience at all, and to receive zero reward,for at least as long as it takes the modules on the perimeter to learn to move out ofthe way. While a randomized initial position is supposed to mitigate this problem,as the number of modules grows, so grows the probability that any one module willrarely or never be on the perimeter at the outset of an episode. This is a directconsequence of our decision to start locomotion from a tightly packed rectangularconfiguration with a height-to-length ratio closest to 1. That decision was motivatedby the self-unpacking and deploying scenario often mentioned in SRMR research.

Hypothesis: Distributed GAPS will learn much slower than centralized factoredGAPS, and may not learn at all.

As demonstrated in figure 3-8, indeed in practice, as few as 15 modules runningthe fully distributed version of the GAPS algorithm do not find any good policiesat all, even after 100,000 episodes of learning. This limitation will be addressed inchapters 4 and 5.

3.5.6 Comparison to hand-designed controllers

The structure of the reward signal used during the learning phase determines whatthe learned policies will do to achieve maximum reward. In the case of eastward loco-motion the reward does not depend on the shape of the modular robot, only on how

40

Figure 3-8: Comparison of learning performance between the centralized, factoredGAPS and the distributed GAPS (DGAPS) implementations: smoothed (10-pointwindow) average rewards obtained during 10 learning trials with 15 modules. (a)first 10,000 episodes, and (b) to 100,000 episodes of learning.

Figure 3-9: Screenshot sequence of 15 modules executing the best policy found byGAPS for eastward locomotion.

Figure 3-10: Screenshot sequence of 15 modules executing the hand-designed policyfor eastward locomotion.

41

Table 3.2: Rules for eastward locomotion, distilled from the best policy learned byGAPS.t d → NEdt , d dt , dd dt → SEdt d , d td d , dd dt d → E

dd t, ddd t

, ddd td→ S

t current actord neighbor module

far east is has gone at the end of an episode. On the other hand, the hand-designedpolicies of Butler et al. (2001) were specifically developed to maintain the convex,roughly square or cubic overall shape of the robot, and to avoid any holes within thatshape. It turns out that one can go farther faster if this constraint is relaxed. Theshape-maintaining hand-designed policy, executed by 15 modules for 50 time steps(as shown in figure 3.5.6), achieves an average reward per episode of 5.8 (σ = 0.9),whereas its counterpart learned using GAPS (execution sequence shown in figure3.5.6) achieves an average reward of 16.5 (σ = 1.2). Table 3.2 graphically representsthe rules distilled from the best policy learned by GAPS. The robot executing thispolicy unfolds itself into a two-layer thread, then uses a thread-gait to move East.While very good at maximizing this particular reward signal, these policies no longerhave the “nice” properties of the hand-designed policies of Butler et al. (2001). Bylearning with no constraints and a very simple objective function (maximize horizon-tal displacement), we forgo any maintenance of shape, or indeed any guarantees thatthe learning algorithm will converge on a policy that is a locomotion gait. The bestwe can say is that, given the learning setup, it will converge on a policy that locallymaximizes the simple reward.

3.5.7 Remarks on policy correctness and scalability

Some learned policies are indeed correct, if less than perfect, locomotion gaits. Theircorrectness or acceptability is determined by an analysis of learned parameters. Con-sider the stochastic policy to which GAPS converged in table 3.3. The table onlyshows those observation-action pairs where θ(o, a) >> θ(o, a′) for all a′ and whereexecuting a results in motion. This policy is a local optimum in policy space — asmall change in any θ(o, a) will lead to less reward. It was found by the learner on aless than ideal run and it is different from the global optimum in table 3.2. We arguethat this policy will still correctly perform the task of eastward locomotion with highprobability as the θ(o, a) gets larger for the actions a shown.

Note first of all that if the rules in 3.3 were deterministic, then we could makean argument akin to the one in Butler et al. (2001) where correctness is proven fora hand-designed policy. Intuitively, if we start with a rectangular array of modulesand assume that each module gets a chance to execute an action during a turn,then some rule can always be applied, and none of the rules move any moduleswest, so that eastward locomotion will always result. This crucially depends on our

42

Table 3.3: Learned rules for eastward locomotion: a locally optimal gait.

d d dt dd→ Nt d , d dt d→ NEd dt

, d d dt, dd dt d → Edt , d dt , d dd t

→ SE

dd t, ddd t

, ddd td→ S

t current actord neighbor module

Figure 3-11: Locomotion of nm modules following rules in table 3.3, treated as de-terministic. After the last state shown, module n executes action S, modules 1 then2 execute action E, then module 4 can execute action NE, and so forth. The lengthof the sequence is determined by the dimensions n and m of the original rectangularshape.

assumption of synchronous turn-taking. Figure 3-11 shows the first several actionsof the only possible cyclic sequence for nm modules following these rules if treatedas deterministic. The particular assignment of module IDs in the figure is chosen forclarity and is not essential to the argument.

However, the rules are stochastic. The probability of the robot’s center of massmoving eastward over the course of an episode is equal to the probability, where thereare T turns in an episode, that during t out of T turns the center of mass movedeastward and t > T − t. As θ(o, a) → ∞ so π(o, a, θ) → 1 for the correct action a,so t will also grow and we will expect correct actions. And when the correct actionsare executed, the center of mass is always expected to move eastwards during a turn.Naturally, in practice once the algorithm has converged, we can extract deterministicrules from the table of learned parameters by selecting the highest parameter valueper state. However, it has also been shown (Littman 1994) that for some POMDPsthe best stochastic state-free policy is better than the best deterministic one. It maybe that randomness could help in our case also.

While some locally optimal policies will indeed generate locomotion gaits like theone just described, others will instead form protrusions in the direction of higherreward without moving the whole robot. Figure 3-12 shows some configurations

43

a b c

d e f

Figure 3-12: Some locally optimal policies result in robot configurations that are notlocomotion gaits.

resulting from such policies — the prevalence of these will grow with the robot size.In the next chapter, we examine in detail several strategies for constraining the policysearch and providing more information to the learning agents in an effort to avoidlocal optima that do not correspond to acceptable locomotion gaits.

no. modules no. bad configs15 0.1±0.120 5±1.7100 10±0.0

(a) (b)

Figure 3-13: Scalability issues with GAPS: 15, 20 and 100 modules. (a) Smoothed(10-point window) average reward over 10 learning trials. (b) Mean number of stuckconfigurations, out of 10 test trials for each of 10 learned policies after 10,000 episodesof learning, with standard error.

Local optima, especially those that do not correspond to acceptable locomotiongaits, become more of a problem for larger robots consisting of more modules. Figure3-13a shows that 20 modules running the same GAPS algorithm do not perform aswell as 15 modules. If we increase the size of the robot to 100 modules, no significantincrease in reward over time occurs. The table in figure 3-13b demonstrates that therobot gets stuck in locally optimal but unacceptable configurations more often as thenumber of modules grows.

44

no. modules no. bad configs no. bad policy15 0.1±0.1 2±1.320 0.9±0.8 3±1.5

Table 3.4: Results of introducing a stability requirement into the simulator: meannumber of non-gaits, out of 10 test runs, for 10 learned policies, with standard error(“bad policy” refers to test in which the modules stopped moving after a while de-spite being in an acceptable configuration; this effect is probably due to the rewardstructure in these experiments: since unstable configurations are severely penalized,it is possible that the modules prefer, faced with a local neighborhood that previouslyled to an unstable configuration, to not do anything.)

(a) (b)

Figure 3-14: Results of introducing a stability constraint into the simulator: twonon-gait configurations.

At this point we might note that some of the undesirable configurations shownin figure 3-12 would not be possible on any physical robot configuration, since theyare not stable. Although we explicitly state our intent to work only with abstractrules of motion, at this point we might consider introducing another rule enforcedby the simulator: (3) the robot’s center of mass must be within its footprint. In aseries of experiments, we let the modules learn in this new environment, where anyepisode where rule (3) is broken is immediately terminated with a reward of -10 toall modules (centralized factored learning).

Hypothesis: Penalizing unstable configurations will reduce the number of dysfunc-tional (stuck) locally optimal configurations in GAPS.

The results of the experiments (table 3.4) show that non-gait local optima stillexist, even though unstable long protrusions are no longer possible. Some such con-figurations can be seen in figure 3-14a and 3-14b. For the rest of this thesis, we returnto the original two rules of the simulator.

3.6 Key issues in gradient ascent for SRMRs

Whenever stochastic gradient ascent algorithms are used, three key issues affect boththe speed of learning and the resulting concept or policy on which the algorithmconverges. These issues are:

45

number of parameters The more parameters we have to estimate, the more dataor experience is required (sample complexity), which results in slower learningfor a fixed amount of experience per history (episode of learning).

amount of experience Conversely, the more data or experience is available to theagents per learning episode, the faster learning will progress.

starting point Stochastic gradient ascent will only converge to a local optimumin policy space. Therefore the parameter settings at which the search beginsmatter for both the speed of convergence and the goodness of the resultingpolicy.

In lattice-based self-reconfiguring modular robots, these three variables are af-fected by a number of robot and problem parameters. Of the three, the startingpoint is the most straightforward: the closer initial policy parameters lie to a goodlocally optimal policy, the faster the robot will learn to behave well.

3.6.1 Number of parameters to learn

The more parameters need to be set during learning, the more knobs the agent needsto tweak, the more experience is required and the slower the learning process becomes.In lattice-based SRMRs, the following problem setup and robot variables affect thenumber of policy parameters that need to be learned.

Policy representation

Clearly, the number of possible observations and executable actions determines thenumber of parameters in a full tabular representation. In a feature-based represen-tation, this number can be reduced. However, we have seen in this chapter that therecasting of the policy space into a number of log-linearly compositional features isnot a straightforward task. Simple reduction in the number of parameters does notguarantee a more successful learning approach. Instead, good policies may be ex-cluded from the reduced space representation altogether. There may be more localoptima that do not correspond to acceptable policies at all, or less optima overall,which would result in a harder search problem.

The number of parameters to set can also be reduced by designing suitable pre-processing steps on the observation space of each agent.

Robot size

In modular robots, especially self-reconfiguring modular robots, the robot size influ-ences the effective number of parameters to be learned. We can see this relationshipclearly in our case study of locomotion by self-reconfiguration on a 2D lattice-basedrobot. The full tabular representation of a policy in this case involves 28 possiblelocal observations and 9 possible actions, for a total of 256× 9 = 2, 304 parameters.However, if the robot is composed of only two modules, learning to leap-frog over

46

(a)

(b)

Figure 3-15: All possible local observations for (a) two modules, and (b) three mod-ules, other than those in (a).

each other in order to move East, then effectively each agent only sees 4 possibleobservations, as shown in figure 3-15a. If there are only three modules, there are only18 possible local observations. Therefore the number of parameters that the agenthas to set will be limited to 4 × 9 = 36 or 18 × 9 = 162 respectively. This effectivereduction will result in faster learning for smaller robots composed of fewer modules.

Search constraints

Another way of reducing the effective number of parameters to learn is to imposeconstraints on the search by stochastic gradient ascent. This can be achieved, forexample, by containing exploration to only a small subset of actions.

3.6.2 Quality of experience

The amount and quality of available experience will influence both the speed andreliability of learning by stochastic gradient ascent. Specifically, experience duringthe learning phase should be representative of the expected situations for the rest ofthe robot lifetime, in order for the learned policy to generalize properly.

Episode length and robot size

For learning by GAPS, there is a straightforward correlation between the numberof timesteps in an episode and the quality of the updates in that episode. Longerepisodes result in larger counts of observations and actions, and therefore in betterestimates of the “surprise factor” between the expected and actual actions performed.

47

In SRMRs, robot size also contributes to the relationship between episode lengthand quality of learning updates. In particular, for our case study task of locomotionby self-reconfiguration, there is a clear dependency between the number of modules ina robot and the capability of the simple reward function to discriminate between goodand bad policies. To take an example, there is a clear difference between 15 modulesexecuting a good locomotion gait for 50 timesteps and the same number of modulesforming an arm-like protrusion, such as the one shown in figure 3-12a. This differenceis both visible to us and to the learning agent through the reward signal (center ofmass displacement along the x axis), which will be much greater in the case of alocomotion gait. However, if 100 modules followed both policies, after 50 timesteps,due to the nature of the threadlike locomotion gaits, the robot would not be able todifferentiate between what we perceive as a good locomotion policy, or the start ofan arm-like protrusion, based on reward alone. Therefore, as robot size increases, wemust either provide more sophisticated reward functions, or increase episode lengthto increase the amount of experience per episode for the learning modules.

Shared experience

We have seen that centralized factored GAPS can learn good policies while fullydistributed GAPS cannot. This is due to the amount and quality of experience themodules have access to during learning. In the centralized version, each modulemakes the same updates as all others as they share all experience: the sum total ofall encountered observations and all executed actions play into each update. In thedistributed version, each module is on its own for collecting experience and only hasaccess to its own observation and action counts. This results in extremely poor tononexistent exploration for most modules — especially those initially placed at thecenter of the robot. If it were possible for modules to share their experience in a fullydistributed implementation of GAPS, we could expect successful learning due to theincrease in both amount and quality of experience available to each learner.

3.7 Summary

We have formulated the locomotion problem for a SRMR as a multi-agent POMDPand applied gradient-ascent search in policy value space to solve it. Our resultssuggest that automating controller design by learning is a promising approach. Weshould, however, bear in mind the potential drawbacks of direct policy search as thelearning technique of choice.

As with all hill-climbing methods, there is a guarantee of GAPS converging to alocal optimum in policy value space, given infinite data, but no proof of convergenceto the global optimum is possible. A local optimum is the best solution we can findto a POMDP problem. Unfortunately, not all local optima correspond to reasonablelocomotion gaits.

In addition, we have seen that GAPS takes on average a rather long time (mea-sured in thousands of episodes) to learn. We have identified three key issues that con-

48

tribute to the speed and quality of learning in stochastic gradient ascent algorithmssuch as GAPS, and we have established which robot parameters can contribute tothe make-up of these three variables. In this chapter, we have already attempted, un-successfully, to address one of the issues — the number of policy parameters to learn— by introducing feature spaces. In the next two chapters, we explore the influenceof robot size, search constraints, episode length, information sharing, and smarterpolicy representations on the speed and reliability of learning in SRMRs. The goalto keep in sight as we report the results of those experiments, is to find the right mixof automation and easily available constraints and information that will help guideautomated search for the good distributed controllers.

49

Chapter 4

How Constraints and InformationAffect Learning

We have established that stochastic gradient ascent in policy space works in principlefor the task of locomotion by self-reconfiguration. In particular, if modules can some-how pool their experience together and average their rewards, provided that thereare not too many of them, the learning algorithm will converge to a good policy.However, we have also seen that even in this centralized factored case, increasingthe size of the robot uncovers the algorithm’s susceptibility to local minima whichdo not correspond to acceptable policies. In general, local search will be plagued bythese unless we can provide either a good starting point, or guidance in the form ofconstraints on the search space.

Modular robot designers are usually well placed to provide either a starting pointor search constraints, as we can expect them to have some idea about what a reason-able policy, or at least parts of it, should look like. In this chapter, we examine howspecifying such information or constraints affects learning by policy search.

4.1 Additional exploration constraints

In an effort to reduce the search space for gradient-based algorithms, we are lookingfor ways to give the learning modules some information that is easy for the designerto specify yet will be very helpful in narrowing the search. An obvious choice is to letthe modules pre-select actions that can actually be executed in any one of the localconfigurations.

During the initial hundreds of episodes where the algorithm explores the policyspace, the modules will attempt to execute undesirable or impossible actions whichcould lead to damage on a physical robot. Naturally, one may not have the luxury ofthousands of trial runs on a physical robot anyway.

Each module will know which subset of actions it can safely execute given anylocal observation, and how these actions will affect its position; yet it will not knowwhat new local configuration to expect when the associated motion is executed. Re-stricting search to legal actions is useful because it (1) effectively reduces the number

50

(a) (b)

(c) (d)

Figure 4-1: Determining if actions are legal for the purpose of constraining explo-ration: (a) A = {NOP}, (b) A = {NOP}, (c) A = {NOP} (d) A = {2(NE), 7(W )}.

of parameters that need to be learned and (2) causes the initial exploration phasesto be more efficient because the robot will not waste its time trying out impossibleactions. The second effect is probably more important than the first.

The following rules were used to pre-select the subset Ait of actions possible for

module i at time t, given the local configuration as the immediate Moore neighborhood(see also figure 4-1):

1. Ait = {NOP}1 if three or more neighbors are present at the face sites

2. Ait = {NOP} if two neighbors are present at opposite face sites

3. Ait = {NOP} if module i is the only “cornerstone” between two neighbors at

adjacent face sites2

4. Ait = {legal actions based on local neighbor configuration and the sliding cube

model}

5. Ait = Ai

t− any action that would lead into an already occupied cell

These rules are applied in the above sequence and incorporated into the GAPSalgorithm by setting the corresponding θ(ot, at) to a large negative value, therebymaking it extremely unlikely that actions not in Ai

t would be randomly selected bythe policy. Those parameters are not updated, thereby constraining the search atevery time step.

1NOP stands for ‘no operation’ and means the module’s action is to stay in its current locationand not attempt any motion.

2This is a very conservative rule, which prevents disconnection of the robot locally.

51

We predict that the added space structure and constraints that were introducedhere will result in the modular robot finding good policies with less experience.

4.2 Smarter starting points

Stochastic gradient ascent can be sensitive to the starting point in the policy spacefrom which search initiates. Here, we discuss two potential strategies for seeding thealgorithms with initial parameters that may lead to better policies.

4.2.1 Incremental GAPS learning

It is possible to leverage the modular nature of our platform in order to improve theconvergence rate of the gradient ascent algorithm and reduce the amount of experiencerequired by seeding the learning in an incremental way. As we have seen in chapter3, the learning problem is easier when the robot has fewer modules than the size ofits neighborhood, since it means that each module will see and have to learn a policyfor a smaller number of observations.

If we start with only two modules, and add more incrementally, we effectivelyreduce the problem search space. With only two modules, given the physical couplingbetween them, there are only four observations to explore. Adding one other modulemeans adding another 14 possible observations and so forth. The problem becomesmore manageable. Therefore, we have proposed the Incremental GAPS (IGAPS)algorithm.

IGAPS works by initializing the parameters of N modules’ policy with thoseresulting from N − 1 modules having run GAPS for a number of episodes but notuntil convergence to allow for more exploration in the next (N ’th) stage. Supposea number of robotic agents need to learn a task that can also be done in a similarway with fewer robots. We start with the smallest possible number of robots; in ourcase of self-reconfiguring modular robots we start with just two modules learning tolocomote, which run the GAPS algorithm. After the pre-specified number of episodes,their parameter values will be θc2. Then a new module is introduced, and the resultingthree agents learn again, starting from θc2 and using GAPS for a number of episodes.Then a fourth module is introduced, and the process is repeated until the requirednumber of modules has been reached, and all have run the learning algorithm toconvergence of policy parameters.

4.2.2 Partially known policies

In some cases the designer may be able to inform the learning algorithm by starting thesearch at a “good” point in parameter space. Incremental GAPS automatically findssuch good starting points for each consecutive number of modules. A partially knownpolicy can be represented easily in parameter space, in the case of representationby lookup table, by setting the relevant parameters to considerably higher values.

52

(a) (b)

Figure 4-2: Smoothed (100-point moving window), downsampled average rewardscomparing the basic GAPS algorithm versus GAPS with pre-screening for actionsresulting in legal moves, with standard error: (a) 15 modules (50 trials in basiccondition, 30 trials with pre-screening), (b) 20 modules (20 trials in each condition).

Starting from a partially known policy may be especially important for problemswhere experience is scarce.

4.3 Experiments with pooled experience

As in section 3.5.2, the experimental results presented here were obtained during 10independent learning trials, assuming all modules learn and execute the same policyfrom the same reward signal, which is the displacement of the robot’s center of massalong the x axis during the episode.

Unless noted otherwise, in all conditions the learning rate started at α = 0.01,decreased over the first 1, 500 episodes, and remained at its minimal value of 0.001thereafter. The inverse temperature parameter started at β = 1, increased over thefirst 1, 500 episodes, and remained at its maximal value of 3 thereafter. This ensuredmore exploration and larger updates in the beginning of each learning trial. Theseparameters were selected by trial and error, as the ones consistently generating thebest results. For clarity of comparison between different experimental conditions, wesmoothed the rewards obtained during each learning trial with a 100-point movingaverage window, then downsampled each run, in order to smooth out variability dueto ongoing exploration as the modules learn. We report smoothed, downsampledaverage reward curves with standard error bars (σ/

√N , N is the number of trials in

the group) every 1,000 episodes.

4.3.1 Pre-screening for legal motions

We then tested the effect of introducing specific constraints into the learning problem.We constrain the search by giving the robot partial a priori knowledge of the effect

53

a b c d

e f g h

Figure 4-3: Some local optima which are not locomotion gaits, found by GAPS withpre-screening for legal actions and (a)-(f) 1-hop communications, (g)-(h) multi-hopcommunications.

of its actions. The module will be able to tell whether any given action will generatea legal motion onto a new lattice cell, or not. However, the module still will notknow which state (full configuration of the robot) this motion will result in. Nor willit be able to tell what it will observe locally at the next timestep. The amount ofknowledge given is really minimal. Nevertheless, with these constraints the robot nolonger needs to learn not to try and move into lattice cells that are already occupied,or those that are not immediately attached to its neighbors. Therefore we predictedthat less experience would be required for our learning algorithms.

Hypothesis: Constraining exploration to legal actions only will result in faster learn-ing (shorter convergence times).

In fact, figure 4-2 shows that there is a marked improvement, especially as thenumber of modules grows.

4.3.2 Scalability and local optima

Stochastic gradient ascent algorithms are guaranteed to converge to a local maximumin policy space. However, those maxima may represent globally suboptimal policies.Figure 3-12 shows execution snapshots of some locally optimal policies, where therobot is effectively stuck in a configuration from which it will not be able to move.Instead of consistently sending modules up and down the bulk of the robot in athread-like gait that we have found to be globally optimal given our simple rewardfunction, the robot develops arm-like protrusions in the direction of larger rewards.Unfortunately, for any module in that configuration, the only locally good actions, if

54

15 modules 20 modulesGAPS 0.1±0.1 5±1.7

GAPS with pre-screened legal actionsno extra comms 0.3±0.3 4.2±1.5

1-hop 0.3±0.2 2.5±1.3multi-hop 0.3±0.2 0.2±0.1

Table 4.1: In each of 10 learning trials, the learning was stopped after 10,000 episodesand the resulting policy was tested 10 times. The table shows the mean number oftimes, with standard error, during these 10 test runs, that modules became stuck insome configuration due to a locally optimal policy.

any, will drive them further along the protrusion, and no module will attempt to godown and back as the rewards in that case will immediately be negative.

(a) (b)

Figure 4-4: Smoothed (100-point window), downsampled average rewards comparingthe basic GAPS algorithm to GAPS with pre-screening for legal actions and an extrabit of communicated observation through a 1-hop or a multi-hop protocol, with stan-dard error: (a) for 15 modules (50 basic GAPS trials, 30 GAPS with pre-screeningtrials, 20 trials each communications), (b) for 20 modules (20 trials each basic GAPSand pre-screening, 10 trials each communications).

Local optima are always present in policy space, and empirically it seems thatGAPS, with or without pre-screening for legal motions, is more likely to converge toone of them with increasing number of modules comprising the robot.

Having encountered this problem in scaling up the modular system, we introducenew information to the policy search algorithms to reduce the chance of convergenceto a local maximum.

55

4.3.3 Extra communicated observations

In our locomotion task, locally optimal but globally suboptimal policies do not allowmodules to wait for their neighbors to form a solid supporting base underneath them.Instead, they push ahead forming long protrusions. We could reduce the likelihoodof such formations by introducing new constraints, artificially requiring modules tostop moving and wait in certain configurations. For that purpose, we expand thelocal observation space of each module by one bit, which is communicated by themodule’s South/downstairs neighbor, if any. The bit is set if the South neighbor isnot supported by the ground or another module underneath. If a module’s bit is setit is not allowed to move at this timestep. We thus create more time in which othermodules may move into the empty space underneath to fill in the base. Note thatalthough we have increased the number of possible observations by a factor of 2 byintroducing the extra bit, we are at the same time restricting the set of legal actionsin half of those configurations to {NOP}. Therefore, we do not expect GAPS torequire any more experience to learn policies in this case than before.

We investigate two communication algorithms for the setting of the extra bit.If the currently acting module is M1 and it has a South neighbor M2, then either1) M1 asks M2 if it is supported; if not, M2 sends the set-bit message and M1’sbit is set (this is the one-hop scheme), or 2) M1 generates a support request thatpropagates South until either the ground is reached, or one of the modules replieswith the set-bit message and all of their bits are set (this is the multi-hop scheme).

Hypothesis: Introducing an additional bit of observation communicated by theSouth neighbor will reduce the number of non-gait local optima found by GAPS.

Experimentally (see table 4.1) we find that it is not enough to ask just one neigh-bor. While the addition of a single request on average halved the number of stuck con-figurations per 10 test trials for policies learned by 20 modules, the chain-of-supportmulti-hop scheme generated almost no such configurations. On the other hand, theaddition of communicated information and waiting constraints does not seem to af-fect the amount of experience necessary for learning. Figure 4-4a shows the learningcurves for both sets of experiments with 15 modules practically undistinguishable.

However, the average obtained rewards should be lower for algorithms that con-sistently produce more suboptimal policies. This distinction is more visible when themodular system is scaled up. When 20 modules run the four proposed versions of gra-dient ascent, there is a clear distinction in average obtained reward, with the highestvalue achieved by policies resulting from multi-hop communications (see figure 4-4b).The discrepancy reflects the greatly reduced number of trial runs in which modulesget stuck, while the best found policies remain the same in all conditions.

A note on directional information

Table 4.1 shows that even with the extra communicated information in the multi-hopscenario, it is still possible for modules to get stuck in non-gait configurations. Mightwe be able to reduce the number of such occurrences to zero by imparting even more

56

information to the modules? In particular, if the modules knew the required directionof motion, would the number of “stuck” configurations be zero?

(a) (b)

Figure 4-5: Two global configurations that are aliased assuming only local, partialobservations, even in the presence of directional information: (a) module M is reallyat the front of the robot and should not move, (b) module M should be able to movedespite having no “front” neighbors.

In anticipation of chapter 6, where we introduce a sensory model allowing modulesto perceive local gradients, let us assume that each module knows the general direction(eastward) of desired locomotion. It can still only use this new directional informationin a local way. For example, we can constrain the module to only attempt motion if itscolumn is supported by the ground, as discussed earlier, and if it perceives itself locallyto be on the “front” of the robot with respect to the direction of desired motion. Eventhen, the module can not differentiate locally between the two situations depicted infigure 4-5, and therefore, the number of stuck configurations may not be assumed togo to zero even with directional information, so long as the condition of local, partialobservation is preserved. Other communication schemes to differentiate between suchsituations may be devised, but they are outside the scope of this thesis.

Empirically, we ran 10 trials of GAPS with pre-screening for legal actions andmulti-hop communications with additional directional information available to 15and 20 modules. Each module decided whether or not to remain stationary accordingto the following rule: if its column is not supported by the ground, as returned by thechain of communications, and if it sees no neighbors on the right (that is, the front),then the modules chooses the NOP action. Otherwise, an action is picked accordingto the policy and legal pre-screening. Each of the resulting 10 policies for both robotsizes where then tested 10 times. On average, 15 modules got stuck 0.2±0.2 times outof 10 test runs, and 20 modules got stuck 0.3±0.2 times in a non-gait configuration.

The stochastic nature of the policies themselves and of the learning process makesit impossible for us to devise strategies where zero failures is guaranteed, short ofwriting down a complete policy at the start. The best we can do is minimize failureby reasonable means. For the rest of this chapter, and until chapter 6, we will notconsider any directional information to be available to the modules.

57

height=3 height=4# modules baseline baseline multi-hop

15 0.1±0.1 0.2±0.1 -16 0.1±0.1 5.4±1.3 2.1±1.017 0.1±0.1 3.9±1.5 2.5±1.318 1.0±1.0 1.5±1.1 0.5±0.319 0.2±0.1 3.3±1.3 1.1±1.020 0.0±0.0 5.0±1.7 0.2±0.1

Table 4.2: Mean number of non-gait stuck configurations per 10 test trials, withstandard error, of 10 policies learned by modules running centralized GAPS with(multi-hop) or without (baseline) pre-screening for legal actions and multi-hop com-munications.

4.3.4 Effect of initial condition on learning results

We have established that the larger 20-module robot does not in general learn as fastor as well as the smaller 15-module robot. However, it is important to understandthat there is no simple relationship between the number of robot modules and thequality of the learned policies. We do, however, see the following effects:

Importance of initial condition The same number of modules started in differentconfigurations will likely lead to different learning times and resulting policies.

Small size effect When the robot is composed of up to 10 modules, in reasonableinitial configurations, the learning problem is relatively easy due to a limitednumber of potential global configurations.

Large size effect Conversely, when the robot is composed of upwards of 100 mod-ules, the sheer scale of the problem results in a policy space landscape withdrastically more local optima which do not correspond to locomotion gaits. Inthis case, even with the best initial conditions, modules will have trouble learn-ing in the basic setup. However, biasing the search and scaling up incrementally,as we explore in this chapter, will help.

Therefore, when comparing different algorithms and the effect of extra constraintsand information, throughout this thesis, we have selected two examples to demon-strate these effects: 1) 15 modules in a lower aspect ratio initial configuration (height/width= 3/5), which is helpful for horizontal locomotion, and 2) 20 modules in a higher as-pect ratio initial condition (height/width = 4/5), which creates more local optimarelated problems for the learning process. The initial configurations were chosen tomost closely mimic a tight packing for transportation in rectangular “boxes”.

In table 4.2, we report one of the performance measures used throughout thisthesis (the mean number of non-gait configurations per 10 test runs of a policy) forall robot sizes between 15 and 20 modules, in order to demonstrate the effect of theinitial condition. In all cases, at the start of every episode the robot’s shape was a

58

25 modules 90 modules 100 modulesheight=3 1.1±1.0 9±1.0 9.4±0.4

Table 4.3: Larger robots get stuck with locally optimal non-gait policies despite alower aspect ratio of the initial configuration.

rectangle or a rectangle with an appendage, depending on the number of modulesand the required height. For example, 16 modules with a height of 4 were initially ina 4× 4 square configuration, but with a height of 3, they were placed in a 5× 3 + 1,the last module standing on the ground line attached to the rest of the robot onits side. We see that all robot sizes do almost equally well when they have a lowaspect ratio (height=3). When the initial aspect ratio is increased (height=4), thereis a much higher variance in the quality of policies, but in most cases the samenumber of modules learn far worse than before. We can also see that introducinglegality constraints and multi-hop communications in the latter case of the tallerinitial configuration helps reduce the number of non-gait configurations across theboard.

While the initial condition is important to policy search, it does not account forall effects associated with robot size: the number of modules also matters, especiallywhen it increases further. Table 4.3 shows that much larger robots still find morenon-gait policies.

Clearly, both initial conditions and robot size influence the learning process. In therest of this thesis, with the exception of experiments in incremental learning in section4.3.5, we will continue to compare the performance of algorithms, representationsand search constraints for two points: 15 and 20 modules. This allows us to test ourapproach on robots with different aspect ratios. The choice is reasonable, as part ofthe motivation for modular robots is to see them reconfigure into useful shapes afterbeing tightly packed for transportation. Nevertheless we should bear in mind thatinitial conditions, not only in terms of the point in policy space where search starts(we will turn to that next), but also in terms of the robot configuration, will matterfor learning performance.

4.3.5 Learning from better starting points

Hypothesis: Incremental GAPS will result in both faster learning, and fewer unde-sirable local optima, as compared with standard GAPS.

Figure 4-6 shows how incremental GAPS performs on the locomotion task. Itseffect is most powerful when there are only a few modules learning to behave atthe same time; when the number of modules increases beyond the size of the localobservation space, the effect becomes negligible. Statistical analysis fails to reveal anysignificant difference between the mean rewards obtained by basic GAPS vs. IGAPSfor 6, 15, or 20 modules. Nevertheless, we see in figure 4-7b that only 2.1 out of 10 testruns on average produce arm-like protrusions when the algorithm was seeded in the

59

(a) (b)

Figure 4-6: Smoothed, downsampled average rewards, with standard error, over 10trials: comparison between basic GAPS and the incremental extension (a) for 6 mod-ules and (b) for 15 modules.

incremental way, against 5 out of 10 for basic GAPS. The lack of statistical significanceon the mean rewards may be due to the short time scale of the experiments: in50 timesteps, 20 modules may achieve very similar rewards whether they follow alocomotion gait or build an arm in the direction of larger rewards. A drawbackof the incremental learning approach is that it will only work to our advantage ontasks where the optimal policy does not change as the number of modules increases.Otherwise, IGAPS may well lead us more quickly to a globally suboptimal localmaximum.

We have taken the incremental idea a little further in a series of experiments whererobot size was increased in larger increments. Starting with 4 modules in a 2 × 2configuration, we have added enough modules at a time to increase the square lengthby one, eventually reaching a 10 × 10 initial configuration. The results can be seenin figure 4.3.5 and table 4.3.5. The number of locally optimal non-gait configurationswas considerable for 100 modules, even in the IGAPS condition, but it was halvedwith respect to the baseline GAPS.

Figure 4-9 shows the results of another experiment in better starting points. Usu-ally the starting parameters for all our algorithms are initialized to either small ran-dom numbers or all zeros. However, sometimes we have a good idea of what certainparts of our distributed controller should look like. It may therefore make sense toseed the learning algorithm with a good starting point by imparting to it our incom-plete knowledge. We have made a preliminary investigation of this idea by partiallyspecifying two good policy “rules” before learning started in GAPS. Essentially weinitialized the policy parameters to a very strong preference for the ‘up’ (North) ac-tion when the module sees neighbors only on its right (East), and a correspondinglystrong preference for the ‘down’ (South) action when the module sees neighbors onlyon its left (West).

60

15 modules 20 modulesGAPS 0.1±0.1 5±1.7IGAPS 0.3±0.3 2.1±1.3

(a) (b)

Figure 4-7: Incremental GAPS with 20 modules: (a) Smoothed (100-point window),downsampled average rewards obtained by basic vs. incremental GAPS while learninglocomotion (b) The table shows the mean number, with standard error, of dysfunc-tional configurations, out of 10 test trials for 10 learned policies, after 15 and 20modules learning with basic vs. incremental GAPS; IGAPS helps as the number ofmodules grows.

Hypothesis: Seeding GAPS with a partially known policy will result in faster learn-ing and fewer local optima.

A two-way mixed-factor ANOVA on episode number (within-condition variation:tested after 1,500 episodes, 3,000 episodes, 6,000 episodes and 10,000 episodes) andalgorithm starting point (between-condition variation) reveals that mean rewardsobtained by GAPS with or without partial policy knowledge differ significantly (F =19.640, p < .01) for 20 modules, but not for 15. Again, we observe an effect whenthe number of modules grows. Note that for 15 modules, the average reward is (notstatistically significantly) lower for those modules initialized with a partially knownpolicy. By biasing their behavior upfront towards vertical motion of the modules,we have guided the search away from the best-performing thread-like gait, which

non-gait locally optimal configurationsgaps incremental

4x4 modules 5.4±1.3 1.3±0.95x5 modules 10±0.0 1.7±1.2

10x10 modules 10±0.0 4.3±1.6

Table 4.4: Mean number, with standard error, of locally optimal configurations whichdo not correspond to acceptable locomotion gaits out of 10 test trial for 10 learnedpolicies: basic GAPS algorithm vs. incremental GAPS with increments by square sidelength. All learning trials had episodes of length T=50, but test trials had episodesof length T=250, in order to give the larger robots time to unfold.

61

(a) (b)

Figure 4-8: Incremental learning by square side length: smoothed, downsampledaverage rewards for (a) 25 and (b) 100 modules learning eastward locomotion withbasic vs. incremental GAPS. Rewards obtained in 50 timesteps by basic GAPS aregreater than those of incremental GAPS for 100 modules; this highlights the effectof shorter episode lengths on learning, as policies learned by incremental GAPS aremuch more likely to be locomotion gaits (see text and table 4.3.5).

relies more heavily on horizontal motion. The effect will not necessarily generalize toany designer-determined starting point. In fact, we expect there to be a correlationbetween how many “rules” are pre-specified before learning and how fast a goodpolicy is found.

We expect extra constraints and information to be even more useful when expe-rience is scarce, as is the case when modules learn independently in a distributedfashion without tieing their policy parameters to each other.

4.4 Experiments with individual experience

In all of the experiments reported in section 3.5.2 the parameters of the modules’policies were tied; the modules were learning together as one, sharing their experienceand their behavior. However, one of our stated goals for applying reinforcementlearning to self-reconfigurable modular robots is to allow modules to adapt theircontrollers at run-time, which necessitates a fully distributed approach to learning,and individual learning agents inside every module. An important aspect of GAPS-style algorithms is that they can be run in such an individual fashion by agents in amulti-agent POMDP and they will make the same gradient ascent updates providedthat they receive the same experience. Tieing policy parameters together essentiallyincreased the amount of experience for the collective learning agent. Where a singlemodule makes T observations and executes T actions during an episode, the collectivelearning agent receives nT observations and nT actions during the same period oftime, where n is the number of modules in the robot. If we run n independentlearners using the same GAPS algorithm on individual modules, we can therefore

62

(a) (b)

Figure 4-9: Smoothed average rewards over 10 trials: effect of introducing a partiallyknown policy for (a) 15 modules and (b) 20 modules.

expect that it will take proportionately more time for individual modules to findgood policies on their own.

Here we describe the results of experiments on modules learning individual policieswithout sharing observation or parameter information. In all cases, the learning ratestarted at α = 0.01, decreased uniformly over 7,000 episodes until 0.001 and remainedat 0.001 thereafter. The inverse temperature started at β = 1, increased uniformlyover 7,000 episodes until 3 and remained at 3 thereafter. The performance curves inthis sections were smoothed with a 100-point moving average and downsampled forclarity.

Hypothesis: Legality constraints and and extra observation bit communicated byneighbors will have a more dramatic effect in distributed GAPS.

We see in figure 4-10 that the basic unconstrained version of the algorithm isstruggling in the distributed setting. Despite a random permutation of module po-sitions in the start state before each learning episode, there is much less experienceavailable to individual agents. Therefore we observe that legal actions constraintsand extra communicated observations help dramatically. Surprisingly, without any ofthese extensions, GAPS is also greatly helped by initializing the optimization with apartially specified policy. The same two “rules” were used in these experiments as insection 4.3.5. A two-way 4× 2 ANOVA on the mean rewards after 100,000 episodeswas performed with the following independent variables: information and constraintsgroup (one of: none, legal pre-screening, 1-hop communications and multi-hop com-munications), and partially known policy (yes or no). The results reveal a statisticallysignificant difference in means for both main factors, and for their simple interaction,at the 99% confidence level.3

3These results do not reflect any longitudinal data or effects.

63

(a) (b)

Figure 4-10: Smoothed, downsampled average rewards over 10 trials with 15 moduleslearning to locomote using only their own experience, with standard error: (a) basicdistributed GAPS vs. GAPS with pre-screening for legal motions vs. GAPS withpre-screening and 1-hop (green) or multi-hop (cyan) communications with neighborsbelow, and (b) basic distributed GAPS vs. GAPS with all actions and a partiallyspecified policy as starting point.

Hypothesis: Seeding policy search with a partially known policy will help dramat-ically in distributed GAPS.

Figure 4-11a compares the average rewards during the course of learning in threeconditions of GAPS, where all three were initialized with the same partial policy: 1)basic GAPS with no extra constraints, 2) GAPS with pre-screening for legal actionsonly, 3) GAPS with legal actions and an extra observation bit communicated by the1-hop protocol, and 4) by the multi-hop protocol. The curves in 1) and 4) are almostindistinguishable, whereas 2) is consistently lower. However, this very comparableperformance can be due to the fact that there is not enough time in each episode (50steps) to disambiguate between an acceptable locomotion gait and locally optimalprotrusion-making policies.

Hypothesis: Test runs with longer episodes will disambiguate between performanceof policies learned with different sources of information. In particular, dis-tributed GAPS with multi-hop communications will find many fewer bad localoptima.

To test that hypothesis, in all 10 trials we stopped learning at 100,000 episodesand tested each resulting policy during 10 runs with longer episodes (150 timesteps).Figure 4-11b and table 4.5 show that multi-hop communications are still necessaryto reduce the chances of getting stuck in a bad local optimum.

64

(a) (b)

Figure 4-11: (a) Smoothed, downsampled averaged reward over 10 trials of 15 mod-ules running algorithms seeded with a partial policy, with and without pre-screeningfor legal actions and extra communicated observations, with standard error, (b) Av-erage rewards obtained by 15 modules with episodes of length T = 150 after 100,000episodes of distributed GAPS learning seeded with partially known policy, with var-ious degrees of constraints and information, with standard deviation.

4.5 Discussion

The experimental results of section 4.3 are evidence that reinforcement learning canbe used fruitfully in the domain of self-reconfigurable modular robots. Most of ourresults concern the improvement in SRMR usability through automated developmentof distributed controllers for such robots; and to that effect we have demonstratedthat, provided with a good policy representation and enough constraints on the searchspace, gradient ascent algorithms converge to good policies given enough time andexperience. We have also shown a number of ways to structure and constrain thelearning space such that less time and experience is required and local optima becomeless likely. Imparting domain knowledge or partial policy knowledge requires moreinvolvement from the human designer, and our desire to reduce the search space inthis way is driven by the idea of finding a good balance between human designerskills and the automatic optimization. If the right balance is achieved, the humanscan seed the learning algorithms with the kinds of insight that is easy for us; and therobot can then learn to improve on its own.

We have explored two ways of imparting knowledge to the learning system. Onthe one hand, we constrain exploration by effectively disallowing those actions thatwould result in a failure of motion (sections 4.1 and 4.3.1) or any action other thanNOP in special cases (section 4.3.3). On the other hand, we initialize the algorithmat a better starting point by an incremental addition of modules (sections 4.2.1 and4.3.5) or by a partially pre-specified policy (section 4.3.5). These experiments suggestthat a good representation is very important for learning locomotion gaits in SRMRs.Local optima abound in the policy space when observations consists of the 8 bits ofthe immediate Moore neighborhood. Restricting exploration to legal motions only

65

15 modules 20 modulesT=50 T=50 T=100

Distributed GAPS 10±0.0 10±0.0DGAPS with pre-screened legal actions

1-hop comms 1.3±0.2 10±0.0 9.6±0.4multi-hop comms 0.9±0.3 9.6±0.4 3.7±1.1

DGAPS with partially known policyno extra restrictions or info 1.2±0.4 10±0.0 9.9±0.1

pre-screened legal actions 3.8±0.6 8.8±0.6 9.5±0.4+ 1-hop 0.8±0.2 9.9±0.1 7.6±0.7

+ multi-hop 0.3±0.3 4.7±1.0 1.3±0.3

Table 4.5: In each of the 10 learning trials, the learning was stopped after 100,000episodes and the resulting policy was tested 10 times. The table shows the meannumber of times, with standard error, during these 10 test runs for each policy, thatmodules became stuck in some configuration due to a locally optimal policy.

did not prevent this problem. It seems that more information than what is availablelocally is needed for learning good locomotion gaits.

Therefore, as a means to mitigate the prevalence of local optima, we have intro-duced a very limited communications protocol between neighboring modules. Thisincreased the observation space and potentially the number of parameters to esti-mate. However, more search constraints in the form of restricting any motion if lackof support is communicated to the module, allowed modules to avoid the pitfalls oflocal optima substantially more often.

We have also found that local optima were more problematic the more moduleswere acting and learning at once; they were also preferentially found by robots whichstarted from taller initial configurations (higher aspect ratio). The basic formulationof GAPS with no extra restrictions only moved into a configuration from which themodules would not move once in 100 test trials when there were 15 modules learning.When there were 20 modules learning, the number increased to almost half of alltest trials. It is important to use resources and build supporting infrastructure intothe learning algorithm in a way commensurate with the scale of the problem; morescaffolding is needed for harder problems involving a larger number of modules.

Restricting the search space and seeding the algorithms with good starting pointsis even more important when modules learn in a completely distributed fashion fromtheir experience and their local rewards alone (section 4.4). We have shown that inthe distributed case introducing constraints, communicated information, and partiallyknown policies all contribute to successful learning of a locomotion gait, where thebasic distributed GAPS algorithm has not found a good policy in 100,000 episodes.This is not very surprising, considering the amount of exploration each individualmodule needs to do in the unrestricted basic GAPS case. However, given enoughconstraints and enough experience, we have shown that it is possible to use the same

66

RL algorithms on individual modules. It is clear that more work is necessary beforeGAPS is ported to physical robots: for instance, we cannot afford to run a physicalsystem for 100,000 episodes of 50 actions per module each.

In the next chapter, we extend our work using the idea of coordination and com-munication between modules that goes beyond single-bit exchanges.

67

Chapter 5

Agreement in DistributedReinforcement Learning

In a cooperative distributed reinforcement learning situation, such as in a self-recon-figuring modular robot learning to move, an individual learning agent only has accessto what it perceives locally, including a local measure of reward. This observation goesagainst a common assumption in the RL literature (e.g., Peshkin et al. 2000, Changet al. 2004) that the reward is a global scalar available unaltered to all agents1.Consequently, where most research is concerned with how to extract a measure ofa single agent’s performance from the global reward signal2, this is not a problemfor locally sensing SRMR modules. A different problem, however, presents itself:the local reward is not necessarily a good measure of the agent’s performance. Forinstance, it is possible that the agent’s policy is optimal, but the bad choices made byothers prevent it from ever finding itself in a configuration from which it can move.

Especially in the early stages of learning, the agents’ policies will be near-randomto ensure proper exploration. However, most modules will be stuck in the middle ofthe robot and will not have a chance to explore any actions other than NOP untilthe modules on the perimeter of the robot have learned to move out of the way.

The twin problems of limited local experience and locally observed but not nec-essarily telling reward signals are addressed in this chapter through the applicationof agreement algorithms (Bertsekas & Tsitsiklis 1997). We demonstrate below thatagreement can be used to exchange both local reward and local experience amongmodules, and will result in a fully distributed GAPS implementation which learnsgood policies just as fast as the centralized, factored implementation.

1This assumption is not made in research on general-payoff fully observable stochastic Markovgames (e.g., Hu & Wellman 1998).

2David Wolpert famously compares identical-payoff multi-agent reinforcement learning to tryingto decide what to do based on the value of the Gross Domestic Product (e.g., Wolpert et al. 1999).

68

5.1 Agreement Algorithms

Agreement (also known as consensus) algorithms are pervasive in distributed systemsresearch. First introduced by Tsitsiklis et al. (1986), they have been employed in manyfields including control theory, sensor networks, biological modeling, and economics.They are applicable in synchronous and partially asynchronous settings, whenevera group of independently operating agents can iteratively update variables in theirstorage with a view to converge to a common value for each variable. Flocking isa common example of an agreement algorithm applied to a mobile system of robotsor animals: each agent maintains a velocity vector and keeps averaging it with itsneighbors’ velocities, until eventually, all agents move together in the same direction.

5.1.1 Basic Agreement Algorithm

First, we quickly review the basic agreement algorithm, simplifying slightly from thetextbook (chapter 7 of Bertsekas & Tsitsiklis 1997). Suppose there are N modulesand therefore N processors, where each processor i is updating a variable xi. For easeof exposition we assume that xi is a scalar, but the results will hold for vectors aswell. Suppose further that each i sends to a subset of others its current value of xi

at certain times. Each module updates its value according to a convex combinationof its own and other modules’ values:

xi(t + 1) =N∑

j=1

aijxj(τij(t)), if t ∈ T i,

xi(t + 1) = xi(t), otherwise,

where aij are nonnegative coefficients which sum to 1, T i is the set of times at whichprocessor i updates its value, and τ i

j(t) determines the amount of time by which thevalue xj is outdated. If the graph representing the communication network amongthe processors is connected, and if there is a finite upper bound on communicationdelays between processors, then the values xi will exponentially converge to a commonintermediate value x such that xmin(0) ≤ x ≤ xmax(0) (part of Proposition 3.1 inBertsekas & Tsitsiklis (1997)). Exactly what value x the processors will agree on inthe limit will depend on the particular situation.

5.1.2 Agreeing on an Exact Average of Initial Values

It is possible to make sure the modules agree on the exact average of their initialvalues. This special case is accomplished by making a certain kind of updates syn-chronous and pairwise disjoint updates.

Synchronicity means that updates are made at once across the board, such thatall communicated values are out of date by the same amount of time. In addition,modules communicate and update their values in disjoint pairs of neighbors, equallyand symmetrically:

xi(t + 1) = xj(t + 1) =1

2(xi(t) + xj(t)) .

69

In this version of agreement, the sum x1 + x2 + ... + xN is conserved at every timestep, and therefore all xi will converge to the exact arithmetic mean of the initialvalues xi(0). Actually, equal symmetrical averaging between all members of each fullyconnected clique of the graph representing neighboring processor communications willalso be sum-invariant, such that all nodes agree on the exact average of initial values.This fact will be relevant to us later in this chapter, where we use agreement togetherwith policy search in distributed learning of locomotion gaits for SRMRs.

In partially asynchronous situations, or if we would like to speed up agreement byusing larger or overlapping sets of neighbors, we cannot in general say what value theprocessors will agree upon. In practice, pairwise disjoint updates lead to prohibitivelylong agreement times, while using all the nearest neighbors in the exchange workswell.

5.1.3 Distributed Optimization with Agreement Algorithms

Agreement algorithms can also be used for gradient-style updates performed by sev-eral processors simultaneously in a distributed way (Tsitsiklis et al. 1986), as follows:

xi(t + 1) =N∑

j=1

aijxj(τij(t)) + γsi(t),

where γ is a small decreasing positive step size, and si(t) is in the possibly noisydirection of the gradient of some continuously differentiable nonnegative cost function.It is assumed that noise is uncorrelated in time and independent for different i’s.

However, GAPS is a stochastic gradient ascent algorithm, which means that theupdates performed climb the estimated gradient of the policy value function ∇V ,which may, depending on the quality of the agents’ experience, be very different from∇V , as shown in figure 5-1. A requirement of the partially asynchronous gradient-like update algorithm described in this section is that the update direction si(t) be inthe same quadrant as ∇V , but even this weak assumption is not necessarily satisfiedby stochastic gradient ascent methods. In addition, the “noise”, described by thewi(t) term above, is correlated in a SRMR, as modules are physically attached toeach other. In the rest of this section, we nevertheless propose a class of methods forusing the agreement algorithm with GAPS, and demonstrate their suitability for theproblem of learning locomotion by self-reconfiguration.

5.2 Agreeing on Common Rewards

and Experience

When individual modules learn in a completely distributed fashion, often most mod-ules will make zero updates. However, if modules share their local rewards, all con-tributing to a common result, the agreed-upon R will only be equal to zero if nomodules have moved at all during the current episode, and otherwise all modules willmake some learning update. We expect that fact to speed up the learning process.

70

Figure 5-1: Stochastic gradient ascent updates are not necessarily in the direction ofthe true gradient.

(a) (b)

Figure 5-2: Two implementations of agreement exchanges within the GAPS learningalgorithm: (a) at the end of each episode, and (b) at each time step.

In addition, we expect the agreed-upon R to be a more accurate measure of how wellthe whole robot is doing, thereby also increasing the likelihood of better updates byall modules. This measure will be exactly right if the modules use synchronous equalsymmetrical pairwise disjoint updates, although this is harder to achieve and slowerin practice.

A theorem in Peshkin (2001) states that a distributed version of GAPS, where eachagent maintains and updates its own policy parameters, will make exactly the sameupdates as the centralized factored version of GAPS, provided the same experienceand the same rewards. As we have seen, this proviso does not hold in general forthe task of locomotion by self-reconfiguration. However, neighboring modules canexchange local values of reward to agree on a common value. What happens if modulesalso exchange local experience?

In the centralized GAPS algorithm, the parameter updates are computed with

71

Algorithm 3 Asynchronous ID-based collection of experience. This algorithm canbe run at the end of each episode before the learning update rule.Require: unique id

send message = 〈id,R,Co, Coa〉 to all neighborsloop

if timeout thenproceed to update

end if

whenever message m = 〈idm, Rm, Cmo , Cm

oa〉 arrivesif idm not seen before then

store reward Rs(idm)← Rm

add experience Co+ = Cmo , Coa+ = Cm

oa

resend m to all neighborsend if

end loop

update:average rewards R← mean(Rs)GAPS update rule

the following rule:∆θoa = αR (Coa − Coπθ(o, a)) ,

where the counters of experienced observations Co and chosen actions per observationCoa represent the sum of the modules’ individual counters. If there is a way forthe modules to obtain these sums through local exchanges, then the assumptions ofthe theorem are satisfied and we can expect individual modules to make the sameupdates as the global learner in the centralized case, and therefore to converge to alocal optimum.

There are at least two possible ways of obtaining these sums. The first algorithmassumes a unique ID for every module in the system—a reasonable assumption if weconsider physically instantiated robotic modules. At the end of each episode, eachmodule transmits a message to its face neighbors, which consists of its ID and thevalues of all nonzero counters. Upon receiving such a message in turn, the modulechecks that the ID has not been seen before. If the ID is new, the receiving modulewill add in the values from the message to its current sums. If not, the message isignored. The only problem with this ID-based algorithm, shown also as Algorithm 3,is termination. We cannot assume that modules know how many of them comprisethe system, and therefore must guess a timeout value beyond which a module will nolonger wait for messages with new IDs and will proceed to update its parameters.

An alternative method is again based on the agreement algorithm. Recall thatusing synchronous, equal and symmetric pairwise disjoint updates guarantees con-vergence to a common x which is the exact average of initial values of xi. If at theend of an episode, each module runs such an agreement algorithm on each of thecounters that it maintains, then each counter Ci will converge to C = 1

N

∑Ni=1 Ci. If

72

Algorithm 4 Asynchronous average-based collection of experience. This algorithmmay be run at the end of each episode, or the procedure labeled average may beused at each time step while gathering experience, as shown also in figure 5-2b.

send message message = 〈R,Co, Coa〉 to a random neighborrepeat

if timeout thenproceed to update

end if

whenever m = 〈Rm, Cmo , Cm

oa〉 arrivesaverage:reward R← 1

2 (R + Rm)experience Co ← 1

2 (Co + Cmo ), Coa ← 1

2 (Coa + Cmoa)

send to a random neighbor new message m′ = 〈R,Co, Coa〉until converged

update:GAPS update rule

this quantity is then substituted into the GAPS update rule, the updates become:

∆θoa = αR

(1

N

N∑i=1

Cioa − πθ(o, a)

1

N

N∑i=1

Cio

)=

1

NαR (Coa − Coπθ(o, a)) ,

which is equal to the centralized GAPS updates scaled by a constant 1N

. Note thatsynchronicity is also required at the moment of the updates, i.e., all modules mustmake updates simultaneously, in order to preserve the stationarity of the underlyingprocess. Therefore, modules learning with distributed GAPS using an agreementalgorithm with synchronous pairwise disjoint updates to come to a consensus on boththe value of the reward, and the average of experience counters, will make stochasticgradient ascent updates, and therefore will converge to a local optimum in policyspace.

The algorithm (Algorithm 4) requires the scaling of the learning rate by a constantproportional to the number of modules. Since stochastic gradient ascent is in generalsensitive to the learning rate, this pitfall cannot be avoided. In practice, however,scaling the learning rate up by an approximate factor loosely dependent (to the orderof magnitude) on the number of modules, works just as well.

Another practical issue is the communications bandwidth required for the imple-mentation of agreement on experience. While many state of the art microcontrollershave large capacity for speedy communications (e.g., the new Atmel AT32 architectureincludes built-in full speed 12 Mbps USB), depending on the policy representation,significant downsizing of required communications can be achieved. The most obviouscompression technique is to not transmit any zeros. If exchanges happen at the end ofthe episode, instead of sharing the full tabular representation of the counts, we notice

73

Algorithm 5 Synchronous average-based collection of experience. This algorithmmay be run at the end of each episode, or the procedure labeled average may beused at each time step while gathering experience, as shown also in figure 5-2b.

send to all n neighbors message = 〈R,Co, Coa〉

for all time steps doaverage:receive from all n neighbors message m = 〈Rm, Cm

o , Cmoa〉

average reward R← 1n+1 (R +

∑nm=1 Rm)

average experience Co ← 1n+1 (Co +

∑nm=1 Cm

o ), Coa ← 1n+1 (Coa +

∑nm=1 Cm

oa)send to all n neighbors m′ = 〈R,Co, Coa〉

end for

update:GAPS update rule

that often experience histories h =< o1, a1, o2, a2, ..., oT , aT > are shorter — in fact,as long as 2T , where T is the length of an episode. In addition, these can be sortedby observation with the actions combined: hsorted =< o1 : a1a2a1a4a4, o2 : a2, o3 :a3a3, ... >, and finally, if the number of repetitions in each string of actions warrantsit, each observation’s substring of actions can be re-sorted according to action, anda lossless compression scheme can be used, such as the run-length encoding, givinghrle =< o1 : 2a1a22a4, o2 : a2, o3 : 2a3, ... >. Note that this issue unveils the trade-off in experience gathering between longer episodes and sharing experience throughneighbor communication.

How can a module incorporate this message into computing the averages of sharedexperience? Upon receiving a string hm

rle coming from module m, align own and m’scontributions by observation, creating a new entry for any oi that was not found inown hrle. Then for each oi, combine action substrings, re-sort according to action andre-compress the action substrings such that there is only one number nj precedingany action aj. At this point, all of these numbers n are halved, and now representthe pairwise average of the current module’s and m’s combined experience gatheredduring this episode. Observation counts (averages) are trivially the sum of actioncounts (averages) written in the observation’s substring.

5.3 Experimental Results

We ran the synchronized algorithm described as Algorithm 5 on the case study taskof locomotion by self-reconfiguration in 2D. The purpose of the experiments was todiscover how agreement on common rewards and/or experience affects the speed andreliability of learning with distributed GAPS. As in earlier distributed experiments inchapter 4, the learning rate α here was initially set at the maximum value of 0.01 andsteadily decreased over the first 7,000 episodes to 0.001, and remained at that valuethereafter. The temperature parameter β was initially set to a minimum value of 1and increased steadily over the first 7,000 episodes to 3, and remained at that value

74

(a) (b)

Figure 5-3: Smoothed (100-point window), downsampled average center of mass dis-placement, with standard error, over 10 runs with 15 modules learning to locomoteeastward, as compared to the baseline distributed GAPS and standard GAPS withgiven global mean reward. (a) The modules run the agreement algorithm on rewardsto convergence at the end of episodes. (b) Distribution of first occurrences of cen-ter of mass displacement greater than 90% of maximal displacement for each group:baseline DGAPS, DGAPS with given mean as common reward, and DGAPS withagreement in two conditions – two exchanges per timestep and to convergence at theend of episode.

thereafter. This encouraged more exploration during the initial episodes of learning.Figure 5-3a demonstrates the results of experiments with the learning modules

agreeing exclusively on the rewards generated during each episode. The two con-ditions, as shown in figure 5-2, diverged on the time at which exchanges of rewardtook place. The practical difference between the end-of-episode versus the during-the-episode conditions is that in the former, averaging steps can be taken until con-vergence3 on a stable value of R, whereas in the latter, all averaging stops with theend of the episode, which will likely arrive before actual agreement occurs.

Figure 5-3a shows that despite this early stopping point, not-quite-agreeing on acommon reward is significantly better than using each module’s individual estimate.As predicted, actually converging on a common value for the reward results, on av-erage, in very good policies after 100,000 episodes of learning in a fully distributedmanner. By comparison, as we have previously seen, distributed GAPS with indi-vidual rewards only does not progress much beyond an essentially random policy. Inorder to measure convergence times (that is, the speed of learning) for the different al-gorithms, we use a single-number measure obtained as follows: for each learning run,average robot center of mass displacement dx was smoothed with 100-point movingwindow and downsampled to filter out variability due to within-run exploration andrandom variation. We then calculate a measure of speed of learning (akin to con-vergence time) for each condition by comparing the average first occurrence of dx

3As recorded by each individual module separately.

75

reward mean: 2.4agreement gives: 0.63

no. stuckDGAPS 10±0.0DGAPS with common R 0.4±0.2agreement at each t 0.5±0.2agreement at end episode 0.2±0.1

(a) (b)

Figure 5-4: Effect of active agreement on reward during learning: (a) The moduleswith greatest rewards in such configurations have least neighbors (module 15 has 1neighbor and a reward of 5) and influence the agreed-upon value the least. The mod-ules with most neighbors (modules 1-10) do not have a chance to get any reward, andinfluence the agreed-upon value the most. (b) Mean number of stuck configurations,with standard error, out of 10 test runs each of 10 learned policies, after 15 moduleslearned to locomote eastward for 100,000 episodes.

greater than 90% of its maximum value for each condition, as shown in figure 5-3b.The Kruskal-Wallis test with four groups (baseline original DGAPS with individualrewards, DGAPS with common reward R = 1

N

∑Ni=1 Ri available to individual mod-

ules, agreement at each timestep, and agreement at the end of each episode) revealsstatistically significant differences in convergence times (χ2(3, 36) = 32.56, p < 10−6)at the 99% confidence level. Post-hoc analysis shows no significant difference be-tween DGAPS with common rewards and agreement run at the end of each episode.Agreement (two exchanges at every timestep) run during locomotion results in asignificantly later learning convergence measure for that condition – this is not sur-prising, as exchanges are cut off at the end of episode when the modules still havenot necessarily reached agreement.

While after 100,000 episodes of learning, the policies learned with agreement atthe end of each episode and those learned with given common mean R receive thesame amount of reward, according to figure 5-3a, agreement seems to help more earlierin the learning process. This could be a benefit of averaging among all immediateneighbors, as opposed to pairwise exchanges. Consider 15 modules which startedforming a protrusion early on during the learning process, such that at the end ofsome episode, they are in the configuration depicted in figure 5-4a. The average robotdisplacement here is Rmean = 2.4. However, if instead of being given that value,modules actively run the synchronous agreement at the end of this episode, they willeventually arrive at the value of Ragree = 0.63. This discrepancy is due to the factthat most of the modules, and notably those that have on average more neighbors,have received zero individual rewards. Those that have, because they started formingan arm, have less neighbors and therefore less influence on Ragree, which results insmaller updates for their policy parameters, and ultimately with less learning “pull”

76

(a) (b)

Figure 5-5: Smoothed (100-point window), downsampled average center of mass dis-placement, with standard error, over 10 runs with 15 modules learning to locomoteeastward, as compared to the baseline distributed GAPS. (a) The modules run theagreement algorithm on rewards and experience counters at the end of episodes withoriginal vs. scaled learning rates. (b) Distribution of first occurrences of center ofmass displacement greater than 90% of maximal displacement for each group.

exerted by the arm.Our hypothesis was therefore that, during test runs of policies learned by DGAPS

with common reward versus DGAPS with agreement algorithms, we will see signif-icantly more dysfunctional stuck configurations in the first condition. The table infigure 5-4b details the number of times, out of 10 test trials, that policies learnedby different algorithms were stuck in such configurations. Our findings do not sup-port our hypothesis: both types of agreement, as well as the baseline algorithm withcommon reward, generate similar numbers of non-gait policies.

For the set of experiments involving agreement on experience as well as rewards,the learning rate α was scaled by a factor of 10 to approximately account for the av-eraging of observation and observation-action pair counts4. Figure 5-5a demonstratesthat including exchanges of experience counts results in dramatic improvement in howearly good policies are found. There is also a significant difference in learning speedbetween learning with the original learning rate or scaling it to account for averaging ofexperience (Kruskal-Wallis test on the first occurrence of center of mass displacementthat is greater than 90% of maximal displacement for each group reports 99% signif-icance: χ2(2, 27) = 25.31, p < .10−5). Finally, we see in figure 5-6a that distributedGAPS with agreement on both reward and experience learns comparatively just asfast and just as well as the centralized version of GAPS for 15 modules. The three-way Kruskal-Wallis test reports 99% significance level (χ2(2, 27) = 16.95, p = .0002);however testing the centralized versus agreement groups only shows no significance.However, when we increase the robot size to 20 modules (and episode length to100 timesteps) in figure 5-6b, all three conditions yield different reward curves, with

4This scaling does not in itself account for the difference in experimental results. Scaling up α inthe baseline distributed GAPS algorithm does not help in any way.

77

(q) (b)

Figure 5-6: Comparison between centralized GAPS, distributed GAPS and dis-tributed GAPS with agreement on experience: (a) smoothed (100-point window)average center of mass displacement over 10 trials with 15 modules learning to loco-mote eastward (T=50), (b) same for 20 modules with episode length T=100: learn-ing was done starting with α = 0.2 and decreasing it uniformly over the first 1,500episodes to 0.02.

15 mods, T=50 20 mods, T=100distributed GAPS 10±0.0 10±0.0centralized GAPS 0.1±0.1 5±1.7agreement on R and C 1.1 ±1.0 6±1.6

Table 5.1: Mean number of non-gait local optima, with standard error, out of 10 testtrials for 10 learned policies. The starting learning rate was α = 0.1 for 15 modulesand α = 0.2 for 20 modules.

agreement-based distributed learning getting significantly less reward than centralizedGAPS.

Where does this discrepancy come from? In table 5.1, we see that learning in afully distributed way with agreement on reward and experience makes the modulesslightly more likely to find non-gait local optima than the centralized version. How-ever, a closer look at the policies generated by centralized learning vs. agreementreveals another difference: gaits found by agreement are more compact than thosefound by central GAPS. Figure 5-7 demonstrates in successive frames two such com-pact gaits (recall figure 3.5.6 for comparison). Out of all the acceptable gait policiesfound by GAPS with agreement, none of them unfolded into a two-layer thread-likegait, but all maintained a more blobby shape of the robot, preserving something of itsinitial configuration and aspect ratio. Such gaits necessarily result in lower rewards;however, this may be a desirable property of agreement-based learning algorithms.

78

(a)

(b)

Figure 5-7: Screenshots taken every 5 timesteps of 20 modules executing two compactgait policies found by learning with agreement on reward and experience.

79

5.4 Discussion

We have demonstrated that with agreement-style exchanges of rewards and experi-ence among the learning modules, we can expect to see performance in distributedreinforcement learning to increase by a factor of ten, and approach that of a cen-tralized system for smaller numbers of modules and lower aspect ratios, such as wehave with 15 modules. This result suggests that the algorithms and extensions wepresented in chapters 3 and 4 may be applicable to fully distributed systems. Withhigher numbers of modules (and higher aspect ratios of the initial configuration),there is a significant difference between rewards obtained by centralized learning anddistributed learning with agreement, which is primarily due to the kinds of policiesthat are favored by the different learning algorithms.

While centralized GAPS maximizes reward by unfolding the robot’s modules intoa two-layer thread, distributed GAPS with agreement finds gaits that preserve theblobby shape of the initial robot configuration. This difference is due to the waymodules calculate their rewards by repeated averaging with their neighbors. As weexplained in section 5.3, modules will agree on a common reward value that is notthe mean of their initial estimates, if the updates are not pairwise. This generates aneffect where modules with many neighbors have more of an influence on the resultingagreed-upon reward value than those with fewer neighbors. It therefore makes sensethat locomotion gaits found through this process tend to keep modules within closerproximity to each other. There is, after all, a difference between trying to figure outwhat to do based on the national GDP, and trying to figure out what to do based onhow you and your neighbors are faring.

We have developed a class of algorithms incorporating stochastic gradient ascentin policy space (GAPS) and agreement algorithms, creating a framework for fullydistributed implementations of this POMDP learning technique.

This work is very closely related to the distributed optimization algorithm de-scribed by Moallemi & Van Roy (2003) for networks, where the global objectivefunction is the average of locally measurable signals. They proposed a distributedpolicy gradient algorithm where local rewards were pairwise averaged in an asyn-chronous but disjoint (in time and processor set) way. They prove the resulting localgradient estimates converge in the limit to the global gradient estimate. Crucially,they do not make any assumptions about the policies the processors are learning,beyond continuity and continuous differentiability. In contrast, here we are interestedin a special case where ideally all modules’ policies are identical. Therefore, we areable to take advantage of neighbors’ experience as well as rewards.

Practical issues in sharing experience include the amount of required communica-tions bandwidth and message complexity. We have proposed a simple way of limitingthe size of each message by eliminating zeros, sorting experience by observation andaction, and using run-length encoding as representation of counts (and ultimately,averages). We must remember, however, that in most cases, communication is muchcheaper and easier than actuation, both in resource (e.g., power) consumption, and interms of the learning process. Modules would benefit greatly from sharing experiencewith higher communication cost, and therefore learning from others’ mistakes, rather

80

than continuously executing suboptimal actions that in the physical world may wellresult in danger to the robot or to something or someone in the environment.

If further reduction in communications complexity is required, in recent workMoallemi & Van Roy (2006) proposed another distributed protocol called consensuspropagation, where neighbors pass two messages: their current estimate of the mean,and their current estimate of a quantity related to the cardinality of the mean esti-mate, i.e. a measure of how many other processors so far have contributed to theconstruction of this estimate of the mean. The protocol converges extremely fastto the exact average of initial values in the case of singly-connected communicationgraphs (trees). It may benefit GAPS-style distributed learning to first construct aspanning tree of the robot, then run consensus propagation on experience. However,we do not want to give up the benefits of simple neighborhood averaging of rewardsin what concerns avoidance of undesirable local optima.

We may also wish in the future to develop specific caching and communicationprotocols for passing just the required amount of information, and therefore requiringless bandwidth, given the policy representation and current experience.

81

Chapter 6

Synthesis and Evaluation:Learning Other Behaviors

So far we had focused on a systematic study of learning policies for a particulartask: simple two-dimensional locomotion. However, the approach generalizes to otherobjective functions.

In this chapter, we combine the lessons learned so far about, on the one hand,the importance of compact policy representation that 1) still includes our ideal policyas an optimum, but 2) creates a relatively smooth policy value landscape, and, onthe other hand, that of good starting points, obtained through incremental learningor partially known policies. Here, we take these ideas further in order to evaluateour approach. Evaluation is based on different tasks requiring different objectivefunctions, namely:

• tall structure (tower) building

• locomotion over obstacle fields

• reaching to an arbitrary goal position in robot workspace

Reducing observation and parameter space may lead to good policies disappearingfrom the set of representable behaviors. In general, it will be impossible to achievesome tasks using limited local observation. For example, it is impossible to learn apolicy that reaches to a goal position in a random direction from the center of therobot, using only local neighborhoods, as there is no way locally of either observingthe goal or otherwise differentiating between desired directions of motion. Therefore,we introduce a simple model of minimal sensing at each module and describe twoways of using the sensory information: by incorporating it into a larger observationspace for GAPS, or by using it in a framework of geometric policy transformations,for a kind of transfer of learned behavior between different spatially defined tasks.

82

(a) (b)

Figure 6-1: 25 modules learning to build tall towers (reaching upward) with basicGAPS versus incremental GAPS (in both cases with pre-screened legal actions): (a)Smoothed (100-point window), downsampled average height difference from initialconfiguration, with standard error (b) Distribution of first occurrence of 90% of max-imal height achieved during learning, for each group separately.

6.1 Learning different tasks with GAPS

We start by validating policy search with the standard representation of chapters 3through 5, on two tasks that are more complex than simple locomotion: building talltowers, and locomotion over fields of obstacles.

6.1.1 Building towers

In this set of experiments, the objective was to build a configuration that is as tall aspossible. The rewards were accumulated during each episode as follows:rit+1 = 5 × (yi

t+1 − yit), with the global reward calculated as R = 1

N

∑Ni=1

∑Tt=1 ri

t.This objective function is maximized when globally modules have created the mostheight difference since the initial configuration. In addition, it can be collected in adistributed way.

In order to penalize unstable formations, in these experiments we used the sameadditional stability constraints as earlier in section 3.5.7, namely, if the (global) centerof mass of the robot is moved outside its x-axis footprint, the modules fall, and theglobal reward becomes R = −10. Individual modules may also fall with very lowprobability which is proportional to the module’s y position relative to the rest ofthe robot and inversely proportional to the number of its local neighbors. This alsoresults in a reward of -10, but this time for the individual module only.

Ten learning trials with 25 modules each were run in two different conditions:either the modules used the basic GAPS algorithm or incremental GAPS with incre-ments taken in the length of the original square configuration: from a length of 2 (4modules) to the final length of 5 (25 modules). In both cases, modules pre-screenedfor legal actions only. These experiments validate the incremental approach on a

83

Figure 6-2: A sequence of 25 modules, initially in a square configuration, executinga learned policy that builds a tall tower (frames taken every 5 steps).

non-locomotion task. We can see in figure 6-1a that on average, modules learningwith basic GAPS do not increase the height of the robot over time, whereas modulestaking the incremental learning approach create an average height difference after7,000 episodes.

Figure 6-1c shows the distribution of the first occurrence of a height differencegreater than 90% of the maximal height difference achieved in each group. Wetake this as a crude measure of converge speed. By this measure basic GAPS con-verges right away, while incremental GAPS needs on average over 6,000 episodes (aKruskal-Wallis non-parametric ANOVA shows a difference at 99% significance level:χ2(1, 18) = 15.26, p < .0001). However, by looking also at figure 6-1a, we notice thatthe maximal values are very different. Incremental GAPS actually learns a tower-building policy, whereas basic GAPS does not learn anything.

Figure 6-2 demonstrates an execution sequence of a policy learned by 25 mod-ules through the incremental process, using pre-screening for legal actions. Not allpolicies produce tower-like structures. Often the robot reaches up from both sidesat once, standing tall but with double vertical protrusions. This is due to two facts.The first is that we do not specifically reward tower-building, but simple height dif-ference. Therefore, policies can appear successful to modules when they achieve anypositive reward. This problem is identical to the horizontal protrusions appearingas local maxima to modules learning to locomote in chapter 4. However, the otherproblem is that the eight immediately neighboring cells do not constitute a sufficientrepresentation for disambiguating between the robot’s sides in a task that is essen-tially symmetric with respect to the vertical. Therefore modules can either learn toonly go up, which produces double towers, or they can learn to occasionally comedown on the other side, which produces something like a locomotion gait instead. Alarger observed neighborhood could be sufficient to disambiguate in this case, whichagain highlights the trade-off between being able to represent the desired policy andcompactness of representation.

84

Figure 6-3: First 80 time-steps of executing a policy for locomotion over obsta-cles, captured in frames every 10 steps (15 modules). The policy was learned withstandard-representation GAPS seeded with a partial policy.

6.1.2 Locomotion over obstacles

Standard GAPS, which defines observations straightforwardly over the 8 neighbor-ing cells, has 256 possible observations when there are no obstacles to be observed.However, when we introduce obstacles, there are now 3 possible states of each localcell, and therefore 38 = 6, 561 possible observations, or 38 × 9 = 59, 049 parametersto learn. We expect basic GAPS to do poorly on a task with such a large parameterspace to search. We could naturally pretend that observations should not be affectedby obstacles — in other words, each module can only sense whether or not there isanother module in a neighboring cell, and conflates obstacles and empty spaces.

We perform a set of experiments with the larger and smaller representation GAPS,trying to learn a policy for “rough terrain” locomotion over obstacles. Figure 6-3shows a sequence of a test run where 15 modules execute a policy for locomotionover obstacles learned with standard-representation GAPS. Due to the limited localobservability and conservative rules to prevent disconnections, it is essential that therobot learns to conform its shape to the shape of the terrain as show in the figure.Otherwise, in a local configuration where the acting module mi is standing on theground (or an obstacle) but does not observe any neighbors also connected to theground, it will not move, unless another module can move into a position to unblockmi.

Experiments were performed in two conditions: with or without seeding the searchwith a partially known policy. When specified, the partially known policy was in allcases equivalent to the one in earlier experiments (section 4.3.5): we specify thatwhen there are neighbors on its right, the module should go up, and when there areneighbors on its left, the module should go down. Figure 6-4a shows the smoothed(100-point moving window), downsampled average rewards obtained by GAPS witha larger (38 observations) and smaller (28 observations) representations, with andwithout partial policy.

85

2-bit 3-bitbasic GAPS 10±0.0 10±0.0partial policy 4.0±1.0 5.1±0.6

(a) (b)

Figure 6-4: Locomotion over obstacles with or without initializing GAPS to a partiallyknown policy, for both 2-bit and 3-bit observations, learned by 15 modules. (a)Smoothed (100-point window), downsampled average rewards, with standard error.(b) Mean number of non-gait stuck configurations, with standard error, out of 10 testruns for each of 10 learned policies.

The table shown in figure 6-4b represents the average, for a sample of 10 learnedpolicies, number of times modules were stuck in a dysfunctional configuration dueto a local optimum that does not correspond to a locomotion gait. Locomotion overobstacles is a very hard task for learning, and the results reflect that, with even15 modules not being able to learn any good policies without us biasing the searchwith a partially known policy. In addition, we observe that increasing the number ofparameters to learn by using a more general 3-bit representation leads to a significantdrop in average rewards obtained by policies after 10,000 episodes of learning: averagereward curves for biased and unbiased GAPS with 3-bit observation are practicallyindistinguishable. While the mean number of arm-like locally optimal configurationsreported in the table for the biased condition and 3-bit observation is significantlylower than for unbiased GAPS, this result fails to take into account that the “good”policies in that case still perform poorly. The more blind policies represented withonly 2 bits per cell of observations fared much better due to the drastically smallerdimensionality of the policy parameter space in which GAPS has to search. Larger,more general observation spaces require longer learning.

6.2 Introducing sensory information

In previous chapters, we had also assumed that there is no other sensory informationavailable to the modules apart from the existence of neighbors in the immediate Mooreneighborhood (eight adjacent cells). As we had pointed out before, this limits theclass of representable policies and excludes some useful behaviors. Here, we introducea minimal sensing model.

86

6.2.1 Minimal sensing: spatial gradients

We assume that each module is capable of making measurements of some underlyingspatially distributed function f such that it knows, at any time, in which directionlies the largest gradient of f . We will assume the sensing resolution to be eitherone in 4 (N, E, S or W) or one in 8 (N, NE, E, SE, S, SW, W or NW) possibledirections. This sensory information adds a new dimension to the observation spaceof the module. There are now 28 possible neighborhood observations ×5 possiblegradient observations1 ×9 possible actions = 11, 520 parameters, in the case of thelower resolution, and respectively 28 × 9× 9 = 20, 736 parameters in the case of thehigher resolution. This is a substantial increase. The obvious advantage is to broadenthe representation to apply to a wider range of problems and tasks. However, if weseverely downsize the observation space and thus limit the number of parameters tolearn, it should speed up the learning process significantly.

6.2.2 A compact representation: minimal learning program

Throughout this thesis we have established the dimensions and parameters whichinfluence the speed and reliability of reinforcement learning by policy search in thedomain of lattice-based self-reconfigurable modular robots. In particular, we havefound that the greatest influence is exerted by the number of parameters which rep-resent the class of policies to be searched. The minimal learning approach is to reducethe number of policy parameters to the minimum necessary to represent a reasonablepolicy.

To achieve this goal, we incorporate a pre-processing step and a post-processingstep into the learning algorithm. Each module still perceives its immediate Mooreneighborhood. However, it now computes a discrete local measure of the weightedcenter of mass of its neighborhood, as depicted in figure 6-5. At the end of this pre-processing step, there are only 9 possible observations for each module, depending onwhich cell the neighborhood center of mass belongs to.

In addition, instead of actions corresponding to direct movements into adjacentcells, we define the following set of actions. There are 9 actions, corresponding asbefore to the set of 8 discretized directions plus a NOP. However, in the minimalistlearning framework, each action corresponds to a choice of general heading. Theparticular motion that will be executed by a module, given the local configurationand the chosen action, is determined by the post-processing step shown in figure 6-5.The module first determines the set of legal motions given the local configuration.From that set, given the chosen action (heading), it determines the closest motionto this heading up to a threshold. If no such motions exist, the module remainsstationary. The threshold is set by the designer and in our experiments is equal to1.1, which allows motions up to 2 cells different from the chosen heading.

Differentiating between actions (headings) and motions (cell movements) allowsthe modules to essentially learn a discrete trajectory as a function of a simple measure

1Four directions of largest gradient plus the case of a local optimum (all gradients equal to athreshold).

87

(a) (b) (c) (d) (e)

Figure 6-5: Compressed learning representation and algorithm: pre-processing (a)compute local neighborhood’s weighted center of mass, biased towards the periphery,(b) discretize into one of 9 possible observations; GAPS (c) from observation, chooseaction according to the policy: each action represents one of 8 possible headings ora NOP; post-processing (d) select closest legal movement up to threshold give localconfiguration of neighbors, (e) execute movement. Only the steps inside the dashedline are part of the actual learning algorithm.

of locally observed neighbors.

6.2.3 Experiments: reaching for a random target

The modules learned to reach out to a target positioned randomly in the robot’sworkspace (reachable x, y−space). The target (x0, y0) was also the (sole) maximumof f , which was a distance metric. The targets were placed randomly in the half-sphere defined by the y = 0 line and the circle with its center at the middle of therobot and its radius equal to (N + M)/2, where N is the number of modules and Mthe robot’s size along the y dimension.

Fifteen modules learned to reach to a random goal target using GAPS with theminimal representation and minimal gradient sensing framework with the lower sen-sory resolution (4 possible directions or local optimum) of section 6.2.1. Figure 6-6shows an execution sequence leading to the reaching of one of the modules to the tar-get: the sequence was obtained by executing a policy learned after 10,000 episodes.Ten learning trials were performed, and each policy was then tested for 10 test trials.The average number of times a module reached the goal in 50 timesteps during thetest trials for each policy was 7.1 out of 10 (standard deviation: 1.9).

We record two failure modes that contribute to this error rate. On the one hand,local observability and stochastic policy means a module may sometimes move intoa position that completes a box-type structure with no modules inside (as previouslyshown in figure 3-12b). When that happens, the conservative local rules we use toprevent robot disconnection will also prevent any corner modules of such a structurefrom moving, and the robot will be stuck. Each learned policy led the robot intothis box-type stuck configuration a minimum of 0 and a maximum of 3 out of the 10test runs (mean: 1.2, standard deviation: 1). We can eliminate this source of errorby allowing a communication protocol to establish an alternative connection beyondthe local observation window (Fitch & Butler (2007) have one such protocol). This

88

Figure 6-6: Reaching to a random goal position using sensory gradient informationduring learning and test phases. Policy execution captured every 5 frames.

leaves an average of 1.7 proper failures per 10 test trials (standard deviation: 1.6).Three out of 10 policies displayed no failures, and another two failed only once.

6.3 Post learning knowledge transfer

If a function over the (x, y) space has a relationship to the reward signal such thatsensing the largest function gradient gives the modules information about which waythe greatest reward will lie, then there should be a way of using that informationto reuse knowledge obtained while learning one behavior, to another one under cer-tain conditions. For example, if modules have learned to reach any goal randomlypositioned in the whole (x, y) space, that policy would subsume any locomotion poli-cies. Since distributed learning is hard, can modules learn one policy that wouldapply to different tasks, given the gradient information? We are also interested inthe situations where learning a simpler policy would give modules an advantage whenpresented with a more difficult task, in the spirit of good starting points for policysearch that we are exploring in this thesis.

In the next sections we study possible knowledge transfer from policies learnedwith underlying gradient information in two ways: using gradients as extra observa-tion, or using gradient information in order to perform geometric transformations ofthe old policy for the new task.

6.3.1 Using gradient information as observation

Suppose modules have learned a policy with the observation space defined over the lo-cal neighborhood and direction of greatest gradient of some sensed function f definedover the (x, y) space. Suppose further that the reward signal R was proportional tothe gradient of f . The modules are now presented with a new spatially defined task,where the reward signal R′ is proportional to the gradient of a new spatially defined

89

function f ′. The intuition is that if f ′ is sufficiently similar to f , and if all possiblegradient observations have been explored during the learning of the old policy, thenthe modules will get good performance on the new task with the old policy. Thecondition of sufficient similarity in this case is that f ′ be a linear transformation off . This is subject to symmetry breaking experimental constraints such as requiringthat the modules stay connected to the ground line at all times.

The idea that having seen all gradient observations and learned how to behavegiven them and the local configuration should be of benefit in previously unseenscenarios can be taken further. In the next section, we experiment with a limited setof gradient directions during the learning phase, then test using random directions.

6.3.2 Experiments with gradient sensing

In this set of experiments, 16 modules, originally in a 4x4 square configuration, learnedusing GAPS with minimal representation and minimal sensing in the following setup.At each episode, the direction of reward was chosen randomly between eastward,westward and upward. Therefore, the modules learned a policy for three out of thefour possible local gradients. We tested the policies obtained in 10 such learning trials,after 10,000 episodes each, and separated them into two groups based on performance(5 best and 5 worst). We then test the policies again, this time on the task of reachingto a random goal position within the robot’s reachable workspace. Our prediction isthat this kind of policy transfer from locomotion and reaching up to reaching to arandom goal should be possible, and therefore that the quality of policy learned onthis small subset of possible directions will determine whether the modules succeedon the second test task.

In these experiments, the 5 best policies succeeded in reaching the random goalon average in 7.8 out of 10 test trials (standard deviation: 1.6) and the 5 worstpolicies succeeded on average in 5 out of 10 trials (standard deviation: 1.9). Therandom policy never succeeded. The difference in success rate between the two groupsis statistically significant (F(1,8)=6.32, p=.0361) at the 95% confidence level. Aswe expected, learning with gradients enables modules to transfer some behavioralknowledge to a different task.

6.3.3 Using gradient information for policy transformations

It is interesting to note that in lattice-based SRMRs, the observation and action setsare spatially defined. Whether an observation consists of the full local neighborhoodconfiguration surrounding module i or a post-processed number representing the dis-cretized relative position of the neighbors’ local center of mass, it comes from thespatial properties of the neighborhood. The set of actions consists of motions inspace (or directions of motion in space), which are defined by the way they affect saidproperties. In the cases where the reward function R is also spatially defined, localgeometric transformations exist that can enable agents to transfer their behavioralknowledge encoded in the policy to different reward functions.

90

Figure 6-7: When there is a known policy that optimizes reward R, modules shouldbe able to use local geometric transformations of the observation and policy to achievebetter than random performance in terms of reward R’ (sensed as direction of greatestgradient). The ground line breaks the rotational transformations when 6 RR′ > π/2such that a flip by the vertical becomes necessary. The implementation is discretizedwith π/4 intervals, which corresponds to 8 cells.

Suppose R = ∇f(x, y) in two-dimensional space and a good policy exists, pa-rameterized by θ(o, a), where o is the result of local spacial filtering. Note that, inthe minimal learning representation, both o and a are discretized angles from thevertical. If we now transform R, say, by rotating f(x, y) with respect to the moduleby an angle ω, then rotating also the observation and action such that o′ = o−ω anda′ = a − ω now describes the new situation; and taking π′

θ(o′, a′) = πθ(o, a) should

give the action that should be executed in the new situation. Figure 6-7 shows howrotating the reward coordinate frame works. The idea is intuitive in the minimalrepresentation, since actually executed motions are divorced from the actions repre-sented by the policy, the latter being simply desirable motion directions. However,the transformation also works in the standard GAPS representation. Figure 6-8 showshow to transform the neighborhood observation and find the equivalent actions in astandard representation.

The rotation example, while intuitive, breaks down because the existence of theground line, to which the modules are tied, breaks the rotational symmetry. However,it should be possible to use symmetry about the vertical axis instead: in the experi-ments in section 6.3.4 we transform the observations and actions first by rotating byup to π/2, then by flipping about the vertical axis if necessary.

While some transfer of knowledge will be possible, there is no reason to believethat transformation alone will give a resulting policy that will be optimal for the newreward. Instead, like incremental learning and partially specified policies, the use oftransformations can be another way of biasing search.

91

Algorithm 6 GAPS with transformations at every timestepInitialize parameters θ according to experimental conditionfor each episode do

Calculate policy π(θ)Initialize observation counts N ← 0Initialize observation-action counts C ← 0for each timestep in episode do

for each module m dosense direction of greatest gradient ωobserve oif ω <= π/2 then

rotate o by −ω → o′

elseflip o by vertical and rotate by π/2− ω → o′

end ifincrement N(o′)choose a from π(o′, θ)if ω <= π/2 then

rotate a by ω → a′

elseflip a by vertical and rotate by ω − π/2→ a′

end ifand increment C(o′, a′)execute a′

end forend forGet global reward RUpdate θ according toθ ( o′, a′ ) += α R ( C(o′, a′)− π(o′, a′, θ) N(o′) )Update π(θ) using Boltzmann’s law

end for

6.3.4 Experiments with policy transformation

We performed a set of experiments where 15 modules learned the eastward locomo-tion task with the minimal representation. After 10,000 episodes, the learning wasstopped, and the resulting policies were first tested on the learning task and thentransformed as described in section 6.3.3. Figure 6-9 shows equivalent situations be-fore and after the transformation. The policies were then tested with the task ofwestward locomotion. Out of 8 policies that corresponded to acceptable eastwardlocomotion gaits after 10,000 episodes of learning, 7 performed equally well in 10 testtrials each in westward locomotion when flipped. The remaining two policies, whichmade armlike protrusions, did not show locomotion gaits when flipped either.

In another set of experiments, 15 and 20 modules learned with a standard GAPSrepresentation using 28 possible observations, and the modules performed transfor-mations of the local observation and corresponding policy at every timestep. Here, weuse the higher resolution sensing model with 8 possible gradient directions (plus the

92

(a) (b) (c) (d)

Figure 6-8: Policy transformation with standard GAPS representation when the re-ward direction is rotated by π/2 from the horizontal (i.e., the modules now need to goNorth, but they only know how to go East): (a) find local neighborhood and rotateit by −π/2 (b)-(c) the resulting observation generates a policy (d) rotate policy backby π/2 to get equivalent policy for moving North.

(a) (b)

Figure 6-9: Equivalent observation and corresponding policy for module M (a) beforeand (b) after the flip.

case of a local optimum). Thus, modules maintain a single policy with the standardnumber of parameters – the sensory information is used in transformations alone, andthe modules are learning a “universal” policy that should apply in all directions (seeAlgorithm 6). The experiments were then run in three conditions with ten trials ineach condition:

1. Modules learned to reach to a random target from scratch, with parametersinitialized to small random numbers (baseline).

2. Modules learned eastward locomotion, but were tested on the task of reachingto a random target (post-learning knowledge transfer).

3. Modules learned to reach to a random target with parameters initialized toan existing policy for eastward locomotion, obtained from condition 2 (biasingsearch with a good starting point).

93

mean success ± std. err. 15 mods 20 mods

learn with transformations 8.1 ± .4 5.1 ± .6learn eastward locomotion,test with transformations

6.4 ± .8 5.0 ± .8

use policy for eastwardlocomotion as starting point

7.6 ± .4 8.6 ± .5

random policy 3 0

(a) (b)

Figure 6-10: (a) Average over 10 learning trials of smoothed (100-point window),downsampled reward over time (learning episodes) for 20 modules running centralizedlearning with episode length of T=50, learning to reach to an arbitrary goal positionwith geometric transformations of observations, actions, and rewards, starting withno information (blue) vs. starting with existing locomotion policy (black). Standarderror bars at 95% confidence level. (b) Table of average number of times the goal wasreached out of 10 test trials, with standard error: using locomotion policy as a goodstarting point for learning with transformations works best with a larger number ofmodules.

Table 6-10b summarizes the success rates of the policies learned under these threeconditions on the task of reaching to a randomly placed target within the robot’sreachable space, with the success rate of the random policy shown for comparison.We can see that all conditions do significantly better than the random policy, for both15 and 20 modules. In addition, a nonparametric Kruskal-Wallis ANOVA shows nosignificant difference between the success rate of the three conditions for 15 modules,but for 20 modules, there is a significant difference (χ2(2, 27) = 13.54, p = .0011) atthe 99% confidence level. A post-hoc multiple comparison shows that the policiesfrom condition 3 are significantly more successful than all others. As the problembecomes harder with a larger number of modules, so biasing the search with a goodstarting point generated by geometrically transforming the policy helps learning more.

Fig. 6-10a shows the smoothed, downsampled average rewards (with standarderror) obtained over time by 20 modules learning with transformations in conditions1 and 3. We can see clearly that biasing GAPS with a good starting point obtainedthrough transforming a good policy for a different but related task, results in fasterlearning, and in more reward overall.

6.4 Discussion

We have demonstrated the applicability of GAPS to tasks beyond simple locomotion,and presented results in learning to build tall structures (reach upward), move overrough terrain and reach to a random goal position. As we move from simple locomo-

94

tion to broader classes of behavior for lattice-based SRMRs, we quickly outgrow thestandard model for each module’s local observations. Therefore, in this chapter wehave introduced a model of local gradient sensing that allows expansion of the robots’behavioral repertoire, as amenable to learning. We have additionally introduced anovel, compressed policy representation for lattice-based SRMRs that reduces thenumber of parameters to learn.

With the evaluation of our results on different objective functions, we have alsopresented a study of the possibility of behavioral knowledge transfer between differenttasks, assuming sensing capabilities of spatially distributed gradients, and rewards de-fined in terms of those gradients. We find that it is possible to transfer a policy learnedon one task to another, either by folding gradient information into the observationspace, or by applying local geometric transformations of the policy (or, equivalently,observation and action) based on the sensed gradient information. The relevance ofthe resulting policies to new tasks comes from the spatial properties of lattice-basedmodular robots, which require observations, actions and rewards to be local functionsof the lattice space.

Finally, we demonstrate empirically that geometric policy transformations canprovide a systematic way to automatically find good starting points for further policysearch.

95

Chapter 7

Concluding Remarks

7.1 Summary

In this thesis, we have described the problem of automating distributed controller gen-eration for self-reconfiguring modular robots (SRMRs) through reinforcement learn-ing. In addition to being an intuitive approach for describing the problem, distributedreinforcement learning also provides a double-pronged mechanism for both controlgeneration and online adaptation.

We have demonstrated that the problem is technically difficult due to the dis-tributed but coupled nature of the modular robots, where each module (and thereforeeach agent) has only access to local observations and a local estimate of reward. Wethen proposed a class of algorithms taking Gradient Ascent in Policy Space (GAPS)as a baseline starting point. We have shown that GAPS-style algorithms are capableof learning policies in situations where the more powerful Q-learning and Sarsa arenot, as they make the strong assumption of full observability of the MDP. We iden-tified three key issues in using policy gradient algorithms for learning, and specifiedwhich parameters of the SRMR problem contribute to these issues. We then ad-dressed scalability issues presented by learning in SRMRs, including the proliferationof local optima unacceptable for desired behaviors, and suggested ways to mitigatevarious aspects of this problem through search constraints, good starting points, andappropriate policy representations. In this context, we discussed both centralized andfully distributed learning. In the latter case, we proposed a class of agreement-basedGAPS-style algorithms for sharing both local rewards and local experience, which wehave shown speed up learning by at least a factor of 10. Along the way we providedexperimental evidence for our algorithms’ performance on the problem of locomo-tion by self-reconfiguration on a two-dimensional lattice, and further evaluated ourresults on problems with different objective functions. Finally, we proposed a geo-metric transformation framework for sharing behavioral knowledge between differenttasks and as another systematic way to automatically find good starting points fordistributed search.

96

7.2 Conclusions

We can draw several conclusions from our study.

1. The problem of partially observable distributed reinforcement learning (as inSRMRs) is extremely difficult, but it is possible to use policy search to learnreasonable behaviors.

2. Robot and problem design decisions affect the speed and reliability of learningby gradient ascent, in particular:

the number of parameters to learn is affected by the policy representa-tion, the number of modules comprising the robot, and search constraints,if any;

the quality of experience is affected by the length of each learning episode,the robot size, and the sharing of experience, if any;

the starting point is affected by any initial information available to modules,and by the additive nature of modular robots, which enables learning byincremental addition of modules for certain tasks;

the initial condition in addition to the starting point in policy space, theinitial configuration of the robot affects the availability of exploratory ac-tions;

the quality of the reward estimate is affected by the sharing of rewardsthrough agreement, if any.

3. It is possible to create good design strategies for mitigating scalability and localoptima issues by making appropriate design decisions, such as pre-screening forlegal actions — which effectively reduces the number of parameters to be learned— or using smart and appropriate policy representations — which nonethe-less require careful human participation and trade-offs between representationalpower (the ideal policy must be included), number of policy parameters, andthe quality of policy value landscape that these parameters create (it may havemany local optima but lack the redundancy of the full representation).

4. Unlike centralized but factored GAPS, fully distributed gradient ascent withpartial observability and local rewards is impossible without improvement ineither (a) amount and quality of experience, or (b) quality of local estimatesof reward. Improvement in (a) can be achieved through good design strate-gies mentioned above, and either (a) or (b) or both benefit tremendously fromagreement-style sharing among local neighbors.

5. Agreement on both reward and experience results in learning performance thatis empirically almost equivalent to the centralized but factored scenario forsmaller numbers of modules.

97

6. The difference in reward functions between centralized learning and distributedlearning with agreement generates a difference in learned policies: for largernumbers and initial configurations of higher aspect ratios, agreement by neigh-borhood averaging of local estimates of rewards and experience favors the find-ing of more compact gaits, and is unlikely to find the same two-layer thread-likegaits that prevail in the solutions found by centralized learning.

7. Some post-learning knowledge transfer is possible between different spatiallydefined tasks aimed at lattice-based SRMRs. It is achievable through simplegradient sensing, which is then used either as an additional observation compo-nent, or in a framework of local geometric policy transformations. Either way,the resulting policies may be perfectly adequate for the new task, or at the leastcan become good starting points for further search.

While this thesis is mainly concerned with learning locomotion by self-reconfigurationon a two-dimensional square lattice, the results obtained herein are in no way re-stricted to this scenario. In chapter 6 we already generalized our findings to differentobjective functions in the same domain. This generalization required only minimaladditional assumptions, which would ensure that good policies would be within thesearch space. For example, extending to locomotion over obstacles involved intro-ducing a third possible observation value per neighborhood cell, which increased thenumber of policy parameters. Another example is the objective of reaching to anarbitrary goal in the workspace, which necessitated the additional assumption of asimple sensory model. In both cases, however, the structure of the algorithm did notchange, and the lessons we have learned about scaffolding learning by constrainingand biasing the search remain valid.

In addition, we believe that the same arguments and ideas will be at play whenevercooperative distributed controller optimization is required in networked or physicallyconnected systems where each agent suffers from partial observability and noisy lo-cal estimates of rewards. The set of these situations could include coordination ofnetworked teams of robots, robotic swarm optimization, and learning of other cooper-ative group behavior. For example, locomotion gaits learned by modules in this thesiscould potentially be useful in formation control of a team of wheeled mobile robots,whose connections to each other are those of remaining within communication range,rather than being physically attached. Similarly, in manufacturing and self-assemblyscenarios, particularly whenever the environment can be partitioned into a geomet-ric lattice, it should be possible to employ the same learning strategies described inthis thesis, provided enough relevant information and search constraints are available.The result would be more automation in coordinated and cooperative control in thesedomains with less involvement from human operators.

7.3 Limitations and future work

At present, we have developed an additional automation layer for the design of dis-tributed controllers. The policy is learned offline by GAPS; it can then be transferred

98

to the robots. We have not addressed online adaptation in this thesis. In the futurewe would like to see the robots, seeded with a good policy learned offline, to runa much faster distributed adaptation algorithm to make practical run-time on-the-robot adjustments to changing environments and goals. We are currently exploringnew directions within the same concept of restricting the search in an intelligent waywithout requiring too much involved analysis on the part of the human designer. Itis worth mentioning anyway that the computations required by the GAPS algorithmare simple enough to run on a microcontroller with limited computing power andmemory.

A more extensive empirical analysis of the space of possible representations andsearch restrictions is needed for a definitive strategy in making policy search workfast and reliably in SRMRs. In particular, we would like to be able to describe theproperties of good feature spaces for our domain, and ways to construct them. Aspecial study of the influence of policy parameter redundancy would be equally usefuland interesting: for example, we would like to know in general when restricting ex-ploration to only legal actions might be harmful instead of helpful because it removesa source of smoothness and redundancy in the landscape.

We have studied how search constraints and information affect the learning oflocomotion gaits in SRMRs in this thesis. The general conclusion is that more infor-mation is always better, especially when combined with restrictions on how it may beused, which essentially guides and constrains search. This opens up the theoreticalquestion on the amount and kinds of information needed for particular systems andtasks, which could provide a fruitful direction of future research.

As we explore collaboration between the human designer and the automated learn-ing agent, our experiments bring forward some issues that such interactions couldraise. As we have observed, the partial information coming from the human designercan potentially lead the search away from the global optimum. The misleading cantake the form of a bad feature representation, or an overconstrained search, or apartial policy that favors suboptimal actions.

Another issue is that of providing a reward signal to the learning agents thatactually provides good estimates of the underlying objective function. Experimentshave shown that a simple performance measure such as displacement during a timeperiod cannot always disambiguate between good behavior (i.e., locomotion gait) andan unacceptable local optimum (e.g., a long arm-like protrusion in the right direction).In this thesis, we have restricted experiments to objective functions that depend ona spatially described gradient. Future work could include more sophisticated rewardsignals.

We have also seen throughout that the learning success rate (as measured, for ex-ample, by the percentage of policies tested which exhibited correct behavior) dependsnot only on robot size, but also on the initial condition, specifically, the configurationthat the robot assumes at the start of every episode. In this thesis, we have limitedour study to a few such initial configurations and robot sizes in order to compare thedifferent approaches to learning. In the future, we would like to describe the space ofinitial conditions, and research their influence on learning performance more broadly.

Finally, the motivation of this work has been to address distributed RL problems of

99

a cooperative nature; and as a special case of those problems, some of our approaches(e.g., sharing experience) is applicable to agents essentially learning identical policies.While this is a limitation of the present work, we believe that most algorithmic andrepresentational findings we have discussed should also be valid in less identical andless cooperative settings. In particular, sharing experience but not local rewardsmight lead to interesting developments in games with non-identical payoffs. Thesedevelopments are also potential subjects for future work.

In addition to the above, we would like to see the ideas described in this thesisapplied to concrete models of existing modular robots, as well as to other distributeddomains, such as chain and hybrid modular robots, ad hoc mobile networks, androbotic swarms.

100

Bibliography

Andre, D. & Russell, S. (2000), Programmable reinforcement learning agents, in‘Neural Information Processing Systems’, Vancouver, Canada.

Bagnell, J. A., Kakade, S., Ng, A. Y. & Schneider, J. (2004), Policy search by dynamicprogramming, in ‘Advances in Neural Information Processing Systems 16’.

Baxter, J. & Bartlett, P. L. (2001), ‘Infinite-horizon gradient-based policy search’,Journal of Artificial Intelligence Research 15, 319–350.

Bertsekas, D. P. (1995), Dynamic programming and optimal control, Athena Scientific,Belmont MA.

Bertsekas, D. P. & Tsitsiklis, J. N. (1997), Parallel and Distributed Computation:Numerical Methods, Athena Scientific.

Bishop, J., Burden, S., Klavins, E., Kreisberg, R., Malone, W., Napp, N. & Nguyen,T. (2005), Self-organizing programmable parts, in ‘International Conference onIntelligent Robots and Systems, IEEE/RSJ Robotics and Automation Society’.

Brunner, E. & Puri, M. L. (2001), ‘Nonparametric methods in factorial designs’,Statistical papers 42(1), 1–52.

Buhl, J., Sumpter, D. J., Couzin, I., Hale, J., Despland, E., Miller, E. & Simpson, S. J.(2006), ‘From disorder to order in marching locusts’, Science 312, 1402–1406.

Butler, Z., Kotay, K., Rus, D. & Tomita, K. (2001), Cellular automata for decen-tralized control of self-reconfigurable robots, in ‘Proceedings of the InternationalConference on Robots and Automation’.

Butler, Z., Kotay, K., Rus, D. & Tomita, K. (2004), ‘Generic distributed controlfor locomotion with self-reconfiguring robots’, International Journal of RoboticsResearch 23(9), 919–938.

Chang, Y.-H., Ho, T. & Kaelbling, L. P. (2004), All learning is local: Multi-agentlearning in global reward games, in ‘Neural Information Processing Systems’.

Everist, J., Mogharei, K., Suri, H., Ranasinghe, N., Khoshnevis, B., Will, P. & Shen,W.-M. (2004), A system for in-space assembly, in ‘Proc. Int. Conference onRobots and Systems (IROS)’.

101

Fernandez, F. & Parker, L. E. (2001), ‘Learning in large cooperative multi-robotdomains’, International Journal of Robotics and Automation: Special issue onComputational Intelligence Techniques in Cooperative Robots 16(4), 217–226.

Fitch, R. & Butler, Z. (2006), A million-module march, in ‘Digital Proceedings ofRSS Workshop on Self-Reconfigurable Modular Robotics’, Philadelphia, PA.

Fitch, R. & Butler, Z. (2007), ‘Million-module march: Scalable locomotion forlarge self-reconfiguring robots’, submitted to the IJRR Special Issue on Self-Reconfigurable Modular Robots.

Gilpin, K., Kotay, K. & Rus, D. (2007), Miche: Self-assembly by self-disassembly,in ‘Proceedings of the IEEE/RSJ Intl. Conference on Intelligent Robots andSystems’, San Diego, CA.

Griffith, S., Goldwater, D. & Jacobson, J. M. (2005), ‘Robotics: Self-replication fromrandom parts’, Nature 437, 636.

Grudic, G. Z., Kumar, V. & Ungar, L. (2003), Using policy reinforcemnet learning onautonomous robot controllers, in ‘Proceedings of the IEEE/RSJ Intl. Conferenceon Intelligent Robots and Systems’, Las Vegas, Nevada.

Guestrin, C., Koller, D. & Parr, R. (2002), Multiagent planning with factored mdps,in ‘Advances in Neural Information Processing Systems (NIPS)’, Vol. 14, TheMIT Press.

Hu, J. & Wellman, M. P. (1998), Multiagent reinforcement learning: Theoreticalframework and an algorithm, in J. Shavlik, ed., ‘International Conference onMachine Learning’, Morgan Kaufmann, Madison, Wisconsin, pp. 242–250.

Jadbabaie, A., Lin, J. & Morse, A. S. (2003), ‘Coordination of groups of mobile au-tonomous agents using nearest neighbor rules’, IEEE Transactions on AutomaticControl 48(6).

Kamimura, A., Kurokawa, H., Yoshida, E., Murata, S., Tomita, K. & Kokaji, S.(2004), Distributed adaptive locomotion by a modular robotic system, M-TRANII – from local adaptation to global coordinated motion using cpg controllers, in‘Proc. of Int. Conference on Robots and Systems (IROS)’.

Kitano, H., Asada, M., Kuniyoshi, Y., Noda, I. & Osawa, E. (1997), RoboCup: Therobot world cup initiative, in W. L. Johnson & B. Hayes-Roth, eds, ‘Proceedingsof the First International Conference on Autonomous Agents (Agents’97)’, ACMPress, New York, pp. 340–347.

Kok, J. R. & Vlassis, N. (2006), ‘Collaborative multiagent reinforcement learning bypayoff propagation’, Journal of Machine Learning Research 7, 1789–1828.

102

Kotay, K. & Rus, D. (2004), Generic distributed assembly and repair algorithms forself-reconfiguring robots, in ‘IEEE Intl. Conf. on Intelligent Robots and Systems’,Sendai, Japan.

Kotay, K. & Rus, D. (2005), Efficient locomotion for a self-reconfiguring robot, in‘Proc. of Int. Conference on Robotics and Automation (ICRA)’.

Kotay, K., Rus, D., Vona, M. & McGray, C. (1998), The self-reconfiguring roboticmolecule, in ‘Proceedings of the IEEE International Conference on Robotics andAutomation’, Leuven, Belgium, pp. 424–431.

Kubica, J. & Rieffel, E. (2002), Collaborating with a genetic programming systemto generate modular robotic code, in W. et al., ed., ‘GECCO 2002: Proceedingsof the Genetic and Evolutionary Computation Conference’, Morgan KaufmannPublishers, New York, pp. 804–811.

Lagoudakis, M. G. & Parr, R. (2003), ‘Least-squares policy iteration’, Journal ofMachine Learning Research 4, 1107–1149.

Littman, M. L. (1994), Memoryless policies: Theoretical limitations and practicalresults, in ‘From Animals to Animats 3: Proceedings of the 3rd InternationalConference on Simulation of Adaptive Behavior (SAB)’, Cambridge, MA.

Marthi, B., Russell, S. & Andre, D. (2006), A compact, hierarchically optimal q-function decomposition, in ‘Procedings of the Intl. Conf. on Uncertainty in AI’,Cambridge, MA.

Martin, M. (2004), The essential dynamics algorithm: Fast policy search in continuousworlds, in ‘MIT Media Lab Tech Report’.

Mataric, M. J. (1997), ‘Reinforcement learning in the multi-robot domain’, Au-tonomous Robots 4(1), 73–83.

Meuleau, N., Peshkin, L., Kim, K.-E. & Kaelbling, L. P. (1999), Learning finite-state controllers for partially observable environments, in ‘Proc. of 15th Conf.on Uncertainty in Artificial Intelligence (UAI)’.

Moallemi, C. C. & Van Roy, B. (2003), Distributed optimization in adaptive networks,in ‘Proceedings of Intl. Conference on Neural Information Processing Systems’.

Moallemi, C. C. & Van Roy, B. (2006), ‘Consensus propagation’, IEEE Transactionson Information Theory 52(11).

Murata, S., Kurokawa, H. & Kokaji, S. (1994), Self-assembling machine, in ‘Pro-ceedings of IEEE Int. Conf. on Robotics and Automation (ICRA)’, San Diego,California, p. 441448.

Murata, S., Kurokawa, H., Yochida, E., Tomita, K. & Kokaji, S. (1998), A 3d self-reconfigurable structure, in ‘Proceedings of the 1998 IEEE International Confer-ence on Robotics and Automation’, pp. 432–439.

103

Murata, S., Yoshida, E., Kurokawa, H., Tomita, K. & Kokaji, S. (2001), ‘Self-repairingmechanical systems’, Autonomous Robots 10, 7–21.

Mytilinaios, E., Marcus, D., Desnoyer, M. & Lipson, H. (2004), Design and evolvedblueprints for physical self-replicating machines, in ‘Proc. of 9th Int. Conferenceon Artificial Life (ALIFE IX)’.

Nagpal, R. (2002), Programmable self-assembly using biologically-inspired multiagentcontrol, in ‘Proceedings of the 1st International Joint Conference on AutonomousAgents and Multi-Agent Systems (AAMAS)’, Bologna, Italy.

Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E.& Liang, E. (2004), Autonomous inverted helicopter flight via reinforcementlearning, in ‘ISER’.

Ng, A. Y. & Jordan, M. (2000), Pegasus: A policy search method for large mdps andpomdps, in ‘Proc. of Int. Conference on Uncertainty in AI (UAI)’.

Okubo, A. (1986), ‘Dynamical aspects of animal grouping: Swarms, schools, flocksand herds’, Advances in Biophysics 22, 1–94.

Olfati-Saber, R., Franco, E., Frazzoli, E. & Shamma, J. S. (2005), Belief consensusand distributed hypothesis testing in sensor networks, in ‘Proceedings of theWorkshop on Network Embedded Sensing and Control’, Notre Dame University,South Bend, IN.

Olfati-Saber, R. & Murray, R. M. (2004), ‘Consensus problems in networks of agentswith switching topology and time-delays’, IEEE Trans. on Automatic Control49(9), 1520–1533.

Østergaard, E. H., Kassow, K., Beck, R. & Lund, H. H. (2006), ‘Design of the ATRONlattice-based self-reconfigurable robot’, Autonomous Robots 21(2), 165–183.

Pamecha, A., Chiang, C.-J., Stein, D. & Chirikjian, G. (1996), Design and imple-mentation of metamorphic robots, in ‘Proceedings of the 1996 ASME DesignEngineering Technical Conference and Computers in Engineering Conference’.

Peshkin, L. (2001), Reinforcement Learning by Policy Search, PhD thesis, BrownUniversity.

Peshkin, L., Kim, K., Meuleau, N. & Kaelbling, L. P. (2000), Learning to cooperatevia policy search, in ‘Proceedings of the 16th Conference on Uncertainty inArtificial Intelligence (UAI)’, p. 489.

Schaal, S., Peters, J., Nakanishi, J. & Ijspeert, A. (2003), Learning movement primi-tives, in ‘Int. Symposium on Robotics Research (ISRR)’.

Schneider, J., Wong, W.-K., Moore, A. & Riedmiller, M. (1999), Distributed valuefunctions, in ‘Proceedings of the International Conference on Machine Leanring’.

104

Shen, W.-M., Krivokon, M., Chiu, H., Everist, J., Rubenstein, M. & Venkatesh,J. (2006), ‘Multimode locomotion via SuperBot reconfigurable robots’, Au-tonomous Robots 20(2), 165177.

Shoham, Y., Powers, R. & Grenager, T. (2003), ‘Multi-agent reinforcement learning:a critical survey’.

Stone, P. & Veloso, M. M. (2000), ‘Multiagent systems: A survey from a machinelearning perspective’, Autonomous Robots 8(3), 345–383.

Suh, J. W., Homans, B. & Yim, M. (2002), Telecubes: Mechanical design of a modulefor self-reconfiguration, in ‘Proceedings of the IEEE International Conference onRobotics and Automation’, Washington, DC.

Sutton, R. S. (1995), Generatlization in reinforcement learning: Successful examplesusing sparse coarse coding, in D. S. Touretzky, M. C. Mozer & M. E. Hasselmo,eds, ‘Advances in Neural Information Processing Systems’, MIT Press, pp. 1038–1044.

Sutton, R. S. & Barto, A. G. (1999), Reinforcement Learning: An Introduction, MITPress.

Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. (2000), Policy gradient meth-ods for reinforcement learning with function approximation, in ‘Advances inNeural Information Processing Systems 12’.

Tedrake, R., Zhang, T. W. & Seung, H. S. (2005), Learning to walk in 20 minutes, in‘Proc. 14th Yale Workshop on Adaptive and Learning Systems’, Yale University,New Haven, CT.

Toothaker, L. E. & Chang, H. S. (1980), ‘On “the analysis of ranked data derivedfrom completely randomized factorial designs”’, Journal of Educational Statistics5(2), 169–176.

Tsitsiklis, J. N., Bertsekas, D. P. & Athans, M. (1986), ‘Distributed asynchronous de-terministic and stochastic gradient optimization algorithms’, IEEE Transactionson Automatic Control AC-31(9), 803–812.

Varshavskaya, P., Kaelbling, L. P. & Rus, D. (2004), Distributed learning for modularrobots, in ‘Proc. Int. Conference on Robots and Systems (IROS)’.

Varshavskaya, P., Kaelbling, L. P. & Rus, D. (2006), On scalability issues in reinforce-ment learning for self-reconfiguring modular robots, in ‘Digital Proceedings ofRSS Workshop on Self-Reconfigurable Modular Robotics’, Philadelphia, PA.

Varshavskaya, P., Kaelbling, L. & Rus, D. (2007), ‘Automated design of adaptivecontrollers for modular robots using reinforcement learning’, accepted for pub-lication in International Journal of Robotics Research, Special Issue on Self-Reconfigurable Modular Robots .

105

Vicsek, T., Czirok, A., Ben-Jacob, E., Cohen, I. & Schochet, O. (1995), ‘Novel typeof phase transition in a system of self-driven particles’, Physical Review Letters75(6), 1226–1229.

Watkins, C. J. C. H. & Dayan, P. (1992), ‘Q-learning’, Machine Learning 8, 279–292.

White, P., Zykov, V., Bongard, J. & Lipson, H. (2005), Three-dimensional stochasticreconfiguration of modular robots, in ‘Proceedings of Robotics: Science andSystems’, Cambridge, MA.

Wolpert, D., Wheeler, K. & Tumer, K. (1999), Collective intelligence for control ofdistributed dynamical systems, Technical Report NASA-ARC-IC-99-44.

Yim, M., Duff, D. G. & Roufas, K. D. (2000), Polybot: a modular reconfigurablerobot, in ‘Proceedings of the IEEE International Conference on Robotics andAutomation’.

Yu, W., Takuya, I., Iijima, D., Yokoi, H. & Kakazu, Y. (2002), Using interaction-basedlearning to construct an adaptive and fault-tolerant multi-link floating robot, inH. Asama, T. Arai, T. Fukuda & T. Hasegawa, eds, ‘Distributed AutonomousRobotic Systems’, Vol. 5, Springer, pp. 455–464.

Zykov, V., Mytilinaios, E., Adams, B. & Lipson, H. (2005), ‘Self-reproducing ma-chines’, Nature 435(7038), 163–164.

106

distributed reinforcement learning for self-reconﬁguring...

Documents