a distributed adaptive control system for a quadruped mobile robot

7/25/2019 A Distributed Adaptive Control System for a Quadruped Mobile Robot

1/6

A

Distributed Adaptive Control System

for a Quadruped Mobile Robot

Bruce L. Digney and

M.

M. Gupta

Intelligent Systems Research Laboratory, College of Engineering

University of Saskatchewan, Saskatoon, Sask. CANADA S7N OW0

Email: [email protected]

Abdruct-

In

this research, a method by which

reinforcement learning can be combined into a be-

havior based control system is presented. Behav-

iors which are impossible or impractical to embed

as predetermined responses are learned through

self-exploration and self-organization using a tem-

poral difference reinforcement learning technique.

This

results in what is referred to as a distributed

adaptive control system

(DACS); in effect the

robots artificial nervous system.

A DACS is de-

veloped for a simulated quadruped mobile robot

and the locomotion behavior level is isolated and

evaluated. At the locomotion level the proper ac-

tuator sequences were learned for all possible gaits

and eventually graceful gait transitions were also

learned. When confronted with an actuator mal-

function, all gaits and transitions were adapted re-

sulting in new limping gaits for the quadruped.

I .

INTRODUCTION

Although conventional control and artificial intelligence

researchers have made many advances, neither ideology

seems capable of realizing autonomous operation. Th at

is, neither can produce machines which can interact with

the world with an ease comparable to humans or at least

higher animals. In responding to such limitations, many

researchers have looked to

biological physiological

based

systems

as

the motivation to design artificial systems. As

an example are the behavior based systems of Brooks [l]

and Beer [2]. Behavior based control systems consist of a

hierarchical structure of simple behavior modules. Each

module is responsible for the sensory motor responses of

a particular level of behavior. The overall effect is that

higher level behaviors are recursively built upon lower ones

and the resulting system operates in

a

self-organizing man-

ner. Both Brooks and Beers systems were loosely based

upon the nervous systems of insects. These artificial in-

sects operated in a hardwired manner and exhibited an

interesting repertoire of simple behaviors. By hardwired it

0-7803-0999-5/93/$03.0001993

IEEE

.

144

is

meant that each behavior module had its responses pre-

determined and was simply programmed externally. Al-

though this approach is successful with simple behaviors,

it is obvious that many situations exist where predeter-

mined solutions are impossible or impractical to obtain.

It is subsequently proposed that by incorporating learn-

ing into the behavior based control system, these difficult

behaviors could be acquired through self-exploration and

self-learning.

Complex behaviors are usually characterized by

a 8e-

quence of actions with success or failure only known at

the end of that sequence. Also, the critical error sig-

nal is only an indication of the success or failure of the

system and no information regarding error gradients can

be determined,

as

in the case of continuous valued error

feedback. Thus the required learning mechanism must

be capable of both reinforcement learning aswell as tem-

poral credit assignment. Incremental dynamic program-

ming techniques such

as

Bartos [3] temporal difference

(TD) appear to be well suited to such tasks. Based upon

Bartos previous adaptive heuristic critic

[4],

TD employs

adaptive state and action evaluation functions to incre-

mentally improve its action policy until successful oper-

ation is attained. The incorporation of TD learning into

behavior based control results in a framework of adap-

tive (ABMs) and non-adaptive behavior modules which

is referred to here

as a

distributed adaptive control sys-

tem (DACS). The remainder of this report will be con-

cerned with

a

brief description of the DACS and ABMs,

and implementing of the locomotion level ABM within the

DACS of

a

simulated quadruped mobile robot. This level

is considered appropriate because the actuator sequences

for quadruped locomotion are not intuitively obvious and

difficult to determine. Other levels such as global naviga-

tion,

t a sk

planning and task coordination are implemented

and discussed by Digney [ 5 ] .

11 DISTRIBUTEDDAPTIVEONTROLYSTEMS

The DACS shown in Figure

1

s comprised of various adap-

tive and non-adaptive behavior modules. Non-adaptive

Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.


2/6

modules are present as inherent knowledge and are used

where adaptive solutions are not required. All modules

receive sensory inputs and respond with actions in an at-

tempt to perform

a

command specified by a higher level.

The performance of commands in most cams will require a

sequence of actione by the lower level system and possibly

the cooperation of many lower level systems. The coupling

between ABMs is shown in Figure 2. In this configura-

tion, the action from level 1+1becomes the command for

level

1

Level

1

+ 1dm upplies goal based reinforcement,

r g ,

o drive level 1 towards succeasful completion of that

command. Level 1 in turn issues actions to level 1- 1 and

receives environmental based reinforcement, pc

,

rom level

1 -

1.

This environment based reinforcement is represen-

tative of the difficulty or cost incurred while performing

the requested actions and is included to drive level

1

to a

cost effective solution. While operating, level 1 may en-

ter a state which is in some way damaging or dangerous.

To drive the system away from such

a

state, sensor based

reinforcement, r is used. Sensor based reinforcement is

supplied from sensors at level

1

It is analogous to pain o r

fear and will ensure that level 1 operates in

a

safe man-

ner. These three reinforcements are combined into a total

reinforcement signal, rt , according to Equation 1.

r:

=

re

+

ag rg+a r, (1)

where:

a g

and

a

are the relative importance of the

reinforcements.

Figure

1:

Schematic of DACS

It can be seen from Figure 2 that the flow of environ-

mental and sensor based reinforcement is in the upward

direction. This will result in lower level skills and behav-

iors being learned first, then other higher level behaviors,

converging in a recursive manner toward the highest level.

Figure 1 shows this highest level

as

existing within a sin-

gle physical machine. However, in the case of multiple

machines operating in a collective, higher abstract behav-

ior levels are possible. Within the context of this paper,

only behaviors relevant to individual machines will be dis-

cussed. In the absence

of

higher collective behaviors con-

Figure 2: Hierarchy of Three

ABMs

trolling individual machines, the purpose or task of the

machine is embedded within the

DACS

as

an

ins t inc t

or

d r i v e . This instinct is the high level action which results

in a feeling of accomplishment or positive reinforcement

within the DACS. It is then the responsibility of the adap-

tive behavior modules within the DACS to learn the skills

and behaviors necessary to fulfill this drive. This concept

as well as the self-organizing characteristics that result

from such interactions are further discussed by Digney [ 5 ] .

The

ABM

is the primary adaptive building block for

the DACS. Within it exist computational mechanisms for

state classification, learning and the combination of rein-

forcement signals. Figure 3 shows

a

schematic of an

ABM

complete with incoming command, sensory and reinforce-

ment signals. For clarity the outgoing reinforcement sig-

nals have been removed. For any particular level, say

I

the

ABM

observes the relevant system states through

ap-

propriate sensors. For a perception system consisting of

N sensors, the state

S I ,

s defined as

where: sn is the individual sensor reading, 0 < n < N

QOdBudwnfMamrd

mar Bumd Pdn)R*dmm

Ad.plv.8.k.riarModL&

EnvCo (.lBrd

F Y h f c m ~ ~ r d

Figure 3: Single ABM

State transitions are detected and the resulting states

are classified uaing an idealized neural claaeiflcation

145

., . . . ,

, ,

...



3/6

scheme. This classification embodies the macroscopic

o p

erating principlea of unsupervised neural networks such

as

ART2-A

[6]

and will be assumed adequate in the con-

text

of these simulations. The Temporal Difference (TD)

algorithm as developed by Barton

[

learns by adjust-

ing state and action evaluation functions then uses these

evaluations to choose an optimum action policy. It can

be shown that these two evaluation functions can be com-

bined into

a

single action dependent evaluation function,

say Q,,,,, similar to th at described by Barto [7]. Given

the system at state 8 , the action taken,

U ,

is the action

which satisfies

(3)

where: is a random valued function.

In Equation 3, Q,,u and can be thought of as the

goal driven and exploration driven components of the ac-

tion policy respectively. Taking the action

U*

results in

the transition from state s to state w and the incurring

of a total reinforcement signal

r t .

The action dependent

evaluation function error is obtained by modifying the T D

error equation and is

e

=

Qvirtuai - s ,U* rt

(4)

where: Qvirlual

is

the virtual sta te evaluation value of the

where: is the rate of adaption and

k

is the index of

adap ion.

As the evaluation function converges, the goal driven

component begins to dominate over the exploration driven

component. The resulting action policy will perform the

command in

a

successful and efficient manner. Generally,

an ABM will be capable of performing more than a single

command. For an ABM capable of

cmao

commands, the

vector of the evaluation functions is defined as:

where: Qs,u is the evaluation function and

c

the particular

command 0 < c < Cmas.

111. DACS FOR A QUADRUPED OBLE

ROBOT

To evaluate the DACS, the simulated quadruped shown

in Figure

4

was

used. This mobile robot was placed in-

side a simulated three dimensional landscape where it is

left to develop skills and behaviors as it interacts with

its

environment. This world is made up of ramps, plateaus,

cliffs and walls,

as

well

as

various substances of interest.

In the absence of any predetermined knowledge it

is

the

responsibility of the DACS and in particular the ABMs to

acquire the skills and behaviors for successful operation.

next state w and 7 is the temporal discount factor.

If action,

U ,

does not achieve the desired goal, the vir-

~

tual state evaluation is,

Q u i r t u o l z

mU={QuJ.

5 )

It

is

easily seen t hat Qvirtuai becomes the minimum ac-

tion dependent evaluation function of the new state, w

(remember the evaluation functions are negative in sign)

and in effect corresponds to the action most likely to be

taken when the system leaves state w

state evaluation is,

If the action,

U

achieves the desired goal, the virtual

Qvirtual =

0 . (6)

This provides relative sta te evaluations and allows for ope-

nended or cyclic goal states. This is illustrated by consid-

ering that for cyclic goals it is the dynamic transitions

between

states

tha t constitutes a goal state and not sim-

plely the arrival at a static system state(s).

This error is used to adapt the evaluation functions ac-

cording to LMS rules

as

follows

Figure 4: Simulated Quadruped

Although not the most efficient method of locomotion,

the learning of quadruped walking provides interesting

and challenging problems. Involved is the learning of com-

plex actuator sequences in the midst of numerous false

goal states and modes of failure. Figure 5 shows the loco-

motion ABM with the appropriate sensory, reinforcement

and motor action connections.

146



4/6

z=z:

71

= mldw

The reinforcement signals are defined aa

M M n a cn W

Figure 5: Locomotion ABM

The commands,

Clocomotjon,

are issued from the ABM

above and are dependent upon the possible sensory states

of that module. In this case these sensors are capable of

detecting all realizable modes of body motion. The com-

mands for the locomotion level are defined in Equation 10.

0 forward

1

left turn

(10)

cmoo all possible modes

For any specific command the locomotion ABM will is-

sue action responses, u~ocomo t~on ,o the actuators driv-

ing the legs in the horizontal,

h ,

and vertical,

U

direc-

tions. Within thi s action vector are the individual actu-

ator commands to extend, e x , or retract, rt as shown in

Equation

11

and 12.

where

Cleg =

h

hold

extend vertical

ur l

retract vertical (12)

he,

extend horizontal

, hrt retract horizontal

Each leg is equipped with sensors for measuring the

forces on each

foot

and the positions of each leg. The

forces on the foot are are biased such that

-fmaz

a distributed adaptive control system for a quadruped mobile robot

Documents