a distributed adaptive control system for a quadruped mobile robot

Upload: mer-fro

Post on 25-Feb-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 A Distributed Adaptive Control System for a Quadruped Mobile Robot

    1/6

    A

    Distributed Adaptive Control System

    for a Quadruped Mobile Robot

    Bruce L. Digney and

    M.

    M. Gupta

    Intelligent Systems Research Laboratory, College of Engineering

    University of Saskatchewan, Saskatoon, Sask. CANADA S7N OW0

    Email: [email protected]

    Abdruct-

    In

    this research, a method by which

    reinforcement learning can be combined into a be-

    havior based control system is presented. Behav-

    iors which are impossible or impractical to embed

    as predetermined responses are learned through

    self-exploration and self-organization using a tem-

    poral difference reinforcement learning technique.

    This

    results in what is referred to as a distributed

    adaptive control system

    (DACS); in effect the

    robots artificial nervous system.

    A DACS is de-

    veloped for a simulated quadruped mobile robot

    and the locomotion behavior level is isolated and

    evaluated. At the locomotion level the proper ac-

    tuator sequences were learned for all possible gaits

    and eventually graceful gait transitions were also

    learned. When confronted with an actuator mal-

    function, all gaits and transitions were adapted re-

    sulting in new limping gaits for the quadruped.

    I .

    INTRODUCTION

    Although conventional control and artificial intelligence

    researchers have made many advances, neither ideology

    seems capable of realizing autonomous operation. Th at

    is, neither can produce machines which can interact with

    the world with an ease comparable to humans or at least

    higher animals. In responding to such limitations, many

    researchers have looked to

    biological physiological

    based

    systems

    as

    the motivation to design artificial systems. As

    an example are the behavior based systems of Brooks [l]

    and Beer [2]. Behavior based control systems consist of a

    hierarchical structure of simple behavior modules. Each

    module is responsible for the sensory motor responses of

    a particular level of behavior. The overall effect is that

    higher level behaviors are recursively built upon lower ones

    and the resulting system operates in

    a

    self-organizing man-

    ner. Both Brooks and Beers systems were loosely based

    upon the nervous systems of insects. These artificial in-

    sects operated in a hardwired manner and exhibited an

    interesting repertoire of simple behaviors. By hardwired it

    0-7803-0999-5/93/$03.0001993

    IEEE

    .

    144

    is

    meant that each behavior module had its responses pre-

    determined and was simply programmed externally. Al-

    though this approach is successful with simple behaviors,

    it is obvious that many situations exist where predeter-

    mined solutions are impossible or impractical to obtain.

    It is subsequently proposed that by incorporating learn-

    ing into the behavior based control system, these difficult

    behaviors could be acquired through self-exploration and

    self-learning.

    Complex behaviors are usually characterized by

    a 8e-

    quence of actions with success or failure only known at

    the end of that sequence. Also, the critical error sig-

    nal is only an indication of the success or failure of the

    system and no information regarding error gradients can

    be determined,

    as

    in the case of continuous valued error

    feedback. Thus the required learning mechanism must

    be capable of both reinforcement learning aswell as tem-

    poral credit assignment. Incremental dynamic program-

    ming techniques such

    as

    Bartos [3] temporal difference

    (TD) appear to be well suited to such tasks. Based upon

    Bartos previous adaptive heuristic critic

    [4],

    TD employs

    adaptive state and action evaluation functions to incre-

    mentally improve its action policy until successful oper-

    ation is attained. The incorporation of TD learning into

    behavior based control results in a framework of adap-

    tive (ABMs) and non-adaptive behavior modules which

    is referred to here

    as a

    distributed adaptive control sys-

    tem (DACS). The remainder of this report will be con-

    cerned with

    a

    brief description of the DACS and ABMs,

    and implementing of the locomotion level ABM within the

    DACS of

    a

    simulated quadruped mobile robot. This level

    is considered appropriate because the actuator sequences

    for quadruped locomotion are not intuitively obvious and

    difficult to determine. Other levels such as global naviga-

    tion,

    t a sk

    planning and task coordination are implemented

    and discussed by Digney [ 5 ] .

    11 DISTRIBUTEDDAPTIVEONTROLYSTEMS

    The DACS shown in Figure

    1

    s comprised of various adap-

    tive and non-adaptive behavior modules. Non-adaptive

    Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.

  • 7/25/2019 A Distributed Adaptive Control System for a Quadruped Mobile Robot

    2/6

    modules are present as inherent knowledge and are used

    where adaptive solutions are not required. All modules

    receive sensory inputs and respond with actions in an at-

    tempt to perform

    a

    command specified by a higher level.

    The performance of commands in most cams will require a

    sequence of actione by the lower level system and possibly

    the cooperation of many lower level systems. The coupling

    between ABMs is shown in Figure 2. In this configura-

    tion, the action from level 1+1becomes the command for

    level

    1

    Level

    1

    + 1dm upplies goal based reinforcement,

    r g ,

    o drive level 1 towards succeasful completion of that

    command. Level 1 in turn issues actions to level 1- 1 and

    receives environmental based reinforcement, pc

    ,

    rom level

    1 -

    1.

    This environment based reinforcement is represen-

    tative of the difficulty or cost incurred while performing

    the requested actions and is included to drive level

    1

    to a

    cost effective solution. While operating, level 1 may en-

    ter a state which is in some way damaging or dangerous.

    To drive the system away from such

    a

    state, sensor based

    reinforcement, r is used. Sensor based reinforcement is

    supplied from sensors at level

    1

    It is analogous to pain o r

    fear and will ensure that level 1 operates in

    a

    safe man-

    ner. These three reinforcements are combined into a total

    reinforcement signal, rt , according to Equation 1.

    r:

    =

    re

    +

    ag rg+a r, (1)

    where:

    a g

    and

    a

    are the relative importance of the

    reinforcements.

    Figure

    1:

    Schematic of DACS

    It can be seen from Figure 2 that the flow of environ-

    mental and sensor based reinforcement is in the upward

    direction. This will result in lower level skills and behav-

    iors being learned first, then other higher level behaviors,

    converging in a recursive manner toward the highest level.

    Figure 1 shows this highest level

    as

    existing within a sin-

    gle physical machine. However, in the case of multiple

    machines operating in a collective, higher abstract behav-

    ior levels are possible. Within the context of this paper,

    only behaviors relevant to individual machines will be dis-

    cussed. In the absence

    of

    higher collective behaviors con-

    Figure 2: Hierarchy of Three

    ABMs

    trolling individual machines, the purpose or task of the

    machine is embedded within the

    DACS

    as

    an

    ins t inc t

    or

    d r i v e . This instinct is the high level action which results

    in a feeling of accomplishment or positive reinforcement

    within the DACS. It is then the responsibility of the adap-

    tive behavior modules within the DACS to learn the skills

    and behaviors necessary to fulfill this drive. This concept

    as well as the self-organizing characteristics that result

    from such interactions are further discussed by Digney [ 5 ] .

    The

    ABM

    is the primary adaptive building block for

    the DACS. Within it exist computational mechanisms for

    state classification, learning and the combination of rein-

    forcement signals. Figure 3 shows

    a

    schematic of an

    ABM

    complete with incoming command, sensory and reinforce-

    ment signals. For clarity the outgoing reinforcement sig-

    nals have been removed. For any particular level, say

    I

    the

    ABM

    observes the relevant system states through

    ap-

    propriate sensors. For a perception system consisting of

    N sensors, the state

    S I ,

    s defined as

    where: sn is the individual sensor reading, 0 < n < N

    QOdBudwnfMamrd

    mar Bumd Pdn)R*dmm

    Ad.plv.8.k.riarModL&

    EnvCo (.lBrd

    F Y h f c m ~ ~ r d

    Figure 3: Single ABM

    State transitions are detected and the resulting states

    are classified uaing an idealized neural claaeiflcation

    145

    ., . . . ,

    , ,

    ...

    Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.

  • 7/25/2019 A Distributed Adaptive Control System for a Quadruped Mobile Robot

    3/6

    scheme. This classification embodies the macroscopic

    o p

    erating principlea of unsupervised neural networks such

    as

    ART2-A

    [6]

    and will be assumed adequate in the con-

    text

    of these simulations. The Temporal Difference (TD)

    algorithm as developed by Barton

    [

    learns by adjust-

    ing state and action evaluation functions then uses these

    evaluations to choose an optimum action policy. It can

    be shown that these two evaluation functions can be com-

    bined into

    a

    single action dependent evaluation function,

    say Q,,,,, similar to th at described by Barto [7]. Given

    the system at state 8 , the action taken,

    U ,

    is the action

    which satisfies

    (3)

    where: is a random valued function.

    In Equation 3, Q,,u and can be thought of as the

    goal driven and exploration driven components of the ac-

    tion policy respectively. Taking the action

    U*

    results in

    the transition from state s to state w and the incurring

    of a total reinforcement signal

    r t .

    The action dependent

    evaluation function error is obtained by modifying the T D

    error equation and is

    e

    =

    Qvirtuai - s ,U* rt

    (4)

    where: Qvirlual

    is

    the virtual sta te evaluation value of the

    where: is the rate of adaption and

    k

    is the index of

    adap ion.

    As the evaluation function converges, the goal driven

    component begins to dominate over the exploration driven

    component. The resulting action policy will perform the

    command in

    a

    successful and efficient manner. Generally,

    an ABM will be capable of performing more than a single

    command. For an ABM capable of

    cmao

    commands, the

    vector of the evaluation functions is defined as:

    where: Qs,u is the evaluation function and

    c

    the particular

    command 0 < c < Cmas.

    111. DACS FOR A QUADRUPED OBLE

    ROBOT

    To evaluate the DACS, the simulated quadruped shown

    in Figure

    4

    was

    used. This mobile robot was placed in-

    side a simulated three dimensional landscape where it is

    left to develop skills and behaviors as it interacts with

    its

    environment. This world is made up of ramps, plateaus,

    cliffs and walls,

    as

    well

    as

    various substances of interest.

    In the absence of any predetermined knowledge it

    is

    the

    responsibility of the DACS and in particular the ABMs to

    acquire the skills and behaviors for successful operation.

    next state w and 7 is the temporal discount factor.

    If action,

    U ,

    does not achieve the desired goal, the vir-

    ~

    tual state evaluation is,

    Q u i r t u o l z

    mU={QuJ.

    5 )

    It

    is

    easily seen t hat Qvirtuai becomes the minimum ac-

    tion dependent evaluation function of the new state, w

    (remember the evaluation functions are negative in sign)

    and in effect corresponds to the action most likely to be

    taken when the system leaves state w

    state evaluation is,

    If the action,

    U

    achieves the desired goal, the virtual

    Qvirtual =

    0 . (6)

    This provides relative sta te evaluations and allows for ope-

    nended or cyclic goal states. This is illustrated by consid-

    ering that for cyclic goals it is the dynamic transitions

    between

    states

    tha t constitutes a goal state and not sim-

    plely the arrival at a static system state(s).

    This error is used to adapt the evaluation functions ac-

    cording to LMS rules

    as

    follows

    Figure 4: Simulated Quadruped

    Although not the most efficient method of locomotion,

    the learning of quadruped walking provides interesting

    and challenging problems. Involved is the learning of com-

    plex actuator sequences in the midst of numerous false

    goal states and modes of failure. Figure 5 shows the loco-

    motion ABM with the appropriate sensory, reinforcement

    and motor action connections.

    146

    Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.

  • 7/25/2019 A Distributed Adaptive Control System for a Quadruped Mobile Robot

    4/6

    z=z:

    71

    = mldw

    The reinforcement signals are defined aa

    M M n a cn W

    Figure 5: Locomotion ABM

    The commands,

    Clocomotjon,

    are issued from the ABM

    above and are dependent upon the possible sensory states

    of that module. In this case these sensors are capable of

    detecting all realizable modes of body motion. The com-

    mands for the locomotion level are defined in Equation 10.

    0 forward

    1

    left turn

    (10)

    cmoo all possible modes

    For any specific command the locomotion ABM will is-

    sue action responses, u~ocomo t~on ,o the actuators driv-

    ing the legs in the horizontal,

    h ,

    and vertical,

    U

    direc-

    tions. Within thi s action vector are the individual actu-

    ator commands to extend, e x , or retract, rt as shown in

    Equation

    11

    and 12.

    where

    Cleg =

    h

    hold

    extend vertical

    ur l

    retract vertical (12)

    he,

    extend horizontal

    , hrt retract horizontal

    Each leg is equipped with sensors for measuring the

    forces on each

    foot

    and the positions of each leg. The

    forces on the foot are are biased such that

    -fmaz