hierarchical refinement of skills and skill application for autonomous robots

E L S E V I E R Robotics and Autonomous Systems 19 (1997) 259-271

Robotics and

Autonomous Systems

Hierarchical refinement of skills and skill application for autonomous robots

Michae l Ka i se r* , Ri id iger D i l lmann University of lf arlsruhe, Institute for Real-Time Computer Systems and Robotics, D-76128 Karlsruhe, Germany

Abstract

One of the major goals in designing learning robots is to let these robots develop useful skills over time. These skills are not only related to physical actions of the robot, but also to the coordination of activities, communication with humans, and active sensing. Throughout this paper, the interdependency between these different kinds of skills is analyzed. For the case of elementary action skills and coordination skills, methods for inegration of skill application and refinement are developed. It is shown that this integration has the potential to support long-term learning and autonomous experimentation.

1. Introduction

The complexity of tasks that autonomous manipulation robots as well as autonomous mobile platforms have to solve is usually tackled through task decom- position. Here, two characteristic modes of operation can always be distinguished:

Model-based operation, including path planing or assembly planning and execution on the base of an a priori giveri and possibly continuously refined geometrical world model. This incldes also mission planning, which might be based on a given model of the world's "semantics".

Reactive operation such as collision avoidance or compliant motion that involves a direct coupling between the robot's sen~;ors and its actuators. The next action of the robot is determined by the current sensor readings - possibly - their history, and the current goal. These operations will from now on be referred to as the basic or elementary action skills of the robot. They are complemented by basic sensing skills that

* Corresponding author. E-mail: [email protected].

allow for a targeted use of the robot's sensors in the framework of plan and mission execution.

As modeling and planning makes only sense down to a certain level of abstraction, both modes of operation are usually combined. Then, elementary skills represent the interface between the planning and the control level in the robot's architecture. They also determine the basic operators available for planning: Only if the robot is able to associate a symbolic operator with a sequence of actions that are possibly dependent on its perceptions, i.e., only if the robot can operationalize the operator by applying a particular skill, using this operator on the planning level makes sense [18]. In addition, coordination skills are required to foster the efficient use of the available elementary sensing and actions skills.

To realize elementary sensing and action skills re- qures to map perceptions to actions by means of a strategy that is goal-oriented. Several possibilities to encode such a strategy exist (Fig. 1). The "traditional" approach is the model-based one, which tries to determine the skill's application conditions a priori and to

0921-8890/97/$17.00 © 1997 Elsevier Science B.V. All rights reserved PII S0921-8890(96)01)055 -3

260 M. Kaiser, R. Dillmann/Robotics and Autonomous Systems 19 (1997) 259-271

Skill specificalion [

Exacl Approximale I[ ~owl~,,ge l~,,w,~,,g~ I I Examples I I EvaluatiOnFunclion I

I progranVlraiect(,rylskill

I Prololyp of robot

Applicatitm arid Rcl]llcmcnl

Fig. 1. Approaches to the design of elementary action and sensing skills.

systematically design a skill based on this identification. Since exact models are seldom feasible, usually qualitative models are in use, resulting, for example, in probabilistic approaches as in the case of occupancy grids [9], or in the use of fuzzy systems [32]. How- ever, this explicit coding of the required strategy is not an easy task. Specifically, it cannot be assumed that users of future service robots will be able to perform this kind of low-level robot programming.

What can be expected from users of such robots is their ability to demonstrate strategies related to skills (e.g., by manually operating the robot), and to evaluate the strategy used by the robot, i.e., to evaluate the robot's performance. Approaches to learn based on human- generated examples usually appear under the heading of skill acquisition by human demonstration and behavior cloning, respectively. In robotics, these approaches are mostly related to manipulation skills [2,14,16]. The acquisition of basic mobility skills from human demonstration has, for instance, been described by Pomerleau [30] and Reignier [31]. How- ever, to allow a human user to program the robot via demonstration, the human's actions must be correctly understood by the robot. Also, feedback to the human teacher must be given in an approapriate manner. Both requirements sum up to the need for communication skills.

The refinement of elementary skills is a problem whose solution can be approached by means of conventional adaptive control techniques as well as through reinforcement learning and [4,24,33] in the context of adaptive neuro- and fuzzy-control [5,27].

Refinement, however, is not always an appropriate means to deal with an unsatisfactory performance of the robot. Sometimes, a new elementary skill must be generated in order to achieve a certain goal, or the conditions triggering the use of a specific elementary skill (i.e., the coordination skill) must be changed. If several skills for achieving a subgoal are available (e.g., several collision avoidance techniques based on different kinds of sensors), the most suitable one should be selected for a particular environment. If incomplete or even incorrect knowledge about the robot and the task is assumed, finding the best skill is as well a problem of adaptation.

This paper presents an hierarchical approach to skill learning that includes skill refinement as well as the adaptation of skill activation conditons and the identification of a requirement to generate new skills. The use of skills to discretize the robot's action space allows for the solution of complex tasks. The specific problems that are tackled are: • the initial acquisition and the refinement of elemen-

tary skills; • the construction of a symbolic representation that al-

lows for assigning skills to states that are defined via environmental conditions or mission related conditions; and

• the refinement of the state transition ~ skill mapping. Individual solutions to these problems are based on

a precise definition of the term skill, as well as on a distinction of the several types of skills that an au-

tonomous system must provide. The paper is therefore organized as follows: First, the notion of skill is clarified and skill models are developed. Then, representations for both skills and environmental states are developed and related to each other. Finally, solutions to the problem of multilevel skill refinement are presented, and experimental results are given.

2. Skills and skill models

"Skill" denotes the learned power of doing a thing completently. From a system's theoretic view-point, this means that for a given state x(t), the skilled system (the robot) should perform a competent action u(t): Such a competent action is an action that - possibly in the long term - contributes to achieving a goal.

M. Kaiser, R. Dillmann/Robotics and Autonomous Systems 19 (1997) 259-271 261

Robot state x(t)

Error in skill Skill successfully execution applied Te+

Errc, r Terminatitm I criterion e s criterion t s

e s = 0 \

I ts = 0 [ " Cont ro l

func t ion c s

x(t) J Evaluation funclit>ll 1" s

r(t~) x(t) l

Fig. 2. The role of Cs, rs, es and ts during the application of an elementary action skill.

In the context of robot programming and control, the following types of skills can be identified:

Act ion skil ls are skills that relate goals to physical actions (e.g., motiLons) of the robot. Mostly, they rely on the robot's perceptions to produce a competent action.

Sensing skil ls are required for active sensing and involve a goal- oriented parameterization and use of the available sensors.

Communica t ion ski,!ls are required for communicat- ing information from the robot to the environment and vice versa. For example, the ability to understand a human demonstration is a communication skill, as well as the contruction of a natural language description of the robot's perceptions.

Coordination skil ls are responsible for defining appropriate subgoals that allows for efficient application of action, sensing, and communication skills.

The focus of this paper is on the acquisition and refinement of action ~aad coordination skills. The efficient application of sensing skills has, for example, been described in [23,34], their acquisition is for instance treated in [ 1 ]. Communication skills are mostly considered within the context of Robot Programming by Demonstration (RPD [13,22]), as, for example, in [211.

2.1. The local skil l nu~del

For a given state x ( t ) , a skilled system (the robot) should perform a goal-oriented action u( t ) . The action

performed should be the result of a competent decision, i.e., it should be optimal with respect to an evaluation criterion (a reward) r ( x ( t ) ) that is related to the goal to be achieved. Essentially, a skill s is therefore given through a conr tro l func t ion

Cs: u ( t ) = Cs (x ( t ) ) , (1)

that implicitly encodes the goal associated to the skill and produces in each state x ( t ) a competent action u( t ) , and a reward function rs(X(t ) ) ~ [man, rmax] that evaluates the state x ( t ) w.r.t, the goal. To allow for execution monitoring, an error criterion es (x ( t ) ) is also required.

If the skill application involves to move the system along a trajectory (x*(0) . . . . . x * ( T ) ) in the state space, a termination criterion ts(X(t)) ~ {0, 1} must also be present (see Fig. 2). The local model of the skill s is therefore given as a quadruple of functions, s = (Cs, rs, es, ts).

In the case of elementary action skills and elementary sensing skills, the state x is represented as a sequence of sensorial inputs y, i.e., x ( t ) = ( y ( t - d) . . . . . y ( t - d - p)), d, p > 0, while the action u( t )

is directly executable by the robot (e.g., as a primi- tive motion or a one-shot parameterization of a sensor) without requiring any further processing on a level above or equal to the skill level. The skill is therefore operat ional with respect to the robot. For coordination skills, the state describes the robot's environment on an application dependent abstract level, whereas the action produced at any instant is the decision to activate


or deactivate a particular elementary skill. In the case of communication skills, the function Cs maps components of the robot's internal state vector or a state transition to a set or sequence of human-understandable symbols. Alternatively, it may map human-generated symbols of different modalities (e.g., speech or ges- tural input) to a robot-internal state vector or a state transition. In all cases, the error criterion is given via boundary conditions on the state variables. If present, the termination criterion is defined in a similar manner for the goal state.

2.2. Extensions to the local skill model

The local skill model considers only those components of a skill that are relevant for skill execution (Fig. 2). However, these components do not support a targeted selection and activation of skills.

To make efficient use of an elementary action or sensing skill, it is necessary to characterize appropriate applicability conditions. In addition, the effect of the skill application must be described. The global skill model therefore comprises not only the functions Cs, rs, es, and ts, but also a predicate As that determines the applicability of s and a function Es that describes the changes in the state variables induced by applying s.

Both As and Es are required to operate on those representations that are used by the existing skill selection and coordination mechanisms, i.e., to use the appropriate symbols: Consequentely, the extension of the local skill model results in a hybrid representation of the skill s (Fig. 3). 1

2.3. Skill acquisition and representation

To realize an elementary action or sensing skill, the actual design (acquisition) procedure must be chosen (Fig. 1). Depending on this procedure, appropriate representation formalisms for each skill component must be selected.

2.3.1. Skill acquisition from human demonstration One of the most promising approaches to the ac-

quisition of elementary action and sensing skills

1 It should be noted that As and Es of elementary skills are mirrored in Cs of coordination skills.

Supervising and coordinating entity (user, coordination mechanism)

Application $ ~Ern~r Evalualion 1Tenninalion

I Applicability c"nditi°n's As 1 Applicalion poslconditions E s

Control lianction Ev',dualion Function Crileria lhnctiun

Fig. 3. Hybrid skill representation by the components of the extended skill model.

is to build these skills from human demonstrations [1,16,31]. However, two facts must always be considered: (1) Human-generated examples alone are not enough

for a robot to learn a skill sufficiently well [17]. (2) Skill refinement based only on a scalar evaluation

of the robot's performance, as it is necessary in the context of skill acquisition from human demonstration, can only be successfully performed if the refinement procedure is provided with hints about a possible strategy. These hints can also be obtained from user demonstrations [16].

Within the general setting of learning from examples and based on these observations, several requirements emerge that must be met by the methods employed to approximate the control function Cs and the evaluation function rs: (1) It must be possible to construct the function ap-

proximator from the human performance data. (2) It must be possible to incrementally train the func-

tion approximator. (3) It must be possible to analyze the function

approximator on-line to determine if its structure is appropriate to approximate the target function sufficiently well.


As far as the selected representation techniques have a direct impact on the refinement procedures, these techniques are now discussed. For a detailed description of the several phases of the skill acquisition process, see [7,10,16,2011.

2.3.2. Representation of control function and evaluation function

One possibility to meet the given requirements on the level of elementary skills is to employ Radial-Basis Function networks (RBF networks) for approximating Cs and rs. The basic principle realized in RBF networks is that of loc~dized receptive fields. An RBF network consists of an input layer holding the current input values, a hidden layer representing the number of clusters in the input space, and an output layer in- tegrating the degree of membership calculated by the hidden layer units. Ibis calculation is performed on the basis of two cluster specific parameters, its center/~ and its width or. The hidden neuron representing a particular cluster gives the highest output when the input it receives is close to its center. The exact value of the output is determined by the neuron's transfer function (mostly a multidimensional gaussian function or a function representing a hypersphere or an hyper- cube), and the specific measure of distance, which is usually the Euclidean. An extension of RBF networks that is particularly interesting for control applications is the invention of time-delays that has been proposed by Berthold [6] (Fig. 4).

The output y (i.e., the action u produced by the network representing Cs or the predicted evaluation r

r i ( x ( 0 ) ~ ~ - d ) )

Fig. 4. Structure of a Radial-Basis Function network with time-delays in the hidden layer.

produced by the network representing rs) calculated by an RBF network for a given input vector x is given as

yj = f (x) = s wi,jri ,

s is the transfer function of the output neuron (usually the identity), Wi, j is the weight between the output neuron j and the cluster i, and ri (x) is the membership value calculated for cluster i. For example,

r i ( x ) = e x p [ - ( H x - I ~ i H ) 2 O'i

with center ~i and width o" i of cluster i. In case of d time-delays in the hidden (cluster) layer, the output neuron calculates the output yj by adding current membership values and their history over the last d time steps, such that

s ( £ ~ - ~ w i , j , k r i ( x ( t - - k ) ) ) . yj(t) = \i+l k=0

These networks are capable of universal approximation over R n or over compact subsets of R n, given proper cluster centers/zi [29].

To construct an RBF network, the number of clusters, their centers and widths have to be determined. For solving this task, several algorithms have been proposed [25,26], among those a supervised clustering algorithm specifically designed for constructing networks to approximate continuous functions [3]. This algorithm is based on the idea that the amount of overlap between individual clusters of an RBF network should depend on the similarity of the outputs associated to these clusters. Additionally, it allows to determine the amount of generalization done by the network by specifying the maximum network activity in regions of the input space that are not covered by examples (for further details see [3]). Following the construction, the network's output weights llOi,j, k are trained using gradient descent.

2.3.3. Representation of termination criterion and error criterion

The criterion functions ts and es are represented as region lists, i.e., ordered lists of labeled hyperintervals (Fig. 5). Two lists originate from human demonstration

264

~a

M. Kaiser, R. DiUmann/Robotics and Autonomous Systems 19 (1997) 259-271

Fig. 5. Two-dimensional regionlist drawn "in reverse order". During evaluation, the brighter regions are considered first.

data: the first initially consisting only of hyperinterval representing es that describes the encountered state space, and a second one that represents the goal state by an hyperinterval, too.

2.3.4. Representation o f applicability: pre- and post- conditions

In the context of skill acquisition from human demonstration, several methods to represent the pre- and post-conditions associated to an elementary skill are available. Firstly, As can be described as a subset of the set of possible application contexts, while Es is the change in the application context.

For example, the coordination mechanism may represent the robot's environment by means of object- attribute-value pairs (ok : ak, wk) and relations R = ((Ol : al , W l ) , . . . , (On : an, Wn)) with Wi, 1 < i < n, being a set of possible values of oi : ai. It cannot be assumed that sufficiently many examples of the form (Cb(S), Ca(s)), with Cb(S) describing the environment before and Ca(s) after the application of skill s are available. Hence, As and Es can in general not be learned from examples. Therefore, an analytic learning technique that starts from a description of the user's intention is applied to determine As [10]. Es is chosen to be this intention.

In the context of service robots and robot programming by demonstration, another appealing option is to build an explicit representation of the states and state transitions to which elementary skills should be linked (Fig. 6). This kind of representation is especially help- ful if the initial assignment of skills to state transitions should be given by a human user. For mobile robots such as PRIAMOS [8], one possibility to realize this

Fig. 6. Association of an insertion skill to a graphically represented state transition.

kind of representation is given by a connectivity graph (Fig. 7).

To automatically generate such graphs from a geometrical model, literature often proposes Voronoi diagrams [28]. However, Voronoi-based methods that generate a topological graph based on a two- dimensional representation of the environment, as it is used in PRIAMOS' control system, result in general in a huge set of very short edges. Additionally, these methods rely only on geometrical information and do not take other data, e.g., if a particular segment represents a stationary or a dynamic obstacle, into account. Therefore, the map shown in Fig. 7 has been created using a method based on tests of the environment's structure. This method is described in [19].

3. Skill application and refinement

During application and refinement of elementary skills, two levels of abstration must be considered (Fig. 8). First, refinement (and, possibly, extension) takes place on the skill level itself, i.e., the functions Cs, ts, and es are altered (or extended) according to a local measure of performance that is modeled by the function rs. Secondly, the strategy responsible for choosing a particular skill in a particular situation, i.e., the coordination skill, has to be adapted to environmental and task-specific conditions. Here, learning to choose the best skill is equivalent to finding the best action to be taken to perform the desired state transition on the global level (the coordination level).

3.1. Refining individual skills

Refining a basic skill means to alter the functions Cs, ts, and es with respect to some external feedback. The actual mechanism used for this adaptation


I°ii D O O R P A S S I N G i . . . . . . . . . . . . . . . . . . . . . . . . . . . . ' "'~

•

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . l

D 0 CK. I N G

T U R N

Fig. 7. Geometrical map of PRIAMOS' operating environment with connectivity graph and indications of associated elementary action skills.

J S

q F ~

I ...... ID

Q ( C , s ) ..............................

S

tu'

Fig. 8. Hierarchical refinement of elementary robot skills.

depends on the repre'sentation of the functions as well as on the information contained in the feedback [15].

The minimum feedback that is assumed to be available in the context of learning from human demonstrations is an evaluation of the performance of the robot after the application of an individual skill. The task

is therefore to refine a continuous reaLvalued function on the basis o f a delayed reinforcement signal. Gullapal l i ' s approach based on stochastic real-valued units (SRV units [11,12]) provides a suitable starting point for solving this task, which can formally be described as follows.


Given: An initial skill represented by functions Cs, ts, and es, a model rs that acts as a critic, and an external feedback source providing a scalar evaluation r of the effect of the skill application.

Determine: New functions C sm , ts and esn whose application results in a better, if possible optimal evaluation, and a new model r n that takes the changes introduced by the adaptation into account.

Since we choose RBF networks for the representation of basic mobility skills, the actual refinement of the control function Cs associated to a skill can take place as follows.

Extension of Cs towards new situations: The local representation employed in RBF networks allows for detecting situations that were not encountered so far, i.e., if

¥i ~ {1 . . . . . n} : riQ¢) < Ss, 0 < 8s < 1,

it can be concluded that x is a new situation. If it is desired to extend the network to cover this situation, a new cluster n + 1 is generated. The action u (i.e., the weights llOnq-l,i) to be associated to this cluster can be requested from the user or cloned from the action associated to the cluster that is nearest to the new one. In any case, the width tr of the new cluster is to be initialized such that it does not affect already existing clusters. If k is the index of the nearest old clusters (k ~ {1 . . . . . n}), and a threashold ~t controlling the maximum overlap between clusters as well as gaussian transfer functions ri are given, tr can be initialized as

IIx - t~n II O ' ~

In 8t

Adaptation of the action u associated to a known situation: The typical action to be undertaken in skill refinement is the adaptation of the action u calculated by applying the function Cs (i.e., by evaluating the network representing Cs) on the given situation x. As- suming that the new action Un obtained a feedback signal rn, whereas the original action u resulted in an evaluation r with r < rn, the network representing Cs is adapted on-line using (uj - Unj) as error and r/ = r/0 sgn (rn - r) as learning rate.

The only kind of information that can be expected from the user during the skill application and refinement process is an evaluation after the termination of

the skill execution. 2 To perform the actual adaptation, we use the following rules: (!) If the user aborted the operation, he/she is asked

if the current state is a goal state or an error state. The corresponding function (i.e., es resp. ts) is updated.

(2) Otherwise, the user is asked if the criterion that fired did so correctly. If this is the case, the control function Cs is updated, otherwise, the criterion is updated. In the latter case, an additional feedback can be given by the user that can be used for updating the control function Cs.

(3) If the control function is updated, the critic rs is updated, too.

If the control function must be updated, the con- tribution of any particular action u(t) to the obtained evaluation must be determined. To solve this tempo- ral credit assignment problem, usually exponentially discounted rewards are used. However, our approach has been to use a discounted adaptation rate.

For updating the region list, we simply include or exclude the final state x. If a new goal or error state must be learned, a corresponding region is inserted. Otherwise, a new unlabeled region covering the current state is inserted, thereby excluding the region from the error or the termination criterion.

To gradually improve the performance of the robot while applying a specific skill, i.e., to increase the quality of the function Cs w.r.t, given evaluation criteria, it is necessary to explore the action space, i.e., to systematically alter the action produced by the network representing c in order to find "better" actions. Obviously, the amount of exploration should be guided by the confidence that exists with respect to the quality of the original action the network produced.

Following Gullapalli [ 12], the output u of the action network is altered by means of a normal distribution qJ(u, or) with the standard deviation tr being a function of the difference between the maximum achiev- able reinforcement signal rmax and the predicted one r (see Fig. 9). Additionally, the exploration is guided by the parameter Yr, the exploration width. If the skill s has been build from human performance data ((y(0), u)(0)), • • • ,y (T) , u(T)) , Yr can be chosen for

2 Termination means that either a robot-specific error occurs, the goal state known from the demonstration is reached, or the user stops the robot.


i. ~-- c , ( , ) . - ~ , ( ~ )

~- 7,(rrnax -- r)

Fig. 9. Stochastic action selection following the SRV approach.

each component Uk c,f the action vector independently according to

Yr = max lUk(i) -- uk(j)l. i,je{O,...,T},ivkj

Summarizing, the skill application, evaluation, and refinement process appears as follows:

(1) For the current situation x(t), evaluate the network representing Cs.

(2) If the network's activation is low and adaptation is desired, generate a new cluster covering x(t). If the user does not provide an appropriate action, assign the action associated to the closest old cluster to this new cluster.

(3) Else if no adaptation is desired, stop the skill application.

(4) Evaluate the network representing rs on the situation x( t ).

(5) I f the model network's activation is low, generate a new cluster covering x(t) and assign 0 as prediction r to this cluster.

(6) Alter the action u(t) based on the prediction r according to the procedure shown in Fig. 9.

(7) Store the situation-action pair (s(t), u(t)). (8) If the goal situation has not been reached and

the maximum number of actions has not been exceeded, go back to (1).

(9) Obtain the final reinforcement signal rfinal (rfinal - -

1 if the goal has been reached, - 1 otherwise). (10) Obtain a general evaluation of the skill applica-

tion rstart- If application exists, calculate y such that yTrfinal = rstart. Otherwise, choose a fixed y e [0, 1].

(11) For each situation-action pair (x(t), u(t)) , t {0 . . . . . T}, calculate rn(t) as rn(t) = yr-'r~nal.

(12) For each situation-action pair (x(t), u(t)) , t e {0 . . . . . T}, if rn(t) > r(t) update Cs.

m

Fig. 10. Docking.

(13) For each situation-action pair (x(t), u(t)) , t e {0 . . . . . T} and the obtained reinforcement rn (t), update rs.

3.1.1. Example One example of an elementary mobility skill that

was learned and refined is the docking skill (Fig. 10). Here, two user demonstrations resulted in 75 and 84 examples of the form (24 ultrasonic distance measure- ments, translational offset Ax, translational offset Ay, rotational offset Au).

Based on the demonstration data, 14 of the 24 ultrasonic sensors were identified as relevant, and a training file of 122 samples was generated. From the training data, a network consisting of 89 clusters was built to represent Cs, whereas the network representing rs consisted of 35 clusters.

Table 1 shows the results obtained from applying the initially learned skill. Most notable is the relatively large number of neurons inserted during the first trial. At the end of this trial, the termination condition was not correctly identified, such that the termination criterion had to be extended.

3.2. Refinement of coordination skills

I f global context information is available that can be related to the execution of a specific elementary operation, it makes sense to look for the best out of a set of similar skills to be employed in that context. The t a skon this level can therefore be stated as:

268

Table 1 Skill application and refinement

M. Kaiser, R. Dillmann/Robotics and Autonomous Systems 19 (1997) 259-271

Function T 1 T2 T3 T4

Cs 11 3 0 0 Rs 13 4 0 0 ts 1 0 0 0 Success 0 1 1 1

Neurons inserted in the networks representing Cs and rs, new termination condition ts generated during four subsequent skill applications in known environment, but starting from different locations. The goal was reached in any of the trials.

Given: A context C requiring the execution of an elementary skill s of a specific class.

Determine: The skill s that maximizes a given evaluation criterion related to C.

If only the immediate evaluation r of the application of skill s in context C is of interest, determining the best skill to be selected becomes a matter of pre- dicting this evaluation via QZ(C, s). Each QZ(C, s) is a task-dependent value that estimates the appropriate- ness of applying skill s in context C for fulfilling task 27. Hence, for the task-dependent coordination skill c the control function Cc is given by the rule applied to select s, given QZ(C, s), the evaluation function rc is given by the QZ, and the set of valid contexts (i.e., the error criterion) is encoded implicitly as the set of all contexts C for which values QZ(C, s) exist.

The appropriate learning rule to be applied at the end of each skill application is simply

Q Z ( C , s ) t+ l = (1 - ot)Qz(C, s) t + ctr,

where c~ is the learning rate. If the accumulated feedback must be considered, Watkins Q- Learning [35] is an appropriate technique, and the learning law becomes

Q Z ( C ' s ) t + l

I C* = (1-ot )Qz(C,s) t +ot(r + YQmax( )),

where C* is the active context after s has been successfully applied (e.g., the next edge in the connectivity graph). QZmax(C*) is defined as

QZmax(C* ) = max Qz(C*,s). s E skills

In both cases, finding the best skill to be applied requires exploration, i.e., to systemtically change skill

selection. This exploration can be guided through background knowledge that might be available on the context level. If no sufficient background knowledge exists, the standard technique associated with Q-Learning, i.e., selecting the skill s by means of a Boltzmann distribution

eQZ(C,s)/T

P(e = siC) = ~bE Skills eaZ(C'b)/T '

in combination with a random initialization of the Q values is applied. Here, T is a temperature parameter.

3.2.1. Example To asses the convergence of Q-Learning in the con-

text of optimal skill selection, several experiments were undertaken in simulation. For the results shown in Fig. 11, the task was to find the best out of a varying number of skill. The Q value for each skill was initialized randomly. The rinforcement signal r was selected randomly from [0.9,1] if the optimal skill was executed, and to a random value in [0,1] otherwise, in order to simulate the inconsistency in evaluations obtained under real world conditions. Ob- viously, the number of trials needed to find the optimal action (here, achieving P(s = SoptlC) > 0.9 was the convergence criterion) depended on both the learning rate and the number of alternative actions that were available.

4. Integration

To be able to perform efficient learning on multiple interdependent levels of abstraction, it is necessary to define a preference criterion (bias) that determines on which level exploration and learning should take place. Based on the assumption that the available skills are already operational (i.e., they can be applied, but are not optimal), the complete learning procedure is designed top-down. It starts with choosing the best elementary skill and performs skill-level adaptation afterwards.

4.1. Synchronizing refinement on the coordination and execution level

If rc,s denotes the evaluation obtained for applying a skill s in a specific context C, and /sk is the skill

M. Kaiser, R. DiUmann/Robotics and Autonomous Systems 19 (1997) 259-271 269

Nil rl l |qff Of" I ~ S Io [ i l l | | o l , l im ' .d a t . t i tb , 7 0 0 . • ,

5 :ilk'llOIl~ av:dlahle - - a¢ t i o l l s nv ; l i l ah l¢ . . . .

(mOO 9 ~tJtqis :w;dl;,l'l¢

5OO

4O0

~00

2OO

O. | 0,2 ().~4 (1.4 ().~ 0,6 0.7 (}.N 0.9 I l . g : u iliulg r:llt.

I{XXX)

I(X)O

I 0~)

-7

I o

N l l l n | l c r ¢~f I~'iah¢ l a l i n d op l i l l l a l ;It 'liOll

NOlm | r c r I~t I l i :ds - -

I o N l l l l l h~ [ Iii :*vaj l l lhle :~'liOllS

Fig. 11. Number of trials (averaged over 100 runs per data point, T = 50) to find the optimal action: left - comparison between different values of or; right - influence of number of available actions (or = 1.0).

specific evaluation, @re learning control procedure can be designed as follows: (1) For the current context C, find the skill yielding

the highest rewind rc,s. (2) If rc,s is sufficiently high, do not perform any

adaptation and proceed with 6. (3) If rc,s is too low but rD,s is sufficiently high in

a different global context D (i.e., the skill has already been applied successfully elsewhere), clone the skill and proceed with (5) using the clone.

(4) Ifrc,s is too low ;rod low or unknown with respect to other global contexts, continue with (5).

(5) Permanently 3 adapt the skill s using the methods described in Section 3.1.

(6) Assign the skill s to the current context C. On all levels, using background knowledge for guid-

ing the exploration and the adaptation will greatly influence the speed of convergence. In that sense, the preliminary results presented can be considered as a worst case indication of the effort needed for refinement, since they assume nothing but the existence of a feedback signal at the end of the application of an individual skill. Usually, additional information (like prototype actions, more informative evaluation func-

3 Obviously, an individual skill might be realized by means by an adaptive controller. This controller can be used every time the skill is applied, e.g., for supressing oscillations. However, the changes made through this adaptation mechanism are not permanent, since every time the skill is invoked, control will start using the same setting of parameters.

tions, and addtional attributes of skills that allow for a better initialization of the Q values) are available. For a realization of the complete method on a real robot, as it is being planned for PRIAMOS, this knowledge must be exploited.

4.2. Extensions

Two additional aspects make the proposed method interesting in the context of robot systems that are designed for long-time operation in partially unknown environments, but which may communicate with humans. Firstly, if an unsatisfactory performance of the robot is detected that cannot be enhanced by adaptation, a new elementary skill has to be generated, and the generation of this skill can be triggered by the robot itself. Especially if the robot can communicate with a human supervisor, it can explicitly ask for an example of the required skill.

Given a set of elementary skills of a certain class, it is also possible to exploit the relationship between state transitions and actions in a different way. If only the class of the elementary skill to be applied is available but the parameters describing the global context are unknown, assessing the performance of a particular skill allows to identify these parameters, i.e., by systematically evaluating the performance of specific skills, it becomes possible to obtain world knowledge through experimentation. Imagine, for instance, a robot manipulator equipped with a grinding tool.


The robot has to polish different work pieces in its environment all of which have been modeled geometri- cally only. For the actual grinding operation, the robot has to provide the ability to control the force it ap- plies to each workpiece. Assume that the robot is not equipped with an adaptive force controller, instead a set of controllers (elementary skills) optimized for different materials exists. By finding for each work-piece the CONTROL-FORCE skill that results in the best evaluation, the robot is able to classify the pieces with respect to the material, without having to analytically determine material stiffness or other parameters.

Acknowledgements

This work has been funded by the ESPRIT Project 7274 "B-Learn II". It has been performed at the Insti- tute for Real-Time Computer Systems and Robotics, Prof. Dr.-Ing. U. Rembold and Prof. Dr.-Ing. R. Dillmann, Department of Computer Science, Univer- sity of Karlsruhe, Germany. The authors would like to thank Marnix Nuttin, Attilio Giordana, and Volker Klingspor for their support.

References

5. Conclusion and further work

If robots are to be employed efficiently in environments that are only partially known, they must exploit their experience. They must be able to adapt to their environment and to the specific conditions of use as they are given by the application and the user. In such a scenario, learning becomes evident. A learning robot can relieve the robot designer from cumbersome programming tasks, and it can support the robot user to customize itself for his or her specific needs.

An indispensible requirement for efficient operation is the intelligent use and refinement of basic robot skills. Throughout the paper, a novel hierarchical approach to learning efficient skill application while using mechanisms of skill refinement has been proposed. Several preliminary results have shown the appropri- ateness of the individual techniques involved in the approach.

The realization of the complete learning control loop on PRIAMOS as well as on a Puma 260 manipulator is a topic of current work and requires substantial effort. Here, the first step is the identification and acquisition of task-independent background knowledge that supports the refinement as well as an analysis of the minimum available feedback that can be expected on each of the levels of adaptation. Subsequently, the benefits obtained through multi-level adaptative behavior on a real robot must be evaluated quantitatively. This task is expected to become extremely difficult, since the effort involved in performing just a single experiment is very high, and appropriate criteria for evaluating complex learning systems are still to be found [18].

[1] M. Accame and EG.B. De Natale, A fast and easy way for teaching an ANN to extract edge points from an image, Proc. 4th European Workshop on Learning Robots, Kadsruhe, Germany (1995).

[2] H. Asada and S. Liu, Transfer of human skills to neural net robot controllers, Proc. IEEE Int. Conf. on Robotics and Automation (1991).

[3] C. Baroglio, A. Giordana, M. Kaiser, M. Nuttin and R. Piola, Learning controllers for industrial robots, Machine Learning (1996).

[4] A.G. Barto, R.S. Sutton and C.W. Anderson, Neuronlike elements that can solve difficult learning control problem, IEEE Transactions on Systems Man and Cybernetics (1983) 835-846.

[5] H.R. Berenji, A reinforcement learning based architecture for fuzzy logic control, International Journal of Approximate Reasoning 6 (1992) 267-292.

[6] M.R. Berthold, A time delay radial basis function network for phoneme recongnition, IEEE Int. Conf. on Neural Networks, Orlando, FL (1994) 4470-4472a.

[7] R. Dillmann, M. Kaiser and A. Ude, Acquisition of elementary robot skills from human demonstration, Int. Symp. on Intelligent Robotics Systems, Pisa, Italy (1995).

[8] R. Dillmann, M. Kaiser, F. Wallner and P. Weckesser, PRIAMOS: An advanced mobile system for service, inspection, and surveillance tasks, in: T. Kanade, H. Bunke and H. Noltemeier, eds., Modelling and Planning for Sensor Based Intelligent Robot Systems (World Scientific, Singapore, 1995).

[9] A. Elfes, Using occupancy grids for mobile robot perception and navigation, IEEE Computer (June 1989) 46-57.

[10] H. Friedrich, S. Miinch, R. Dillmann, S. Bocionek and M. Sassin, Robot programming by demonstration: Supporting the induction by human interaction, Machine Learning (1996).

[11] V. Gullapaili, A stochastic reinforcement learning algorithm for learning real valued functions, Neural Networks 3 (1990) 671-692.

[12] V. Gullapalli, J.A. Franklin and H. Benbrahim, Acquiring robot skills via reinforcement learning, IEEE Control Systems Magazine 14 (1) (1994) 13-24.

M. Kaiser, R. DiUmann/Robotics and Autonomous Systems 19 (1997) 259-271 271

[13] R. Heise, Demonstration instead of programming: Focussing attention in robot task acquisition, Research Report no. 89/360/22, Department of Computer Science, University of Calgary, 1989.

[14] S. Hirai, H. Nogachi and K. Iwata, Transplantation of human skillful motion to manipulators in insertion of deformable tubes, IEEE Int. Conf. on Robotics and Automation, Nagoya, Japan (1995) 1900-1905.

[15] M. Kaiser, M. Deck, A. Retey, K. Berns and W. Ilg, Using neural networks for real-world adaptive control, in: Neural Networks: Producir,~g Dependable Systems (Solihull, UK, 1995).

[16] M. Kaiser and R. Dillmann, Building elementary robot skills from human demonstration, IEEE Int. Conf. Robotics and Automation, Minneapolis, MN (1996).

[17] M. Kaiser, H. Friedrich and R. Dillmann, Obtaining good performance from a bad teacher, Int. Conf. on Machine Learning, Workshop on Programming by Demonstration, Tahoe City (1995).

[18] M. Kaiser, V. Klingspor, J. del R. Mill~, M. Accame, E Wallner and R Dillmann, Using machine learning techniques in real-world mobile robots, IEEE Expert (1995).

[19] M. Kaiser, V. Klin~;spor, K. Morik, A. Rieger, M. Acame and J. del R. Millan, B-Learn II-D 405: Learning Techniques for Mobile Systems, B-Learn II - ESPRIT BRA Project No. 7274 (1995).

[20] M. Kaiser, A. Retey and R. Dillmann, Robot skill acquisition via human demonstration, Int. Proc. Int. Conf. on Advanced Robotics (1995).

[21] V. Klingspor and K. Morik, Towards concept formation grounded on perception and action of a mobile robot, Proc. 4th Int. Conf. on Intelligent Autonomous Systems, Karlsruhe (1995).

[22] Y. Kuniyoshi, M. Inaba and H. Inoue, Learning by watching: Extracting reusable task knowledge from visual observation of humans performance, IEEE Transactions on Robotics and Automation 10 (6) (1994) 799-822.

[23] J.J. Leonard and H.E Durrant-Whyte, Directed Sonar Sensing for Mobile Robot Navigation (Kluwer Academic Publishers, Dordrecht, 1992).

[24] J. del R. Millhn, Learning efficient reactive behavioral sequences from basic reflexes in a goal-directed autonomous robot, Proc. 3rd Int. Conf. on Simulation of Adaptive Behavior (1994).

[25] J. Moody and C. D~ken, Learning with localized receptive fields, in: T. Sejnowski, D. Touretzky and G. Hinton, eds., Proc. Connect.ionist Models Summer School, Carnegie Mellon University (1988).

[26] M.T. Musavi, W. Ahmed, K.H. Chan, K.B. Faris and D.M. Hummels, On the training of radial basis function classifiers, Neural Networks 5 (1992) 595-603.

[27] K.S. Narendra and S. Mukhopadhyay, Adaptive control of nonlinear multivariable systems using neural networks, Neural Networks 7 (1994).

[28] A. Okabe, B. Boots and K. Sugihara, Spatial Tessellations: Concepts and Applications of Voronoi Diagrams (Wiley, New York, 1992).

[29] J. Park and I.W. Sandberg, Approximation and radial-basis- function networks, Neural Computation 5 (1993).

[30] D.A. Pomedeau, Efficient training of artificial neural networks for autonomous navigation, Neural Computation 3 (1991) 88-97.

[31] P. Reignier, V. Hansen and J.L. Crowley, Incremental supervised learning for mobile robot reactive control, in: Intelligent Autonomous Systems 4 (lOS Press, 1995) 287-294.

[32] K.-T. Song and J.-C. Tai, Fuzzy navigation of a mobile robot, Proc. IEEE/RSJ Conf. on Intelligent Robots and Systems, Raleigh, NC (1992).

[33] R.S. Sutton, A.G. Barto and R.J. Williams, Reinforcement learning is direct adaptive control, IEEE Control Systems Magazine (April 1992) 19-22.

[34] F. Wallner, M. Kaiser, H. Friedfich and R. Dillmann, A multilevel learning approach to mobile robot path planning, in: V. Graefe, ed., Intelligent Robots and Systems (Elsevier, Amsterdam, 1995).

[35] C.J.C.H. Watkins, Learning with delayed rewards, Ph.D. Thesis, University of Cambridge, 1989.

Riidiger Dillmann is a university professor in computer science and robotics at the University of Karl- sruhe, where he is the Director of the CAD/CAM/Robotics group at the Institute for Real-Time Computer Sys- tems and Robotics. He is also Head of the Interactive Techniques for Planning research group of the FZI Karlsmhe. He received the Dr.-Ing. degree from the University of Karlsruhe. He is a member of the German Society of

Computer Science (GI) and IEEE and has published more than 200 papers in the fields of mobile robotics, robot learning, and robot programming by demonstration.

Michael Kaiser currently works as a research associate at the Insti- tute for Real-Time Computer Sys- tems and Robotics of the University of Karlsruhe. His present research is concerned with robot learning, human-robot interaction, and robot programming by demonstration. He received a Dipl.-Inform. degree from the University of Karlsruhe, where he studied computer science and biomed- ical engineering. He is a member of

IEEE and has published more than 20 papers on robot learning and neural adaptive control.

hierarchical refinement of skills and skill application for autonomous robots

Documents