hierarchical refinement of skills and skill application for autonomous robots

13
ELSEVIER Robotics and Autonomous Systems 19 (1997) 259-271 Robotics and Autonomous Systems Hierarchical refinement of skills and skill application for autonomous robots Michael Kaiser*, Riidiger Dillmann University of lfarlsruhe, Institute for Real-Time ComputerSystems and Robotics, D-76128Karlsruhe, Germany Abstract One of the major goals in designing learning robots is to let these robots develop useful skills over time. These skills are not only related to physical actions of the robot, but also to the coordination of activities, communication with humans, and active sensing. Throughout this paper, the interdependency between these different kinds of skills is analyzed. For the case of elementary action skills and coordination skills, methods for inegration of skill application and refinement are developed. It is shown that this integration has the potential to support long-term learning and autonomous experimentation. 1. Introduction The complexity of tasks that autonomous manipu- lation robots as well as autonomous mobile platforms have to solve is usually tackled through task decom- position. Here, two characteristic modes of operation can always be distinguished: Model-based operation, including path planing or assembly planning and execution on the base of an a priori giveri and possibly continuously refined geo- metrical world model. This incldes also mission plan- ning, which might be based on a given model of the world's "semantics". Reactive operation such as collision avoidance or compliant motion that involves a direct coupling be- tween the robot's sen~;ors and its actuators. The next action of the robot is determined by the current sen- sor readings - possibly - their history, and the current goal. These operations will from now on be referred to as the basic or elementary action skills of the robot. They are complemented by basic sensing skills that * Corresponding author. E-mail: [email protected]. allow for a targeted use of the robot's sensors in the framework of plan and mission execution. As modeling and planning makes only sense down to a certain level of abstraction, both modes of operation are usually combined. Then, elementary skills represent the interface between the planning and the control level in the robot's architecture. They also determine the basic operators available for plan- ning: Only if the robot is able to associate a symbolic operator with a sequence of actions that are possibly dependent on its perceptions, i.e., only if the robot can operationalize the operator by applying a partic- ular skill, using this operator on the planning level makes sense [18]. In addition, coordination skills are required to foster the efficient use of the available elementary sensing and actions skills. To realize elementary sensing and action skills re- qures to map perceptions to actions by means of a strategy that is goal-oriented. Several possibilities to encode such a strategy exist (Fig. 1). The "traditional" approach is the model-based one, which tries to deter- mine the skill's application conditions a priori and to 0921-8890/97/$17.00 © 1997 Elsevier Science B.V. All rights reserved PII S0921-8890(96)01)055 -3

Upload: michael-kaiser

Post on 04-Jul-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

E L S E V I E R Robotics and Autonomous Systems 19 (1997) 259-271

Robotics and

Autonomous Systems

Hierarchical refinement of skills and skill application for autonomous robots

Michae l Ka i se r* , Ri id iger D i l lmann University of lf arlsruhe, Institute for Real-Time Computer Systems and Robotics, D-76128 Karlsruhe, Germany

Abstract

One of the major goals in designing learning robots is to let these robots develop useful skills over time. These skills are not only related to physical actions of the robot, but also to the coordination of activities, communication with humans, and active sensing. Throughout this paper, the interdependency between these different kinds of skills is analyzed. For the case of elementary action skills and coordination skills, methods for inegration of skill application and refinement are developed. It is shown that this integration has the potential to support long-term learning and autonomous experimentation.

1. Introduction

The complexity of tasks that autonomous manipu- lation robots as well as autonomous mobile platforms have to solve is usually tackled through task decom- position. Here, two characteristic modes of operation can always be distinguished:

Model-based operation, including path planing or assembly planning and execution on the base of an a priori giveri and possibly continuously refined geo- metrical world model. This incldes also mission plan- ning, which might be based on a given model of the world's "semantics".

Reactive operation such as collision avoidance or compliant motion that involves a direct coupling be- tween the robot's sen~;ors and its actuators. The next action of the robot is determined by the current sen- sor readings - possibly - their history, and the current goal. These operations will from now on be referred to as the basic or elementary action skills of the robot. They are complemented by basic sensing skills that

* Corresponding author. E-mail: [email protected].

allow for a targeted use of the robot's sensors in the framework of plan and mission execution.

As modeling and planning makes only sense down to a certain level of abstraction, both modes of operation are usually combined. Then, elementary skills represent the interface between the planning and the control level in the robot's architecture. They also determine the basic operators available for plan- ning: Only if the robot is able to associate a symbolic operator with a sequence of actions that are possibly dependent on its perceptions, i.e., only if the robot can operationalize the operator by applying a partic- ular skill, using this operator on the planning level makes sense [18]. In addition, coordination skills are required to foster the efficient use of the available elementary sensing and actions skills.

To realize elementary sensing and action skills re- qures to map perceptions to actions by means of a strategy that is goal-oriented. Several possibilities to encode such a strategy exist (Fig. 1). The "traditional" approach is the model-based one, which tries to deter- mine the skill's application conditions a priori and to

0921-8890/97/$17.00 © 1997 Elsevier Science B.V. All rights reserved PII S0921-8890(96)01)055 -3

260 M. Kaiser, R. Dillmann/Robotics and Autonomous Systems 19 (1997) 259-271

Skill specificalion [

Exacl Approximale I[ ~owl~,,ge l~,,w,~,,g~ I I Examples I I EvaluatiOnFunclion I

I progranVlraiect(,rylskill

I Prololyp of robot

Applicatitm arid Rcl]llcmcnl

Fig. 1. Approaches to the design of elementary action and sensing skills.

systematically design a skill based on this identifica- tion. Since exact models are seldom feasible, usually qualitative models are in use, resulting, for example, in probabilistic approaches as in the case of occupancy grids [9], or in the use of fuzzy systems [32]. How- ever, this explicit coding of the required strategy is not an easy task. Specifically, it cannot be assumed that users of future service robots will be able to perform this kind of low-level robot programming.

What can be expected from users of such robots is their ability to demonstrate strategies related to skills (e.g., by manually operating the robot), and to eval- uate the strategy used by the robot, i.e., to evaluate the robot's performance. Approaches to learn based on human- generated examples usually appear under the heading of skill acquisition by human demonstra- tion and behavior cloning, respectively. In robotics, these approaches are mostly related to manipulation skills [2,14,16]. The acquisition of basic mobility skills from human demonstration has, for instance, been de- scribed by Pomerleau [30] and Reignier [31]. How- ever, to allow a human user to program the robot via demonstration, the human's actions must be correctly understood by the robot. Also, feedback to the human teacher must be given in an approapriate manner. Both requirements sum up to the need for communication skills.

The refinement of elementary skills is a prob- lem whose solution can be approached by means of conventional adaptive control techniques as well as through reinforcement learning and [4,24,33] in the context of adaptive neuro- and fuzzy-control [5,27].

Refinement, however, is not always an appropriate means to deal with an unsatisfactory performance of the robot. Sometimes, a new elementary skill must be generated in order to achieve a certain goal, or the conditions triggering the use of a specific elementary skill (i.e., the coordination skill) must be changed. If several skills for achieving a subgoal are available (e.g., several collision avoidance techniques based on different kinds of sensors), the most suitable one should be selected for a particular environment. If incomplete or even incorrect knowledge about the robot and the task is assumed, finding the best skill is as well a problem of adaptation.

This paper presents an hierarchical approach to skill learning that includes skill refinement as well as the adaptation of skill activation conditons and the iden- tification of a requirement to generate new skills. The use of skills to discretize the robot's action space al- lows for the solution of complex tasks. The specific problems that are tackled are: • the initial acquisition and the refinement of elemen-

tary skills; • the construction of a symbolic representation that al-

lows for assigning skills to states that are defined via environmental conditions or mission related condi- tions; and

• the refinement of the state transition ~ skill mapping. Individual solutions to these problems are based on

a precise definition of the term skill, as well as on a distinction of the several types of skills that an au-

tonomous system must provide. The paper is there- fore organized as follows: First, the notion of skill is clarified and skill models are developed. Then, repre- sentations for both skills and environmental states are developed and related to each other. Finally, solutions to the problem of multilevel skill refinement are pre- sented, and experimental results are given.

2. Skills and skill models

"Skill" denotes the learned power of doing a thing completently. From a system's theoretic view-point, this means that for a given state x(t), the skilled system (the robot) should perform a competent action u(t): Such a competent action is an action that - possibly in the long term - contributes to achieving a goal.

M. Kaiser, R. Dillmann/Robotics and Autonomous Systems 19 (1997) 259-271 261

Robot state x(t)

Error in skill Skill successfully execution applied Te+

Errc, r Terminatitm I criterion e s criterion t s

e s = 0 \

I ts = 0 [ " Cont ro l

func t ion c s

x(t) J Evaluation funclit>ll 1" s

r(t~) x(t) l

Fig. 2. The role of Cs, rs, es and ts during the application of an elementary action skill.

In the context of robot programming and control, the following types of skills can be identified:

Act ion skil ls are skills that relate goals to physi- cal actions (e.g., motiLons) of the robot. Mostly, they rely on the robot's perceptions to produce a competent action.

Sensing skil ls are required for active sensing and involve a goal- oriented parameterization and use of the available sensors.

Communica t ion ski,!ls are required for communicat- ing information from the robot to the environment and vice versa. For example, the ability to understand a hu- man demonstration is a communication skill, as well as the contruction of a natural language description of the robot's perceptions.

Coordination skil ls are responsible for defining ap- propriate subgoals that allows for efficient application of action, sensing, and communication skills.

The focus of this paper is on the acquisition and refinement of action ~aad coordination skills. The ef- ficient application of sensing skills has, for example, been described in [23,34], their acquisition is for in- stance treated in [ 1 ]. Communication skills are mostly considered within the context of Robot Programming by Demonstration (RPD [13,22]), as, for example, in [211.

2.1. The local skil l nu~del

For a given state x ( t ) , a skilled system (the robot) should perform a goal-oriented action u( t ) . The action

performed should be the result of a competent deci- sion, i.e., it should be optimal with respect to an eval- uation criterion (a reward) r ( x ( t ) ) that is related to the goal to be achieved. Essentially, a skill s is therefore given through a conr tro l func t ion

Cs: u ( t ) = Cs (x ( t ) ) , (1)

that implicitly encodes the goal associated to the skill and produces in each state x ( t ) a competent action u( t ) , and a reward function rs(X(t ) ) ~ [man, rmax] that evaluates the state x ( t ) w.r.t, the goal. To allow for execution monitoring, an error criterion es (x ( t ) ) is also required.

If the skill application involves to move the sys- tem along a trajectory (x*(0) . . . . . x * ( T ) ) in the state space, a termination criterion ts(X(t)) ~ {0, 1} must also be present (see Fig. 2). The local model of the skill s is therefore given as a quadruple of functions, s = (Cs, rs, es, ts).

In the case of elementary action skills and ele- mentary sensing skills, the state x is represented as a sequence of sensorial inputs y, i.e., x ( t ) = ( y ( t - d) . . . . . y ( t - d - p)), d, p > 0, while the action u( t )

is directly executable by the robot (e.g., as a primi- tive motion or a one-shot parameterization of a sensor) without requiring any further processing on a level above or equal to the skill level. The skill is therefore operat ional with respect to the robot. For coordination skills, the state describes the robot's environment on an application dependent abstract level, whereas the ac- tion produced at any instant is the decision to activate

262 M. Kaiser, R. Dillmann/Robotics and Autonomous Systems 19 (1997) 259-271

or deactivate a particular elementary skill. In the case of communication skills, the function Cs maps compo- nents of the robot's internal state vector or a state tran- sition to a set or sequence of human-understandable symbols. Alternatively, it may map human-generated symbols of different modalities (e.g., speech or ges- tural input) to a robot-internal state vector or a state transition. In all cases, the error criterion is given via boundary conditions on the state variables. If present, the termination criterion is defined in a similar manner for the goal state.

2.2. Extensions to the local skill model

The local skill model considers only those compo- nents of a skill that are relevant for skill execution (Fig. 2). However, these components do not support a targeted selection and activation of skills.

To make efficient use of an elementary action or sensing skill, it is necessary to characterize appropri- ate applicability conditions. In addition, the effect of the skill application must be described. The global skill model therefore comprises not only the functions Cs, rs, es, and ts, but also a predicate As that deter- mines the applicability of s and a function Es that de- scribes the changes in the state variables induced by applying s.

Both As and Es are required to operate on those representations that are used by the existing skill se- lection and coordination mechanisms, i.e., to use the appropriate symbols: Consequentely, the extension of the local skill model results in a hybrid representation of the skill s (Fig. 3). 1

2.3. Skill acquisition and representation

To realize an elementary action or sensing skill, the actual design (acquisition) procedure must be chosen (Fig. 1). Depending on this procedure, appropriate rep- resentation formalisms for each skill component must be selected.

2.3.1. Skill acquisition from human demonstration One of the most promising approaches to the ac-

quisition of elementary action and sensing skills

1 It should be noted that As and Es of elementary skills are mirrored in Cs of coordination skills.

Supervising and coordinating entity (user, coordination mechanism)

Application $ ~Ern~r Evalualion 1Tenninalion

I Applicability c"nditi°n's As 1 Applicalion poslconditions E s

Control lianction Ev',dualion Function Crileria lhnctiun

Fig. 3. Hybrid skill representation by the components of the extended skill model.

is to build these skills from human demonstra- tions [1,16,31]. However, two facts must always be considered: (1) Human-generated examples alone are not enough

for a robot to learn a skill sufficiently well [17]. (2) Skill refinement based only on a scalar evaluation

of the robot's performance, as it is necessary in the context of skill acquisition from human demon- stration, can only be successfully performed if the refinement procedure is provided with hints about a possible strategy. These hints can also be ob- tained from user demonstrations [16].

Within the general setting of learning from examples and based on these observations, several re- quirements emerge that must be met by the methods employed to approximate the control function Cs and the evaluation function rs: (1) It must be possible to construct the function ap-

proximator from the human performance data. (2) It must be possible to incrementally train the func-

tion approximator. (3) It must be possible to analyze the function

approximator on-line to determine if its structure is appropriate to approximate the target function sufficiently well.

M. Kaiser, R. Dillmann/Robotics and Autonomous Systems 19 (1997) 259-271 263

As far as the selected representation techniques have a direct impact on the refinement procedures, these techniques are now discussed. For a detailed descrip- tion of the several phases of the skill acquisition pro- cess, see [7,10,16,2011.

2.3.2. Representation of control function and evaluation function

One possibility to meet the given requirements on the level of elementary skills is to employ Radial-Basis Function networks (RBF networks) for approximating Cs and rs. The basic principle realized in RBF net- works is that of loc~dized receptive fields. An RBF network consists of an input layer holding the current input values, a hidden layer representing the number of clusters in the input space, and an output layer in- tegrating the degree of membership calculated by the hidden layer units. Ibis calculation is performed on the basis of two cluster specific parameters, its cen- ter/~ and its width or. The hidden neuron representing a particular cluster gives the highest output when the input it receives is close to its center. The exact value of the output is determined by the neuron's transfer function (mostly a multidimensional gaussian function or a function representing a hypersphere or an hyper- cube), and the specific measure of distance, which is usually the Euclidean. An extension of RBF networks that is particularly interesting for control applications is the invention of time-delays that has been proposed by Berthold [6] (Fig. 4).

The output y (i.e., the action u produced by the network representing Cs or the predicted evaluation r

r i ( x ( 0 ) ~ ~ - d ) )

Fig. 4. Structure of a Radial-Basis Function network with time-delays in the hidden layer.

produced by the network representing rs) calculated by an RBF network for a given input vector x is given as

yj = f (x) = s wi,jri ,

s is the transfer function of the output neuron (usually the identity), Wi, j is the weight between the output neuron j and the cluster i, and ri (x) is the membership value calculated for cluster i. For example,

r i ( x ) = e x p [ - ( H x - I ~ i H ) 2 O'i

with center ~i and width o" i of cluster i. In case of d time-delays in the hidden (cluster) layer, the out- put neuron calculates the output yj by adding current membership values and their history over the last d time steps, such that

s ( £ ~ - ~ w i , j , k r i ( x ( t - - k ) ) ) . yj(t) = \i+l k=0

These networks are capable of universal approxi- mation over R n or over compact subsets of R n, given proper cluster centers/zi [29].

To construct an RBF network, the number of clus- ters, their centers and widths have to be determined. For solving this task, several algorithms have been proposed [25,26], among those a supervised clustering algorithm specifically designed for constructing net- works to approximate continuous functions [3]. This algorithm is based on the idea that the amount of overlap between individual clusters of an RBF net- work should depend on the similarity of the outputs associated to these clusters. Additionally, it allows to determine the amount of generalization done by the network by specifying the maximum network activity in regions of the input space that are not covered by examples (for further details see [3]). Following the construction, the network's output weights llOi,j, k are trained using gradient descent.

2.3.3. Representation of termination criterion and error criterion

The criterion functions ts and es are represented as region lists, i.e., ordered lists of labeled hyperintervals (Fig. 5). Two lists originate from human demonstration

264

~a

M. Kaiser, R. DiUmann/Robotics and Autonomous Systems 19 (1997) 259-271

Fig. 5. Two-dimensional regionlist drawn "in reverse order". During evaluation, the brighter regions are considered first.

data: the first initially consisting only of hyperinterval representing es that describes the encountered state space, and a second one that represents the goal state by an hyperinterval, too.

2.3.4. Representation o f applicability: pre- and post- conditions

In the context of skill acquisition from human demonstration, several methods to represent the pre- and post-conditions associated to an elementary skill are available. Firstly, As can be described as a subset of the set of possible application contexts, while Es is the change in the application context.

For example, the coordination mechanism may rep- resent the robot's environment by means of object- attribute-value pairs (ok : ak, wk) and relations R = ((Ol : al , W l ) , . . . , (On : an, Wn)) with Wi, 1 < i < n, being a set of possible values of oi : ai. It can- not be assumed that sufficiently many examples of the form (Cb(S), Ca(s)), with Cb(S) describing the en- vironment before and Ca(s) after the application of skill s are available. Hence, As and Es can in general not be learned from examples. Therefore, an analytic learning technique that starts from a description of the user's intention is applied to determine As [10]. Es is chosen to be this intention.

In the context of service robots and robot program- ming by demonstration, another appealing option is to build an explicit representation of the states and state transitions to which elementary skills should be linked (Fig. 6). This kind of representation is especially help- ful if the initial assignment of skills to state transitions should be given by a human user. For mobile robots such as PRIAMOS [8], one possibility to realize this

Fig. 6. Association of an insertion skill to a graphically repre- sented state transition.

kind of representation is given by a connectivity graph (Fig. 7).

To automatically generate such graphs from a ge- ometrical model, literature often proposes Voronoi diagrams [28]. However, Voronoi-based methods that generate a topological graph based on a two- dimensional representation of the environment, as it is used in PRIAMOS' control system, result in general in a huge set of very short edges. Additionally, these methods rely only on geometrical information and do not take other data, e.g., if a particular segment repre- sents a stationary or a dynamic obstacle, into account. Therefore, the map shown in Fig. 7 has been created using a method based on tests of the environment's structure. This method is described in [19].

3. Skill application and refinement

During application and refinement of elementary skills, two levels of abstration must be considered (Fig. 8). First, refinement (and, possibly, extension) takes place on the skill level itself, i.e., the functions Cs, ts, and es are altered (or extended) according to a local measure of performance that is modeled by the function rs. Secondly, the strategy responsible for choosing a particular skill in a particular situation, i.e., the coordination skill, has to be adapted to environ- mental and task-specific conditions. Here, learning to choose the best skill is equivalent to finding the best action to be taken to perform the desired state transi- tion on the global level (the coordination level).

3.1. Refining individual skills

Refining a basic skill means to alter the functions Cs, ts, and es with respect to some external feed- back. The actual mechanism used for this adaptation

M. Kaiser, R. Dillmann/Robotics and Autonomous Systems 19 (1997) 259-271 265

I°ii D O O R P A S S I N G i . . . . . . . . . . . . . . . . . . . . . . . . . . . . ' "'~

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . l

D 0 CK. I N G

T U R N

Fig. 7. Geometrical map of PRIAMOS' operating environment with connectivity graph and indications of associated elementary action skills.

J S

q F ~

I ...... ID

Q ( C , s ) ..............................

S

tu'

Fig. 8. Hierarchical refinement of elementary robot skills.

depends on the repre'sentation of the functions as well as on the information contained in the feedback [15].

The minimum feedback that is assumed to be avail- able in the context of learning from human demonstra- tions is an evaluation of the performance of the robot after the application of an individual skill. The task

is therefore to refine a continuous reaLvalued func- tion on the basis o f a delayed reinforcement signal. Gullapal l i ' s approach based on stochastic real-valued units (SRV units [11,12]) provides a suitable starting point for solving this task, which can formally be de- scribed as follows.

266 M. Kaiser, R. Dillmann/Robotics and Autonomous Systems 19 (1997) 259-271

Given: An initial skill represented by functions Cs, ts, and es, a model rs that acts as a critic, and an external feedback source providing a scalar evaluation r of the effect of the skill application.

Determine: New functions C sm , ts and esn whose application results in a better, if possible optimal eval- uation, and a new model r n that takes the changes introduced by the adaptation into account.

Since we choose RBF networks for the representa- tion of basic mobility skills, the actual refinement of the control function Cs associated to a skill can take place as follows.

Extension of Cs towards new situations: The local representation employed in RBF networks allows for detecting situations that were not encountered so far, i.e., if

¥i ~ {1 . . . . . n} : riQ¢) < Ss, 0 < 8s < 1,

it can be concluded that x is a new situation. If it is desired to extend the network to cover this situation, a new cluster n + 1 is generated. The action u (i.e., the weights llOnq-l,i) to be associated to this cluster can be requested from the user or cloned from the action associated to the cluster that is nearest to the new one. In any case, the width tr of the new cluster is to be initialized such that it does not affect already existing clusters. If k is the index of the nearest old clusters (k ~ {1 . . . . . n}), and a threashold ~t controlling the maximum overlap between clusters as well as gaussian transfer functions ri are given, tr can be initialized as

IIx - t~n II O ' ~

In 8t

Adaptation of the action u associated to a known situation: The typical action to be undertaken in skill refinement is the adaptation of the action u calculated by applying the function Cs (i.e., by evaluating the network representing Cs) on the given situation x. As- suming that the new action Un obtained a feedback signal rn, whereas the original action u resulted in an evaluation r with r < rn, the network representing Cs is adapted on-line using (uj - Unj) as error and r/ = r/0 sgn (rn - r) as learning rate.

The only kind of information that can be expected from the user during the skill application and refine- ment process is an evaluation after the termination of

the skill execution. 2 To perform the actual adaptation, we use the following rules: (!) If the user aborted the operation, he/she is asked

if the current state is a goal state or an error state. The corresponding function (i.e., es resp. ts) is updated.

(2) Otherwise, the user is asked if the criterion that fired did so correctly. If this is the case, the control function Cs is updated, otherwise, the criterion is updated. In the latter case, an additional feedback can be given by the user that can be used for updating the control function Cs.

(3) If the control function is updated, the critic rs is updated, too.

If the control function must be updated, the con- tribution of any particular action u(t) to the obtained evaluation must be determined. To solve this tempo- ral credit assignment problem, usually exponentially discounted rewards are used. However, our approach has been to use a discounted adaptation rate.

For updating the region list, we simply include or exclude the final state x. If a new goal or error state must be learned, a corresponding region is inserted. Otherwise, a new unlabeled region covering the cur- rent state is inserted, thereby excluding the region from the error or the termination criterion.

To gradually improve the performance of the robot while applying a specific skill, i.e., to increase the quality of the function Cs w.r.t, given evaluation cri- teria, it is necessary to explore the action space, i.e., to systematically alter the action produced by the net- work representing c in order to find "better" actions. Obviously, the amount of exploration should be guided by the confidence that exists with respect to the qual- ity of the original action the network produced.

Following Gullapalli [ 12], the output u of the action network is altered by means of a normal distribution qJ(u, or) with the standard deviation tr being a func- tion of the difference between the maximum achiev- able reinforcement signal rmax and the predicted one r (see Fig. 9). Additionally, the exploration is guided by the parameter Yr, the exploration width. If the skill s has been build from human performance data ((y(0), u)(0)), • • • ,y (T) , u(T)) , Yr can be chosen for

2 Termination means that either a robot-specific error occurs, the goal state known from the demonstration is reached, or the user stops the robot.

M. Kaiser, R. Dillmann/Robotics and Autonomous Systems 19 (1997) 259-271 267

i. ~-- c , ( , ) . - ~ , ( ~ )

~- 7,(rrnax -- r)

Fig. 9. Stochastic action selection following the SRV approach.

each component Uk c,f the action vector independently according to

Yr = max lUk(i) -- uk(j)l. i,je{O,...,T},ivkj

Summarizing, the skill application, evaluation, and re- finement process appears as follows:

(1) For the current situation x(t), evaluate the net- work representing Cs.

(2) If the network's activation is low and adaptation is desired, generate a new cluster covering x(t). If the user does not provide an appropriate action, assign the action associated to the closest old cluster to this new cluster.

(3) Else if no adaptation is desired, stop the skill application.

(4) Evaluate the network representing rs on the sit- uation x( t ).

(5) I f the model network's activation is low, gener- ate a new cluster covering x(t) and assign 0 as prediction r to this cluster.

(6) Alter the action u(t) based on the prediction r according to the procedure shown in Fig. 9.

(7) Store the situation-action pair (s(t), u(t)). (8) If the goal situation has not been reached and

the maximum number of actions has not been exceeded, go back to (1).

(9) Obtain the final reinforcement signal rfinal (rfinal - -

1 if the goal has been reached, - 1 otherwise). (10) Obtain a general evaluation of the skill applica-

tion rstart- If application exists, calculate y such that yTrfinal = rstart. Otherwise, choose a fixed y e [0, 1].

(11) For each situation-action pair (x(t), u(t)) , t {0 . . . . . T}, calculate rn(t) as rn(t) = yr-'r~nal.

(12) For each situation-action pair (x(t), u(t)) , t e {0 . . . . . T}, if rn(t) > r(t) update Cs.

m

Fig. 10. Docking.

(13) For each situation-action pair (x(t), u(t)) , t e {0 . . . . . T} and the obtained reinforcement rn (t), update rs.

3.1.1. Example One example of an elementary mobility skill that

was learned and refined is the docking skill (Fig. 10). Here, two user demonstrations resulted in 75 and 84 examples of the form (24 ultrasonic distance measure- ments, translational offset Ax, translational offset Ay, rotational offset Au).

Based on the demonstration data, 14 of the 24 ultrasonic sensors were identified as relevant, and a training file of 122 samples was generated. From the training data, a network consisting of 89 clusters was built to represent Cs, whereas the network represent- ing rs consisted of 35 clusters.

Table 1 shows the results obtained from applying the initially learned skill. Most notable is the relatively large number of neurons inserted during the first trial. At the end of this trial, the termination condition was not correctly identified, such that the termination cri- terion had to be extended.

3.2. Refinement of coordination skills

I f global context information is available that can be related to the execution of a specific elementary operation, it makes sense to look for the best out of a set of similar skills to be employed in that context. The t a skon this level can therefore be stated as:

268

Table 1 Skill application and refinement

M. Kaiser, R. Dillmann/Robotics and Autonomous Systems 19 (1997) 259-271

Function T 1 T2 T3 T4

Cs 11 3 0 0 Rs 13 4 0 0 ts 1 0 0 0 Success 0 1 1 1

Neurons inserted in the networks representing Cs and rs, new termination condition ts generated during four subsequent skill applications in known environment, but starting from different locations. The goal was reached in any of the trials.

Given: A context C requiring the execution of an elementary skill s of a specific class.

Determine: The skill s that maximizes a given eval- uation criterion related to C.

If only the immediate evaluation r of the applica- tion of skill s in context C is of interest, determining the best skill to be selected becomes a matter of pre- dicting this evaluation via QZ(C, s). Each QZ(C, s) is a task-dependent value that estimates the appropriate- ness of applying skill s in context C for fulfilling task 27. Hence, for the task-dependent coordination skill c the control function Cc is given by the rule applied to select s, given QZ(C, s), the evaluation function rc is given by the QZ, and the set of valid contexts (i.e., the error criterion) is encoded implicitly as the set of all contexts C for which values QZ(C, s) exist.

The appropriate learning rule to be applied at the end of each skill application is simply

Q Z ( C , s ) t+ l = (1 - ot)Qz(C, s) t + ctr,

where c~ is the learning rate. If the accumulated feed- back must be considered, Watkins Q- Learning [35] is an appropriate technique, and the learning law becomes

Q Z ( C ' s ) t + l

I C* = (1-ot )Qz(C,s) t +ot(r + YQmax( )),

where C* is the active context after s has been suc- cessfully applied (e.g., the next edge in the connectiv- ity graph). QZmax(C*) is defined as

QZmax(C* ) = max Qz(C*,s). s E skills

In both cases, finding the best skill to be applied re- quires exploration, i.e., to systemtically change skill

selection. This exploration can be guided through background knowledge that might be available on the context level. If no sufficient background knowl- edge exists, the standard technique associated with Q-Learning, i.e., selecting the skill s by means of a Boltzmann distribution

eQZ(C,s)/T

P(e = siC) = ~bE Skills eaZ(C'b)/T '

in combination with a random initialization of the Q values is applied. Here, T is a temperature parameter.

3.2.1. Example To asses the convergence of Q-Learning in the con-

text of optimal skill selection, several experiments were undertaken in simulation. For the results shown in Fig. 11, the task was to find the best out of a varying number of skill. The Q value for each skill was initialized randomly. The rinforcement signal r was selected randomly from [0.9,1] if the optimal skill was executed, and to a random value in [0,1] otherwise, in order to simulate the inconsistency in evaluations obtained under real world conditions. Ob- viously, the number of trials needed to find the optimal action (here, achieving P(s = SoptlC) > 0.9 was the convergence criterion) depended on both the learning rate and the number of alternative actions that were available.

4. Integration

To be able to perform efficient learning on multiple interdependent levels of abstraction, it is necessary to define a preference criterion (bias) that determines on which level exploration and learning should take place. Based on the assumption that the available skills are already operational (i.e., they can be applied, but are not optimal), the complete learning procedure is designed top-down. It starts with choosing the best elementary skill and performs skill-level adaptation afterwards.

4.1. Synchronizing refinement on the coordination and execution level

If rc,s denotes the evaluation obtained for applying a skill s in a specific context C, and /sk is the skill

M. Kaiser, R. DiUmann/Robotics and Autonomous Systems 19 (1997) 259-271 269

Nil rl l |qff Of" I ~ S Io [ i l l | | o l , l im ' .d a t . t i tb , 7 0 0 . • ,

5 :ilk'llOIl~ av:dlahle - - a¢ t i o l l s nv ; l i l ah l¢ . . . .

(mOO 9 ~tJtqis :w;dl;,l'l¢

5OO

4O0

~00

2OO

O. | 0,2 ().~4 (1.4 ().~ 0,6 0.7 (}.N 0.9 I l . g : u iliulg r:llt.

I{XXX)

I(X)O

I 0~)

-7

I o

N l l l n | l c r ¢~f I~'iah¢ l a l i n d op l i l l l a l ;It 'liOll

NOlm | r c r I~t I l i :ds - -

I o N l l l l l h~ [ Iii :*vaj l l lhle :~'liOllS

Fig. 11. Number of trials (averaged over 100 runs per data point, T = 50) to find the optimal action: left - comparison between different values of or; right - influence of number of available actions (or = 1.0).

specific evaluation, @re learning control procedure can be designed as follows: (1) For the current context C, find the skill yielding

the highest rewind rc,s. (2) If rc,s is sufficiently high, do not perform any

adaptation and proceed with 6. (3) If rc,s is too low but rD,s is sufficiently high in

a different global context D (i.e., the skill has al- ready been applied successfully elsewhere), clone the skill and proceed with (5) using the clone.

(4) Ifrc,s is too low ;rod low or unknown with respect to other global contexts, continue with (5).

(5) Permanently 3 adapt the skill s using the methods described in Section 3.1.

(6) Assign the skill s to the current context C. On all levels, using background knowledge for guid-

ing the exploration and the adaptation will greatly in- fluence the speed of convergence. In that sense, the preliminary results presented can be considered as a worst case indication of the effort needed for refine- ment, since they assume nothing but the existence of a feedback signal at the end of the application of an individual skill. Usually, additional information (like prototype actions, more informative evaluation func-

3 Obviously, an individual skill might be realized by means by an adaptive controller. This controller can be used every time the skill is applied, e.g., for supressing oscillations. However, the changes made through this adaptation mechanism are not permanent, since every time the skill is invoked, control will start using the same setting of parameters.

tions, and addtional attributes of skills that allow for a better initialization of the Q values) are available. For a realization of the complete method on a real robot, as it is being planned for PRIAMOS, this knowledge must be exploited.

4.2. Extensions

Two additional aspects make the proposed method interesting in the context of robot systems that are de- signed for long-time operation in partially unknown environments, but which may communicate with hu- mans. Firstly, if an unsatisfactory performance of the robot is detected that cannot be enhanced by adapta- tion, a new elementary skill has to be generated, and the generation of this skill can be triggered by the robot itself. Especially if the robot can communicate with a human supervisor, it can explicitly ask for an example of the required skill.

Given a set of elementary skills of a certain class, it is also possible to exploit the relationship between state transitions and actions in a different way. If only the class of the elementary skill to be applied is avail- able but the parameters describing the global context are unknown, assessing the performance of a partic- ular skill allows to identify these parameters, i.e., by systematically evaluating the performance of specific skills, it becomes possible to obtain world knowl- edge through experimentation. Imagine, for instance, a robot manipulator equipped with a grinding tool.

270 M. Kaiser, R. Dillmann/Robotics and Autonomous Systems 19 (1997) 259-271

The robot has to polish different work pieces in its en- vironment all of which have been modeled geometri- cally only. For the actual grinding operation, the robot has to provide the ability to control the force it ap- plies to each workpiece. Assume that the robot is not equipped with an adaptive force controller, instead a set of controllers (elementary skills) optimized for dif- ferent materials exists. By finding for each work-piece the CONTROL-FORCE skill that results in the best evaluation, the robot is able to classify the pieces with respect to the material, without having to analytically determine material stiffness or other parameters.

Acknowledgements

This work has been funded by the ESPRIT Project 7274 "B-Learn II". It has been performed at the Insti- tute for Real-Time Computer Systems and Robotics, Prof. Dr.-Ing. U. Rembold and Prof. Dr.-Ing. R. Dillmann, Department of Computer Science, Univer- sity of Karlsruhe, Germany. The authors would like to thank Marnix Nuttin, Attilio Giordana, and Volker Klingspor for their support.

References

5. Conclusion and further work

If robots are to be employed efficiently in environ- ments that are only partially known, they must exploit their experience. They must be able to adapt to their environment and to the specific conditions of use as they are given by the application and the user. In such a scenario, learning becomes evident. A learning robot can relieve the robot designer from cumbersome pro- gramming tasks, and it can support the robot user to customize itself for his or her specific needs.

An indispensible requirement for efficient operation is the intelligent use and refinement of basic robot skills. Throughout the paper, a novel hierarchical ap- proach to learning efficient skill application while us- ing mechanisms of skill refinement has been proposed. Several preliminary results have shown the appropri- ateness of the individual techniques involved in the approach.

The realization of the complete learning control loop on PRIAMOS as well as on a Puma 260 manipu- lator is a topic of current work and requires substantial effort. Here, the first step is the identification and ac- quisition of task-independent background knowledge that supports the refinement as well as an analysis of the minimum available feedback that can be expected on each of the levels of adaptation. Subsequently, the benefits obtained through multi-level adaptative be- havior on a real robot must be evaluated quantitatively. This task is expected to become extremely difficult, since the effort involved in performing just a single experiment is very high, and appropriate criteria for evaluating complex learning systems are still to be found [18].

[1] M. Accame and EG.B. De Natale, A fast and easy way for teaching an ANN to extract edge points from an image, Proc. 4th European Workshop on Learning Robots, Kadsruhe, Germany (1995).

[2] H. Asada and S. Liu, Transfer of human skills to neural net robot controllers, Proc. IEEE Int. Conf. on Robotics and Automation (1991).

[3] C. Baroglio, A. Giordana, M. Kaiser, M. Nuttin and R. Piola, Learning controllers for industrial robots, Machine Learning (1996).

[4] A.G. Barto, R.S. Sutton and C.W. Anderson, Neuronlike elements that can solve difficult learning control problem, IEEE Transactions on Systems Man and Cybernetics (1983) 835-846.

[5] H.R. Berenji, A reinforcement learning based architecture for fuzzy logic control, International Journal of Approximate Reasoning 6 (1992) 267-292.

[6] M.R. Berthold, A time delay radial basis function network for phoneme recongnition, IEEE Int. Conf. on Neural Networks, Orlando, FL (1994) 4470-4472a.

[7] R. Dillmann, M. Kaiser and A. Ude, Acquisition of elementary robot skills from human demonstration, Int. Symp. on Intelligent Robotics Systems, Pisa, Italy (1995).

[8] R. Dillmann, M. Kaiser, F. Wallner and P. Weckesser, PRIAMOS: An advanced mobile system for service, inspection, and surveillance tasks, in: T. Kanade, H. Bunke and H. Noltemeier, eds., Modelling and Planning for Sensor Based Intelligent Robot Systems (World Scientific, Singapore, 1995).

[9] A. Elfes, Using occupancy grids for mobile robot perception and navigation, IEEE Computer (June 1989) 46-57.

[10] H. Friedrich, S. Miinch, R. Dillmann, S. Bocionek and M. Sassin, Robot programming by demonstration: Supporting the induction by human interaction, Machine Learning (1996).

[11] V. Gullapaili, A stochastic reinforcement learning algorithm for learning real valued functions, Neural Networks 3 (1990) 671-692.

[12] V. Gullapalli, J.A. Franklin and H. Benbrahim, Acquiring robot skills via reinforcement learning, IEEE Control Systems Magazine 14 (1) (1994) 13-24.

M. Kaiser, R. DiUmann/Robotics and Autonomous Systems 19 (1997) 259-271 271

[13] R. Heise, Demonstration instead of programming: Focussing attention in robot task acquisition, Research Report no. 89/360/22, Department of Computer Science, University of Calgary, 1989.

[14] S. Hirai, H. Nogachi and K. Iwata, Transplantation of human skillful motion to manipulators in insertion of deformable tubes, IEEE Int. Conf. on Robotics and Automation, Nagoya, Japan (1995) 1900-1905.

[15] M. Kaiser, M. Deck, A. Retey, K. Berns and W. Ilg, Using neural networks for real-world adaptive control, in: Neural Networks: Producir,~g Dependable Systems (Solihull, UK, 1995).

[16] M. Kaiser and R. Dillmann, Building elementary robot skills from human demonstration, IEEE Int. Conf. Robotics and Automation, Minneapolis, MN (1996).

[17] M. Kaiser, H. Friedrich and R. Dillmann, Obtaining good performance from a bad teacher, Int. Conf. on Machine Learning, Workshop on Programming by Demonstration, Tahoe City (1995).

[18] M. Kaiser, V. Klingspor, J. del R. Mill~, M. Accame, E Wallner and R Dillmann, Using machine learning techniques in real-world mobile robots, IEEE Expert (1995).

[19] M. Kaiser, V. Klin~;spor, K. Morik, A. Rieger, M. Acame and J. del R. Millan, B-Learn II-D 405: Learning Techniques for Mobile Systems, B-Learn II - ESPRIT BRA Project No. 7274 (1995).

[20] M. Kaiser, A. Retey and R. Dillmann, Robot skill acquisition via human demonstration, Int. Proc. Int. Conf. on Advanced Robotics (1995).

[21] V. Klingspor and K. Morik, Towards concept formation grounded on perception and action of a mobile robot, Proc. 4th Int. Conf. on Intelligent Autonomous Systems, Karlsruhe (1995).

[22] Y. Kuniyoshi, M. Inaba and H. Inoue, Learning by watching: Extracting reusable task knowledge from visual observation of humans performance, IEEE Transactions on Robotics and Automation 10 (6) (1994) 799-822.

[23] J.J. Leonard and H.E Durrant-Whyte, Directed Sonar Sensing for Mobile Robot Navigation (Kluwer Academic Publishers, Dordrecht, 1992).

[24] J. del R. Millhn, Learning efficient reactive behavioral sequences from basic reflexes in a goal-directed autonomous robot, Proc. 3rd Int. Conf. on Simulation of Adaptive Behavior (1994).

[25] J. Moody and C. D~ken, Learning with localized receptive fields, in: T. Sejnowski, D. Touretzky and G. Hinton, eds., Proc. Connect.ionist Models Summer School, Carnegie Mellon University (1988).

[26] M.T. Musavi, W. Ahmed, K.H. Chan, K.B. Faris and D.M. Hummels, On the training of radial basis function classifiers, Neural Networks 5 (1992) 595-603.

[27] K.S. Narendra and S. Mukhopadhyay, Adaptive control of nonlinear multivariable systems using neural networks, Neural Networks 7 (1994).

[28] A. Okabe, B. Boots and K. Sugihara, Spatial Tessellations: Concepts and Applications of Voronoi Diagrams (Wiley, New York, 1992).

[29] J. Park and I.W. Sandberg, Approximation and radial-basis- function networks, Neural Computation 5 (1993).

[30] D.A. Pomedeau, Efficient training of artificial neural networks for autonomous navigation, Neural Computation 3 (1991) 88-97.

[31] P. Reignier, V. Hansen and J.L. Crowley, Incremental supervised learning for mobile robot reactive control, in: Intelligent Autonomous Systems 4 (lOS Press, 1995) 287-294.

[32] K.-T. Song and J.-C. Tai, Fuzzy navigation of a mobile robot, Proc. IEEE/RSJ Conf. on Intelligent Robots and Systems, Raleigh, NC (1992).

[33] R.S. Sutton, A.G. Barto and R.J. Williams, Reinforcement learning is direct adaptive control, IEEE Control Systems Magazine (April 1992) 19-22.

[34] F. Wallner, M. Kaiser, H. Friedfich and R. Dillmann, A multilevel learning approach to mobile robot path planning, in: V. Graefe, ed., Intelligent Robots and Systems (Elsevier, Amsterdam, 1995).

[35] C.J.C.H. Watkins, Learning with delayed rewards, Ph.D. Thesis, University of Cambridge, 1989.

Riidiger Dillmann is a university professor in computer science and robotics at the University of Karl- sruhe, where he is the Director of the CAD/CAM/Robotics group at the Institute for Real-Time Computer Sys- tems and Robotics. He is also Head of the Interactive Techniques for Planning research group of the FZI Karlsmhe. He received the Dr.-Ing. degree from the University of Karlsruhe. He is a member of the German Society of

Computer Science (GI) and IEEE and has published more than 200 papers in the fields of mobile robotics, robot learning, and robot programming by demonstration.

Michael Kaiser currently works as a research associate at the Insti- tute for Real-Time Computer Sys- tems and Robotics of the University of Karlsruhe. His present research is concerned with robot learning, human-robot interaction, and robot programming by demonstration. He received a Dipl.-Inform. degree from the University of Karlsruhe, where he studied computer science and biomed- ical engineering. He is a member of

IEEE and has published more than 20 papers on robot learning and neural adaptive control.