incremental human-based training of an agent-based...

Incremental Human-Based Training of an Agent-Based Training System

Debbie Richards and Meredith Taylor

Macquarie University

Computing Department, Faculty of Science

North Ryde, NSW, 2109, Australia.

{richards,mtaylor}@science.mq.edu.au

Abstract

Acquiring and transferring the knowledge needed for agents or even humans to think and act intelligently can be difficult due to the embrained, embodied, contextual and tacit nature of much knowledge. In the security domain we are working in, we have the additional problem that much of the knowledge and content is confidential and not freely shared. To capture practical intelligence, a component of successful intelligence, we have thus developed an approach which captures scenarios and knowledge directly from the domain expert as they (re)create and experience a training session. We employ a hybrid case and rule-based knowledge acquisition and representation technique known as Ripple Down Rules which allows the domain expert to incrementally train the agents who in turn are able to assist trainees to acquire the domain knowledge in the context of the learning situation.

1 Introduction

Successful intelligence has been defined by psychology Professor Sternberg as "the integrated set of abilities needed to attain success in life; however an individual defines it, within his or her sociocultural context" [24]. To be more specific, these abilities are three types of intelligences: 1) analytical intelligence which is used for analysis, problem solving and decision making; 2) creative intelligence allows which generation of novel and interesting ideas and; 3) practical intelligence which is used by individuals to apply the most appropriate knowledge according to the situation. Our training environments seek to aid practical intelligence, much of which is tacit.

The majority of research in knowledge based systems has focused on codified, formal or explicit knowledge, probably due to the inherent difficulty in capturing implicit or tacit knowledge. However, it is well recognised that it is the tacit component of knowledge that provides competitive advantage [6] providing the key differentiator between the novice and the expert. We take a process perspective of knowledge which tightly links knowing and doing [3] where the focus is on the application of expertise. Fitting with a situated view of cognition [5], we see knowledge and expertise as highly contextual and made up to fit the circumstances. Therefore, we employ a

scenario/case-based technique which seeks to capture both explicit and implicit knowledge as it is exercised by experts. We call such knowledge collectively “knowledge-in-action” [21].

In this project we are seeking to provide a training environment which allows trainees to safely experience a risky situation and come to their own conclusions regarding what is the right decision to be made or action to be taken. The system offers its own conclusions, but to some extent gaining experience is the main goal. Our particular domain is trainee airport customs and immigration officers who need to know what to ask a passenger, how to ask it, how to judge the validity of the answers and what other actions are necessary in deciding whether to allow the passenger entry into Australia. Border security has become a major concern to governments and the general population. Learning to detect suspicious behaviour in passengers is a key part of protecting borders; however, such knowledge is difficult to acquire via traditional or formal education. To allow the acquisition of tacit or practical knowledge via problem solving, the trainee needs to explore the environment learning through a process of trial and error. Role plays are often used for such purposes, but an even less threatening environment can be offered by computer-based simulations.

For computer-based simulations to provide an appropriate alternative to roles plays, we need to model realistic looking avatars that demonstrate human-like intelligence such as “the avatar’s ability to solve specific problems such as natural language processing and understanding” [1, p.1]. For such a task currently no algorithm exists which compares favourably with a human, however, within a virtual environment, it has been argued that it is not actual intelligence that is significant, rather it is the perceived intelligence [1]. “Philosophers and psychologists have argued for over a century that what matters in human-human interaction is the individual’s subjective beliefs about each other, not the objective truth of the interaction” (pp. 1-2). The question of what initiates the perception of an avatar’s intelligence arises from this subjectivity. Bailenson [1, p.2] suggests that “there are many likely variables [that affect this subjectivity and] interact in complex ways: photo-realism, non-verbal behaviour, autonomy and interactivity”.

1.1 Related Work

A number of agents (e.g. Carmen [15]; GRETA [16]; STEVE [10]) have been developed which differ in appearance, behaviour, intelligence, embodiment, believability, socially capability, and so on, according to the goals of the researchers and projects. Two noteworthy initiatives in the area of building virtual environments populated with intelligent agents include Net Environment for Embodied Conversational Agents (NECA) [12] and Mission Rehearsal Exercise System (MRE) [25]. More specifically, there is a body of work focused on the use of agents to detect or model behaviours such as lying and acting suspiciously. The application of this work to the “war against terrorism”, the defence forces and perhaps even the justice system is apparent. As in the case of our research, much of the agent work relies and builds on the findings of other disciplines such as psychology (e.g. [8, 9). For example, [19] have developed an agent system to test whether humans can detect when an agent is lying. Similar to the work of [16] and [17], their work concerns displaying and detecting deception via facial expression. Our focus currently is limited to language, dialogues and gross actions rather than fine actions such as body gestures and facial expressions. Other computer-supported research into deception is in the area of conversational systems (e.g. [13]).

2 Acquiring knowledge and experience.

Characteristic of projects such as NECA [12] and MRE [25] are the need for intensive effort of an extensive and multidisciplinary team. However, for agent-based training simulations to become more widespread, affordable and accessible, we need approaches which are less reliant on using trained specialists such as graphic artists for developing animations, psychologists, sociologists or anthropologists for identifying and describing human experiences and knowledge engineers or logicians for acquiring and encoding knowledge. Returning to the roots of AI in which the goal was to create artificial intelligences similar to natural intelligences, we seek a more natural approach in which our agents and human can learn, namely experiential and trial-and-error type learning which does not rely on a host of technical specialists and mediators in the learning process. Our desire to allow the trainer/trainee to directly interact with and train/learn from the system is in recognition that acquiring knowledge has been a bottleneck in the development of knowledge based systems (KBS) [18] further leading to validation and maintenance issues [23]. Similar issues are faced in the authoring of drama or narrative-based systems [2] with the added concern of how to manage control and flow.

To allow context-specific choices by the character and user-driven knowledge acquisition (KA)/engineering (KE) we have incorporated Ripple Down Rules (RDR) [7] into our system. The use of naturally occurring cases (or in this application past experiences) and the RDR exception structure allow local-patching of knowledge which can be

performed rapidly and easily by the user. The developer would work with the trainer to set up a number of initial scenarios. The trainer can independently interact with the base scenarios to add new scenarios and knowledge as exceptions by suggesting alternative outcomes, input variables, agent behaviours, scenario objects, and so on.

We have created a 3D training simulation environment as a research platform [22] containing a number of components. The scenarios, avatars, and animations used are contained in a separate content component. Vizard is used to render the content in our computer assisted virtual environment (CAVE) system which consists of three projectors which display the virtual world onto a 6m wide semi-cylindrical screen canvas. The user is positioned slightly off centre towards the canvas to allow a 160o field of view. The Vizard Engine is controlled remotely by a software environment we have created called SynthIDE which is controlled by lua scripts, a language similar to python (Fig. 1). AgentSpeak(Officer,"Have you read and understood the questions on the

incoming passenger card?","Have you <pron sym='r eh d'/>and

understood the questions on the incoming passenger card?");

AgentSpeak(Passenger,"Yeah.", "<pron sym='y 1 eh h'/>");

AgentSpeak(Officer,"Do you have more than $10000 Australian or

foreign currency equivalent on you?");

AgentSpeak(Passenger,"No I don't.");

NarratorSpeak(NarratorVoice,"Jason has not brought any luggage with

him. He only has his wallet, passport and keys.");

l_Sleep(1000);

AgentSpeak(Officer,"Why haven't you brought any luggage with you?");

AgentSpeak(Passenger,"Well, I had to leave in a hurry. My mother's very

sick. She's in hospital.");

Figure 1. “No luggage scenario” lua script extract

Our agents are able to follow instructions via the use of a behavioural engine which forms part of the scenario engine, but they have no reasoning capabilities of their own. To this end we need to populate the knowledge base with rules that will allow the agents to make decisions. These rules will also determine which scenarios are appropriate (including what objects and characters to include and their positioning and behaviours), what language is suitable (including not just the words to use but also the intonation and emphasis), appropriate body language and facial gestures.

In exploring a number of our research questions, we have offered two main approaches to handle reasoning and control: 1) a Wizard of Oz style interface based on role playing games (RPG) using a human as the Game or SimMaster [20]; 2) a narrative engine based on our previous work on Interactive Drama using elementary and abstract units like goals, tasks and obstacles. In the first approach the burden falls on the human to provide intelligence. The second approach requires a trained logician to encode the narrative elements. This paper describes a hybrid approach which captures and validates codified rule-based intelligence from a human domain expert guided by a set of cases.

3.1 Incrementally acquiring knowledge via a training simulation

Ripple down rules (RDR) [7] is a technique for acquiring knowledge incrementally in the form of production rules which are motivated by and linked to domain-specific scenarios (cases). Note that we present and use Multiple Classification Ripple Down Rules (MCRDR) [11] which significantly extended single classification RDR. RDR is based on the observation that acquiring knowledge from a domain expert is not as simple as asking them everything they know about a certain subject. Rather, experts need some kind of context in which to exercise their knowledge. Just as humans learn incrementally via experience, our system learns one case at a time. Knowledge is elicited as the expert interacts with a case/scenario relevant to their area of expertise by asking them what their conclusions are about the scenario and what features justify this conclusion, in accordance with how experts tend to exercise their expertise. In RDR, the expert’s conclusions about the scenario become the conclusion/s in one or more rules, the features of the case become the rule conditions. The scenario is linked to the new rules and is used when new knowledge is added to override incorrect conclusions (e.g. incomplete, over-generalisations, overspecializations) to ensure that inconsistencies do not arise in the knowledge base. The condition of the new rule must differentiate between the current case and the previously seen cases associated with the rule to be overriden. There is no attempt to capture a globally true set of rules, instead knowledge evolves as new cases are seen and rules are added as exceptions to patch the knowledge wherever it is deemed not to be correct for the new case at hand.

For example, if the domain is airport security, a domain expert may be presented with the following scenario:

Scenario 1: A 24 year old male with no luggage is at the

customs desk. He claims he is here for a week-long holiday. The expert may then be asked how much of a risk the

passenger in the scenario represents, and what action the customs officers should take and why. The expert may say that because the passenger is in between 18 and 30 and they have no luggage, they represent a high risk and the customs officers should question the passenger further. This scenario will then be associated with the following rule:

IF 18<age<30 AND luggage = ‘none’ THEN risk = “high”

AND action = “question the passenger” When another scenario is presented to the expert, the

RDR system will use the rules it already has to make its own conclusions about the scenario which the expert can then either accept or reject. If the expert rejects the system’s conclusions a new rule will be created which will be an exception to the rule the system used in making its conclusion. For example, another scenario in the airport security domain may be the following:

Scenario 2: A 27 year old female with no luggage is at the customs desk. She claims she is here for a day trip. The system will look at the rule created for the previous

scenario and see that it is relevant here (18<age<30 AND

luggage = ‘none’) and conclude that the passenger presents a high risk and the customs officers should question the passenger. The domain expert may then disagree with the system’s conclusions on the basis of the passenger only visiting for a day. An exception to the previous rule will then be created that could look like the following: IF 18<age<30 AND luggage = ‘none’ AND trip_length = “day” THEN risk = “low” AND action = “Do nothing”

By using RDR, eventually a large rule base will be built by the domain expert, and the system will be able to make more accurate conclusions about a scenario based on the knowledge captured by the expert. A ripple down rule base is represented as a tree with each rule represented as a node in the tree (Figure 2). When a scenario is presented to the RDR system, it will move down the tree visiting each rule that applies to the scenario and testing each child. The last true rule on each pathway are the conclusions given.

Figure 2. RDR tree with 5 rules showing one of the cases associated with one of the rules

Figure 2 shows the nodes visited (in bold outline) for scenario 2. The system first visits the default rule (which is always true and draws no conclusion or the default conclusion) then visits rule 1, the first rule that applies to the scenario. After this, the system has nowhere further to go and presents the conclusion to the expert. The expert then chooses to add a new rule, rule 3 (bold dotted line) as an exception to rule 1. Thus, the next time a scenario is presented to the system that has a passenger between the ages of 18 and 30 making a day trip with no luggage, the system will be able to progress to rule 4 before making a conclusion.

Our training simulation system includes several 3D animated scenarios that might occur in the domain of airport security. In our approach, the domain expert can interrupt a running scenario and display its attributes (Fig. 3). At this point the system presents its conclusions for the scenario (Fig. 4) based on the RDR knowledge base (KB) which has been built up from previous scenarios. The expert may then either agree with the conclusions or disagree by adding a new rule into the RDR KB which will

then be associated with that scenario or case (Fig. 2). If the expert chooses to disagree with the system’s conclusions about a scenario, they will be re-shown the attributes of the current scenario, as well as the attributes of the scenario associated with the rule giving the incorrect conclusion.

Figure 3. Scenario with attributes displayed

Figure 4. System’s conclusion based on RDR KB in Figure 2

Rules for passengers with no luggage

1. A passenger with no luggage is immediately suspicious. Customs officers are advised to search the passenger’s clothes and body and consider the passenger to present a moderate risk. 2. Additionally, if the passenger has a reason for visiting Australia (i.e. the reason is known) customs officers should attempt to verify that the reason is genuine. 3. If the passenger is only staying for one day, they present a low risk and customs officers should let the passenger through. Scenario

You are an expert in airport security. Watch each scenario. Then correct the system’s conclusions afterwards using the 3 rules listed above. Ripple Down Rules Questions

Please give the most correct response to the statements below Q1. It was easy to understand what the scenario’s attributes were. Q2. I found it easy to understand what the system’s conclusion was and how to disagree with it. Q3. I found it easy to understand how the system worked out its conclusion. Q4. I found it easy to select extra categories to change the conclusion for the scenario. Q5. Once I had chosen the extra categories, I found it easy to specify what the new conclusion should be. Q6. The user manual provided was important in helping me to understand how to use the system. Q7. How effective/useful do you think this system is as a tool for capturing knowledge of customs officers for use in training? Q8. If you have any other comments, please write them below 3D character questions.

Q9. How would you describe the appearance of the characters? Q10. How would you describe the behaviour of the characters? Q11. What affect did the appearance of the characters have on your ability to use the system? Q12. What affect did the behaviour of the characters have on your ability to use the system? Q13. How useful do you think this environment would be for learning? Give your reasons. Figure 5. Information and questions in our pilot study

To demonstrate the knowledge acquisition process, in Fig. 4 a 32 year old female passenger with no luggage is being questioned by a customs officer. The system has suggested that the passenger presents a moderate risk and should be searched. However, the previous case that these rules were created from had a passenger who was planning to stay for 2 days – 1 week, whereas the passenger in the current case is only staying for one day. This means that the risk presented by the passenger and the action the customs officers should take might be different to those in the previous scenario. By choosing to disagree with the current conclusion (Fig 5.) the user adds/selects the new conclusion to do nothing and selects the “duration of stay” attribute-value “1-day” (Fig. 4); thus forming the new exception rule, which will be associated with the current scenario. By using the RDR system within the training simulation, over time a concise database of rules linked with scenarios will be built up

3.2 Evaluating Usability – Pilot Study

To date we have completed an initial evaluation of our approach to capturing knowledge as one interacts with a scenario. We expect to have a more comprehensive usability study completed in the next few months which incorporates the various technical, logistical and conceptual issues identified in this pilot. The pilot involved four participants (3 male and 1 female) aged between 22 and 29. Each participant was given the incoming passenger card used in Australian airports, a list of 'facts' about airport security that were made up for the study along with some likert-scale questions on using the system and questions regarding the believability of the characters (Figure 5). Also included was a small user manual for the RDR screens (not used by any participants!). Scenario 1

Passenger’s age: 35 Passenger’s gender: Male Arrived from: Havana Passenger’s nationality: Cuban Type of luggage: none Purpose:of visit: visiting mother in hospital Duration of stay: 2 days-1 week

Scenario 2

Passenger’s age: 32 Passenger’s gender: Female Arrived from: Singapore City Nationality: Singaporean Type of luggage: none Purpose of visit: business Duration of stay: 1 day

Figure 6. Attributes of the two scenarios played to participants

The participants were shown two scenarios involving passengers being questioned by customs officers (Figure 6). Each scenario ended with the participant being asked “What should the customs officers do next” after which the participant used the RDR system to change the system’s conclusion for the scenario using the ‘facts’ that they were given. Initially the RDR system was only given the first three rules displayed in Figure 2 causing it to conclude for the first scenario that the passenger presented a low risk and that the customs officers should do nothing. When the participant chose to alter the conclusion for scenario 1, new rules were added to the RDR KB which could then be used when the system made its conclusion for scenario 2. After each participant had finished the test, they answered the questions on the RDR system in Figure 5. The results for Q1-Q6 by participant are given in Table 1.

ID Gender Q1 Q2 Q3 Q4 Q5 Q6 P1 Male 4 3 3 4 4 3 P2 Female 5 5 4 4 5 4 P3 Male 5 5 5 4 4 2 P4 Male 5 5 5 4 4 4

Table 1. Table of Likert responses to questions 1-6 by participant Key: 5 = Strongly Agree, 4 = Agree, 3 = Neutral, 2 = Disagree, 1 = Strongly

Disagree

All participants felt that the system had potential for

capturing expertise for training customs officers. Additionally, most of the participants felt that the 3D environment was useful for learning and training purposes. One participant (P1) commented “The environment would be

useful because it will demonstrate a number of factors that would

be used in decision making as well being able to demonstrate

both existing rules as well as the consequences of taking

particular decisions.” Another participant (P2) pointed out that the system would “…Help trainees with putting principles

into practice and seeing how well they remember them in a

“realistic” but not stressful or judgmental environment.” Most of the participants had trouble initially with using

the system and understanding the purpose of each screen. This is most likely attributed to the fact that they neglected to study the user manual before starting the exercises. In reality, customs officers would receive training before being required to use the system. Participants found that they did need some instruction on how each screen in the system worked.

Although each participant was given the same list of ‘facts’, each participant created a different conclusion for each scenario. (In practice the ability to tailor knowledge is a benefit of RDR). One participant (P2) only considered the type of luggage when changing the conclusion for scenario 1, whereas other participants also took into account duration of stay and purpose of visit. Depending on what a participant chose as their conclusion for scenario 1, they were able to see the system use their conclusion for scenario 2, prompting them to write a new conclusion for scenario 2 and illustrating the nature of the RDR system. Overall, the preliminary study yielded positive results on the usefulness of both the RDR system and the 3D training simulation and assisted us in targeting areas for further development.

4 Conclusions and Next Steps

The closest application to our problem domain using RDR is the work conducted by [26] for the Australian Health Insurance Commission to identify general practitioners who are involved in fraudulent practice. Single classification RDR was used to classify 1,500 practice profiles as fraudulent or not fraudulent. Through this incremental classification process the rules were simultaneously acquired and the results compared against manual assignment of classifications by humans. Our work differs not only by using MCRDR but the interactive acquisition of knowledge as a narrative unfolds and alternative scenarios/cases are created raises challenges relating to control, (re-)inferencing and sustaining a sense of immersion.

Our next steps are to permit scenario authoring by adapting, altering and combining base scenarios on the fly with particular concern for how changes will affect scenario flow, knowledge acquisition and execution and the training experience in general. After further usability testing of the knowledge acquisition/scenario authoring approach, we intend to demonstrate the system to interested parties in the defence sector using domain experts in border security.

We acknowledge that non-verbal behaviours play an important part in detecting suspicious behaviour. Our KA approach allows the knowledge related to these behaviours

to be captured (for example, IF fidgeting and sweating THEN suspicious, EXCEPT if child or brokenAirConditioner), but our immersive worlds lack this level of realism. We look forward to being able to benefit from other advances in agent technologies, in particular the work on embodied agents to make our agents more believable and the experience more immersive. Our work is complementary to that body of research and based on the realization that capturing knowledge, particularly practical intelligence, is inherently difficult and exacerbated in high security and restricted access domains. Thus, our approach allows those ‘in the know’ to pass on their knowledge and experience to others who have the need to know.

References

1. Bailenson, J.N , Beall, A.C, Blascovich, J.J, Raimundo, M. and Weishbush, M. (2001) , Intelligent agents who wear your face: User’s reactions to the virtual self, Tech. Report, Center for Virtual Environment & Behaviors, Pscyh. Dept, University of California.

2. Brisson, A., Dias, J., & Paiva, A. (2007). From chinese shadows to interactive shadows: building a storytelling application with autonomous shadows. Proc. ABSHLE at AAMAS 2007. ACM Press.

3. Carlsson, S.A., El Sawy, O.A., Eriksson, I., Raven, A. (1996), Gaining competitive advantage through shared knowledge creation: in search of a new design theory for strategic information systems, In Proc. of ECIS, Lisbon, pp.1067-76.

4. Castelfranchi, C. and Poggi, I. (1993). Lying as pretending to give information. In Herman Parret, (ed) Pretending to communicate, de Gruyter, Berlin, NY, pp 276–291.

5. Clancey, W. (1997) The Conceptual Nature of Knowledge, Situations and Activity In Feltovich, P.J., Ford, K. M. and R. R. Hoffman (eds) Expertise in Context: Human and Machine AAAI Press/MICT Press, Cambridge, Mass., 247-291.

6. Colonia-Willner, R., (1999) Investing in practical intelligence: Ageing and efficiency among executives Int.l Jrnl of Behavioural Development 23(3):591-614

7. Compton, P. and Jansen, R., (1990) A Philosophical Basis for KA. Knowledge Acquisition 2:241-257.

8. DePaulo, B. M., Lindsay, J. J., Malone, B. E., Muhlenbruck, L., Charlton, K., & Cooper, H. (2003). Cues to deception. Psychological Bulletin, 129, 74-118.

9. Ekman, P., and O'Sullivan, M. (2006) From Flawed Self-Assessment to Blatant Whoppers: The Utility of Voluntary and Involuntary Behavior in Detecting Deception. Behavioral Sciences and the Law, 24, 673-686.

10. Johnson, W. L., and Rickel, J. 1998. Steve: An animated pedagogical agent for procedural training in virtual environments. SIGART Bulletin 8:16-21.

11. Kang, B., Compton, P. and Preston, P. (1995) Multiple Classification Ripple Down Rules: Evaluation and Possibilities Proc. KAW’1995 Feb 26 - March 3 1995, Vol 1: 17.1-17.20.

12. Klesen, M. (2002). Report on affective reasoning and cultural diversity. Technical Report, DFKI.

13. Lee, M. and Wilks, Y. (1997). Eliminating deception and mistaked belief to infer conversational implicature. In IJCAI Wshp on Coop., Collab. and Conflict in Dialogue Systems

14. Lorenz, M.; Gehrke, J.D.; Langer, H.; Timm, I.J.; Hammer, J. (2005) Situation-aware Risk Management in Autonomous Agents, CIKM’05, Oct 31–Nov 5, 2005, Bremen, Germany.

15. Marsella, S, and Gratch, J. (2001) Modeling the Interplay of Emotions and Plans in Multi-Agent Simulations In Proc. 23rd Annual Cog. Sci. Soc., Edinburgh, Scotland.

16. Pelachaud, C. and Poggi, I. (2002) Subtleties of facial expressions in embodied agents. Journal of Visualization and Computer Animation, 31:301–312.

17. Prendinger, H. and Ishizuka, M. (2001) Social Role Awareness in Animated Agents. Proc. Agents’01, 270–277.

18. Quinlan, J. R. (1979): Discovering rules by induction from large collections of examples, in D. E. Mitchie (ed.), Expert systems in the micro-electronic age. Edinburgh Uni. Press, Edinburgh, Scotland

19. Rehm, M. and André, E. (2005). Catch me if you can: exploring lying agents in social settings. In Proc.AAMAS’05, Utrecht, ACM, New York, NY, 937-944.

20. Richards, D. (2006) Is Interactivity Actually Important? Proceedings of the Third Australasian Conference on Interactive Entertainment (IE'2006), 4-6 Dec. 2006, Perth.

21. Richards, D. and Busch, P., (2002) Knowledge in Action: Blurring the Distinction Between Explicit and Tacit Knowledge Journal of Decision Systems, Editions Hermes 11(2): 149-164.

22. Richards, D., Szilas, N., Kavakli, M., Dras, M., 2007: Impacts of Visualisation, Interaction and Immersion on Learning with An Agent-Based Training Simulation, Special issue on "Agent Based Systems for Human Learning" of the Jrnl Int. Trans. on Systems Science and Applications (ITTSA), 1-16.

23. Soloway, E, Bachant, J. and Jensen, K. (1987) Assessing the Maintainability of XCON-in-RIME: Coping with Problems of a very Large Rule Base Proc IJCAI’87, Seattle, WA: Morgan Kaufman, Vol 2:824-829.

24. Sternberg, R. J., and Grigorenko, E. L. (2000). Teaching for successful intelligence: To increase student learning and achievement. Arlington Heights, IL: Skylight Prof.l Dev.

25. Swartout, W., Hill, R. Gratch, J., Johnson, W. L., Kyriakakis, C., LaBore, C., Lindheim, R., Marsella, S., Miraglia, D., Moore, B., Morie, J., Rickel, J., Thiebaux, M., Tuch, L., Whitney, R. and Douglas, J. (2001) Toward the Holodeck: Integrating Graphics, Sound, Character and Story In Proc. AAMAS’01, NY, ACM Press, pp. 409-416.

26. Wang, J. C., M. Boland, W. Graco, and H. He. (1996) Classifying general practitioner practice profiles. In Proc. PKAW96: Sydney, Australia, October 1996, pp 333–345.

27. Ward, D. and Hexmoor, H. (2003). Deception as a means for power among collaborative agents. In Int. WS on Collaborative Agents, pp. 61–66.

Spectrum Curricula for Measuring Teachability

Jacob BealBBN Technologies10 Moulton Street

Cambridge, MA [email protected]

Alice LeungBBN Technologies10 Moulton Street


Robert LaddagaBBN Technologies10 Moulton Street


ABSTRACTMachines learning from human teachers can advance boththe capabilities of engineered systems and our understandingof human intelligence. A key challenge for this field, how-ever, is how to e!ectively measure and compare the teacha-bility of machine learners, particularly given the diversity ofpotential learners and the inherent adaptivity and variabilityof human instructors. We are addressing this challenge withspectrum curricula, where each spectrum curriculum is asuite of lessons, all with the same instructional goal, but var-ied incrementally with respect to some property of interest.We have designed a set of seven spectrum curricula, inves-tigating three instructional properties in the RoboCup do-main, and are implementing these within the BootstrappedLearning Framework produced by the DARPA BootstrappedLearning program[2]. The materials we are producing arebeing made publicly available on the Open BootstrappedLearning Project website, such that any researcher can testagainst the curricula available or can contribute their owncurricula to improve the quality of this community resource.

Categories and Subject DescriptorsI.2.6 [Artificial Intelligence]: Learning

General TermsMeasurement

KeywordsTeachability, Bootstrapped Learning, Robocup, SpectrumCurriculum

1. INTRODUCTIONHumans are much better at learning from other humans

than our current machine learning systems are at learningfrom humans. Right now, a human who wants to instruct amachine needs to be two types of expert at once—an expertin the subject that is to be instructed, and also an expert insome method of configuring the machine, such as program-ming or knowledge representation.

Cite as: Spectrum Curricula for Measuring Teachability, Jacob Beal,Alice Leung and Robert Laddaga, Proc. of 9th Int. Conf. on Au-tonomous Agents and Multiagent Systems (AAMAS 2010),van der Hoek, Kaminka, Lespérance, Luck and Sen (eds.), May, 10–14,2010, Toronto, Canada, pp. XXX-XXX.Copyright c! 2010, International Foundation for Autonomous Agents andMultiagent Systems (www.ifaamas.org). All rights reserved.

In contrast, even the most naive instructors can easilyteach other humans, as when little kids teach one anothergames. Outside of formal classroom settings, humans teachone another informally, imprecisely, and highly e!ectively(and even a classroom is extremely informal and imprecisecompared to the current requirements for machine instruc-tion). Cultural knowledge, like how to change a flat tire,what to do when you’re sick, or fun games to play together,passes easily from person to person in a population, evenwhen the teacher has no training as an instructor.

Building systems capable of such human-like learning isimportant for both engineering and scientific reasons. Froman engineering perspective, if humans and machines are towork together in challenging environments, then machinesmust be able to learn from humans in a manner similar tohow humans learn from one another. The environments thathumans act in are highly diverse and constantly changing—whether in the o"ce, at home, or on a battlefield. Evena machine with an exhaustive background knowledge mustbe customized for the particular environment in which itacts, just as a secretary must become accustomed to thepeculiarities of a particular o"ce’s policies and customs ora soldier must learn the particular patterns and cautions ofa deployment area.

For humans and machines to work together routinely, ma-chines must have this human-like capacity to learn fromnaive instructors. Neither the homemaker nor the soldierhave the time and inclination to bother with a machinepartner otherwise. Although machines are still far fromhuman-like teachability, much might be accomplished evenby meeting the humans partway, as with Gra"ti characterrecognition[5]. Even a partial win could be revolutionary,allowing machines to adapt “culturally” to the environmentsin which they are deployed and also allowing the vast reser-voirs of human curricula to be used in configuring machinesfor application domains.

From a scientific perspective, building a machine thatlearns more like a human does may shed light on the natureof human learning. We may hope to investigate questionssuch as:

• How important are shared assumptions and biases forhuman-like learning?

• Are humans powerful but “rational” learners, or do theyleap to unwarranted (but often correct) conclusions?

• How does instruction enable representational change?

• What are the relative contributions of shared architec-

ture, shared inherent biases, shared culture, and sharedexperiences?

• Is our arsenal of machine learning techniques missing anyfundamental tools that humans are employing?

A key challenge for this field, however, is how to e!ectivelymeasure and compare the e!ectiveness of machine learners,particularly given the diversity of learners and the inherentadaptivity and variability of human instructors. Spectrumcurricula address this challenge by measuring a student’sability to adapt to a set of fixed teachers, which all teachthe same lesson but vary incrementally along some propertyof interest in how the lesson is taught.

We have previously proposed the notion of spectrum cur-ricula in [1]; in this paper, we develop the idea further andgive specific and detailed examples of how such curriculacan be designed. In particular, we have designed a set ofseven spectrum curricula, investigating three properties ofinstruction—strength of assumptions about mutual knowl-edge, distance of transfer between lesson and use in context,and level of detail—in the domain of 3-on-2 keepaway inRoboCup simulated soccer[3]. We are implementing thesecurricula within the Bootstrapped Learning Framework pro-duced by the DARPA Bootstrapped Learning program[2],and making the materials we are producing publicly avail-able on the Open Bootstrapped Learning Project website,http://dsl.bbn.com/BL/, such that any researcher can testagainst the curricula available or can contribute their owncurricula to improve the quality of this community resource.

2. SPECTRUM CURRICULAWhen a skilled human teacher instructs a single student,

the teacher typically adapts their curriculum to the back-ground and needs of the student. Consider, for example,teaching a person how to play Kriegspiel, a chess variant inwhich each player sees only their own pieces and a refereeadjudicates. If the student is already deeply familiar withchess, the game may be explained just so simply, with manyof the details in making such a variant work left implicit,whereas all the rules may need to be spelled out explicitlyfor a student unfamiliar with chess. Likewise, some studentsmay pick things up rapidly, whereas others may need muchreinforcement before they are comfortable with the game. Ina tutoring environment, a teacher will adapt to convey thelesson to the student, expanding where prompted by studentdi"culties and spending less time where the student picksup readily.

While this sort of adaptability is good for making learningsystems work, since a human teacher can adjust to match thefeedback being given by the system, it presents a problemfor measuring the teachability of a system. An adaptiveteacher can teach many systems with di!erent capabilitiesthe same lesson, masking their di!erences. Moreover, nohuman teacher will instruct exactly the same way twice anddi!erent teachers will instruct the same student in di!erentways.

A spectrum curriculum addresses this problem by turningit around, and measuring the adaptability of the studentto a set of teachers, each executing a relatively rigid cur-riculum. Each spectrum curriculum is built around a singlevariable property of instruction, such as the level of detailin a lesson, and contains a set of lessons that incrementallyvary this property from “too hard” to “too easy.” For exam-

Figure 1: A game of 3-on-2 keepaway being playedin the Robocup simulator.

ple, a level of detail spectrum for moving pawns might rangefrom a single en passant capture on the “too hard” end ofthe spectrum to hundreds of examples with di!erent rest-of-board configurations on the “too easy” end of the spectrum.The series of lessons is taught in order from di"cult to easy,and the student is tested before the first lesson and aftereach lesson. The product of training against such a curricu-lum is thus a curve showing how the student’s cumulativeunderstanding improves as the aspect varies.

Testing an instruction-based learning system against sev-eral curricula for the same aspect should produce a goodmeasure of the system’s adaptability with respect to that as-pect, with more area under the curve generally being better.No system should be expected to perform well on the mostchallenging lessons, but the performance of well-designedsystem should degrade gracefully as the challenge increases.Moreover, a spectrum that ranges from “too hard” to “tooeasy”should also be able to provide a fine-grained measure ofperformance on which incremental progress towards human-like teachability can be easily tracked.

Note, however, that the systems with the most human-like teachability may not show the most breadth on a givenspectrum! For example, an assumption about human-liketeacher’s intent may cause a system to fail on an “easy”task where too much detail is provided, since the studentcannot believe something so painstakingly instructed can beso simple as the concept it has deduced.

To make a spectrum curriculum, one must first specify thespectrum property of interest. The might be any topic of in-terest to a researcher: the length of the lesson in seconds, thenumber of examples provided, the number of instructionaltechniques employed, the level of math knowledge that theinstruction depends on, and many others. Spectrum curric-ula therefore can vary enormously, and the question thenis which properties are the most interesting, the most use-ful and the most predictive of the behavior of human ormachine learners. We are not able to provide an extensivecharacterization in this paper, but the topic itself is of scien-tific and philosophical interest, and we have barely scratchedthe surface in terms of the range of spectrum curricula wehave begun to study. One of the things we would like thispaper to accomplish is the beginning of a discussion of spec-trum curricula as a topic of scientific interest and study inthe area of educational technology.

Figure 2: In an instruction scenario, three agents—teacher, student, and world—interact by sendingsymbolic messages. The world sends percept mes-sages to both teacher and student, and either ofthem can manipulate it by sending it action mes-sages.

3. BBN ROBOCUP KEEPAWAY CURRICULAWe have begun applying the notion of spectrum curricula

by designing a set of spectrum curricula, using the domainof Robocup soccer simulations[3]. We have designed sevencurricula, each of which teaches a di!erent skill for playinga 3-on-2 keepaway game in a restricted portion of the soccerfield.

Our curricula use an agent-based instructional framework.An instruction scenario is set up in terms of a set of inter-acting agents: a teacher, a student, and a world (Figure 2).These all evolve independently, communicating by passingsymbolic messages to one another. The teacher and stu-dent jointly observe the world, so any time the world’s statechanges, it sends updated percepts to both. The teacher andthe student can both a!ect the world’s state by commandingit to take actions. Finally, the teacher and the student sendone another a variety of messages for di!erent stages of theteaching process. This general framework should be able tosupport “classroom” worlds with multiple students and/ormultiple teachers, as well as dialogue between teacher andstudent. At present, however, the curricula are all designedfor one-on-one instruction with low interactivity.

Each of the seven spectrum curricula we have designed isa collection of 6 to 10 lessons, incrementally varying from“hard” to “easy” along one of three properties of interest:

• Strength of assumptions about mutual knowledge (2)

• Distance of transfer between lesson and use in context(3)

• Detail of instruction (2)

Three lessons teach knowledge for the“keeper”team, threeteach knowledge for the “taker” team, and one is used byboth teams. In every case, the piece of knowledge beinglearned is a function that either returns a boolean result orchooses one of two options.

These seven curricula also exercise three of the “naturalinstruction method”[4] teaching modalities that have beenidentified in the DARPA Bootstrapped Learning project:

• learning from examples (6)

• learning from “telling” (2)

• learning from feedback (from the instructor or environ-ment) (2)

To clearly illustrate the spectrum curricula concept andshow how it can be implemented, we present all seven cur-ricula, organizing them by property to show how di!erentcurricula can investigate the same property of interest.

3.1 Assumptions about mutual knowledge

Out of Bounds.Both keeper and taker players try to avoid going out of

bounds. This spectrum teaches a function that is used totell when a nearby location is illegal.Spectrum Property: Strength of assumptions about mu-tual knowledge.Natural Instruction Methods: Learning by ExampleTest: 10 random locationsLessons (Easy to Di!cult):

• 50 labelled examples scattered randomly around the field

• 20 labelled examples scattered around the boundary (as-sumes values not near the boundary are indicated by theexamples given)

• 8 labelled examples, one in and one out on each line(assumes a rectilinear area)

• 2 labelled examples, in or out, at opposite corners (as-sumes the area is aligned with cardinal directions)

• hint at the line, two labelled examples, one in and oneout (assumes the line is the border)

• hint at the line, one labelled example in or out (assumesthe other side of the line is the opposite value)

• only hint at the line (assumes interior of boundary islikely to be “good”)

Taker: Triangle Bounds.In order to be e!ective at intercepting, a taker always tries

to stay “near enough” to the center of the group. This spec-trum teaches that “near enough” means within the triangleformed by the three keepers. Assumes prior knowledge ofa “Distance” function that takes two positions and returnsa scalar and an “Angle” function that takes a vertex andtwo points on rays extending from it and reports the anglebetween the rays.Spectrum Property: Strength of assumptions about mu-tual knowledge.Natural Instruction Methods: Learning from ExamplesTest: 10 random configurations of keepers and the takerLessons (Easy to Di!cult):

• 50 labelled examples scattered randomly around the field,with the three keepers in canonical positions. In everycase, the positions of the three keepers are hinted at,along with the “distance” and “angle” functions.

• 20 labelled examples scattered around the boundary (as-sumes values not near the boundary are indicated by theexamples given).

• Three pairs of examples, one in and on out on each edge(assumes boundaries are straight lines).

(a) Case 1 (b) Case 2 (c) Case 3 (d) Case 4

Figure 3: Configuration cases for teaching a keeper player where to pass. Keepers are blue, takers are green,and the player is keeper 0.

• One pair of examples, in and out, next to a boundary(assumes symmetry in decision boundary).

• One pair of examples, in and out, not near a boundary(assumes the function is set by the in/out spatial rela-tion).

• One example, in or out, not near a boundary (assumesthe other side is the opposite value).

3.2 Transfer between lesson and use

Keeper: Where to Pass.When a keeper has decided to pass, it needs to choose

one of its teammates to pass to. This spectrum teachesthe keeper to pass to the the team-mate with the great-est minimum angle to a taker. Assumes prior knowledge of“Distance” and “Angle” functions.Spectrum Property: Distance of transfer between lessonand use in context.Natural Instruction Methods: Learning from ExamplesTest: 10 random configurations of all five players

These lessons involve four types of configurations, as shownin Figure 3. We will refer to the three keepers are K0, K1,K2, and the two takes as T1 and T2. The player is alwaysK0 and has the ball, and is choosing whether to pass to K1or to K2. The configurations are divided into types basedon what types of mistakes a human-like learner might make,given examples of that type.

Figure 3(a) shows an example of a Case 1 configuration.With examples of this type of configuration, a learner mightmistakenly come to believe that the answer is to the pick thefurthest teammate or the teammate with the least averagedistance from the two takers. In Case 1, the correct answer,K2, is the teammate furthest from K0. T1 and T2 are closerin distance to K2 but closer in angle to K1.

Figure 3(b) shows an example of a Case 2 configuration.With examples of this type of configuration, a learner mightmistakenly come to believe that the answer is to the pick theclosest teammate or the teammate with the least averagedistance from the two takers. In Case 2, the correct answer,K1, is the teammate closest from K0. T1 and T2 are closerin distance to K1 but closer in angle to K2.

Figure 3(c) shows an example of a Case 3 configuration.With examples of this type of configuration, a learner mightmistakenly come to believe that the answer is to the pickthe furthest teammate or the teammate with the greatestaverage distance from the two takers. In Case 3, the correct

answer, K2, is the teammate furthest from K0. Also, T1and T2 are closer in both distance and angle to K1 thanthey are to K2.

Figure 3(d) shows an example of a Case 4 configuration.With examples of this type of configuration, a learner mightmistakenly come to believe that the answer is to the pick theclosest teammate or the teammate with the greatest averagedistance from the two takers. In Case 4, the correct answer,K1, is the teammate closest to K0. Also, T1 and T2 arecloser in both distance and angle to K2 than they are to K1.

For each case, it is also possible to generate a Case nAversion. In the Case nA version, T1 and T2 are in the sameplace, simplifying learning.Lessons (Easy to Di!cult):

• 36 labelled examples with 3 keepers in di!erent positionsand the 2 takers at the same location: 12 examples eachof Case 1A, Case 2A, and Case 3A.

• 36 labelled examples of all 5 players in di!erent positions:9 examples each of Case 1, Case 2, Case 3, and Case 4.

• 12 labelled examples with 3 keepers in di!erent positionsand the 2 takers at the same location: 2 examples eachof Case 1A, Case 2A, and Case 3A and 6 examples ofCase 4A.

• 2 labelled examples of all 5 players in di!erent positions.1 example each of Case 2 and Case 3.

• 1 labelled examples with 3 keepers in di!erent positionsand the 2 takers at the same location. The example isCase 1A.

• 1 labelled example of all 5 players in di!erent positions.The example is Case 1.

Taker: Who to Guard?.Assuming that a taker player has decided to guard one

of the keepers that does not have the ball, it needs to pickone of the two to be the one that it guards. This spectrumteaches that a taker should guard the keeper that it is closestto. Assumes prior knowledge of the “Distance” function.

Some potential incorrect conclusions that the learner mightmake are it should select the keeper who does not have theball and is (1) closest to the ball (2) furthest from the ball(3) nearest to the other taker (4) farthest from the othertaker.Spectrum Property: Distance of transfer between lesson

and use in context.Natural Instruction Methods: Learning from ExamplesTest: 10 random configurations of you, the other taker, akeeper with the ball, and two keepers who don’t have theball.

In all the following lesson examples, the taker player is T1and the keeper with the ball is K0.Lessons (Easy to Di!cult):

• 20 examples with only T1, K1, and K2. No ball. Makesure T1 is not equidistant from K1 and K2.

• 20 examples with all 5 players and T2 at the same lo-cation as K0. Make sure T1 is not equidistant from K1and K2.

• 4 examples with K0 in the center of the field. K1 andK2 are equidistant from K0. T2 is at the same locationas K0. T1 is located approximately on the line that runsthrough K1 and K2. In one example, T1 is on the side ofK1 that is away from K2. In one example, T1 is betweenK1 and K2, but closer to K1. In one example, T1 isbetween K1 and K2, but closer to K2. In one example,T1 is on the side of K2 that is away from K1.

• 4 examples with K1 in the center of the field. K0 andK2 are equidistant from K1. T2 is at the same locationas K0. T1 is located approximately on the line betweenK1 and K2. In one example, T1 is approximately 20% ofthe distance between K1 and K2. In the other examples,T2 is approximately 40%, 60%, and 80% of the distancebetween K1 and K2.

• One example with T1, K1, and K2 randomly placed onthe field. No ball, no K0, no T2. Make sure that T1 isnot equidistant from K1 and K2.

• One example with K0 in the center of the field. K1 andK2 are equidistant from K0. T2 is in the same places asK0. T1 is near K1. The answer is K1.

Keeper: Where to Move.When a keeper does not have the ball, it tries to get open

so that its team-mate can pass to it. Thus spectrum teachesthe keeper that good “open” locations are in a band 50-80% of the field away from the keeper with the ball, and notwithin 10 degrees of either taker or the other keeper withoutthe ball. Assumes prior knowledge of“Distance”and“Angle”functions.Spectrum Property: Distance of transfer between lessonand use in context.Natural Instruction Methods: Learning from Feedbackand Learning by Examples.Test: 10 random configurations with the following proper-ties: two are acceptable locations, and two each exercise thedistance prohibition and the three player proximity prohibi-tions.

In all the following examples, the keeper player is K0.Lessons (Easy to Di!cult):

• 100 labelled examples of all 5 players in random plausiblepositions. The ball is with a di!erent keeper. Studentis then allowed to propose up to 10 position/label pairsand told whether each is right or wrong.

• 40 labelled examples with three keepers and one taker

in random plausible positions. The ball position andfeedback stage are as before.

• 20 labelled examples with two keepers in fixed positions,K0 at a random distance and one taker in a randomposition between them. The ball position and feedbackstage are as before.

• 10 labelled examples with the ball keeper in a fixed posi-tion, K0 at a random distance, and one taker in a randomposition between them. The ball position and feedbackstage are as before.

• Four examples, one per rule. One has just two keepers,and K0 starts too close and moves away until its distanceis acceptable and the ball is kicked to it. Another isthe same except that the keeper starts too far away andmoves closer. The other two have a taker at too near anangle and a keeper at too near an angle. Feedback is asbefore.

• Two examples: two keepers and a taker slightly o! theline between the keepers. In the first example, K0 isat an acceptable distance, and moves until its angle isacceptable, at which point the ball is kicked to it. In thesecond example, the angle is acceptable, but the distanceis too close. Feedback is as before.

• One example: two keepers and a taker slightly o! theline between the keepers. K0 starts too close in distanceand angle, then moves away from the line, simultaneouslyhitting an acceptable distance and angle, at which pointthe other keeper kicks the ball to it. Feedback is asbefore.

• Same as before, but the ball starts with K0.

3.3 Detail of instruction

Taker: Guard or Take?.When a keeper is holding the ball, each taker player peri-

odically checks to see if it should be the one trying to takethe ball from that keeper. If not, it will guard one of theother keepers. This spectrum teaches that the taker shouldtry to take if it is the one closest to the ball. Assumes priorknowledge of a “Distance” function.Spectrum Property: Detail of instructionNatural Instruction Methods: Learning from Examples,becoming Learning by Telling as lessons become easier.Test: 10 random configurations of takers and ball-holdingkeeperLessons (Easy to Di!cult):

• Full calculation told as hints:GoTake = Distance(me.pos,ball.pos)<Distance(other.pos,ball.pos) Given an example oftwo takers and a ball on the field, with the function re-turning opposite answers when applied to the two takers.

• Given both distance calls as hints, but not comparison.

• Hints just point to distance from teammate to ball.

• Hints just point to distance function and ball.

• Hint just points to distance function.

• Hint just points to position of ball.

• No hints at all, just the example.

Keeper: When to Pass.Whenever a keeper player has the ball, it periodically

checks to see whether it should continue holding the ballor to pass. This spectrum teaches that it should pass when-ever a taker comes “near,” and how to determine what is“near.” Assumes prior knowledge of a “Distance” function.Spectrum Property: Detail of instructionNatural Instruction Methods: Learning by Feedback,becoming Learning by Telling as lessons become easier.Test: 10 random configurations of takersLessons (Easy to Di!cult):

• Full calculation told as hints:PassNow = Or(Near(taker0),Near(taker1))Near(taker) = Distance(me.pos,taker.pos)<kThe student is then told to maximize their length of pos-session and given 10 feedback trials.

• Don’t tell the value of the threshold k.

• Also don’t tell how to use the distance calculation.

• For Near, hint only that the positions of self and takerare important.

• For Near, hint only that self is important.

• Give no information about Near.

• Hint only that Near should be calculated for taker0 andtaker1, but not how to combine them.

• Hint only that taker0 and taker1 are important.

• Only tell the student to maximize possession time.

• Only tell the student that possession time is important.

4. IMPLEMENTATION AND DISTRIBUTIONWe are implementing these curricula in the Bootstrapped

Learning Framework that is being produced by the DARPABootstrapped Learning program[2]. This Java-based frame-work provides an infrastructure for agent-based instructionscenarios, a scripting language for designing lessons and con-necting them to form curricula, and a knowledge representa-tion language (InterLingua[4]) that can be used to describethe goal knowledge that a lesson is attempting to teach. ARobocup domain package created by SRI interfaces a stan-dard Robocup simulator into this framework and provides abase player for each team, with spots in its strategy wherelearned knowledge may be plugged in.

We are making the curricula we produce publicly avail-able on the Open Bootstrapped Learning Project website,http://dsl.bbn.com/BL/, along with a copy of the frame-work, additional documentation, and a semi-competent baselearner for researchers to build o!. Researchers can use thespectrum curricula on the site as challenge problems or toevaluate their systems. Researchers are also encouraged tosubmit new spectrum curricula to the collection, either usingthe provided framework or their own framework, to improvethe quality of this community resource.

Figure 4: BBN Open Bootstrapped LearningProject webpage

5. CONTRIBUTIONSWe have demonstrated how spectrum curricula can be de-

signed for investigating a student’s teachability, where eachspectrum curriculum incrementally varies how a lesson istaught with respect to a property of interest. More particu-larly, we have presented a set of seven spectrum curricula inthe Robocup domain that can be used to investigate threeproperties of instruction: strength of mutual knowledge as-sumptions, transfer distance between lesson and use, andlesson detail.

It is important to remember, however, that the actuale!ectiveness of these curricula at capturing their designedgoals has yet to be evaluated. An important future directionof research is to test both humans and machine learnersagainst these curricula in order to verify that useful spectraare actually produced.

The materials we are producing are being made publiclyavailable on the Open Bootstrapped Learning Project web-site, such that any researcher can test against the curriculaavailable or can contribute their own curricula. As this pub-lic resource is improved, we hope that it will prove stimu-lating to the community of researchers investigating teach-ability, both providing a useful set of challenge problemsand a meaningful standard of cross-system comparison, andthereby advance us toward a future of adaptive human-machine collaboration and predictive models of human teach-ability.

6. REFERENCES[1] J. Beal, P. Robertson, and R. Laddaga. Curricula and

metrics to investigate human-like learning. In AAAI2009 Spring Symposium ”Agents that Learn fromHuman Teachers”, March 2009.

[2] DARPA IPTO. Bootstrapped learning.http://www.darpa.mil/ipto/programs/bl/bl.asp,(Retrieved Nov. 5, 2008).

[3] H. Kitano, M. Asada, Y. Kuniyoshi, I. Noda, E. Osawa,and H. Matsubara. Robocup: A challenge problem forai and robotics. In RoboCup-97: Robot Soccer WorldCup I, pages 1–19, London, UK, 1998. Springer-Verlag.

[4] L. Lefkowitz, J. Curtis, and M. Witbrock. Accessibleresearch cyc. Technical ReportAFRL-IF-RS-TR-2007-204, Air Force ResearchLaboratory, 2007.

[5] I. S. MacKenzie and S. X. Zhang. The immediateusability of gra"ti. In Graphics Interface ’97, pages129–137, 1997.

Generalizing Apprenticeship Learning across Hypothesis Classes

Thomas J.Walsh

Rutgers UniversityDept of Computer

SciencePiscataway, NJ

[email protected]

KaushikSubramanian

Rutgers UniversityECE DepartmentPiscataway, NJ

[email protected]

Michael L.Littman

Rutgers UniversityDept of Computer

SciencePiscataway, NJ

[email protected]

Carlos DiukPrinceton UniversityDept of Psychology

Princeton, [email protected]

ABSTRACT

This paper develops a generalized apprenticeship learningprotocol for reinforcement-learning agents with access toa teacher who provides policy traces (transition and re-ward observations). We characterize su!cient conditions ofthe underlying models for e!cient apprenticeship learningand link this criteria to two established learnability classes(KWIK and Mistake Bound). We then construct e!cientapprenticeship-learning algorithms in a number of domains,including two types of relational MDPs that are not e!-ciently learnable in the autonomous case. We instantiateour approach in a software agent and a robot agent thatlearn e"ectively from a human teacher.

Categories and Subject Descriptors

I.2.6 [Artificial Intelligence]: Learning

General Terms

Theory

Keywords

Apprenticeship Learning, Reinforcement Learning

1. INTRODUCTIONTeachers unquestionably increase the speed and e!cacy

of learning in humans. Yet in the field of reinforcementlearning (RL), almost all learning agents gain experiencesolely by interaction with their environment—teachers arenot in the loop. This work addresses this disconnect byproposing a generalized protocol for apprenticeship learningwithin the reinforcement-learning paradigm, characterizingsu!cient conditions for e!cient apprenticeship learning thatcover a wide swath of important AI domains, and showingthat there is a potentially exponential improvement in sam-ple complexity when an agent can interact with a teacherinstead of learning on its own.

The idea of integrating a teacher into the learning processhas been proposed in several di"erent forms. For instance,in the early computational learning theory literature, equiv-alence queries [3] played the role of teacher and were shownto increase the class of learnable concepts in the supervisedlearning setting. In this work, we expand the apprentice-ship protocol of [2] to cover a wider array of model classes.We consider a teacher with a policy !T , who can deliver atrace, a sequence of states, actions, and rewards obtained byexecuting !T from a start state, to the learning agent after

seeing it behaving suboptimally. We note that this scenariois di"erent from inverse reinforcement learning [1], wherethe reward function is inferred from sequences of states andactions. Instead, our agents see the actual rewards and tran-sitions induced by the teacher’s policy and act to try tomaximize this observable reward function.

We characterize a class of RL environments for which anagent can guarantee that only a polynomial number of ex-ample traces are needed to act near-optimally. Specifically,this class includes all KWIK-learnable domains from the au-tonomous case and all deterministic domains from the Mis-take Bound (MB) learning class, a set that contains manymodels that thwart autonomous agents. These results gen-eralize earlier theoretical results in a handful of RL repre-sentations, including flat MDPs, linear MDPs, StochasticSTRIPS, and deterministic OOMDPs.

2. TERMINOLOGYIn this section, we propose a protocol for apprenticeship

learning and a supervised learning framework that will allowus to study the e!ciency of learners in this setting.

2.1 Reinforcement LearningA reinforcement-learning [12] agent interacts with an en-

vironment described by a Markov Decision Process (MDP),which is a 5-tuple !S, A, T , R, "" with states S, actions A,transition functions T : S # A $% Pr[S], reward functionR : S # A $% [Rmin, Rmax], and discount factor ". Anagent’s deterministic1 policy ! : S $% A induces a valuefunction over the state space, defined by the Bellman Equa-tions: Q!(s, a) = R(s, a)+"

P

s! T (s, a, s!)Q!(s!, !(s!)) andV !(s) = Q!(s, !(s)). An optimal policy !" for an MDP

is a policy such that &s, V "(s) = V !"(s) ' V !!

(s),&!!.The goal of a standard (autonomous) reinforcement-learningagent is to enact a near-optimal policy despite having tolearn T and R from experience.

2.2 Generalized Apprenticeship LearningIn contrast to the autonomous learning framework, this

paper considers a paradigm where the agent’s experience isaugmented with experience produced by a teacher and thecriterion is to find a policy whose value function is nearlyas good as, or better than, the value function induced bythe teacher’s policy. Formally, we define the ApprenticeshipLearning Protocol for episodic domains where each episode

1The results of this paper can be extended to the stochasticcase, although practical implementation gets more complex.

has a length of at most H = Poly(|M |, |A|, Rmax,1

1#") in

Algorithm 1. Here, |M | is a measure of environment com-plexity, described later.

Algorithm 1 The Apprenticeship-Learning Protocol

The agent starts with S, A and ", a time-cap H and hasaccess to episodic environment EThe teacher has policy !T .for each new start state s0 from E do

t = 0while The episode has not ended and t < H do

The agent chooses at.!st+1, rt" = E.progress(st, at)t = t + 1

if the teacher believes it has a better policy for thatepisode then

The teacher provides a trace # starting from s0.

Intuitively, the agent is allowed to interact with the envi-ronment, but, unlike standard RL, at the end of an episode,the teacher can provide the agent with a trace of its ownbehavior starting from the original start state. The criteriathe teacher uses to decide when to send a trace is left generalhere, but one specific test of value, which we use in our ex-periments, is for the teacher to provide a trace if at any timet in the episode, Q!T (st, at) < Q!T (st, !T (st))( $. That is,the agent chooses an action that appears worse than theteacher’s choice in the same state. Traces are of the form:# = (s0, a0, r0), ...(st, at, rt), ...(sg , rg) where s0 is the ini-tial state, and sg is a terminal (goal) state or some otherstate if the H cuto" is reached. Notice that the trajectorybegins before (or at) the point where the agent first actedsub-optimally, and may not even contain the state in whichthe agent made its mistake.

Since the teacher’s policy may not be optimal, this tracecould potentially prescribe behavior worse than the agent’spolicy. We distinguish between these traces and their morehelpful brethren with the following definition.

Definition 1. A valid trace (with accuracy $) is a tracesupplied by a teacher executing policy !T delivered to anagent who just executed policy !A starting from state s0 suchthat V !T (s0) ( $ > V !A (s0).

Defining valid traces in this way allows agents to outperformtheir teacher without being punished for it. With determin-istic policies, this definition means that at no time in a validtrace does the teacher prescribe an action that is much worsethan any of the actions the agent used in that state. Notethat when the teacher enacts optimal behavior (!T = !"),only valid traces will be provided.

We now introduce a teacher into the learning loop in a waythat allows us to characterize the e!ciency of learning anal-ogous to the way the PAC-MDP framework [11] has beenused to characterize e!cient behavior in the autonomoussetting. We define PAC-MDP-Trace learning as follows:

Definition 2. An RL agent is said to be PAC-MDP-Trace if, given accuracy parameters $ and %, and followingthe protocol outlined in Algorithm 1, the number of validtraces (with accuracy $) received by the agent over its life-time is bounded by Poly(|M |, |A|, Rmax, 1

1#") with probability

1( %, where |M | measures the complexity of the MDP’s rep-resentation, specifically the description of T and R.

2.3 Frameworks for Learning ModelsIn this section, we introduce a class of dynamics where

PAC-MDP-Trace behavior can be induced. In autonomousreinforcement learning, the recent development of theKWIK [8] or “Knows What It Knows” framework has unifiedthe analysis of models that can be e!ciently learned. TheKWIK-learning protocol for supervised learning consists ofan agent seeing an infinite stream of inputs xt ) X. Forevery input, the agent has the choice of predicting a label(yt ) Y ) or admitting * (“I don’t know”). If the agent pre-dicts *, it sees a noisy observation zt of the true label. Ahypothesis h" in class H is said to be KWIK learnable if,with probability 1 ( %, two conditions are met. (1) Everytime the agent predicts yt += *, ||yt ( h"(xt)|| , $. (2)The number of times yt = * is bounded by a polynomialfunction of the problem description.

In autonomous reinforcement-learning, if T and R are ef-ficiently KWIK learnable, e!cient behavior can be achievedby optimistically filling in the * predictions. These resultscover a large number of models common to autonomous re-inforcement learning, including flat MDPs and DBNs [8].However, several relevant and intuitively appealing classesare not KWIK learnable. For instance, conjunctions of nterms are not KWIK learnable (see Section 4.2). So an en-vironment with a“combination lock”of n tumblers that needto be set to the correct digits for a high reward action (“un-lock”) to be e"ective, may require an exponential number ofsuboptimal steps. But, in the trace setting, learning to opensuch a lock is simple: the agent only needs the teacher tosupply a single trace to learn the combination! Thus, thereare clearly models that are learnable in the apprenticeshipsetting that are not autonomously learnable.

Prior work [3, 6] has established a link between teach-able hypothesis classes and the mistake bound (MB) frame-work [9]. However, both of these considered only the pre-diction setting (not sequential decision making). The MBlearning protocol for model learning is essentially the sameas the KWIK protocol except for three changes. (1) In MB,there is no * prediction. The agent must always predicta yt ) {0, 1} and receives a true label when it is wrong.(2) MB is only defined for deterministic hypothesis classes,so instead of zt, the agent will actually see the true label.(3) E!ciency is characterized by a polynomial bound onthe number of mistakes made. It follows [8] that any e!-cient KWIK learning algorithm for a deterministic hypoth-esis class can become an e!cient algorithm for MB by sim-ply replacing all * labels with an arbitrary element fromY . We now introduce the following related criteria, called amistake-bounded predictor (MBP) and will later show thatif T and R for an MDP are learnable in this framework, itis PAC-MDP-Trace learnable.

Definition 3. A mistake-bounded predictor (MBP)is an online learner with accuracy parameters $ and % thattakes a stream of inputs from set X and maps them to out-puts from a set Y . After predicting any yt, the learner re-ceives a (perhaps noisy) label zt produced by an unknownfunction from a known hypothesis class. An MBP must makeno more than a polynomial (in 1

#, 1

$, and some measure of

the complexity of the hypothesis class) number of mistakeswith probability 1 ( %. Here, a mistake occurs if, for inputxt, the learner produces yt and ||h"(xt) ( yt|| > $, where h"

is the unknown function to be learned.

Like KWIK, MBP observations can be noisy, and like MB,the learner is allowed to make a certain number of mistakenpredictions. In fact, we can formalize the relationships withthe following propositions.

Proposition 1. Any hypothesis class that is KWIK orMB learnable is MBP learnable.

Proof. A KWIK learner can be used in its standardform, except that whenever it predicts *, the MBP-Agentshould pick a yt ) Y arbitrarily. Also, the underlying KWIKlearner should not be shown new labels when it does not pre-dict *, though the “outer” MBP agent does receive them.

MB is defined for deterministic hypothesis classes, so inthis setting, the MB and MBP protocols line up.

We can combine MBP learners in several ways and stillpreserve the MBP properties. For instance, consider thefollowing MB-partitioned class.

Proposition 2. Consider two “low-level”MBP-learnableclasses C0 and C1 with the input space X and disjoint outputsets Y0 and Y1, respectively. Consider a “high-level” MB-learnable class C mapping X to {0, 1}. The composition ofthese classes where the output of the class C learner is usedto select which low-level MBP class to use (if the output ofthe high-level learner is i, use class Ci) is MBP-learnable.

Proof. On input x, get a prediction i from the C learner.Then, query the Ci learner and report its response as thesolution. Observe y. Define i such that y ) Yi, then train Clearner with (x, i) and Ci with (x, y). By construction, alllearners get the appropriate training data and will individ-ually make a small number of mistakes. When they makeaccurate predictions, the overall learner is accurate.

As an example, consider a factored MDP with a rewardfunction defined as follows. If a predetermined conjunctioncR over all n factors is false, then the agent receives rewardRmin < 0 and otherwise it gets a reward drawn from a dis-tribution over [0, Rmax]. Given that information (but notcR or the distribution), the reward function can be learnedusing an MB conjunction learner and a KWIK learner forthe distribution when the conjunction is true, because thecases are always discernible. In contrast, a class where thefalse conjunction again produces Rmin, but a true conjunc-tion induces a distribution over [Rmin, Rmax], is not coveredunder this case because the outputs sets of the learning prob-lems overlap (making it unclear how to solve the top-levellearner).

Several simpler combinations of MBP learners also pre-serve the MBP properties. These include input-partition,where low level learners are chosen based on some knownfunction of X (a degenerate case of MB-partition withoutthe “high-level” MB learning); union, where the outputs ofseveral MBP learners with the same input sets give predic-tions (one can simply use the “low-level” learner who hasmade the fewest mistakes and give samples to all the learn-ers); and cross-product, where learners with disjoint inputsand outputs have their predictions combined. These formsof combination have proven useful in building KWIK learn-ers for autonomous RL [8] and we use MB-partition in Sec-tion 4.3.

3. EFFICIENCY RESULTSIn this section, we link the class of MBP learnable func-

tions to e!cient PAC-MDP-Trace learning agents.

3.1 Efficient Apprenticeship LearningWe introduce a model-based RL algorithm (MBP-Agent,

Algorithm 2) for the apprenticeship setting that uses anMBP learner as a module for learning the dynamics of theenvironment. Notice that because MBP learners never ac-knowledge uncertainty, our algorithm for the apprenticeshipsetting believes whatever its model tells it (which could bemistaken). While autonomous learners run the risk of fail-ing to explore under such conditions, the MBP-Agent caninstead rely on its teacher to provide experience in more“helpful” parts of the state space, since its goal is simply todo at least as well as the teacher. Thus, even model learn-ers that default to pessimistic predictions when little data isavailable (as we see in later sections and as are used in ourexperiments), can be used successfully in the MBP-Agentalgorithm. Algorithm 2 has the following property.

Algorithm 2 MBP-Agent

The agent knows $, %, and A and has access to the envi-ronment E, teacher T , and a planner P for the domain.Initialize MBP learners LT ($, %) and LR($, %)for each episode do

s0 = E.startStatewhile episode not finished do

at = P.getPlan(st, LT , LR).!rt, st+1" = E.executeAct(at)LT .Update(st, at, st+1); LR.Update(st, at, rt)

if T provides trace # starting from s0 then&!s, a, r, s!" ) #, LT .Update(s, a, s!), LR.Update(s, a, r)

Theorem 1. An MBP-learner is PAC-MDP-Trace forany domain where the transitions and rewards are MBPlearnable.

The proof hinges on an extension of the standard Explore-Exploit lemma, we call the Explore-Exploit-Explain Lemma.

Lemma 1. On each trial, we can define a set of knownstate,action (!s, a") pairs as the ones where the MBP cur-rently predicts transitions accurately. One of these outcomesoccurs: (1) The agent will encounter an unknown !s, a" (ex-plore) with high probability. (2) The agent will execute apolicy !t whose value is better or not much worse than theteacher’s policy !T (exploit). (3) The teacher’s trace willencounter an unknown !s, a" (explain) with high probability.

Lemma 1 proves Theorem 1 because MBP can only make apolynomial number of mistakes, meaning cases (1) and (3)can only happen a polynomial number of times. Below is asketch of the lemma’s proof.

Proof. The quantity V !T (s0) is the value, in the real en-vironment, of the teacher’s policy and V !A(s0) is the value,in the real environment, of the agent’s current policy. Anal-ogously, we can define U!T (s0) as the value, in the agent’slearned MDP, of the teacher’s policy and U!A (s0) is thevalue, in the agent’s learned MDP, of the agent’s policy.

By any of several simulation lemmata, such as Lemma 12from [11], if |U!A (s0)(V !A(s0)| > $, then, with high prob-ability, case (1), explore, will happen. That’s because ex-ecuting !A in the real environment will produce a sampleof V !A(s0) and the only way it can be di"erent from theagent’s conception of the policy’s value, U!A(s0), is if anunknown !s, a" is reached with su!ciently high probability.

Next, we consider the case where U!A (s0) and V !A (s0)are approximately equal. If V !A(s0) ' V !T (s0) ( $, thatmeans !A is nearly optimal relative to !T , and case (2),exploit, happens.

Finally, we consider the case where U!A(s0) and V !A (s0)are approximately equal and V !A (s0) < V !T (s0) ( $. Notethat U!A (s0) ' U!T (s0) (because !A was chosen as opti-mal). Chaining inequalities, we have U!T (s0) , U!A (s0) ,V !A (s0)( $ < V !T (s0)( 2$. We’re now in a position to usea simulation lemma again: since |U!T (s0) ( V !T (s0)| > 2$,then, with high probability, case (3), explain, will very likelyhappen when the teacher generates a trace.

In summary, KWIK-learnable models can be e!cientlylearned in the autonomous RL case, but MB learning is in-su!cient for exploration. MBP covers both of these classes,and is su!cient for e!cient apprenticeship learning, so mod-els that were autonomously learnable as well as many mod-els that were formerly intractable (the MB class), are alle!ciently learnable in the apprenticeship setting. As anexample, the combination lock described earlier could re-quire an exponential number of tries using a KWIK learnerin the autonomous case, and MB is insu!cient in the au-tonomous case because it does not keep track of what com-binations have been tried. But in the apprenticeship setting,the MBP-Agent can get the positive examples it needs (seeSection 4.2) and will succeed with at most n (one for eachirrelevant tumbler) valid traces.

4. EXAMPLE DOMAIN CLASSESWe now present upper bounds and algorithms for e!-

cient apprenticeship learning in several RL representations.While some of these classes are e!ciently learnable in theautonomous case, others are provably intractable.

4.1 Flat and Linear MDPsIn the autonomous setting, flat MDPs (where states are

propositional members of a set S) can be learned with a

KWIK bound of O(S2A#2

) [8]. Following Theorem 1, thisgives us a polynomial PAC-MDP-Trace bound, a result thatis directly comparable to the apprenticeship-leaning resultunder an earlier protocol [2]. One di"erence between the twoprotocols is that theirs requires all traces to be given beforelearning starts and our learners are oblivious to when theyhave met their goal of matching or exceeding the teacher.

The same work considered apprenticeship learning of lin-ear dynamics. We note that these domains are also cov-ered by Theorem 1 as recent results on KWIK Linear Re-gression [15] have shown that such n-dimensional MDPs are

again KWIK learnable with a bound of O(n3

#4).

4.2 Classical MB ResultsWhile the results above are interesting, flat and linear

MDPs are known to be e!ciently learnable in the au-tonomous case. We now describe two MBP-learnable classesthat we will use as the backbone of RL algorithms for rela-tional MDPs that are not autonomously learnable.

As mentioned earlier, KWIK and MB are separable whenlearning monotone conjunctions2 over n literals when thenumber of literals relevant to the conjunction (nR) can be

2The results extend to the non-monotone setting using thestandard method of including all negated literals.

PutDown(B, To, BelowTo): Reward = (1PRE: Holding(B) - Clear(To) - On(To, BelowTo) -Block(B) - Block(To)&1(p1 = 0.7):ADD: On(B, To) , EmptyHand()DEL: Holding(B, From) Clear(To)&2(p2 = 0.3):ADD: On(B, BelowTo) DEL: UnExploded(To), On(To,BelowTo), Clear(To)

Table 1: Exploding blocks in Stochastic STRIPS.

as many as n. In KWIK, conjunctions of size k = O(1) aree!ciently learnable: The system simply enumerates all nk

conjunctions of this size, predicts * unless all the hypothe-ses agree on a prediction, and eliminates wrong hypotheses.However, when the conjunction is of size O(n), the O(2n)hypothesis space can result in an exponential number of *predictions. This situation arises because negative examplesare highly uninformative. In the combination lock, for ex-ample, the agent has no idea which of the 2n settings willallow the lock to be unlocked, so it must predict * at everynew combination. However, if it does see this one positiveexample it will have learned the correct combination.

In contrast, learners in the MB setting can e!ciently learnconjunctions of arbitrary size by exploiting this asymmetry.Specifically, an MB agent for conjunction learning (detailedin [7]) can maintain a set of literals lj ) LH where lj =1 for every positive example it has seen before. If &ljt )LH = 1 in xt, the agent correctly predicts true, otherwiseit defaults to false. By using such defaults, which KWIKagents cannot, and by only counting the highly informativepositive samples, each of which subtracts at least one literalfrom LH , polynomial sample e!ciency is achieved.

Another MB learnable class is k-term-DNF (disjunctivenormal form of k = O(1) terms). k-term-DNF are of theform (li - lj - ...)1 . ... . (... - ...)k, that is, a disjunctionof k terms, each of at most size n. This class of functionsis known to be MB learnable [7] by creating a conjunctionof new literals, each representing a disjunction of k originalliterals (for k = 3 we would have lijm = li.lj.lm), and thenusing the conjunction learning algorithm described above.

4.3 Learning Stochastic STRIPS OperatorsWe now describe a relational MDP class that, when given

some background information about the environment’s pos-sible dynamics, can be MBP learned using MB-partitionwith an MB conjunction learner, and a KWIK learner. Theclass is Stochastic STRIPS with rewards [15], where statesare comprised of objects O, and predicates P (e.g. On(b, c)).Actions a ) A (see Table 1) are parameterized and their dy-namics are described by two parts. First, a pre-condition ca,which is a conjunction (over P and the action’s parameters)that determines whether the action will actually execute inthe current state or return a unique “failure” signal. Thesecond part of an action’s description is a set of possible ef-fects #a, where each &a

i is a pair of “Add” and “Delete” listsdescribing the changes to the current state and has an asso-ciated probability pa

i ) $a. Previous work has establishedthat while the probabilities of these outcomes cannot be e!-ciently learned by traditional “counting” methods, they canbe KWIK learned using a linear regression technique [15].This approach is necessary because sometimes the di"erence

between states st and st+1 are explainable by several of thee"ects. Given this result and the conjunction learning algo-rithm above, we have the following result:

Proposition 3. Stochastic STRIPS operators are PAC-MDP-Trace learnable if the agent is given the possible e"ects(#a, but not $a) beforehand by using Algorithm 2 and MB-partitioning between an MB conjunction learner (precondi-tions) and a KWIK-LR learner (e"ect probabilities).

Proof. Each transition sample is either a “failure” (ca isfalse) or a transition that returns a next state s! (without afailure signal), so the output spaces are disjoint as requiredby MB-partition. We use a conjunction learner for each ac-tion (MB-CON a) to predict whether the preconditions of agrounding of that action hold. As in earlier work on deter-ministic STRIPS [14], these learners produce a pessimisticversion of the preconditions, the most specific hypothesispossible on the conjunction. Thus, unless a series of actionsexists that, with some probability, lead to the goal with-out failure of these pessimistic preconditions, the agent willrequest a trace by acting randomly for H steps (or endingthe episode if possible). Each # received because of sucha request or some other suboptimal policy (with respect tothe teacher) will provide positive examples of the precondi-tions, updating each MB-CON a so no more than |A|n traceswill be needed, where n is the number of literals within theaction’s scope.

Each pai is then learned (with KWIK-LR) separately from

the conjunction using a mixture of real experience and trace

tuples and has a known KWIK bound of O( |A||!|3

#4). Thus,

given #a, the dynamics are MBP learnable, and thus thedomain can be PAC-MDP-Trace learned.

We note that these results generalize the findings of [15](which did autonomous learning of only the probabilities)and [14], which used an MB-like algorithm to prove the e!-ciency of “Trace-Learning”deterministic STRIPS operators.The use of a conjunction learner in this case relies on aunique failure signal when the preconditions of an actionfail. We now investigate a di"erent type of relational MDP,with conditional outcomes that do not provide this signal.

4.4 Deterministic OOMDP OperatorsObject-oriented MDPs [5] or OOMDPs are made up of

objects with attributes (e.g. agent6.location = 15) and pred-icates that must be defined in terms of these attributes (e.g.On(A,B): A.y = B.y + 1). Actions (as in Table 2) are de-scribed by condition-e"ect pairs !ca

i , &ai " such that in state

st, the condition (a conjunction over the predicates) thatholds (conditions may not overlap) governs which e"ect oc-curs. E"ects describe changes to the objects’ attributes.

We now consider the problem of learning each cai given the

k = O(1) possible e"ects (&a1 ...&a

k) for each action. Noticethat this case is di"erent than the precondition learning donein the Stochastic STRIPS case because there is no longer asingle failure signal for a ca

i not matching st. Moreover,state transitions might not always allow the learner to un-ambiguously determine which e"ect occurred. For instance,invoking the MoveRight action (Table 2) when o1.x = 4, andthen observing o1.x = 5, does not tell us which of (&1, &2)actually occurred, so it is not immediately clear which ca

i ,should be updated. Here we give a solution in the appren-ticeship setting using an MB k-term-DNF learner.

MoveRight(Obj,Loc): Reward = (1c1: ClearToRight(Loc) - GoodFooting(Obj, Loc)&1 : Obj1.x = min(2 +Obj1.x, 5)c2: ClearToRight(Loc) - WetFloor(Obj, Loc) - Freez-ing(Loc)&2 : Obj1.x = min(1 +Obj1.x, 5)c3: WallToRight(Loc)&3 : Obj1.x = Obj1.x

Table 2: An OOMDP operator for walking rightwith a limit of x = 5.

5 10 15 20 25 3075

80

85

90

95

100

105

110

115

120

Episode Number

Step

sto

Goa

l

MBP Agent

Autonomous DOORmax

Simulated Taxi

Figure 1: A KWIK learner (autonomous) and anMBP learner (apprenticeship) in the Taxi Domain(10 trials each).

Proposition 4. Deterministic OOMDP conditions arePAC-MDP-Trace learnable if the agent is given #a for eacha beforehand by using Algorithm 2 and an MB k-term-DNFlearner.

Proof sketch. The “trick” here is instead of repre-senting the condition that causes an e"ect to occur, welearn the conditions that do not cause e"ect &i, which isc1... . ci#1 . ci+1 . ..ck. Since each cj is an arbitrarily sizedcondition, we are learning a k-term-DNF for each conditionnot occurring. While extra steps are needed to negate expe-rience tuples and interpret the predictions, this insight givesus the desired result.

This result extends the previous (KWIK) sample complex-ity results for autonomous OOMDPs learning, where it wasnecessary to limit the size of each ca

i to O(1). Unfortunately,this particular algorithm does not extend to the stochasticsetting as it will be unclear which formulas to update.

5. EXPERIMENTSOur first experiment is in a deterministic simulated “taxi”

domain (from [5]): a 5 # 5 grid world with walls, a taxi,a “passenger” and 4 possible destination cells. The agentmust learn about navigation and the conditions for pickingup and dropping o" a passenger. We use the determinis-tic OOMDP representation from above, but to accommo-date the autonomous agent both learners used conjunctionlearning routines (KWIK-Enumeration and MB-Con) ande"ects were constructed to be unambiguous. Figure 1 showsour MBP-Agent (using MB-Con) approach with apprentice-ship learning reaching optimal behavior ahead of the currentstate-of-the-art autonomous approach, DOORMAX [5], aKWIK-based algorithm. Here, and in the later robot ver-sion, traces were provided at the end of each episode if theagent’s policy was worse than the teacher’s. Each trial inFigure 1 used 4 to 7 such traces.

-16

-14

-12

-10

-8

-6

-4

0 5 10 15 20 25 30

Ave

rage

rew

ard

Episode

MBP optimalMBP sub-optimal

KWIK-Autonomous pre-cond givenKWIK-Autonomous

Figure 2: KWIK autonomous learners and an ap-prenticeship learners in a noisy 3-blocks world.Traces were given after each episode and the resultswere averaged over 30 trials.

Next, we consider a Stochastic STRIPS noisy-blocks worldsimilar to Table 1 but with more benign e"ects: with prob-ability 0.2, the pickup and putdown actions simply haveno e"ect. There are also two “extra” pickup/putdown ac-tions that do nothing with probability 0.8. Figure 2 shows4 agents in a 3-blocks version of this domain. The MBP-Agent with an optimal teacher learns the preconditions andto avoid the extra actions from a single trace. The MBP-Agent with a suboptimal teacher (who uses the extra actions50% of the time) eventually learns the probabilities and per-forms optimally in spite of its teacher. Both MBP agentse!ciently learn the preconditions. In contrast, a KWIKlearner given the preconditions mirrors the suboptimal-tracelearner, and a KWIK learner for both the preconditions andprobabilities requires an inordinate amount of explorationto execute the actions correctly. We recorded similar resultswith 4-blocks and exploding blocks, although 2 or 3 optimaltraces are needed in those domains.

Our last experiment was a version of the taxi environmentabove, but now on a physical Lego MindstormsTM robot.A human demonstrated traces by controlling the robot di-rectly. Unlike the simulated version, the real-world transi-tions are stochastic (due to noise). So in this experiment,the robot was given all the conditions it might encounterand had to learn the e"ects and probabilities of the actions.We provided traces after each episode until it completed thetask twice on its own from random start states. We ran thisexperiment in two di"erent settings: one where the dynam-ics of the environment’s borders (physical walls) were given,and one where the robot had to learn about behavior nearwalls. The first setting contained 6 condition settings andrequired 5 traces and a total of 159 learning steps. In theharder setting, there were 60 conditions to learn and theagent required 10 traces and 516 learning steps. This exper-iment demonstrates the realizability and robustness of ourapprenticeship-learning protocol in real world environmentswith sensor imprecision and stochasticity.

6. RELATEDWORK AND CONCLUSIONSA number of di"erent protocols and complexity measures

have been described for apprenticeship learning in RL. Theone used in this paper is an extension of the interactiondescribed by [2] with the change that teachers no longerhave to give all of their traces up front. The field of InverseRL [1, 13], also sometimes called apprenticeship learning,

attempts to learn an agent’s reward function by observinga sequence of actions (not rewards) taken by a teacher. Incontrast to this interaction, our learners actually see sam-ples of the transitions and rewards collected by the teacherand use this “experience” in a traditional model-based RLfashion. Recent work in Imitation Learning [10] took MDPinstances and trajectories and generalized these behaviors,but assumed that costs were linear in the feature space. Ourresults do not rely on any such relation.

A topic of future work is an investigation of other learn-ability classes (beyond KWIK and MB) or combinationsthat may be su!cient for MBP learning, expanding the classof known MBP-learnable models. Of particular interest aresupervised-learning classes that perform MB learning in thepresence of label noise [4] as well as frameworks that considerhelpful examples such as teaching dimension [6].

This work has extended the previous protocol for appren-ticeship learning, defined a measure of sample complexityfor this interaction (PAC-MDP-Trace), and shown that twowidely studied model classes (KWIK and MB) are su!-cient for PAC-MDP-Trace learning. We used these findingsto construct learning algorithms for RL environments thatwere otherwise intractable and provided empirical evidencein simulated and robot domains.

6.1 AcknowledgmentsDARPA IPTO FA8650-06-C-7606 provided funding.

7. REFERENCES[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via

inverse reinforcement learning. In ICML, 2004.[2] P. Abbeel and A. Y. Ng. Exploration and apprenticeship

learning in reinforcement learning. In ICML, 2005.[3] D. Angluin. Queries and concept learning. Machine

Learning, 2(4):319–342, 1987.[4] P. Auer and N. Cesa-Bianchi. On-line learning with

malicious noise and the closure algorithm. Annals ofMathematics and A.I., 23(1-2):83–99, 1998.

[5] C. Diuk, A. Cohen, and M. L. Littman. An object-orientedrepresentation for e!cient reinforcement learning. InICML, 2008.

[6] S. A. Goldman and M. J. Kearns. On the complexity ofteaching. Journal of Computer and System Sciences,50:303–314, 1992.

[7] M. J. Kearns and U. V. Vazirani. An Introduction toComputational Learning Theory. MIT Press, Cambridge,MA, USA, 1994.

[8] L. Li, M. L. Littman, and T. J. Walsh. Knows what itknows: A framework for self-aware learning. In ICML, 2008.

[9] N. Littlestone. Learning quickly when irrelevant attributesabound: A new linear-threshold algorithm. MachineLearning, 2(4):285–318, 1987.

[10] N. D. Ratli", D. M. Bradley, J. A. Bagnell, and J. E.Chestnutt. Boosting structured prediction for imitationlearning. In NIPS, 2006.

[11] A. L. Strehl, L. Li, and M. L. Littman. Reinforcementlearning in finite MDPs: PAC analysis. Journal of MachineLearning Research, 10(2):413–444, 2009.

[12] R. S. Sutton and A. G. Barto. Reinforcement Learning: AnIntroduction. MIT Press, Cambridge, MA, March 1998.

[13] U. Syed and R. E. Shapire. A game-theoretic approach toapprenticeship learning. In NIPS, 2008.

[14] T. J. Walsh and M. L. Littman. E!cient learning of actionschemas and web-service descriptions. In AAAI, 2008.

[15] T. J. Walsh, I. Szita, C. Diuk, and M. L. Littman.Exploring compact reinforcement-learning representationswith linear regression. In UAI, 2009.

Learning from Human Teachers: Issues and Challenges for ILP in Bootstrap Learning

Sriraam Natarajan1, Gautam Kunapuli1, Richard Maclin2, David Page1,

Ciaran O'Reilly3, Trevor Walker1 and Jude Shavlik1

1University of Wisconsin-Madison

Department of Biostatistics

{natarasr, kunapg, page, walker, shavlik}@biostat.wisc.edu

2SRI International

Artificial Intelligence Center

[email protected]

3University of Minnesota, Duluth

Department of Computer Science

[email protected]

!"#$%!&$

Bootstrap Learning (BL) is a new machine learning paradigm that

seeks to build an electronic student that can learn using natural

instruction provided by a human teacher and by bootstrapping on

previously learned concepts. In our setting, the teacher provides

(very few) examples and some advice about the task at hand using

a natural instruction interface. To address this task, we use our

Inductive Logic Programming system called WILL to translate the

natural instruction into first-order logic. We present approaches

to the various challenges BL raises, namely automatic translation

of domain knowledge and instruction into an ILP problem and the

automation of ILP runs across different tasks and domains, which

we address using a multi-layered approach. We demonstrate that

our system is able to learn effectively in over fifty different

lessons across three different domains without any human-

performed parameter tuning between tasks.

&'()*+,-)./'01/#234)5(/6).5,-7(+,. I.2.3 [!,(-8-5-'9/:0()99-*)05)]: Deduction and Theorem Proving –

logic programming.

;)0),'9/$),<. Algorithms, Design, Reliability, Experimentation, Human Factors.

=)>?+,1. inductive logic programming, human teachers, automating setup

problem

@A :B$%C6D&$:CB/One of the long cherished goals of Artificial Intelligence (AI) is to

design agents that learn by interacting with humans, performing

actions, receiving guidance and/or feedback from the human and

improving its performance[3]. Traditional supervised learning

approaches treat learning as a problem where some problem-

dependent criteria (such as learning error, possibly combined with

other means to control the inductive bias) is optimized given

labeled examples.

Bootstrap Learning (BL) is a new learning paradigm proposed by

Oblinger [5] which views learning as knowledge acquisition. The

electronic student assumes all relevant knowledge is possessed by

the teacher who teaches through human-like natural instruction

methods including providing domain descriptions, pedagogical

examples, telling of instructions, demonstration and feedback. In

addition to teacher instruction, the student learns concepts that

build upon one another through a “ladder” of lessons; lower rungs

of the lesson ladder teach simpler concepts which are learned first

and bootstrap (i.e., are used to learn more complex concepts).

The electronic student, called M A BLE, the Modular Architecture

for Bootstrap Learning Experiments [9] addresses the

aforementioned limitations of the classical learning paradigm.

First, M A BLE consists of several different learning algorithms,

which it is able to employ depending on the concept being taught

and hence can learn a diverse range of tasks across different

domains. Second, by virtue of the abstracted natural instruction

and its ability to bootstrap complex behaviors, M A BLE can be

taught by non-programmers and non-experts. Thus, while

traditional learning specializes by domain, BL specializes by the

various natural instruction methods.

In this paper, we focus on one particular modality of teacher

input: instruction by example, including teacher hints about

specific examples. We use a logic-based approach that creates

learned models expressed in first-order logic, which is called

Inductive Logic Programming (ILP) [4]. ILP is especially well-

suited for the “learning from examples” component in M A BLE for

two reasons. First, it can use a rich knowledge base that may have

been provided to the learner initially or may have been

learned/augmented during earlier lessons. Second, the declarative

representation of both examples and learned rules makes it easier

for the teacher and student to communicate about what has been

learned so far; for example, a teacher can identify and correct

student mistakes from earlier lessons. Similarly, the use of logic

allows for sharing lessons of learned knowledge between modules

that learn from different kinds of instruction.

This paper makes four key contributions: First, we present an ILP

based system that learns from a human teacher in the presence of

a very small number of examples. Second, we present the first of

its kind methodology to automatically setup ILP runs that do not

require intervention by an ILP expert (or any human for that

matter). Third, is our algorithm that converts human advice and

feedback into sentences written in first-order logic that are then

used to guide the ILP search. The final and a very important

contribution is the evaluation of the system in 5 different domains

with teaching lessons for over 50 different concepts and where the

correct concepts are learned without any modification of our

&-()/ '.: Learning from Human Teachers: Issues and Challenges in

Bootstrap Learning, Sriraam Natarajan, Gautam Kunapuli, David Page,

Trevor Walker, Ciaran O'Reilly and Jude Shavlik, AAMAS 2010

Workshop on Agents Learning Interactively from Human Teachers.

(www.ifaamas.org). All rights reserved.

@/1300 University Avenue, Medical Sciences Center, Madison WI 53705 E/333 Ravenswood Avenue, Menlo Park CA 94025 F/320 Heller Hall, 1114 Kirby Drive, Duluth MN 55812

algorithm between the lessons. Our computerized student is

scored based on several test examples for each lesson and our

student achieves a near-perfect grade when given advice prepared

by a third party (who is not from our institution).

EA "G/&H'99)0*)./We first introduce the learning framework and then outline the

challenges of ILP and BL.

EA@ G)',0-0*/I,'<)?+,J/The learning framework consists of the teacher, the environment,

and the student interacting with each other. Given a domain

within which learning takes place, the concepts to be learned are

organized as lessons within a curriculum created by a separate

group of researchers from outside our institution and not under

our control. A lesson may be taught by more than one so-called

natural instruction method. A lesson that teaches more complex

concepts is broken down into two or more simpler lessons which

are learned first and the more complex lesson is bootstrapped

from the simpler ones. The structure of the curriculum is

analogous to a "lesson" ladder with lower rungs representing

simpler concepts and the complexity of the lessons increasing as

we climb higher.

The teacher interacts with the student during teaching lessons

using utterance messages and with the simulator using imperative

messages which are actions that change the world state. The

teacher can test during testing sessions with imperative messages

requiring M A BLE to answer questions and the teacher then

evaluates the student's responses by providing a grade.

EAE :0125(-K)/G+*-5/L,+*,'<<-0*/ILP combines principles of two of the most important fields of AI:

machine learning and knowledge representation. An ILP system

learns a logic program given background knowledge as a set of

first-order logic formulae and a set of examples expressed as facts

represented in logic. In first-order logic, terms represent objects in

the world and comprise constants (e.g., M',>), variables (N), and

functions (8'(H),C8OP+H0Q). Predicates are functions with

boolean return value. Literals are truth-valued and represent

properties of objects and relations among objects, e.g.

<',,-)1OP+H0R/M',>Q. Literals can be combined into compound

sentences using connectives such as AND, OR and NOT. It is

common [10] to convert sets of sentences into a canonical form,

producing sets of clauses. We are developing a Java-based ILP

system called WILL.

Now, consider the ILP search space presented in S,,+,T/

%)8),)05)/.+2,5)/0+(/8+201A, where logical variables are left out

for simplicity and the possible features are denoted by a letter in A

through Z. Let us assume that the true target concept is a

conjunction of the predicates A,Z,R and W. ILP's search space

without relevance is presented within the dashed box. Normally,

ILP adds literals one after another, seeking the a short rule that

covers all (or most of the) positive examples and none (or few) of

the negatives. If there are n predicates then this can lead to a

search of O(n!) combinations to discover the target concept. As

can be seen by the portion of the search space that is outside the

box, if a human teacher tells the ILP system that predicates A, Z,

and R are relevant to the concept being learned, the amount of

search that is needed can be greatly reduced. Such reduction can

enable an ILP system to learn from a rather small number of

examples. In the example, the teacher's hint specifies 3 out of 4

predicates that should appear in the target concept and hence an

ILP learner needs to search over a smaller set of hypotheses to

discover the correct concept.

Of course if the teacher is malicious or incompetent, then teacher-

provided hints will increase the number of hypotheses that need to

be considered since they increase the branching factor of the space

being searched, but in this work we assume the human teacher has

useful things to say, even if a bit imperfectly (teacher errors of

omission are less harmful to WILL than errors of commission, as

Figure 1 illustrates). The major BL challenge for ILP is that it has

to be used, not only for different lessons within the same domain,

but also across different domains; this necessitates the automation

of the ILP setup problem without the intervention of an ILP

expert.

Another important aspect requiring automated ILP runs is that the

parameter settings cannot change between different runs. We

cannot expect any human guidance regarding settings and need to

find good default values that work broadly. Actually, our

algorithms themselves try out a few parameter settings and use

cross validation to choose good settings. However, given the

large number of parameters in typical ILP systems (maximum rule

length, modes, minimal acceptable accuracy of learned clauses,

etc.), our algorithms cannot exhaustively try all combinations and

hence must choose an appropriate set of candidate parameters that

will work across dozens of learning tasks.

The goal of our ILP based agent is to translate the teacher's

instructions into first-order logic. The instructions can be labels

on example, as well as advice and/or feedback about these

examples. We have created an interpreter that converts the advice

to first-order logic by combining and generalizing the advice from

individual examples and uses a cost-based search through the

possible set of hypothesis to learn the target concept. BL also

provides the opportunity for the student to refine its concept if it

had learned an incorrect one. This setting is called learning by

feedback, where the teacher provides explicit feedback such as

providing the correct answer, pointing to important features or

previously learned concept that the student should consider, etc.

Our interpreter also interprets such feedback provided by the

teacher and refines its learned concept.

EAF "G/6+<'-0./U/&H'99)0*)./The domains of the BL project are D0<'00)1/ !),-'9/ V)H-59)

(UAV) control, !2(+<'()1/ $'.J/ I+,5) (ATF), :0(),0'(-+0'9/

#7'5)/#('(-+0 (ISS). I-*2,)/@A/#'<79)/.)',5H/.7'5)/(+/-992.(,'()/(H)/2.)8290)../+8/

,)9)K'0(/.('()<)0(./(+/:GLA/

D!V/ 6+<'-0/ 6).5,-7(-+0: This domain involves operating a

UAV and its camera to execute a reconnaissance mission. Tasks

include determining if the UAV has enough fuel to accomplish a

mission, achieving appropriate latitude, altitude, learning if there

is a single (or multiple) stopped (or moving) truck(s) in a scenario,

whether an object (say truck, building or intersection) is near

another object of intersect. The idea is that the UAV is flying

around and has to automatically identify scenarios that are

potentially interesting from the defense perspective.

Figure 2 presents the lesson hierarchy for the domain. Each lesson

is presented as an oval in the figure. An arrow between lessons

indicate the bootstrapping relationship between them. For

example, an arrow between Near and TruckIsAtIntersection

indicates that the latter lesson requires the concept learned by

former.

/

/

/

I-*2,)/ EA/ D!V/ 9)..+0/ W-),',5H>X/ !/ ,)9'(-+0.H-7/ !!"/

3)(?))0/9)..+0./!/'01/"/-01-5'()./(H'(/"/2.)./!/-0/-(./5+05)7(A/

D!V/ &H'99)0*).: The learner has to deal with complex

structures such as position, which consists of attributes such as

latitude, longitude, altitude, etc. Encoding these spatial attributes

as part of one position literal would enable WILL to learn a

smaller clause, but would increase the branching factor during

search due to the additional arguments introduced by such a large-

arity predicate. Representing these spatial attributes as separate

predicates would decrease the branching factor at the expense of

the target concept being a longer clause. In addition, the tasks

involve learning the concept of "near" that can exist between any

two objects of interest. In a later lesson, this concept might be

used, for instance, to determine if a truck is at an intersection in

which case the objects must be specialized to be of the types truck

and intersection. It is a challenge for ILP systems to

automatically generalize and specialize at different levels of the

type hierarchy. Finally, this domain requires extensive

"bootstrapping" as can be seen from Figure 2, which presents a

heirarchy organizing the UAV lessons, and requires the object

hierarchies to be able to generalize across different lessons.

!$I/ 6+<'-0/ 6).5,-7(-+0: The goal of the ATF domain is to

teach the student how to command a company of armored

platoons to move from one battlefield location to another in

accordance with military doctrine. The lessons are organized

based on the complexity of tasks. At the lowest level are the tasks

concerning individual vehicles and segments of vehicles. At a

higher level are the tasks concerning platoons (sets of segments)

while at the top-most level are the tasks of a company which is a

set of platoons.

!$I/ &H'99)0*).X/ ATF poses at least two key challenges for

application of ILP to BL. First, is the presence of a large number

of numeric features./For example, there are distances and angles

between vehicles, segments, platoons, and companies. For each

of these objects, there are numeric attributes such as direction,

location (in three dimensions), speed, etc. All these numeric

features require ILP to select good thresholds or intervals, which

can lead to a large number of features. The second important

challenge is the deep nesting of the object structure. Each

company has a list of platoons each of which has a list of

segments that contain a list of vehicles. This deep nesting requires

ILP to construct the predicates and features at the appropriate

level of the object hierarchy. While this might not appear as a

major issue with individual runs, it should be noted that the same

settings have to be used across all the lessons for all the domains.

:##/6+<'-0/6).5,-7(-+0X The ISS curriculum places the student

in a role of a flight controller who must detect and diagnose

problems within the thermal control system of the International

Space Station. The lessons include teaching the student what

constitutes an emergency/alert, and how to post observation

reports concerning actionable alerts. Examples of these include

learning the conditions for alerting abnormal observations,

warning, emergency and caution

!"#$%&'()'*+,-".*&','/&-*0*'/

!"#$$%&'()*+,-(%./"%(%,"0/1! 2#30/1'()*+,-(%./"%(%,"0/1

:##/ &H'99)0*).X This domain poses several issues that are not

prominent in the other ones. The key challenge is that the number

of features in the domain is very large. The fact-base of the

domain consists of all the small parts, measurements, reports of

the ISS and hence is significantly larger than the other domains

(100's of features for a single example). A direct consequence is

that the amount of time taken to construct the predicates is far

greater than the other domains. This is an important issue due to

the fact the learning strategies (in our case, learning by examples)

have a fixed learning time. Within this time limit, the student has

to interpret the teacher's statements, convert to its internal

representation (in our case, first-order logic statements), learn and

get evaluated on the test-example. Unlike the domains, this is not

inherently relational. There are specific valves and meters that

should be considered while learning the target concept. ILP,

which is a powerful tool for learning first-order logic models that

allow for generalization needs to consider objects at the grounded

level in this domain.

FA #CGV:B;/"G/L%C"GSM#//We now present the two main steps of our approach, namely,

interpreting relevance and adopting a multi-layered strategy for

automating ILP runs.

FA@ :0(),7,)(-0*/%)9)K'05)/One of the key challenges in BL is learning from a very small

number of examples. A human teacher will not spare the time and

effort to specify thousands of examples that any common machine

learning algorithm requires to learn a reasonable target concept.

Instead, the human teacher provides some information (that we

call relevance statements or advice) about the target concept that

the student uses to accelerate its learning. For instance, when

teaching the concept of a full fuel tank, the teacher might gesture

to the fuel capacity of the tank and the current fuel level of the

tank. The student might then infer the relationship between the

two attributes. The main advantage of such a specification of

relevant attributes is that it drastically reduces the search space

(i.e., the search through the list of possible features in the target

concept). Note that many possible features, such as color, length,

weight, tire pressure, etc. could spuriously discriminate between

positive and negative examples if the total number of examples is

1')"0*-2'(-*03'(&"$$'45,%)6!

()'*+,-"!

1')"0*-2'(-*03'7"8-*05,%)6

()'*+,-"!

1'+495":395,%)6./;&.*&',/')&-"*!

:%33:%'35+*6;//'//<"+3!='+,!

very small (which is the case in the BL lessons, see Section 5).

Thus, relevance statements become very critical in discovering the

correct target concept.

We next outline our algorithm for interpreting the relevance

statements provided by the teacher. We first illustrate the process

of interpretation with an example before presenting the algorithm

formally. Consider the lesson, RecognizeSingleStoppedTruckScenario

in the UAV domain. The goal in this lesson is to identify if there

is one and only one stopped truck in the scenario . We now

present the teacher utterances followed by our interpretation of the

statements.

RelevantRelationship(arg1=SameAs(arg1 = 1, arg2 = GetLength(arg1 = Of(arg1 =actors) arg2 = Scenario(actors = [Truck( name = Truck19, latitude = -10,longitude = 10,moveStatus = Stopped)]))))

pred3(S) IF Scenario_actors(S,L),length(L,I),sameAs(I,1).

Advice is provided using Relevant statements in BL. In the above

statement, the teacher states that the length (size) of the actor list

of the current scenario, should be 1. After the above relevance

statement the teacher proceeds to give further instructions, here

talking about a different example:

Gesture(atObject = Truck(name= Truck17, latitude =-10,longitude = 10, moveStatus = Stopped))

Similarly, rules will be created for the other relevant statements

corresponding to the instance of and the move status of the truck.

RelevantRelationship(arg1= InstanceOf (arg1 = this, arg2 = Truck))

In the above statements, the teacher first gestures at (points to) an

object (Truck17 in this case) and explains that it being an instance

of a truck is relevant to the target concept. The teacher further

utters the following:

RelevantRelationship(arg1 =SameAs(arg1 = Of(arg1 = moveStatus, arg2 = Indexical(name = this))

arg2 = Stopped))

The above statement specifies that the moveStatus of the truck

being "Stopped" is relevant to the target concept. The term

Indexical is used to access the object that is being gestured at most

recently by the teacher. Hence Indexical(name = this) here refers

to the truck that has been gestured to earlier. Hence the teacher

utters that the actors list of the scenario must be of size 1, that the

object in that list must be of the type truck and that its move status

must be equal to stopped. We will now proceed to explain how

WILL interprets these statements and constructs background

knowledge and partial answers correspondingly.

First, WILL identifies the interesting and relevant features from

the above statements. WILL first creates the following interesting

predicates:

isaInterestingComposite(Truck19) isaInterestingComposite(Truck17) isaInterestingNumber(1) isaInterestingComposite(Scenario1) isaInterestingSymbol(Stopped)

A key challenge when dealing with teacher-instruction about

specific examples is "what should be generalized (i.e., to a logical

variable) and what should remain constant?" The above facts

provide WILL with some candidate constants that should be

considered; WILL initially uses variables for all the arguments in

the rules it is learning, but it also considers replacing variables

with constants. WILL next creates the following relevant

statements:

relevant: Vehicle_moveStatus relevant: Scenario

relevant: Scenario_actors relevant: Truck relevant: sameAs

The features (attributes, objects, and relations) that are mentioned

in the relevant statements are considered as relevant for the target

concept. Consequentially, these features get lower scores when

searching through the space of ILP rules and computing the cost

of candidate rules. WILL then proceeds to construct rules

corresponding to the relevance statements it receives. In the

following rules, assume S is of type scenario, L is of type list, T is

of type truck, and I is an integer. Following Prolog notation,

commas denote logical AND. One rule WILL creates from

teacher-provided instruction is

The above rule is constructed from the first relevant statement (of

a positive example) that specifies that the length of the actors list

in a scenario must be of size 1. A rule will now be constructed for

the gesture that points at Truck19 in the list.

pred5(T,S) IF Scenario_actors(S,L),member(T,L)

pred7(T,S) IF Truck(T, S)

The above rules uses the previous rule in asserting that the object

that is a member of the list is of the type truck. Finally, the last

relevance statement is interpreted as:

pred9(T,S) IF moveStatus(T,S),sameAs(S,stopped)

Once these rules are created for a particular example, WILL

creates the combinations by combining the pieces of advice using

the logical connective AND.

relevantFromPosEx1(S) IF pred3(S),pred5(T,S),pred7(T,S), pred9(T,S)

WILL then proceeds to construct similar statements for the second

example. Once all the individual examples are processed and the

rules are created for each of the examples, WILL then proceeds to

construct combinations of the rules in order to generalize across

all the examples. The simplest combination is the combination of

all rules from all positive examples and all the rules from all

negative examples.

posCombo(S) IF relevantFromPosEx1(S),...,relevantFromPosExN(S)

Similarly the negCombo is constructed by taking the negation of

the negative relevantANDs.

negCombo(S) IF ~relevantFromNegEx1(S),...,~relevantFromNegExN(S)

We denote the negation of a concept by ~. Hence, by now our

rules generalize positive and negative examples separately. Then

WILL constructs the cross product across the different

combinations and adds them to the background.

allCombo(S) IF posCombo(S),negCombo(S)

All the predicates (relevantFrom's, posCombo, negCombo,

allCombo, etc) are added to the background during search. and are

marked as being relevant to the target concept. We also combine

the advice about examples using the logical connective OR. We

use both AND and OR to combine because a human teacher might

be teaching the computer learning a new conjunctive concept with

each example illustrating only one piece of the concept, or the

teacher might be teaching a concept with several alternatives, with

each alternative illustrated via a different example. We refer to

such rules as comboRules.

The algorithm for interpreting relevance is presented in Table 1.

(Our ILP system can handle tasks that involve multiple categories

by repeatedly treating them as "1 versus the others" classification

problems.) WILL interprets the advice and creates relevant

features corresponding to the objects and features mentioned in

the relevant statements, as illustrated above.

The net result is that our algorithm has hypothesized a relatively

small number of individual and 'compound' general rules that can

be evaluated using the (small number of) labeled examples

provided by its human teacher. Should these prove insufficient,

WILL can combine and extend (by using the 'primitive' features in

the domain at hand) by further searching of the hypothesis space.

$'39)/@A/!9*+,-(H</I+,/:0(),7,)(-0*/%)9)K'05)A/

For each category (e.g. TRUE and FALSE)

For each example

For each relevant statement about that example

Construct relevant features

Construct relevant rules for the particular example

Combine the rules from individual examples to form

"combo" rules about the current category.

Combine the rules from different examples to form "mega"

rules about the concept as a whole.

FAE M29(-YG'>),)1/#(,'()*>//One of the key issues with several machine learning methods

is the periodic intervention by the domain expert to select

features, tune parameters and set up runs. This is particularly true

of ILP where researchers face the problem of designing new

predicates, guiding ILP’s search, setting additional parameters,

etc. BL brings a major challenge for ILP in this area, because

WILL must automatically set up training without the intervention

of an ILP expert. This is needed because human teachers cannot

be expected to understand the algorithmic details of a learning

approach; rather they communicate with the student in and as

natural and human-like dialog as is feasible [8]. This necessitates

the guiding of search automatically in a domain independent

manner. Automatic parameter selection methods such as the one

proposed in [1] are not useful in our system due to the fact that we

do not have access to a large number of examples. Instead we

resort to a multi-layered strategy that tries several approaches to

learn the target concept.

Table 2 presents the algorithm of multi-layered strategy called

Onion. The innermost layer implements the basic strategy:

invoking WILL after automated mode construction, using only the

relevant combinations of features (as told by the teacher). This

means that WILL initially explores a very restricted hypothesis

space. If no theory is learned or if the learned theory has a poor

score (based on heuristics), then the hypothesis space is expanded,

say by considering features mentioned by the teacher. Continuing

this way, our multi-layered approach successively expands the

space of hypotheses until an acceptable theory is found. At each

level, the algorithm considers different clause length and different

values of coverage (#pos examples covered - #neg examples

covered). Whenever a theory that fits the specified criteria is

found, the algorithm returns the theory.

As a final note, while the teacher and the learner follow a

fixed protocol while communicating via Interlingua, interpreting

relevance amounts to more than simply implementing a rule-based

parsing system. This is because of the ambiguity that is prevalent

in every teacher relevance statement, in particular as to how

general the advice. It is this ambiguity, of whether the teacher

advice is about specific examples or applies to all examples

generally, that necessitates a relevance interpreter as in Table 1.

$'39)/E/M29(-Y9'>),)1/#(,'()*>A/

L,+5)12,)X/C0-+0 (facts, background, examples) returns theory

// n positive examples and m negative examples

While (time remaining)

1. Include only combo-rules that are generated by WILL for

the search. Call WILLSEARCH. If perfect theory found,

return

2. Expand search space to include all relevant features. Call

WILLSEARCH. If perfect theory found, return

3. Expand search space to include all features. Call

WILLSEARCH. If perfect theory found, return

4. Flip the example labels and call Onion with new

examples

End-while

If no theory learned, return the largest combo-rule

L,+5)12,)X/Z:GG#S!%&W/returns theory

! For rule length 2 to maxRuleLength

! For coverage = n to n/2

! Search for the acceptable theory. If found, return it/

FAF !/G'>),)1/!77,+'5H/8+,/:GL//Having outlined the relevance interpreter and Onion, we now

present the complete learning strategy in Table 3. We first parse

all the Interlingua messages and create examples both positive

and negative. In the case of multi-class problems, we pose the

problem as one vs. others. Then the ground facts and background

knowledge are constructed. The relevance interpreter then creates

the comboRules. Finally, Onion is called with the background

rules and facts to learn the target concept. Once the target concept

is learned, the teacher evaluates the target on a few test examples.

If the theory is unacceptable, the teacher provides more examples

and/or relevant statements as a feedback, thus aiding WILL to

learn a better concept.

$'39)/FA/G)',0-0*/#(,'()*>A/

L,+5)12,)X/#(,'()*>(IL Messages) returns theory

1. Construct examples(pos and neg), facts (ground truth) &

background

2. Parse relevant statements, construct comboRules and add to

background

3. Call Onion(facts, background, examples)

4. If acceptable theory is found, return theory

Else call for the feedback lesson to obtain more examples

and/or relevant statements. Go to Step 1.

[A !66:$:CB!G/:##DS#/;)0),'(-+0/+8/B)*'(-K)/SN'<79).. In general, ILP requires

a large number of examples to learn a concept. While this is a

challenge in all of machine learning, the need to learn complex

relational concepts in first-order logic makes it even more so in

ILP. In some domains, it is natural for a teacher to say that a

particular world state contains a single positive example; for

example, it is natural for a teacher to point to a set of three blocks

and state that they form a stack. It is a reasonable assumption that

various combinations of the rest of the blocks in that scene do not

form a stack and hence, WILL assumes these are (putative)

negative examples. We have found that for most of the lessons

provided in BL there is such a need for automatically constructing

negatives because instruction contains mainly positive examples.

Another way to express negative examples is to say some

world state does not contain any instances of the concept being

taught: "the current configuration of blocks contains no stacks".

Assume the teacher indicates isaStack takes three arguments,

each of which is of type block. If WILL is presented with a

world containing N blocks where there are no stacks, it can create

N3 negative examples. In general, negative examples are

generated by instantiating the arguments of predicates whose

types we may have been told, in all possible ways using typed

constants encountered in world states; examples known to be

positive are filtered out. Depending on the task, the student may

have either teacher-provided negatives or induced negatives. As

we do not want to treat these identically, WILL allows costs to be

assigned to examples ensuring that the cost of covering a putative

negative can be less than covering a teacher-provided one.

G)',0-0*/(H)/B)*'(-+0/+8/'/&+05)7(. Human teachers typically

gauge the difficulty of concepts being taught by human

comprehensibility, in terms of which, accurate, short, conjunctive

rules are preferred. When learning concepts such as

outOfBounds in a soccer field, the target concept might have a

large set of disjunctions (since it can be out of bounds on any of

four sides). It is easier to learn if the ball is in bounds and then

negate the learned concept. Our learning bias here is that our

benevolent teacher is teaching a concept that is simple to state, but

we are not sure if the concept or its negation is simple to state, so

we always consider both. For a small number of examples, it is

usually hard to learn a disjunctive rule, especially if the examples

are not the best ones, but rather only 'reasonable' in that they were

near the boundaries, but not exactly next to them.

\A &CB&GD#:CB/As mentioned earlier, our implemented system perfectly learned

(100% accuracy) 56 lessons from a combination of training

examples and teacher-provided hints. Running our ILP system

without these hints - i.e., only using the training examples, for

which there was an average of 7.6 labeled examples per concept

taught - produced an average accuracy on held-aside examples of

63.9% (there was a 50-50 mixture of positive and negative

examples, so 50% is the default accuracy of random guessing).

We have shown how the naturally provided human advice can be

absorbed by ILP approach in order to learn a large number of

concepts across a handful of domains. None of the advice, nor the

lessons solved were created by us. Instead our task was to make

effective use of the provided advice to learn the intended concepts

while given only a small number of labeled examples.

The ILP approach allows learning to be applied to much richer

types of data than the vast majority of machine-learning methods,

due to its use of first-order logic as a representation for both data

and hypotheses. However, ILP requires substantial experience to

properly set up the 'hypothesis space' it searches. The natural

teacher-learner interaction in our BL project is being interpreted

by WILL as guidance for defining ILP hypothesis spaces, as well

as biasing the search in such spaces toward the most promising

areas. Finally, it should be noted that while these teacher

instructions significantly influence the ILP algorithm in terms of

which hypotheses it considers, the algorithm is still able to make

additions to the teacher's instructions; decide which teacher

instructions should be kept and which should be discarded; and

choose how to integrate instructions about individual examples

into a general concept. In other words, the human teacher is

advising, rather than commanding, the student who still has the

capability to make decisions on its own.

Human advice taking has long been explored in AI in the context

of reinforcement learning [2], where the knowledge provided by

the human is converted into a set of rules and knowledge-based

neural networks are used to represent the utility function of the

RL agent. Advice has also been incorporated in ILP systems [7] to

learn constant free horn clauses. The key difference in our system

is the presence of a very small number of examples.

Currently, we are focusing on our layered approach, to more

robustly automate ILP in these different tasks. Also, we are

currently looking at more richly exploiting teacher-provided

feedback beyond statements about which features and objects are

relevant. One possible future direction is to explore the possibility

of refining the learned theories using teacher feedback in the lines

of theory refinement for ILP [6]. Refining teacher's advice is

important as it provides room for teacher mistakes.

]A !&=BCZGS6;MSB$#/The authors gratefully acknowledge support of the Defense

Advanced Research Projects Agency under DARPA grant

FA8650-06-C-7606. Views and conclusions contained in this

document are those of the authors and do not necessarily represent

the official opinion or policies, either expressed or implied of the

US government or of DARPA.

^A %SIS%SB&S#/ [1] Kohavi, R. and John, G. Automatic parameter selection by

minimizing estimated error. In ICML, 1995.

[2] Maclin, R. and Shavlik, J. W. Creating advice-taking

reinforcement learners. Mach. Learn., 22, 1996.

[3] McCarthy, J. The advice taker, a program with common sense.

In Symp. on the Mechanization of Thought Processes, 1958.

[4] Muggleton S. and De Raedt, L. Inductive logic programming:

Theory and methods. Journal of Logic Programming,

19/20:629–679, 1994.

[5] Oblinger, D. Bootstrap learning - external materials.

http:// www.sainc.com/bl-extmat, 2006.

[6] Ourston, D. and Mooney, R. Theory refinement combining

analytical and empirical methods. Artificial Intelligence,

66:273–309, 1994.

[7] Pazzani, M. and Kibler D. The utility of knowledge in

inductive learning. Mach. Learn. 9:57–94, 1992.

[8] Quinlan, J. R. Induction of decision trees. Mach. Learn.,

1(1):81–106, 1986.

[9] Shen, J., Mailler, R., Bryce, D. and O’Reilly, C. MABLE: a

framework for learning from natural instruction. In AAMAS,

2009.

[10] Russell, S. and Norvig, P. Artificial Intelligence: A Modern

Approach (Second Edition). Prentice Hall, 2003.

Learning to Ask the Right Questions Melinda Gervasio

SRI International 333 Ravenswood Ave. Menlo Park, CA 94025

1-650-859-4411

[email protected]

Eric Yeh SRI International

333 Ravenswood Ave. Menlo Park, CA 94025

1-650-859-6134

[email protected]

Karen Myers SRI International

333 Ravenswood Ave. Menlo Park, CA 94025

1-650-859-4833

[email protected]

A BST R A C T

Asking questions is an integral part of learning. It can be used to

clarify concepts, test hypotheses, fill in missing knowledge, or

acquire additional knowledge to facilitate learning. The latter

motivates the work described in this paper. By asking questions,

our system obtains information that serves as background

knowledge for a base learner, providing it with the ability to make

useful generalizations even with few training examples. In

previous work, we developed static strategies for question asking.

Here, we extend that work with a learning approach for acquiring

question-asking strategies that better accommodates the

interdependent nature of questions. We present experiments

validating the approach and showing its usefulness in acquiring

efficient, context-dependent question-asking strategies.

Categories and Subject Descriptors

I.2.6 [L earning]: Concept learning, Induction, Knowledge

acquisition

General T erms

Algorithms, Performance, Experimentation

K eywords

Active learning, decision tree induction, metalearning, preference

learning

1. IN T R O DU C T I O N !"#$%&'( #( )#*( +#,'+"#&( -*.%&$( -/( 0'-'*"%&'( #( )1+-/"'*2+(

preferences so he can best decide which models to offer. Most

likely, he will ask a few questions to try to quickly narrow down

the kind of cars the customer is interested in before presenting the

customer with cars. Using -3'()1+-/"'*2+(*'#)-%/&+, he can refine

his model of the custo"'*2+( 4*'5'*'&)'+( #&0 adjust subsequent

offers accordingly, in due time figuring out the cars most

preferred by the customer. The salesman might be able to come to

the same conclusion by just offering the customer one car after

#&/-3'*(#&0($*#01#,,.(5%$1*%&$(/1-(-3'()1+-/"'*2+(4*'5'*'&)'+(-3#-(

way. But most customers will not willingly provide feedback on

car after car after car. By asking questions to obtain some

background information first, the salesman can figure out the

)1+-/"'*2+(4*'5'*'&)'+("/*'(61%)7,.. And if he asks just a few,

focused questions, he simplifies the preference elicitation process

for his customer as well.

Supervised machine learning algorithms rely primarily on labeled

training data to acquire models for tasks such as classification and

ranking. When training data is scarce, background knowledge can

help constrain the hypothesis space and avoid overfitting. One

approach to obtaining such background knowledge is by asking

questions of a human teacher. For example, active learning asks

for labels on selected examples [[2][3][6]] while active feature selection asks questions about feature relevance [12]. In most

cases, question asking is tightly integrated into a specific learner.

Our work on QUAIL (Question Asking to Inform Learning)

addresses an abstracted question-asking problem not tied to any

specific learning algorithm [[7][8]]. While tight integration with a

particular learner clearly has its benefits8including the ability to

more deeply integrate question asking into the learning

algorithm8there are situations where an independent question-

asking system is desirable. QUAIL is part of a large, task learning

system, POIROT [1], that includes several different learning and

reasoning systems working in concert to address a single,

overarching problem. QUAIL provides POIROT with the ability

to manage question asking in a centralized manner, balancing the

demands of the different learners against the benefits to the

system as a whole.

9%:'&( 61'+-%/&+( ;%-3( 0%55'*'&-( )/+-+( #&0( 1-%,%-%'+<( =>?!@2+(

objective is to ask the highest utility questions within a specified

budget. Initially, we form1,#-'0( =>?!@2+( 61'+-%/&( +',')-%/&(

problem as a 0/1 knapsack problem [9]. Given a budget for

questions and a set of questions with associated costs and utilities,

QUAIL applied the knapsack algorithm to find a set of questions

with maximal utility within the given budget. To evaluate

QUAIL, we integrated it with CHARM [15], a system for learning

lexicographic preference models from training examples

consisting of pairwise preferences. While preliminary results

looked promising [8], further investigation revealed two

complicating factors regarding utility that rendered this

formulation problematic. First, question utility turned out to be

non-additive8that is, the utility of a set of questions was not

necessarily equal to the sum of the utilities of the individual

questions. Second, utility turned out to be context-dependent8that is, the utility of a question was often dependent on the answer

to previous questions.

A richer, more faithful utility model was clearly needed; however,

it is infeasible to expect such a model to be engineered by hand.

Indeed, our experience with the earlier experiments revealed that

our intuitions were not always right: our initial utility assignments

performed poorly and it took a number of tries to develop utility

assignments that led to better performance. In the course of our

experimentation, we also realized that questions were

interdependent: asking particular questions was only useful given

particular answers to previous questions, something our selection

strategy did not account for. For our car salesman, our static

strategy would have amounted to deciding, ahead of time, to ask a

particular set of questions to all customers, regardless of their

answers. Thus, the salesman might ask whether a customer liked

luxury cars even if the customer had already indicated a desire for

hybrid cars and there were no hybrid luxury cars.

To address these issues, we developed a learning approach to

question asking, initially focused on addressing the context-

dependent nature of question utilities. Since we are addressing a

metalearning1 task8the objective being to find a question-asking

strategy that will provide the greatest improvement for a base

learner8our learner does not itself interact directly with a human

teacher. However, it uses performance data for a base learner that

itself relies on a question-asking agent to interact with a human

teacher. This agent may itself be considered a learner since its

goal is to obtain information to improve performance on some

task. This #$'&-2+(learning is completely under its own initiative:

communication with the human teacher is restricted to questions

from the agent and answers from the teacher. However, the

questions may involve any content and be of any form provided

the answers can be incorporated by the base learner. The

metalearning process does not require a staged curriculum that

builds over previously learned concepts. However, for the

question-asking agent, the information acquired through question

asking is dependent on the information gleaned from previous

questions8the primary motivation for our work on learning

question-asking strategies. The goal of our research is not so

much to provide insight on human learning as it is to use

inspiration from the way humans ask focused questions to acquire

information.

We begin by describing our formulation of a question-asking

strategy as a decision tree. We then present our algorithm for

inducing such a decision tree from examples consisting of a base

,'#*&'*2+( 4'*5/rmance with the corresponding answers. Next we

describe our evaluation domain, starting with a discussion of how

question answers are incorporated into CHARM and moving on to

a presentation of the synthetic data sets we designed to investigate

the effects on learning of different target model populations and

different object distributions. We then present our evaluation of L-

QUAIL, the learning version of QUAIL, comparing its

performance against a set of handcrafted strategies. We conclude

with a discussion of related and future work.

2. L E A RNIN G F O R M U L A T I O N To allow question selection to depend on previous answers, we

need a data structure that supports conditional branching. One

such structure is a decision tree, where the nodes correspond to

questions and the branches to answers. Our basic learning task is

thus one of decision tree induction. However, there are some

distinctive features about question-asking decision trees that make

them different from the typical decision trees learned for standard

supervised learning tasks.

In decision tree induction for classification (Quinlan, 1986), a

decision tree is constructed by selecting successive features on

which to split the training instances, represented as feature

vectors, until all the instances have the same class label or some

other stopping criterion is reached. Each node corresponds to a

feature and each branch to a feature value, and the class label of

the instances at a leaf node is the predicted class. The goal is to

classify examples as quickly and as accurately as possible, so key

to the construction of good decision trees is the splitting criterion

1 We use metalearning in the sense of learning to learn, rather

than learning knowledge that can be transferred across domains.

for determining the next feature to test. For classification, the

splitting criterion typically relies on some measure of entropy or

impurity regarding the instances at a node and their labels.

For our metalearning problem, the training instances are the

models to be learned by the base learner8for example, preference

models for a preference learner or classifiers for a classification

learner. The features are questions and the feature values are

answers, so a model is effectively represented by its answers to

the questions. Our goal is to provide the base learner with the

most informative answers possible, so our splitting criterion is

based on the expected performance improvement from the

answers to a question across all the models at that node. More

specifically, the expected performance of the base learner at a

particular node is the average of its performance over all the

models at that node, given the answers leading to that node, and

we choose the question leading to highest expected performance

over all the resulting splits. Our decision tree induction algorithm

is summarized in Table 1.

Table 1. Decision tree induction algorithm for learning question-asking strategies.

LEARN(models, questions, answers, cost, budget, evalset) node = new decision tree node node.models = models node.answers = answers

For each q Q where q.cost A(budget - cost splitsq = {<modelsa,a> | modelsa models that answer a to q}

q.perf = PERF(splitsq, answers, evalset) Select qbest as the q Q with maximum q.perf node.question = qbest

If qbest.perf < PERF ORMANCE_THRESHOLD For each <modelsa,a> in splitsqbest

Add LEARN(modelsa, Q-qbest, answers+a, cost+q.cost, budget) as a branch to node

Return node PERF(splits,answers,evalset) perf = 0 nmodels = 0

For each <modelsa,a> splits

For each model in modelsa

perf = perf + EVAL_BASE_LEARNER(model, answers+a, evalset) nmodels++ RETURN perf/nmodels

3. Q U EST I O N A NSW E RS AS B A C K G R O UND K N O W L E D G E Building on our previous work, we decided to evaluate our

learning approach to question asking using the CHARM

preference learner as our base learner. CHARM learns

lexicographic preference models from pairwise preferences8

essentially, a partial ordering on object attributes, where the

attributes have ordinal values and preferences can be for either

direction (high or low) for each attribute [15]. We developed a set

of questions for CHARM and modified CHARM to incorporate as

background knowledge the corresponding answers for any target

model [8]. For the experiments described in this paper, we used

the binary attribute questions listed in Table 2.

Table 2. B inary attribute questions used in the experiments.2 C H A R M incorporates answers to these questions as

background knowledge.

Question C lass Question

Attribute Relevance Is ATT relevant? Attribute Ordering Is ATT1 more important than ATT2? Value Ordering For ATT, are LOW values preferred to HIGH?

Before learning, CHARM considers all attributes to be equally

important and, for each attribute, high and low values to be

equally preferred. Thus, without any learning or background

knowledge, CHARM predicts a tie for any two objects. CHARM

incorporates the answers to the attribute-related questions as

constraints on the learned model8more specifically, on the

relative attribute ordering and value preferences reflected by the

lexicographic preference model. Thus, even without any training

examples, answers will result in a modified model, typically

resulting in one that will predict preferences over some objects.

The question types in Table 2 are complete: there are several

combinations of question instantiations whose answers provide

complete information to CHARM, enabling it to learn a

preference model even without any training examples. Asking

questions comes at its own cost and our goal is not to replace

training but, rather, to augment it with the answers to carefully

selected questions, particularly when the base learner has very few

training examples. However, to focus our investigation purely on

the effects of question answers, we do not provide CHARM with

any training examples in the experiments discussed here.

4. D A T A Since we are dealing with a metalearning problem, we need data

for evaluating the base learner (CHARM) as well as data for

training and testing the metalearner (L-QUAIL). Specifically, we

need the objects over which CHARM is to learn preferences, the

target models CHARM is to learn, the questions QUAIL can ask,

the answers /5('#)3(-#*$'-("/0',(-/(-3'(61'+-%/&+<(#&0(BC?DE2+(

performance numbers given different question answers.

To support controlled experimentation, we generated synthetic

data sets that varied assorted properties of the objects and the

target models. Specifically, given a fixed set of object attributes,

we created object sets that varied in how the attributes were

correlated. We also varied the target models regarding which

attributes were relevant, which attributes were equally preferred,

and which attribute value direction was preferred.

We decided to exhaustively generate performance data for all

possible question subsets for all the target models under the

different conditions. Aside from greatly facilitating

experimentation, these numbers provide us with a way to compute

#&( 144'*( F/1&0( /&( BC?DE2+( 4'*5/*"#&)'( ;%-3( 61'+-%/&(

answers. However, since the number of objects, models,

questions, and question subsets grows very quickly with the

number of attributes, this exhaustive generation approach limits us

to a small number of attributes8an issue we address in the

2 In addition to these attribute-related questions, CHARM can also

incorporate answers to questions about objects (e.g., Is OBJ1 preferred to OBJ2?) and to more costly, set-based question

types (e.g., Which of these attributes are relevant?). These

questions were selected from a comprehensive question catalog

we designed for learning in POIROT [7].

discussion on future work. In the experiments reported here, we

use four binary attributes, which correspond to the question

instantiations in Table 3 and to the data sets in Table 4.

Table 3. Specific questions used in experiments.

Question C lass Question ID Attribute Relevance Is A0 relevant? q0

Is A1 relevant? q1

Is A2 relevant? q2

Is A3 relevant? q3

Attribute Ordering Is A0 more important than A1? q4

Is A0 more important than A2? q5





Attribute Value Ordering For A0, is LO preferred to HI? q10

For A1, is LO preferred to HI? q11



Table 4. Data sets used in experiments. Data Set Objects Target Models

A tt Correlations Relevant A tts

Pref Val Dir

Equiv A tts

n4_v2_r4 none all four any none

n4_v2_r2 none any two any none

n4_v2_r2_01 none only A0,A1 any none

n4_v2_r2_23 none only A2,A3 any none

n4_v2_r2_01_23 none {A0,A1} or {A2,A3}

any none

n4_v2_r4_H none all four HI none

n4_v2_r4_L none all four LO none

eq_n4_v2_r2_0-1 none A0,A1 any {0,1}

eq_n4_v2_r4_0-1_2-3 none all four any {0,1}, {2,3}

pos_n4_v2_r4 positive between A0

and A1, A2 and A3

all four LO none

pos_n4_v2_r2 positive between A0 and A1, A2 and A3

any two LO none

neg_n4_v2_r4 negative between A0

and A1, A2 and A3

all four LO none

neg_n4_v2_r2 negative between A0 and A1, A2 and A3

any two LO none

5. E V A L U A T I O N Our primary motivation for developing a learning approach to

question asking was to avoid the costly and at times fallible hand-

engineering of question-asking strategies that would be required

to address the non-additive and context-dependent nature of

question utilities. By learning a question-asking strategy, we can

address these issues while also enabling question asking adapt to

unique characteristics of the object and model populations. Thus,

we designed our experiments to explore how well L-QUAIL

could learn to ask the right questions.

5.1 Handcrafted Question-Asking Strategies We compare the performance of the strategies learned by L-

QUAIL against handcrafted strategies8both the original static

strategies, developed in consultation with the CHARM

developers, and a handcrafted decision tree. Based on our

previous experiments, we already know that the static strategies

suffer from a failure to account for the interdependent nature of

questions. Thus, it is useful to include them here to verify whether

learning indeed benefits question asking. Table 5 shows the static

strategies used in the experiments.

Table 5. O riginal static question-asking strategies, which ask questions according to the utilities of different question types.

Strategy Utilities valord value ordering > attribute ordering, attribute relevance

attord attribute ordering > value ordering, attribute relevance

attrel attribute relevance > attribute ordering, value ordering

att attribute ordering, attribute relevance > value ordering

attval attribute ordering, value ordering > attribute relevance

valrel attribute relevance, value ordering > attribute ordering

Based on a better understanding of CHARM and an analysis of its

method of incorporating questions answers, we also handcrafted a

decision tree that addresses context dependence and that can be

used to acquire a perfect model purely by asking questions. The

+-*#-'$.( 5%*+-( #+)'*-#%&+( #&( #--*%F1-'2+( *',':#&)'<( then it asks

whether high or low values are preferred for the attribute, and

finally it asks about its relative importance with respect to every

other attribute that has also been determined to be relevant. By

taking previous answers into consideration, the strategy avoids

asking useless questions. However, it does not consider budget at

all and it is biased somewhat by the order in which it asks about

the attributes.3 The latter is an issue with the decision tree

formulation in general, where the question asked at each node is a

deterministic choice.

5.2 Performance Upper Bound The performance data generated for the data sets provides us with

a means to compute a performance upper bound (or gold

standard). For each budget and each model, we can identify the

set of questions that results in the highest accuracy CHARM can

achieve for that model with that budget. Note that because the best

performing subset of questions may be different for different

models and distributions, in general it will not be possible to

create a strategy to ask precisely these questions.

5.3 Hypotheses Our primary hypothesis was:

The learned question-asking strategy will outperform the

handcrafted strategies.

In addition, we also hypothesized that the learned decision trees

would be adapted to the underlying characteristics of the object

and model populations of the individual data sets. Specifically,

Attribute relevance questions will gain prominence when there

were irrelevant attributes.

Attribute value questions, although generally the most useful,

will not be used when the models always preferred a particular

attribute value direction.

Attribute ordering questions will not be asked between

attributes that had equal importance.

Learning preferences will be easier with positively correlated

attributes and harder with negatively correlated attributes.

3 With no prior information on relevance, we simply select an

arbitrary ordering over the attributes for question asking.

5.4 Evaluation Metrics G1*( 4*%"#*.( ':#,1#-%/&( "'-*%)( ;#+( BC?DE2+( 4'*5/*"#&)'<(

measured as the percentage of correct predictions (of pairwise

preferences) over the objects in a data set. Since we are interested

in question asking under a constrained budget, we plot this

performance over increasing budgets. We say that one strategy

outperforms another for a particular budget if it has higher

accuracy for that budget. We can also say that it outperforms

another if it reaches 100% accuracy sooner (i.e., with fewer

questions). Since L-QUAIL is dealing with a metalearning

4*/F,'"<(;'( #:'*#$'(BC?DE2+( 4'*5/*"#&)'( /:'*( #,,( -3'( -#*$'-(

models in a data set. We also inspect the questions included and

relations between them in the learned trees.

5.5 Experimental Results Over all the data sets, the results predominantly supported our

hypotheses (Figure 1).4 The learned strategy outperformed all the

handcrafted strategies (i.e., achieved higher accuracy for the same

budget and achieved 100% accuracy sooner) for all the data sets

except under two conditions. First, for the data set where any two

attributes were relevant (Figure 1b), the handcrafted decision tree

4'*5/*"'0(F'--'*( 5/*( F10$'-+(H(I. This may be explained by the

fact that the handcrafted decision tree explicitly tests for attribute

relevance first. This is particularly valuable when there are

irrelevant attributes but is inefficient when all attributes are

relevant (Figure 1a). Second, when negatively correlated

attributes were involved (Figure 1f), the attord static strategy

preferring attribute ordering questions reached 100% accuracy

sooner. Negatively correlated attributes present a particular

challenge in that if two attributes are negatively correlated and a

model has the same preferred value direction for both, if a value

ordering question is only asked for the less important attribute, the

resulting model will predict incorrect preferences in cases where

objects differ in value for the more important attribute. However,

the results show that this is an even greater problem for the other

strategies than for the learned strategy, which is only

outperformed at high budgets. Another interesting thing to note

about these graphs is the nonmonotonic behavior of the

performance plots for some of the static strategies (Figure 1c,f),

providing evidence of the non-additive nature of question utility.

The results also showed that L-QUAIL learned question-asking

strategies that were adapted to unique characteristics of the

objects and models in each data set. For example, Figure 2 shows

the trees learned for data sets where a specific two of the four

attributes were relevant (see Table 3 for the corresponding

complete questions). Each tree only asked value ordering

questions for the relevant attributes and in the case where the two

attributes were equally important (Figure 2c), the tree did not ask

any attribute ordering questions.

Figure 3 shows another learned strategy, this one for the situation

where the models fell into two different classes8those where

only attributes A0 and A1 were relevant and those where only A2

and A3 were relevant. Interestingly, the attribute relevance

questions are never asked8they are in the tree but their answers

are implied by previous answers, illustrating a situation where the

learned tree benefits from taking previous answers into

consideration. The learned tree is not particularly intuitive but

results in much better performance than the handcrafted strategies.

4 Space constraints preclude the presentation of the complete

results; the rest are consistent with the findings reported here.

a. All attributes relevant. b. Only two (any two) attributes relevant. c. Only attributes A0 and A1 relevant.

d. Only A0,A1 relevant, equally important;

learnt strategy matches gold standard. e. High attributes always preferred.

f. A0,A1 and A2,A3 negatively correlated.

F igure 1. Results for different data sets showing learned strategy (learnt) generally outperforming handcrafted strategies.

F igure 2. Decision trees learned for data sets with (a) only A0 and A1 relevant, (b) only A2 and A3 relevant, and (c) only A0

and A1 relevant and also equally ranked.

F igure 3. Decision tree learned for data set where models only cared about attributes A0 and A1 or attributes A2 and A3.

While our experiments involved a comprehensive set of questions

with which CHARM could learn a perfect model, the results

should also apply to situations involving questions providing only

partial coverage and even to situations involving base learners that

may imperfectly incorporate answers. Because L-QUAIL learns

over a ,'#*&'*2+ actual performance with question answers, it

acquires question-asking strategies that are naturally adapted to

different situations.

6. R E L A T E D W O R K Asking questions to obtain the most useful information for a

learner has much in common with active learning [[2][3][6]].

Most active learning work focuses on acquiring class labels for

the most informative examples. An extension, active feature selection [11] attempts to obtain information about feature

relevance. In contrast, QUAIL obtains information of different

kinds through a variety of questions, some (like attribute

relevance) of which may result in direct modifications to the

hypothesis space.

Proactive learning augments active learning with decision-

theoretic techniques to accommodate imperfect oracles [4].

Because L-QUAIL learns over performance numbers with the

answers rather than from the answers themselves, to some extent

it already accommodates imperfect oracles and even imperfect

learners. Active model selection attempts to find an optimal

sequence of tests on learned models with different costs for the

purpose of identifying the model with the highest expected

accuracy [10]. While active model selection addresses budgetary

constraints on testing or evaluation during learning, L-QUAIL

addresses budgetary constraints during performance.

Games with a purpose (e.g., [13][14]) involve asking questions in

a manner specifically designed to entice humans to provide

answers to fill in knowledge gaps. QUAIL differs in its specific

focus on asking questions to help a learner learn and this paper

contributes a learning approach to finding effective question-

asking strategies for supervised learners.

7. F U T UR E W O R K Our experimental results, while promising, investigated a

synthetic domain and relied on exhaustively generating

performance data for all possible question subsets. More realistic

+'--%&$+(;%,,(*'61%*'(BC?DE2+(4'*5/*"#&)'(-/(F'(#++'++'0(01*%&$(

L-=>?!@2+( ,'#*&%&$. Since CHARM will only have to be run

over the subset of models at a node and learning typically

terminates before all questions are asked, this approach should be

less expensive than exhaustive generation. However, it is likely

that some method will be needed to further reduce the set of

candidate questions at each node.

L-=>?!@2+( $*''0.( 3'1*%+-%)( 5/*( 61'+-%/&( +',')-%/&( does not yet

fully address the non-additive nature of question utilities. L-

QUAIL remains susceptible to sometimes missing questions that

individually lead to small performance improvements but together

lead to large gains and to selecting useless questions when none

led to immediate improvement. We are considering two possible

solutions to this problem: 1) limited lookahead for considering

small sets of questions rather than single questions, and 2) random

decision trees allowing less immediately useful questions to be

selected with some probability.

The experiments in this paper only considered questions of equal

cost. Questions with differing costs means learning must factor in

question cost when considering the next question to ask8for

example, by considering sets of questions of the same total cost

rather than individual questions. Within a fixed budget scenario,

the resulting tree can be used directly for question asking.

However, for a flexible budget scenario, something like a

conditional decision tree that selects questions based on the

remaining budget would be needed.

Our experiments also assume access to some pool of target

models together with their corresponding question answers and

preference judgments over the objects of interest. While a large

pool of training data clearly benefits learning, obtaining such data

comes at its own cost. Cost-sensitive decision trees [5], which

trade off the cost of obtaining better utility estimates against the

value of acquiring a higher-accuracy tree, may help in solving this

problem.

8. SU M M A R Y A ND C O N C L USI O NS We presented a learning approach to acquiring effective question-

asking strategies under constrained budget scenarios. The

approach addresses the issue of context-dependent question

utilities and partially addresses the issue of non-additive utilities.

We formulated question-asking strategies as decision trees and

presented a decision tree learning algorithm for acquiring such

strategies. We presented experiments exploring the effects of

different target model populations and object distributions on L-

QUAIL, testing its ability to learn decision trees adapted to

different data sets. The results of our experiments provided strong

support for our hypotheses regarding L-=>?!@2+(#F%,%-.(-/(adapt

to different conditions and to generally outperform handcrafted

strategies. By relying on a F#+'(,'#*&'*2+(actual performance with

the answers in the situations it is likely to encounter, our learning

approach avoids the costly hand-engineering of effective

question-asking strategies for different learners while providing

question-asking strategies that are naturally adapted to specific

base learners and their environments.

9. A C K N O W L E D G M E N TS This work was supported by the Defense Advanced Research

Projects Agency and the United States through BBN Technologies

Corp. on contract number FA8650-06-C-7606. We thank Omid

Madani for helping us formulate the learning problem, and Fusun

Yaman for extending CHARM to integrate QUAIL.

10. R E F E R E N C ES [1] Burstein, M., Laddaga, R., McDonald, D., Cox, M., Benyo,

B., Robertson, P., Hussain, T., Brinn, M., & McDermott, D.

(2008). POIROT - integrated learning of web service

procedures. Proc. AAAI. Chicago, IL.

[2] Cohn, D., Atlas, L., & Ladner, R. (1994). Improving

generalization with active learning. Machine Learning, 15

(2), 201-221.

[3] Dasgupta, S., Hsu, D., & Monteleoni, C. (2007). A general

agnostic active learning algorithm. Proc. NIPS. Vancouver,

B.C., Canada.

[4] Donmez, P., & Carbonell, J. (2008). Proactive learning: cost-

sensitive active learning with multiple imperfect oracles.

Proc. CIKM. Napa Valley, CA.

[5] Esmeir, S., & Markovitch, S. (2008). Anytime induction of

low-cost, low-error classifiers: a sampling-based approach.

JAIR.

[6] Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997).

Selective sampling using the query by committee algorithm.

Machine Learning, 28, 133-168.

[7] Gervasio, M., & Myers, K. (2008). Question asking to inform

procedure learning. Proc. AAAI Workshop on Metareasoning: Thinking about Thinking. Chicago, IL.

[8] Gervasio, M., Myers, K., desJardins, M., & Yaman, F.

(2009). Question asking to inform preference learning: a case

study. Proc. AAAI Spring Symposium on Agents that Learn from Human Teachers. Stanford, CA.

[9] Kellerer, H. P. (2005). Knapsack Problems. Springer Verlag.

[10] Madani, O., Lizotte, D., & Greiner, R. (2004). Active model

selection. Proc. UAI. Banff, Canada.

[11] Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81-106.

[12] Raghavan, H., Madani, O., & Jones, R. (2006). Active

learning with feedback on features and instances. Journal of Machine Learning Research.

[13] Speer, R., Krishnamurthy, J., Havasi, C., Smith, D.,

Lieberman, H., & Arnold, K. (2009). An interface for targeted

collection of common sense knowledge using a mixture

model. Proc. IUI. Sanibel Island, FL.

[14] von Ahn, L., Kedia, M., & Blum, M. (2006). Verbosity: a

game for collecting common-sense knowledge. Proc. CHI. Montreal, Quebec, Canada.

[15] Yaman, F., Walsh, T. J., Littman, M. L., & desJardins, M.

(2008). Democratic approximation of lexicographic

preference models. Proc. ICML. Helsinki, Finland.

Impact of Human Communication in a Multi-teacher,Multi-robot Learning by Demonstration System

Murilo Fernandes MartinsDept. of Elec. and Electronic Engineering

Imperial College LondonLondon, UK

[email protected]

Yiannis DemirisDept. of Elec. and Electronic Engineering

Imperial College LondonLondon, UK

[email protected]

ABSTRACTA wide range of architectures have been proposed within theareas of learning by demonstration and multi-robot coordi-nation. These areas share a common issue: how humans androbots share information and knowledge among themselves.

This paper analyses the impact of communication betweenhuman teachers during simultaneous demonstration of taskexecution in the novel Multi-robot Learning by Demonstra-tion domain, using the MRLbD architecture. The perfor-mance is analysed in terms of time to task completion, aswell as the quality of the multi-robot joint action plans.

Participants with di!erent levels of skills taught real robotssolutions for a furniture moving task through teleoperation.The experimental results provide evidence that explicit com-munication between teachers does not necessarily reduce thetime to complete a task, but contributes to the synchronisa-tion of manoeuvres, thus enhancing the quality of the jointaction plans generated by the MRLbD architecture.

Categories and Subject DescriptorsI.2.9 [Artificial Intelligence]: Robotics

General TermsAlgorithms, Design, Experimentation

KeywordsLearning by Demonstration, Multi-robot Systems

1. INTRODUCTIONThe extensively studied field of Multi-robot Systems (MRS)

is well-known for its advantages and remarkable applicationswhen compared to systems consisted of a single robot. MRSbring benefits such as redundancy, flexibility, robustness andso forth (for a review on MRS, refer to [14]) to a variety ofdomains, including search and rescue, cleanup of hazardouswaste, space exploration, surveillance and the widely recog-nised RoboCup domain.

Likewise, a wide range of studies have addressed the chal-lenge of developing Learning by Demonstration (LbD) archi-tectures, which system can learn not from expert designersor programmers, but from observation and imitation of de-sired behaviour (for a comprehensive survey, see [1]).

However, very little work has been done attempting tomerge LbD and MRS.

The work of [10] presented a novel approach to LbD inMRS – the Multi-robot Learning by Demonstration (MRLbD)

Figure 1: The P3-AT mobile robots used in the ex-periments.

architecture. This architecture enabled multiple robots tolearn multi-robot joint action plans by observing the si-multaneous execution of desired behaviour demonstrated bymultiple teachers. This was achieved by firstly learning lowlevel behaviours at single robot level using the HAMMERarchitecture [6], and subsequently applying a spectral clus-tering algorithm to segment group behaviour [16].

Communication is a classic feature, common to the ar-eas of MRS and LbD, which has been addressed by a widevariety of studies. The performance impact of communica-tion in MRS is well-known [2], [13]. However, a questionthat remains open is whether e!ects of communication be-tween humans when teaching robots are somehow correlatedwith the ones in MRS. To the knowledge of the authors, noprior work has analysed the e!ects of communication be-tween teachers in the MRLbD domain. This paper presentsan analysis of performance in time to task completion andquality of the demonstration in a multiple teacher, multiplerobot learning by demonstration scenario, making use of theMRLbD architecture.

The remainder of this paper is organised as follows: Sec-tion 2 provides background information of how communica-tion has been used within the areas of MRS and LbD. InSection 3, the teleoperation framework, the MRLbD archi-tecture and the robots (pictured in Fig. 1) are explained, fol-lowed by Section 4, which presents the real scenario and thetask specification in which the experiments were performed.Subsequently, Section 5 analyses the results obtained, andfinally Section 6 presents the conclusions and further work.

2. BACKGROUND INFORMATIONSeveral studies discuss properties and the nature of com-

munication among humans and robots. In this paper, thecharacteristics of communication is categorised into 2 types,which definition is hereafter discussed. The impact of com-

munication in MRS is also discussed in this section.

2.1 Implicit communicationImplicit communication is the information that can be

gathered just by observing the environment, and relies onthe perceptual capabilities of the observer, often involvingcomputer vision. Implicit communication does not dependon communication networks or mechanisms. On the otherhand, a mismatch between observed data and the observer’sunderlying capabilities can potentially happen – a commonissue known as the correspondence problem [11].

LbD techniques often make use of implicit communication,where the behaviour of a group or a single agent (human orrobot) is observed, and predictions of intention are formu-lated. The group or single agent is not necessarily aware ofthe fact that it is being observed. In some cases, humanspedagogically demonstrate desired behaviour to a robot byexecuting a task (e.g., [6]) or teleoperating robots (e.g., [3],[10]). In other cases a robot observes humans and hypothe-sises the actions humans may be executing [9].

2.2 Explicit communicationExplicit communication occurs when robots deliberately

exchange internal states by means other than observation.This type of communication is largely used in MRS as ameans to gather data in order to maximise the utility func-tion of a group. Thus, it is expected that a group of robotscan improve performance by sharing location, sensor dataand so forth. As an example, the work of [17] presented anapproach in which robots could form coalitions with otherrobots, sharing locations of robots and objects, as well assensor data. Market-based approaches for multi-robot taskallocation also make use of state communication to defineauctions and distribute resources within robots (for a re-view, refer to [7]).

Within the LbD area, in [5] a teacher could explicitly teach2 humanoid robots when information should be communi-cated, and also select the state features that should be ex-changed using inter-robot communication. The approachpresented by [12] allowed the teacher to provide the robotwith additional information beyond the observed state byusing verbal instructions, such as TAKE, DROP, STARTand DONE. These verbal instructions were used at the sametime the robot learnt to transport objects by following theteacher. Similarly, the framework presented in [15] enableda robot to established a verbal dialog communication (us-ing a pre-defined vocabulary of sentences) to learn how toperform a guide tour while following the teacher around.

2.3 Impact of Communication in MRSPrevious studies have analysed the e!ects of communica-

tion in MRS. In [2], performance of a simulated group ofreactive-controlled robots was analysed in 3 distinct tasks.According to the results obtained in that study, explicit com-munication is not essential for tasks in which implicit com-munication is available. On the other hand, if the observablestate of the environment is very little, then explicit commu-nication significantly improves the group performance.

In the work of [13], the performance was analysed in re-gards to time to task completion the results obtained ledto similar conslusions of [2]. If the necessary informationcan be gathered by implicit communication, then explicitcommunication does not make significant di!erence in the

Human-robot

interface Robot cognitive capabilities

WiFi

network Joystick

Visualisation

Robot control

Environment

perception

Player

Server

Logging

Robot

hardware

Plan Extraction

Action recognition

(HAMMER)

Group behaviour

segmentation

Multirobot plan

Figure 2: Overview of the MRLbD architecture.

time to complete a task. However, if the cost of executingredundant actions is high, then explicit communication canpotentially reduce the time to task completion.

The results presented in [10] demonstrated that multiplerobots can learn joint action plans from multiple teacherdemonstrations. For those experiments, no explicit commu-nication between the teachers was permitted. This paperanalyses whether the impact in performance caused by com-munication between teachers is correlated with the findingsof [2] and [13].

3. PLATFORM FOR EXPERIMENTATIONThis Section briefly describes the teleoperation framework

with which the P3-AT robots were controlled during theexperiments, as well as the relevant details regarding theMRLbD architecture (which block diagram can be seen inFig. 2) employed to extract multi-robot joint action plans.

3.1 Teleoperation frameworkThe teleoperation framework within the MRLbD archi-

tecture consists in a client/server software written in C++to control the P3-AT robots.

The server software comprises the robot cognitive capa-bilities and resides on the robot’s onboard computer. Theserver is responsible for acquiring the sensor data and send-ing motor commands to the robot, whereas the client soft-ware runs on a remote computer and serves as the interfacebetween the human operator and the robot.

The P3-AT mobile robots – Each robot is equippedwith onboard computers, and its sensing capabilities com-prise a laser range scanner model Sick LMS-200 with fieldof view of 180 ! and centimetre accuracy, a firewire cameraproviding coloured images with 320x240 pixels at 30 framesper second, and 16 sonar range sensors with field of viewof 30 ! installed in a ring configuration around the robot.In addition, odometry sensors installed on the robot pro-vide feedback of robot pose (2D coordinates and orientationrelative to a starting point).

The server software – Within the Robot cognitive ca-pabilities block (Fig. 2), the server interfaces with the robothardware by means of the robot control interface Player [8].

Initially, sensor data is requested to the robot and thenprocessed. Laser data is combined with robot odometry val-ues to build an incremental 2D map of the environment.Additionally, the images captured from the camera are usedfor object recognition and pose estimation, and also com-pressed in JPEG format to be sent over the Wi-Fi networkto the client software. Also, the sonar data is used for ob-stacle detection which disables the motors in case a specific

Figure 3: A screenshot of the human-robot inter-face; camera is on top-left, 2D map is on top-right.

motor command would result in crashing into a not recog-nised object. Besides, joystick inputs representing motorcommands are constantly received from the client.

The data manipulated by the server is incrementally storedin a log file every 0.5 seconds, comprising a series of obser-vations made by the robot during task execution.

The client software – Within the Human-robot inter-face block (Fig. 2), the sensor data received from the server isdisplayed to the human demonstrator . This data consists inthe robot’s onboard camera, laser and sonar range scanners,odometry values, battery level and Wi-Fi signal strength. Ascreenshot of this interface can be seen in Fig. 3. In orderto aid in teleoperating the robots, thicker line segments areextracted from laser data, and recognised objects are repre-sented by specific geometric shapes and colours (e.g., whenother robots are recognised, a blue ellipse is displayed onscreen, representing the observed robot pose). In addition,an incremental 2D map of the environment is built usinglaser and odometry data, map which can be substantiallyuseful for location awareness during navigation.

In the meantime, the robot is teleoperated by the demon-strator through a joystick.

3.2 The MRLbD ArchitectureIt is worth noticing that the MRLbD architecture does not

necessarily require the aforementioned teleoperation frame-work; it can be integrated in any MRS in which robots arecapable of generating appropriate observations of the stateof the environment along time.

The learning module of the MRLbD architecture is re-sponsible for extracting a joint action plan for each set ofsimultaneous demonstrated data provided by multiple hu-man teachers. Within the Plan Extraction block, 3 distinctstages, hereafter discussed, process the demonstrated data.

Action recognition at single robot level – Whenaddressing the problem of recognising observed actions, amismatch between observed data and robot internal statesmight happen. This is a common issue known as the corre-spondence problem [11].

The MRLbD architecture embeds a prediction of intentapproach – the HAMMER architecture [6] – to recognise ac-tions from observed data and manoeuvre commands, whichis based upon the concepts of multiple hierarchically con-nected inverse-forward models. As illustrated in Fig. 4, eachinverse-forward model pair is a hypothesis of a primitive be-haviour and the predicted state is compared to the observedstate to compute a confidence value. This value represents

I1

I2

In

F1

F2

Fn

state s

(at t)

M1

M2

Mn

Prediction

verification

(at t+1)

Prediction

verification

(at t+1)

Prediction

verification

(at t+1)

P1

P2

Pn

…

…

Figure 4: Diagramatic statement of the HAMMERarchitecture. In state st, inverse models (I1 to In)compute motor commands (M1 to Mn), with whichforward models (F1 to Fn) form predictions of thenext state st+1 (P1 to Pn), then verified at st+1.

how correct that hypothesis is, thus determining the robotprimitive behaviour which would result in the most similaroutcome to the observed action.

As in [10], this paper makes use of five inverse-forwardmodel pairs, which are: Idle (the robot remains stationary),Search (the robot searches for a specific object), Approach(the robot approaches the object once it is found), Push (therobot moves the object to a destination area) and Dock (therobot navigates back to its initial location). While the in-verse models are hand-coded primitive behaviours, forwardmodels take into account the di!erential drive kinematicsmodel of the robot.

Group behaviour segmentation – A Matlab imple-mentation of the spectral clustering algorithm presented in[16] is used for spatio-temporal segmentation of group be-haviour based on the observations made by the robots duringtask demonstration.

The current work makes use of the same parameters de-scribed in [10] for the clustering algorithm. These valueswere thoroughly defined to preserve the generality of the al-gorithm when applied to di!erent sets of concurrent demon-strated data.

In addition, the same 2 events used in [10] were used in thisimplementation. The event SEESBOX occurs every timethe robot recognises a known object within its field of view.Also, the event SEESROBOT occurs when the robot recog-nises another robot.

Multi-robot plan generation – In the MRLbD, thesegmented group behaviour is combined with the recognisedactions at single robot level to build multi-robot action plans.This plan comprises a sequence of actions at single robotlevel, with specific times to start and end action execution,that each robot should perform along time as an attemptto achieve the same goals of the task execution formerlydemonstrated.

Given the unlabelled group behaviour segmentation overtime for each robot, the recognised actions at single robotlevel are ranked based upon the forward model predictionsand start/end times of that particular segment of group be-haviour. For every robot and every cluster, a confidencelevel of the inverse-forward model is calculated. The actionwith highest confidence level is then selected, and a sequenceof actions at individual level for each robot are generated,defining the multi-robot joint action plan.

4. EXPERIMENTAL TESTSExperiments were performed in a furniture movers realis-

tic scenario. This experiment di!ers from the one presented

1 2 3 4 50

100

200

300

400

500

600

700

800

900

1000

1100

trial

time (

secs

.)

Comparison of task time to completion

without communication

with communication

Figure 5: Time-based comparison of the demon-strated task executions, with (green bars) and with-out (blue bars) verbal communication allowed be-tween participants.

in [4] in the sense that demonstrators have full control of therobot through the joystick, opposed to the selection of pre-defined actions comprising the robots’ underlying capabili-ties. Moreover, in this work mobile robots P3-AT were usedin a real environment. In comparison with the experimentsperformed in [10], the current experiments make use of alarger box, and robots must move the box through a tighter,cluttered environment, demanding significantly more di"-cult coordinated manoeuvres to complete the task.

A total of 10 demonstrations were conducted, involving7 participants: 3 novices, 2 intermediate and 2 experiencedusers of the teleoperation framework1. The same instruc-tions were thoroughly given to all the 7 demonstrators, in-cluding a training session of 10 minutes for familiarisationwith the teleoperation framework. In pairs, the demonstra-tors were asked to remotely control 2 robots and search fora particular box, which location was initially unknown, andbring it to a specified destination, resulting in the task com-pletion. In order to analyse the impact of explicit communi-cation between participants during task execution, the samepairs of participants performed 2 task executions (compris-ing one trial): one demonstration with implicit communica-tion only, and another demonstration with verbal communi-cation permitted.

An experienced and an intermediate users performed thefirst trial, then 2 experienced users demonstrated the secondand subsequently 2 novices deployed the third trial. Thefourth trial was also executed by 2 novices and lastly anexperienced and an intermediate users carried out the taskdemonstration.

5. RESULTS AND DISCUSSIONThis Section analyses the impact of explicit communica-

tion between teachers in the joint action plans extractedby the MRLbD architecture, discussing the results obtainedfrom the 10 demonstrations of task execution in regards totime to completion of the task, as well as quality of themulti-robot joint action plan generated.

5.1 Time to completion of task demonstrationsFrom the results obtained using the MRLbD architecture,

1A novice and the 2 advanced participants performed theexperiments twice

Fig. 5 shows the time to completion of the 10 task execu-tions, where each trial represents 2 demonstrations from thesame pair of participants, with and without explicit, namely,verbal communication allowed between participants.

A comparative analysis of the trials in a time basis pro-vides no evidence of performance gains, which is consistentwith the discussion presented in Section 2.3. In fact, nosignificant di!erence can be observed.

On one hand, explicit communication can cause delays, asobserved in trials 2 (13 secs.) and 4 (22 secs.) in Fig. 5. De-lays could presumably be a consequence of participants nat-urally spending some time communicating with each otherin order to synchronise and coordinate manoeuvres.

On the other hand, trials 1 and 5 presented a slight re-duction of time demanded (30 secs. and 19 secs.). Recallingthe discussion in Section 2.3, implicit communication not al-ways provide enough information when the environment ispartially observable. That is, sometimes a robot might losetrack of its robot mate or the object being manipulated,thus requiring manoeuvres to look around and observe itssurroundings. In which case, one robot can provide the otherrobot with the relevant information by explicitly communi-cating its observation, avoiding unnecessary manoeuvres.

From trial 3, however, an expressive reduction in time tocompletion of the task can be observed (4 min. and 11 secs.).Unexpectedly, this trial provided an excellent example ofhow group behaviour performance can benefit from explicitcommunication.

In this particular trial, during the demonstration usingimplicit communication, both participants decided to nav-igate towards the direction opposite to where the box wasinitially located (Fig. 6(a)), thus demanding more time forthe box to be found by both participants. Searching for thebox does not require collaboration between robots, but it isa classical example of a redundant action.

Predictably, the participants benefited from the explicitcommunication: one participant quickly found the box andimmediately shared that information, telling the other par-ticipant the box location (Fig. 6(b)), hence reducing the timefor task completion.

Interestingly, whether the participants demonstrating thetask were experienced users or novices, the impact of allow-ing explicit communication between teachers did not resultin changes in time to completion of the task whatsoever.

5.2 Quality of Joint Action Plans GeneratedIt is widely recognised that developing systems to work

in real world scenarios markedly elicits real world issueswhich must be taken into account, such as noise and thenon-determinism of the environment. Furthermore, whendeveloping systems in which robots learn from humans, thedemonstrations of desired behaviour are potentially unlikelyto result in exactly the same data.

Therefore, the multi-robot joint action plans generated bythe MRLbD architecture are expected to present variation,mainly in the segmentation of group behaviour, where robottrajectories along time are taken into account.

Fig. 7(a) plots the joint action plan generated by pro-cessing the data demonstrated with implicit communicationonly, from trial 1. In that particular demonstration, a lackof coordination, showing the manoeuvres were mostly outof synchrony, can be observed. While robot 1 was alreadypushing the object (denoted by the character “P”within the

!5

0

5

10

15

!15!10!505100

500

1000

1500

2000

x coord

Spatio!temporal view

y coord

time

Robots’ initial location

Both robotstake wrongdirection

(a) No communication allowed: both robots nav-igated opposite to box location

!5

0

5

10

15

!15!10!505100

500

1000

1500

2000

x coord

Spatio!temporal view

y coord

time

Robots’ initial location

Communication

occurred

Robot found

the box

(b) Communication allowed: box location infor-mation communicated to other robot

Figure 6: Robot trajectories along time from trial 3.

0 100 200 300 400 500 600

1

2

S

I

P

A

P

A

I

D

P

D

P

A

A

P

D

D

time

rob

ot

Robot actions timeline

(a) Implicit communication only

0 100 200 300 400 500 600

1

2

S

S

P

A

I

I

I

A

P

P

P

A

P

P

time

rob

ot


(b) Explicit communication allowed

Figure 7: Joint action plans generated using demonstrations from trial 1.

0 50 100 150 200 250 300

1

2

S

S

P

A

P

P

P

P

P

A P

P

D

time

rob

ot


(a) Implicit communication only

0 50 100 150 200 250 300

1

2

S

S

P

P

P

P

P

P

P

P

P

P

P

P

time

rob

ot


(b) Explicit communication allowed

Figure 8: Joint action plans generated using demonstrations from trial 2.

second action on the timeline), robot 2 was still approximat-ing (“A”) to the object. Only at the fourth action the robot1 realised that robot 2 was not in synchrony with its ma-noeuvres. This lack of coordination caused the HAMMERarchitecture to recognise the fourth and fifth individual ac-tions of robot 2 as Dock. At that stage, robot 2 was movingtowards the place it was initially located, and presumablylost track of the object, leading the confidence level of theDock action to be higher.

Conversely, the joint action plan extracted from the demon-stration with explicit communication in trial 1 indicates rea-

sonably synchronised actions, as can be seen in Fig. 7(b).Again, robot 1 started pushing the object first, while robot2 was still approaching the object (second actions on thetimeline). The participant teleoperating robot 1 was possi-bly notified by the other participant, which initiated a periodin which the robots remained stationary (Idle action denotedby “I”). The remainder of the actions were hence coordina-tive executed until task completion, thus suggesting thatexplicit communication can contribute to the MRLbD ar-chitecture to generate better coordinated joint action plans.

Fig. 8 also provides evidence that explicit communication

contributes to extraction of better joint action plans. In trial2 both participants were experienced users. Therefore, theresult from the demonstration with implicit communicationonly (Fig. 8(a)) already presents a considerably coordinated,noticeably synchronised joint action plan.

However, this synchrony is lost at the fifth action on thetimeline, which was quickly re-established by the robot 2,presumably approaching the object rapidly, and then en-gaging in the object moving. In contrast with Fig. 8(a), theFig. 8(b) reveals a plan in which actions are jointly executed,tightly synchronised in time, when explicit communicationwas permitted between the experienced teachers.

These results indicate that the use of explicit communi-cation aids human teachers regardless their level of skills,improving the quality of the demonstration, quality whichstill relies on the level of skills.

6. CONCLUSION AND FUTURE WORKThis paper presented an analysis of the impact of com-

munication between humans concurrently teaching multiplerobots using the novel MRLbD architecture proposed in [10].

It was found that the impact in the MRLbD system per-formance when using explicit communication is strongly cor-related to the impact in MRS. What is more, the resultsnoticeably provide evidence that explicit communication af-fects human teachers at the same extent, regardless the levelof skills of these teachers.

Regarding the time to task completion, explicit commu-nication was found to result in no substantial alteration inperformance. Explicit communication can only be beneficialto reduce the time to task completion if the cost (e.g., in timeor battery life) of executing redundant actions is high.

However, the quality of the joint action plans was shown tosatisfactorily benefit from explicit communication. It was re-vealed that those plans presented tighter coordinated, tem-porally synchronised actions executed by the robots.

Ongoing work is on refining the inverse-forward modelpairs of the HAMMER architecture for action recognitionand investigating the use of other events within the spectralclustering algorithm.

7. ACKNOWLEDGEMENTSThe support of the CAPES Foundation, Brazil, under the

project No. 5031/06–0, is gratefully acknowledged.

8. REFERENCES[1] B. D. Argall, S. Chernova, M. Veloso, and

B. Browning. A survey of robot learning fromdemonstration. Robotics and Autonomous Systems,57(5):469–483, 2009.

[2] T. Balch and R. C. Arkin. Communication in reactivemultiagent robotic systems. Auton. Robots, 1(1):27–52,1994.

[3] S. Chernova and M. Veloso. Confidence-based policylearning from demonstration using gaussian mixturemodels. In AAMAS ’07: Proc. of the 6th Int. jointConf. on Autonomous Agents and Multiagent Systems,pages 1–8, New York, NY, USA, 2007. ACM.

[4] S. Chernova and M. Veloso. Multiagent collaborativetask learning through imitation. In 4th Int. Symposiumon Imitation in Animals and Artifacts, April 2007.

[5] S. Chernova and M. Veloso. Teaching multi-robotcoordination using demonstration of communicationand state sharing. In AAMAS ’08: Proceedings of the7th Int. Joint Conf. on Autonomous Agents andMultiagent Systems, pages 1183–1186, Richland, SC,2008. IFAAMAS.

[6] Y. Demiris and B. Khadhouri. Hierarchical attentivemultiple models for execution and recognition ofactions. Robotics and Autonomous Systems,54(5):361–369, 2006.

[7] M. B. Dias, R. M. Zlot, N. Kalra, and A. T. Stentz.Market-based multirobot coordination: A survey andanalysis. Technical report, Robotics Institute,Carnegie Mellon University, Pittsburgh, PA, April2005.

[8] B. Gerkey, R. Vaughan, K. Stoy, A. Howard,G. Sukhatme, and M. Mataric. Most valuable player:a robot device server for distributed control.Intelligent Robots and Systems, 2001. Proc. of the2001 IEEE/RSJ Int. Conf. on, 3:1226–1231, 2001.

[9] R. Kelley, A. Tavakkoli, C. King, M. Nicolescu,M. Nicolescu, and G. Bebis. Understanding humanintentions via hidden markov models in autonomousmobile robots. In HRI ’08: Proc. of the 3rdACM/IEEE Int. Conf. on Human Robot Interaction,pages 367–374, New York, NY, USA, 2008. ACM.

[10] M. F. Martins and Y. Demiris. Learning multirobotjoint action plans from simultaneous task executiondemonstrations. In Proc. of 9th Int. Conf. onAutonomous Agents and Multiagent Systems (AAMAS2010), page to appear, Toronto, Canada, 2010.

[11] C. L. Nehaniv and K. Dautenhahn. Thecorrespondence problem, pages 41–61. MIT Press,Cambridge, MA, USA, 2002.

[12] M. N. Nicolescu and M. J. Mataric. Natural methodsfor robot task learning: Instructive demonstrations,generalization and practice. In In Proc. of the 2nd Int.Joint Conf. on Autonomous Agents and Multi-AgentSystems, pages 241–248, 2003.

[13] L. E. Parker. The e!ect of action recognition androbot awareness in cooperative robotic teams. InIROS ’95: Proc. of the Int. Conf. on IntelligentRobots and Systems-Volume 1, page 212, Washington,DC, USA, 1995. IEEE Computer Society.

[14] L. E. Parker. Distributed intelligence: Overview of thefield and its application in multi-robot systems.Journal of Physical Agents, 2(1):5–14, March 2008.Special issue on Multi-Robot Systems.

[15] P. Rybski, K. Yoon, J. Stolarz, and M. Veloso.Interactive robot task training through dialog anddemonstration. In 2nd ACM/IEEE Int. Conf. onHuman Robot Interaction, March 2007.

[16] B. Takacs and Y. Demiris. Balancing spectralclustering for segmenting spatio-temporal observationsof multi-agent systems. In Data Mining, 2008. ICDM’08. Proc. of the 8th IEEE Int. Conf. on, pages580–587, Dec. 2008.

[17] F. Tang and L. Parker. Asymtre: Automated synthesisof multi-robot task solutions through softwarereconfiguration. Robotics and Automation, 2005.ICRA 2005. Proc. of the 2005 IEEE Int. Conf. on,pages 1501–1508, 18-22 April 2005.

Biped Walk Learning On Nao Through Playback andReal-time Corrective Demonstration

Çetin MeriçliComputer Science Department

Carnegie Mellon UniversityPittsburgh, PA 15213, USA

[email protected]

Manuela VelosoComputer Science Department

Carnegie Mellon UniversityPittsburgh, PA 15213, USA

[email protected]

Categories and Subject DescriptorsI.2 [Artificial Intelligence]: Robotics

General TermsAlgorithms

ABSTRACTWe contribute a two-phase biped walk learning approach which isdeveloped on the Aldebaran Nao humanoid robot. In the first phase,we identify and save a complete walk cycle from the motions ofthe robot while it is executing a given walk algorithm as a blackbox. We show how the robot can then play back such a recordedcycle in a loop to obtain a good open-loop walking behavior. Inthe second phase, we introduce an algorithm to directly modify therecorded walk cycle using real time corrective feedback providedby a human. The algorithm learns joint movement corrections tothe open-loop walk based on the corrective feedback as well as therobot’s sensory readings while walking autonomously. Comparedto the open-loop algorithm and hand-tuned closed-loop walking al-gorithms, our two-phase method provides an improvement in walk-ing stability, as demonstrated by our experimental results.

Keywordsrobot learning from demonstration, skill learning

1. BACKGROUNDBiped walk learning is a challenging problem in humanoid robotics

due to the complex dynamics of walking. Developing efficientbiped walking methods on commercial humanoid platforms withlimited computational power is even more challenging since thedeveloped algorithm should be computationally inexpensive, and itis not possible to alter the hardware.

The Aldebaran humanoid robot, Nao, is a 4.5 kilograms, 58 cmtall humanoid robot with 21 degrees of freedom (http://www.aldebaran-robotics.com/pageProjetsNao.php). Since the introduction of theNao as the common robot platform of the RoboCup Standard Plat-form League (SPL) (http://www.tzi.de/spl) in 2008, competing teamshave been drawn to investigate efficient biped walking methodssuitable for the Nao hardware [1].Cite as: Biped Walk Learning On Nao Through Playback and Real-timeCorrective Demonstration, Author(s), Proc. of 9th Int. Conf. on Au-tonomous Agents and Multiagent Systems (AAMAS 2010), van derHoek, Kaminka, Lespérance, Luck and Sen (eds.), May, 10–14, 2010,Toronto, Canada, pp. XXX-XXX.Copyright c! 2010, International Foundation for Autonomous Agents andMultiagent Systems (www.ifaamas.org). All rights reserved.

There have also been approaches to task and skill learning thatutilize learning from demonstration paradigm for biped and quadrupedwalk learning [2], low level motion planning [3], and task learningfor single and multi-robot systems [4].

2. APPROACHA walking algorithm computes a vector of joint commands and

sends it to the robot at each execution timestep. Considering theperiodical nature of walking, it is possible to compute a walk cycleonce and then play the cycle back in a loop to obtain a continuouswalking behavior.

We begin by collecting 10 trials of the robot walking forwardsfor 3 meters at a fixed speed using Liu and Veloso’s walking al-gorithm [1]. In some trials, the robot completed the walk withoutlosing its balance and in others the robot lost its balance and fellbefore completing the full walk. A set of walk cycles are extractedfrom the trials in which the robot completed the walk, yielding anumber of repetitions of a complete walk cycle. The extracted walkcycles are then compressed into a single walk cycle by computingthe mean value of the joint commands for each individual timestepin the walk cycle. The resulting walk cycle is then played back in aloop to obtain an open-loop walk.

Algorithm 1 Closed-loop walking using sensor-joint couplings.Cycle" loadWalkCycle()timestep" 0loop

St " readSensors()smoothSensors(Sensorst)for all j # Joints do

if timestep MOD correctioninterval = 0 thenCj = calculateCorrection(Sensorst, j)

elseCj = 0

end ifNextActionj " Cycletimestep

j + Cj

end fortimestep" timestep + 1 (mod CycleSize)

end loop

The changes in sensory readings when the robot is about to loseits balance can be used to derive a correction policy by mappingthese changes to corrective feedback signals. The correction valuefor each joint is defined as a function of a single sensor reading. Ateach timestep the correction values for all joints are computed usingthe recent sensory readings and the defined correction functions.The pseudo-code of that process is given in Algorithm 1.

We use a human demonstrator to provide corrective feedbacksignals in real time while the robot is performing playback walkaction (Figure 1). One of the major engineering problems in usingreal time human feedback for bipedal walk learning is that feed-back must be provided without interfering with the robot dynam-ics. We developed a wireless control interface using the NintendoWiimote commercial game controller with the Nunchuk extension(http://www.nintendo.com/wii/what/controllers) to provide correc-tive demonstration to the robot. Both the controller and its exten-sion are equipped with sensors measuring their absolute pitch androll angles. Rolling a handle causes the corresponding half of therobot to bend towards the direction of the roll (Figure 2).

Figure 1: A snapshot from a demonstration session. A loosebaby harness is used to prevent possible hardware damage incase of a fall. The harness neither affects the motions of therobot nor holds it as long as the robot is in an upright position.

Figure 2: Example corrections using the Wiimote. (a) Neutralposition, (b) bent towards right, (c) neutral position, and (d)bent towards forward.

We use the accelerometer readings as the sensory input. We fita normal distribution on the correction data points received for all256 possible values of the accelerometer, and we use the meansof the normal distributions to compute the correction values duringpolicy execution.

3. RESULTS AND CONCLUSIONSWe evaluated the efficiency of the learned feedback policy against

the open-loop policy obtained in the first phase and a hand-tunedclosed-loop policy. The hand-tuned policy couples the roll andpitch anglecs of torso computed by torso angle sensor to roll andpitch corrections on both left and right sides using a linear functionC = AX + B where X is the sensory reading, with hand-tuned Aand B values. We conducted 20 runs per policy and we measuredthe distance traveled before the robot falls. The results are given inFigure 3.

Figure 3: Performance evaluation results: a) open-loop, b)hand-tuned closed-loop using torso angle sensor readings, c)learned policy using accelerometer readings. The lines withinthe boxes mark the median, the marks at both ends of boxes in-dicate minimum and maximum distances, and the left and rightedges of boxes mark 25th and 75th percentiles, respectively

The learned policy demonstrated an improvement over open-loop and hand-tuned policies, reaching a maximum travelled dis-tance of 956 centimeters while the maximum distances for open-loop and hand-tuned policies were 327 and 689 centimeters, re-spectively.

This paper contributes a two-phase biped walk learning algo-rithm and an inexpensive wireless feedback method for providingcorrective demonstration. We presented a cheap open-loop play-back walking method that learns a complete walking cycle from anexisting walking algorithm and then plays it back. This method ishighly suitable for humanoid robots with limited on-board process-ing power.

4. ACKNOWLEDGMENTSThis work is partially supported through The Scientific and Tech-

nological Research Council of Turkey (TÜBITAK) Programme 2214.The authors would like to thank Stephanie Rosenthal, Tekin Mer-içli, and Brian Coltin for their valuable feedbacks, and to the mem-bers of Cerberus and CMWrEagle RoboCup SPL teams for provid-ing their codebases.

5. REFERENCES[1] Jinsu Liu, Xiaoping Chen, and Manuela Veloso. Simplified

Walking: A New Way to Generate Flexible Biped Patterns. InProceedings of the Twelfth International Conference onClimbing and Walking Robots and the Support Technologiesfor Mobile Machines (CLAWAR 2009), Istanbul, Turkey, 9-11September 2009, O. Tosun, H. L. Akin, M. O. Tokhi and G SVirk (Ed.), World Scientific, 2009, 2009.

[2] D.H. Grollman and O.C. Jenkins. Learning elements of robotsoccer from demonstration. In International Conference onDevelopment and Learning (ICDL 2007), pages 276–281,London, England, Jul 2007.

[3] B. Argall, B. Browning, and M. Veloso. Learning robotmotion control with demonstration and advice-operators. InProceedings of the IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS’08), 2008.

[4] Sonia Chernova and Manuela Veloso. Interactive policylearning through confidence-based autonomy. Journal ofArtificial Intelligence Research, 34, 2009.

Reinforcement Learning via Practice and Critique Advice

Kshitij Judah, Saikat Roy, Alan Fern, and Thomas DietterichSchool of EECS, Oregon State University

1148 Kelley Engineering CenterCorvallis, Oregon, USA

judahk,roy,afern,[email protected]

ABSTRACTWe consider the problem of incorporating end-user advice into re-inforcement learning (RL). In our setting, the learner alternatesbetween practicing, where learning is based on actual world ex-perience, and end-user critique sessions where advice is gathered.During each critique session the end-user is allowed to analyze atrajectory of the current policy and then label an arbitrary subsetof the available actions as good or bad. Our main contribution isan approach for integrating all of the information gathered duringpractice and critiques in order to effectively optimize a parametricpolicy. The approach optimizes a loss function that linearly com-bines losses measured against the world experience and the critiquedata. We evaluate our approach using a prototype system for teach-ing tactical battle behavior in a real-time strategy game engine. Re-sults are given for a significant evaluation involving ten end-usersshowing the promise of this approach and also highlighting chal-lenges involved in inserting end-users into the RL loop.

Categories and Subject DescriptorsI.2.6 [Computing Methodologies]: Learning; H.5.2 [InformationSystems]: INFORMATION INTERFACES AND PRESENTATION—Graphical user interfaces (GUI)

General TermsAlgorithms

KeywordsReinforcement Learning, Human-Computer Interaction, IntelligentUser Interfaces

1. INTRODUCTIONHumans often learn through iterations of practice followed by

feedback from a teacher. The goal of this paper is to make progresstoward computer agents that can learn via a similar process wherefeedback is provided by end-users with no knowledge of the agents’internal workings. With this motivation we consider the problem ofintegrating an end-user into a reinforcement learning (RL) processwhere the RL agent will alternate between two stages: 1) The cri-tique stage where the end-user is able to observe a trajectory of theagent’s current performance and then critique any of the possibleactions as good or bad, 2) The practice stage where the RL agentlearns from autonomous practice in the environment while takingthe user’s critique into account.

We develop a novel approach for combining the critique andpractice data for selecting a policy. The approach is based on op-timizing an objective function over policy parameters that linearly

combines objectives related to each type of data. One of our keycontributions is to propose an interpretation of the critique data thatleads to a well motivated objective for the critique data. In particu-lar, the critique data is viewed as defining a distribution over a typeof supervised learning problem we call any-label learning. The ob-jective is then defined in terms of the expected loss of the policyparameters with respect to this distribution.

We present results of a user study in the domain of tactical bat-tles in a real-time strategy game. The overall results show that ourapproach is able to significantly improve over pure RL and is apromising mechanism for leveraging end-users in the RL loop. Im-portant usability issues also emerge.

2. RELATED WORKHere we review only the most closely related prior work.Prior work uses high-level rules as advice for RL. For example,

rules in the form of programming-language constructs [11], logicalrules derived from a constrained natural language [8], and rulesabout action utilities and preferences [10, 9]. While experimentalresults show promise, there are no experiments that combine morethan 2 or 3 pieces of advice at once and no end-user studies. Also,it is likely that end-users will have difficulty providing such advicedue to the need to learn an advice language.

In the learning by demonstration (LBD) framework the user pro-vides full demonstrations of a task that the agent can learn from[3]. Recent work [4] incorporates model learning to improve onthe demonstrations, but does not allow users to provide feedback onthe agent’s observed behavior. Other work combines LBD and hu-man critiques on behavior [1, 2] similar to those used in our work,but there is no autonomous practice and no experiments involvingend-users.

Another type of approach has been to allow users to provide real-time feedback to the agent during the learning process in the formof an end-user driven reward signal. Knox and Stone [7] use super-vised learning to predict the user reward signal and select actions inorder to maximize the predicted reward. This approach has shownsome promise in user studies, but does not integrate autonomouspractice (until very recently; see [6]) and is hence limited by theabilities of the user. Thomaz and Breazeal [14] rather combine theend-user reward signal with the environmental reward and use aQ-learning variant for RL. This method has also shown promise inuser studies, but the semantics of the combination of reward signalsis unclear in general.

3. OVERALL APPROACHProblem Formulation. We consider reinforcement learning (RL)

in the framework of Markov decision processes (MDPs), where thegoal is to maximize the expected total reward from an initial state.

It is assumed that all policies are guaranteed to reach a terminalstate in a finite number of steps. Extending to discounted infinitehorizons settings is straightforward. In this work, we incorporatean end-user into the usual RL loop. The user-agent interaction al-ternates between two stages: 1) Get critique data C from the userregarding the current policy, and 2) Update policy via autonomouspractice while factoring in the critique data.

In stage 1, we focus on a simple form of critique data that is rea-sonable for the end-users to provide. In particular, the end-user isallowed to observe the execution of the agent’s current policy andthen scroll back and forth along the trajectory and mark any of theavailable actions in any state as good or bad. The result of the cri-tiquing stages is a critique data set:C =

!(s1, c

+1 , c

!1 ), ..., (sM , c+M , c!M )

", where si is a state and c+i

and c!i are sets of actions that are labeled as good and bad for si.No constraints are placed on which states/actions are included in C,and it is expected that for many states on the trajectories one or bothof c+i and c!i will be empty. Intuitively, the actions in c+i are thosethat the end-user believes are “equally optimal" while those in c!iare those that are considered to be sub-optimal. In many domains,it is possible for there to be multiple optimal/satisfactory actions,which is why we consider sets of actions. One of the primary con-tributions of this paper (Section 4) is to propose a semantics for Cthat supports a principled approach to policy optimization.

In stage 2 of our protocol, the agent practices by interacting withthe MDP as in traditional RL, but while taking the current cri-tique data into account. This generates a set of trajectories T =

{T1, ..., TN}, where Ti =#(s1i , a

1i , r

1i ), ..., (s

Lii , aLi

i , rLii )

$and

sti , ati , and rti are the state, action and reward at time t in Ti. Thus,

during the practice session and before the next critique session, theagent has two sets of data T and C available. The fundamentalproblem is thus to use T and C in order to select an agent policythat maximizes reward in the environment.

Solution Approach. We consider learning the parameters ! of astochastic policy "!(a|s), which gives the probability of selectingaction a in state s given !. We use a standard log-linear policy rep-resentation "!(a|s) ! exp(f(s, a) · !), where f(s, a) is a featurevector of state-action pairs and ! weighs features relative to one an-other. Our problem is now to use the data sets T and C in order toselect a value of ! that maximizes the expected total reward of "! .

Our high-level approach is to optimize ! with respect to a com-bined objective function:

J"(!, C, T ) = #U(!, T )" (1" #)L(!, C) (1)

where # # [0, 1]. U(!, T ) is an estimate of the expected utilityof "! based on T (details below) and L(!, C) is a loss functionthat measures the disagreement between ! and C (details in Sec-tion 4). # specifies the relative influence of T versus C, where# = 1 corresponds to pure RL, and # = 0 corresponds to puresupervised learning from C. We will use differentiable functionsfor U and L, and thus, for a fixed #, J" can be optimized via gradi-ent descent. Supervised actor-critic learning [13] follows a similarapproach where online policy updates are driven by a combined re-ward and supervisory signal. In that work, a schedule for adjustingthe value of # was proposed. However, we have found that in prac-tice, fixing a schedule for # is difficult since the best choice varieswidely across users, domains, and data quality. Thus, instead weuse an explicit validation approach, where part of the practice ses-sion is used to evaluate different values of #.

Algorithm 1 gives the main loop of our approach, which starts bycollecting critique data based on the current policy (line 3). Nextthe practice session begins, and some number of additional trajecto-

Algorithm 1 Main loop of our approach.1: loop2: // Begin critique stage.3: C := C $ GETCRITIQUEDATA(!)

4: // Begin practice stage.5: T := T $ GETPRACTICETRAJECTORIES(!)6: for i = 1:n do7: # = #i //choose a value of #8: !i = argmax! J"(!, C, T ) // see Equation 19: Ui = ESTIMATEUTILITY(!i) // by executing "!i

10: end for11: i" = argmaxi Ui

12: ! = !i!13: end loop

ries (line 5) are collected via a combination of following the currentpolicy and exploration 1. Next (lines 6-10) the algorithm considersdifferent values of #, for each one optimizing J" and then esti-mating the expected utility of the resulting policy by averaging theresults of a number of policy executions. In our experiments weused 11 values of # equally distributed in [0, 1]. Finally, ! is set tothe parameter value that yielded the best utility estimate. Note thatthe ability to easily optimize for different values of # is one of theadvantages of our batch-style learning approach. It is unclear howa more traditional online approach would support such validation.

Trajectory Data Objective. To compute U(!, T ) it is straight-forward to use likelihood weighting in order to estimate the utilityof "! using off-policy trajectories T , an approach that has been pre-viously considered for policy gradient RL [12]. Let p!(Tj) be theprobability of "! generating trajectory Tj and let !j be the param-eters of the policy that generated Tj . An unbiased utility estimateis given by:

U(!, T ) =1

W (T )

!

Tj#T

w(Tj)Rj , w(Tj) =p!(Tj)

p!j (Tj)

where W (T ) =%

Tj#T w(Tj). Note that the unknown MDPdynamics cancel out of the weighting factor w(Tj), leaving onlyfactors related to the policies. The gradient of U(!, T ) has a com-pact closed form.

4. CRITIQUE DATA OBJECTIVEWe now give our formulation for L(!, C). We first introduce a

supervised learning problem we call any-label learning (ALL) andthen extend it to incorporate critique data.

Any-Label Learning. We consider a view of the end-user, wherefor any state s, there is a set of actions O(s) that the user considersto be “equally optimal" in the sense that actions in O(s) are con-sidered to be equally good, but strictly better than actions outsideof O(s). The set of actions will vary across users and even for thesame user across time. Also, in many cases, these sets will not cor-respond to the true set of optimal actions in the MDP. However, ifthe end-user is reasonably competent in the application domain, itis reasonable to assume that, if available, learning from O(s) wouldlikely provide useful guidance.

Given an ideal end-user, we could elicit the full action sets O(s)for states along the observed trajectories, resulting in a supervisedtraining set of states labeled by action sets:{(s1, O(s1)), ...., (sM , O(sM ))}. A natural learning goal for this1In our experiments, each practice stage generated 10 trajectoriesin the following way: with 0.8 probability the trajectory was gen-erated by the current policy; otherwise, the parameter vector wasdivided by a randomly drawn constant to increase the random-ness/temperature of the policy in order to facilitate exploration.

data is to find a probabilistic policy, or classifier, that has a highprobability of returning an action in O(si) when applied to si. Thatis, it is not important which particular action is selected as long asthe probability of selecting an action in O(s) is high. This leadsto the definition of the ALL supervised learning problem wherethe input is a set of data of the above form assumed to be drawni.i.d. and the goal is to learn policy parameters that maximizethe ALL likelihood:

&Mi=1 "!(O(si)|si) where "!(O(si)|si) =%

a#O(si)"!(a|si).

ALL is distinct from other supervised learning problems whereinstances/states are associated with sets of labels/actions. The multi-label learning problem [15] differs in that the goal is to learn aclassifier that outputs all of the labels in sets and no others. Thisobjective is not appropriate for our problem, where we only careabout outputting one of the labels and do not care about the relativeprobabilities assigned to the actions in O(si). The multiple-labellearning problem [5] interprets the label sets as possible labels onlyone of which is the true label. Thus, in multiple-label learning,the label sets are used to encode label ambiguity, which is in sharpcontrast to ALL.

Expected Any-Label Learning. It is unlikely that end-userswill actually provide the full action sets O(s). For example, theywill often indicate that the action selected by the agent is bad andpossibly mark one other action as good, even when there are multi-ple good alternatives. Thus, in reality we must make due with onlythe good and bad action sets c+ and c! for a state, which only pro-vide partial evidence about O(s). The naive approach of treatingc+ as the true set O(s) can lead to difficulty. For example, whenthere are actions outside of c+ that are equally good compared tothose in c+, attempting to learn a policy that strictly prefers theactions in c+ results in difficult learning problems. Furthermore,we need a principled approach that can handle the situation whereeither c+ or c! is empty.

The key idea behind our loss function is to define a user modelthat induces a distribution over ALL problems given the critiquedata. The loss is then related to the expected ALL likelihood withrespect to this distribution. We define a user model to be a distribu-tion Q(O(s)|c+, c!) over sets O(s) # 2A (where A is the actionset) conditioned on the critique data for s and assume independenceamong different states. We analyze a simple model involving twonoise parameters $+ and $! and a bias parameter %. The parame-ters $+ ($!) represents the rate of noise for good (bad) labels and %is the bias probability that an unlabeled action is in O(s). The usermodel Q further assumes that the actions in O(s) are conditionallyindependent given the critique data, giving the following:

Q(O(s)|c+, c!) ='

a#O(s)

q(a # O(s)|c+, c!)'

a/#O(s)

q(a /# O(s)|c+, c!) (2)

q(a # O(s)|c+, c!) =

()

*

1" $+, if a # c+

$!, if a # c!

%, if a %# c+ $ c!

While simple, this model can capture useful expectations aboutboth a domain (e.g. typical number of good actions) and the users(e.g. mislabeling rate).

Given Q, the distribution over ALL training sets D ={(s1, O(s1)), . . . , (sM , O(sM ))} conditioned on critique data C =!(s1, c

+1 , c

!1 ), ..., (sM , c+M , c!M )

"is: P (D|C) =&M

i=1 Q(O(si)|c+i , c!i ). Our loss function L(!, C) is now defined

as the negative log of the expected ALL likelihood,

L(!, C) = "M+

i=1

log(E,"!(O(si)|si)|c+i , c

!i

-) (3)

where O(si) is a random variable distributed as Q(O(si)|c+i , c!i ).

Computing L(!, C) by expanding each expectation is impractical,since there are |2A| terms in each. However, the loss function hasa compact closed form for the above user model.

THEOREM 1. If the user model is given by Equation 2 then

L(!, C) = !M!

i=1

log"" + (1! #+ ! ")$!(c

+i ) + (#! ! ")$!(c

!i )

#

.

PROOF. (Sketch) The expected ALL likelihood inside each log-arithm term of L(!, C) is given by

E$$!(O(si)|si)|c+i , c!i

%=

!

O#2A

Q(O|c+i , c!i )$!(O|si)

=!

O#2A

Q(O|c+i , c!i )!

a#O

$!(a|si)

=!

a#A

$!(a|si)!

O:a#O

Q(O|c+i , c!i )

= " + (1! #+ ! ")$!(c+i ) + (#! ! ")$!(c

!i )

The final equality follows from%

O:a#O Q(O|c+i , c!i ) = q(a #

O(si)|c+i , c!i ) and algebra.

In our experiments, we assume a noise free ($+ = $! = 0)and unbiased (% = 0.5) user model which yields: L(!, C) =%M

i=1 log(1 + "!(c+i )" "!(c

!i )) + k where k is a constant. Op-

timizing this loss function results in maximizing the probability onc+i while minimizing the probability on c!i , which agrees with in-tuition for this model. The gradient of this objective is straightfor-ward to calculate for our log-linear policy representation.

5. EXPERIMENTAL SETUPRTS Tactical Micro-Management. We evaluate our approach

on the problem of micro-management of units in tactical battlesin the real-time strategy (RTS) game of Wargus, which runs onthe open-source RTS engine Stratagus. In particular, we considercontrolling a group of 5 friendly close-range military units againsta group of 5 enemy units. Such micro-management is very dif-ficult for human players due to the fast pace, and thus it is typ-ical for players to rely on the default game behavior for micro-management, which can often be considerably sub-optimal. In ourexperiments we asked end-users to help teach the learning agentto improve tactical micro-management against the in-built WargusAI. This problem is a natural match for learning from critiques andpractice rather than other types of human interfaces. For example,given the fast pace and multiple units acting in parallel, it is verydifficult to provide real-time feedback as in the TAMER system[7]. It is also difficult to provide demonstrations due to the inherentdifficulty of the control.

We instrumented Stratagus with an API for an RL agent and withan interface that allows an end-user to watch a battle involving theagent and pause at any moment. The user can then scroll back andforth within the episode and mark any possible action of any agentas good or bad. The RL agent makes decisions every 20 game cy-cles where the available actions for each military unit are to attackany of the units on the map (enemy or friendly) giving a total of 10actions per unit. To control all friendly units the RL agent uses asimple strategy of looping through each unit at each decision cycleand selecting an action for it using a policy that is shared amongthe units. The experiences of all the units are pooled together forlearning the shared policy. The reward function provides positive

and negative rewards for inflicting and recieving damage respec-tively, with a positive and negative bonus for killing an enemy orbeing killed. Our log-linear policy used 27 hand-coded features thatmeasure properties such as “the number of friendly units attackingthe targeted unit", “the health of the target unit", etc.

Evaluated Systems. We evaluate three learning systems: 1)Pure RL, which corresponds to removing the critique stage fromAlgorithm 1 and setting # = 1, 2) Pure Supervised, corresponds toremoving the practice stage from Algorithm 1 and setting # = 0, 3)Combined System, which learns from both types of data accordingto Algorithm 1.

User Study. For the user study we constructed two battle maps,which differed only in the initial placement of the units. Both mapshad winning strategies for the friendly team and are of roughly thesame difficulty. The user study involved 10 end-users not familiarwith the system, 6 with CS backgrounds and 4 without a CS back-ground. For each user, the study consisted of teaching both the puresupervised and the combined systems, each on a different map, fora fixed amount of time. We attempted to maintain a balance be-tween which map and learning system was used first by each user.We allowed 30 minutes for the fully supervised system and 60 min-utes for the combined system, due to the additional time requiredby the combined system for the practice periods. This time differ-ence is intended to give the users the chance of providing roughlythe same amount of advice in both cases. Note that in a real trainingscenario, an end-user would not be confined to the system duringthe practice stages, as they are in the user study, meaning that thepractice sessions will typically have lower cost to the user in realitythan in the user study. It also means that we necessarily had to useshorter practice stages than might typically be used in a real appli-cation. In our studies, practice session times ranged from 3 to 7minutes depending on the amount of data and convergence rate ofgradient descent.

6. EXPERIMENTAL RESULTS

6.1 Simulated Experiments with User DataTo provide a more controlled setting for evaluating the effective-

ness of our approach, we first present simulated experiments usingreal critique data collected from our user studies. For each of ourtwo maps, we selected the worst and best performing users whentraining the combined system, for a total of 2 users per map. Thetotal amount of critique data (number of good and bad labels pro-vided) for each user was: User1 - 36, User2 - 91, User3 - 115, User4- 33. This shows that different users have very different practicesin terms of the amount of advice. For each user, we divided theircritique data into four equal sized segments, from which we createdfour data sets for each user containing 25%, 50%, 75%, and 100%of their critique data. We provided the combined system with eachof these data sets and allowed it to practice for 100 episodes. Allresults are averaged over 5 runs and are given in Table 1. For eachuser and map, the rows correspond to the percentage of critiquedata used with the number of episodes |T | increasing from 0 (puresupervised) to 100. The top row of each table corresponds to pureRL (0% critique data). Each cell in the table records the value es-timate of the best policy (estimated via running the policy severaltimes) found in the practice episode up to that point. For example,data points recorded for 50 episodes show the value of the best pol-icy found after any of the first 50 episodes. This evaluation reflectsthat in practice a user would use the best policy found during thesession, rather than simply use the final policy. However, our con-clusions are unchanged if instead we employ the value of the actualpolicy resulting after each episode.

Benefit of Critiques. The data show that pure RL is unable tolearn a winning policy (i.e. achieve a positive value) within 100practice episodes on either map.2 We also see that in almost allcases, as the amount of critique data increases, the performance im-proves for a fixed number of practice episodes. That is, the learningcurves associated with less critique data are dominated by the learn-ing curves associated with more critique data. The only exceptionis for User 3, where moving from 50% critique data to 75% critiquedata hurt performance for 0 and 10 RL episodes. However, beyond10 RL episodes, the combined system is able to recover, and wesee improvement in going from 50% to 75%. This is possibly dueto the introduction of noisy critiques between 50% and 75% thatcould dominate when too few practice episodes are available. In allcases, a winning policy was found using no more than 75% of thecritique data. We ran pure RL for up to 500 episodes and found thatit achieved -7.4 on Map1 and 12.1 on Map 2, both of which are sig-nificantly worse than the combined system. Overall, these resultsgive strong evidence that our approach is effective at leveraging thecritique data of end-users to improve the effectiveness of practice.

Benefit of Practice. Each column of data shows the performancetrend of increasing the amount of critique data for a fixed amount ofpractice. First, we see that for all but one user, the pure supervisedapproach (column |T | = 0) was able to find a winning strategywhen considering 100% of the critique data. This indicates thateven with no practice, the critique data from most users was suffi-cient to outperform pure RL. Next, comparing columns from left toright (increasing practice) there is a relatively rapid improvementin performance for each fixed amount of critique data. Overall, thisprovides strong evidence that our approach is able to leveragingpractice episodes in order to improve the effectiveness of a givenamount of end-user critique data.

6.2 Results of User StudyTable 2 summarizes our user study. Each row corresponds to

a particular performance level and each column corresponds to aparticular learning system (pure supervised or combined) either onMap 1, Map 2, or the combined map results. Each entry in the tablelists the fraction of end-users, for the particular system/map com-bination, that were able to achieve the corresponding performancelevel during the study. For example, on Map 1, 5 of the 6 users thattrained the supervised system achieved a performance of at least50.

Comparing to Pure RL. Recall that the performance of the pureRL system after 500 episodes is -7.4 and 12.16 on Map 1 and 2respectively, which requires significantly more time than our userswere given. On Map 1 all users were able to achieve -50 usingeither system, and all but one user (for the combined system) wasable to achieve 0. Half of the users were able to achieve 50 withthe combined system, and all but one achieved 50 with the puresupervised system. For Map 2, all users were able to achieve atleast 50 using either system. These results show that the users wereable to significantly outperform pure RL using both the supervisedand combined systems.

Comparing Combined and Pure Supervised. The end-usershad slightly greater success with the pure supervised system ver-sus the combined system. In particular, more users were able toachieve performance levels of 50 and 100 using the supervised sys-tem. Based on the questionnaire feedback and our observations ofthe users, we conjecture a combination of reasons for this differ-

2We have found that, in this domain, our pure RL approach isgenerally more effective than several other alternatives includingOLPOMDP and Q-learning, both of which are unable to learn win-ning strategies in any reasonable number of episodes.

Map 1 Results!!!!!!!%C

|T | 0 10 20 30 40 50 60 70 80 90 100

RL 0 -230.9 -188.6 -170.5 -166.9 -166.9 -132.5 -124.6 -108.7 -104.9 -81.8 -79.2U

ser4

25 -166 -103 -103 -73 -68.4 -59.4 -59.4 -59.4 -47.8 -41.1 -34.850 -33 -33 88 89.2 89.2 89.2 99.4 99.4 99.4 99.4 99.475 126 126 126 129.6 135.2 135.2 135.2 135.2 135.2 135.2 135.2100 160 160 161.2 161.2 161.2 161.2 161.2 161.2 161.2 161.2 161.2

Use

r3

25 -119 -123.6 -105.4 -97.9 -76.8 -72.8 -72.8 -69.4 -69.4 -69.4 -4950 -86 -38.2 -38.2 -35.6 -27.6 -20.2 -9.8 -9.8 -9.8 -9 -975 -130 -88.3 -28.4 34 45.6 47 47 47 76.8 76.8 76.8100 -96 -93 65.9 78.3 78.7 83.1 83.1 91.5 91.5 91.5 92.6

Map 2 ResultsRL 0 -210.9 -130.6 -66.1 -61.6 -43.9 -43.9 -40.6 -40.6 -35.4 -33.7 -33.7

Use

r1

25 -147 -88.7 -70.4 -52.8 -44 -44 -29.6 -26.9 -18.3 -18.3 -5.750 -123 -84.9 -20 -13.8 -13.8 -4.4 -4.4 -0.4 -0.4 -0.4 -0.475 -35 -35 -29.8 -4.2 6.4 15.2 29 35.2 35.2 50 56100 21 21 33 51.2 51.4 60.8 77.9 86.3 87.6 87.6 87.6

Use

r2

25 -156 -154.8 -117.4 -116.4 -113.2 -112.4 -102.8 -101.4 -96 -84.4 -8350 -160 -116.9 -57.2 -39.4 14.8 18.2 18.2 39.4 39.4 53.6 53.675 -48 -17.8 -14 37.2 52.4 58.8 58.8 76.2 76.2 82.2 95100 64 64 64 72.4 76.4 96 117.6 118.8 118.8 118.8 118.8

Table 1: Results of simulated experiments for four end-users. Rows correspond to using different percentages of the critique data,with the number of practice episodes |T | increasing from 0 to 100. Negative values in a cell indicate a losing policy, and positivevalues indicate a winning policy. The magnitudes reflect the total hit points remaining for the winning team.

Map 1 Map 2 both mapsperformancelevel

Supervised Combined Supervised Combined Supervised Combined

-50 6/6 4/4 4/4 6/6 10/10 10/100 6/6 3/4 4/4 6/6 10/10 9/1050 5/6 2/4 4/4 6/6 9/10 6/1080 3/6 1/4 3/4 3/6 6/10 4/10100 1/6 0/4 1/4 2/6 2/10 2/10

Table 2: Results of user studies on Map 1 and Map 2. Each row corresponds to a policy performance level, indicated by the firstcolumn entry. For each map, results are given for the pure supervised and combined system. Each table entry lists the fraction ofparticipants for the map/system pair that were able to achieve the corresponding performance level within the time constraints ofthe user study. For reference, using much more time, pure RL was unable to achieve level -50 for Map 1 and was only able to achievea performance of 12 on map 2.

ence.Some users found the pure supervised system to be more user

friendly, since there was no delay experienced while waiting forthe practice stages to end. Because of this, the users were able tosee the effects of their critiques almost immediately, and as a resultthey gave more and possibly better advice. For the majority of theusers, the amount of advice given to the pure supervised system wasnearly twice that given to the combined system. Note that in a realscenario the end-user would not need to wait around for the practiceperiod to end, but could periodically check in on the system.

One could argue that the practice episodes of the combined sys-tem should compensate for the reduced amount of critique data.However, it was observed that, for some users, the policies returnedafter certain practice sessions were quite poor and even appearedto ignore the user’s previous critiques. The users found this quitefrustrating, and it likely impacted the quality of their critiques. Webelieve that such occurrences can be partly attributed to the factthat the amount of practice used in the user study between critiquestages was not sufficient to overcome the reduction in the amountand perhaps quality of critique data. Indeed our simulated exper-

iments showed that, with a reasonable amount of practice, therewere significant performance gains over no practice at all. An im-portant future investigation is to design a user study where the useris not captive during the practice sessions, but rather checks in peri-odically, which will better reflect how the combined system mightbe used in reality. Nevertheless, in general, it is likely that withour current combined system, users will experience occasional de-creases in performance after practice and perceive that the systemhas not taken their critiques into account. A lesson from this studyis that such behavior can be detrimental to the user experience andthe overall performance achieved.

We considered other factors that could have lead to interestingobservations about the study. One was examining whether playinga map first versus second results in any difference in the perfor-mance. However we found no significant differences in the per-formance. Another was whether CS versus non-CS backgroundresulted in any difference in the performance. Again, we found nosignificant differences.

Advice Patterns of Users. Figure 1 shows fraction of positive,negative and mixed advice type for each user. For both systems,

(a) Supervised System

(b) Combined System

Figure 1: Fraction of positive, negative and mixed advice. Pos-itive (or negative) advice is where the user only gives feedbackon the action taken by the agent. Mixed is where the user notonly gives feedback on the agent’s action but also suggests al-ternative actions to the agent.

majority of advice given is mixed. This suggests that typical teach-ing pattern followed by users is to critique agent’s actions and sug-gest (possibly multiple) better alternatives.

However, many users (for supervised: user 2, 8, 4, 5, 3 and 9;combined: user 2, 6, 3 and 9) still provide good amount of negativeonly feedback. Further analysis showed that for users 2, 8, 5, 3,and 9 on supervised system, negative advice is more because theystart out by giving lot of negative advice in first critiquing rounditself. In subsequent rounds, they give more mixed or positive ad-vice. However for user 4, a decrease in performance in a later roundencouraged negative advice. Likewise for the combined system,users 2, 3 and 9 start out by giving lot of negative advice in first cri-tiquing round followed by mixed or positive advice in subsequentrounds. For user 6, decrease in performance encouraged negativeadvice.

Hence we observe different groups of users based on their advicepattern. Users 2, 3 and 9 show similar pattern where they give morenegative advice at the beginning and mixed or positive in the laterstages. Users 4 and 6 gave mixed advice from the beginning excepta decrease in performance enouraged them to be more critical andgive negative advice. Users 1, 10 and 7 almost always gave mixedadvice. Only users 8 and 5 showed variation in giving negativeadvice. For supervised system, they started out by giving morenegative advice in first critiquing round. But they didn’t do this forthe combined system.

A similar analysis for positive advice shows that many users (forsupervised: 7, 8, 4, 5, 6 and 9; for combined: 2, 7, 8, 4, 5, 6 and9) provide good amount of positive advice as well. These positiveadvice serve as motivational feedback from the user to the agentencouraging agent’s improving behavior. Other users (1, 10 and 3)kept giving mixed advice almost always.

7. FUTURE WORK

Our current results are promising. However, the user study sug-gests significant usability challenges when end-users enter the learn-ing loop. An important part of our future work will be to conductfurther user studies in order to pursue the most relevant directionsincluding: enriching the forms of advice, increasing the stabilityof autonomous practice, studying user models that better approxi-mate end-users, and incorporating end-users into model-based RLsystems.

8. REFERENCES[1] B. Argall, B. Browning, and M. Veloso. Learning by

demonstration with critique from a human teacher. InInternational Conference on Human-Robot Interaction,2007.

[2] B. Argall, B. Browning, and M. Veloso. Learning robotmotion control with demonstration and advice-operators. InIROS, 2008.

[3] A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Robotprogramming by demonstration. In Handbook of Robotics.2008.

[4] A. Coates, P. Abbeel, and A.Y. Ng. Learning for control frommultiple demonstrations. In ICML, 2008.

[5] R. Jin and Z. Ghahramani. Learning with multiple labels. InNIPS, 2003.

[6] W.Bradley Knox and Peter Stone. Combining manualfeedback with subsequent mdp reward signals forreinforcement learning. In Proc. of 9th Int. Conf. onAutonomous Agents and Multiagent Systems (AAMAS 2010),May 2010.

[7] W.B. Knox and P. Stone. Interactively shaping agents viahuman reinforcement: the TAMER framework. InInternational Conference on Knowledge Capture, 2009.

[8] G. Kuhlmann, P. Stone, R. Mooney, and J. Shavlik. Guidinga reinforcement learner with natural language advice: Initialresults in RoboCup soccer. In AAAI Workshop onSupervisory Control of Learning and Adaptive Systems,2004.

[9] R. Maclin, J. Shavlik, L. Torrey, T. Walker, and E. Wild.Giving advice about preferred actions to reinforcementlearners via knowledge-based kernel regression. In AAAI,2005.

[10] R. Maclin, J. Shavlik, T. Walker, and L. Torrey.Knowledge-based support-vector regression forreinforcement learning. Reasoning, Representation, andLearning in Computer Games, 2005.

[11] R. Maclin and J.W. Shavlik. Creating advice-takingreinforcement learners. Machine Learning, 22(1):251–281,1996.

[12] Leonid Peshkin and Christian Shelton. Learning from scarceexperience. In ICML, 2002.

[13] M.T. Rosenstein and A.G. Barto. Supervised actor-criticreinforcement learning. In J. Si, A. Barto, W. Powell, andD. Wunsch, editors, Learning and Approximate DynamicProgramming: Scaling Up to the Real World, pages359–380. 2004.

[14] A.L. Thomaz and C. Breazeal. Teachable robots:Understanding human teaching behavior to build moreeffective robot learners. Artificial Intelligence,172(6-7):716–737, 2008.

[15] G. Tsoumakas and I. Katakis. Multi-label classification: Anoverview. International Journal of Data Warehousing andMining, 3(3):1–13, 2007.

Beyond Teleoperation: Exploiting Human Motor Skillswith MARIOnET

Adam SetapenDept. of Computer ScienceUniversity of Texas at Austin

[email protected]

Michael QuinlanDept. of Computer ScienceUniversity of Texas at Austin

[email protected]

Peter StoneDept. of Computer ScienceUniversity of Texas at [email protected]

ABSTRACTAlthough machine learning has improved the rate and ac-curacy at which robots are able to learn, there still existtasks for which humans can improve performance signifi-cantly faster and more robustly than computers. Whilesome ongoing work considers the role of human reinforce-ment in intelligent algorithms, the burden of learning is of-ten placed solely on the computer. These approaches neglectthe expressive capabilities of humans, especially regardingour ability to quickly refine motor skills. In this paper,we propose a general framework for Motion Acquisition forRobots through Iterative Online Evaluative Training (MAR-IOnET). Our novel paradigm centers around a human in amotion-capture laboratory that “puppets” a robot in real-time. This mechanism allows for rapid motion developmentfor di!erent robots, with a training process that providesa natural human interface and requires no technical knowl-edge. Fully implemented and tested on two robotic plat-forms (one quadruped and one biped), this paper demon-strates that MARIOnET is a viable way to directly transferhuman motion skills to robots.

Categories and Subject DescriptorsI.2.9 [Artificial Intelligence]: Robotics—Operator inter-faces, Kinematics and dynamics

General TermsHuman Factors, Design, Experimentation

KeywordsHuman-robot/agent interaction, Agent development tech-niques, tools and environments

1. INTRODUCTIONAs robots become more commonplace, the tools to facili-

tate knowledge transfer from human to robot will be vital,especially for non-technical users. Therefore, when design-ing autonomous robots that interact with humans, not onlyis it important to leverage advances in machine learning, butit is also useful to have the tools in place to enable directtransfer of knowledge between man and machine. We in-troduce such a tool for enabling a human to teach motioncapabilities to a robot.

Specifically, this paper describes a direct and real-timeinterface between a human in a motion-capture suit anda robot. In our framework, the learning happens exclu-sively by the human - not the robot. Our approach ex-ploits the rate at which humans are able to learn and re-fine fine-motor skills [19, 14]. Called MARIOnET, Motion

Acquisition in Robots through Iterative Online EvaluativeTraining, the interface has been implemented on two robots- one quadruped and one humanoid. As the name indicates,MARIOnET is a form of iterative online evaluative train-ing. The human performs a motion and the robot mimicsin realtime. The human evaluates the robot’s performance,and repeats the motion accounting for any errors perceivedin the robot’s previous actions. This loop is continued untila su"cient motion sequence is obtained.

Our results indicate that humans are able to quickly im-prove a robot’s performance of a task requiring fine-motorskills.The primary contribution of this paper is a new paradigmfor directly encoding motion sequences from a human to arobot. One motivation of our approach is to develop ane"cient method for generating cyclical motion sequences.We empirically evaluate the rate at which human subjectslearn to exploit a direct robot mapping, and demonstratethat MARIOnET is a powerful way to harness the cognitiveflexibility of humans for quickly training robots.

The remainder of the paper is organized as follows. Sec-tion 2 describes our motivation and provides backgroundinformation necessary to understand the technical details ofour approach. Section 3 provides a detailed account of theimplementation and training process of MARIOnET. Sec-tion 4 outlines the hardware used in our evaluation and de-scribes our experimental results. Section 5 discusses relatedwork, and finally Section 6 presents our conclusions and pos-sible future work.

2. MOTIVATION AND BACKGROUNDThis section provides detailed motivation for our work,

briefly situates MARIOnET within the learning agents lit-erature, and describes the underlying technology upon whichMARIOnET is built.

2.1 MotivationIn this paper, we examine a new approach towards robot

motion acquisition through an innovative training paradigmthat exploits a system finely tuned by thousands of yearsof biological evolution: the human body. While the processby which humans are able to learn exceptionally quickly isnot yet fully understood, work being done on the neuro-logical basis of learning is steadily shedding light on howwe rapidly acquire and apply new knowledge [13]. Recentbreakthroughs in behavioral motor control have enhancedour understanding of the human brain and illustrate howremarkable our innate capacity for delicate motor controlis [19]. Muellbacher et al. report that given a 60-minutetraining period, human subjects can rapidly optimize per-

formance of a complex task involving fine motor control [14].The high-level motivation for MARIOnET is that a real-

time mapping from a human to a robot will serve as a con-venient interface for quickly and systematically training e"-cient motion sequences. While there is certainly a di!erencein the natural dynamics of robots and Homo sapiens, it isour belief that people’s ability to quickly hone fine motorskills can be exploited to rapidly train diverse robot mo-tions. Even if the mapping from human coordinates to robotcoordinates is not exact, we hypothesize that humans willbe able to rapidly learn to correct for any inconsistencies.Additionally, the prospect of mapping any human limb toany robot limb allows for a flexible training process (e.g.,mapping human arms to robot legs).

Our long-term vision is to use MARIOnET to train ro-bust motions for robots such as walking, running, and kick-ing. This could be useful in many domains, such as robotsoccer where most people create these behaviors by hand orvia extensive learning experiments using constrained param-eterizations, causing a lot of wear and tear on the robots [10].Generally, programming specialized robot motions requiresa significant amount of coding, which is not possible for mostpeople, and is not necessary when using MARIOnET.

MARIOnET uses real-time mimicking on the part of therobot because we believe it is essential that the human sub-ject is able to quickly evaluate the actions of the robot inorder to refine its movement in subsequent training. Forexample, imagine the following situation in attempting to“puppet”a humanoid robot: the human subject realizes thatwhen training a biped robot to walk, the robot frequentlyloses balance and topples forward. The human can try dif-ferent things to correct this — lengthening of stride, reduc-tion of knee-bend, etc., all in real time while watching therobot. Although MARIOnET uses teleoperation to interactwith the robot, it represents a new methodology for knowl-edge transfer. The central hypothesis of this work is thathumans are skilled enough at fine motor control, that mi-nor nuances essential for maximizing performance in robotlocomotion (otherwise discoverable only through computa-tionally expensive and time-consuming exhaustive methods)may be found with significantly less e!ort. Furthermore,since the training process proposed by MARIOnET providesan intuitive interface requiring no technical knowledge, ourapproach facilitates the direct transfer of motor knowledgefrom human to robot by any non-technical user.

2.2 ML and HRIMachine learning (ML), the study of algorithms that im-

prove automatically through experience, has drastically im-proved the rate at which robots can learn. Recently, ma-chine learning algorithms have seen great success in train-ing robots to move quickly and e"ciently, and there arenumerous case studies in which ML has been used for on-line and o!-line performance improvement in multi-agentautonomous robot environments [10, 16].

Human-robot interaction (HRI) examines the principlesand methods by which robots and humans can naturallycommunicate and interact. As robots become more im-mersed in our everyday lives, the ability for non-technicalusers to program, train, and interact with these machineswill be vital. Thus, any viable framework for human-robotinteraction should require very little technical knowledge touse. Additionally, HRI systems should aim to make the

method of communication between robot and human as nat-ural as possible, namely by providing a convenient interfacefor the human [1]. Steinfeld et al. emphasize the need forany viable human-robot interface to cater to non-technicalusers, and state that when testing any interface it is “crit-ical to recruit subjects having a broad range of knowledge,experience, and expertise” [20].

2.3 Robot LocomotionHistorically robot motion has been written by experts, and

falls into two main categories; open-loop and closed-loop so-lutions. In both approaches the the limbs are roughly per-forming trajectory following, where the trajectory is eitherpre-calculated (open-loop) or is calculated based on sensorsand dynamics (closed-loop). Both approaches require sig-nificant technical knowledge, considerable development timeand neither are suitable for non-technical users.

While machine learning has been applied to learn/optimizethese trajectories [18, 22, 11] there is still a large amount ofcode required to initially define the motion sequences, plac-ing a significant burden on the original robot programmer.

Inverse kinematics (IK) encompasses the task of deter-mining the joint angles that will move a robot’s limb to adesired end-e!ector position given in XYZ coordinates (rel-ative to the robot’s body). Two main methods exist forcalculating inverse kinematics: a) Jacobian based iterativeapproaches [5] and b) Direct trigonometric calculation. Inthe work presented in this paper both techniques are used.

On the humanoid robot used in this paper, each leg has sixdegrees of freedom. As a result there exist many solutionsto the IK calculation. That is, multiple configurations of thejoints can result in the same end-e!ector position. For thatreason we use a Jacobian approach to based on a Denavit-Hartenberg [8] representation of the limbs. This approachsolves for the smallest set of joint movements that result inthe end-e!ector being within ! of the desired location. Theadvantage of this approach is that the robot generally doesa good job of reaching any possible location, even gettingas close as possible to impossible ones. The disadvantageis that often multiple iterations over the inverse kinematicssolver are required, which can be computationally expensive.

On the quadruped robot we use, each leg has only threedegrees of freedom, meaning that in most locations thereexists only one unique solution for the IK calculation. Forthat reason a direct trigonometric approach is used, in whichwe can accurately determine the three required angles. Thisapproach is extremely fast to calculate and is ideal for manyrobots with relatively low processing power. However, at afew select locations two solutions exists and the robot canoccasionally oscillate between these solutions. This is gener-ally not a problem as these limb positions are rarely neededby the robot.

2.4 Motion CaptureMotion capture is a technique of recording the detailed

movements of a performer, thereby capturing a digital modelof their movements. Motion capture systems may be pas-sive, where a performer wears reflective materials and lightis generated near the camera, or active, where various LEDson the subject’s body emit light which is detected by sur-rounding cameras. Typically, the LEDs are strobed on ando! very quickly in pre-defined groups, which enables realtimemarker disambiguation by the cameras. State-of-the-art ac-

tive motion capture systems allow for precise representationsof human poses, resulting in a complete digital representa-tion in Cartesian space.

In this work we use a PhaseSpace IMPULSE active-markermotion capture system that employs 16 high-sensitivity 12.6megapixel cameras positioned overhead surrounding a 20by 20 foot tracking area. A human subject wears a blackvirtual-reality body suit, on which 36 LED markers are strate-gically placed. With a sample rate of 480 Hz and a latencyof less than 10ms, the PhaseSpace IMPULSE system is afast and accurate way of capturing even the most subtle ofhuman movements.

3. MARIONETThis section describes the MARIOnET framework. Sec-

tion 3.1 describes the implementation details of MARIOnET,and section 3.2 describes the MARIOnET user interface andtraining methodology.

3.1 ImplementationMARIOnET has been implemented in two distinct mod-

ules: a) a C++ framework with a custom client to con-nect to a motion-capture server and b) a generalized motionmodule that is directly implemented on each robot. A fully-functional graphical user interface (GUI) has also been de-veloped that facilitates training. The motion capture datais down-sampled and commands can be sent to the robot at8 - 30 Hz.

We represent each human limb as a vector of points thatcan be initialized to a “neutral” position. In this way, wecan precisely represent any human pose by relating the cur-rent pose to a neutral position. The di!erence between thesevectors is now transformed to a coordinate system appropri-ate for a particular robot, and a resulting set of robot jointpositions is generated by calculating a solution to inversekinematics. The control flow of our interface can be seen inFigure 1. An initial configuration procedure correlates thebounds of each human subject to the bounds of the robot,and captures a neutral human pose.

The main MARIOnET algorithm (Algorithm 1) will nowbe described. Once initialized, the client enters the mainloop and captures the markers from the motion captureserver, decoding each point to a body part based on uniquemarker IDs (line 2). The decoded packet is then trans-formed from the absolute coordinate system of the motioncapture system to a relative coordinate system appropriatefor a robot (line 5). This transformation is accomplished bycalculating a forward-facing vector orthogonal to the planecreated by the human’s torso, and rotating every point inthe pose accordingly. These vectors, now in a relative coor-dinate system, are scaled down to the robot’s size by consid-ering the subject’s body size in conjunction with the robot’sphysical bounds (line 6).

After scaling to the appropriate robot coordinate system,a mapping is applied from human-limb to robot-limb (line7). It is possible to map limbs one-to-one (for example whenfully controlling a humanoid robot), or one-to-many. Forexample, a “trot” gait can be generated for quadrupeds bymapping the human’s left arm to the robot’s front left andback right legs, and the human’s right arm to the robot’sfront right and back left legs. The user can select di!erentmappings through the GUI without the need to recompileany code. The GUI also includes realtime interactive sliders

Figure 1: Control flow of the MARIOnET Interface

for scaling outgoing robot coordinates. The sliders allowindependent control of the x, y, and z values for both armsand legs, which are applied after limb mapping (line 8).

During training, it is often useful to have a “looped” mo-tion sequence. For example, the human could take two stepsand wish the robot to repeat this sequence over and over,resulting in a continuous gait. To facilitate a natural hu-man interface, we have implemented hand-gesture recogni-tion to control the looping state of the robot. Wheneverthe human touches his thumb and pinky fingers together,the robot changes its looping state. There are three loopingstates: live capture that is not being recorded, live capturethat is being recorded for looping, and loop playback. Everytime a hand-gesture by the human is detected, MARIOnETupdates an internal looping signal. All encountered loopsare stored, and can be replayed using the GUI or saved di-rectly as a sequence of joint angles on the robot that canbe called from any high-level behavioral code to reproducethe looped motion. The lines of Algorithm 1 that deal withlooping are 3-4, 9-10, and 13-15.

Input: config (Individual’s Configuration), mapping(Desired mapping)

while TRUE do1pose = getMarkers();2loopState = handGestureDetected(pose);3if loopState != LOOPING then4

relPose = absoluteToRelative(pose);5robotPose = transformToRobot(config);6robotPose.applyMapping(mapping);7robotPose.scale(GUI.scalars());8

else9(relPose, robotPose) = loop.nextFrame();10

end11sendToRobot(robotPose);12if loopState == RECORDING then13

loop.add(relPose, robotPose);14end15

end16

Algorithm 1: Main motion-capture algorithm

At this stage, the MARIOnET client has a complete rep-resentation of the human’s body that is scaled down to the

robot’s coordinate system and altered to represent a par-ticular limb mapping. This information is sent to the robotwirelessly in the form of a packet (line 12), and the algorithmreturns to the start of the main loop.

We now turn our attention to the robot motion algorithm,which must be implemented for each model of robot thatMARIOnET wishes to communicate with. Pseudocode rep-resenting a generic robot motion algorithm can be found inAlgorithm 2. Every time the robot receives a packet (line3), it simply calculates a solution to inverse kinematics foreach limb (lines 4,5), sets corresponding joint angles usinginterpolation (line 6), and conveys its current loop state tothe user through speech and LED indicators (line 8).

setPose(INITIAL-POSE);1while TRUE do2

robotPose = getLatestCommand();3foreach E!ector e ! robotPose do4

["#! ] = solveInvKin(e);5

interpolateJoints(e,"#!);6

end7conveyLoopState();8

end9

Algorithm 2: Robot motion control algorithm

3.2 Training and InterfaceThe MARIOnET GUI allows viewing of both human and

robot kinematics. The GUI supports realtime tuning ofmapping scalars, and provides a mechanism for recordingmotion sequences for later use. For example, if the humannotices that the robot’s arms are always “too close” to thefront of its body, the user can simply increase the x-directionscalar of the arms through the GUI.

The training process works best with two people: one con-trolling the MARIOnET GUI and one in the motion-capturesuit. The first step in training is creating a configuration filefor the human, which is generated by prompting the trainerto successively place their arms at their sides, fully extendedforward, fully extended to the sides, and fully extended up-ward. This process initializes mapping scalars which cor-relate to the corresponding physical bounds of the robot.The first author of this paper controlled the GUI for all ofthe training sessions. After initialization, the human androbot are synchronized and live motion-capture data is sentto the robot. The training process is as follows: the hu-man performs a motion, which can be seen through realtimemimicking by the robot. The human evaluates the robot’sperformance, and repeats the motion accounting for per-ceived error in the robot’s previous actions. This loop iscontinued until a satisfactory motion sequence is obtained.

4. EXPERIMENTS AND RESULTSIn this section, we describe various experiments that eval-

uate the e!ectiveness of our approach. First, we outlinethe robot hardware used in the current implementation ofMARIOnET. Then, we analyze the use of MARIOnET foran episodic closed-loop task using the Nao. Finally, we showthat our approach is also useful for capturing cyclical open-loop motions, such as walking, using the AIBO. This sectionprimarily evaluates the ease of the training interface and as-sesses the ability of a human to quickly improve at a taskinvolving fine-motor control.

4.1 Experimental Robot Platforms4.1.1 Sony AIBO

The AIBO is a sophisticated quadruped robot that wasmass produced by Sony from 1999 to 2006 (see Figure 3).The ERS-7 model (used in these experiments) has an inter-nal 64-bit RISC processor with a clock speed of 576MHz.The robot has 20 degrees of freedom: 3 in the head, 1 ineach ear, 1 in the chin, 3 in each leg, and 2 in the tail.The robot also contains a 802.11 wireless card that enablesexternal communication.

4.1.2 Aldebaran NaoWe have also implemented MARIOnET on a humanoid

robot called the Nao (see Figure 2). Developed by Alde-baran Robotics, the Nao recently replaced the AIBO as therobot for the RoboCup Standard Platform League. TheNao contains an AMD Geode 500Mhz processor, 256MB ofmemory, and includes 802.11 wireless capabilities. Measur-ing at 23 inches and just under 9.6 pounds, the The Naohas 21 degrees of freedom and body proportions similar tothat of a human. Each foot of the robot contains four force-sensitive resistors, and the Nao houses an integrated inertialmeasurement unit with a two-axis gyrometer and three-axisaccelerometer.

4.2 Humanoid TasksEight volunteers served as test subjects, and each subject

completed a 45 to 90 minute interactive training session withthe Nao. Our test subjects consisted of three technical usersand five non-technical users.

The setup of our episodic task, Car-Park, can be seen inFigure 2(a). The robot stands in front of a surface with twodistinct boxes — a source and a sink. The human standsbehind the robot and attempts to guide the robot to movea toy car from the source to the sink. The robot starts withboth arms at its sides, and the task is completed when all ofthe car’s wheels reside inside the bounds of the sink. If thecar is knocked o! the surface, the subject is given a threesecond penalty.

The test subjects performed 60 iterations of Car-Park.For the first 10 episodes, the average time to completion was28.5 seconds - for the last 10 episodes the average was 6.8seconds. As can be seen in Figure 2(b) the learning curverepresenting elapsed time to complete Car-Park decreasessignificantly over 60 iterations. The entire training sessiontook less than 1 hour, and the subjects decreased their aver-age completion time by a factor greater than 4. This exper-iment helps verify our hypothesis that humans can quicklylearn to control robots via the MARIOnET interface.

The solution for Car-Park that every user eventually con-verged on was to use both arms - one to nudge the car andthe other to stop it at the correct location. This coordinatedsequence is the type of motion that might have taken a stan-dard ML algorithm a long time to find, and would certainlyrequire significant exploration of the state space.

4.3 Quadruped LocomotionWhile the Car-Park experiment is a closed loop control

task in which the human continually controls the robot, weenvision MARIOnET ’s main usefulness being in generatingopen loop control sequences such as periodic gaits. Dueto the current fragility of the Nao robots, we limited ourlocomotion tests to the quadruped platform.

(a) Experiment Setup

(b) Results

Figure 2: Experimental setup of Car-Park and av-erage learning curve of 60 iterations

Six volunteers served as test subjects in evaluating the ef-fectiveness of MARIOnET for quadruped locomotion, againconsisting of both technical and non-technical users (3 tech-nical, 3 non-technical). Due to physical limitations of thebody suit, a “crawling” motion by the human was infeasible,because marker visibility on the front of the body suit iscritical for coordinate transformations and limb detection.Therefore, we flipped the problem, and the human, upside-down. Each subject laid on his back, and each human limbwas mapped to the corresponding robot e!ector. The taskwas to get the AIBO to walk one meter.

Four of the test subjects were able to get the robot to walkone meter at least once during a 20 minute training period.The training position with the human on his back is some-what strenuous, and the other two subjects eventually gaveup. However, the four successful subjects exhibited dramaticimprovement in walk speed over the course of their sessions.Generally, the subjects had a “moment of clarity”, in whichthey found the correct general trajectory to achieve robotstability. After finding this stable region, the users simplyadjusted various aspects of their motion and evaluated anychanges on the robot. A learning curve consisting of eightiterations can be seen in Figure 4. While half of the indi-viduals successfully made use of the looping mechanism, theother half preferred to control the robot in real-time for theduration of the task.

Interestingly, the two users unable to produce a walk were

Figure 3: An example training session using theSony AIBO

both technically inclined (computer scientists). All non-technical subjects achieved a steep learning curve, indicatingthat technical expertise is not needed to use MARIOnET.The fastest looped walk achieved a velocity of 18.8 cm/s, andthe subject had only trained for 17 minutes before achiev-ing this speed. To put this number in context, some of thefastest AIBO walks found through optimization algorithmsare in excess of 34 cm/s[17], while the standard walk Sonyincludes with the AIBO is 3.2cm/s. However, most parame-ter optimization techniques start with a decent hand-codedwalk, while MARIOnET starts from scratch. It should benoted that the output of the MARIOnET learning could beused as the starting point for these optimizations.

Figure 4: Learning curves of 8 iterations by the foursuccessful subjects

The first time a subject controls a robot, it takes approxi-mately 1 minute to tune the interactive scalars to appropri-ate values, which correlates a comfortable human pose to astable robot pose. A video illustrating both robots in actioncan be found atwww.cs.utexas.edu/~AustinVilla/?p=research/marionet.

5. RELATED WORKLearning from demonstration (LfD, or imitation learning)

is a process in which a robot attempts to learn a task by ob-serving a demonstration, typically performed by a human.LfD is a promising way of transferring knowledge from agentto agent, and work by Dautenhahn and Nehaniv illustrateshow many animals use this technique to learn new skills [7].A good deal of recent work in LfD indicates that using hu-man feedback as a reward signal to a reinforcement learningor policy-search algorithm can significantly improve learn-ing speed [21, 9, 6, 2]. These studies illustrate the impor-

tance of harnessing human-robot interfaces in order to “de-sign algorithms that support how people want to teach andsimultaneously improve the robot’s learning behavior” [21].Thomaz and Breazeal coin this paradigm “Socially guidedmachine learning”, where the benefits of machine learningare combined with the intuitive knowledge of humans.

Breazeal and Scassellati posit that there are four inte-gral questions to consider when designing a system thatuses learning from demonstration [4]: a) How does the robotknow when to imitate? b) How does the robot know whatto imitate? c) How does the robot map observed actioninto behavior? d) How does the robot evaluate its behavior,correct errors, and recognize when it has achieved its goal?

Our system bypasses the first two questions, as the robotimitates the human in real-time. The robot maps observedactions into behaviors using a deterministic scaling functionthat can be augmented by the user. Finally, while our robotdoes not currently evaluate its own behavior, we touch onthis question in Section 6.

Although motion-capture data has been harnessed to im-prove robot locomotion, to the best of our knowledge, noreal-time human-robot interface using motion-capture hasever been utilized in the way MARIOnET proposes. Recentwork by Kulic, Takano, and Nakamura introduced a systemusing incremental learning of “human motion pattern primi-tives” by observation of motion-capture data [12]. Addition-ally, Nakanishi et al. have presented a framework for learn-ing bipedal locomotion using dynamical movement primi-tives based on non-linear oscillators, using motion-capturedata as input [15]. While these approaches are based on asimilar motivation of using human motion to train robots,MARIOnET ’s real-time interface provides a direct route ofcontrolling the pose of a robot.

6. CONCLUSIONS AND FUTURE WORKWhile the similarities of human movement and robot lo-

comotion have been investigated [3], our idea of exploitinghuman motor skills for e"cient learning of robot locomotiontakes a completely new approach. We control the motion of arobot not by modeling its dynamics or tweaking parametersof a machine learning algorithm, but by taking advantage ofthe most finely-tuned and sophisticated control mechanismknown to man: himself.

As more robots appear with complex body dynamics, it isvital that interaction is possible for all types of users, bothtechnical and non-technical. However, it is very di"cult tosystematically construct motion controllers that exploit thespecific properties of a robot, even for a roboticist. Ourexperiments suggest that all types of subjects are able tosuccessfully use MARIOnET, as the non-technical users wereable to intuitively grasp the interface. This approach allowsthe layman to precisely develop specialized robot motions.

In this first specification of MARIOnET, we have laid thegroundwork for much future work. As mentioned earlier,MARIOnET abstracts the task of learning away from therobot and places this burden on the human. Although ourresults indicate that this approach is viable, a more robustset of problems could be approached and optimized if therobot and human learned in harmony. Three of the four“integral questions” for LfD proposed by Breazeal and Scas-sellati [4] are naturally answered by MARIOnET, while thefourth requires the robot to reason about its actions. Usingthe e!ective combination of human reinforcement and ma-

chine learning, we plan to address this important questionin future work.

7. ACKNOWLEDGMENTSThis work has taken place in the Learning Agents Research Group

(LARG) at the Artificial Intelligence Laboratory, The University ofTexas at Austin. LARG research is supported in part by grants fromthe National Science Foundation (CNS-0615104 and IIS-0917122),ONR (N00014-09-1-0658), DARPA (FA8650-08-C-7812), and the Fed-eral Highway Administration (DTFH61-07-H-00030).

8. REFERENCES[1] Hri ’09: Proc. of the 4th acm/ieee intl. conf. on human robot

interaction, 2009.[2] C. Atkeson and S. Schaal. Robot learning from demonstration.

In Machine learning: Proc. 14th Int. Conf. (ICML ’97),pages 12–20, 1997.

[3] C. Azevedo, B. Espiau, B. Amblard, and C. Assaiante. Bipedallocomotion: toward unified concepts in robotics andneuroscience. Biological Cybernetics, 96(2):209–228, 2007.

[4] C. Breazeal and B. Scassellati. Challenges in building robotsthat imitate people. Defense Tech. Information Center, 2000.

[5] S. Buss. Introduction to inverse kinematics with jacobiantranspose, pseudoinverse, and damped least squares methods.Technical report, UCSD, 2004.

[6] S. Chernova and M. Veloso. Interactive policy learning throughconfidence-based autonomy. Journal of Artificial IntelligenceResearch, 34:1–25, 2009.

[7] K. Dautenhahn and C. Nehaniv. Imitation in animals andartifacts. MIT Press Cambridge, MA, USA, 2002.

[8] J. Denavit and R. Hartenberg. A kinematic notation forlower-pair mechanisms based on matrices. Journal of AppliedMechanics, 22(2):215–221, 1955.

[9] W. B. Knox and P. Stone. TAMER: Training an AgentManually via Evaluative Reinforcement. In IEEE 7th Intl.Conf. on Development and Learning, August 2008.

[10] N. Kohl and P. Stone. Machine learning for fast quadrupedallocomotion. In The Nineteenth National Conf. on ArtificialIntelligence, pages 611–616, July 2004.

[11] J. Z. Kolter, P. Abbeel, and A. Y. Ng. Hierarchicalapprenticeship learning with application to quadrupedlocomotion. In NIPS, 2007.

[12] D. Kulic, W. Takano, and Y. Nakamura. Combining automatedon-line segmentation and incremental clustering for whole bodymotions. In Robotics and Automation, 2008. ICRA 2008.IEEE Intl. Conf. on, pages 2591–2598, 2008.

[13] A. Lawson. The neurological basis of learning, developmentand discovery: implications for science and mathematicsinstruction. Springer, 2003.

[14] W. Muellbacher, U. Ziemann, B. Boroojerdi, L. Cohen, andM. Hallett. Role of the human motor cortex in rapid motorlearning. Experimental Brain Research, 136(4):431–438, 2001.

[15] J. Nakanishi, J. Morimoto, G. Endo, G. Cheng, S. Schaal, andM. Kawato. Learning from demonstration and adaptation ofbiped locomotion. Robotics and Autonomous Systems,47(2-3):79–91, 2004.

[16] J. Peters, S. Vijayakumar, and S. Schaal. Reinforcementlearning for humanoid robotics. In Proc. of the ThirdIEEE-RAS Intl. Conf. on Humanoid Robots, pages 1–20, 2003.

[17] M. Saggar, T. D’Silva, N. Kohl, and P. Stone. Autonomouslearning of stable quadruped locomotion. Lecture Notes inComputer Science, 4434:98, 2007.

[18] S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert. Learningmovement primitives. In Intl. Symposium on RoboticsResearch (ISRR2003). springer, 2004.

[19] R. Schmidt and T. Lee. Motor Control And Learning: ABehavioral Emphasis. Human Kinetics, 2005.

[20] A. Steinfeld, T. Fong, D. Kaber, M. Lewis, J. Scholtz,A. Schultz, and M. Goodrich. Common metrics forhuman-robot interaction. In Proc. of the 1st ACMSIGCHI/SIGART Conf. on Human-robot interaction, pages33–40. ACM New York, NY, USA, 2006.

[21] A. Thomaz and C. Breazeal. Teachable robots: Understandinghuman teaching behavior to build more e!ective robot learners.Artificial Intelligence, 172(6-7):716–737, 2008.

[22] R. Zhang and P. Vadakkepat. An evolutionary algorithm fortrajectory based gait generation of biped robot. In Proc. of theIntl. Conf. on CIRAS, 2003.

Integrating Human Demonstration and ReinforcementLearning: Initial Results in Human-Agent Transfer

Matthew E. Taylor Sonia ChernovaUniversity of Southern California MIT Media Laboratory

Los Angeles, CA Cambridge, [email protected] [email protected]

ABSTRACT

This work introduces Human-Agent Transfer (HAT), a method that

combines transfer learning, learning from demonstration and rein-

forcement learning to achieve rapid learning and high performance

in complex domains. Using experiments in a simulated robot soc-

cer domain, we show that human demonstrations can be transferred

into a baseline policy for an agent, and reinforcement learning can

be used to significantly improve policy performance. These results

are an important initial step that suggest that agents can not only

quickly learn to mimic human actions, but that they can also learn

to surpass the abilities of the teacher.

1. INTRODUCTIONAgent technologies for virtual agents and physical robots are

rapidly expanding in industrial and research fields, enabling greater

automation, increased levels of efficiency, and new applications.

However, existing systems are designed to provide niche solutions

to very specific problems and each system may require significant

effort. The ability to acquire new behaviors through learning is fun-

damentally important for the development of general-purpose agent

platforms that can be used for a variety of tasks.

Existing approaches to agent learning generally fall into two cat-

egories: independent learning through exploration and learning from

labeled training data. Agents often learn independently from ex-

ploration via Reinforcement learning (RL) [35]. While such tech-

niques have had great success in offline learning and software ap-

plications, the large amount of data and high exploration times they

require make them intractable for most real-world domains.

On the other end of the spectrum are learning from demonstra-

tion (LfD) algorithms [1, 8, 10, 11, 17, 21]. These approaches

leverage the vast experience and task knowledge of a person to

enable fast learning, which is critical in real-world applications.

However, the final policy performance achieved by these methods

is limited by the quality of the dataset and the performance of the

teacher. Human teachers provide especially noisy and suboptimal

data due to differences in embodiment (e.g., degrees of freedom,

action speed, precision) and limitations of human ability.

This paper proposes a novel approach: use RL transfer learn-

ing methods [38] to combine LfD and RL and achieve fast learning

and high performance in a complex domain. In transfer learning,

knowledge from a source task is used in a target task to speed up

learning. Equivalently, knowledge from a source agent is used to

speed up learning in a target agent. For instance, knowledge has

been successfully transferred between agents that balance differ-

ent length poles [29], that solve a series of mazes [9, 45], or that

play different soccer tasks [39, 41, 42]. The key insight of transfer

learning is that previous knowledge can be effectively reused, even

if the source task and target task are not identical. This results in

substantially improved learning times because the agent no longer

relies on an uninformed (arbitrary) prior.

In this work, we show that we can effectively transfer knowledge

from a human to an agent, even when they have different percep-

tions of state. Our method,Human-Agent Transfer (HAT): 1) allows

a human (or agent) teacher to perform a series of demonstrations

in a domain, 2) uses an existing transfer learning algorithm, Rule

Transfer [36] to bias learning in an agent, and 3) allows the agent to

improve upon the transferred policy using RL. HAT is empirically

evaluated in a simulated robot soccer domain and the results serve

as a positive proof of concept.

2. BACKGROUNDThis section provides background on the three key techniques

discussed in this paper: reinforcement learning, learning from demon-

strations, and transfer learning.

2.1 Reinforcement LearningA common approach for an agent to learning from experience is

reinforcement learning (RL). We define reinforcement learning us-

ing the standard notation of Markov decision processes (MDPs) [25].

At every time step the agent observes its state s ! S as a vector of

k state variables such that s = "x1, x2, . . . , xk#. The agent selects

an action from the set of available actions A at every time step. An

MDP’s reward function R : S $A %& R and (stochastic) transition

function T : S$A %& S fully describe the system’s dynamics. The

agent will attempt to maximize the long-term reward determined by

the (initially unknown) reward and transition functions.

A learner chooses which action to take in a state via a policy,

! : S %& A. ! is modified by the learner over time to improve per-

formance, which is defined as the expected total reward. Instead of

learning ! directly, many RL algorithms instead approximate the

action-value function, Q : S $ A %& R, which maps state-action

pairs to the expected real-valued return. In this paper, agents learn

using Sarsa [26, 30], a well known but relatively simple temporal

difference RL algorithm, which learns to estimate Q(s, a). While

some RL algorithms are more sample efficient than Sarsa, this pa-

per will focus on Sarsa for the sake of clarity.

Although RL approaches have enjoyed multiple past successes

(e.g., TDGammon [40], inverted Helicopter control [19], and agent

locomotion [27]), they frequently take substantial amounts of data

to learn a reasonable control policy. In many domains, collect-

ing such data may be slow, expensive, or infeasible, motivating the

need for ways of making RL algorithms more sample-efficient.

2.2 Learning from DemonstrationLearning from demonstration (LfD) is a growing area of ma-

chine learning research that explores techniques for learning a pol-

icy from examples, or demonstrations, provided by a human teacher.

We define demonstrations as sequences of state-action pairs that

are recorded by the agent while the teacher executes the desired

behavior. LfD algorithms utilize this dataset of examples to de-

rive a policy that reproduces the demonstrated behavior. Com-

pared to exploration-based methods, such as reinforcement learn-

ing, demonstration learning reduces the learning time and elimi-

nates the frequently difficult task of defining a detailed reward func-

tion [2, 32].

The field of learning from demonstration is broadly defined and

many different algorithms have been proposed within its scope [1].

Approaches vary based on how demonstrations are performed (e.g.,

teleoperation [6, 10, 20], teacher following [21], kinesthetic teach-

ing [12], external observation [17, 23]), and the type of policy

learning method used (e.g., regression [5, 10], classification [6, 28],

or planning [43, 44]).

Demonstration learning algorithms have been highly effective

for real-world agent systems. LfD techniques possess a number

of key strengths. Most significantly, demonstration leverages the

vast task knowledge of the human teacher to significantly speed up

learning times either by eliminating exploration entirely [10, 20], or

by focusing learning on the most relevant areas of the state space

[32]. Demonstration also provides an intuitive programming inter-

face for humans, opening possibilities for policy development to

non-agents-experts.

LfD algorithms are inherently limited by the quality of the in-

formation provided by the human teacher. Algorithms typically as-

sume the dataset to contain high quality demonstrations performed

by an expert. In reality, however, teacher demonstrations may be

ambiguous, unsuccessful, or suboptimal in certain areas of the state

space. A naïvely learned policy will likely perform poorly in such

areas [2]. To enable the agent to improve beyond the performance

of the teacher, learning from demonstration must be combined with

learning from experience. Previous work by Smart and Kaelbling

showed that human demonstration can be used to bootstrap rein-

forcement learning in domains with sparse rewards [32]. However,

their evaluation was performed in relatively simple domains with

small feature spaces, and the cost of refining the policy using RL in

more complex domains has not been previously addressed.

2.3 Transfer LearningThe insight behind transfer learning (TL) is that generalization

may occur not only within tasks, but also across tasks, allowing an

agent to begin learning with an informative prior instead of relying

on random exploration.

Transfer learning methods for reinforcement learning agents can

transfer a variety of information between agents. However, many

transfer methods restrict what type of learning algorithm is used by

both agents (for instance, some methods require temporal differ-

ence learning [39] or a particular function approximator [42] to be

used in both agents). However, when transferring from a human, it

is impossible to copy a human’s “value function” — both because

the human would likely be incapable of providing a complete and

consistent value function, and because the human would quickly

grow wary of evaluating a large number of state, action pairs.

This paper uses Rule Transfer [36], a particularly appropriate

transfer method that is agnostic to the knowledge representation

of the source learner. The ability to transfer knowledge between

agents that have different state representations and/or actions is a

critical ability when considering transfer of knowledge between a

human and an agent. The following steps summarize Rule Transfer:

1a: Learn a policy (! : S %& A) in the source task. Any type of

reinforcement learning algorithm may be used.

1b: Generate samples from the learned policy After training

has finished, or during the final training episodes, the agent

records some number of interactions with the environment in

the form of (S, A) pairs while following the learned policy.

2: Learn a decision list (Ds : S %& A) that summarizes thesource policy. After the data is collected, a propositional

rule learner is used to summarize the collected data to ap-

proximate the learned policy.1 This decision list is used as a

type of inter-lingua, allowing the following step to be inde-

pendent of the type of policy learned (step 1a).

3: Use Dt to bootstrap learning of an improved policy in the

target task. Our previous work [36] has suggested that pro-

viding the agent a pseudo-action, which when selected al-

ways executes the action suggested by the decision list, is

an effective method for allowing the agent to both exploit

the transferred knowledge, as well as learn when to ignore

the knowledge (by selecting one of the base actions in the

MDP).

2.4 Additional Related WorkLearning from demonstration and transfer learning work has been

discussed earlier. This section briefly summarizes three additional

lines of related work.

Within psychology, behavioral shaping [31] is a training proce-

dure that uses reinforcement to condition the desired behavior in

a human or animal. During training, the reward signal is initially

used to reinforce any tendency towards the correct behavior, but is

gradually changed to reward successively more difficult elements

of the task. Shaping methods with human-controlled rewards have

been successfully demonstrated in a variety of software agent ap-

plications [3, 13]. In contrast to shaping, LfD allows a human to

demonstrate complete behaviors, which may contain much more

information than simple positive/negative rewards.

Most similar to our approach is the recent work by Knox and

Stone [15] which combines shaping with reinforcement learning.

Their TAMER [14] system learns to predict and maximize a reward

that is interactively provided by a human. The learned human re-

ward is combined in various ways with Sarsa("), providing signif-

icant improvements. The primary difference between HAT and this

method is that we focus on leveraging human demonstration, rather

than human reinforcement.

Transfer learning problems are typically framed as leveraging

knowledge learned on a source task to improve learning on a re-

lated, but different, target task. Representation transfer [37] ex-

amines the complimentary task of transferring knowledge between

agents with different internal representations (i.e., the function ap-

proximator or learning algorithm) of the same task. This idea is

somewhat similar to implicit imitation [24], in that one agent teaches

another how to act in a task. Allowing for such shifts in represen-

tation gives additional flexibility to an agent designer; past experi-

ence may be transferred rather than discarded if a new representa-

tion is desired. Representation transfer is similar in spirit to HAT in

that both the teacher and the learner function in the same task, but

very different techniques are used since the human’s “value func-

tion” cannot be directly examined.

High-level advice and suggestions have also been used to bias

agent learning. Such advice can provide a powerful learning tech-

nique that speeds up learning by moulding the behavior of an agent

1Additionally, if the agents in the source and target task use dif-ferent state representations or have different available actions, thedecision list can be translated via inter-task mappings [36, 39] (asstep 2b). For the current paper, this translation is not necessary, asthe source and target agents operate in the same domain.

and reducing the policy search space. However, existing methods

typically require either a significant user sophistication (e.g., the

human must use a specific programming language to provide ad-

vice [18]) or significant effort is needed to design a human inter-

face (e.g., the learning agent must have natural language process-

ing abilities [16]). Allowing a teacher to demonstrate behaviors

is preferable in domains where demonstrating a policy is a more

natural interaction than providing such high-level advice.

3. METHODOLOGYIn this section we present HAT, our approach to combining LfD

and RL. HAT consists of three steps, similar to those used in Rule

Transfer:

Phase 1: Source Demonstration The agent performs the task un-

der the teleoperated control by a human teacher, or by exe-

cuting an existing suboptimal controller. During execution,

the agent records all state-action transitions. Multiple task

executions may be performed (similar to rule transfer’s step

1b).

Phase 2: Policy Transfer HAT uses the state-action transition data

recorded during Phase 1 to derive rules summarizing the pol-

icy (similar to rule transfer step 2). These rules are used to

bootstrap autonomous learning.

Phase 3: Independent Learning The agent learns independently

in the task via reinforcement learning, using the transferred

policy to bias its learning (similar to rule transfer step 3). In

this phase, the agent initially executes actions based solely

on the transferred rules so that it learns the value of the trans-

ferred policy. After this initial training, the agent is allowed

to either execute the action suggested by the transfered rules,

or it can execute one of the MDP actions. Through explo-

ration, the RL agent can decide when it should follow the

transferred rules or when it should execute a different action

(e.g., the transfered rules are sub-optimal).

4. EXPERIMENTAL VALIDATIONThis section first discusses Keepaway [34], a simulated robot

soccer domain, explains the experimental methodology used to test

HAT, and then reports on results in this domain that confirm the

efficacy of our method.

4.1 KeepawayIn this section we discuss Keepaway, a domain with a continuous

state space and significant amounts of noise in the agent’s actions

and sensors. One team, the keepers, attempts to maintain posses-

sion of the ball within a 20m $ 20m region while another team, the

takers, attempts to steal the ball or force it out of bounds. The sim-

ulator places the players at their initial positions at the start of each

episode and ends an episode when the ball leaves the play region

or is taken away from the keepers.

The keeper with the ball has the option to either pass the ball to

one of its two teammates or to hold the ball. In 3 vs. 2 Keepaway (3

keepers and 2 takers), the state is defined by 13 hand-selected state

variables (see Figure 1) as defined elsewhere [34]. The reward to

the learning algorithm is the number of time steps the ball remains

in play after an action is taken. The keepers learn in a constrained

policy space: they have the freedom to decide which action to take

only when in possession of the ball. Keepers not in possession

of the ball are required to execute the Receive macro-action in

which the player who can reach the ball the fastest goes to the ball

K3

T2

Center of field

Ball K2

K1

T1

Figure 1: This diagram shows the distances and angles used to

construct the 13 state variables used for learning with 3 keepers

and 2 takers. Relevant objects are the 3 keepers (K) and the

two takers (T), both ordered by distance from the ball, and the

center of the field.

and the remaining players follow a handcoded strategy to try to get

open for a pass.

The Keepaway problem maps fairly directly onto the discrete-

time, episodic RL framework. As a way of incorporating domain

knowledge, the learners choose not from the simulator’s primitive

actions but from a set of higher-level macro-actions implemented

as part of the player [34]. These macro-actions can last more than

one time step and the keepers have opportunities to make deci-

sions only when an on-going macro-action terminates. The macro-

actions (Hold, Pass1, and Pass2 in 3 vs. 2) that the learners se-

lect among can last more than one time step, and the keepers have

opportunities to make decisions only when an on-going macro-

action terminates. To handle such situations, it is convenient to

treat the problem as a semi-Markov decision process, or SMDP [4,

25], where agents reason over multi-step macro actions. Agents

then make decisions at discrete time steps (when macro-actions are

initiated and terminated).

To learn Keepaway with Sarsa, each keeper is controlled by a

separate agent. Many kinds of function approximation have been

successfully used to approximate an action-value function in Keep-

away, but a Gaussian Radial Basis Function Approximation (RBF)

has been one of the most successful [33]. All weights in the RBF

function approximator are initially set to zero; every initial state-

action value is zero and the action-value function is uniform. Ex-

periments in this paper use version 9.4.5 of the RoboCup Soccer

Server [22], and version 0.6 of UT-Austin’s Keepaway players [33].

4.2 Experimental SetupWhen measuring speedup in RL tasks, there are many possible

metrics. In this paper, we measure the success of HAT along two

(related) dimensions.

The initial performance of an agent in a target task may be im-

proved by transfer. Such a jumpstart (relative to the initial perfor-

mance of an agent learning without the benefit of any prior informa-

tion), suggests that transferred information is immediately useful

to the agent. In Keepaway, the jumpstart is measured as the aver-

age episode reward (corresponding to the average episode length

in seconds), averaged over 1,000 episodes without learning. The

jumpstart is a particularly important metric when learning is slow

and/or expensive.

The total reward accumulated by an agent (i.e., the area under

the learning curve) may also be improved. This metric measures

the ability of the agent to continue to learn after transfer, but is

heavily dependant on the length of the experiment. In Keepaway,

the total reward is the sum of the average episode durations at every

integral hour of training:X

t:0!n

(average episode reward at training hour t)

where the experiment lasts n hours and each average reward is

computed by using a sliding window over the past 1,000 episodes

(to help combat the high noise in the Keepaway domain).

In this work, we consider two types of policies which can boot-

strap learning (Phase 1 of HAT).

1. Previous work [34] defined a policy that was hand-tuned to

play 3 vs. 2 Keepaway. This static policy performs signifi-

cantly better than allowing the keepers to select actions ran-

domly, but players that learn can surpass its performance.

2. In the simulator, Keepaway players can be controlled by the

keyboard. This allows a human to watch the visualization

and instruct the keeper with the ball to execute the Hold,

Pass1, or Pass2 actions.

In experiments, we record all (s, a) pairs selected by the hand-tuned

policy and from a human’s control. It is worth noting that while the

hand-tuned policy uses the same state variables (i.e., representation

of state) that the target task learning agent uses, the human has a

very different representation. Rather than observing a 13 dimen-

sional state vector, the human uses a visualizer (Figure 2), which

contains more detailed information. This additional information

may or may not be useful for executing a high-performing policy,

but it is critical that whatever method used to glean information

about the human’s policy does not require the agent and the human

to have identical representations of state.

To evaluate HAT, we compare the outcome from four distinct

experiments.

1. “No Prior”: The agent learns the task using Sarsa with an

uninitialized (arbitrary) Q-value function.

2. “20 Episodes: Hand-coded Policy”: Allow the hand-coded

agent to demonstrate its policy for 20 episodes, transfer this

information to the target task agents, and then continue learn-

ing with Sarsa.

3. “10 Episodes: Human Training”: Allow a human to demon-

strate a policy for 10 episodes, transfer this information to

the target task agents, and then continue learning with Sarsa.

4. “18 Episodes: Human Training”: Allow a human to demon-

strate a policy for 18 episodes, transfer this information to

the target task agents, and then continue learning with Sarsa.

It is worth noting that while keepaway learning trials are measured

in simulator hours, the above three demonstration periods are sig-

nificantly shorter. For example, it takes less than three simulator

minutes for the hand-coded policy to demonstrate 10 episodes of 3

vs. 2 Keepaway.

In Phase 2, we use a simple propositional rule learner to generate

a decision list summarizing the policy (that is, it learns to generalize

which action is selected in every state). For these experiments, we

use JRip, an implementation of RIPPER [7] included in Weka [46].

Figure 2: This figure shows a screenshot of the visualizer used

for the human to demonstrate a policy in 3 vs. 2 Keepaway.

The human controls the keeper with the ball (shown as a hollow

white circle) by telling the agent when, and to whom, to pass.

When no input is received, the keeper with the ball executes the

Hold action, attempting to maintain possession of the ball.

In Phase 3, this decision list is loaded by all three keepers, af-

ter which they learn and act independently. The decision list is

treated as a pseudo-action [36], which the agent may select, and

then execute the action indicated by the decision list. For the first

100 episodes, all keepers are forced to execute this pseudo-action,

attempting to mimic the policy demonstrated in Phase 1. During

these 100 episodes, the keepers learn the value of the transferred

decision list.

After the first 100 episodes, keepers can select from the three

MDP-level actions (Hold, Pass1, or Pass2 actions) and the pseudo-

action, which executes the action suggested by the decision list for

the current state. The agent is free to explore (using #-greedy ex-

ploration), allowing it to discover the value of executing actions

that disagree with the transferred decision list. Specifically, over

time, the agent learns to execute actions in areas of the state space

that differ from that suggested by the decision list when the demon-

strated policy’s actions are sub-optimal. (Were the agent to always

execute the pseudo-action, the agent would never learn but would

simply mimic the policy demonstrated in Phase 1.)

4.3 Experimental ResultsThis section presents preliminary results showing that HAT is ef-

fective by using demonstration and Rule Transfer to bootstrap RL

in Keepaway agents.

Figure 3 compares the performance of the four experimental set-

tings discussed above. Each experiment was run five times and

the performance was analyzed at every hour, using a 1,000 episode

sliding window. For readability only one line per experiment is

shown, the average of the five trials, and error bars show the stan-

dard error of the five trials. The “No Prior” line shows the perfor-

mance of agents learning without the benefit of transfer. The other

three lines show the performance of HAT after demonstration by

a hand-coded policy and by a human. Table 1 compares the four

experiments according to their jumpstart and total reward. A Stu-

dent’s t-test suggests2 that all jumpstarts are statistically significant

(p < 0.05), relative to learning with no prior. Note that a jumpstart

2Note that 5 trials is not sufficient for the normality assumption

6

8

10

12

14

16

0 5 10 15 20 25 30 35 40

Ep

iso

de

Du

ratio

n (

se

co

nd

s)

Training Time (simulator hours)

3 vs. 2 Keepaway

18 Episodes: Human Training10 Episodes: Human Training

20 Episodes: Hand-coded PolicyNo Prior

Figure 3: This graph summarizes performance of Sarsa learn-

ing in Keepaway using four different settings, averaged over

five trials each. Error bars show the standard error in the per-

formance.

Table 1: This table shows the jumpstart and total reward met-

rics for 3 vs. 2 Keepaway.

Method Jumpstart Total Reward

No Prior 0 531

20 Episodes, Hand-coded 5.8 512

10 Episodes, Human 5.6 559

18 Episodes, Human 6.5 606

of roughly six seconds is also “practically significant,” as policies

learned from scratch reach an average possession time of 14 sec-

onds per episode after training. Only the difference in total reward

between no prior and 18 episodes of human training is statistically

significant (p < 0.05).

These results show that the “No Prior” agents initially perform

very poorly, learning a reasonable control policy only after spend-

ing significant amounts of time exploring the environment. In all

three cases, HAT is able to improve the jumpstart of the learner.

This shows that the demonstrated policy is indeed useful to the

agent during initial learning. Such a result is particularly important

when training is slow and/or expensive — for the first 9 simula-

tor hours, the HAT keepers dominate the keepers learning with an

uninformed prior.

In terms of both the jumpstart and total reward, human demon-

strations were more useful than the hand-coded policy, likely be-

cause the human was able to achieve higher performance using

keyboard control than the hand-coded policy. Using more human

demonstrations achieved higher performance, likely because the

extra data allowed the decision list learned in Phase 2 to more ac-

curately approximate the demonstrated policy.

Transferring information via HAT from both the hand-coded pol-

icy and the human results in significant improvements over learning

without prior knowledge.

5. FUTUREWORK AND CONCLUSIONThis paper has introduced HAT, a novel method to combine learn-

ing from demonstration with reinforcement learning by leveraging

used by a t-test and technically a more sophisticated statistical testshould be used.

an existing transfer learning algorithm. Initial empirical results in

the Keepaway domain have shown that HAT can improve learning

by using demonstrations generated by a hand-coded policy or by a

human.

In order to better understand HAT and possible variants, future

work will address the following questions:

• Why do keepers have trouble improving their performance

after using HAT to learn from hand-coded policy demonstra-

tions? Is this a statistical quirk, is the hand-coded policy near

a local maximum, or is there another (currently unknown) ef-

fect at work?

• How does the quality of the demonstration affect learning?

• How does the quantity or state space coverage of demonstra-

tions affect learning?

• Rather than performing 1-shot transfer, could HAT be ex-

tended so that the learning agent and teacher could iterate be-

tween learning autonomously and providing additional demon-

strations?

• Are there other transfer techniques that would better allow

an agent to learn from a recorded demonstration?

• In this work, the human teacher and the learning agent had

different representations of state. Will HAT still be useful if

the teacher and agent have different actions? How similar

do the source task and target tasks need to be for effective

learning improvement?

• Is using a pseudo-action efficient? Previous work [36] sug-

gested that using the pseudo-action was superior to a set of

possible transfer learning variants, but this should be re-investigated

in the context of human-agent transfer.

• Could we combine these techniques with inverse reinforce-

ment learning? For instance, it could be that the human is

maximizing a different reward function, which accounts in

part for the human’s higher performance.

Acknowledgements

The authors would like to thank Shivaram Kalyanakrishnan for

sharing his code to allow a human to control the keepers via key-

board input. We also thank the anonymous reviewers and W. Bradley

Knox for useful comments and suggestions.

6. REFERENCES

[1] B. Argall, S. Chernova, M. Veloso, and B. Browning. A

survey of robot learning from demonstration. Robotics and

Autonomous Systems, 57(5):469 – 483, 2009.

[2] C. G. Atkeson and S. Schaal. Robot learning from

demonstration. In ICML, 1997.

[3] B. Blumberg, M. Downie, Y. Ivanov, M. Berlin, M. P.

Johnson, and B. Tomlinson. Integrated learning for

interactive synthetic characters. ACM Trans. Graph.,

21(3):417–426, 2002.

[4] S. J. Bradtke and M. O. Duff. Reinforcement learning

methods for continuous-time Markov decision problems. In

NIPS, 1995

[5] B. Browning, L. Xu, and M. Veloso. Skill acquisition and use

for a dynamically-balancing soccer robot. In AAAI, 2004.

[6] S. Chernova and M. Veloso. Interactive policy learning

through confidence-based autonomy. Journal of Artificial

Inelligence Research, 34(1):1–25, 2009.

[7] W. W. Cohen. Fast effective rule induction. In ICML, 1995.

[8] Y. Demiris and A. Billard. Special Issue on Robot Learning

by Observation, Demonstration and Imitation. IEEE

Transaction on Systems, Man and Cybernetics, 2006.

[9] F. Fernandez and M. Veloso. Probabilistic policy reuse in a

reinforcement learning agent. In AAMAS, 2006.

[10] D. H. Grollman and O. C. Jenkins. Dogged learning for

robots. In ICRA, 2007.

[11] D. H. Grollman and O. C. Jenkins. Sparse incremental

learning for interactive robot control policy estimation. In

ICRA, 2008.

[12] M. Hersch, F. Guenter, S. Calinon, and A. Billard.

Dynamical system modulation for robot learning via

kinesthetic demonstrations. IEEE Transactions on Robotics,

24(6):1463–1467, Dec. 2008.

[13] F. Kaplan, P.-Y. Oudeyer, E. Kubinyi, and A. Miklosi.

Robotic clicker training. In Robotics and Autonomous

Systems, 38(3-4):197 – 206, 2002.

[14] W. B. Knox and P. Stone. Interactively shaping agents via

human reinforcement: The tamer framework. In K-CAP,

2009.

[15] W. B. Knox and P. Stone. Combining manual feedback with

subsequent MDP reward signals for reinforcment learning. In

AAMAS, 2010.

[16] G. Kuhlmann, P. Stone, R. J. Mooney, and J. W. Shavlik.

Guiding a reinforcement learner with natural language

advice: Initial results in robocup soccer. In AAAI Workshop

on Supervisory Control of Learning and Adaptive Systems,

2004.

[17] A. Lockerd and C. Breazeal. Tutelage and socially guided

robot learning. In IEEE/RSJ International Conference on

Intelligent Robots and Systems, 2004.

[18] R. Maclin and J. W. Shavlik. Creating advice-taking

reinforcement learners. Machine Learning, 22(1-3):251–281,

1996.

[19] A. Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte,

B. Tse, E. Berger, and E. Liang. Inverted autonomous

helicopter flight via reinforcement learning. In International

Symposium on Experimental Robotics, 2004.

[20] M. Nicolescu, O. Jenkins, A. Olenderski, and E. Fritzinger.

Learning behavior fusion from demonstration. Interaction

Studies, 9(2):319–352, Jun 2008.

[21] M. N. Nicolescu and M. J. Mataric. Methods for robot task

learning: Demonstrations, generalization and practice. In

AAMAS, 2003.

[22] I. Noda, H. Matsubara, K. Hiraki, and I. Frank. Soccer

server: A tool for research on multiagent systems. Applied

Artificial Intelligence, 12:233–250, 1998.

[23] N. Pollard and J. K. Hodgins. Generalizing demonstrated

manipulation tasks. In Workshop on the Algorithmic

Foundations of Robotics, 2002.

[24] B. Price and C. Boutilier. Accelerating reinforcement

learning through implicit imitation. Journal of Artificial

Intelligence Research, 19:569–629, 2003.

[25] M. L. Puterman. Markov Decision Processes: Discrete

Stochastic Dynamic Programming. John Wiley & Sons, Inc.,

1994.

[26] G. Rummery and M. Niranjan. On-line Q-learning using

connectionist systems. Technical Report

CUED/F-INFENG-RT 116, Engineering Department,

Cambridge University, 1994.

[27] M. Saggar, T. D’Silva, N. Kohl, and P. Stone. Autonomous

learning of stable quadruped locomotion. In RoboCup-2006:

Robot Soccer World Cup X, 2007.

[28] J. Saunders, C. L. Nehaniv, and K. Dautenhahn. Teaching

robots by moulding behavior and scaffolding the

environment. In ACM SIGCHI/SIGART conference on

Human-robot interaction, 2006.

[29] O. G. Selfridge, R. S. Sutton, and A. G. Barto. Training and

tracking in robotics. In IJCAI, 1985.

[30] S. Singh and R. S. Sutton. Reinforcement learning with

replacing eligibility traces. Machine Learning, 22:123–158,

1996.

[31] B. F. Skinner. Science and Human Behavior.

Colliler-Macmillian, 1953.

[32] W. D. Smart and L. P. Kaelbling. Effective reinforcement

learning for mobile robots. In ICRA, 2002.

[33] P. Stone, G. Kuhlmann, M. E. Taylor, and Y. Liu. Keepaway

soccer: From machine learning testbed to benchmark. In

RoboCup-2005: Robot Soccer World Cup IX, 2006.

[34] P. Stone, R. S. Sutton, and G. Kuhlmann. Reinforcement

learning for RoboCup-soccer keepaway. Adaptive Behavior,

13(3):165–188, 2005.

[35] R. S. Sutton and A. G. Barto. Introduction to Reinforcement

Learning. MIT Press, 1998.

[36] M. E. Taylor and P. Stone. Cross-domain transfer for

reinforcement learning. In ICML, 2007.

[37] M. E. Taylor and P. Stone. Representation transfer for

reinforcement learning. In AAAI 2007 Fall Symposium on

Computational Approaches to Representation Change during

Learning and Development, 2007.

[38] M. E. Taylor and P. Stone. Transfer learning for

reinforcement learning domains: A survey. Journal of

Machine Learning Research, 10(1):1633–1685, 2009.

[39] M. E. Taylor, P. Stone, and Y. Liu. Transfer learning via

inter-task mappings for temporal difference learning. Journal

of Machine Learning Research, 8(1):2125–2167, 2007.

[40] G. Tesauro. TD-Gammon, a self-teaching backgammon

program, achieves master-level play. Neural Computation,

6(2):215–219, 1994.

[41] L. Torrey, J. W. Shavlik, T. Walker, and R. Maclin. Relational

macros for transfer in reinforcement learning. In ILP, 2007.

[42] L. Torrey, T. Walker, J. W. Shavlik, and R. Maclin. Using

advice to transfer knowledge acquired in one reinforcement

learning task to another. In ECML, 2005.

[43] M. van Lent and J. E. Laird. Learning procedural knowledge

through observation. In K-CAP, 2001.

[44] H. Veeraraghavan and M. Veloso. Learning task specific

plans through sound and visually interpretable

demonstrations. In IEEE/RSJ International Conference on

Intelligent Robots and Systems, pages 2599–2604, Sept.

2008.

[45] A. Wilson, A. Fern, S. Ray, and P. Tadepalli. Multi-task

reinforcement learning: a hierarchical Bayesian approach. In

ICML, 2007.

[46] I. H. Witten and E. Frank. Data Mining: Practical machine

learning tools and techniques. Morgan Kaufmann, 2005.

incremental human-based training of an agent-based...

Documents