interactive robot learning of visuospatial skillskormushev.com/papers/ahmadzadeh_icar-2013.pdf ·...

Interactive Robot Learning of Visuospatial Skills

Seyed Reza Ahmadzadeh, Petar Kormushev and Darwin G. CaldwellDepartment of Advanced Robotics

Istituto Italiano di Tecnologiavia Morego 30, 16163, Genova

Email: {reza.ahmadzadeh, petar.kormushev, darwin.caldwell}@iit.it

Abstract—This paper proposes a novel interactive robot learn-ing approach for acquiring visuospatial skills. It allows a robotto acquire new capabilities by observing a demonstration whileinteracting with a human caregiver. Most existing learning fromdemonstration approaches focus on the trajectories, whereas inour approach the focus is placed on achieving a desired goalconfiguration of objects relative to one another. Our approach isbased on visual perception which captures the object’s contextfor each demonstrated action. The context embodies implicitlythe visuospatial representation including the relative positioningof the object with respect to multiple other objects simultaneously.The proposed approach is capable of learning and generalizingdifferent skills such as object reconfiguration, classification, andturn-taking interaction. The robot learns to achieve the goalfrom a single demonstration while requiring minimum a prioriknowledge about the environment. We illustrate the capabilitiesof our approach using four real world experiments with a BarrettWAM robot.

I. INTRODUCTION

Several robot skill learning approaches based on humandemonstrations have been proposed during the past decade.Motor skill learning approaches can be categorized in twomain subsets: trajectory-based and goal-based approaches.Trajectory-based approaches put the focus on recording andre-generating trajectories and forces for object manipulation[1], [2]. However, in many cases, it is not the trajectory thatis important but the goal of the action, for example, solvinga jigsaw puzzle [3] or assembling an electric circuit board. Insuch examples, trajectory-based approaches actually increasethe complexity of the learning process unnecessarily.Several goal-based approaches such as [4] have been developedto address this issue. For instance, there is a large bodyof literature on grammars from the linguistic and computerscience communities, with a number of applications relatedto robotics [5], [6]. Other symbolic learning approaches arefocused on goal configurations rather than action execution [7].Such approaches inherently comprise many steps, for instance,segmentation, clustering, object recognition, structure recog-nition, symbol generation, syntactic task modeling, motiongrammar and rule generation, etc. Another drawback of suchapproaches is that they require a significant amount of a prioriknowledge to be manually engineered into the system.Furthermore, most above-mentioned approaches assume theavailability of the information on the internal state of a demon-strator such as joint angles, while humans usually cannotdirectly access to imitate the observed behavior.An alternative to motor skill learning approaches are visualskill learning approaches [8], [9]. These approaches are basedon observing the human demonstration and using human-likevisuospatial skills to replicate the task [10]. Visuospatial skill

is the capability to visually perceive the spatial relationshipbetween objects.In this paper, we propose a novel visuospatial skill learningapproach for interactive robot learning tasks. Unlike the motorskill learning approaches, our approach utilizes visual percep-tion as the main information source for learning new skillsfrom demonstration. The proposed visuospatial skill learning(VSL) approach uses a simple algorithm and minimum a prioriknowledge to learn a sequence of operations from a singledemonstration. In contrast to many previous approaches, VSLleverages simplicity, efficiency, and user-friendly human-robotinteraction. Rather than relying on complicated models ofhuman actions, labeled human data, or object recognition, ourapproach allows the robot to learn a variety of complex taskseffortlessly, simply by observing and reproducing the visualrelationship among objects. We demonstrate the feasibility ofthe proposed approach in four real world experiments in whichthe robot learns to organize objects of different shape and coloron a tabletop workspace to accomplish a goal configuration. Inthe real world experiments, the robot acquires and reproducesthree main capabilities: object reconfiguration; classification;and turn-taking. This work is an extension of our previousresearch on visuospatial skill learning [11], with two majornovelties: (i) ability to learn and reproduce not only theposition but also the orientation of objects; and (ii) applicationto more challenging tasks including object classification andhuman-robot turn-taking.

II. RELATED WORK

Visual skill learning or learning by watching is one of themost powerful mechanisms of learning in humans. Researchershave shown that even newborns can imitate simple bodymovements such as facial gestures [12]. In cognitive science,learning by watching has been investigated as a source ofhigher order intelligence and fast acquisition of knowledge [8],[13]. In the rest of this section we give examples for visualskill learning approaches.In [8], a robot agent watches a human teacher performing asimple assembly task in a tabletop environment. The motormovements of the human are classified as actions known tothe robot (pick, move, place etc.). The robot can reproducethe sequence of actions even if the initial configuration of theobjects are changed. In their approach, the set of actions isalready known to the robot. And they use symbolic labelsto reproduce the sequence of actions. In order to detect themovement of an object, they detect and track the demon-strator’s hand. Since they rely solely on passive observationsof a teacher demonstration, this method has to make use ofcomplex computer vision techniques, in carefully structured

environments, in order to infer all the information necessaryfor the task. In our method we do not use symbolic labels.Asada et al. [14] propose a method for learning by observation(teaching by showing) based on the demonstrator’s view recov-ery and adaptive visual servoing. They believe that coordinatetransformation is a time-consuming and error-prone method.Instead, they assume that both the robot and the demonstratorhave the same body structure. They use two sets of stereocameras, one for observing the robot’s motion and the other forobserving the demonstrator’s motion. The optic-geometricalconstraint, called ”epipolar constraint”, is used to reconstructthe view of the agent, on which adaptive visual servoing isapplied to imitate the observed motion. In our method, we usecoordinate transformation.In [15], the authors propose a learning from observation systemusing multiple sensors in a kitchen environment with typicalhousehold tasks. They focus on pick-and-place operationsincluding techniques for grasping. A data glove, a magneticfield based tracking system and an active trinocular camerahead is used in their experiment. Object recognition is doneusing fast view-based vision approaches. Also, they extractfinger joint movements and hand position in 3D space fromthe data glove. The method is based on pre-trained neuralnetworks to detect hand configurations and to search in apredefined symbol database. However, there was no real worldreproduction with a robot. A similar research focuses on ex-tracting and classifying subtasks for grasping tasks using visualdata from demonstration, generating trajectory and extractingsubtasks [16]. They use color markers to capture data fromthe caregiver’s hand. In our method, we use neither neuralnetworks nor symbol abstraction techniques.A visual learning by Imitation approach is presented in [9].The authors utilize neural networks to map visual perception tomotor skills (visuo-motor) together with viewpoint transforma-tion. For gesture imitation a Bayesian formulation is adopted.They used a single camera in their experiments. A method forsegmenting demonstrations, recognizing repeated skills, andgeneralizing complex tasks from unstructured demonstration ispresented in [5]. The authors use Beta Process AutoregressiveHMM for recognizing and generalizing. They apply simplemetrics on the captured object’s context to distinguish betweenobservations instead. For object detection they use pre-definedcoordinated frames and visual fiducial fixed on each object.Furthermore, they clustered points in each frame and foreach segment in the demonstration a DMP is created. In ourmethod we use captured object’s context to find pick andplace points for each operation instead of object recognitionand point clustering.Visual analysis of demonstrations andautomatic policy extraction for an assembly task is presentedin [6]. To convert a demonstration of the desired task intoa string of connected events, this approach uses a set ofdifferent techniques such as image segmentation, clustering,object recognition, object tracking, structure recognition, sym-bol generation, transformation of symbolic abstraction, andtrajectory generation. In our approach we do not use symbolgeneration techniques.A robot goal learning approach is presented in [7] that canground discrete concepts from continuous perceptual datausing unsupervised learning. The authors provided the demon-strations of five pick-and-place tasks in a tabletop workspaceby non-expert users. They utilize object detection, segmen-tation, and localization methods using color markers. During

Fig. 1. The experimental setup for a visuospatial skill learning (VSL) task.

each operation, the approach detects the object that changesmost significantly. They provide the starting and ending time ofeach demonstration using graphical or speech commands. Ourapproach solely relies on the captured observations to learn thesequence of operations and there is no need to perform any ofthose steps.

III. VISUOSPATIAL SKILL LEARNING

In this section, we introduce the visuospatial skill learning(VSL) approach. After stating our assumptions and definingthe basic terms, we describe the problem statement and explainthe VSL algorithm.As can be seen in Fig. 1 the experimental set-up for allthe conducted experiments consists of a torque-controlled 7-DOF Barrett WAM robotic arm with 3-finger Barrett Handattached, a tabletop working area, a set of objects, and a CCDcamera which is mounted above the workspace (not necessarilyperpendicular to the workspace).A human caregiver demonstrates a sequence of operations

on the available objects. Each operation consists of one pickaction and one place action which are captured by the camera.Afterwards, using our proposed method, and starting froma random initial configuration of the objects, the robot canperform a sequence of new operations which ultimately resultsin reaching the same goal as the one demonstrated by thecaregiver.

A. Terminology

First, we need to define some basic terms to describe ourlearning approach accurately.World: In our method, the intersection area between the robot’sworkspace and the caregiver’s workspace which is observableby the camera (with a specific field of view) is called a world.The world includes objects which are being used during thelearning task, and can be reconfigured by the human caregiverand the robot.Frame: A bounding box which defines a cuboid in 3D or arectangle in 2D coordinate system. The size of the frame canbe fixed or variable. The maximum size of the frame is equalto the size of the world.Observation: The captured context of the world from a prede-fined viewpoint and using a specific frame. Each observationis a still image which holds the content of that part of the

world which is located inside the frame.Pre-pick observation: An observation which is captured justbefore the pick action and is centered around the pick location.Pre-place observation: An observation which is captured justbefore the place action and is centered around the placelocation.The pre-pick and pre-place terms in our approach are analo-gous to the precondition and postcondition terms used in logicprogramming languages (e.g. Prolog).

B. Problem Statement

Formally, we define a process of visuospatial skill learningas a tuple V = {η,W,O,F ,S} where, η defines the numberof operations which can be specified in advance or can bedetected during the demonstration phase; W ∈ <m×n is amatrix containing the image which represents the context ofthe world including the workspace and all objects; O is a setof observation dictionaries O = {OPick,OPlace}; OPick isan observation dictionary comprising a sequence of pre-pickobservations OPick = 〈OPick(1),OPick(2), . . . ,OPick(η)〉,and OPlace is an observation dictionary comprisinga sequence of pre-place observations OPlace =〈OPlace(1),OPlace(2), . . . ,OPlace(η)〉. For example,OPick(i) represents the pre-pick observation capturedduring the ith operation. F ∈ <m×n is an observation framewhich is used for recording the observations; and S is avector which stores scalar scores related to each detectedmatch during a match finding process. The score vector iscalculated using a metric by comparing the new observationscaptured during the reproduction phase with the dictionary ofrecorded observations in the demonstration phase.The output of the reproduction phase are the setP = {PPick,PPlace} and the set θ = {θPick, θPlace}which contain the identified positions and orientationsfor performing the pick-and-place operations by the VSLalgorithm.PPick is an ordered set of pre-pick positionsPPick = 〈PPick(1),PPick(2), . . . ,PPick(η)〉. PPlaceis an ordered set of pre-place positions PPlace =〈PPlace(1),PPlace(2), ...,PPlace(η)〉. θPick is an ordered setof pick rotations θPick = 〈θPick(1), θPick(2), . . . , θPick(η)〉and θPlace is an ordered set of place orientationsθPlace = 〈θPlace(1), θPlace(2), ..., θPlace(η)〉.For example, PPick(i) and and θPick(i) represent the pick-position and the pick orientation during the ith operationrespectively.

C. Methodology

A high-level flow diagram illustrating the VSL approach isshown in Fig. 2. Pseudo-code of the proposed implementationis given in Algorithm 1.The VSL approach consists of two phases: demonstrationand reproduction. In both phases, initially, the objects arerandomly placed in the world, W = {WD,WR}. (So, VSLdoes not depend on initial configuration of the objects.) WD

and WR represent the world during the demonstration and thereproduction phases respectively.The initial step in each learning task is calculating the (2D to2D) coordinate transformation (line 1 in Algorithm 1) whichmeans computing a homography matrix for the current set-up

Fig. 2. A high-level flow diagram illustrating the demonstration andreproduction phases in the VSL approach.

Fig. 3. Coordinate transformation from the image plane to the workspaceplane (H represents the homography matrix).

of the robot and the camera, H ∈ <3×3. This transformationmatrix is used to transform points from the image plane tothe workspace plane (Fig. 3). In the demonstration phase, thecaregiver starts to reconfigure the objects to fulfill a desiredgoal which is not known to the robot. During this phase thesize of the observation frame, FD, is fixed and can be equalor smaller than the size of the world, WD. Two functions,RecordPrePickObs and RecordPrePlaceObs, are developed tocapture and rectify one pre-pick observation and one pre-place observation for each operation. The captured images arerectified using the calculated homography matrix H. Usingimage subtraction and thresholding techniques, the two men-tioned functions (lines 3 and 4 in Algorithm 1) extract a pickpoint and a place point in the captured observations and thencenter the observations around the extracted positions. Thesefunctions also extract the initial pick and place orientations ofthe objects in the world. The recorded sets of observations,OPick and OPlace, together with the extracted orientations,form an observation dictionary which is used as the input forthe reproduction phase of VSL.Before starting the reproduction phase, if the position and

Input : {η,W,F}Output: {P, θ}

1 Initialization: CalculateTransformation// Part I : Demonstration

2 for i = 1 to η do3 OPick(i) = RecordPreP ickObs(WD,FD)4 OPlace(i) = RecordPreP laceObs(WD,FD)5 end// Part II : Reproduction

6 Re− calculate Transformation(if necessary)7 for i = 1 to η do8 F∗ = FindBestMatch(WR,OPick(i))9 PPick(i) = FindPickPosition(F∗)

10 θPick(i) = FindPickRotation(F∗)11 PickObjectFrom(PPick(i), θPick(i))12 F∗ = FindBestMatch(WR,OPlace(i))13 PPlace(i) = FindP lacePosition(F∗)14 θPlace(i) = FindP laceRotation(F∗)15 PlaceObjectTo(PPlace(i), θPlace(i))16 end

Algorithm 1: Pseudo-code for the VSL (VisuospatialSkill Learning) approach.

orientation of the camera with respect to the robot is changedwe need to re-calculate the homography matrix for the newconfiguration of the coordinate systems (lines 6 in Algo-rithm 1). The next phase, reproduction, starts with randomizingthe objects in the world to create a new world, WR. Then thealgorithm selects the first pre-pick observation and searchesfor similar visual perception in the new world to find the bestmatch (using the FindBestMatch function in lines 8 and 12).Although, the FindBestMatch function can use any metric tofind the best matching observation, to be consistent, in all ofour experiments we use the same metric (see section IV-B).This metric produces a scalar score with respect to eachcomparison. If more than one matching object is found, theFindBestMatch function returns the match with higher score.After the algorithm finds the best match, the FindPickPositionand FindPickRotation functions (line 9, 10) identify the pickposition and rotation which the position is located at thecenter of the corresponding frame, F∗. The PickObjectFromuses the results from the previous step to pick the objectfrom the identified position. Then, the algorithm repeats thesearching process to find the best match with the correspondingpre-place observation (line 12). The FindPlacePosition andFindPlaceRotation functions identify the place position androtation at the center of the corresponding frame, F∗ and thePlaceObjectTo function uses the results from the previous stepto place the object to the identified position. The frame sizein the reproduction phase is fixed and equal to the size of theworld (WR). One of the advantages of the VSL approach isthat it provides the exact position of the object where the objectshould be picked without using any a priori knowledge aboutthe object. The reason is, when the algorithm is generating theobservations, it centers the observation frame around the pointthat the caregiver is operating the object.

IV. IMPLEMENTATION OF VSL

In this section we describe the steps required to implementthe VSL approach for real world experiments.

A. Calculating Coordinate Transformation (Homography)

In each experiment the camera’s frame of reference mayvary with respect to the robot’s frame of reference. To trans-form points between these coordinate systems we calculate thehomography H. Using, at least, 4 real points on the workspaceplane and 4 corresponding points on the image plane, thehomography is calculated using Singular Value Decomposition(SVD) method [17]. The extracted homography matrix is usednot only for coordinate transformation but also to rectifythe raw captured images from the camera. For each taskthe process of extracting the homography should be repeatedwhenever the camera’s frame of reference is changed withrespect to the robot’s frame of reference.After the robot acquires a new skill, it still can reproduce theskill afterwards even if the experimental set-up is changed. Thealgorithm just needs to use the new coordinate transformation.It means that VSL is a view-invariant approach.

B. Image Processing

Image processing methods have been used in both demon-stration and reproduction phases of VSL. In the demonstrationphase, for each operation the algorithm captures a set ofraw images consist of pre-pick, pre-place images. Firstly, thecaptured raw images are rectified using the homography matrixH. Secondly image subtraction and thresholding techniquesare applied on the couple of images to generate pre-pickand pre-place observations. The produced observations arecentered around the frame. In the reproduction phase, foreach operation the algorithm rectifies the captured worldobservation. Then, the corresponding recorded observationsare loaded from the demonstration phase and a metric isapplied to find the best match (the FindBestMatch functionin the Algorithm 1). Although any metric can be used inthis function (e.g. window search method), we use ScaleInvariant Feature Transform (SIFT) algorithm [18]. SIFT isone of the most popular feature-based methods which is ableto detect and describe local features that are invariant toscaling and rotation. Afterwards, we apply RANSAC in orderto estimate the transformation matrix Hsift from the set of thematches. Since the calculated transformation matrix Hsift has8 degrees of freedom, with 9 elements in the matrix, to have aunique normalized representation we pre-multiply Hsift witha normalization constant:

α =sign(Hsift(3, 3))√

(Hsift(3, 1)2 +Hsift(3, 2)2 +Hsift(3, 3)2)(1)

This normalization constant is selected to make the decom-posed projective matrix have a vanishing line vector of unitmagnitude and that avoids unnatural interpolation results. Thenormalized matrix αHsift can be decomposed into simpletransformation elements,

αHsift = T RθR−φSvRφP (2)

whereR±φ are rotation matrices to align the axis for horizontaland vertical scaling of Sv; Rθ is another rotation matrix to

Fig. 4. The result of the image subtracting and thresholding for a place action(right), Match finding result between the 4th observation and the world in the4th operation of the Domino task using SIFT (left).

orientate the shape into its final orientation; T is a translationmatrix; and lastly P is a pure projective matrix:

P =

[1 0 00 1 0

αH(3,1) αH(3,2) αH(3,3)

](3)

An affine matrix HA is the remainder of αH by extracting P;HA = αHP−1. T is extracted by taking the 3rd column ofHA and A, which is a 2× 2 matrix, is the remainder of HA.A can be further decomposed using SVD.

A = UDV T (4)

where D is a diagonal matrix, and U and V are orthogonalmatrices. Finally we can calculate

Sv =[D 00 1

],Rθ =

[UV T 0

0 1

],Rφ =

[V T 00 1

](5)

Rθ is calculated for both pick and place operations(Rpickθ ,Rplaceθ ) so, the pick and place rotation angles of theobjects are extracted,

θpick = arctan(Rpickθ (2, 2)/Rpickθ (2, 1)) (6)

θplace = arctan(Rplaceθ (2, 2)/Rplaceθ (2, 1)) (7)

Note that projective transformation is position-dependent com-pared to the position-independent affine transformation. Moredetails about homography estimation and decomposition canbe found in [19].VSL relies on vision, which might be obstructed by otherobjects, by the caregiver’s body, or during the reproductionby the robot’s arm. Therefore, for physical implementation ofthe VSL approach special care needs to be taken to avoid suchobstructions.Finally, we should mention that the image processing part isnot the focus of this paper, and we use the SIFT-RANSACalgorithms because of their popularity and the capability offast and robust match finding. Fig. 4 shows the result of theSIFT algorithm applying to an observation and a new world.

C. Trajectory Generation

The pick and place points together with the pick andplace rotation angles extracted from the image processing

section are used to generate a trajectory for the correspond-ing operation. For each pick-and-place operation the desiredCartesian trajectory of the end-effector is a cyclic movementbetween three key points: rest point, pick point, and placepoint. Fig 5(a) illustrates two different views of a generatedtrajectory. Also, four different profiles of rotation angle aredepicted in Fig 5(b). The robot starts from the rest point whilethe rotation angle is equal to zero (point no.1) and movessmoothly (along the red curve) towards the pick point (pointno.2). During this movement the robot’s hand rotates to reachthe pick rotation angle according to the rotation angle profile.Then the robot picks up an object, relocates it (along thegreen curve) to the place-point (point no.3), while the handis rotating to meet the place rotation angle. Next, the robotplaces the object in the place point, and finally moves back(along the blue curve) to the rest point. For each part ofthe trajectory, including the grasping and the releasing parts,we define a specific duration and initial boundary conditions(initial positions and velocities). In addition, we assumed thatthe initial and final rotation angles to be zero. We defined ageometric path in workspace which can be expressed in theparametric form of the equations (1) to (3), where s is definedas a function of time t, (s = s(t)), and we define the differentelements of the geometric path as px = px(s),py = py(s),and pz = pz(s). Polynomial function of order three withinitial condition of position and velocity is used for px, andpy . Also, We used equation (3) for pz . Distributing the timesmoothly with a 3rd order polynomial starting from initial timeto final time, together with the above mentioned equations, thegenerated trajectories reduce the probability of the collisionof the moving object with the adjacent objects. In order togenerate the rotation angle trajectory for the robot’s hand thetrapezoidal profile is used together with the extracted θpick andθplace from the equations (6) and (7).

px = a3s3 + a2s

2 + a1s+ a0 (8)py = b3s

3 + b2s2 + b1s+ b0 (9)

pz = h[1− |(tanh−1(h0(s− 0.5)))κ|] (10)

V. EXPERIMENTAL RESULTS

In this section we demonstrate four experiments to illustratethe capabilities and the limitations of our approach. We showfive different capabilities of VSL, which are summarizedin Table I, including absolute and relative positioning, userintervention to modify the reproduction, classification, andturn-taking interaction. The video accompanying this papershows the execution of the tasks and is available online at [20].In all the real-world experiments the demonstration, learning,and reproduction phases are performed online, but duringthe demonstration phase the caregiver should take specialcare to avoid obstruction. In all of the experiments, in thedemonstration phase, we set the size of the frame for the pre-pick observation equal to the size of the biggest object in theworld, and the size of the frame for the pre-place observation2 to 3 times bigger than the size of the biggest objects in theworld. In the reproduction phase, on the other hand, we setthe size of the frame equal to the size of the world.The camera is mounted above the table and looks down onthe workspace. The resolution of the captured images are

0.3

0.4

0.5−0.15

−0.1−0.05

00.05

0.1

0.05

0.1

0.15

0.2

0.25

0.3

1

2

y [m]x [m]

3

z [m

]

0.30.35

0.40.45

0.50.55

−0.15−0.1

−0.050

0.050.1

0.05

0.1

0.15

0.2

0.25

0.3

x [m]

2

3y [m]

1

z [m

]

1

(a) The generated trajectory from two viewpoints

0 1 2 3 4 5 6 7 8−100

−80

−60

−40

−20

0

20

40

60

80

100

time [sec]

Ang

le o

f Rot

atio

n [d

eg]

(b) The generated angle of rotation, θ, for therobot’s hand

Fig. 5. (a) The generated trajectory for the first pick-and-place operation ofthe alphabet ordering task from two viewpoints. Point 1: rest-point, Point 2:pick-point, and Point 3: place-point. (b) Four generated profiles of rotationangle for the robot’s hand used in different tasks.

TABLE I. CAPABILITIES OF VSL ILLUSTRATED IN EACH TASK

Capability Task

Ani

mal

Puzz

le

Alp

habe

t

Ord

erin

g

Ani

mal

svs

.

Mac

hine

s

Dom

ino

Relative Positioning X X - XAbsolute Positioning - X - -

Classification - - X -Turn-taking - - X X

User intervention to - - X Xmodify the reproduction

1280× 960 pixels.Although the trajectory is created in the end-effector space,we control the robot in the joint space based on the inversedynamics to avoid singularities. Also, during the reproductionphase, our controller keeps the spatial orientation of the robot’shand (end-effector) perpendicular to the workspace, in order tofacilitate the pick-and-place operation. The robot’s hand stillcan rotate parallel to the workspace plane for grasping theobjects from different angles.Table II lists the execution time for different steps of theimplementation of the VSL approach.

A. Alphabet Ordering

In this VSL task, the world includes four cubic objectswith A, B, C, and D labels in addition to a fixed right anglebaseline which is a static part of the world. The goal isto assemble the set of objects with respect to the baselineaccording to the demonstration. This task emphasizes VSL’s

(a) The sequence of the operations by the caregiver

(b) The sequence of the operations by the robot

Fig. 6. Alphabet ordering (Details in Section V-A). The initial configurationof the objects in the world is different in (a) and (b). The red arrows showthe operations.

capability of relative positioning of an object with respectto other surrounding objects in the world (a visuospatialskill). This inherent capability of VSL is achieved throughthe use of visual observations which capture both the objectof interest and its surrounding objects (i.e. its context). Inaddition, the baseline is provided to show the capability ofabsolute positioning of the VSL approach. It shows the factthat we can teach the robot to attain absolute positioningof objects without defining any explicit a priori knowledge.Fig. 6(a) shows the sequence of operations in the demonstra-tion phase. Recording pre-pick and pre-place observations, therobot learns the sequence of operations. Fig. 6(b) shows thesequence of operations reproduced by VSL on a novel worldwhich is achieved by randomizing the objects in the world.

B. Animal Puzzle

In the previous task, duo to the absolute positioningcapability of VSL, the final configuration of the objects inthe reproduction and the demonstration phases are always thesame. In this experiment, however, the final result can be aentirely new configuration of objects by removing the fixedbaseline from the world. The goal of this experiment is toshow the VSL’s capability of relative positioning. In this VSLtask, the world includes two sets of objects which completea ‘frog’ puzzle beside a ‘pond’ label together with a ‘giraffe’puzzle beside a ‘tree’ label. Fig. 7(a) shows the sequence ofoperations in the demonstration phase. To show the capabilityof generalization, the ‘tree’ and the ‘pond’ are randomlyreplaced by the caregiver before the reproduction phase. Thegoal is to assemble the set of objects for each animal with

TABLE II. EXECUTION TIME FOR DIFFERENT STEPS OF VSL

Steps Time (sec) For eachHomography 0.2-0.35 learning taskSIFT+RANSAC 18-26 operation (reproduction)Image subtraction 0.09-0.11 comparisonImage thresholding 0.13-0.16 comparisonTrajectory generation 1.3-1.9 operation (reproduction)Trajectory tracking 6.6-15 operation (reproduction)Image rectification 0.3-0.6 imageDemonstration 3.8-5.6 operation(calculation time)Reproduction 27.5-44.1 operation

(a) The sequence of the operations by the caregiver

(b) The sequence of the operations by the robot

Fig. 7. Animal puzzle (Details in Section V-B). The initial and finalconfigurations of the objects in the world are different in (a) and (b). Thered arrows show the operations.

respect to the labels according to the demonstration. Fig. 7(b)shows the sequence of operations reproduced by VSL. This isone of the explicit merits of this approach that without anychanges in the implementation it can deal with two differenttasks.

C. Animals vs. Machines: A Classification Task

In this interactive task we demonstrate the VSL capabilityof classification of objects. We provided the robot with fourobjects, two ‘animals’ and two ‘machines’. Also, two labeledbins are used in this experiment for classifying the objects.Similar to previous tasks, the objects, labels and bins are notknown to the robot initially. In this task, firstly, all the objectsare randomly placed in the world. The caregiver randomlypicks objects one by one and places them in the correspondingbins. In the reproduction phase, the caregiver places one ofthe objects each time in a different sequence with respectto the demonstration. This is an interactive task between thehuman and the robot. The human caregiver can modify thesequence of operations in the reproduction phase by presentingthe objects to the robot in a different sequence with respect tothe demonstration.To achieve the mentioned capabilities the algorithm is modifiedso that the robot doesn’t follow the operations sequentially butsearches in the pre-pick observation dictionary to find the bestmatching pre-pick observation. Then it uses the selected pre-pick observation for the reproduction phase as before.Fig. 8-1 and Fig. 8-2 show one operation during the demon-stration phase in which the caregiver is classifying an objectas an ‘Animal’. To show that the VSL approach is rotation-invariant, in the reproduction phase the caregiver places eachobject in a different place with a different rotation. Fig. 8-3and Fig. 8-4 show two reproduced operations by VSL.

D. Domino: A Turn-taking Task

The goal of this experiment is to show that VSL can dealwith the tasks including the cognitive behaviour of turn-taking.In this VSL task, the world includes a set of objects all ofwhich are rectangular tiles divided into two square ends andeach end is labeled with a half object. (Due to lack of spacein the workspace, one of the domino objects is cut in half).In this task first the caregiver demonstrates all the operations.The 3rd and 5th operations during the demonstration phase

Fig. 8. Classifying of one object by the caregiver in the third task is shownin subfigures 1 and 2. Two sets of match detection results in the reproductionphase are shown in subfigures 3 and 4. The red and green crosses on theobjects and on the bins, show the detected positions for pick and place actionsrespectively. (details in section V-C).

are shown in Fig. 9-1 and Fig. 9-2. In the reproduction phase,the caregiver starts the game by placing the first object (oranother) in a random place. The robot then takes the turn andfinds and places the next matching domino piece. In this taskwe use the modified algorithm from the previous task. Thehuman caregiver can also modify the sequence of operationsin the reproduction phase by presenting the objects to the robotin a different sequence with respect to the demonstration.Fig. 9-3 and Fig. 9-4 show the match finding result by the VSLand the final reproduced operation by the robot respectively.

VI. DISCUSSION

In order to test the repeatability of our approach and toidentify the possible factors of failure, we used the capturedimages from the real world experiments while excluding therobot from the loop. We kept all other parts of the loopintact and repeated each experiment three times. The resultshows that 2 out of 48 pick-and-place operations failed. Themain failure factor is the match finding error which can beresolved by adjusting the parameters of SIFT-RANSAC orusing alternative match finding algorithms. The noise in theimages and the occlusion of the objects can be listed astwo other potential factors of failure. Despite the fact thatour algorithm is scale-invariant, color-invariant, and view-invariant, it has some limitations. For instance, if the caregiveraccidentally moves one object while operating another, thealgorithm may fail to find a pick/place position. One possiblesolution is to combine classification techniques together with

Fig. 9. Two operations during the demonstration phase by the caregiver inthe domino task are shown in subfigures 1 and 2. Match detection result andthe reproduced operation by the robot in the reproduction phase are shown insubfigures 3 and 4. The red cross on the object shows the detected positionsfor pick action by VSL. (details in section V-D).

the image subtraction and thresholding techniques to detectmulti-object movements.

VII. CONCLUSION AND FUTURE WORK

We proposed a visuospatial skill learning approach thathas powerful capabilities as shown in the four real worldexperiments. The method possesses the following capabilities:relative and absolute positioning, user intervention to modifythe reproduction, classification and turn-taking. These charac-teristics make VSL approach a suitable choice for interactiverobot learning tasks which rely on visual perception. Moreover,our approach is convenient for the vision-based robotic plat-forms which are designed to perform a variety of repetitive andinteractive production tasks (e.g. Baxter). Because applyingour approach to such robots, requires no complex programmingor costly integration.Further perspectives include using a camera (e.g. Kinect ora stereo camera) which is robust to changes in illuminationconditions and can provide depth information to performassembly tasks along z-axis as well (e.g. stacking objectson top of each other). Finally, in our future work, we aimto improve the grasping technique for objects with differentsize and material. However, in this paper we applied a simplegrasping method by measuring the closing torque of the BarrettHand.

ACKNOWLEDGMENT

This research was partially supported by the PANDORA EU FP7project under the grant agreement no. ICT-288273.

REFERENCES

[1] B. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey ofrobot learning from demonstration,” Robotics and Autonomous Systems,vol. 57, no. 5, pp. 469–483, 2009.

[2] P. Kormushev, S. Calinon, and D. G. Caldwell, “Imitation learning ofpositional and force skills demonstrated via kinesthetic teaching andhaptic input,” Advanced Robotics, vol. 25, no. 5, pp. 581–603, 2011.

[3] B. Burdea and H. Wolfson, “Solving jigsaw puzzles by a robot,”Robotics and Automation, IEEE Transactions on, vol. 5, no. 6, pp.752–764, 1989.

[4] D. Verma and R. Rao, “Goal-based imitation as probabilistic inferenceover graphical models,” Advances in neural information processingsystems, vol. 18, p. 1393, 2006.

[5] S. Niekum, S. Osentoski, G. Konidaris, and A. Barto, “Learning andgeneralization of complex tasks from unstructured demonstrations,” inProceedings of the IEEE/RSJ International Conference on IntelligentRobots and Systems, 2012.

[6] N. Dantam, I. Essa, and M. Stilman, “Linguistic transfer of humanassembly tasks to robots,” in IEEE/RSJ International Conference onIntelligent Robots and Systems, 2012.

[7] C. Chao, M. Cakmak, and A. Thomaz, “Towards grounding conceptsfor transfer in goal learning from demonstration,” in Development andLearning (ICDL), 2011 IEEE International Conference on, vol. 2.IEEE, 2011, pp. 1–6.

[8] Y. Kuniyoshi, M. Inaba, and H. Inoue, “Learning by watching: Ex-tracting reusable task knowledge from visual observation of humanperformance,” Robotics and Automation, IEEE Transactions on, vol. 10,no. 6, pp. 799–822, 1994.

[9] M. Lopes and J. Santos-Victor, “Visual learning by imitation with motorrepresentations,” Systems, Man, and Cybernetics, Part B: Cybernetics,IEEE Transactions on, vol. 35, no. 3, pp. 438–449, 2005.

[10] A. K. Pandey and R. Alami, “Towards task understanding through multi-state visuo-spatial perspective taking for human-robot interaction,” inIJCAI workshop on agents learning interactively from human teachers(ALIHT-IJCAI), 2011.

[11] S. R. Ahmadzadeh, P. Kormushev, and D. G. Caldwell, “Visuospatialskill learning for object reconfiguration tasks,” in Intelligent Robots andSystems (IROS) 2013, IEEE/RSJ International Conference on. IEEE,2013.

[12] A. Meltzoff and M. Moore, “Newborn infants imitate adult facialgestures,” Child development, pp. 702–709, 1983.

[13] G. Rizzolatti, L. Fadiga, V. Gallese, and L. Fogassi, “Premotor cortexand the recognition of motor actions,” Cognitive brain research, vol. 3,no. 2, pp. 131–141, 1996.

[14] M. Asada, Y. Yoshikawa, and K. Hosoda, “Learning by observationwithout three-dimensional reconstruction,” Intelligent Autonomous Sys-tems (IAS-6), pp. 555–560, 2000.

[15] M. Ehrenmann, O. Rogalla, R. Zollner, and R. Dillmann, “Teachingservice robots complex tasks: Programming by demonstration forworkshop and household environments,” in Proceedings of the 2001International Conference on Field and Service Robots (FSR), vol. 1,2001, pp. 397–402.

[16] M. Yeasin and S. Chaudhuri, “Toward automatic robot programming:learning human skill from visual data,” Systems, Man, and Cybernetics,Part B: Cybernetics, IEEE Transactions on, vol. 30, no. 1, pp. 180–185,2000.

[17] R. Hartley and A. Zisserman, Multiple view geometry in computervision. Cambridge Univ Press, 2000, vol. 2.

[18] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”International journal of computer vision, vol. 60, no. 2, pp. 91–110,2004.

[19] T. Y. Wong, P. Kovesi, and A. Datta, “Projective transformations forimage transition animations,” in Image Analysis and Processing, 2007.ICIAP 2007. 14th International Conference on. IEEE, 2007, pp. 493–500.

[20] Video, “Video accompanying this paper, available online,”http://kormushev.com/goto/ICAR-2013, 2013.

http://kormushev.com/goto/ICAR-2013

interactive robot learning of visuospatial skillskormushev.com/papers/ahmadzadeh_icar-2013.pdf ·...

Documents