alessandro bonardi , andrew j. davison - arxiv · 1 jtj u j x ˝2tj u f (˝) ^; (1) where v^= v kvk...

7
Learning One-Shot Imitation from Humans without Humans Alessandro Bonardi 1 , Stephen James 2 , Andrew J. Davison 2 Abstract— Humans can naturally learn to execute a new task by seeing it performed by other individuals once, and then reproduce it in a variety of configurations. Endowing robots with this ability of imitating humans from third person is a very immediate and natural way of teaching new tasks. Only recently, through meta-learning, there have been successful attempts to one-shot imitation learning from humans; however, these approaches require a lot of human resources to collect the data in the real world to train the robot. But is there a way to remove the need for real world human demonstrations during training? We show that with Task-Embedded Control Networks, we can infer control polices by embedding human demonstrations that can condition a control policy and achieve one-shot imitation learning. Importantly, we do not use a real human arm to supply demonstrations during training, but instead leverage domain randomisation in an application that has not been seen before: sim-to-real transfer on humans. Upon evaluating our approach on pushing and placing tasks in both simulation and in the real world, we show that in comparison to a system that was trained on real-world data we are able to achieve similar results by utilising only simulation data. Videos can be found here * . I. INTRODUCTION Humans are able to learn how to perform a task by simply observing their peers performing it once; this is a highly desirable behaviour for robots, as it would allow the next generation of robotic systems, even in households, to be easily taught tasks, without additional technology or long interaction times. Endowing a robot with the ability to learn from a single human demonstration rather than through teleoperation, would allow for a more seamless human-robot interaction. Previous work has investigated hand-engineered systems which track movements and specify a mapping between the human and robot domains [1], [2]. Rather than explicitly hand-engineered systems, an emerging trend in robotics is to instead learn control directly from raw sensor data in an end-to-end manner. These systems operate well when close and complicated coordination is required between vision and control [3]. Domain-Adaptive Meta-Learning (DAML) is a recent approach that uses an end-to-end method for one-shot imitation of humans [4] which leveraged a large amount of prior meta-training data collected for many different tasks. This approach required thousands of examples across many tasks during meta-training: these examples are videos of a person physically performing the tasks and teleoperated robot demonstrations, meaning that there has to be an active and 1 Department of Computing, Imperial College London 2 Dyson Robotics Lab, Imperial College London Research presented here has been supported by Dyson Technology Ltd * https://sites.google.com/view/tecnets-humans Fig. 1: The robot gains its ability to infer actions from humans in simulation, and can then learn a new task from a single human demonstration in the real-world. long human presence when collecting the dataset. Is there a way to reduce, or even eliminate, the amount of human presence that is needed when collecting datasets that require footage of humans? We propose that the recent successes in visual simulation-to-reality transfer [5], [6], [7], [8] suggest there is a way. To that end, we present an approach to the one-shot human imitation learning problem which does not require an active manual intervention during training, thus saving tens or hundreds of researchers hours. We show that the recent work on Task-Embedded Control Networks (TecNets) [9] can be used to infer control polices by embedding human demonstrations that can condition a control policy and achieve one-shot imitation learning. Rather than using real humans to supply demonstrations during training, we instead leverage domain randomisation in an application that has not been seen before: sim-to-real transfer on humans. arXiv:1911.01103v1 [cs.RO] 4 Nov 2019

Upload: others

Post on 21-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Alessandro Bonardi , Andrew J. Davison - arXiv · 1 jTj U j X ˝2Tj U f (˝) ^; (1) where v^= v kvk and where f is the embedding network. Using a combination of the cosine distance

Learning One-Shot Imitation from Humans without Humans

Alessandro Bonardi1, Stephen James2, Andrew J. Davison2

Abstract— Humans can naturally learn to execute a new taskby seeing it performed by other individuals once, and thenreproduce it in a variety of configurations. Endowing robotswith this ability of imitating humans from third person is avery immediate and natural way of teaching new tasks. Onlyrecently, through meta-learning, there have been successfulattempts to one-shot imitation learning from humans; however,these approaches require a lot of human resources to collectthe data in the real world to train the robot. But is there away to remove the need for real world human demonstrationsduring training? We show that with Task-Embedded ControlNetworks, we can infer control polices by embedding humandemonstrations that can condition a control policy and achieveone-shot imitation learning. Importantly, we do not use a realhuman arm to supply demonstrations during training, butinstead leverage domain randomisation in an application thathas not been seen before: sim-to-real transfer on humans. Uponevaluating our approach on pushing and placing tasks in bothsimulation and in the real world, we show that in comparisonto a system that was trained on real-world data we are able toachieve similar results by utilising only simulation data. Videoscan be found here∗.

I. INTRODUCTION

Humans are able to learn how to perform a task bysimply observing their peers performing it once; this is ahighly desirable behaviour for robots, as it would allowthe next generation of robotic systems, even in households,to be easily taught tasks, without additional technology orlong interaction times. Endowing a robot with the ability tolearn from a single human demonstration rather than throughteleoperation, would allow for a more seamless human-robotinteraction.

Previous work has investigated hand-engineered systemswhich track movements and specify a mapping between thehuman and robot domains [1], [2]. Rather than explicitlyhand-engineered systems, an emerging trend in robotics isto instead learn control directly from raw sensor data in anend-to-end manner. These systems operate well when closeand complicated coordination is required between vision andcontrol [3]. Domain-Adaptive Meta-Learning (DAML) is arecent approach that uses an end-to-end method for one-shotimitation of humans [4] which leveraged a large amount ofprior meta-training data collected for many different tasks.This approach required thousands of examples across manytasks during meta-training: these examples are videos of aperson physically performing the tasks and teleoperated robotdemonstrations, meaning that there has to be an active and

1Department of Computing, Imperial College London2Dyson Robotics Lab, Imperial College LondonResearch presented here has been supported by Dyson Technology Ltd∗https://sites.google.com/view/tecnets-humans

Fig. 1: The robot gains its ability to infer actions fromhumans in simulation, and can then learn a new task from asingle human demonstration in the real-world.

long human presence when collecting the dataset. Is therea way to reduce, or even eliminate, the amount of humanpresence that is needed when collecting datasets that requirefootage of humans? We propose that the recent successes invisual simulation-to-reality transfer [5], [6], [7], [8] suggestthere is a way.

To that end, we present an approach to the one-shothuman imitation learning problem which does not requirean active manual intervention during training, thus savingtens or hundreds of researchers hours. We show that therecent work on Task-Embedded Control Networks (TecNets)[9] can be used to infer control polices by embeddinghuman demonstrations that can condition a control policyand achieve one-shot imitation learning. Rather than usingreal humans to supply demonstrations during training, weinstead leverage domain randomisation in an application thathas not been seen before: sim-to-real transfer on humans.

arX

iv:1

911.

0110

3v1

[cs

.RO

] 4

Nov

201

9

Page 2: Alessandro Bonardi , Andrew J. Davison - arXiv · 1 jTj U j X ˝2Tj U f (˝) ^; (1) where v^= v kvk and where f is the embedding network. Using a combination of the cosine distance

Fig. 2: We use Task-Embedded Control Networks (TecNets) to allow tasks to be learned from a single human demonstration.Images of human demonstrations are embedded into a compact representation of a task, which is then expanded andconcatenated (channel-wise) to the most recent observation from a new configuration of that task before being sent throughthe control network in a closed-loop manner. Both the task-embedding net and control net are jointly optimised to producea rich embedding.

After training, we are able to deploy a system in the realworld which can perform a previously unseen task in a newconfiguration after a single real-world human demonstration.Our approach, which is summarised in Figure 1, is evaluatedon pushing and placing tasks in both simulation and in thereal world. We show we are able to achieve similar results toa system trained on real-world data. Moreover, we show thatour approach remains robust to visual domain-shifts, suchas a substantially different background, between the humandemonstrator and the robot agent performing the task.

II. RELATED WORK

Imitation learning aims to learn tasks by observing ademonstrator, and can broadly be classified into two key ar-eas: (1) behaviour cloning, where an agent learns a mappingfrom observations to actions given demonstrations [10], [11],and (2) inverse reinforcement learning [12], where an agentattempts to estimate a reward function that describes thegiven demonstrations [13], [14]. In this work we concentrateon the former. The majority of work in behaviour cloningoperates on a set of configuration-space trajectories that canbe collected via tele-operation [15], [16], kinesthetic teaching[17], [18], sensors on a human demonstrator [19], [20], [21],[22], through motion planners [5], or even by observinghumans directly. Expanding further on the latter, learningby observing humans has previously been achieved throughhand-designed mappings between human actions and robotactions [1], [2], [23], visual activity recognition and explicithandtracking [24], [25], and more recently by a system thatinfers actions from a single video of a human via an end-to-end trained system [4].

One-shot and few-shot learning is the paradigm oflearning from a small number of examples at test time, andhas been widely studied in the image recognition community[26], [27], [28], [29], [30], [31]. Our approach is based onJames et al. [9] which comes under the domain of metriclearning [32], [33]. There is an abundance of work in metriclearning, including Matching Networks [26], which use anattention mechanism over a learned embedding space to

produce a weighted nearest neighbour classifier given la-belled examples called a support set and unlabelled examplescalled a query set. Prototypical Networks [31] are similar,but differ in that they represent each class by the mean ofits examples, the prototype, and use a squared Euclideandistance rather than the cosine distance. In the case of one-shot learning, matching networks and prototypical networksbecome equivalent.

Sim-to-real methods attempt to address the apparent do-main gap of both the visual and dynamics between simulationand the real-world, which reduces the need for expensivereal-data collection. It has been shown that naively trans-ferring skills between the two domains is not possible [34],resulting in numerous attempts at transfer methods in bothcomputer vision and robotics. Domain randomisation, whichapplies random textures, lighting, and camera position tothe simulated scenes, has seen great success in numerousvision-based robotics applications [35], [36], [5], [6], [7].This method allows the algorithm operating on these ran-domised scenes to become invariant to domain differencesthat appear in the real world. Rather than directly operatingon randomised images, RCAN [8] is a recent approachthat instead translate randomised rendered images into theirequivalent non-randomised, canonical versions, producingsuperior results on a complex sim-to-real grasping task.Rather than operating on RGB images, other works haveinstead used depth images to cross the domain gap [37], [38];however, in our tasks, the colour of an object is an importantfeature when inferring what object the robot needs to interactwith, particularly when the geometry of the objects are verysimilar. In our work, we show that domain randomisationcan be leveraged to transfer the ability to infer actions fromhuman demonstrations.

III. BACKGROUND

Our approach builds on Task-Embedded Control Net-works (TecNets), which we summarise in the following. Themethod is composed of a task-embedding network and acontrol network that are jointly trained to output actions (e.g.

Page 3: Alessandro Bonardi , Andrew J. Davison - arXiv · 1 jTj U j X ˝2Tj U f (˝) ^; (1) where v^= v kvk and where f is the embedding network. Using a combination of the cosine distance

motor velocities) for a new variation of a task, given one ormore demonstrations of it. Using these demonstrations, thetask-embedding network has the responsibility of learning acompact representation of a task, which we call a sentence.The control network then takes this static sentence alongwith current observations of the world to output actions ona variation of the same task. TecNets do not have a strictrestriction on the number of tasks that can be learned, anddo not easily forget previously learned tasks during training,or after. The setup only expects the observations (e.g. visual)from the demonstrator during test time, which makes it veryapplicable for learning from human demonstrations.

Formally, a policy π for task T maps observations oto actions a, and we assume to have expert policies π∗

for multiple different tasks. Corresponding example trajec-tories consist of a series of observations and actions: τ =[(o1,a1), . . . , (oT ,aT )] and we define each task to be a setof such examples, T = {τ1, · · · , τK}. TecNets aim to learn auniversal policy π(o, s) that can be modulated by a sentences, where s is a learned description of a task T. The resultinguniversal policy π(o, s) should emulate the expert policy π∗

for task T.For training, we sample two disjoint sets of examples for

every task Tj : a support set TjU and a query set TjQ. Thesupport set is used to compute a combined sentence sj ∈ RNfor the task, by taking the normalised mean of the sentencefor each example:

sj =

[1

|TjU |

∑τ∈Tj

U

fθ(τ)

]∧, (1)

where v∧ = v‖v‖ and where fθ is the embedding network.

Using a combination of the cosine distance between pointsand the hinge rank loss (inspired by [39]), the loss for aquery set TjQ is defined as:

Lemb =∑τ∈Tj

Q

∑i6=j

max[0,margin− fθ(τ) ·sj + fθ(τ) ·si] .

(2)This loss helps learning an embedding space in which

tasks that are visually and semantically similar are also closein the embedding space. Additionally, given a sentence sj ,computed from the support set TjU , as well as examples fromthe query set TjQ, the following behaviour-cloning loss forthe policy π can be computed:

LQctr =∑τ∈Tj

Q

∑(o,a)∈τ

‖π(o, sj)− a‖22 . (3)

It was found that having the control network also predictthe action for the examples in the support set TjU leads toincreased performance. Thus, the final loss is:

LTec =∑T

λembLemb + λUctrLUctr + λQctrLQctr. (4)

IV. LEARNING FROM HUMANS USING TECNETS

We expand on the TecNets method introduced in theprevious section by incorporating the notion of a human

Algorithm 1 Training loss computation for one batch. B isthe batch size, KU and KQ are the number of examples fromthe support and query set respectively, KR is the numberof robot examples to sample, and RandomSample(S,N)selects N elements uniformly at random from the set S.

1: procedure TRAINING ITERATION2: B = RandomSample({T1, · · · ,TN}, B)3: for Tj ∈ B do4: (Rj ,Hj)← Tj

5: HjU = RandomSample(Hj ,KU )6: HjQ = RandomSample(Hj\HjU ,KQ)

7: Rj′ = RandomSample(Rj ,KR)

8: sjU =[

1KU

∑τ∈Hj

Ufθ(τ)

]∧9: sjq = fθ(τq) ∀τq ∈ HjQ

Lemb = LUHctr= LRctr = 0

10: for Tj ∈ B do11: Lemb +=

∑q

∑i6=jmax[ 0,margin−sjq ·s

jU+

sjq · siU ]12: LUHctr

+=∑τ∈Hj

U

∑(o,a)∈τ ‖π(o, s

jU )− a‖22

13: LRctr+=

∑τ∈Rj

∑(o,a)∈τ ‖π(o, s

jU )− a‖22

14: Ltec = λembLemb + λUHctrLUHctr

+ λQRctrLQRctr

15: return Ltec

demonstrator which can be summarised in Figure 2. Weslightly modify the definition of a task T to instead in-clude two collections of examples: a human demonstratorcollection H = {τ1, · · · , τK} and a robot agent collectionR = {τ1, · · · , τK}, such that T = (H,R).

From this, we now pick three disjoint sets of examples(rather than the original two) for every task Tj : a supportset of human examples HjU , a query set of human examplesHjQ, and a set of robot examples Rj

′ .In analogy to Eq. (1) a combined sentence sj ∈ RN

for a task is computed by taking the normalised mean ofthe sentence for each example in the support set of humanexamples HjU :

sj =

[1

|HjU |

∑τ∈Hj

U

fθ(τ)

]∧, (5)

We then train the embedding model to produce a higherdot-product similarity between human demonstrations of atask’s embedded example fθ(τ) and its sentence sj than tosentences of human demonstrations from other tasks si:

Lemb =∑τ∈Hj

∑i 6=j

max[0,margin− fθ(τ) · sj + fθ(τ) · si] .

(6)Additionally, given a sentence sj , computed from the

support set HjU , as well as examples from the robot set Rj′

for the same task we can compute the following behaviour-cloning loss for the policy π:

Lctr =∑τ∈Rj

Q

∑(o,a)∈τ

‖π(o, sj)− a‖22 . (7)

Page 4: Alessandro Bonardi , Andrew J. Davison - arXiv · 1 jTj U j X ˝2Tj U f (˝) ^; (1) where v^= v kvk and where f is the embedding network. Using a combination of the cosine distance

Fig. 3: An example of the variations we get when we applydomain randomisation to the simulated human arm.

The final loss is the combination of the embedding lossLemb, the control loss on the support set for the humanexamples, and the control loss on the robot examples:

Ltec = λembLemb + λUHctrLUHctr

+ λRctrLRctr

. (8)

Note that only the human examples of the same task areexplicitly enforced to be close together in the embeddingspace, rather than human and robot examples. Although wecould have also enforced an additional embedding loss onhuman and robot examples being close together, in practicewe found that this was not necessary. This is due to the jointtraining of both task-embedding and control networks whichenforces the network to implicitly learn to map the embeddedhuman examples to a set of corresponding robot actions.Pseudocode for both the training is provided in Algorithm 1.

Input to the task-embedding network consists of(width, height, 3× |τ |), where 3 represents the RGB chan-nels. As in the TecNets paper, we found that we only needto take the first and last frame of an example trajectory τfor computing the task embedding and so we discard inter-mediate frames, resulting in an input of (width, height, 6).The sentence from the task-embedding network is then tiledand concatenated channel-wise to the input of the controlnetwork (as shown in Figure 2), resulting in an input imageof (width×height×3+N), where N represents the lengthof the embedding.

A. Data Collection in Simulation

Many approaches to human imitation rely on trainingin the real world. This has many disadvantages, but mostevident is the amount of time and effort needed to collectdata for the training dataset. In the case of DAML, thousandsof demonstrations had to be recorded, which rely on anactive human presence to obtain both human and robotdemonstrations, as the robot still has to be controlled insome way. For instance, in the DAML placing experimenta total of 2586 demonstrations were collected to form thetraining dataset, meaning tens of research hours dedicatedto collecting data, with no guarantees that the dataset allowsthe network to generalise well enough. Training in simulationprovides much more flexibility and availability of data: datageneration can be easily parallelised and does not requireconstant human intervention. Additionally, there have beenmany successful examples of systems trained in simulation

Fig. 4: The simulation and real world setup. On the left, wesee the 24 DoF arm, and in the centre we see the 6 DoFKinova Mico arm; both have domain randomisation applied.On the right, we see real-world setup with the Kinova Mico.Observations come from an over-the-shoulder RGB camerain both simulation and the real world.

and then run in the real-word; one common approach to dothis is domain randomisation [35], [36], [5], [6], [7], [8].

Our approach generates the training dataset using PyRep[40], a recently released robot learning research toolkit, builton top of V-REP [41]. We modelled a 3D mesh of a humanarm from nonecg.com, which we then broke down into rigidshapes. Our simulated arm has 26 degrees of freedom: 3 inthe shoulder, 2 in the elbow, 2 for the wrist and the remaining19 in the hand. 26 revolute joints link together the differentrigid shapes: to emulate the soft-body behaviour of a realarm during motion, adjacent shapes slide over each other,making previously hidden parts of each shape visible. Theresulting effect is very similar to real human arm motion.

During dataset generation, we collect the image, the jointangles and the joint velocities at each timestep for bothhuman arm and robot. To achieve sim-to-real transfer weperform domain randomisation. Specifically, we sample froma set of 5000 textures and procedurally generated images (viaPerlin noise), and apply them to all objects in the scene andto the human arm (an example can be seen in Figure 3).Additionally, we sample the position, the orientation and thesize of the objects from a uniform distribution. The startingconfiguration of both the demonstrator and the agent, camerapose, light directions and lighting parameters are sampledfrom a normal distribution. A snapshot of the simulation andreal-world setup can be seen in Figure 4.

B. Training

Our task-embedding network and control network use aconvolutional neural network (CNN), which consists of 4convolution layers, each with 16 filters of size 5×5, followedby 3 fully-connected layers consisting of 200 neurons. Eachlayer is followed by layer normalisation [42] and an eluactivation function [43], except for the final layer, wherethe output is linear for both the task-embedding and controlnetwork.

Input consists of a 125× 125 RGB images and the robotproprioceptive data, including the joint angles. The proprio-

Page 5: Alessandro Bonardi , Andrew J. Davison - arXiv · 1 jTj U j X ˝2Tj U f (˝) ^; (1) where v^= v kvk and where f is the embedding network. Using a combination of the cosine distance

Fig. 5: The three tasks that we evaluate our model on in the real world. Left: placing an object in one container with twodistractor, same camera pose for human and robot and same background. Centre: the same placing task as on the left, butwith a domain shift (cloth on the table) between the human demonstration and the robot execution. Right: pushing oneobject to a target with one distractor, same camera pose for human and robot and same background.

(a) Placing (b) Pushing

Fig. 6: The real world test set for both the placing (a) andpushing (b) domain. For the placing domain (a), holdingobjects are on the top and placing objects (consisting ofbowls, plates, cups, and pots) are on the bottom.

ceptive data is concatenated to the features extracted from theCNN layers of the control network, before being sent throughthe fully-connected layers. The output of the embeddingnetwork (embedding size) is a vector of length 20. The outputof the control network corresponds to velocities applied tothe 6 joints of a Kinova Mico 6-DoF arm. During training,we set the margin to be 0.1 for the embedding loss Lemb,and set both the support and query size to be 5.

Optimisation was performed with Adam [44] with a learn-ing rate of 5× 10−4 and a batch size of 100. Lambdas wereset as follows: λemb = 0.1, λUHctr

= 1.0, and λRctr= 1.0.

V. RESULTS

In our experiments we try to answer the following ques-tions: (1) Can TecNets learn the domain shift between ademonstrator and an agent? In other words, can our approachlearn an embedding of a task given demonstrator examples,and also a mapping from the demonstrator domain to theagent domain for control? (2) Is it possible to learn a taskfrom a real-world human demonstration when all the trainingis done is simulation? (3) How does our approach compareto another state-of-the-art one-shot human imitation learning

method? We consider two experiments, placing and pushing,which were undertaken for DAML [4] in order to compareour approach with their results. We run a set of experimentsin both simulation and in the real world.

A. Placing

We begin by presenting our results for the placing exper-iment, both in simulation and in the real world: the goalis to place a hand-held object in a container, with othertwo containers in the scene acting as distractors. A trialis successful if the object lands inside the container. Ourdataset features a total of 2280 tasks, where each contains15 simulated human demonstrations, and 15 simulated robotdemonstrations. For each task we sampled three objects fromthe MIL dataset [45] of 105 training meshes, and usedthem as target containers and distractors; we randomised thescene as described in IV-A and we trained the network insimulation with the parameters in IV-B.

We evaluated one-shot placing in simulation on 74 tasks,with 6 trials each, using the MIL test meshes: in every trialwe randomise the position of the objects and of the camera,and we procedurally generate the hand-held object. We alsoperformed evaluation for the same system in the real world(Figure 5 Left) on 18 tasks and 4 trials, using the containersand the held objects shown in Figure 6a, maintaining thesame camera pose and background between demonstrationand trial.

The results for the placing experiment are shown in TableI. We find that the robot is able to learn from just one humandemonstration of a previously unseen task, and can leveragethe training with domain randomisation to bridge the realitygap, with comparable success rates to DAML. Additionally,we report the results of simulated placing evaluation for anetwork trained without the the control loss on the humanexamples support set LUHctr

. As it was previously outlinedin James at al. [9], the inclusion of the support loss assiststhe network in learning the task and the mapping betweendomains.

We also report the results of a real world experiment witha dataset where we simply randomised the scene and madethe held object float to the target bowl, without using oursimulated human arm. The results show that without the sim-ulated arm, the resulting real-world policy chooses a target at

Page 6: Alessandro Bonardi , Andrew J. Davison - arXiv · 1 jTj U j X ˝2Tj U f (˝) ^; (1) where v^= v kvk and where f is the embedding network. Using a combination of the cosine distance

random; therefore the arm model is necessary to successfullylearn to imitate from a single human demonstration.

Placing ExperimentDAML: Real World 93.8%Ours: Real World 88.9%

Ours: Sim 94.1%Ours: Sim λU

ctr = 0 78%Ours: Real World (No Sim Arm) 39%

TABLE I: One-shot success rate of the placing experiment,using novel objects. In simulation the scene is randomisedfor every trial. In the real-world the human demonstrationsare taken from the perspective of the robot. We achievecomparable performance to DAML despite training on noreal-world data.

B. Pushing

In the pushing experiment the goal is to push an objectagainst a target amid one distractor: a trial is successful ifthe object hits or falls within 5cm of the target. Our datasetfeatures a total of 1620 tasks, with 15 domain randomiseddemonstrations for both robot and human, using objects fromthe MIL dataset.

We evaluated one-shot pushing in both simulation andreal-world (Figure 5 Right), with the same number of trialsas for the previous placing experiments. The objects usedfor the real-world experiment are shown in Figure 6b. Wereport the results in Table II together with the DAML resultsto show that our network trained in simulation has once againcomparable performance to a model trained with real data.

Pushing ExperimentDAML: Real World 88.9%Ours: Real World 84.7%

Ours: Sim 87.6%

TABLE II: One-shot success rate of the pushing experiment.In simulation the scene is randomised for every trial, whereasin the real-world the human demonstrations are taken fromthe perspective of the robot. As in the placing results, weachieve comparable performance to DAML despite trainingon no real-world data.

C. Large Domain Shift

As a final experiment, we tested how resilient our modelis against large domain shifts in the real world, expecting itto leverage the adaptability acquired from domain randomi-sation. We evaluated placing in the real world taking thehuman demonstrations with a cloth on the table, and thenmaking the robot perform the same task with the table clothremoved, therefore with a substantial change of background(Figure 5 Centre): the model placed the held object correctly87.5% of the 72 trials.

We have therefore shown that due to domain randomi-sation our performance does not degrade on large domainshifts, whereas for example in DAML the success rate under

large change of scenes drops by up to 15%. This showcasesthe benefits of leveraging large scale simulations for roboticlearning.

VI. CONCLUSIONS

We have presented an approach to the one-shot humanimitation problem that leverages zero human interactionduring training time. We achieve this by the combinationof 2 methods. Firstly, we extending Task-Embedded ControlNetworks (TecNets) [9] to infer control polices by embed-ding human demonstrations that can condition a controlpolicy and achieve one-shot imitation learning. Secondly,and most importantly, we show that we are able to performsim-to-real on humans which allows us to train our systemwith no real-world data. With this approach, we are able toachieve similar performance to a state-of-the-art alternativemethod that relies on thousands of training demonstrationscollected in the real-world, whilst also remaining robustto visual domain-shifts, such as a substantially differentbackgrounds. For future work, we hope to further investigatethe variety of human actions that can be transferred fromsimulation to reality. For example, in this work, we haveshown that a human arm can be transferred, but would thesame method work for demonstrations including the entiretorso of a human? We hope this work provides the first stepin answering this question.

ACKNOWLEDGMENT

We thank Michael Bloesch, Ankur Handa, Sajad Saeedi,and Dan Lenton for insightful feedback on an early draft ofthis paper.

REFERENCES

[1] K. Lee, Y. Su, T.-K. Kim, and Y. Demiris, “A syntactic approachto robot imitation learning using probabilistic activity grammars,”Robotics and Autonomous Systems, vol. 61, no. 12, pp. 1323–1334,2013.

[2] Y. Yang, Y. Li, C. Fermuller, and Y. Aloimonos, “Robot learningmanipulation action plans by” watching” unconstrained videos fromthe world wide web,” in Twenty-Ninth AAAI Conference on ArtificialIntelligence, 2015.

[3] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end trainingof deep visuomotor policies,” The Journal of Machine LearningResearch, 2016.

[4] T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine,“One-shot imitation from observing humans via domain-adaptivemeta-learning,” Robotics: Science and Systems, 2018.

[5] S. James, A. J. Davison, and E. Johns, “Transferring end-to-endvisuomotor control from simulation to real world for a multi-stagetask,” Conference on Robot Learning, 2017.

[6] J. Matas, S. James, and A. J. Davison, “Sim-to-real reinforcementlearning for deformable object manipulation,” Conference on RobotLearning, 2018.

[7] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrish-nan, L. Downs, J. Ibarz, P. Pastor, K. Konolige et al., “Using simulationand domain adaptation to improve efficiency of deep robotic grasping,”in 2018 IEEE International Conference on Robotics and Automation(ICRA). IEEE, 2018, pp. 4243–4250.

[8] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan,J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis, “Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonicaladaptation networks,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2019, pp. 12 627–12 637.

Page 7: Alessandro Bonardi , Andrew J. Davison - arXiv · 1 jTj U j X ˝2Tj U f (˝) ^; (1) where v^= v kvk and where f is the embedding network. Using a combination of the cosine distance

[9] S. James, M. Bloesch, and A. J. Davison, “Task-embedded control net-works for few-shot imitation learning,” Conference on Robot Learning,2018.

[10] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neuralnetwork,” Advances in Neural Information Processing Systems, 1989.

[11] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learningand structured prediction to no-regret online learning,” InternationalConference on Artificial Intelligence and Statistics, 2011.

[12] A. Y. Ng, S. J. Russell et al., “Algorithms for inverse reinforcementlearning.” International Conference on Machine Learning, 2000.

[13] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein-forcement learning,” International Conference on Machine learning,2004.

[14] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverseoptimal control via policy optimization,” International Conference onMachine Learning, 2016.

[15] S. Calinon, P. Evrard, E. Gribovskaya, A. Billard, and A. Kheddar,“Learning collaborative manipulation tasks by demonstration usinga haptic interface,” in 2009 International Conference on AdvancedRobotics. IEEE, 2009, pp. 1–6.

[16] T. Zhang, Z. McCarthy, O. Jow, D. Lee, K. Goldberg, and P. Abbeel,“Deep imitation learning for complex manipulation tasks from vir-tual reality teleoperation,” International Conference on Robotics andAutomation, 2018.

[17] B. Akgun, M. Cakmak, J. W. Yoo, and A. L. Thomaz, “Trajectoriesand keyframes for kinesthetic teaching: A human-robot interactionperspective,” in Proceedings of the seventh annual ACM/IEEE inter-national conference on Human-Robot Interaction. ACM, 2012, pp.391–398.

[18] P. Pastor, L. Righetti, M. Kalakrishnan, and S. Schaal, “Onlinemovement adaptation based on previous sensor experiences,” in 2011IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE, 2011, pp. 365–371.

[19] R. Dillmann, “Teaching and learning of robot tasks via observation ofhuman performance,” Robotics and Autonomous Systems, vol. 47, no.2-3, pp. 109–116, 2004.

[20] S. Ekvall and D. Kragic, “Interactive grasp learning based on humandemonstration,” in IEEE International Conference on Robotics andAutomation, 2004. Proceedings. ICRA’04. 2004, vol. 4. IEEE, 2004,pp. 3519–3524.

[21] S. Calinon and A. Billard, “Teaching a humanoid robot to recognizeand reproduce social cues,” in ROMAN 2006-The 15th IEEE Interna-tional Symposium on Robot and Human Interactive Communication.IEEE, 2006, pp. 346–351.

[22] V. Kruger, D. L. Herzog, S. Baby, A. Ude, and D. Kragic, “Learningactions from observations,” IEEE robotics & automation magazine,vol. 17, no. 2, pp. 30–43, 2010.

[23] J. Rothfuss, F. Ferreira, E. E. Aksoy, Y. Zhou, and T. Asfour, “Deepepisodic memory: Encoding, recalling, and predicting episodic expe-riences for robot action execution,” IEEE Robotics and AutomationLetters, vol. 3, no. 4, pp. 4007–4014, 2018.

[24] J. Lee and M. S. Ryoo, “Learning robot activities from first-personhuman videos using convolutional future regression,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops, 2017, pp. 1–2.

[25] K. Ramirez-Amaro, M. Beetz, and G. Cheng, “Transferring skills tohumanoid robots by extracting semantic representations from observa-tions of human activities,” Artificial Intelligence, vol. 247, pp. 95–118,2017.

[26] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matchingnetworks for one shot learning,” Advances in Neural InformationProcessing Systems, 2016.

[27] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networksfor one-shot image recognition,” ICML Deep Learning Workshop,2015.

[28] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap,“Meta-learning with memory-augmented neural networks,” Interna-tional Conference on Machine Learning, 2016.

[29] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” International Conference on Learning Representations,2017.

[30] E. Triantafillou, R. Zemel, and R. Urtasun, “Few-shot learning throughan information retrieval lens,” Advances in Neural Information Pro-cessing Systems, 2017.

[31] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” Advances in Neural Information Processing Systems,2017.

[32] B. Kulis et al., “Metric learning: A survey,” Foundations and Trendsin Machine Learning, 2012.

[33] A. Bellet, A. Habrard, and M. Sebban, “A survey on metric learning forfeature vectors and structured data,” arXiv preprint arXiv:1306.6709,2013.

[34] S. James and E. Johns, “3d simulation for robot arm control withdeep q-learning,” NIPS Workshop (Deep Learning for Action andInteraction), 2016.

[35] F. Sadeghi and S. Levine, “Cad2rl: Real single-image flight without asingle real image,” arXiv preprint arXiv:1611.04201, 2016.

[36] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,“Domain randomization for transferring deep neural networks fromsimulation to the real world,” Intelligent Robots and Systems (IROS),2017 IEEE/RSJ International Conference on, 2017.

[37] U. Viereck, A. t. Pas, K. Saenko, and R. Platt, “Learning a visuomotorcontroller for real world robotic grasping using easily simulated depthimages,” in Conference on Robot Learnng.

[38] M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt, “High precisiongrasp pose detection in dense clutter,” in IROS, 2016, pp. 598–605.

[39] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolovet al., “Devise: A deep visual-semantic embedding model,” Advancesin Neural Information Processing Systems, 2013.

[40] S. James, M. Freese, and A. J. Davison, “Pyrep: Bringing v-rep todeep robot learning,” arXiv preprint arXiv:1906.11176, 2019.

[41] E. Rohmer, S. P. N. Singh, and M. Freese, “V-rep: a versatile andscalable robot simulation framework,” in Proc. of The InternationalConference on Intelligent Robots and Systems (IROS), 2013.

[42] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXivpreprint arXiv:1607.06450, 2016.

[43] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accuratedeep network learning by exponential linear units (elus),” InternationalConference on Learning Representation, 2016.

[44] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” International Conference on Learning Representation, 2015.

[45] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot visualimitation learning via meta-learning,” Conference on Robot Learning,2017.