vision guided cutting skill acquisition for cognitive robotsskill of cutting. here we consider the...

Vision Guided Cutting Skill Acquisition for Cognitive Robots

Yi Zhang1, Yezhou Yang1, Cornelia Fermuller2, Yiannis Aloimonos1

Abstract— In this paper, we present a cognitive learningframework for the humanoid robot to acquire cutting skill.Our framework consists of two main modules: 1) A visualmodule that is trained using the deep feature to visually estimateobject firmness level, and 2) A motor skill learning module thatadopts reinforcement learning method to generate optimizedcontrol parameters for cutting. We propose a method that isbased on the consideration of energy efficiency for the cuttingmotion. Experiments conducted on both tasks show that 1)the visual module is able to estimate firmness level of a novelobject with state-of-the-art performance, and 2) the motorlearning module, given the estimated firmness level from thevisual module, is able to improve the actual performance ofcutting on a humanoid robot (Baxter). Also, we show that theenergy efficiency measure we proposed is a proper criterion forevaluating the cutting skill on cognitive robots.

I. INTRODUCTION

Home assistant robots, which are aiming to help humanbeings with daily routine tasks, need to be able to master theskill of cutting. Here we consider the cutting action that isaiming to divide the target object into pieces with a knife orother sharp implement. At the first glance, it seems that theskill of cutting is a relatively easy one to acquire. However, itis widely considered as a problem with extreme difficulty tobecome accurate with the knife work using minimum energyin order to create dishes that cook evenly and look trulyprofessional, for our human beings. The question becomes:how to design a learning mechanism to equip a humanoidrobot with the skill of cutting? In this paper, we argue that toachieve the skill of cutting requires both 1) visual recognitionof object’s attribute (such as firmness), and 2) fine level robotmanipulator control.

See and Estimate: before actually applying a cuttingaction, as human beings, we evaluate the physical attributeof the target object, especially the level of firmness. Thisis mainly for the purpose of estimating how much force isneeded to cut through the target object while the energyspent is minimized. In the field of Computer Vision, themethods of estimating physical attributes from visual inputhave been studied [4], [6], [22], however, these works aremainly concerned with either binary or relative attributes.For the purpose of estimating object firmness, we need amultiple scales measure.

Act and Fine-tune: given a visual estimation of firmnessmeasure, we generate a coarse cutting policy. After that,through repetitive trials of cutting this specific category of

1Y. Zhang, Y. Yang, and Y. Aloimonos are from the Depart-ment of Computer Science and 2C. Fermuller is from UMIACS, Uni-versity of Maryland, College Park, MD, 20742 USA. {zhange}at umd.edu, {yzyang, yiannis} at cs.umd.edu and{fer} at umiacs.umd.edu

Visual Perception

Firmness Prediction

Skill Practice

Fig. 1. How human learns to cut new object.

objects, we, as human beings, gradually learn to adapt ourpolicy parameters to master the cutting of the specific targetobject. This process is generally referred to as “reinforcementlearning” in the field of Robotics. Fig. 1 depicts how humanlearns to cut new object.

Inspired by these observations and intuitions, in this work,the first implementation of a vision-guided approach to helpa humanoid robot master the skill of cutting is presented. Itconsists of two modules: 1) A CNN based visual recognitionmodule to first estimate the level of firmness given anarbitrary target object; and 2) a reinforcement paradigm isimplemented to further refine the cutting policy with regardto the target object.

Moreover, cutting is an action that has to be definedfrom the perspective of applied forces, due to the inevitableinteraction between the sharp implement and the targetobject. In the literature of robotics, there is a vast amount ofactions that has been studied for humanoids and most of themare trajectory-based analysis [15], [20], [30], [31]. In thispaper, we are looking for a more generative representationof the action, and we take into account the forces involved,which may be introduced by the agent, or by the object whichthe agent is interacting with, or from the environment. In thiswork, we use the force pattern to define cutting policy andpresent a force-based controller to generate cutting action ona humanoid platform.

The paper is organized in the following manner: We firstsummarize works in the fields of Computer Vision, Roboticsand Cognitive Systems that are closely related in Sec. II;Then we present our novel cognitive and developmentalframework for cutting skill learning in Sec. III. In Sec. IV,three experiments are conducted to validate our proposedsystem, and we also present a new object firmness data-set

for future research.

II. RELATED WORK

A. Studies on manipulation skill acquisition

Previous works have described robot systems that arecapable of doing manipulative actions and skills in thekitchen environment. Lenz et al. [19] presented a model-predictive control approach using deep learning techniquesfor food-cutting. Yamaguchi et al. [31] developed a pouringbehavior model that consists of a library of small skillsand using learning and planning methods for generalizingin terms of different materials, container shapes, contexts,container locations and target amounts. Bollini et al. [2]designed a chef robot that is able to parse a recipe, turnsit into a high-level plan that involves actions like pour, mixand bake and executes it. Beetz et al. [1] demonstrates arobot with the skill of cooking a pancake. The robot hasthe perception routine for object detection and localizationas well as checking the estimated result. It also performsforce-adaptive control to flip the pancake using a spatula.Kormushev et al. [15] also studied the pancake flippingskill. They focused on modulating the demonstrated motionprimitives using the Reinforcement Learning technique. An-other important domestic application of autonomous robotis the laundry tasks. Van der burg et al. [30] presentedan algorithm for the robot to detect the geometry of thecloth and determine the motion of the gripper to achievethe desired folding configuration. Lee et al. [18] proposeda method of generalizing the demonstrated skill with bothkinematic and driven phases and they demonstrated on thetask of tying a knot. The cutting action is rarely studied as askill for the humanoid robot. Our work presents a frameworkincluding both visual and motor learning for robots learninga cutting action, which is novel.

B. Skill acquisition based on physical force

There are studies that focus on enabling the robot to gen-eralize the force related skills to different situations. Gams etal. [5] proposed a method for the robot’s motion trajectoryto adapt to different surfaces while still keep the same forceprofile using DMP. Kalakrishnan et al. [10] developed anapproach that enables the robot to acquire contact interactionskills through the reinforcement learning. Our work is similarto that approach but we also have a vision system togeneralize the policy for different object conditions. Mostof those studies require human demonstration. [16] usesa haptic device for the force demonstration. [23], [21],[11] demonstrates force and position profile simultaneouslythrough kinesthetic teaching. Our system does not require ahuman demonstration. The initial parameters are generatedrandomly. Our reward feedback after every trial of executionis able to guide the policy search to converge toward areasonable configuration.

C. Motor skill learning through policy search

Reinforcement learning problems in robotics are differentfrom typical benchmark problems for reinforcement learning.

The problems are usually defined with high dimensional andcontinuous states and actions. There are also challenges inobtaining the true states from the environment, the high costof each roll-out and under-modeling the details from the realworld system [12]. Policy search methods are more suitablefor robotics reinforcement problems than most other RLapproaches since they provide better integration of domain-specific knowledge and capability of dealing with parame-terized policies and they have been proved successfully onrobotics [25], [27], [14], [13]. Another category of methodsfor policy improvement is called black-box optimization.Some recent researches have compare the BBO method withother RL methods and found it to have equal or betterperformance despite of its simplicity [8], [29]. In this work,we used CMA-ES algorithm [7], which is considered to bea standard in BBO [28].

III. OUR FRAMEWORK

In this section, we present a framework for the cutting skillacquisition. Figure. 2 is a flowchart that illustrates the overallframework. It consists of two main modules: observation andexecution.

The vision module takes an image of the target object asinput, and is trained to predict an initial estimation of thetarget object’s firmness level. The prediction model here istrained using a supervised learning method, and we extractthe visual features using a pre-trained Convolution NeuralNetwork (CNN).

The motor control module, takes the belief of the firmnesslevel to initialize the parameters for the controller to executea cutting action. However, as human beings learning a newskill, given the visually recognized physical attribute, wefurther fine-tune our motor skill through interaction withthe physical world and observed feedback. This observationsuggests a reinforcement learning paradigm to improve thecontrol parameters. We formulate the robot cutting action asa Markov Decision Process (MDP). In each time step, theamount of force that is exerted on the knife from the robot’send effector is estimated based on the state of the knife.This MDP is the inner loop of the execution for each trialof cutting as shown in Fig. 3. After one trial of cutting isexecuted, a cutting quality score is generated with respect tothe actual execution of the cutting. The score is then usedto update the action parameters. This makes the outer loopas shown in Fig. 3. This iterative learning process continuesuntil the score of the action converges to a local maximum.We then store this trained set of cutting parameters as thelearned knowledge with regard to the firmness level of theobject.

A. Visual learning module for predicting object firmness

As a human being, when we are given an everyday objectto cut, we are able to first visually estimate the firmnessof the object from its appearance. People use tactile senseto get haptic feedback to confirm how hard the object is(by pressing its surface), but this is not always used. Visualperception stays the main resource of information for us to

Image input

RegressionModel

Object firmness

Controller

Execution

Score

Update

RL algorithm

CNN ModelImage feature

….

policy parameters

Fig. 2. Framework flowchart

AGENT

action state

ENVIRONMENT

CMA-ES

policy parameters

reward

Inner loop which occurs for each time step within an episode

Outer loop which occurs for each episode

Fig. 3. Execution and reinforcement learning loop

estimate object firmness. Moreover, when we are given amaterial (or object) that we have never encountered before,we are able to infer a rough firmness estimation from ourprevious life experience. Given these observations and intu-ition, we formulate a learning problem here to empiricallystudy the visually firmness level estimation module.

First, we compiled a testing-bed, which consists of every-day fruits and vegetables. The dataset contains the imagesof a variety of categories, and the objects considered inour experiments have a large variation of firmness level.Without loss of generality, we constrain our firmness levelas an integer parameter Yf , whose value ranges from 1 to10, where 1 is the softest and 10 is the hardest.

In this work, we use a pre-trained CNN from [17]. Theconvolutional layers contains 96 to 1024 kernels of size3x3 to 7x7. Max pooling kernels of size 3x3 and 5x5 areused at different layers to build robustness to intra-classdeformations. For each image in our testing-bed, we usethis pre-trained model to extract a 4096 dimensional featurevector using [3], [17]. We then trained a support vectormachine (SVM) classifier to do firmness level recognitionusing the deep features. For each image I in the training set,we have an annotated pair of training data CNN(I), Yf . The

training model can be represented by:

Yf = SVMθ(CNN(I)), (1)

where SVM stands for support vector machine; θ is the SVMparameter to be optimized; CNN stands for pre-trained netmodel for feature extraction.

For each testing image I in our dataset we apply thetrained model SVMθ and convert the classification scoresto Pr(Yf |I). During execution time before cutting, thehumanoid robot acquires an image of the object from itscamera and then attends to the object area through attentionmechanism, and then predicts the firmness of the object usingthe trained model. In the motor learning module, we usethe prior belief, of how hard the object is, to initialize theparameters of the controller for initial cutting action.

B. Reinforcing the cutting skillWe formulate the cutting process as a Markov Decision

Process (MDP). A MDP consists of a set of states S, a setof actions A, the transition probability P and the rewardfunction R. A fundamental assumption under MDP is thatthe current state (st and t for time step) only depends onthe state (st−1) and action (at−1) in the previous time step,which could be formulated by:

st ∼ p(st|st−1, at−1). (2)

And, an action at time step t (at) is generated by a policy(πθ) based on the current state, which could be formulatedas,

at ∼ πθ(at|st). (3)

An episode is defined as a sequence of states and actionsand an episode ends when a certain terminal state is reached.The expected reward of an episode could be formulated as:

J(θ) =1

αΣE

{T∑t=0

αtrt

}(4)

where θ ∈ R are the policy parameter, αt is the weightfactor for each step and αΣ is the normalization factor. Thegoal of the reinforcement learning is to find a set of policyparameters θ so that the expected reward J(θ) is maximized.One category of methods that have been applied to solve thisproblem is called the policy gradient approach. It follows thesteepest gradient of the expected reward to search for theoptimized parameters. In each time step t, the update of thepolicy parameters could be expressed as,

θt+1 = θt + αt∇Jθ|θ=θt . (5)

In this work, we adopt the covariance matrix adaption evo-lution strategy (CMA-ES) to search for the optimized policyparameters for cutting skill learning. Similar to the quasi-Newton methods, it is a second order method estimating acovariance matrix through an iterative procedure. The mainadvantage of the algorithm is that it does not approximate thederivatives of the function. Thus, it can be applied to non-smooth and noisy problems. It has been widely considered tobe one of the most robust algorithms of direct policy searchfor reinforcement learning problems [8], [29].

C. Evaluating a cutting action

In order to improve the cutting skill, we first need away to evaluate the cutting result. We base our evaluationmetric design on following observations: (1) A cutting actionusing an unnecessary large amount of force could lead to aseparation of the target object, but it would be a waste ofenergy and is not generally considered as a safe operationwhen a human is executing a cutting action; (2) A cuttingaction exerting small force would be inefficient in terms oftime; (3) A good cutting action requires not only separatingthe target object but also balancing well between the exertedforce and the amount of time for execution. Thus, in thispaper we propose to formulate the cutting energy as theevaluation measure for how good a generated cutting actionis.

Here, we use E to denote the cutting energy. It could beexpressed in terms of the integration of the force exertedover the path of the full motion sequence, which could beformulated as:

E =

∫C

Fds. (6)

To give an example of the principle of minimizing energy,we take saw as an example of balancing between distanceand force applied. While a saw is used during a cutting task,the distance of the cutting path is prolonged but the amountof force applied in the cutting process has been reduced. Inthis study, we are fixing the cutting path and the parameterfor the evaluation function is the force profile. The forceprofile is generated based on the state of the robot’s endeffector and the policy parameters.

D. Generalization of the optimized parameters through thefirmness level

As mentioned in Sec. III-A, a visual module is trained topredict a novel object’s firmness level (Y If ) given its visualappearance (I). In order to link Y If with different policyparameters θ, we trained several sets of policy parametersθ for different firmness levels Yf . We then create a hash-table H which has firmness level as the hash-key and theset of policy profiles θ as the table content. Thus for eachlevel of firmness considered in Sec. III-A, we could retrievea pre-trained set of control policy θI as:

θI = H(Y If ). (7)

When the robot is given a novel object to cut, ourvisual system first predict a firmness level. Then a set ofoptimized policy parameters is retrieved from the hash-table.In the experiment section, we test two ways of applyingthe retrieved control policy: 1) directly apply the retrievedpolicy to the controller. From the experimental result, we cansee that it could be already a fairly good policy for cuttingthe novel object; 2) we can further apply another round ofreinforcement learning procedure. Never the less, since theretrieved policy parameters are not randomly initialized, thelearning process takes much shorter time for the target rewardto converge.

E. A force-based controller for cutting action

It is worth to mention a practical issue when implementinga cutting action on a humanoid robot platform. To execute thecutting action on a humanoid robot, a mere trajectory basedmotion controller does not suffice. It is mainly because atrajectory based controller only takes a position and orien-tation target as the movement goal for execution. When atrajectory based controller is applied for the cutting action, arelatively higher resistance from a hard object could stop therobot end effector from moving further. Thus, the executionof the action will halt when the execution time exceeds acertain threshold.

In order to solve the issue, we designed a force based con-troller for implementing the cutting action on the humanoidrobot. The controller essentially creates a new layer on top ofthe trajectory based controller. We augmented the controllerso that at every time step we are not only assigning a targetpose for the controller but also how much torque needed toexert on each actuator. The robot joint torque τq at each timestep follows

τq = JT (q)τx (8)

τx ∝ ∆x(F,xt) (9)

, where τx is the torque in the end-effector space. It isproportional to the spatial difference ∆x in terms of the end-effector’s current position xt and target position. The targetposition is determined by how much force F the robot wouldlike to impose to the external environment through its end-effector. In the cutting application, it is how much force therobot used to cut the object. τx is converted to τq , the torqueon every joint on the robot arm, through the inverse of theJacobians matrix JT (q). In such a manner, our robot cansuccessfully follow a certain trajectory as well as executethe desired force profile.

IV. EXPERIMENTS

The theoretical framework we have presented suggeststhree hypotheses that deserve empirical tests: (a) Firmnesscan be inferred from visual feature extracted from the CNNbased visual module with decent accuracy; (b) Using rein-forcement paradigm could improve the cutting skill evaluatedby the energy consumption for the cutting motion; (c) On ahumanoid robot platform, the optimized control parameterscan be used to generalize a good cut for novel objects withsimilar firmness level.

To test the three hypotheses empirically, we need to definea set of performance variables. The first hypothesis is relatedto a visual system, we can test the system’s performance bycomparing the predicted firmness with the ground truth onthe testing split of the dataset. We use the mean absoluteerror (MSE) as the metric for the evaluation. Then, weuse the same metric to compare our system’s performancewith a baseline method to support our hypothesis (a). Thehypothesis (b) is about validating the reinforcement learningmodule’s performance. We qualitatively test the hypothesis

by analyzing the reward and the convergence trend of the mo-tor control policy parameters after a number of trials duringiterative learning. For the third hypothesis, we empiricallytest it by having the robot cutting object B with similarfirmness level using the same set of policy parameters trainedfrom object A, and observe whether the system could stillobtain sufficiently good performance without going throughthe iterative reinforcement learning phase.

A. Firmness prediction

1) Setup: In order to validate hypothesis (a), we needa data-set to serve as a test-bed. However, there is nopublic image data-set is available for our purpose. Thus, wecompiled the first object firmness data-set for validation andevaluation.

The data-set consists of 44 categories of fruits and veg-etables (The category here refers to the type of fruit andvegetable e.g. apple, orange, broccoli etc.). Each categoryhas 10 to 20 images crawled from Google image search.Figure. 4 shows some examples of the images in the data-set. Most of the categories are the same as the object’s name.But, there are a few special instances. For example, we havetwo categories for the watermelon in the data-set, one is forwhole-body watermelon and the other is sliced watermelon.These two categories would have the same object namebut apparently different firmness level (sliced watermelon ismuch more soft than whole-body watermelon).

We follow a pre-defined training and testing protocolfor our exeperiments. The data-set is divided into training,development and testing set. We take 2 instances from eachcategories as the development data and another 2 instances asthe testing data. The total number of instances in the trainingset, development set and testing set are 678, 88 and 88.

Here, we set the firmness level as an integer value rangesfrom 1 to 10. For example, coconut has a firmness leveltowards 10 since it is hard, and strawberry has a firmnesslevel towards 1 since it is soft. Since the firmness level isactually a subjective term, we designed a survey to findthe consensus of the ground-truth firmness levels for eachinstance. A snapshot of the survey is shown in Fig. 5. Thehuman subject is asked to give a number out of the scalefrom 1 to 10 to describe how much force he/she is goingto use to cut the certain fruit or vegetable shown in thepicture. The survey is distributed on Mechanical Turk and wecollected 100 valid responses on each of the 44 categories(4400 responses in total). The firmness level of each categoryis then obtained by averaging the 100 responses. In such amanner, every instance in our data set is annotated with aconsensus firmness level.

We used a pre-trained Convolutional Neural Network toextract features from the images, and we used Caffe [9],which is an open source deep learning framework. Themain advantage of CNN than the traditional hand-craftedfeature extraction procedure is that it gives a much moresophisticated representation of the image through its deephierarchical structure. The network we used was trained withmillions of images from ImageNet visual object recognition

SOFT HARD

Fig. 4. Examples of images in the dataset on the scale of firmness

Fig. 5. The survey interface to collect the ground truth for the firmnesslabel

challenge. For each image instance in our data-set, weextracted the deep feature using this pre-trained network,which has a dimension of 4096.

Since the output of the model for a firmness level is avalue instead of a category, we use a regression model toapproach this task. The specific model we choose is thesupport vector machine for regression implemented in theScikit-learn, a machine learning python package [24]. Byusing different kernels, we are able to increase the dimensionof the input data. Specifically in this paper, we use liner,second order polynomial and radial basis function kernel(as known as rbf kernel). We use the mean absolute error(MSE) as the evaluation metric, which is defined as theabsolute difference between the prediction and the ground-truth divided by the number of samples. The hyper-parameterfor tuning the performance on the development data is thepenalty parameter for the error term in the SVM model. Ahigher penalty value would lead to over-fitting while a lowerone would potentially lead to under-fitting.

2) Results: Figure 6 is the plot showing the mean absoluteerror for training and development dataset using differentSVM kernels as a function of penalty parameter. The meanabsolute error is expressed as

MAE =1

n

n∑i=1

|fi − yi| (10)

where yi is the ground truth label and fi is the prediction.The horizontal axis is displayed in the log scale. The errorfor the training data (solid lines) decreases monotonicallyfor all of the three kernels. The error for the developmentdata (dashed lines) decreases at first and then increases. Thelowest error value occurs when the penalty parameter is 1.0

TABLE IMAE OF OUR APPROACH AND THE BASELINE APPROACH ON THE

TRAINING, DEVELOPMENT AND TESTING DATA

Training Development Testing

CNN+SVM 0.24 0.67 0.83

Baseline / / 1.57

10-6 10-5 10-4 10-3 10-2 10-1 100 101

Penalty parameter

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Mea

n ab

solu

te e

rror

tr_err_rbftr_err_lintr_err_polde_err_rbfde_err_linde_err_pol

Fig. 6. Tuning the hyperparameter on the development data

using the second order polynomial kernel, which we treat asthe tuned hyperparameter during testing.

The MAEs with the setting we optimized in the develop-ment data are summarized in Table. I. For the testing data,the mean absolute error is 0.83. As a comparison, we use aconstant firmness level labeling method as a naive baseline.Regardless of the image, the baseline method predict afirmness level of 5.0. Such a naive method yields a MSEof 1.57 on the testing data. Here we want to mention thatthe baseline method has a decent performance because thefirmness level of objects follows a normal distribution, whichmeans most of the fruits or vegetables have a firmness levelclose to 5.0. However, our method yields a much betterperformance on predicting the object firmness and achieveonly 0.83 MSE.

B. Reinforcing the cutting skill

Fig. 7. Experimental setup

1) Setup: We would like to demonstrate the reinforcementlearning module by training a set of policy parameters forcutting objects on humanoid robot. Here we use cucumbersas the main experiment target object.

The cutting motion is formulated in 1D space, which isin the direction of the gravity force in the robot coordinatesystem. The state s = [x, x, x] in which x is the distancefrom the knife to the cutting board. In other words, itmeasures how far in distance is it from the object being cutthrough. x is the velocity and x is the acceleration. Actionis defined as the increment of force ∆Fz applied at the endeffector of the robot. The policy parameter θ ∈ R4 and thedeterministic policy ∆Fz = s · θ[1:3] + θ4 is chosen.

An episode ends when 1) the time exceeds 30 seconds,2) x is larger than 20cm or 3) x becomes 0 when the knifereaches the cutting board. The total reward for the entireepisode is calculated as R = −

∫ T0Fzvzdt, where T is

the time when the object is successfully cut through. If theepisode finishes due to x is larger than 20cm, it means theknife never touches the object.In such a case, we apply areward of −100.

We implement the force based cutting controller as aRobot Operating System (ROS) [26] node. For each episode,a set of control parameter is generated by the CMA-ESmodule and the cutting process is launched with the learnedparameters. After the execution, the reward is calculated andis used to update the model for the next round of policysearch.

We use the Baxter as the robot platform to carry outour experiments. Its arm has seven degree-of-freedom. Theforce command on the end effector is converted to torquecommand on every actuator through inverse kinematics cal-culations. The state (position, velocity) of the end effectoris published by the robot in each time step. This state isthen treat as the same as the state of knife assuming a rigidconnection.

2) Results: After 130 episodes of execution, the rewardconverges to a high level compared to the initial episodeas shown in Fig. 8. This trend can also be observed inthe experiment because a low reward corresponds to cuttingusing a force that is either too small or too large. At thebeginning we could observe a high proportion of the trialsin which either the robot perform an intensive cut thathit hard on the cutting board or the knife moves slowlywhich consumed a lot of time. There were also many trialsin which the knife moved upward instead of downward.But these cases occurred less often as the system evolvesthrough the learning process. The four policy parametersalso converge after the learning process. Figure 9 shows thatthe variation for the values of the normalized parametersdampened along with acquiring more cutting experience. Itshows from another perspective that the algorithm exploresthe possible solutions and converges to an optimized one.

Figure 10 shows the force profile using the optimizedparameters obtained after the reinforcement learning. It isinteresting to point it out that the plateau region in themiddle of the profile is learned be an energy efficient way for

0 20 40 60 80 100 120 140Episode

−200

−150

−100

−50

0Re

war

d x

100

cucumberbanana

Fig. 8. The rewards of each episode throughout the process of learningcucumber and banana cutting skill. If the episode finishes because x is largerthan 20 cm which means the knife never touches the object, a reward of-100 is returned.

0 20 40 60 80 100 120 140Episode

−10

−5

0

5

10

15

Norm

aliz

ed p

aram

eter

val

ues

µ1

µ2

µ3

µ4

Fig. 9. The values of the normalized policy parameters for each episodethroughout the process learning cucumber cutting skill.

penetrating the skin of the cucumber. The temporal durationis about 0.9 second showing that it is a fairly fast cut through,and it is close to how much time it takes for a human beingwith proficient cutting skill. Overall, it shows that our systemis able to converge to a decent policy parameters for robotcutting skill.

C. Cutting performance on objects with similar firmness

1) Setup: We trained another set of parameters for cuttingthe banana which has a similar firmness level as the cu-cumber. Then we used the optimized parameters for cuttingcucumbers to cut a banana and vise verse. The optimizedparameters are the mean values of the parameters used inthe last ten episodes of the learning process. Ten trials ofcuttings were performed for each case and the result wasrecorded as the average of their rewards.

0.0 0.2 0.4 0.6 0.8 1.0time (s)

−15

−10

−5

0

5

exer

ted

forc

e in

z-d

irect

ion

(N)

Fig. 10. Force profile during one trial of cutting execution using theoptimized policy parameters.

TABLE IITHE REWARD R OBTAINED FROM USING OPTIMIZED PARAMETERS θ∗

TRAINED ON CUCUMBER/BANANA TO CUT BANANA/CUCUMBER

θ∗cucumber θ∗banana

Rcucumber -31 -43

Rbanana -89 -67

2) Results: The reward in average is lower for cutting thebananas than cutting the cucumbers. It is because the bananahas a thicker skin than the cucumber and it costs more energyto cut it through. As shown in Table II, the difference in therewards of cutting using the optimized parameters from theother object with the similar firmness level is about 30%.It validates our hypothesis (c) that the firmness level of theobject could be used to generalize the policy parameters forthe cutting action on a novel object.

V. CONCLUSION AND FUTURE WORK

In this paper, we presented a cognitive learning frameworkthat would enable a humanoid robot to learn a more energyefficient way to execute the cutting action based on the visualfeature of the object. We conducted an empirical test on theobject firmness data-set and confirmed that the firmness ofthe object can be inferred from visual features extracted fromthe CNN based visual module. A set of experiments on thehumanoid robot platform validated our hypothesis that 1)using reinforcement paradigm could improve the cutting skillin terms of the energy efficiency 2) the optimized controlparameters can be applied to cut novel objects with similarfirmness level.

Human beings rely not only on their visual system but alsothe tactile system to obtain the feedback in the manipulationtasks. We believe that the deployment of a tactile system onthe humanoid robot’s manipulator would help the acquisitionand the execution of the force-oriented actions.

REFERENCES

[1] M. Beetz, U. Klank, I. Kresse, A. Maldonado, L. Mosenlechner,D. Pangercic, T. Ruhr, and M. Tenorth, “Robotic roommates makingpancakes,” in Humanoid Robots (Humanoids), 2011 11th IEEE-RASInternational Conference on. IEEE, 2011, pp. 529–536.

[2] M. Bollini, S. Tellex, T. Thompson, N. Roy, and D. Rus, “Interpretingand executing recipes with a cooking robot,” in Experimental Robotics.Springer, 2013, pp. 481–495.

[3] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng,and T. Darrell, “Decaf: A deep convolutional activation feature forgeneric visual recognition,” in Proceedings of the 31st InternationalConference on Machine Learning (ICML-14), 2014, pp. 647–655.

[4] V. Ferrari and A. Zisserman, “Learning visual attributes,” in Advancesin Neural Information Processing Systems, 2007, pp. 433–440.

[5] A. Gams, M. Do, A. Ude, T. Asfour, and R. Dillmann, “On-lineperiodic movement and force-profile learning for adaptation to newsurfaces,” in Humanoid Robots (Humanoids), 2010 10th IEEE-RASInternational Conference on, Dec 2010, pp. 560–565.

[6] H. Grabner, J. Gall, and L. Van Gool, “What makes a chair a chair?”in Computer Vision and Pattern Recognition (CVPR), 2011 IEEEConference on. IEEE, 2011, pp. 1529–1536.

[7] N. Hansen and A. Ostermeier, “Completely derandomized self-adaptation in evolution strategies,” Evolutionary computation, vol. 9,no. 2, pp. 159–195, 2001.

[8] V. Heidrich-Meisner and C. Igel, “Evolution strategies for direct policysearch,” in Parallel Problem Solving from Nature–PPSN X. Springer,2008, pp. 428–437.

[9] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in Proceedings of the ACM InternationalConference on Multimedia. ACM, 2014, pp. 675–678.

[10] M. Kalakrishnan, L. Righetti, P. Pastor, and S. Schaal, “Learning forcecontrol policies for compliant manipulation,” in Intelligent Robots andSystems (IROS), 2011 IEEE/RSJ International Conference on. IEEE,2011, pp. 4639–4644.

[11] J. Kober, M. Gienger, and J. Steil, “Learning movement primitivesfor force interaction tasks,” in Robotics and Automation (ICRA), 2015IEEE International Conference on, May 2015, pp. 3192–3199.

[12] J. Kober and J. Peters, “Reinforcement learning in robotics: A survey,”in Reinforcement Learning. Springer, 2012, pp. 579–610.

[13] J. Kober and J. R. Peters, “Policy search for motor primitives inrobotics,” in Advances in neural information processing systems, 2009,pp. 849–856.

[14] T. Kollar and N. Roy, “Trajectory optimization using reinforcementlearning for map exploration,” The International Journal of RoboticsResearch, vol. 27, no. 2, pp. 175–196, 2008.

[15] P. Kormushev, S. Calinon, and D. G. Caldwell, “Robot motor skillcoordination with em-based reinforcement learning,” in IntelligentRobots and Systems (IROS), 2010 IEEE/RSJ International Conferenceon. IEEE, 2010, pp. 3232–3237.

[16] ——, “Imitation learning of positional and force skills demonstratedvia kinesthetic teaching and haptic input,” Advanced Robotics, vol. 25,no. 5, pp. 581–603, 2011.

[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neuralinformation processing systems, 2012, pp. 1097–1105.

[18] A. X. Lee, H. Lu, A. Gupta, S. Levine, and P. Abbeel, “Learning force-based manipulation of deformable objects from multiple demonstra-tions,” in Robotics and Automation (ICRA), 2015 IEEE InternationalConference on. IEEE, 2015, pp. 177–184.

[19] I. Lenz, R. Knepper, and A. Saxena, “Deepmpc: Learning deep latentfeatures for model predictive control,” in RSS, 2015.

[20] R. Mao, Y. Yang, C. Fermuller, Y. Aloimonos, and J. S. Baras, “Learn-ing hand movements from markerless demonstrations for humanoidtasks,” in 2014 IEEE-RAS International Conference on HumanoidRobots. IEEE, 2014, pp. 938–943.

[21] L. Pais, K. Umezawa, Y. Nakamura, and A. Billard, “Learning robotskills through motion segmentation and constraints extraction,” in HRIWorkshop on Collaborative Manipulation, 2013.

[22] D. Parikh and K. Grauman, “Relative attributes,” in Computer Vision(ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp.503–510.

[23] P. Pastor, M. Kalakrishnan, L. Righetti, and S. Schaal, “Towardsassociative skill memories,” in Humanoid Robots (Humanoids), 201212th IEEE-RAS International Conference on. IEEE, 2012, pp. 309–315.

[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al.,“Scikit-learn: Machine learning in python,” The Journal of MachineLearning Research, vol. 12, pp. 2825–2830, 2011.

[25] J. Peters and S. Schaal, “Reinforcement learning of motor skills withpolicy gradients,” Neural networks, vol. 21, no. 4, pp. 682–697, 2008.

[26] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs,R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operatingsystem,” in ICRA workshop on open source software, vol. 3, no. 3.2,2009, p. 5.

[27] J. W. Roberts, L. Moret, J. Zhang, and R. Tedrake, “Motor learning atintermediate reynolds number: experiments with policy gradient on theflapping flight of a rigid wing,” in From Motor Learning to InteractionLearning in Robots. Springer, 2010, pp. 293–309.

[28] T. Ruckstieß, M. Felder, and J. Schmidhuber, “State-dependent ex-ploration for policy gradient methods,” in Machine Learning andKnowledge Discovery in Databases. Springer, 2008, pp. 234–249.

[29] F. Stulp and O. Sigaud, “Policy improvement methods: Between black-box optimization and episodic reinforcement learning,” 2012.

[30] J. Van Den Berg, S. Miller, K. Goldberg, and P. Abbeel, “Gravity-based robotic cloth folding,” in Algorithmic Foundations of RoboticsIX. Springer, 2010, pp. 409–424.

[31] A. Yamaguchi, C. G. Atkeson, and T. Ogasawara, “Pouring skillswith planning and learning modeled from human demonstrations,”International Journal of Humanoid Robotics, vol. 12, no. 03, p.1550030, 2015.

vision guided cutting skill acquisition for cognitive robotsskill of cutting. here we consider the...

Documents