learning the semantics of manipulation action - …yzyang/paper/acl2015_final.pdflearning the...

Learning the Semantics of Manipulation Action

Yezhou Yang† and Yiannis Aloimonos† and Cornelia Fermuller† and Eren Erdal Aksoy‡† UMIACS, University of Maryland, College Park, MD, USA{yzyang, yiannis, fer}@umiacs.umd.edu‡ Karlsruhe Institute of Technology, Karlsruhe, Germany

[email protected]

Abstract

In this paper we present a formal compu-tational framework for modeling manip-ulation actions. The introduced formal-ism leads to semantics of manipulation ac-tion and has applications to both observ-ing and understanding human manipula-tion actions as well as executing them witha robotic mechanism (e.g. a humanoidrobot). It is based on a Combinatory Cat-egorial Grammar. The goal of the intro-duced framework is to: (1) represent ma-nipulation actions with both syntax and se-mantic parts, where the semantic part em-ploys λ-calculus; (2) enable a probabilis-tic semantic parsing schema to learn thelambda-calculus representation of manip-ulation action from an annotated actioncorpus of videos; (3) use (1) and (2) to de-velop a system that visually observes ma-nipulation actions and understands theirmeaning while it can reason beyond ob-servations using propositional logic andaxiom schemata. The experiments con-ducted on a public available large manip-ulation action dataset validate the theoret-ical framework and our implementation.

1 Introduction

Autonomous robots will need to learn the actionsthat humans perform. They will need to recognizethese actions when they see them and they willneed to perform these actions themselves. This re-quires a formal system to represent the action se-mantics. This representation needs to store the se-mantic information about the actions, be encodedin a machine readable language, and inherently bein a programmable fashion in order to enable rea-soning beyond observation. A formal represen-tation of this kind has a variety of other appli-cations such as intelligent manufacturing, human

robot collaboration, action planning and policy de-sign, etc.

In this paper, we are concerned with manipula-tion actions, that is actions performed by agents(humans or robots) on objects, resulting in somephysical change of the object. However most ofthe current AI systems require manually definedsemantic rules. In this work, we propose a com-putational linguistics framework, which is basedon probabilistic semantic parsing with Combina-tory Categorial Grammar (CCG), to learn manip-ulation action semantics (lexicon entries) from an-notations. We later show that this learned lexiconis able to make our system reason about manipu-lation action goals beyond just observation. Thusthe intelligent system can not only imitate humanmovements, but also imitate action goals.

Understanding actions by observation and exe-cuting them are generally considered as dual prob-lems for intelligent agents. The sensori-motorbridge connecting the two tasks is essential, anda great amount of attention in AI, Robotics as wellas Neurophysiology has been devoted to investi-gating it. Experiments conducted on primates havediscovered that certain neurons, the so-called mir-ror neurons, fire during both observation and ex-ecution of identical manipulation tasks (Rizzolattiet al., 2001; Gazzola et al., 2007). This suggeststhat the same process is involved in both the obser-vation and execution of actions. From a function-alist point of view, such a process should be ableto first build up a semantic structure from obser-vations, and then the decomposition of that samestructure should occur when the intelligent agentexecutes commands.

Additionally, studies in linguistics (Steedman,2002) suggest that the language faculty developsin humans as a direct adaptation of a more primi-tive apparatus for planning goal-directed action inthe world by composing affordances of tools andconsequences of actions. It is this more primitive

apparatus that is our major interest in this paper.Such an apparatus is composed of a “syntax part”and a “semantic part”. In the syntax part, every lin-guistic element is categorized as either a functionor a basic type, and is associated with a syntacticcategory which either identifies it as a function or abasic type. In the semantic part, a semantic trans-lation is attached following the syntactic categoryexplicitly.

Combinatory Categorial Grammar (CCG) intro-duced by (Steedman, 2000) is a theory that canbe used to represent such structures with a smallset of combinators such as functional applicationand type-raising. What do we gain though fromsuch a formal description of action? This is simi-lar to asking what one gains from a formal descrip-tion of language as a generative system. Chom-skys contribution to language research was exactlythis: the formal description of language throughthe formulation of the Generative and Transforma-tional Grammar (Chomsky, 1957). It revolution-ized language research opening up new roads forthe computational analysis of language, provid-ing researchers with common, generative languagestructures and syntactic operations, on which lan-guage analysis tools were built. A grammar foraction would contribute to providing a commonframework of the syntax and semantics of action,so that basic tools for action understanding can bebuilt, tools that researchers can use when develop-ing action interpretation systems, without havingto start development from scratch. The same toolscan be used by robots to execute actions.

In this paper, we propose an approach for learn-ing the semantic meaning of manipulation actionthrough a probabilistic semantic parsing frame-work based on CCG theory. For example, we wantto learn from an annotated training action corpusthat the action “Cut” is a function which has twoarguments: a subject and a patient. Also, the ac-tion consequence of “Cut” is a separation of thepatient. Using formal logic representation, oursystem will learn the semantic representations of“Cut”:

Cut :=(AP\NP )/NP : λx.λy.cut(x, y) → divided(y)

Here cut(x, y) is a primitive function. We will fur-ther introduce the representation in Sec. 3. Sinceour action representation is in a common calculusform, it enables naturally further logical reasoningbeyond visual observation.

The advantage of our approach is twofold: 1)Learning semantic representations from annota-tions helps an intelligent agent to enrich automat-ically its own knowledge about actions; 2) Theformal logic representation of the action could beused to infer the object-wise consequence after acertain manipulation, and can also be used to plana set of actions to reach a certain action goal. Wefurther validate our approach on a large publiclyavailable manipulation action dataset (MANIAC)from (Aksoy et al., 2014), achieving promising ex-perimental results. Moreover, we believe that ourwork, even though it only considers the domain ofmanipulation actions, is also a promising exampleof a more closely intertwined computer vision andcomputational linguistics system. The diagram inFig.1 depicts the framework of the system.

Figure 1: A CCG based semantic parsing frame-work for manipulation actions.

2 Related Works

Reasoning beyond appearance: The very smallnumber of works in computer vision, which aim toreason beyond appearance models, are also relatedto this paper. (Xie et al., 2013) proposed that be-yond state-of-the-art computer vision techniques,we could possibly infer implicit information (suchas functional objects) from video, and they callthem “Dark Matter” and “Dark Energy”. (Yanget al., 2013) used stochastic tracking and graph-cut based segmentation to infer manipulation con-sequences beyond appearance. (Joo et al., 2014)used a ranking SVM to predict the persuasive mo-tivation (or the intention) of the photographer whocaptured an image. More recently, (Pirsiavash etal., 2014) seeks to infer the motivation of the per-son in the image by mining knowledge stored in

a large corpus using natural language processingtechniques. Different from these fairly general in-vestigations about reasoning beyond appearance,our paper seeks to learn manipulation actions se-mantics in logic forms through CCG, and furtherinfer hidden action consequences beyond appear-ance through reasoning.

Action Recognition and Understanding: Hu-man activity recognition and understanding hasbeen studied heavily in Computer Vision recently,and there is a large range of applications for thiswork in areas like human-computer interactions,biometrics, and video surveillance. Both visualrecognition methods, and the non-visual descrip-tion methods using motion capture systems havebeen used. A few good surveys of the former canbe found in (Moeslund et al., 2006) and (Turaga etal., 2008). Most of the focus has been on recog-nizing single human actions like walking, jump-ing, or running etc. (Ben-Arie et al., 2002; Yilmazand Shah, 2005). Approaches to more complex ac-tions have employed parametric approaches, suchas HMMs (Kale et al., 2004) to learn the transi-tion between feature representations in individualframes e.g. (Saisan et al., 2001; Chaudhry et al.,2009). More recently, (Aksoy et al., 2011; Ak-soy et al., 2014) proposed a semantic event chain(SEC) representation to model and learn the se-mantic segment-wise relationship transition fromspatial-temporal video segmentation.

There also have been many syntactic ap-proaches to human activity recognition which usedthe concept of context-free grammars, becausesuch grammars provide a sound theoretical basisfor modeling structured processes. Tracing backto the middle 90’s, (Brand, 1996) used a grammarto recognize disassembly tasks that contain handmanipulations. (Ryoo and Aggarwal, 2006) usedthe context-free grammar formalism to recognizecomposite human activities and multi-person in-teractions. It is a two level hierarchical approachwhere the lower-levels are composed of HMMsand Bayesian Networks while the higher-level in-teractions are modeled by CFGs. To deal witherrors from low-level processes such as tracking,stochastic grammars such as stochastic CFGs werealso used (Ivanov and Bobick, 2000; Moore andEssa, 2002). More recently, (Kuehne et al., 2014)proposed to model goal-directed human activi-ties using Hidden Markov Models and treat sub-actions just like words in speech. These works

proved that grammar based approaches are prac-tical in activity recognition systems, and shedinsight onto human manipulation action under-standing. However, as mentioned, thinking aboutmanipulation actions solely from the viewpointof recognition has obvious limitations. In thiswork we adopt principles from CFG based activ-ity recognition systems, with extensions to a CCGgrammar that accommodates not only the hierar-chical structure of human activity but also actionsemantics representations. It enables the systemto serve as the core parsing engine for both ma-nipulation action recognition and execution.

Manipulation Action Grammar: As men-tioned before, (Chomsky, 1993) suggested that aminimalist generative grammar, similar to the oneof human language, also exists for action under-standing and execution. The works closest relatedto this paper are (Pastra and Aloimonos, 2012;Summers-Stay et al., 2013; Guha et al., 2013).(Pastra and Aloimonos, 2012) first discussed aChomskyan grammar for understanding complexactions as a theoretical concept, and (Summers-Stay et al., 2013) provided an implementation ofsuch a grammar using as perceptual input onlyobjects. More recently, (Yang et al., 2014) pro-posed a set of context-free grammar rules for ma-nipulation action understanding, and (Yang et al.,2015) applied it on unconstrained instructionalvideos. However, these approaches only con-sider the syntactic structure of manipulation ac-tions without coupling semantic rules using λ ex-pressions, which limits the capability of doing rea-soning and prediction.

Combinatory Categorial Grammar and Se-mantic Parsing: CCG based semantic parsingoriginally was used mainly to translate naturallanguage sentences to their desired semantic rep-resentations as λ-calculus formulas (Zettlemoyerand Collins, 2005; Zettlemoyer and Collins,2007). (Mooney, 2008) presented a frameworkof grounded language acquisition: the interpre-tation of language entities into semantically in-formed structures in the context of perception andactuation. The concept has been applied success-fully in tasks such as robot navigation (Matuszeket al., 2011), forklift operation (Tellex et al., 2014)and of human-robot interaction (Matuszek et al.,2014). In this work, instead of grounding naturallanguage sentences directly, we ground informa-tion obtained from visual perception into seman-

tically informed structures, specifically in the do-main of manipulation actions.

3 A CCG Framework for ManipulationActions

Before we dive into the semantic parsing of ma-nipulation actions, a brief introduction to the Com-binatory Categorial Grammar framework in Lin-guistics is necessary. We will only introduce re-lated concepts and formalisms. For a completebackground reading, we would like to refer read-ers to (Steedman, 2000). We will first give a briefintroduction to CCG and then introduce a fun-damental combinator, i.e., functional application.The introduction is followed by examples to showhow the combinator is applied to parse actions.

3.1 Manipulation Action Semantics

The semantic expression in our representation ofmanipulation actions uses a typed λ-calculus lan-guage. The formal system has two basic types:entities and functions. Entities in manipulationactions are Objects or Hands, and functions arethe Actions. Our lambda-calculus expressions areformed from the following items:

Constants: Constants can be either entities orfunctions. For example, Knife is an entity (i.e., itis of type N) and Cucumber is an entity too (i.e., itis of type N). Cut is an action function that mapsentities to entities. When the event Knife Cut Cu-cumber happened, the expression cut(Knife, Cu-cumber) returns an entity of type AP, aka. ActionPhrase. Constants like divided are status functionsthat map entities to truth values. The expressiondivided(cucumber) returns a true value after theevent (Knife Cut Cucumber) happened.

Logical connectors: The λ-calculus expressionhas logical connectors like conjunction (∧), dis-junction (∨), negation(¬) and implication(→).

For example, the expression

connected(tomato, cucumber)∧divided(tomato) ∧ divided(cucumber)

represents the joint status that the sliced tomatomerged with the sliced cucumber. It can beregarded as a simplified goal status for “mak-ing a cucumber tomato salad”. The expression¬connected(spoon, bowl) represents the statusafter the spoon finished stirring the bowl.

λx.cut(x, cucumber) → divided(cucumber)

represents that if the cucumber is cut by x, thenthe status of the cucumber is divided.λ expressions: lambda expressions represent

functions with unknown arguments. For example,λx.cut(knife, x) is a function from entities to en-tities, which is of type NP after any entities of typeN that is cut by knife.

3.2 Combinatory Categorial GrammarThe semantic parsing formalism underlying ourframework for manipulation actions is that ofcombinatory categorial grammar (CCG) (Steed-man, 2000). A CCG specifies one or more logi-cal forms for each element or combination of ele-ments for manipulation actions. In our formalism,an element of Action is associated with a syntac-tic “category” which identifies it as functions, andspecifies the type and directionality of their argu-ments and the type of their result. For example, ac-tion “Cut” is a function from patient object phrase(NP) on the right into predicates, and into func-tions from subject object phrase (NP) on the leftinto a sub action phrase (AP):

Cut := (AP\NP )/NP. (1)

As a matter of fact, the pure categorial gram-mar is a conext-free grammar presented in the ac-cepting, rather than the producing direction. Theexpression (1) is just an accepting form for Ac-tion “Cut” following the context-free grammar.While it is now convenient to write derivations asfollows, they are equivalent to conventional treestructure derivations in Figure. 3.2.

Knife Cut Cucumber

N N

NP (AP\NP)/NP NP>

AP\NP<

AP

AP

AP

NP

N

Cucumber

A

Cut

NP

N

Knife

Figure 2: Example of conventional tree structure.

The semantic type is encoded in these cate-gories, and their translation can be made explicit

in an expanded notation. Basically a λ-calculusexpression is attached with the syntactic category.A colon operator is used to separate syntacticaland semantic expressions, and the right side of thecolon is assumed to have lower precedence thanthe left side of the colon. Which is intuitive as anyexplanation of manipulation actions should firstobey syntactical rules, then semantic rules. Nowthe basic element, Action “Cut”, can be furtherrepresented by:

Cut :=(AP\NP )/NP : λx.λy.cut(x, y) → divided(y).

(AP\NP )/NP denotes a phrase of type AP ,which requires an element of type NP to specifywhat object was cut, and requires another elementof type NP to further complement what effectorinitiates the cut action. λx.λy.cut(x, y) is the λ-calculus representation for this function. Since thefunctions are closely related to the state update,→ divided(y) further points out the status expres-sion after the action was performed.

A CCG system has a set of combinatory ruleswhich describe how adjacent syntatic categoriesin a string can be recursively combined. In thesetting of manipulation actions, we want to pointout that similar combinatory rules are also appli-cable. Especially the functional application rulesare essential in our system.

3.3 Functional application

The functional application rules with semanticscan be expressed in the following form:

A/B : f B : g => A : f(g) (2)

B : g A\B : f => A : f(g) (3)

Rule. (2) says that a string with type A/B can becombined with a right-adjacent string of type B toform a new string of type A. At the same time, italso specifies how the semantics of the category Acan be compositionally built out of the semanticsfor A/B and B. Rule. (3) is a symmetric form ofRule. (2).

In the domain of manipulation actions, follow-ing derivation is an example CCG parse. Thisparse shows how the system can parse an ob-servation (“Knife Cut Cucumber”) into a se-mantic representation (cut(knife, cucumber) →divided(cucumber)) using the functional appli-cation rules.

Knife Cut Cucumber

N N

NP (AP\NP)/NP NPknife λx .λy .cut(x , y) cucumberknife → divided(y) cucumber

>AP\NP

λx .cut(x , cucumber)→ divided(cucumber)

<AP

cut(knife, cucumber)→ divided(cucumber)

4 Learning Model and Semantic Parsing

After having defined the formalism and applica-tion rule, instead of manually writing down all thepossible CCG representations for each entity, wewould like to apply a learning technique to de-rive them from the paired training corpus. Herewe adopt the learning model of (Zettlemoyer andCollins, 2005), and use it to assign weights to thesemantic representation of actions. Since an ac-tion may have multiple possible syntactic and se-mantic representations assigned to it, we use theprobabilistic model to assign weights to these rep-resentations.

4.1 Learning Approach

First we assume that complete syntactic parses ofthe observed action are available, and in fact a ma-nipulation action can have several different parses.The parsing uses a probabilistic combinatorial cat-egorial grammar framework similar to the onegiven by (Zettlemoyer and Collins, 2007). We as-sume a probabilistic categorial grammar (PCCG)based on a log linear model. M denotes a manipu-lation task, L denotes the semantic representationof the task, and T denotes its parse tree. The prob-ability of a particular syntactic and semantic parseis given as:

P (L, T |M ; Θ) =ef(L,T,M)·Θ∑

(L,T ) ef(L,T,M)·Θ (4)

where f is a mapping of the triple (L, T,M ) tofeature vectors ∈ Rd, and the Θ ∈ Rd representsthe weights to be learned. Here we use only lexi-cal features, where each feature counts the numberof times a lexical entry is used in T . Parsing a ma-nipulation task under PCCG equates to finding Lsuch that P (L|M ; Θ) is maximized:

argmaxLP (L|M ; Θ)

= argmaxL∑T

P (L, T |M ; Θ). (5)

We use dynamic programming techniques tocalculate the most probable parse for the manipu-lation task. In this paper, the implementation from(Baral et al., 2011) is adopted, where an inverse-λtechnique is used to generalize new semantic rep-resentations. The generalization of lexicon rulesare essential for our system to deal with unknownactions presented during the testing phase.

5 Experiments

5.1 Manipulation Action (MANIAC) Dataset

(Aksoy et al., 2014) provides a manipulation ac-tion dataset with 8 different manipulation actions(cutting, chopping, stirring, putting, taking, hid-ing, uncovering, and pushing), each of which con-sists of 15 different versions performed by 5 dif-ferent human actors1. There are in total 30 differ-ent objects manipulated in all demonstrations. Allmanipulations were recorded with the MicrosoftKinect sensor and serve as training data here.

The MANIAC data set contains another 20 longand complex chained manipulation sequences(e.g. “making a sandwich”) which consist of a to-tal of 103 different versions of these 8 manipula-tion tasks performed in different orders with novelobjects under different circumstances. These serveas testing data for our experiments.

(Aksoy et al., 2014; Aksoy and Worgotter,2015) developed a semantic event chain basedmodel free decomposition approach. It is an un-supervised probabilistic method that measures thefrequency of the changes in the spatial relationsembedded in event chains, in order to extract thesubject and patient visual segments. It also decom-poses the long chained complex testing actionsinto their primitive action components accordingto the spatio-temporal relations of the manipula-tor. Since the visual recognition is not the coreof this work, we omit the details here and referthe interested reader to (Aksoy et al., 2014; Aksoyand Worgotter, 2015). All these features make theMANIAC dataset a great testing bed for both thetheoretical framework and the implemented sys-tem presented in this work.

5.2 Training Corpus

We first created a training corpus by annotatingthe 120 training clips from the MANIAC dataset,

1Dataset available for download at https://fortknox.physik3.gwdg.de/cns/index.php?page=maniac-dataset.

in the format of observed triplets (subject actionpatient) and a corresponding semantic representa-tion of the action as well as its consequence. Thesemantic representations in λ-calculus format aregiven by human annotators after watching each ac-tion clip. A set of sample training pairs are givenin Table.1 (one from each action category in thetraining set). Since every training clip containsone single full execution of each manipulation ac-tion considered, the training corpus thus has a totalof 120 paired training samples.

Snapshot triplet semantic representation

cleaver chopping carrot chopping(cleaver, carrot)→ divided(carrot)

spatula cutting pepper cutting(spatula, pepper)→ divided(pepper)

spoon stirring bucket stirring(spoon, bucket)

cup take down buckettake down(cup, bucket)→ ¬connected(cup, bucket)∧moved(cup)

cup put on top bowlput on top(cup, bowl)→ on top(cup, bowl)∧moved(cup)

bucket hiding ballhiding(bucket, ball)→ contained(bucket, ball)∧moved(bucket)

hand pushing box pushing(hand, box)→ moved(box)

box uncover appleuncover(box, apple)→ appear(apple)∧moved(box)

Table 1: Example annotations from training cor-pus, one per manipulation action category.

We also assume the system knows that every“object” involved in the corpus is an entity of itsown type, for example:

Knife := N : knife

Bowl := N : bowl

......

Additionally, we assume the syntactic form ofeach “action” has a main type (AP\NP )/NP(see Sec. 3.2). These two sets of rules form theinitial seed lexicon for learning.

5.3 Learned LexiconWe applied the learning technique mentioned inSec. 4, and we used the NL2KR implementa-tion from (Baral et al., 2011). The system learnsand generalizes a set of lexicon entries (syntacticand semantic) for each action categories from thetraining corpus accompanied with a set of weights.

We list the one with the largest weight for each ac-tion here respectively:

Chopping :=(AP\NP )/NP : λx.λy.chopping(x, y)

→ divided(y)

Cutting :=(AP\NP )/NP : λx.λy.cutting(x, y)

→ divided(y)

Stirring :=(AP\NP )/NP : λx.λy.stirring(x, y)

Take down :=(AP\NP )/NP : λx.λy.take down(x, y)

→ ¬connected(x, y) ∧moved(x)Put on top :=(AP\NP )/NP : λx.λy.put on top(x, y)

→ on top(x, y) ∧moved(x)Hiding :=(AP\NP )/NP : λx.λy.hiding(x, y)

→ contained(x, y) ∧moved(x)Pushing :=(AP\NP )/NP : λx.λy.pushing(x, y)

→ moved(y)

Uncover :=(AP\NP )/NP : λx.λy.uncover(x, y)

→ appear(y) ∧moved(x).

The set of seed lexicon and the learned lexiconentries are further used to probabilistically parsethe detected triplet sequences from the 20 longmanipulation activities in the testing set.

5.4 Deducing Semantics

Using the decomposition technique from (Aksoyet al., 2014; Aksoy and Worgotter, 2015), the re-ported system is able to detect a sequence of ac-tion triplets in the form of (Subject Action Pa-tient) from each of the testing sequence in MA-NIAC dataset. Briefly speaking, the event chainrepresentation (Aksoy et al., 2011) of the observedlong manipulation activity is first scanned to esti-mate the main manipulator, i.e. the hand, and ma-nipulated objects, e.g. knife, in the scene withoutemploying any visual feature-based object recog-nition method. Solely based on the interactionsbetween the hand and manipulated objects in thescene, the event chain is partitioned into chunks.These chunks are further fragmented into sub-units to detect parallel action streams. Each parsedSemantic Event Chain (SEC) chunk is then com-pared with the model SECs in the library to decidewhether the current SEC sample belongs to oneof the known manipulation models or represents anovel manipulation. SEC models, stored in the li-brary, are learned in an on-line unsupervised fash-ion using the semantics of manipulations derivedfrom a given set of training data in order to createa large vocabulary of single atomic manipulations.

For the different testing sequence, the numberof triplets detected ranges from two to seven. In to-tal, we are able to collect 90 testing detections and

they serve as the testing corpus. However, sincemany of the objects used in the testing data are notpresent in the training set, an object model-free ap-proach is adopted and thus “subject” and “patient”fields are filled with segment IDs instead of a spe-cific object name. Fig. 3 and 4 show several ex-amples of the detected triplets accompanied with aset of key frames from the testing sequences. Nev-ertheless, the method we used here can 1) gener-alize the unknown segments into the category ofobject entities and 2) generalize the unknown ac-tions (those that do not exist in the training corpus)into the category of action function. This is doneby automatically generalizing the following twotypes of lexicon entries using the inverse-λ tech-nique from (Baral et al., 2011):

Object [ID] :=N : object [ID]

Unknown :=(AP\NP )/NP : λx.λy.unknown(x, y)

Among the 90 detected triplets, using thelearned lexicon we are able to parse all of theminto semantic representations. Here we pick therepresentation with the highest probability afterparsing as the individual action semantic represen-tation. The “parsed semantics” rows of Fig. 3 and4 show several example action semantics on test-ing sequences. Taking the fourth sub-action fromFig. 4 as an example, the visually detected tripletsbased on segmentation and spatial decompositionis (Object 014, Chopping,Object 011). Af-ter semantic parsing, the system predicts thatdivided(Object 011). The complete training cor-pus and parsed results of the testing set will bemade publicly available for future research.

5.5 Reasoning Beyond Observations

As mentioned before, because of the use of λ-calculus for representing action semantics, the ob-tained data can naturally be used to do logical rea-soning beyond observations. This by itself is avery interesting research topic and it is beyond thispaper’s scope. However by applying a couple ofcommon sense Axioms on the testing data, we canprovide some flavor of this idea.

Case study one: See the “final action conse-quence and reasoning” row of Fig. 3 for case one.Using propositional logic and axiom schema, wecan represent the common sense statement (“if anobject x is contained in object y, and object z ison top of object y, then object z is on top of objectx”) as follows:

Figure 3: System output on complex chained manipulation testing sequence one. The segmentationoutput and detected triplets are from (Aksoy and Worgotter, 2015)

.

Figure 4: System output on the 18th complex chained manipulation testing sequence. The segmentationoutput and detected triplets are from (Aksoy and Worgotter, 2015)

.

Axiom (1): ∃x, y, z, contained(y, x) ∧on top(z, y)→ on top(z, x).

Then it is trivial to deduce an additional fi-nal action consequence in this scenario that(on top(object 007, object 009)). This matchesthe fact: the yellow box which is put on top of thered bucket is also on top of the black ball.

Case study two: See the “final action conse-quence and reasoning” row of Fig. 4 for a morecomplicated case. Using propositional logic andaxiom schema, we can represent three commonsense statements:

1) “if an object y is contained in object x, andobject z is contained in object y, then object z iscontained in object x”;

2) “if an object x is contained in object y, andobject y is divided, then object x is divided”;

3) “if an object x is contained in object y, andobject y is on top of object z, then object x is ontop of object z” as follows:

Axiom (2): ∃x, y, z, contained(y, x) ∧contained(z, y)→ contained(z, x).

Axiom (3): ∃x, y, contained(y, x) ∧divided(y)→ divided(x).

Axiom (4): ∃x, y, z, contained(y, x) ∧on top(y, z)→ on top(x, z).

With these common sense Axioms, the systemis able to deduce several additional final actionconsequences in this scenario:

divided(object 005) ∧ divided(object 010)∧ on top(object 005, object 012)∧ on top(object 010, object 012).

From Fig. 4, we can see that these additionalconsequences indeed match the facts: 1) the breadand cheese which are covered by ham are also di-vided, even though from observation the systemonly detected the ham being cut; 2) the dividedbread and cheese are also on top of the plate, eventhough from observation the system only detectedthe ham being put on top of the plate.

We applied the four Axioms on the 20 testingaction sequences and deduced the “hidden” conse-quences from observation. To evaluate our systemperformance quantitatively, we first annotated allthe final action consequences (both obvious and“hidden” ones) from the 20 testing sequences asground-truth facts. In total there are 122 conse-quences annotated. Using perception only (Aksoyand Worgotter, 2015), due to the decompositionerrors (such as the red font ones in Fig. 4) the sys-tem can detect 91 consequences correctly, yieldinga 74% correct rate. After applying the four Ax-ioms and reasoning, our system is able to detect105 consequences correctly, yielding a 86% cor-rect rate. Overall, this is a 15.4% of improvement.

Here we want to mention a caveat: there are def-initely other common sense Axioms that we arenot able to address in the current implementation.However, from the case studies presented, we cansee that using the presented formal framework, oursystem is able to reason about manipulation actiongoals instead of just observing what is happeningvisually. This capability is essential for intelligentagents to imitate action goals from observation.

6 Conclusion and Future Work

In this paper we presented a formal computa-tional framework for modeling manipulation ac-tions based on a Combinatory Categorial Gram-mar. An empirical study on a large manipula-tion action dataset validates that 1) with the intro-duced formalism, a learning system can be devisedto deduce the semantic meaning of manipulationactions in λ-schema; 2) with the learned schemaand several common sense Axioms, our system isable to reason beyond just observation and deduce“hidden” action consequences, yielding a decentperformance improvement.

Due to the limitation of current testing scenar-ios, we conducted experiments only considering arelatively small set of seed lexicon rules and log-ical expressions. Nevertheless, we want to men-tion that the presented CCG framework can alsobe extended to learn the formal logic representa-tion of more complex manipulation action seman-tics. For example, the temporal order of manipula-tion actions can be modeled by considering a seedrule such as AP\AP : λf.λg.before(f(·), g(·)),where before(·, ·) is a temporal predicate. Foractions in this paper we consider seed main type(AP\NP )/NP . For more general manipulation

scenarios, based on whether the action is transi-tive or intransitive, the main types of action can beextended to include AP\NP .

Moreover, the logical expressions can also beextended to include universal quantification ∀ andexistential quantification ∃. Thus, manipulationaction such as “knife cut every tomato” can beparsed into a representation as ∀x.tomato(x) ∧cut(knife, x) → divided(x) (the parse is givenin the following chart). Here, the concept “every”has a main type of NP\NP and semantic mean-ing of ∀x.f(x). The same framework can alsoextended to have other combinatory rules such ascomposition and type-raising (Steedman, 2002).These are parts of the future work along the line ofthe presented work.Knife Cut every Tomato

N N

NP (AP\NP)/NP NP\NP NPknife λx .λy .cut(x , y) ∀x .f (x ) tomatoknife → divided(y) ∀x .f (x ) tomato

>NP

∀x .tomato(x )>

AP\NP∀y .λx .tomato(y) ∧ cut(x , y) → divided(y)

<AP

∀y .tomato(y) ∧ cut(knife, y) → divided(y)

The presented computational linguistic frame-work enables an intelligent agent to predict andreason action goals from observation, and thus hasmany potential applications such as human inten-tion prediction, robot action policy planning, hu-man robot collaboration etc. We believe that ourformalism of manipulation actions bridges com-putational linguistics, vision and robotics, andopens further research in Artificial Intelligenceand Robotics. As the robotics industry is movingtowards robots that function safely, effectively andautonomously to perform tasks in real-world un-structured environments, they will need to be ableto understand the meaning of actions and acquirehuman-like common-sense reasoning capabilities.

7 Acknowledgements

This research was funded in part by the sup-port of the European Union under the Cogni-tive Systems program (project POETICON++),the National Science Foundation under INSPIREgrant SMA 1248056, and by DARPA throughU.S. Army grant W911NF-14-1-0384 under theProject: Shared Perception, Cognition and Rea-soning for Autonomy.

ReferencesE E. Aksoy and F. Worgotter. 2015. Semantic decom-

position and recognition of long and complex ma-nipulation action sequences. International Journalof Computer Vision, page Under Review.

E.E. Aksoy, A. Abramov, J. Dorr, K. Ning, B. Dellen,and F. Worgotter. 2011. Learning the semantics ofobject–action relations by observation. The Interna-tional Journal of Robotics Research, 30(10):1229–1249.

E E. Aksoy, M. Tamosiunaite, and F. Worgotter. 2014.Model-free incremental learning of the semanticsof manipulation actions. Robotics and AutonomousSystems, pages 1–42.

Chitta Baral, Juraj Dzifcak, Marcos Alvarez Gonzalez,and Jiayu Zhou. 2011. Using inverse λ and gener-alization to translate english to formal languages. InProceedings of the Ninth International Conferenceon Computational Semantics, pages 35–44. Associ-ation for Computational Linguistics.

Jezekiel Ben-Arie, Zhiqian Wang, Purvin Pandit, andShyamsundar Rajaram. 2002. Human activ-ity recognition using multidimensional indexing.Pattern Analysis and Machine Intelligence, IEEETransactions on, 24(8):1091–1104.

Matthew Brand. 1996. Understanding manipulation invideo. In Proceedings of the Second InternationalConference on Automatic Face and Gesture Recog-nition, pages 94–99, Killington,VT. IEEE.

R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal.2009. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems forthe recognition of human actions. In Proceedingsof the 2009 IEEE Intenational Conference on Com-puter Vision and Pattern Recognition, pages 1932–1939, Miami,FL. IEEE.

N. Chomsky. 1957. Syntactic Structures. Mouton deGruyter.

Noam Chomsky. 1993. Lectures on government andbinding: The Pisa lectures. Walter de Gruyter.

V Gazzola, G Rizzolatti, B Wicker, and C Keysers.2007. The anthropomorphic brain: the mirror neu-ron system responds to human and robotic actions.Neuroimage, 35(4):1674–1684.

Anupam Guha, Yezhou Yang, Cornelia Fermuller, andYiannis Aloimonos. 2013. Minimalist plans for in-terpreting manipulation actions. Proceedings of the2013 IEEE/RSJ International Conference on Intelli-gent Robots and Systems, pages 5908–5914.

Yuri A. Ivanov and Aaron F. Bobick. 2000. Recogni-tion of visual activities and interactions by stochasticparsing. IEEE Transactions on Pattern Analysis andMachine Intelligence, 22(8):852–872.

Jungseock Joo, Weixin Li, Francis F Steen, and Song-Chun Zhu. 2014. Visual persuasion: Inferring com-municative intents of images. In Computer Visionand Pattern Recognition (CVPR), 2014 IEEE Con-ference on, pages 216–223. IEEE.

A. Kale, A. Sundaresan, AN Rajagopalan, N.P. Cun-toor, A.K. Roy-Chowdhury, V. Kruger, and R. Chel-lappa. 2004. Identification of humans usinggait. IEEE Transactions on Image Processing,13(9):1163–1173.

Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014.The language of actions: Recovering the syntaxand semantics of goal-directed human activities. InComputer Vision and Pattern Recognition (CVPR),2014 IEEE Conference on, pages 780–787. IEEE.

Cynthia Matuszek, Nicholas FitzGerald, Luke Zettle-moyer, Liefeng Bo, and Dieter Fox. 2011. A jointmodel of language and perception for grounded at-tribute learning. In International Conference on Ma-chine learning (ICML).

Cynthia Matuszek, Liefeng Bo, Luke Zettlemoyer, andDieter Fox. 2014. Learning from unscripted deicticgesture and language for human-robot interactions.In Twenty-Eighth AAAI Conference on Artificial In-telligence.

T.B. Moeslund, A. Hilton, and V. Kruger. 2006. Asurvey of advances in vision-based human motioncapture and analysis. Computer vision and imageunderstanding, 104(2):90–126.

Raymond J Mooney. 2008. Learning to connect lan-guage and perception. In AAAI, pages 1598–1601.

Darnell Moore and Irfan Essa. 2002. Recognizingmultitasked activities from video using stochasticcontext-free grammar. In Proceedings of the Na-tional Conference on Artificial Intelligence, pages770–776, Menlo Park, CA. AAAI.

K. Pastra and Y. Aloimonos. 2012. The mini-malist grammar of action. Philosophical Transac-tions of the Royal Society B: Biological Sciences,367(1585):103–117.

Hamed Pirsiavash, Carl Vondrick, and Antonio Tor-ralba. 2014. Inferring the why in images. arXivpreprint arXiv:1406.5472.

Giacomo Rizzolatti, Leonardo Fogassi, and VittorioGallese. 2001. Neurophysiological mechanisms un-derlying the understanding and imitation of action.Nature Reviews Neuroscience, 2(9):661–670.

Michael S Ryoo and Jake K Aggarwal. 2006. Recogni-tion of composite human activities through context-free grammar based representation. In Proceedingsof the 2006 IEEE Conference on Computer Visionand Pattern Recognition, volume 2, pages 1709–1718, New York City, NY. IEEE.

P. Saisan, G. Doretto, Y.N. Wu, and S. Soatto. 2001.Dynamic texture recognition. In Proceedings of the2001 IEEE Intenational Conference on ComputerVision and Pattern Recognition, volume 2, pages58–63, Kauai, HI. IEEE.

Mark Steedman. 2000. The syntactic process, vol-ume 35. MIT Press.

Mark Steedman. 2002. Plans, affordances, and combi-natory grammar. Linguistics and Philosophy, 25(5-6):723–753.

D. Summers-Stay, C.L. Teo, Y. Yang, C. Fermuller,and Y. Aloimonos. 2013. Using a minimal ac-tion grammar for activity understanding in the realworld. In Proceedings of the 2013 IEEE/RSJ Inter-national Conference on Intelligent Robots and Sys-tems, pages 4104–4111, Vilamoura, Portugal. IEEE.

Stefanie Tellex, Pratiksha Thaker, Joshua Joseph, andNicholas Roy. 2014. Learning perceptuallygrounded word meanings from unaligned paralleldata. Machine Learning, 94(2):151–167.

P. Turaga, R. Chellappa, V.S. Subrahmanian, andO. Udrea. 2008. Machine recognition of human ac-tivities: A survey. IEEE Transactions on Circuitsand Systems for Video Technology, 18(11):1473–1488.

Dan Xie, Sinisa Todorovic, and Song-Chun Zhu. 2013.Inferring “dark matter” and “dark energy” fromvideos. In Computer Vision (ICCV), 2013 IEEE In-ternational Conference on, pages 2224–2231. IEEE.

Yezhou Yang, Cornelia Fermuller, and Yiannis Aloi-monos. 2013. Detection of manipulation actionconsequences (MAC). In Proceedings of the 2013IEEE Conference on Computer Vision and PatternRecognition, pages 2563–2570, Portland, OR. IEEE.

Y. Yang, A. Guha, C. Fermuller, and Y. Aloimonos.2014. A cognitive system for understanding hu-man manipulation actions. Advances in CognitiveSysytems, 3:67–86.

Yezhou Yang, Yi Li, Cornelia Fermuller, and YiannisAloimonos. 2015. Robot learning manipulation ac-tion plans by “watching” unconstrained videos fromthe world wide web. In The Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI-15).

A. Yilmaz and M. Shah. 2005. Actions sketch: Anovel action representation. In Proceedings of the2005 IEEE Intenational Conference on ComputerVision and Pattern Recognition, volume 1, pages984–989, San Diego, CA. IEEE.

Luke S Zettlemoyer and Michael Collins. 2005.Learning to map sentences to logical form: Struc-tured classification with probabilistic categorialgrammars. In UAI.

Luke S Zettlemoyer and Michael Collins. 2007. On-line learning of relaxed ccg grammars for parsing tological form. In EMNLP-CoNLL, pages 678–687.

learning the semantics of manipulation action - …yzyang/paper/acl2015_final.pdflearning the...

Documents