guiding interaction behaviors for multi-modal grounded ...behavior annotations. we gather relevant...

1
Guiding Interaction Behaviors for Multi-modal Grounded Language Learning Jesse Thomason, Jivko Sinapov, Raymond J. Mooney University of Texas at Austin , Multi-Modal Grounded Linguistic Semantics Robots need to be able to connect language to their environment in order to discuss real world objects with humans. Grounded language learning connects a robot’s sensory perception to natural language pred- icates. We consider a corpus of objects explored by a robot in previous work (Sinapov, IJCAI 2016). grasp lift lower drop press push A hold and look behavior were also performed for each object. I Objects sparsely annotated with language predicates from an interactive “I Spy” game with humans (Thomason, IJCAI 2016). I Many object examples are necessary to ground language in multi-modal space I We explore additional sources of information from users and corpora I Behavior annotations, “What behavior would you use to understand ‘red’?” I Modality annotations, “How do you experience ‘heaviness’?” I Word embeddings, exploiting that “thin” is close to “narrow.” Methodology The decision d (p , o ) [-1, 1] for predicate p and ob- ject o is d (p , o )= X c C κ p ,c G p ,c (o ), (1) for G p ,c a linear SVM trained on labeled objects for p in the feature space of context c with Cohen’s κ p ,c agree- ment with human labels during cross-validation. “round” grasp lift look audio audio color haptics haptics shape haptics lower grasp lift look audio audio color haptics haptics shape haptics lower yes no Data from Human Interactions Human Behavior Annotations Estimated Context Reliability +Behavior Annotations audio audio Behavior Annotations. We gather relevant behaviors from human annotators for each predicate, allowing us to augment classifier confidences with human sugges- tions. The predicate “heavy” favors interaction behav- iors like lifting and holding. Modality Annotations. We also derive a modality an- notations past work (Lynott, 2009), allowing us to aug- ment classifier confidences with human intuitions. The predicate “white” favors visual-related modalities. Word Embeddings. We calculate positive similarity as and subsequently get augment each predicate’s classifier confidences with similarity-weighted aver- ages of its predicate neighbors. A predicate like “nar- row” with few object examples can borrow information from a predicate like “thin” that is close in word embed- ding space and has many examples. Experiment and Results p r f1 mc .282 .355 .311 κ .406 .460 .422 B+κ .489 .489 .465 M+κ .414 .466 .430 W+κ .373 .474 .412 I Adding behavior annotations or modality annotations improves performance over using kappa confidence alone. I Behavior annotations helped the f -measure of predicates like “pink”, “green”, and “half-full” I Modality annotations helped with predicates like “round”, “white”, and “empty”. I Sharing kappa confidences across similar predicates based on their embedding cosine similarity improves recall at the cost of precision. I For example, “round” improved, but at the expense of domain-specific meanings of predicates like “water”. [email protected] https://www.cs.utexas.edu/˜jesse/

Upload: others

Post on 28-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Guiding Interaction Behaviors for Multi-modal Grounded ...Behavior Annotations. We gather relevant behaviors from human annotators for each predicate, allowing us to augment classifier

Guiding Interaction Behaviors forMulti-modal Grounded Language Learning

Jesse Thomason, Jivko Sinapov, Raymond J. MooneyUniversity of Texas at Austin

,

Multi-Modal Grounded Linguistic Semantics

Robots need to be able to connect language to theirenvironment in order to discuss real world objects withhumans. Grounded language learning connects arobot’s sensory perception to natural language pred-icates. We consider a corpus of objects explored by arobot in previous work (Sinapov, IJCAI 2016).

grasp lift lower

drop press pushA hold and look behavior were also performed foreach object.I Objects sparsely annotated with languagepredicates from an interactive “I Spy” game withhumans (Thomason, IJCAI 2016).

I Many object examples are necessary to groundlanguage in multi-modal space

I We explore additional sources of information fromusers and corporaI Behavior annotations, “What behavior would you use tounderstand ‘red’?”

I Modality annotations, “How do you experience ‘heaviness’?”I Word embeddings, exploiting that “thin” is close to “narrow.”

Methodology

The decision d(p,o) ∈ [−1,1] for predicate p and ob-ject o is

d(p,o) =∑c∈C

κp,cGp,c(o), (1)

for Gp,c a linear SVM trained on labeled objects for p inthe feature space of context c with Cohen’s κp,c agree-ment with human labels during cross-validation.

“round”

grasp lift look

audio audio color

haptics haptics shape haptics

lower

grasp lift look

audio audio color

haptics haptics shape haptics

lower

yes

no

Data from Human Interactions

Human Behavior Annotations

Estimated Context Reliability

+Behavior Annotations

audio

audio

Behavior Annotations. We gather relevant behaviorsfrom human annotators for each predicate, allowing usto augment classifier confidences with human sugges-tions. The predicate “heavy” favors interaction behav-iors like lifting and holding.

Modality Annotations. We also derive a modality an-notations past work (Lynott, 2009), allowing us to aug-ment classifier confidences with human intuitions. Thepredicate “white” favors visual-related modalities.

Word Embeddings. We calculate positive similarityas and subsequently get augment each predicate’sclassifier confidences with similarity-weighted aver-ages of its predicate neighbors. A predicate like “nar-row” with few object examples can borrow informationfrom a predicate like “thin” that is close in word embed-ding space and has many examples.

Experiment and Results

p r f1mc .282 .355 .311κ .406 .460 .422B+κ .489 .489 .465M+κ .414 .466 .430W+κ .373 .474 .412

I Adding behavior annotations or modality annotationsimproves performance over using kappa confidencealone.

I Behavior annotations helped the f -measure ofpredicates like “pink”, “green”, and “half-full”

I Modality annotations helped with predicates like“round”, “white”, and “empty”.

I Sharing kappa confidences across similarpredicates based on their embedding cosinesimilarity improves recall at the cost of precision.

I For example, “round” improved, but at the expense ofdomain-specific meanings of predicates like “water”.

[email protected] https://www.cs.utexas.edu/˜jesse/