robotic vocabulary building using extension inference and implicit …cs- · · 2008-11-18robotic...

Robotic Vocabulary Building Using Extension

Inference and Implicit Contrast

Kevin Gold, Marek Doniec, Christopher Crick, and Brian Scassellati

Department of Computer Science, Yale University, New Haven, CT, USA

Abstract

TWIG (“Transportable Word Intension Generator”) is a system that allows a robotto learn compositional meanings for new words that are grounded in its sensory ca-pabilities. The system is novel in its use of logical semantics to infer which entitiesin the environment are the referents (extensions) of unfamiliar words; its ability tolearn the meanings of deictic (“I,” “this”) pronouns in a real sensory environment;its use of decision trees to implicitly contrast new word definitions with existingones, thereby creating more complex definitions than if each word were treated as aseparate learning problem; and its ability to use words learned in an unsupervisedmanner in complete grammatical sentences for production, comprehension, or ref-erent inference. In an experiment with a physically embodied robot, TWIG learnsgrounded meanings for the words “I” and “you,” learns that “this” and “that” re-fer to objects of varying proximity, that “he” is someone talked about in the thirdperson, and that “above” and “below” refer to height differences between objects.Follow-up experiments demonstrate the system’s ability to learn different conju-gations of “to be”; show that removing either the extension inference or implicitcontrast components of the system results in worse definitions; and demonstratehow decision trees can be used to model shifts in meaning based on context in thecase of color words.

Key words: Word learning, context, developmental robotics, robot, decision trees,symbol grounding, TWIG, pronouns, autonomous mental development, inference.

Email address: [email protected], [email protected],

[email protected], [email protected] (Kevin Gold, Marek Doniec,Christopher Crick, and Brian Scassellati).

Preprint submitted to Elsevier 15 September 2008

1 Introduction

Robots that could understand arbitrary sentences in natural language wouldbe incredibly useful – but real vocabularies are huge. American high schoolgraduates have learned, on average, about 60,000 words [29]. Programminga robot to be able to understand each of these words in terms of its ownsensors and experience would be a monumental undertaking. To recognize thewords in the audio stream is one thing; to recognize their referents in the realworld, quite another. The messy, noisy nature of sensory input can render thisdifficult problem intractable for the programmer. It would be far more usefulto have a robot that is able to learn the meanings of new words, grounded inits own perceptual and conceptual capabilities, and ideally with a minimumof user intervention for training and feedback. The present paper describes asystem that is built to do exactly that.

According to the developmental psychology literature, children can learn newwords rapidly [7] and in a largely unsupervised manner [9]. From the timethat word learning begins around the age of one, children’s word learningaccelerates from a rate of about word every 3 days for the first 4 months [14],to 3.6 words a day between the ages of 2 1

2and 6 [14,1,4], to 12 words a day

between the ages of 8 and 10 [1]. This learning would impose quite a demandon the time of parents and educators if children had to be taught all thesewords explicitly, but instruction does not appear to be necessary for learninglanguage [27,4], and in fact there are communities in which children learn tospeak with hardly any adult instruction at all [21]. Thus, in designing a roboticsystem built to learn the meanings of words quickly and without supervision,it makes sense to study the heuristics that children appear to use in learningthe meanings of words.

One strategy that children use is to infer what object or person an new wordis referring to from its grammatical context – realizing that “sibbing” mustrefer to a verb, but “a sib” must refer to a noun [6]. Another strategy childrenuse is to contrast the new word with other words they know in order to narrowdown its meaning without explicit feedback [7,26,9]. For instance, a child maybelieve at first that “dog” refers to all four-legged animals, but on learning theword “cat,” the child realizes that implicitly, the meaning of “dog” does notinclude cats, and so the definition of “dog” is narrowed. This paper describesa word learning system called TWIG that employs both of these strategies,using sentence context and implicit contrast to learn the meanings of wordsfrom sensory information in the absence of feedback.

Until now, research into learning the semantics of words in an unsupervisedfashion has fallen primarily into two camps. On the one hand, there have beenattempts to learn word semantics in simulation, with the world represented

2

Fig. 1. The robot Nico, on which the TWIG word learning system was implemented.

as collections of atomic symbols [12] or statements in predicate logic [41].These simulations generally assume that the robot already possesses conceptsfor each of the words to be learned, and thus word learning becomes a simpleproblem of matching words to concepts. Such systems can deal with arbitrarilyabstract concepts, because they never have to identify them “in the wild.” Onthe other hand, systems that have dealt with real sensor data have generallyassumed that concepts and even word boundaries in speech must be learned atthe same time as words. Since learning abstract concepts directly from sensordata is difficult, these systems have focused on learning words for physicalobjects [37] and sometimes physical actions [48].

The TWIG word-learning system is an attempt to bridge this divide. “TWIG”stands for “Transportable Word Intension Generator.” It is “transportable”in the sense that it does not rely too heavily on the particular sensory setupof our robot (Figure 1), but could be moved with ease to any robot that canexpress its sensory input in Prolog-like predicates that can include numericvalues. (Note that it is the learning system, and not the definitions learned,that is transportable; each robot using TWIG will learn definitions specific toits own sensory capabilities.) A “word intension” is a function that, when ap-plied to an object or relation, returns true if the word applies. In other words,an intension is a word meaning [8,13]. Though TWIG was originally designedto solve the conundrums posed by pronouns, which themselves straddle theboundary between the abstract and the concrete, its techniques are more gen-erally applicable to other word categories, including verbs, prepositions, andnouns.

The first technique that TWIG introduces is extension inference, in whichTWIG infers from sentence context what actual object the new word refers to.

3

For example, if the system hears “Foo got the ball,” and it sees Bob holding theball, it will assume that “Foo” refers to Bob; the entity in the world that is Bobis the extension of “Foo” in this example. Doing this allows the system to bepicky about what facts it associates with a word, greatly reducing the impactof irrelevant information on learning. Grammatical context also informs thesystem as to whether it should be looking for facts about a single object or arelation between objects, as with prepositions and transitive verbs.

Second, to build word definitions, the system creates definition trees (see Sec-tion 4.3). Definition trees can be thought of as reconstructing the speaker’sdecision process in choosing a word. The interior nodes are simple binary pred-icates, and the words are stored at the leaves. The meaning of a word can bereconstructed by following a path from the word back to the root; its definitionis the conjunction of the true or negated predicates on this path. Alternately,the system can choose a word for a particular object or relation by startingat the root and following the branches that the object satisfies until arrivingat a word.

Definition trees are a type of decision tree, and are created in a manner closelyfollowing the ID3 algorithm [32] – but treating word learning as a problem ofword decision, with words defined in terms of the choices that lead to them, isan approach novel to TWIG. Previous robotic systems have typically assumedthat word concepts can overlap with each other arbitrarily [37,48]. In assumingthat words do not overlap, TWIG actually allows greater generalization fromfew examples, because the boundaries of each concept can extend all the wayup to the “borders” of the other words.

Though we described an earlier version of the TWIG system in a previous con-ference paper [17], that version of TWIG did not make use of the definition treemethod, which we developed later [15]. (Definition trees were originally sim-ply called “word trees,” but this led to some confusion with parse trees.) Thelatest version of TWIG includes several enhancements that were not presentin any previously reported results, including part-of-speech inference, onlinevocabulary updates that allow new words to be used immediately, and betterhandling of negated predicates. This paper also presents a new quantitativeevaluation of the two halves of the system, and some new experiments thatdemonstrate the system’s flexibility in building on its previous vocabulary andlearning multiple definitions for the same word.

2 Problems Addressed by TWIG

The TWIG system is best understood in the context of the problems it hasbeen built to address. Word learning might at first seem to be a problem of

4

simply associating sights and sounds, but as we shall see, language is moresubtle than this, and many ostensibly reasonable approaches cannot succeedin the long run.

2.1 Finding the Referent

One of the reasons that real children are so adept at learning language is thatthey appear to be able to learn language from conversations which they are notthemselves involved in [27]. Unfortunately, this means that many new wordsmay not be clarified with pointing gestures or gaze direction. How, then, isthe robot to know what objects to attend to when it hears a new word?

It is tempting to build a robot that simply associates every sound sequenceheard with everything in its environment, hoping that statistically, words willappear with their referents more often than not. However, there are logicalproblems with this approach [4]. One of the most critical is that some wordsrefer to properties that are almost always present somewhere in the environ-ment, making the word uninformative as to the presence or absence of theproperty. For example, “I” refers to the speaker – but the word “I” is notinformative as to whether someone in the environment is speaking, becausethe fact that any words at all are being spoken implies this. As another ex-ample, it will almost always be the case that one thing in the environmentis “above” another – so the presence or absence of “aboveness” somewherein the environment does not vary much with the presence or absence of theword.

Previous word learning systems have largely ignored the problem of findingthe refrent, either treating it as already solved by a black box [34,2,37] orattempting to associate words with everything in the environment [12]. ChenYu’s eyetracking system was a notable exception, as it used gaze to determinereference [48]. However, gaze direction can still be ambiguous, particularlyfor words that describe relations. TWIG uses sentence context to determinereference, as described below.

2.2 Using Language to Learn More Language

Once some lexicon is known, it should become easier to learn other words. Ifa speaker of Babelese points to a rabbit eating a carrot and says, “Gavagaifoo labra,” the meaning is highly unclear; but if we know that “gavagai”means rabbit and “labra” means carrot, then “foo” is quite likely to mean“eat.” Going still further, if we know that the most common word for “eat”is actually “barra” and not “foo,” we might look for differences between what

5

the rabbit is doing to the carrot now and how it looks to “barra” a carrot;perhaps “foo” actually means to eat quickly, or simply to enjoy. 1

Previous systems have employed a weak form of this kind of inference bytreating the meaning M of a sentence as an unordered set of symbols andfinding a mapping from words to subsets of symbols such that the union ofsymbol subets covers M [41]. This means that if one word in a sentence isknown to mean “rabbit,” it is unlikely that other words in the same sentencewill be assigned to “rabbit” as well. This is one way that increased vocabularysize can help learn new words, and it has been successfully employed in the past[12,41,48]. Still, treating sentences and their meanings as unordered sets tendsto lose some information; for instance, there is no way to learn the differencebetween “above” and “below” without taking word order into account, sincethe order in which the nouns are spoken is the only difference between theseprepositions.

The second way in which learning words can become easier as vocabulary sizeincreases is that new words can be contrasted with known words for differences.This is a fairly novel approach for TWIG, and one of its great strengths;previous systems have generally assumed that words can overlap arbitrarilyin meaning, and that they therefore do not inform each others’ boundaries inconcept space [12,37,48,34]. If one assumes (as many writers do) that no twowords mean quite the same thing, then the learner can assume that contrastsbetween words are informative. In fact, young children appear to follow thisheuristic when learning new words, and will reject words that are technicallyaccurate if more precise words are available [26]. Thus, existing definitions area resource that can and should be exploited when learning new definitions,regardless of whether the old words appear in the same sentence as the newwords.

2.3 Negation With No Negative Examples

The word “he” means roughly “someone who is male, and is neither the ad-dressee nor the speaker.” These last conditions may be difficult to learn frompassive observation, however, because the observed speakers will typically onlyuse “he” correctly, making the definition indistingushable from “someone whois male.” While it is true that all the learner’s examples of “he” will satisfy“not the speaker,” there are many other negated predicates that will be coin-cidentally satisfied every time “he” is spoken – for example, the speaker is noton the moon, it is not the year 3000, the referent is not the King of the UnitedStates, and so on. Including the values of all of these predicates as part of the

1 The example of “gavagai” referring to a rabbit in some way is a classic examplefrom Quine [31], though Quine was making a rather different point.

6

definition of “he” would be incredibly unwieldy, if not impossible in the limit.How is the learner to note the significance of “not speaker,” but discount theimportance of “not the year 3000”?

This problem of knowing to ignore irrelevant facts is just as true for the caseof unnegated facts, but is especially acute for negated facts because thereare so many of them that hold at any given time. If there are n disjointclassifications for an object, then n − 1 will not apply. Obviously, one couldgive up on the unsupervised approach and introduce positive and negativefeedback to get around this problem, but TWIG uses a more elegant solution:introduce negated clauses only when they are necessary to contrast with otherword meanings. In this way, definitions are produced that are not only moreaccurate, but also more concise and more consistent with Grice’s maxims [18].Again, this approach will be described in more detail below.

2.4 Abandoning Finite Concept Regions

When given only positive examples of a concept, it can seem reasonable toconstruct a concept hypothesis that consists of only a small, finite region ofthe sensory space that is centered about those examples. For example, if one’sexamples of “hot” temperatures are all between 25oC – 30oC, an algorithmmight construct a hypothesis that a day is “hot” iff it falls within this range,or perhaps within 23oC – 32oC just to allow for a margin of error. Another ap-proach might assume that hot temperatures obey a normal distribution about27.5oC. Assuming that concepts occupy only finite or highly concentrated re-gions of the concept space is common in word learning work that deals withreal visual input [48,37].

The problem with these approaches is that many word concepts, such as “hot”and “far,” actually extend to infinity along their relevant dimensions. Whenconfronted with a 40oC temperature, the algorithms of the previous examplewould all classify it as unlikely to be “hot,” because the new example is sodistant from the old examples. Yet children do not have to encounter infinitelyhot items to know that if one keeps increasing a temperature, it just keepsgetting hotter.

TWIG addresses this problem by assuming that all known words form a par-tition of the concept space, and that the relevant decisions can all be treatedas thresholds that divide the space. Thus, if TWIG had access to examples of“cold” ranging from 0oC – 5oC and “hot” ranging from 25oC – 30oC, it wouldassume (in the absence of other words) that all temperatures less than 0oC are“cold” and that all temperatures greater than 30oC are “hot.” Further detailscan be found under “Definition Trees,” below.

7

2.5 Roles and Relations as Definitions

The definition of “you” can be stated as, “the person(s) whom the speakeris addressing.” No amount of analyzing a person in isolation can determinewhether the person counts as a “you”; one must know whether the speakeris addressing the person. Any word learning algorithm that removes the ref-erent entirely from its context would therefore fail to learn the meaning of“you.” The same is true for other deictic pronouns, or pronouns that encodea relationship to the speaker: “I,” “you,” and “this,” for example. The rela-tions that are encoded in deictic words can vary from language to language;they can include height relative to the speaker (Daga, spoken in Papua NewGuinea), whether motion is toward or away from the speaker (“come,” “go,”various motion morphemes in Somali), whether something is upriver or down-river (Yup’ik, spoken in Alaska), and even whether the speaker came to knowsomething through hearsay or direct evidence (Makah, spoken in WashingtonState) [39].

TWIG allows relations to the speaker as possible clauses in word definitions,such as “is within 30 cm of the speaker” or “someone the speaker is looking at.”This may seem to be a solution particular to deictic pronouns, but taking thespeaker’s perspective into account is probably more widely applicable as well.For instance, “good” means little in an absolute sense, but implies somethingabout the speaker’s evaluation of an object; it implies a relation between thespeaker and the referent.

Definitions involving relations between objects, such as those implied by prepo-sitions and transitive verbs, are handled in a very similar way, and thus withlittle extra overhead.

2.6 Language Generation

Once new words are learned, it should be possible to combine them to formmeaningful sentences. Most previous robotic work has focused on the pre-grammatical phase of word learning. The learned words can be picked out ofthe audio stream, and the best word could be found for a given referent, butthis left something to be desired in terms of giving the robot the ability tocommunicate.

TWIG understands and generates sentences within a framework of formalsemantics, using Prolog to understand and generate new sentences. The defi-nitions it creates are predicate calculus expressions that are ready to be com-bined during parsing or generation into logical forms. While this strategymakes the system somewhat more brittle in terms of recognition, as utter-

8

Siskind de Marcken Regier Bailey CELL Yu TWIG

Parts of speech all all p v n n, v n, v, p, pn

Real sensors + + +

Deixis + +

Numerical data + + + +

Compositional meanings + + +

Produces sentences +

Word segmentation + + +

Visual prototypes + + +

Table 1A summary of the most influential semantics learning work to date, compared toTWIG: Siskind’s noise-resistant semantics learner [41], de Marcken’s word segmen-tation algorithm with extensions to semantics [12], Regier’s preposition-learningneural network [34], Bailey’s verb prototype learner [2], the CELL system [37], andYu and Ballard’s eyetracking word learner [48]. Parts of speech: p=preposition,v=verb, n=noun, pn=pronoun.

ances must match a recognized grammatical parse, it leaves the system in aposition to generate correct full sentences about its environment, rather thansimply classify referents with single-word utterances.

2.7 Summary of Previous Systems

Table 1 summarizes some of the differences between TWIG and the most influ-ential recent word learning systems. Though each of these systems is essentiallyattempting to model the same problem of how children learn the meaningsof new words without feedback, they have varied greatly in the amount ofabstraction they assume in the data and the kinds of words they attempt tomodel.

Systems using real sensors, such as Roy and Pentland’s CELL system [37],Steels and Kaplan’s Aibo word learner [42], and Yu and Ballard’s eyetrackingword learner [48], have typically been only able to learn definitions for con-crete nouns, and in Yu and Ballard’s case, some physical verbs. Real sensorstypically do not provide the predicates necessary to characterize more abstractwords. These systems’ word semantics have been largely non-compositional,and therefore difficult to integrate into a language generation system. Onthe other hand, these systems typically solved the problems of finding wordboundaries [37,48] and creating concepts for visual classification [37,42,48],two problems that TWIG does not currently address.

Algorithms that attempt to learn word meanings from text alone, such as“Latent Semantic Analysis” [23], typically make use of the proximity of wordsin text to create a metric of similarity between words. While such methods have

9

their place in text retrieval, methods that do not make use of extra-textualinformation are doomed to create definitions that are ultimately circular, anduseless for the task of semantics grounded in sensors [35]. Such methods wouldface substantial hurdles in being adapted to real environments, since a physicalenvironment is not very similar to a text corpus.

Systems built on predicate logic representations, such as Siskind’s noise-resistantword learner [41] and Kate and Mooney’s KRISPER [22], generally treat theword learning problem as finding a mapping from one set of symbols to an-other, and do not take into account the fact that real input would probablybe vector or scalar in nature. They have generally assumed that the exactpredicate or metalanguage symbol necessary to define a word already existssomewhere in the input. This allowed for a great degree of generality, but thisgenerality is somewhat artificial, as all the burden of creating word conceptsis shifted to the sensory system. Predicate logic-based word learners extendthe farthest back in AI history, since they could be implemented in simula-tion with a great deal of abstraction; the earliest example is Terry Winograd’sSHRDLU system, which could ask for natural language definitions when it en-countered unknown words [45]. Few simulations have modeled the impact ofsensor noise and recognition error on the learning, though more have modeledreferential uncertainty [41,22].

Systems that use more realistic simulated sensory input, such as Regier’spreposition-learning neural network [34] and Bailey’s verb learner [2], havetypically focused on learning a single part of speech in a highly simplifiedenvironment that consists of the referents alone. Because these systems weremore psychological models than practical implementations, they tended tomake optimistic assumptions about the sensors (noise-free, discretely valued)and the environment (empty except for the referents).

TWIG employs a hybrid approach as well, working with predicates generatedfrom real sensors that include some scalar parameters. Predicate logic allowsTWIG to maintain a representation of the world that includes relations be-tween objects, while the ability to handle scalar data and form conjunctionskeeps it from placing unreasonable demands on the programmer of the sen-sory system in providing ready-made concepts. TWIG achieves this balance oflinguistic expressivity and robustness to the vagaries of real input through acurious combination of old-fashioned intensional logic (the intension/extensiondistinction was implemented as far back as LUNAR [46]) and a more moderninformation-theoretic approach.

Though the idea of using trees to represent distinctions in meaning has beenused before to model word semantics [19]and store word meanings [3], TWIGis the first system to learn tree structures for storing word semantics in anunsupervised fashion.

10

3 Robotic Platform

This section is about the particular robot on which TWIG was implemented,and the predicates it generated from its sensors. Nothing in this section is apart of TWIG proper, but it will be useful when reading the description ofTWIG to have an idea of the kinds of input it is meant to process.

TWIG was implemented on Nico, a humanoid non-mobile robot that usesvideo cameras, dual-channel microphones, and the Cricket indoor locationsystem [30] to sense the world (Figure 1). Nico is a general research platform,and has been used for experiments in learning to point [43], modeling infantlooking-time experiments [25], testing algorithms for intention recognition [10],drumming to a conductor’s beat [11], and mirror self-recognition [16]. Thus,the robot’s sensors and design are not particularly specific to TWIG. Con-versely, the TWIG system is flexible about what kinds of predicates it uses,and in theory it could work with any robotic system that can provide somekind of simple object representation.

On processing an utterance, TWIG queries the robot’s sensors to receive apredicate-based description of the robot’s world. The predicates used in ourexperiments, and their generation from Nico’s sensors, shall now be describedin greater detail.

3.1 Vision: Faces and gaze direction

The robot’s visual system was used for finding faces and determining thedirections they faced. A wide-angle CCD camera in Nico’s right eye grabbed320 × 240 images at 30 frames per second. These frames were processed bytwo separate face detectors using the Viola and Jones object finding algorithm[44], as implemented in OpenCV [5]. One face detector was used to find profilefaces, the other, faces looking directly at the robot. Output from these two facedetectors was combined using the forward algorithm [38] to give an estimateof whether a person was more likely to be looking to the side or directly atthe robot at a given time. The forward algorithm also functioned to reducethe number of erroneous face detections.

For each face detected in the visual scene, a new symbol personL or personRwas added to the logical representation of the environment, corresponding towhether the person was detected on the left or ride side of the visual field.(This representation was somewhat tailored to the experiment, but made thedata more human-readable.) For each of these symbols, person(X) was addedto the environment. In addition, lookingAt(X,Y) was true for any pair X, Ysuch that Y was in the half-space 30 cm away from X, on the other side of

11

Fig. 2. The pig toy used in the experiment, attached to the Cricket sensor the robotused to find it.

the plane to which X’s looking direction was perpendicular.

The robot also used a symbol for itself, nico, and assumed person(nico).When someone else was speaking, lookingAt(nico,Y) was true for any Y inthe robot’s visual field. During language generation, lookingAt(nico,Y) wasset to the person whom the robot was addressing.

3.2 Sensor networks: Object localization and distance

To simplify the perceptual problem of finding objects and distances betweenthem, the robot used the Cricket Indoor Location System [30]. Two objects, atennis ball and a plush pig toy were equipped with Cricket receivers (Figure 2),while the laboratory ceiling was equipped with 9 Cricket beacons arranged ina 3×3 grid, each roughly 1.5 m from its neighbors. The beacons and receiverscommunicated via ultrasound signals to determine distances between each,and the robot performed a least-squares calculation to determine the locationof each object in 3-dimensional space.

For each sensor-object, a symbol obj1 or obj2 was added to the environment.The distance in centimeters between each entity in the environment was thencalculated and added to the environment with the dist(X,Y,V) predicate,e.g., dist(perL, obj1, 30.5). For the purpose of this calculation, the facesdescribed earlier were assumed to be a fixed distance of 60 cm from the robot,since the vision system did not have access to visual depth information.

In addition, the absolute height above the ground for each object could becomputed from the Cricket sensors. The difference height(X) - height(Y)

for each object pair (X, Y ) was encoded as the predicate relHeight(X,Y,V).

12

3.3 Identity predicates

In addition to the sensory predicates described above, each object receiveda predicate that was uniquely true of itself, such as obj1(obj1). This wasto allow for the possibility of the robot mistakenly believing that one of thewords to be learned was a proper noun. In addition, each object satisfied theidentity relation ident(X,X) with itself.

3.4 Audio: Speaker detection and speech segmentation

A dual-channel microphone was used to determine the speaker of each ut-terance. The two microphone heads were placed 30 cm apart and 50 cm infront of the robot. Within the Sphinx-4 speech recognition system, whetherthe speaker was to the left or right was determined by comparing the aver-age volume of the two channels over time. The symbol for the correspondingperson, perL or perR, was then bound to the variable S (for speaker).

For the purpose of speech recognition, Sphinx used a context-free grammar:

<s> = <np> <vp>;

<np> = (this | that| the ball | the pig | i | you | he);

<vp> = (is <np> | is <p> <np> | got <np>);

<p> = (near | above | below);

Since Sphinx was not trained for the particular speakers in our experiments,even this small grammar resulted in a fairly high recognition error rate. Aspeaker-dependent speech system would have allowed a larger vocabulary ofrecognized words.

3.5 Environment summary

When it begins to parse an utterance, TWIG queries the robot for the state ofthe world, which the robot produces in the form of a list of all atomic sentencesthat the robot can produce from its available sensors at the time of query. Inthe experiments to be described, this list looked something like this:

% Unique predicates -- obj1/obj2 = ball & pig,

% perL/perR = person on left/right

obj1(obj1), obj2(obj2),

perL(perL), perR(perR),

% "Person" predicates (faces)

13

person(perL),

person(perR),

% Possible facing directions

lookingAt(perL, perR),

lookingAt(perL, obj1),

...

% Identity relations

ident(obj1, obj1),

ident(obj2, obj2),

...

% Distances

dist(perL,obj1,31.568943),

dist(perL,obj2,59.623984),

...

% Relative height

relHeight(obj1, obj2, 40.23904)

relHeight(obj1, perL, -5.239)

...

TWIG’s parsing and learning mechanisms then proceed to work with thissummary of the robot’s environment.

4 TWIG Architecture

4.1 Overall Structure

TWIG is divided into two parts. In the first part, the extension finder, TWIGparses an utterance and matches it to a fact observed in the environment.(Here and later, we use the word “fact” as a shorthand for “atomic sentenceof predicate logic.) If a word is not understood but the sentence is close tomatching a fact in the environment, the new word is inferred to refer to theobject or relation in the environment that satisfies the meaning of the rest ofthe sentence. This part corresponds to the problem of finding the extension ofthe word, or its meaning under particular circumstances [8,13]. For example,if the first author of this paper were to say “I,” the extension of “I” wouldbe the individual named Kevin Gold. The extension finder can also find thetwo extensions connected by a transitive verb or preposition – for instance,if it heard “The pig foo the ball,” it could find the particular objects in theenvironment that were being related by the word “foo,” namely the pig andthe ball.

The second part, the definition tree generator, takes as input the new words

14

that the robot has heard and all the atomic sentences in the environmentdescription that mention those words’ extensions, and generates decision treesthat represent their intensional meanings. Recall that the intension of a wordis its general meaning, abstracted from any particular referent. Rather thanbuilding a decision tree for every individual word’s meaning, the intensiongenerator treats learning all the words for a particular part of speech as asingle decision problem, with each word essentially defined by the decisionsthat would lead to it instead of a different word in the vocabulary.

The system assumes a predicate calculus representation for the robot’s envi-ronment, and a semantics that defines words in terms of predicate calculus aswell. This puts some burden on the robot designer in generating useful sensorypredicates. However, the predicates are allowed to mention continuous values,for which the system later generates thresholds. This makes the system moreuseful in the real world than if it required all values to be discretely quantized.

The system assumes also that the robot has access to a speech recognitionsystem that can interpret audio utterances as text. This prohibits the systemfrom learning the semantics of words that it cannot even recognize – but theassumption here is that speech recognition systems will generally cover a muchbroader vocabulary than the robot has meanings for. Indeed, there is evidenceto suggest that human infants tackle the speech segmentation problem beforeever learning a semantics for the segmented words [40], so our system is ingood company in treating segmentation and semantics as separatable issues.

We now turn to a more in-depth exposition of the two halves of the TWIGsystem.

4.2 TWIG Part I: The Extension Finder

The extension finder functions very much like a standard discrete-clause gram-mar in parsing an utterance into logical form, with two major differences.First, the state of the world (in the form of a list of predicate calculus facts) ispassed down the parse tree, and noun phrases as they are parsed are matchedto known objects in the world. The symbols for these real-world objects thenreplace the words’ intensional meanings during the parse. The fact that thisreplacement takes place before the parse is done means that the system canuse some environmental context to narrow down the grammatical parse possi-bilities. The system can also backtrack on this match in case no parse succeedswith a particular word-to-object mapping.

If the parse fails, the system is allowed to retry the parse with a free vari-able. If the sentence contained a new word, this free variable resolves intowhatever word extension is necessary to match a fact that the robot knows.

15

This takes the form of a temporary term with a predicate named after the newword, and arguments named after its extensions – for example, you(personR)

or under(obj1, obj2). This temporary predicate allows the parse to succeedwithout the robot actually knowing what the word means; the system maynot know what a “you” is, but it can infer that it’s one of those.

Another reason the parse may fail is that all the words are known, but the factthat the sentence represents does not match anything that the robot knows.In this case, the same free variable can bind instead to the logical form of thewhole sentence, which can then be added to the robot’s knowledge base. Inthis way, understood sentences do not necessarily need to refer to things thatthe robot already knows, but can inform the robot of facts it cannot directlysense.

It is also possible that the sentence contains more than one new word, orcontains a new word and also relates a fact that the robot does not know. Ineither case, the robot simply gives up on the sentence, and it has no furthereffect.

The extension finder is implemented in Prolog. The TWIG system adapts thefollowing discrete-clause grammar from Pereira and Shieber [28]:

s(S, W) --> np(VP^S, W), vp(VP, W).

np((E^S)^S, W) --> pn(E, W).

np(NP, W) --> det(N2^NP, W), n(N1).

vp(X^S, W) --> tv(X^IV), np(IV^S, W).

vp(S, W) --> [is], pp(S, W).

vp(IV, _W) --> iv(IV).

pp(X^S, W) --> p(X^PREP), np(PREP^S, W).

The abbreviations on phrase types are typical: np for “noun phrase,” iv for“intransitive verb,” and so on. pn covers both proper nouns and pronouns. Wis a list of predicates indicating the state of the world.

A term in the form X^Φ is shorthand for the lambda expression λX.Φ, thenotation for a function Φ with an argument X. In Montague’s semantics (some-times now simply called “formal semantics” [39]), the meanings of words andphrases could be expressed using lambda notation over logical expressions:the verb “has,” for instance, could be expressed as λX.λY.possesses(X, Y ),indicating that “has” refers to a function possesses(X, Y ) that takes two ar-guments, a possessor and possessed. In the Prolog language, these terms canbe used inline in discrete-clause grammars, and the arguments of the functionsare substituted as the parse provides them (see Figure 3).

In the case of verbs and nouns, words at the lowest level are associated withtheir lambda calculus definitions, based on their parts of speech:

16

Fig. 3. Parsing a sentence with an undefined word, “I.” The parse partially succeedson the right, and the system finds that the whole sentence can be grounded if “I”refers to person P1, who has the ball. The missing definition, that “I” refers to thespeaker, can only be learned over time.

tv(Word, X^Y^pred([Word, X, Y])) :- member(vocab(tv,Word), W).

iv(Word, X^pred([Word, X])) :- member(vocab(iv,Word), W).

n(Word, X^pred([Word, X])) :- member(vocab(tv,Word), W).

p(Word, X^Y^pred([Word, X, Y])) :- member(vocab(p,Word), W).

Essentially, these statements relate the part of speech of a word to its logicalsemantics. For example, if a word Word is a transitive verb, then it shouldcreate a lambda calculus function of the form λX.λY.Word(X, Y ), because itrelates two different extensions, the subject and the object. (Transitive verbsthat require an indirect object, such as “give,” would need three argumentsand their own grammatical category, but are not included in the grammarabove.) A noun, on the other hand, contains only one argument, because itrefers to just one thing: λX.Word(X). The Prolog statement pred([P, ...])

represents the predicate P (...) in the robot’s sensory representation; we shallsee below that it is useful to treat the predicate P as a variable, and itsarguments as a list.

Proper nouns, pronouns, and noun phrases beginning with “the” are imme-diately grounded in the robot’s environment. In Prolog, this is expressed asfollows:

det(the, W, (X^S1)^(X^S2)^S2) :- contains(W, S1).

pn(E, W) --> [PN], pn(PN, W, E).

pn(PN, W, X) :- contains(W, pred([PN, X])).

The predicate contains(W, X) is true if the world W entails the fact X. Acall to contains therefore implies a check against the robot’s perceived reality,looking for an object that matches the description implied by a word. If anobject to which the word or phrase applies is found, its symbol takes the place

17

of the corresponding predicate. For instance, on parsing “the ball,” the systemsearches the world W for a symbol X such that ball(X). If ball(b) is found inW , X^ball(X) is replaced with b.

If the robot knows enough to understand the sentence S, the end result whenthe robot hears a sentence is that it is transformed into either the formpred([P, X]) or the form pred([P, X, Y]), where X and Y are symbolsthat correspond directly to known objects and P is a sensory predicate. Ifthe robot’s world W contains this fact as well, nothing further happens. IfW does not contain P (X, Y ), but there is some fact in W that contains P ,the sentence is understood as new information, and is added to the knowledgebase.

If the parse fails, the system is allowed to guess one word extension that it doesnot actually know. An unconstrained variable A is appended to the world Wbefore passing it into the parser, and the parser solves for A. This effectivelyallows the robot to hypothesize a fact of the form pred([Word, Object]),where Word is the new word and Object is the object to which it refers.

Figure 3 illustrates how this works. Suppose the robot hears the statement “Igot the ball.” It does not know who “I” is, but it sees girl a holding a ballb and girl e holding nothing. The parse fails the first time because the robotdoes not know the meaning of “I.” It does, however, know that “got the ball”parses to λX.has(X, b). On retrying with the free variable, the robot findsthat hypothesizing I(a) allows it to match the sentence to got(a, b), a fact italready knows. Thus, “I” is assumed to refer to a: the system has successfullyinferred the extension.

In the case of new words with logical forms that take two arguments, suchas transitive verbs and prepositions, the process works in the same way, onlythe system infers that the new word must be a relation between the two nounextensions found elsewhere in the parse. For instance, in parsing “The pig foothe ball,” if the words “pig” and “ball” can be matched to specific objects pand b in the environment, the system will infer that the logical form of thewhole sentence is pred([foo, p, b]) even if it has not encountered the word“foo” before. This fact can be added as-is to the robot’s knowledge base, butthe fact is ultimately meaningless (ungrounded in sensory predicates) untilthe robot also learns the intension of “foo,” which occurs in the next step.

In the process of parsing the sentence, the system must necessarily assign apart of speech to the new word as well. In the grammar described above,this choice is unambiguous. For more general grammars, the part of speechreturned would be the first one found that satisfied the parse and produced aknown fact.

The new word, its extension(s), and its part of speech are all passed to the

18

that

ident(X,S)

you male(X)

he she

person(X)

I lookingAt(S,X)

lookingAt(S,X)

Fig. 4. An example of a definition tree. Paths to the left indicate that a predicateis satisfied. The predicates on the path to a word imply its definition. S refers tothe speaker, and X refers to a referent. If the system reaches the dot when trying todecide what word to use for an entity, it decides that it has no word that applies.

next stage of the TWIG system for further processing.

4.3 TWIG Part II: Definition Tree Generator

4.3.1 The Interpretation of Definition Trees

Definition trees reconstruct the speaker’s decision process in choosing a wordfor a referent or a relation. They are essentially decision trees, but they arebuilt for use in the “reverse direction”: the words are stored at the leaves,and a word’s definition is given by the path from the root to the word’s leaf.The interior nodes can be decisions about the referent’s properties itself, orrelations to other objects or people, particularly the speaker. When TWIGlearns the meanings of new words, it does so by constructing definition treesthat include the new words. The structures of these trees implicitly define thewords.

Figure 4 shows an example of a definition tree. Paths to the left after a decision

19

indicate that the predicate is satisfied, while the right branch indicates thatit is not. Each decision consists of attempting to satisfy a logical predicatewith at most one threshold on a numerical argument to the predicate. Wefollow the additional convention that S always refers to the speaker, X alwaysrefers to a referent, Y is an optional second referent for the purpose of definingrelations, and V is a numerical threshold. For example, one decision might bedist(S,X,V) & V ≤ 28.8, indicating whether the distance between speakerS and referent X is less than 28.8 cm. (We will sometimes use the shorthanddist(X,Y) <= V, or omit mention of the threshold entirely if the attribute isboolean, assuming it to be ≥ 1.) In choosing a word to describe X, the systemwould decide whether speaker S and X satisfy this predicate and threshold.If the predicate is satisfied, the path on the left is followed; if they do not,the path on the right. This process continues until a leaf is reached, at whichpoint the most common word at the leaf would be chosen.

The structure of a definition tree implies the logical definitions of the words itcontains. Each word is defined as the conjunction of the predicates that lie onthe path from the root to the leaf, with a predicate negated if the “no” path isfollowed. For example, the tree in Figure 4 implies that the logical definitionof “you” is

you(X) :- person(X) & ~ident(S,X) & lookingAt(S,X)

indicating that “you” is a person who is not the speaker, but whom the speakeris looking at. (We shall continue to use ident to refer to the identity relation.)

If the rightmost path of the tree is followed to its end, the system realizes thatit does not know enough to describe the item in question; this is indicatedby a dot in Figure 4. The decision just before this final check is depicted indotted lines to indicate that it is constructed in a slightly different mannerfrom the other decisions. The system there is no longer contrasting wordsagainst other words, but choosing whether it knows enough about the referentto say anything at all.

The goal of the system is to construct such trees automatically from dataconsisting of utterances and the scenes that elicited them. We turn now to thealgorithm for constructing these trees.

4.3.2 Constructing Definition Trees From Data

Definition trees are constructed and updated using the output of the extensionfinding module. The construction method is essentially a variant on Quinlan’sID3 algorithm [32], but with the addition of the numerical thresholds of C4.5[33], a final tree branch that catches cases in which no known word shouldapply, a choice of variables that depends on the extensions and part of speech,

20

and some optimizations for online learning.

At each decision node, the available evidence is split into two groups based onthe decision about the referent that is most informative to word choice, withone group satisfying the condition and the other not. This process then occursrecursively until no further decisions are statistically significant. The tree canbe updated online, though the “batch” case shall be described first because itis simpler, and in some cases it must be called as a subroutine to the onlineversion.

The output from the extension finder for a given utterance, and thus the inputto the definition tree generator, is a 6-tuple (Wi, Ti, Xi, Yi, Si, Ωi), where Wi

is the new word, Ti is the inferred part of speech (“type”), Xi is the literaldetermined to be the referent, Yi is an optional second referent (or null), Si isthe literal that was the speaker, and Ωi is a list of predicates from the worldthat contain literals Xi, Yi, or Si as arguments. The second referent is non-nullif the word was determined to refer to a relationship between two referents,instead of to a particular referent.

TWIG generates a separate decision tree for each part of speech. This is nec-essary because during language generation, the system should not attempt touse a noun as an intransitive verb, for instance, even though both have thesame basic logical form and arguments. The rest of the exposition will assumewe are dealing with the tree for a single part of speech.

Let t be an evidence 6-tuple; we now describe the function D(t) that generatesthe set of possible decisions implied by the tuple. A decision here is a triple(P, V0, σ), where P is a predicate with its arguments drawn exclusively fromthe set of variables X, Y, S, V, , V0 is a threshold value, and σ is a signindicating whether the direction of inequality for the threshold is ≥ or ≤.For each tuple, all instances of object Xi are replaced with the variable nameX; all instances of object Yi are replaced with variable name Y ; instances ofthe speaker Si are replaced with the variable S; and one numerical constantcan be replaced with variable name V . All other arguments to the predicateare replaced with the anonymous variable “ ”. The threshold V0 becomes theconstant value that was replaced with the variable V , or 1 if there was noconstant value. Then, two decisions are added to the set: one for σ = “ ≤′′

and one for σ = “ ≥′′. If a single literal plays more than one role, by being bothspeaker and the referent, then all possible decisions that can be generated bythese substitution rules are added to D(t). For example, if Bob was both thespeaker and the sole referent of the new word, the term inchesTall(bob, 60)

would generate four decisions: two decisions for whether the speaker (S) wasat least or at most 60 inches tall, and two decisions for whether the referent(X) was at least or at most 60 inches tall. D(t) consists of the union of allsuch sets generated by each predicate in Ωi.

21

The algorithm must decide which of these decisions is most informative as toword choice. For each decision, a 2 × |W | table is maintained, where |W | isthe number of unique words. This table maintains the counts of the instancesin which the decision was satisfied for each word. Note that this requiresattempting to satisfy each decision with the variables X, Y, S and V bound totheir new values for each tuple.

From this table, one can easily calculate the information gain from splitting onthe decision. Let W be the set of all words wi, and let wi ∈ Wd if it was usedunder circumstances when decision d was satisfied, and wi ∈ W¬d otherwise.Then

Gain(d) = H(W ) − |Wd||W | H(Wd) −

|W¬d||W | H(W¬d) (1)

where

H(W ) =∑

i:wi∈W

−|wi||W | log

|wi||W | (2)

Readers may recognize H(W ) as the entropy of W , characterizing the averageamount of information in a single word. Gain(d) is the expected reductionin entropy on learning the truth or falsity of decision d. The decision withthe most information gain is thus the fact about the referent that maximallyreduces the “surprise” about the choice of word [20]. This is the criterion usedby Quinlan for the original ID3 decision tree algorithm [32]. If two decisionsare tied for informativeness, TWIG breaks the tie in favor of non-numericalpredicates, thus penalizing the numerical predicates for complexity.

The maximally informative decision is added to the tree, so long as the decisionand word choice are determined to be highly unlikely to be independent. Achi-square test can be computed using the same 2 × |W | table to test forsignificance. TWIG uses Yates’ continuity-corrected chi-square test [47]:

χ2Y ates

=N∑

i=1

(|Oi − Ei| − .5)2

Ei

(3)

where the sum is over the cells of the table, Oi is the observed value in that cell,and Ei is the expected value in that cell under the assumption of independence.(Yates’ corrected chi is a compromise between the normal chi square test,which is misleading for small sample sizes, and Fisher’s exact test, which wouldtake too long to compute for large numbers of decisions to be evaluated.)

Because the chi-square critical value depends on the number of degrees offreedom, which in turn depends on the number of unique words, a single

22

critical value can’t be used. Instead, the critical value τ is approximated usingthe following equations [24]:

τ = (1 − A +√

AP−1(1 − α))3(|W | − 1) (4)

A = 2/(9(|W | − 1)) (5)

Here P−1(φ) is the inverse normal distribution, |W | is the number of distinctwords in the table, and α is the desired significance level. In the experiments tobe described, p < 0.001 was arbitrarily set to be the threshold for significance.

With the decision added to the tree, the set of evidence tuples T is partitionedinto two subsets – one subset consisting of the examples in which the decisionwas satisfied, and one in which the decision was not. These subsets are usedat the left and right branches of the tree to generate subtrees, following thesame procedure as outlined above. The process is then repeated recursively foreach branch of the tree, until no more informative decisions can be added. Themost common word among all the tuples at a leaf is chosen as the “correct”word for that leaf. (Irrelevant words at a leaf are often the result of sensornoise or speech recognition error.)

Once the basic tree has been recursively built, one more operation needs to beperformed, in order to make the rightmost branch meaningful. Unmodified,the rightmost path in a basic definition tree always corresponds to a failure toprove anything about a referent, since the right path must be taken when avalue is unknown. Though there exist some words that describe items aboutwhich nothing is known (e.g., “it,” “thing”), they do not exist for all parts ofspeech; for example, there is no “default” transitive verb that implies noth-ing about its referents. Thus, some meaningful words may be assigned to thisbranch simply because there are no further distinctions to be made. Unmod-ified, this leaves the word mostly meaningless to the robot. Moreover, if therobot ever moved to a different environment or context, it would be constantlyfailing to prove anything, but then would mistakenly use its “default word”to describe every new situation.

For these reasons, the definition corresponding to the rightmost path is alwaysaugmented with the single non-negated decision that produces the highestinformation gain when contrasting the rightmost word with all other wordsin the tree. (This decision is represented in the decision tree diagrams witha pair of dotted lines.) The calculation uses all the evidence available at theroot of the tree, but the 2 × |W | table for each decision is collapsed into a2 × 2 table, so that only the rightmost word vs. non-rightmost word decisionis taken into account for informativeness.

23

4.3.3 Optimizations for Online Learning

The batch mode for tree generation is useful for reconstructing a tree from adatafile of evidence 6-tuples; most of the time, however, the tree is updatedonline in response to an utterance. If the 2 × |W | tables for each decision(including decisions not chosen) are maintained at each node in the tree, anode update usually need only consist of updating these tables and added thefew new decisions implied by the new tuple t. This update is performed first atthe root, and then, if the root decision is unchanged, the new tuple is passedrecursively down the tree to whichever branch it satisfies. The tree’s structureonly changes if, after updating the existing decision tables and adding the newdecisions, the most informative decision at a node changes. In this case, thebatch algorithm must be called for the entire subtree. While the worst-caserunning time for this online version is the same – theoretically, every updatecould change the most informative decision at the root, requiring the wholetree to be rebuilt – in practice, most updates do not change any decision nodes,and the update is fast because it only updates tables down a single path to aleaf.

With these optimizations in place, a Java-based implementation running on a3.4 GHz Pentium 4 could typically update a tree with 100-200 evidence tuplesin 1-2 seconds, with a worst case of about eight seconds. The implementationused no optimizations besides those described here, and probably could havebeen further sped up with a better optimized proof mechanism for checkingthe truth or falsity of decisions. (See also “Complexity,” below.)

Note that despite these optimizations for online performance, the order inwhich evidence is received does not matter in determining the final state ofthe definition tree. The inputs could be shuffled, and the final tree wouldremain the same, because the structure of tree n − 1 has no effect on thestructure on tree n; it merely affects the time necessary to compute the nthtree. The table of evidence stored at the root contains all the informationnecessary to rebuild the tree from scratch. If new data is received that makesdifferent decision at the root more informative than the current root decision,the entire tree is rebuilt using this data; if a new decision is more informativeat a different interior node, the subtree stemming from that node is rebuilt.The online optimizations are efficient under the assumption that these arerare occurences, but when the tree must be remade to better accomodate thedata, it changes its structure to do so. TWIG makes use of its existing treeand data tables to avoid repeating calulations it has already made, but nodecision is permanent. This also means that bad decisions introduced due tosensory error can be fixed or eliminated if the system is given more input.

24

4.3.4 Conversion to Prolog

After each definition tree update, the definition tree can be converted intoProlog, sent back to the extension generator, and asserted, so that the wordmeanings can be used for parsing immediately. For example, the meaning of“got” as implied by the definition tree in Figure 5b becomes the followingProlog statement:

contains(W, pred([got, X, Y])) :-

member(W, pred([_P1, X])),

member(W, pred([_P2, Y])),

\+(contains(W, pred([ident, X, Y]))),

(contains(W, pred([dist, X, Y, V0])), V0 <= 38.0).

The “member” statements are necessary so that X and Y are instantiated be-fore entering negated clauses. Prolog’s negation-as-failure operator would oth-erwise dictate that got(X,Y) was false if ident(X,Y) held for any choice of Xand Y in the environment – which is obviously not what is desired. If this defi-nition had referred to the speaker, the clause member(W, pred([says, S, _])

would perform the necessary binding.

A depth-first traversal of each definition tree can create an explicit list ofvalid words for each part of speech in the grammar. Explicit vocabulary listsprohibit the robot from using its internally defined predicates as words.

4.3.5 Complexity

We now turn to the analysis of complexity. The time to build the tree isthe time to evaluate a single decision set, multiplied by the number of nodesin the tree. This includes leaves, since decisions are evaluated at the leavesbut ultimately discarded. The maximum number of leaves in the tree is |T |,the number of evidence tuples; the maximum number of total nodes in thetree is therefore still O(|T |). The number of decisions to be evaluated at eachnode is at most O(|P |V ), where P is the number of different predicates andV is the number of different values encountered for each predicate. (Despitethe fact that there are several permutations of variables permissible for eachpredicate, that number is still bounded by a small constant.) Evaluating adecision requires in the worst case examining every fact in the environment,or |E| steps. Thus, the complexity for the batch algorithm is O(T 2V |P ||E|).V typically will be linear in |T ||E|, creating a running time of O(T 3|P ||E|2);however, sampling a constant number of values for each predicate would bringthe complexity down to O(T 2|P ||E|).

As mentioned previously, the worst case for the online version is a reorganiza-tion of the entire tree at each step, resulting in O(T 2V |P ||E|) steps for each

25

Noun Tree Transitive Verb Treelost

dist(X,Y) >= 40.0cm

obj1(X)

pig obj2(X)

ball

is

ident(X,Y)

dist(X,Y) <= 38.0cm

got

Fig. 5. The definition trees available to the system at the start of the experimentfor (a) nouns and (b) transitive verbs.

new tuple. However, the more common case of an update that does not changethe tree structure requires only O(d|E|(|P |+|T |)) operations to update, whered is the depth of the tree; the two added terms come from the time to updateexisting decisions at a node, O(|E||P |), and the time to generate and evaluatenew decisions from the new tuple, O(|E||T |).

The linear dependence on the number of predicates and the size of the worldmeans that the algorithm would scale well with the addition of a large fac-tual database, containing information that the robot cannot directly perceive.However, if extended chains of inference are necessary to prove certain facts,the |E| term would need to be replaced with the running time of whateverproof mechanism is used to evaluate the truth or falsity of decisions. Effortsat code optimization would be best directed at this proof procedure, since itis effectively the innermost loop of the tree update.

Practically speaking, most robots will probably have a small number of predi-cates, and the dominant terms will be the number of tuples observed so far andthe size of the environment. In the online case, the time to update is usuallylinear in both of these quantities. Furthermore, as the number of examplesincreases, the chance of a major reorganization of the tree decreases, so thatin the limit the updates will almost always be O(d‖Ω|(|P | + |T |)), assumingvalue sampling. Thus, the algorithm should scale well in the long term to morewords and more examples, as long as the number of numerical values is keptin check through sampling.

26

4.3.6 Starting Trees

Clearly, if the algorithm is to understand all but one of the words in thesentence, it needs some word definitions to begin with. This implies that thealgorithm needs some definition trees from the outset in order to use its wordextension finder. Unfortunately, merely specifying the structure of a startingtree is insufficient; the algorithm must have access to the data that generateda tree in order to update it.

The algorithm must therefore be initialized with some actual sensory datain which words are directly paired with their referents, circumventing theword extension finder. The starting vocabulary can be fairly minimal; ourimplementation began with five words, “ball,” “pig,” “is,” “got,” and “lost.”The definition trees generated for these words (using 45 labeled examples) areshown in Figure 5. Importantly, the data used to generate these trees mustbe generated under fairly similar conditions to the real online environment,or any consistent differences will be seen as highly informative. Ideally, therobot would be able to use pointing gestures or gaze direction to find theseextensions without the extension finder, as this would presumably be mostsimilar to how children learn these first examples in the absence of grammar;Yu and Ballard’s gaze tracking word learner provided an excellent systemfor doing this, albeit not in a predicate logic framework [48]. In practice, itis not too difficult to manually provide words and referents for 45 differentenvironment descriptions pulled from the robot’s sensors.

An alternative, which TWIG used previously [15], is to have the starting def-initions exist outside the definition tree structure, rendering their definitionsunalterable and moot for the purposes of definition tree construction. This ap-proach is less elegant, however, as the known definitions should ideally alwaysinform the definitions of new words, and vice versa.

5 Experiment 1: Learning Pronouns and Prepositions

5.1 Methods

The TWIG system was initialized with the data necessary to produce the treesof Section 4.3.6 for nouns and transitive verbs; the pronoun and prepositiontrees began empty. For 200 utterances, the experimenters moved the stuffedpig and ball to different locations in the room, and then spoke one of thefollowing utterances ([noun] should be understood to be “ball” or “pig”): This

is a [noun] ; That is a [noun] ; I got the [noun] ; You got the [noun] ; He gotthe [noun] ; The [noun] is above the [noun] ; The [noun] is below the [noun] ;

27

The [noun] is near the [noun].

Words for which the system did not know definitions are in boldface; eachutterance contained exactly one new word. Sentences that contained morethan one new word, or no new words, were not included in the experiment,since these sentences could not affect the tree development – though speechrecognition errors would sometimes result in such sentences being passed toTWIG.

Over the course of the experiment, the objects were moved arbitrarily aboutthe room; positions included next to the robot, on the steps of a ladder, inthe hands of one of the experimenters, on various tables situated about theroom, and underneath those tables. The experimenters remained roughly 60cm in front of the robot and 50–70 cm away from each other, and faced theappropriate individual (or robot) when saying “you” or “he.” (This experimentwas first reported in [15], but the decision tree algorithm has changed, andthe comparison under “Evaluation” is new.)

5.2 Results

Many utterances were incorrectly recognized by Sphinx: at least 46%, basedon a review of the system’s transcripts. But because these false recognitionstypically either included too many unknown words (e.g., “He is near the pig”)or resulted in tautologies (e.g., “That is that”), the system usually made noinferences from them. A recognition error only affected tree development whenit resulted in a sentence that contained exactly one unknown word.

Figure 6 shows the state of the pronoun tree at the 27th, 40th, and final up-dates to the tree. The person(X) distinction remained the most informativeattribute throughout the experiment, as it served to classify the two pronountypes into two broad categories. The proximal/distal distinction of “this” ver-sus “that” was the next to be discovered by the system. The difference between“I,” “you,” and “he” remained unclear to the system for much of the exper-iment, because they relied on two unreliable systems: the sound localizationsystem and the facing classifier, which had exhibited error rates of roughly10% and 15%, respectively.

The final definitions learned by the tree can be rendered into English as follows:“I” is the person that is the speaker. “You” is a person whom the speaker islooking at. “He” is a person who is not the speaker, and whom the speakeris not looking at. “This” is a non-person closer than 30 cm, and “that” isanything else.

The words “above,” “below,” and “near” were stored in a separate tree, shown

28

dist(S, X) <= 27.3cmI

this

ident(S, X)

27th Update

dist(S,X) >=27.3cm

dist(S, X) <= 26.3cm

this

that

you

person(X)

40th Update

ident(S, X)

you he

person(X)

dist(S, X) <= 28.8cm

thislookingAt(S,X)I dist(S,X) >= 31.2cm

that

Final Pronoun Tree

Fig. 6. The pronoun tree at the 27th, 40th, and final update. S refers to the speaker,and X to the word’s referent; left branches indicate that the logical predicate issatisfied. “dist” refers to distance, and “ident” indicates the two terms are equal.

above

relHeight(X,Y) >= 70.6cm

relHeight(X,Y) <= −52cm

below

Fig. 7. The tree the system created to define prepositions.

in Figure 7. Because there were no contrasting examples for “near,” and infact only four recognized utterances including the word, it did not receive itsown leaf. However, the system did learn that “above” and “below” referred tothe appropriate differences in relative height.

29

5.3 Analysis

The definitions implicit in these trees are for the most part correct, but wereclearly limited by the robot’s sensory capabilities and the nature of the data.For instance, “non-person that is closer than 28.8 cm” is clearly not a fulldefinition of “this,” which can also refer to farther away objects if they arelarge (“this is my house”) or non-physical entities (“this idea of mine”). Undersome circumstances, “you” may be used when the speaker is not looking at theaddressee, and this subtlety is not captured here. Nevertheless, it is difficultto imagine better definitions being produced given the input available to therobot in this experiment.

The definitions are fairly subtle given the nature of the input, which containedno feedback from the speakers or “negative examples.” The system learnedthat “he” should not refer to somebody the speaker is looking at, even thoughits only evidence was the fact that “you” was the preferred word for that case.It learned that “I,” “you,” and “he” must refer to people, and that “this”and “that” should not – again, constraints learned in the absence of negativefeedback. (Admittedly, “this” can sometimes be used in introductions – e.g.,“This is Kevin” – but in general it would be rude to ask someone, “Couldthis go get me a cup of coffee?”) The system also demonstrated with “this”and “that” its ability to generate thresholds on numerical sensory data, in theabsence of specific boolean predicates for proximity.

The space of possible binary relations was not large, and so the reader maythink it obvious that the system learned that height was chosen over distanceand lookingAt as the definitions for “above” and “below.” However, it is usefulto note two things about this learning: first, that these words received fairlyconservative thresholds due to the noisy nature of the Cricket input; andsecond, that “near” received no definition because every word in the treeimplied some kind of nearness. (A pig on the other side of the room from theball could not rightly be said to be “above” the ball.)

TWIG proved resilient against many speech recognition errors. Table 5.3 liststhe different kinds of recognition errors encountered during the experiment,and TWIG’s reactions to them. Because the space of possible utterances ismuch larger than the space of utterances that are meaningful in context, mostspeech recognition errors result in sentences that cannot be interpreted, andso they are discarded.

30

Sentence Type Example TWIG Reaction Proportion

Understandable, accurate I got the ball (true) Accept 48%

Understandable, data mismatch I got the ball (wrong speaker) Accept 5%

Understandable, ambiguous reference That is that Accept 0.5%

More than one new word He got that Discard 28.5%

No extension produces valid fact He got he Discard 10.5%

Misheard as question That is what Discard 7.5%

Table 2TWIG’s reaction to various speech recognition outcomes encountered during theexperiment.

6 Experiment 2: Evaluation Against Variants

In addition to evaluating the TWIG system by the qualitative goodness of thedefinitions it produces, we examined the number and accuracy of the sentencesit was able to produce about its environment, compared to similar systems.

6.1 Methods

The data generated by Experiment 1 was used to train four different wordlearning systems, each a modification of the core TWIG system. In one vari-ant, TWIG’s extension inference system was disabled, so that the words werenot bound to any particular object or relation in the environment. The def-inition trees were created under the assumption that a decision was satis-fied if any object in the environment satisfied the decision. For example,dist(S,X) <= 30.0cm was satisfied if the speaker was closer than 30cm toany object. We call this strategy of associating everything in the environmentwith each word “associationist,” using the terminology of [4].

As another variant, extension inference was allowed, but the system did notuse definition trees. Instead, the system only used the single most statisticallysignificant predicate to define each word, as measured by a chi-square test.This is the system that TWIG used prior to our invention of definition trees[17], and demonstrates the behavior of systems that attempt to learn eachword as a separate learning problem, rather than treating all of word learningas a single decision problem.

Toggling whether extension inference was allowed and whether full definitiontrees were created resulted in four system variants, including TWIG itself.Each system was presented with four test environments, taken from actual

31

8

8088

1149

31

93

8

Vocab treesNo ref. infer. Ref. inferenceRef. inference

Single pred. defs(TWIG)

Vocab. trees

Correct sentences Incorrect sentences

Total sentence productionover four situations

Single pred. defsNo ref. inference

Fig. 8. Sentence production across four variants of TWIG. Disabling TWIG’s abilityto find word extensions generally results in incorrect definitions, while using singlepredicates instead of definition trees tends to result in less nuanced definitions, andhence, overproduction.

robot sensory data. Each environment contained two people and the robotitself, as well as the ball and the pig. The ball and pig were each located eitherclose to one of the people, close to the robot, or far away from any of them,and at varying heights. For each environment, each system produced all of thesentences that were true of that environment, according to the semantics ithad learned. For evaluation, the systems were each provided with valid Prologdefinitions for the words “is,” “got,” “ball,” “lost,” and “pig.”

6.2 Results

Figure 8 shows the results of this comparison. The associationist system thatdefined each word with a single predicate produced nonsensical definitionsfor every new word, but by chance managed to produce a fair number ofcorrect sentences. The associationist tree-based system was more conservativeand did not produce any more correct sentences than the tautologies impliedby the definitions given to the system (“the pig is the pig”). The extension-finding system that defined each word by the single most informative predicatewas much closer to the performance of TWIG, as it produced all the correctsentences that applied to each scene that TWIG did. However, because itcould not represent conjunctions, its definitions of “this” and “that” omittedthe requirement of being a person, while its definition of “he” was based purely

32

Fig. 9. A diagram of the test environment that produced the utterances in Table 3.The robot has the pig toy and is addressing the person on the left, who has the ball.

Associationist Associationist Extension-Finding TWIG

Best Predicate Word Decisions Best Predicate (Extension Finding, Word Decisions)

*I lost the pig. *I is above I. I got the pig. I got the pig.

*That got the ball. *I is above the ball. You got the ball. You got the ball.

I lost the ball. The ball is the ball. *That lost he. He lost the ball.

That is that. *The pig is I. *That got the ball. You got that.

*That lost the pig. *The pig is above the pig. This is the pig. This is the pig.

Table 3Sample sentences produced by each system to describe the situation in Figure 9.An asterisk indicates a statement counted as incorrect.

on distance to the speaker. This resulted in sentences such as “that got theball” (instead of “he got the ball”) and “he is the ball.”

TWIG’s only errors in production were in the sentences “I is I” and “You isyou” for each test environment, since it did not know the words “am” and“are.” (Grammatically correct tautologies were counted as valid across allfour conditions.) Other utterances it produced that it had never heard beforeincluded “I got this” and “That is below the ball.”

Table 3 shows some sample utterances that each system produced about thesituation shown in Figure 9, with the robot next to the pig and speaking tothe person on the left, who has the ball.

7 Experiment 3: Using New Words as Context to Learn Forms of

“To Be”

In Experiment 2, the two utterances which TWIG produced incorrectly acrossall four test situations were “I is I” and “you is you.” There seemed to be noparticular reason why the system could not learn the meanings of “am” and

33

is

got

lost

dist(X,Y) >= 40.0cm

dist(X,Y) <= 38.0cm

ident(X,Y)

ident(X,S)

am lookingAt(S,X)

are

Fig. 10. The transitive verb tree after 20 examples of “I am I” and “you are you.”Using the definitions for “I” and “you” learned in Experiment 1, the system in-fers that “am” and “are” refer to relations between an object and itself, and thenintroduces conjugation-like decisions as a means of distinguishing between them.

“are,” adding them to its transitive verb tree; moreover, it seemed that thesystem might learn the correct conjugations of these words as well, with thestructure of the tree dictating when each form of “to be” was appropriatebased on relationship to the speaker.

8 Methods

The definitions learned in Experiment 1 were loaded as Prolog word defi-nitions. Instead of using the robot directly, the system cycled in simulationbetween the four scenes used for evaluation presented in Experiment 2, butwith the dialogue replaced with one participant saying either “I am I” or “youare you.” The system was presented with 25 examples.

9 Results

As shown in Figure 10, the system correctly learned that “am” and “are”referred to the identity relation. Moreover, the system learned the correctconjugation of these three forms of “to be” based on the same criteria thatit used to distinguish between “I,” “you,” and “he”: namely, that if the firstreferent was the speaker, use “am,” and if the first referent was being spokento, use “are”; otherwise, use “is.” The tree reached this form after 20 examples.

It is also worth noting that the only sentence context available to learn thesewords were the words “I” and “you,” and that using sentence context was

34

essential to learn the meanings of these words, as it allowed the system toinfer that in each case, the word referred to a relation between an entity anditself. However, neither the decisions used for conjugation nor the identityrelation were imported from the pronoun tree, but were learned anew giventhe inferred reference.

10 Experiment 4: Learning Context-Sensitive Definitions of Colors

So far, we have discussed context in several senses. TWIG uses “sentencecontext,” in which the meaning of the rest of the sentence informs the meaningof the word; “sensory context,” in which the presence of sensory data allowsbetter learning than the text alone; “relational context,” in which relationshipsto other objects are taken into account when learning a word for a thing;“deictic context,” a subset of relational contexts dealing specifically with thespeaker; and “vocabulary context,” in which a meaning of other words in thevocabulary can affect the meaning of a new word. Are these kinds of contextsufficient to demonstrate the kind of context sensitivity desired by Roy andReiter when they note that the meaning of “red” appears to differ dependingon whether the phrase is “red wine, red hair, or red car” [36]?

The robot described in the previous experiments did not possess any meansof object recognition, making this kind of distinction difficult to implement onthe same robot as Experiment 1. Hence, this last experiment was performedwith different sensory input entirely – namely, points of color sampled fromGoogle image searches.

10.1 Methods

The following images were collected as samples from the first 20 hits of Googlesearches for the terms listed: 15 images of “red,” 18 of “red hair,” 17 of “redwine,” 14 of “brown,” 20 of “white hair,” 19 of “white wine,” 16 of “burgundy,”15 of “yellow,” 15 of “brown hair,” and 20 of “blonde hair.” (The varyingnumbers were the result of excluding false positives, e.g., an image of Mr.Alan White.)

For each image, an arbitrary point in the relevant part of the image was sam-pled for its RGB values. This information was then converted into predicateform, and combined with a single predicate describing the kind of object de-picted in the image. Thus, an image of red hair might receive the descriptionhair(obj), red(obj, 66), green(obj, 33), blue(obj, 16). The objectclassifications were hand labeled, and included hair, wine, and 37 other clas-

35

white

green(X) <= 138

green(X)<=51 blue(X) >= 109

blue(X) >= 109 white

red hair(X) wine(X)

white green(X)>=142

blue(X) >=94

brownburgundy

yellowblonde

Fig. 11. An experiment on learning context-sensitive color words using the resultsof a Google image search. The tree includes three different branches for white: onefor the shade of white hair, another for the shade of white wine, and a third forwhite in general.

sifications.

These descriptions were then fed back to the decision-tree producing part ofTWIG, bypassing the reference-finder, with the new color words bound ineach case to the object variable. (This is equivalent to giving the full versionof TWIG sentences of the form “The [noun] is [color],” assuming it possessesdefinitions matching the nouns to their predicates.)

10.2 Results

This data produced the tree shown in Figure 11. The system did not produceseparate definitions of red for wine and hair, despite the fact that the redof red wine might be better described as burgundy, and some red hair mightotherwise be called brown. This is presumably because the range shades called“red” spans and includes the red shades of red wine and red hair, and thusthere was no need to distinguish these as special cases.

However, the system did produce three different definitions of white, basedon the classification of the item in question. The system learned that someshades of white are better called “blonde” when hair is being described, andthat something that appears yellow should be called “white” if it is wine. Thus,it appears that TWIG can indeed produce multiple definitions for words, withthe word-to-definition mapping mediated by object category.

36

11 Discussion

To learn words in a complicated environment, a robotic word learning al-gorithm must have some way to determine what object or relation in theenvironment is being talked about. The mere presence or absence of an objectin the environment is not enough, because many properties are continuallypresent, but must be associated only with particular words. The malformeddefinitions created when we removed TWIG’s ability to infer reference fromcontext are a demonstration of this principle. Using those definitions resultedin the robot being unable to generate the rich variety of sentences it couldform with extension-finding enabled.

One of the best ways to determine whether an object is relevant or not is byanalyzing the rest of the sentence, which can strongly imply a particular ob-ject or relation as the target for a new word. For example, when TWIG hears“I got the ball,” even when it does not know what “I” means in general, itcan infer that the word refers to the person who has the ball. There are othermethods of determining reference, such as following eye gaze direction, follow-ing pointing gestures, or using general perceptual salience, and these might beused to supplement the information available through sentence context. How-ever, these other methods can be ambiguous, especially in the case of relationsor deictic words. When the system hears “The ball is above the pig,” it is easyto point to the ball and the pig, but hard to point to the “aboveness.” Systemsthat can make sense of grammar have a clear advantage in the inference ofnew word meanings.

The addition of grammar is quite rare in sensor-grounded word learning, whichreveals just how far this kind of research usually is from actual sentence pro-duction. Both in corpus-based text processing and sensor-based word learning,grammar is often absent, with sheer collocation in time used to find statisti-cal associations [23,37,48]. Such a semantics does not lend itself to sentenceproduction, because the learner does not learn how to combine its words tocreate new meanings. TWIG demonstrates that grammar can not only allowsentence production, but it also aids the comprehension of new words, partic-ularly when words involving relations are involved. The tradeoff is that thegrammar limits the system’s ability to comprehend ungrammatical sentences,which are often common in human conversation. Nevertheless, it seems tomake more sense to build a system with knowledge of grammar and hope toadd robustness against ungrammaticality later, rather than give up on gram-mar from the start.

The reason grammar is so important is that it dictates not only how wordsmust be combined, but how their meanings compose as well; it is the tipof the formal semantics iceberg. Formal semantics provides a much richer

37

notion of meaning than the somewhat ad hoc semantics that have been usedin robotic word learning systems to date. TWIG’s separation of intensionand extension, for instance, allows it to learn the meaning of “I,” withoutbeing confused into thinking that “I” refers to a particular person all thetime. Previous logic-based word learning systems would associate nouns withparticular literal symbols [41,22], which is incorrect for all but proper nouns.On the other end of the spectrum, more sensor-based systems would produceinteresting sensory prototypes for classes of objects, but then could not expressrelations between these objects or integrate more abstract properties into theirdefinitions [37,48].

The primary difficulty with using a logic-based system in a real robot is thatsensor data is rarely pre-packaged in the kind of neat formalism that formalsemantics typically assumes. Real sensor data contains many numerical valuesand is often more low-level than the abstract predicates that word learningsimulations commonly assume. TWIG gets around these difficulties to someextent by allowing definitions to include conjunctions of predicates and thresh-olds on numerical values. Conjunctions allow high-level definitions to be builtof lower-level predicates, while numerical thresholds bridge the gap betweenboolean-valued logic and continuous sensor readings.

Learning logical definitions for words can be difficult when the system is onlygiven “positive” examples of each word, particularly when conjunction andnegation are allowed in the definitions. Definitions that specify a value forevery possible predicate are likely to be too specific: just because a word waslearned on a Tuesday does not mean that isTuesday(X) should be informa-tive. To address this issue, TWIG treats the word learning problem as oneof making correct word decisions, with word meanings made more specificonly when it is necessary to contrast one word with another. This approachappears to be similar to a heuristic young children use when learning newwords, called the Principle of Contrast [9]. Though children appear to be par-titioning the space of possible definitions, treating word learning as a singleclassification problem, this approach has not been previously used in roboticword learning, which has tended to treat each word as a separate recognitionproblem. Treating word learning as a problem of partitioning the semanticspace allows faster learning with fewer examples and no explicit “inhibitory”or “negative” training. Definition trees have the additional advantages of beingconcise, quickly updated, easily understood, and specifying a word preferenceduring generation when multiple words might apply.

Just as introducing grammar made the system less flexible in some ways, def-inition trees also carry a penalty: they assume that two different words of thesame part of speech cannot refer to the same thing. Nevertheless, definitiontrees have the benefit of providing much more accurate definitions than ourearlier system that did not make this assumption [17], which was essentially

38

the “single best predicate” system tested in Section 6. Interestingly, humanchildren appear to go through a phase of this assumption as well; in devel-opmental psychology, it is known as the Principle of Mutual Exclusivity [26].Perhaps further research into human language development will allow us tounderstand when, why, and how human infants abandon this assumption, sothat we might incorporate their “fix” into TWIG. It is possible that thesetrees could still model which word should be used first, but that using a wordshould then free the system to travel down a second-best path in the tree; thisapproach may be useful for modeling anaphoric words such as “it.” At anyrate, creating a different tree for each part of speech alleviates some of theproblem; for example, it allows “that” (pronoun) and “pig” (noun) to refer tothe same object, as demonstrated in the experiment.

The rather artificial nature of the dialog in the experiment, with each par-ticipant repeatedly stating facts that are obvious from the environment, maycast some doubt on the generality of the word learning; real conversationsoften refer to things that are not in the immediate environment, and mayinclude imperatives or questions in addition to declarative statements. Thereare three replies to this objection. First, the dialog in the experiment wasrestricted to achieve learning on a reasonable timescale, but the addition oflanguage that the robot cannot understand would not hinder TWIG’s learn-ing, because TWIG only adds information to its trees if it can match the restof the sentence to a fact it knows. Though waiting for such ideal statementsfrom real dialogue would be a slow process, the introduction of real speechshould not have a negative impact on TWIG’s learning, because most sen-tences would be ignored. Second, the robot can match statements to facts itknows that are not true of the immediate environment, if such facts are inits KB; obviously, more research needs to be done to ensure that the systemcould scale to a large KB, but note that TWIG in general will still only knowa few facts about the specific extensions mentioned in the sentence and theirrelations to each other. Third, there is no reason in principle that the systemcould not be modified to learn from imperatives and questions, given somemeans of reasoning about such sentences; they are easily identified by theirgrammatical form, and their content might be inferred from the addressee’sresponse.

TWIG is good at handling some kinds of ambiguity, but bad at others. Whena reference is ambiguous, TWIG can make an incorrect extension inference,which could then lead to incorrect data to be used for determining meaning.It may be possible to simply avoid making inferences when a sentence con-tains ambiguous reference, but then again, referential ambiguity is commonand this may not be a realistic solution. Perceptual uncertainty, on the otherhand, can be essentially be treated as noise, which our experiments handledreasonably effectively, while ambiguity in meaning in the case of homophonescould probably be determined by context and handled by creating separate

39

definition tree branches for each meaning.

Some of the definitions implied by Figure 6c may seem too simple, but theymust be understood in the context of the robot’s conceptual and sensorycapacities. “I” has more connotations to a human than it does to Nico, butNico’s definition is sufficient to interpret nearby speakers’ sentences or produceits own. The definitions of “this” and “that” are not as general as one mightlike, since they cannot capture reference to abstractions (“this idea”) or takeinto account relativity of scale (“this great country”), but they do capturesome interesting subtleties, such as the fact that “this” should not refer tothe addressee even if he is close to the speaker. Moreover, the system cancapture some of the context-sensitive nature of such words by creating newdefinition tree branches for particular kinds of referent, as we demonstratedin the section 5.3. Nevertheless, the use of absolute numerical thresholds tocreate dividing lines between word definitions is not a desirable property ofTWIG, and a more “fuzzy” approach to semantic boundaries is an area forfurther exploration.

Given that our defense of the robot’s definitions is built upon the robot’s lim-ited sensory capabilities, one might counter that the robot’s world is thereforetoo simple to draw conclusions about learning in more complex environments.We would disagree that the robot’s world is on the level of “Blocks World,”as it includes several different noisy sensors producing continuous values andimage processing routines that are genuinely useful (face detection, gaze di-rection). But even if Nico’s world were taken to be very simple, the fact thatsystems that lack notions of reference or implicit contrast failto learn correctpronoun meanings in even such a simple world, as demonstrated by our pri-mary experiment, implies that such systems are highly unlikely to scale to amore complex environment, with even more features and objects. We do notclaim that our system is sufficient to learn the meanings of all possible words,but rather that extension inference and implicit contrast appear to be veryuseful, perhaps even necessary, for the unsupervised learning of certain wordmeanings.

The TWIG system’s treatment of deictic, or speaker-relative terms, is some-what novel within the world of formal semantics. Normally, a system such asMontague semantics handles deixis by assigning a sentence several “indices”within the space of possible speakers, times, and worlds; hence the term “in-dexical” for deictic pronouns [13]. This rather clunky approach stems fromthe fact that formal semantics is often used to analyze free-floating snippetsof text, unconstrained by a physical environment and ambiguous in context.In a grounded sensory system, the solution is more straightforward: treat thespeaker as another variable S to be grounded in the physical environment.In fact, most of the indices used to constrain indexicals could be restated asproperties of the speaker: the speaker’s time, place, topic of conversation, and

40

so on. Moreover, the addition of the speaker variable S is useful even for wordsthat aren’t deictic pronouns. For example, it can allow the learner to take aspeaker’s attitude toward a referent into account: one can imagine a systemin which likes(S,X) distinguishes the words “good” and “bad.” The caseof interjections is also interesting, because they have no extension at all; yetevery language has words that convey angry(S).

We have also experimented with introducing free variables beyond X, Y, andS in the sentence definitions. Such variables would potentially allow definitiontrees to learn the meaning of a phrase before learning its parts. For example,if the learner did not know the meaning of “have the ball,” it might treatthe whole thing as an intransitive verb meaning got(X,A)&ball(A). Then,on hearing other phrases such as “have the pig” or “have the duckie”, thesystem might notice the repetition of “have” at the beginning of each wordin the subtree. It could then reorganize the tree to break off the leaves ofgot(X,A), change “has” to a transitive verb meaning got(X,Y), and reclassifyits children as noun phrases. This is an idea which we are still developing, butit might allow the system to get around the one-word restriction and usewhole sentences or phrases in the correct context without understanding theirindividual parts. Using our intensional definitions with variables that are notbound to specific objects in the environment might also allow “intensionalreasoning” about general, rather than particular, members of a category.

It is also worthwhile to note in passing that this is the first system to learnthe meaning of the word “I” from unsupervised observation, and then to usethe word to refer to itself. In fact, the robot was not even provided with anytraining examples in which it was the speaker. We hope that the precedingdiscussion demonstrates that this accomplishment is not simply a parlor trick,but exists within a rich framework of language understanding.

The information-theoretic decisions used in definition trees may also provide aparsimonious explanation for several different heuristics observed among chil-dren. For example, young word learners cease to overextend particular wordswhen they learn more apt ones, a heuristic known as the Principle of Contrast[9]. Similarly, young children reject perfectly good descriptions of objects ifthere are more appropriate words, a heuristic known as the Principle of Mu-tual Exclusivity [26]. The act of reasoning backwards about why a particularword was used is sometimes categorized as “theory of mind” [4]. Word choicebased on maximal informativeness obeys Grice’s Maxim of Quantity [18].

There are still many questions that remain to be explored with this system.How should words for superordinate categories such as “animal” be learned?How should the system choose the “next best word” if it has already usedone, but wants to convey more information? How would the system workwith raw audio, instead of text from a speech recognizer? What challenges

41

await in learning concrete nouns from visual data? And can the “one unknownword” restriction be removed, so that the system can learn phrases beforeunderstanding their parts? These are all exciting avenues for future work.

Acknowledgments

Thanks to Justin Hart for helping with data collection. Support for this workwas provided by a National Science Foundation CAREER award (#0238334)and NSF award #0534610 (Quantitative Measures of Social Response in Autism).Some parts of the architecture used in this work was constructed under NSFgrants #0205542 (ITR: A Framework for Rapid Development of ReliableRobotics Software) and #0209122 (ITR: Dance, a Programming Language forthe Control of Humanoid Robots) and from the DARPA CALO/SRI project.This research was supported in part by a software grant from QNX SoftwareSystems Ltd.

References

[1] J. Anglin, Vocabulary development: A morphological analysis, Monographs ofthe Society for Research in Child Development 58 (10) (1993) 1–166.

[2] D. R. Bailey, When push comes to shove: A computational model of the role ofmotor control in the acquisition of action verbs, Ph.D. thesis, Dept. of ComputerScience, U.C. Berkeley (1997).

[3] J. Bateman, Basic technology for multilingual theory and practice: thekpml development environment, in: Proceedings of the IJCAI workshop inmultilingual text generation, 1995.

[4] P. Bloom, How Children Learn the Meanings of Words, MIT Press, Cambridge,Massachusetts, 2000.

[5] G. Bradski, A. Kaehler, V. Pisarevsky, Learning-based computer vision withIntel’s open source computer vision library, Intel Technology Journal 9 (1),available online.

[6] R. W. Brown, Linguistic determinism and the part of speech, Journal ofAbnormal and Social Psychology 55 (1) (1957) 1 – 5.

[7] S. Carey, The child as word-learner, in: M. Halle, J. Bresnan, G. A. Miller (eds.),Linguistic theory and psychological reality, MIT Press, Cambridge, MA, 1978.

[8] R. Carnap, Meaning and Necessity, University of Chicago Press, Chicago, 1947.

42

[9] E. Clark, The principle of contrast: A constraint on language acquisition, in:B. MacWhinney (ed.), Mechanisms of language acquisition, Lawrence ErlbaumAssoc., Hillsdale, NJ, 1987, pp. 1–33.

[10] C. Crick, M. Doniec, B. Scassellati, Who is it? Inferring role and intentfrom agent motion, in: Proceedings of the 6th International Conference onDevelopment and Learning, London, UK, 2007.

[11] C. Crick, M. Munz, B. Scassellati, Synchronization in social tasks: Roboticdrumming, in: International Symposium on Robot and Human InteractiveCommunication, Hereford, England, 2006.

[12] C. G. de Marcken, Unsupervised language acquisition, Ph.D. thesis, MIT (1996).

[13] D. R. Dowty, R. E. Wall, S. Peters, Introduction to Montague Semantics, D.Reidel, Boston, 1981.

[14] L. Fenson, P. Dale, J. S. Reznick, E. Bates, D. Thal, S. Pethick, Variability inearly communicative development, Monographs of the Society for Research inChild Development 59 (5).

[15] K. Gold, M. Doniec, B. Scassellati, Learning grounded semantics with wordtrees: Prepositions and pronouns, in: Proceedings of the 6th InternationalConference on Development and Learning, London, UK, 2007.

[16] K. Gold, B. Scassellati, A bayesian robot that distinguishes “self” from “other”,in: Proceedings of the 29th Annual Meeting of the Cognitive Science Society(CogSci2007), Psychology Press, New York, NY, 2007.

[17] K. Gold, B. Scassellati, A robot that uses existing vocabulary to infer non-visual word meanings from observation, in: Proceedings of the Twenty-SecondConference on Artificial Intelligence (AAAI-07), AAAI Press, 2007.

[18] H. P. Grice, Logic and conversation, in: P. Cole, J. Morgan (eds.), Syntax andSemantics Vol. 3: Speech Acts, Academic Press, New York, 1975, pp. 41–58.

[19] M. A. K. Halliday, An Introduction to Functional Grammar, Hodder Arnold,London, 1994.

[20] R. W. Hamming, Coding and Information Theory, 2nd ed., Prentice-Hall,Englewood Cliffs, NJ, 1986.

[21] S. B. Heath, Ways With Words: Language, life, and work in communities andclassrooms, Cambridge UP, New York, 1983.

[22] R. J. Kate, R. J. Mooney, Learning language semantics from ambiguoussupervision, in: Proceedings of the Twenty-Second AAAI Conference onArtificial Intelligence (AAAI-07), AAAI Press, Menlo Park, CA, 2007.

[23] T. Landauer, S. Dumais, A solution to plato’s problem: the latent semanticanalysis theory of acquisition, induction, and representation of knowledge,Psychological Review 104 (2) (1997) 211–240.

43

[24] A. M. Law, D. Kelton, Simulation Modeling and Analysis, 3rd ed., McGraw-Hill, New York, 2000.

[25] A. Lovett, B. Scassellati, Using a robot to reexamine looking time experiments,in: Proceedings of the 4th International Conference on Development andLearning, San Diego, CA, 2004.

[26] E. M. Markman, G. F. Wachtel, Children’s use of mutual exclusivity toconstrain the meanings of words, Cognitive Psychology 20 (1988) 121–157.

[27] W. O’Grady, How Children Learn Language, Cambridge UP, Cambridge, UK,2005.

[28] F. C. N. Pereira, S. M. Shieber, Prolog and Natural-Language Analysis,CSLI/SRI International, Menlo Park, CA, 1987.

[29] S. Pinker, The Language Instinct, HarperCollins, New York, 1994.

[30] N. B. Priyantha, The Cricket Indoor Location System, Ph.D. thesis,Massachusetts Institute of Technology (2005).

[31] W. V. O. Quine, Word and Object, MIT Press, 1960.

[32] J. R. Quinlan, Induction of decision trees, Machine Learning 1 (1986) 81–106.

[33] J. R. Quinlan, Improved use of continuous attributes in c4.5, Journal of ArtificialIntelligence Research 4 (1996) 77–90.

[34] T. Regier, The Human Semantic Potential: Spatial Language and ConstrainedConnectionism, MIT Press, Cambridge, MA, 1996.

[35] D. Roy, Semiotic schemas: A framework for grounding language in action andperception., Artificial Intelligence 167 (2005) 170–205.

[36] D. Roy, E. Reiter, Connecting language to the world, Artificial Intelligence167 (1–2).

[37] D. K. Roy, A. P. Pentland, Learning words from sights and sounds: acomputational model, Cognitive Science 26 (2002) 113–146.

[38] S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 2nd ed.,Prentice Hall, Upper Saddle River, New Jersey, 2003.

[39] J. I. Saeed, Semantics, 2nd ed., Blackwell Publishing, Malden, MA, 2003.

[40] J. R. Saffran, R. N. Aslin, E. L. Newport, Statistical learning by 8-month-oldinfants, Science 274 (1996) 1926–1929.

[41] J. M. Siskind, Lexical acquisition in the presence of noise and homonymy, in:Proceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), MIT Press, 1994.

[42] L. Steels, F. Kaplan, Aibo’s first words: The social learning of language andmeaning, Evolution of Communication 4 (1) (2001) 3–32.

44

[43] G. Sun, B. Scassellati, Reaching through learned forward model, in: Proceedingsof the 2004 IEEE-RAS/RSJ International Conference on Humanoid Robots,Santa Monica, CA, 2004.

[44] P. Viola, M. Jones, Robust real-time face detection, International Journal ofComputer Vision 57 (2) (2004) 137–154.

[45] T. Winograd, Computer program for understanding natural language, Ph.D.thesis, MIT (1971).

[46] W. A. Woods, Semantics and quantification in natural language questionanswering, in: B. J. Grosz, K. S. Jones, B. L. Webber (eds.), Readings in NaturalLanguage Processing, Morgan Kaufmann, Los Altos, CA, 1978/1986, pp. 205–248.

[47] F. Yates, Contingency table involving small numbers and the chi square test,Journal of the Royal Statistical Society (Supplement) 1 (1934) 217–235.

[48] C. Yu, D. H. Ballard, A multimodal learning interface for grounding spokenlanguage in sensory perceptions, ACM Transactions on Applied Perceptions1 (1) (2004) 57–80.

45

robotic vocabulary building using extension inference and implicit …cs- · · 2008-11-18robotic...

Documents