inference through embodied simulation in cognitive robotsmeeden/... · representation’’ and...

28
Inference Through Embodied Simulation in Cognitive Robots Vishwanathan Mohan Pietro Morasso Giulio Sandini Stathis Kasderidis Received: 1 October 2012 / Accepted: 3 February 2013 / Published online: 12 March 2013 Ó Springer Science+Business Media New York 2013 Abstract In Professor Taylor’s own words, the most striking feature of any cognitive system is its ability to ‘‘learn and reason’’ cumulatively throughout its lifetime, the structure of its inferences both emerging and con- strained by the structure of its bodily experiences. Under- standing the computational/neural basis of embodied intelligence by reenacting the ‘‘developmental learning’’ process in cognitive robots and in turn endowing them with primitive capabilities to learn, reason and survive in ‘‘unstructured’’ environments (domestic and industrial) is the vision of the EU-funded DARWIN project, one of the last adventures Prof. Taylor embarked upon. This journey is about a year old at present, and our article describes the first developments in relation to the learning and reasoning capabilities of DARWIN robots. The novelty in the com- putational architecture stems from the incorporation of recent ideas firstly from the field of ‘‘connectomics’’ that attempts to explore the large-scale organization of the cerebral cortex and secondly from recent functional imaging and behavioral studies in support of the embodied simulation hypothesis. We show through the resulting behaviors’ of the robot that from a computational view- point, the former biological inspiration plays a central role in facilitating ‘‘functional segregation and global integra- tion,’’ thus endowing the cognitive architecture with ‘‘small-world’’ properties. The latter on the other hand promotes the incessant interleaving of ‘‘top-down’’ and ‘‘bottom-up’’ information flows (that share computational/ neural substrates) hence allowing learning and reasoning to ‘‘cumulatively’’ drive each other. How the robot learns about ‘‘objects’’ and simulates perception, learns about ‘‘action’’ and simulates action (in this case learning to ‘‘push’’ that follows pointing, reaching, grasping behav- iors’) are used to illustrate central ideas. Finally, an example of how simulation of perception and action lead the robot to reason about how its world can change such that it becomes little bit more conducive toward realization of its internal goal (an assembly task) is used to describe how ‘‘object,’’ ‘‘action,’’ and ‘‘body’’ meet in the Darwin architecture and how inference emerges through embodied simulation. Keywords Brain guidance Embodied simulation Small worlds Body schema Cognitive robotics Learning and reasoning DARWIN project Introduction So it was a cold winter night of 2011 in Luxembourg when we (VM and PM) last met Professor Taylor enjoying his apple pie and acknowledging the young chef for her cre- ativity. She was all in smiles and indeed, creativity is infectious, be it Betty the Caledonian crow [42, 81], Alex the parrot [58], a capuchin or a chimp [75, 76, 84] or a human infant playing! How brains ‘‘become’’ creative and V. Mohan (&) P. Morasso G. Sandini Robotics, Brain and Cognitive Science Department, Istituto Italiano di Tecnologia, Via Company Morego 30, 16163 Genoa, Italy e-mail: [email protected] P. Morasso e-mail: [email protected] G. Sandini e-mail: [email protected] S. Kasderidis Novocaptis Cognitive Systems and Robotics, Thessaloniki, Greece e-mail: [email protected] 123 Cogn Comput (2013) 5:355–382 DOI 10.1007/s12559-013-9205-4

Upload: others

Post on 13-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

Inference Through Embodied Simulation in Cognitive Robots

Vishwanathan Mohan • Pietro Morasso •

Giulio Sandini • Stathis Kasderidis

Received: 1 October 2012 / Accepted: 3 February 2013 / Published online: 12 March 2013

� Springer Science+Business Media New York 2013

Abstract In Professor Taylor’s own words, the most

striking feature of any cognitive system is its ability to

‘‘learn and reason’’ cumulatively throughout its lifetime,

the structure of its inferences both emerging and con-

strained by the structure of its bodily experiences. Under-

standing the computational/neural basis of embodied

intelligence by reenacting the ‘‘developmental learning’’

process in cognitive robots and in turn endowing them with

primitive capabilities to learn, reason and survive in

‘‘unstructured’’ environments (domestic and industrial) is

the vision of the EU-funded DARWIN project, one of the

last adventures Prof. Taylor embarked upon. This journey

is about a year old at present, and our article describes the

first developments in relation to the learning and reasoning

capabilities of DARWIN robots. The novelty in the com-

putational architecture stems from the incorporation of

recent ideas firstly from the field of ‘‘connectomics’’ that

attempts to explore the large-scale organization of the

cerebral cortex and secondly from recent functional

imaging and behavioral studies in support of the embodied

simulation hypothesis. We show through the resulting

behaviors’ of the robot that from a computational view-

point, the former biological inspiration plays a central role

in facilitating ‘‘functional segregation and global integra-

tion,’’ thus endowing the cognitive architecture with

‘‘small-world’’ properties. The latter on the other hand

promotes the incessant interleaving of ‘‘top-down’’ and

‘‘bottom-up’’ information flows (that share computational/

neural substrates) hence allowing learning and reasoning to

‘‘cumulatively’’ drive each other. How the robot learns

about ‘‘objects’’ and simulates perception, learns about

‘‘action’’ and simulates action (in this case learning to

‘‘push’’ that follows pointing, reaching, grasping behav-

iors’) are used to illustrate central ideas. Finally, an

example of how simulation of perception and action lead

the robot to reason about how its world can change such

that it becomes little bit more conducive toward realization

of its internal goal (an assembly task) is used to describe

how ‘‘object,’’ ‘‘action,’’ and ‘‘body’’ meet in the Darwin

architecture and how inference emerges through embodied

simulation.

Keywords Brain guidance � Embodied simulation � Small

worlds � Body schema � Cognitive robotics � Learning and

reasoning � DARWIN project

Introduction

So it was a cold winter night of 2011 in Luxembourg when

we (VM and PM) last met Professor Taylor enjoying his

apple pie and acknowledging the young chef for her cre-

ativity. She was all in smiles and indeed, creativity is

infectious, be it Betty the Caledonian crow [42, 81], Alex

the parrot [58], a capuchin or a chimp [75, 76, 84] or a

human infant playing! How brains ‘‘become’’ creative and

V. Mohan (&) � P. Morasso � G. Sandini

Robotics, Brain and Cognitive Science Department, Istituto

Italiano di Tecnologia, Via Company Morego 30,

16163 Genoa, Italy

e-mail: [email protected]

P. Morasso

e-mail: [email protected]

G. Sandini

e-mail: [email protected]

S. Kasderidis

Novocaptis Cognitive Systems and Robotics, Thessaloniki,

Greece

e-mail: [email protected]

123

Cogn Comput (2013) 5:355–382

DOI 10.1007/s12559-013-9205-4

Page 2: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

exhibit novelty in behavior through cumulative learning

and effective use of experience is still a mystery. This has

to be unlocked to better understand our own selves and to

create artifacts that can intelligently assist us in the envi-

ronments we inhabit and create. Next day morning, we

were asked ‘‘what aspect of cognition needs to be under-

stood further to effectively design artificial cognitive sys-

tems?’’ There was an instantaneous reply: ‘‘the most

striking feature of any cognitive system is its ability to

learn cumulatively forever and use its experiences effec-

tively to survive.’’ It was the negotiation meeting of the

newly funded EU Project DARWIN and Prof. Taylor had

concisely spelled out the mantra to pursue. The underlying

rationale was twofold: firstly to explore the computational/

neural basis of embodied intelligence by reenacting the

infant developmental learning process in cognitive robots

and secondly create practical systems with ‘‘end user

value’’ that demonstrate cognitive capabilities. This is also

evident from the expansion of the acronym ‘‘Darwin.’’1

Our journey is about a year old now, and this article pre-

sents the first developments in relation to the learning and

reasoning capabilities of the DARWIN robots.

In general, after the tryst with GOFAI, most current

research in the field of cognitive developmental robotics

appreciates the fact that ‘‘sensorimotor experience precedes

representation’’ and cognition is gradually bootstrapped

through a cumulative process of learning by interaction

(physical and social) within the zone of proximal develop-

ment [78] of the agent. This approach indeed has roots in

Wiener’s Cybernetics [80], Varela et al. autopoesis [73],

Chiel and Beer’s neuroethology [14], Clark’s situatedness

[15], Hesslow’s simulation hypothesis [32, 33], and Thom-

son’s enactive cognition [71]. The obvious reason to pursue

this path is because it is impossible to predict and program at

‘‘design time’’ every possible situation in every time instance

to which an artifact may be subjected to in the future. Of

course, robot programming approaches work for simple

machines performing targeted functions (like a washing

machines, etc.) but certainly not for general-purpose robotic

companions envisaged to assist us in unstructured environ-

ments like housekeeping, work-place automation, industrial

assembly, aid for the elderly and physically challenged, to

mention a few. Complementing the extrinsic application-

specific value, the embodied/enactive approach is also rele-

vant from an intrinsic viewpoint of understanding our own

selves; understanding how interactions between body and the

brain shapes the mind, shapes action, and reason. This is

because unlike the range of direct problems in conventional

physics that involve computing effects of forces on objects,

brains of animals have to deal exactly with the inverse

problems of learning, reasoning, and choosing actions that

would enable realization of one’s goals and hence ultimately

survive. Strikingly, many of the inverse problems faced by the

brain to learn, reason, and generate goal-directed behavior are

indeed analogous to the ones roboticists must solve to make

their robots act cognitively in the real world. It was this

interleaving of ‘‘extrinsic’’ and ‘‘intrinsic’’ value that fasci-

nated Prof Taylor and drove him to co-author and work in

DARWIN project. At the same time, it is only fair to say that in

spite of extensive research scattered across multiple scientific

disciplines and prevalence of numerous machine learning

techniques, the present artificial agents still lack much of the

resourcefulness, purposefulness, flexibility, and adaptability

that biological agents so effortlessly exhibit. Certainly, this

points toward the need to develop novel computational

frameworks that go beyond the state of the art and endow

cognitive agents with the capability to learn cumulatively and

use past experience effectively ‘‘to connect the dots’’ when

faced with novel situations [25]. Perhaps a ‘‘humanlike’’

learning touch to machine learning algorithms is the need of

the times ahead!

Looking at incessant loop of gaining experience and

using experience (as prevalent in most biological systems

that demonstrate cognition), learning and reasoning can be

seen as foreground and background alternating each other as

intricately depicted in the artistic creations of M.Escher [46].

In an intriguing work during the early days of embodied/

enactive cognition, Mark [41] playfully remarked that ‘‘we

are rational animals but we are also rational animals,’’

emphasizing the fact like learning that the structure of rea-

soning and inference also does not transcend the structure of

bodily experience. The centrality of embodiment directly

influences ‘‘what’’ and ‘‘how’’ things can be meaningful to

us, the ways in which our understanding of the world is

gradually bootstrapped by experience and the ways in which

we reason about them. In this essence, we believe that for

cognitive robots foreseen to operate in open-ended

unstructured environments, learning and reasoning must

cumulatively drive each other in a closed loop: more

learning leading to better reasoning and inconsistencies in

reasoning driving new learning. For simplicity in neural

computation, this implies that part of the cortical substrates

activated during perceptual and motor learning (i.e., when

an agent gains experience) are also activated when an agent

reasons and simulates the causal consequences of its

actions. While resonance between top-down and bottom-up

information flows is a measure of the quality of learning,

dissonance is the stepping stone to explore, gain more

experience and learn further. Such neural reuse also makes

sense considering the fact that brain is a product of evolu-

tion, meant to support the survival of a species in its natural

environments and importantly operates under constraints of

space, time, and energy. A wealth of emerging evidence

1 DARWIN stands for Dexterous Assembler Robots Working with

embodied INtelligence (www.darwin-project.eu).

356 Cogn Comput (2013) 5:355–382

123

Page 3: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

from neuroscience substantiates this fact (see [8, 24, 28, 33,

49] for recent reviews). We believe that this aspect must be

an essential design feature in future cognitive robots that

have any chance to survive, cooperate, and assist humans in

the real world. While emerging results from functional

imaging and behavioral studies may serve as a guiding light,

there is still an urgent need to also focus on ‘‘cognitive

computation’’ and look deeper into the underlying compu-

tational principles in order to create artificial cognitive

systems that can both be ‘‘practically useful’’ and in turn

shed deeper insights into the ongoing ‘‘neural computation’’

in the brain. In this context, building up on an intriguing

review a decade back by Hesslow [32], we believe that

computational architectures driving cognitive robots must

include three basic features that form the core of the

embodied simulation hypothesis.

Simulation of Action and Body Schema

Mounting evidence accumulated from different directions

such as brain imaging studies [21, 28], mirror neuron

systems [61–63], and embodied cognition [23, 24] gener-

ally supports the idea that action ‘‘generation, observation,

imagination and understanding’’ share similar underlying

functional networks in the brain. In general, there is

growing evidence for the fact that neural circuits in the

predominantly motor areas are also activated in other

contexts related to ‘‘action’’ that do not cause any overt

movement. Such neural activity occurs not only during

imagination of movement ([13, 17, 18], several others) but

also during observation and imitation of other’s actions [9,

21, 28, 39] and comprehension of language, that is, both

action-related verbs and nouns [20, 26, 27, 47, 59]. The

neural activation patterns include not only pre-motor and

motor areas such as PMC, SMA, and M1 but also sub-

cortical areas of the cerebellum and the basal ganglia.

During the observation of movements of others, an entire

network of cortical areas called as the ‘‘action observation

network’’ that includes the bilateral posterior superior

temporal sulcus (STS), inferior parietal lobule (IPL),

inferior frontal gyrus (IFG), dorsal pre-motor cortex, and

ventral pre-motor cortex are activated in a highly repro-

ducible fashion [28]. The central hypothesis that emerges

out of these results is that motor imagery and motor exe-

cution draw on a shared set of cortical mechanisms

underlying motor cognition. In simple terms, it posits that

one can reason about an action (reach, grasp, push, etc.)

without actually performing the action and yet use the

same neural substrate in the sensory motor system. Based

on the wealth of neurobiological evidence, a preliminary

foundation for such a ‘‘shared’’ computational machinery

for ‘‘execution, simulation and understanding’’ of Action

has been created through the development of the Passive

Motion Paradigm framework (see [51] for a recent review)

and used successfully in a range of tasks like bimanual

coordination, motor skill learning, and tool use in the

humanoid iCub (http://www.icub.org/) one of the robots

used by the DARWIN consortium. The PMP mechanism

basically emulates the animation of a ‘‘body schema’’

intended to be not like a passive homunculus posited by

Penfield but a multi-referential dynamical system which

deals at the same time with sensorimotor variables in the

end-effector space, joint space, and ‘‘tool space.’’ Note that

the issue of body schema is not as popular in cognitive

robotics in comparison with the concept of embodiment.

These are not the same things. If you have a body schema,

you also have embodiment but not the other way around.

Venon et al. [74] in their discussion on a roadmap for

cognitive development in humanoid robots present a cata-

log of cognitive architectures, but in none of them, the

concept of body schema is a key element. More recently,

Hoffmann et al. [34] and Mohan and Morasso [51] review

this concept in robotics, emphasizing the gap between the

idea and its computational implementations. Studies on

tool use in animals by Iriki and Sakura [40] and Umilta

et al. [72] further support this viewpoint. In this article, we

develop these ideas further describing how ‘‘object action

and body’’ are connected in the DARWIN architecture,

how novel actions can be learnt and simulated in the

context of a goal, and what are the underlying advantages.

Simulation of Perception and Distributed Organization

of Semantic Memory

Imagining perceiving something is similar to actually per-

ceiving it, only difference being that the perceptual activity

is generated top down rather than by environmental stimuli.

While this perspective has been emphasized in the reviews

of Hesslow [32, 33], Grush, [29] among others, more recent

developments on the organization of semantic knowledge in

the brain (See [16, 48, 49, 57]) provide further insights that

help to constrain computational architectures for cognitive

agents. The main finding emerging from these results is that

conceptual information is grounded in a ‘‘distributed fash-

ion’’ in ‘‘property-specific’’ cortical networks that directly

support perception and action (and that were active during

learning). Same set of cortical areas are known to be active

during real perception, imagination, and lexical processing.

It is also established that ‘‘retrieval’’ or reactivation of the

neural representation can be triggered based on partial cues

coming from ‘‘multiple modalities’’: for example, sound of a

hammer retro activates its shape representation [43, 50],

presence of a real object (banana) or a 2D picture of it can

still activate the complete network associated with the object

(and that was active during learning of it in the first place).

The results indicate that while there is a fine level of

Cogn Comput (2013) 5:355–382 357

123

Page 4: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

‘‘functional segregation’’ in the higher-level cortical areas

processing sensorimotor information, there is also an

underlying cortical dynamics that facilitates ‘‘cross-modal,

top-down and bottom-up’’ activation of these areas. ‘‘Higher

level’’ is emphasized because there is reason to believe that

both early stages of perception (lower level color, shape

processing) and late stages of action (like muscle activity)

should not be involved in embodied simulation. Otherwise,

it would become impossible to distinguish simulation from

reality (and we believe retaining this distinction has

advantages in computational terms too). There is evidence

from both motor [19] and perceptual studies [49]. In the

sections that follow, we attempt to transform the findings

from neuroscience related to simulation of perception into a

possible computational framework for organization of per-

ceptual systems driving DARWIN robots (and conduct

experiments to understand the resulting benefits in terms of

the inferential capabilities of the robot).

Small Worlds, Hubs, and Global Integration

For any large-scale interconnected system composed of many

millions of contributing elements (neurons, people, comput-

ers, etc.) to work efficiently, mechanisms related to functional

segregation and global integration must be synergistically

coupled, disruption of such synergy often leading to large-

scale systemic breakdown. From the view point of embodied

simulation, global integration gives rise to powerful associa-

tive mechanisms that enable neural activity (coming from

partial cues) to swiftly elicit other context-related neural

activity in various cortical areas, hence resulting in the

emergence of loops of anticipation between perception and

action (most often in relation to the goal at hand). Importantly,

a simulated action must be able to elicit perceptual activity that

resembles the activity that would have occurred if the action

had actually been performed and vice versa, that is, a simu-

lated perception must be to trigger actions that are ‘‘doable’’ in

the context of the imagined situation, hence revealing how the

world can further be causally transformed (and if it is valuable

in the context of a goal). Integrative mechanisms hence basi-

cally close the loop between simulated perception and simu-

lated action. From a computational perspective, in a large-

scale complex system like the brain, efficient integrative

mechanisms may also help minimizing the number of pro-

cessing steps, ensure efficient wiring thus costing less area,

low metabolic cost for transmission of information, synchro-

nizability, pattern completion, and conflict resolution. Are

these specific principles involved, which can be exploited

while creating cognitive artifacts like Darwin?

Recent developments in the fields of network theory [4, 5]

and connectomics [67] provide the guiding light. The point of

intersection is the property of ‘‘small worldness’’ now found to

be prevalent in many large-scale networks. In simple terms,

‘‘small worlds’’ are complex systems where individual mem-

bers both form tightly knit local communities (high clustering)

but at the same time characterized with very short path length.

Since the seminal work of Watts and Strogatz [79], [6], it is

now established that several complex systems like social

networks, transportation networks, power grids, connectivity

of the internet, gene networks, food webs, patterns in sexually

transmitted diseases (STDs) among several others exhibit the

‘‘small-world’’ property. Emerging evidence from analysis of

large-scale architecture of the cerebral cortex [30, 67–69]

using techniques like Diffusion Tensor Imaging substantiates

the fact that cortical networks of the brain also exhibit the

small-world property. Basically, they suggest existence of a

small set of hubs (highly connected cortical patches) that

closely interact to facilitate swift cross-modal, top-down, and

bottom-up iterations between sub-networks involved in

learning, simulating, and representing various sensorimotor

information. This is interesting because the studies mentioned

earlier (in relation to simulation of perception and action) also

point toward existence of few set of hubs that facilitate both

‘‘integration and differentiation’’ [16, 49, 57]. Further, with the

recent discovery of the default mode network in the brain [8,

10, 11, 70]; Addis and Schacter [1, 2, 31, 82], it is now also

known that a core network of ‘‘highly connected’’ areas is

consistently activated when subjects perform diverse cogni-

tive functions like recalling past experiences, simulating

possible future events (or prospection), planning possible

actions, and interpreting thoughts and perspectives of other

individuals. In sum, the exciting recent developments from

neuroscience, connectomics, and network science call for

creation of novel computational frameworks for learning and

reasoning that are strongly grounded in the neurobiology of the

brain. As far as we are aware, these recent findings have still

not found a place in the computational architectures driving

‘‘acting, learning, and reasoning’’ robots.

‘‘Brain guidance’’ or the need to ‘‘learn from the

existing solutions’’ was often emphasized by Prof

Taylor in his numerous plenary talks and articles

throughout his illustrious career and also during his

short but inspirational stint while working on the

DARWIN architecture. Recent developments

emerging from multiple fields like connectomics,

network science and neuroscience provide valuable

insights to guide development of novel computational

frameworks that go beyond the existing state-of-the-

art machine learning systems. This would most

probably both increase the sustainability of cognitive

artifacts assisting humans in the real world and

increase their value in the eyes of its end user. At the

same time such a pursuit would lead towards novel

theoretical formulations of embodied intelligence that

358 Cogn Comput (2013) 5:355–382

123

Page 5: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

is deeply grounded in the biology of the brain. We

believe this article is just a preliminary attempt in this

direction.

The rest of the paper is organized as follows. The next

section deals with simple naming games or the robot learning

about objects in its play ground. This section is used to go

deeper into small-world networks, distributed organization of

concepts, network dynamics, and related issues. We describe

how even a small sub-network consisting of just four neural

maps is endowed with its own ‘‘local’’ ability to both ‘‘rea-

son’’ in novel situations and ‘‘resolve’’ contradictions that

may arise between what the system anticipates ‘‘top down’’

and what actually activates the system ‘‘bottom up.’’ ‘‘A Body

Schema for Cognitive Robots: ‘‘Why and What’’’’ section

deals with the issue of body schema and its implementation in

one of the DARWIN robots (iCub) and discusses its potential

utility in the DARWIN architecture in relation to ‘‘simulation

of action.. ‘‘Connecting Object, Action and the Body:

Learning About Action and Simulating Action’’ section

connects ‘‘object, action, and body’’ building up on what has

been presented in ‘‘Naming Games: Learning About Objects

and Simulating Perception’’ and ‘‘A Body Schema for Cog-

nitive Robots: ‘‘Why and What’’’’ sections. How the robot

‘‘learns to push’’, anticipates its consequences on various

objects, inversely generates goal-directed pushing is used to

illustrate central ideas. Note that ‘‘pushing’’ is an important

‘‘multipurpose’’ action investigated significantly in the field

of animal and infant cognition. ‘‘Simulating ‘Perception’ and

‘Action’ in the Context of a ‘‘Goal’’’’ section demonstrates

how all the learning comes to use in the context of a goal (a

simple assembly task). A discussion concludes.

Naming Games: Learning About Objects

and Simulating Perception

We start with a simple scenario of the robot learning about

various objects in its playground and associating their names

with their perceptual properties. The scenario is used to

describe how object concepts are learnt and organized in the

DARWIN architecture. For clarity, this section is broken into

various subsections that go into the details of various topics

like small worldness, distributed organization of object con-

cepts in Darwin, learning a simple ‘‘color-word-shape’’

small-world network, activity dynamics, and inferential

capabilities of the robot at the end of this learning phase.

Small Worldness

Intuitively, any interconnected system consisting of many

millions of individual members (people, neurons, com-

puters, etc.) is a ‘‘small world’’ if any member can connect

to any other member in very few number of hops. Small

world’s are complex systems where individual members

both form tightly knit local communities (high clustering)

but at the same time characterized with short path length

(globally accessible). Since the seminal works of Watts and

Strogatz [79] and Barabasi and Albert [6], it is now

established that several complex systems exhibit the

‘‘small-world’’ property [5]. As an analogy, we all like to

connect to the most well-connected people around us and

this shortens our global reach in the complex social net-

work. More recent attempts to map the large-scale struc-

tural architecture of the cerebral cortex (Haggman et al.

[30], or see an exciting recent book by Sporns [67]), has

now revealed that the cortical networks in the brain also

exhibit the small-world property. Several highly connected

zones or Hubs have been identified through DTI and

tractography, and there is emerging evidence that disrup-

tion of ‘‘small worldness’’ may play a role in causing

neurological disorders like Alzheimer’s disease, Schizo-

phrenia, and Autism spectrum disorders (see chapter 10 of

Sporns [67] for recent survey). In any network in general

too (like internet, airports, etc.), we must note that it is the

well-connected zones (hubs) that are vulnerable to attack,

hence causing noticeable disruption in the global func-

tioning of the system. While ‘‘self-organization’’ as a

computational principle has been used extensively in the

literature, ‘‘small worldness’’ has seldom been exploited in

the design of cognitive architectures for embodied robots

(as far as we are aware). We explore this idea further while

designing the DARWIN architecture and investigate its

computational advantages.

Distributed Organization of Object Concepts Within

a ‘‘Small World’’

Recent functional imaging and behavioral studies shed

light on how conceptual knowledge (and semantic mem-

ory) is organized in the brain, more importantly compatible

with the ‘‘hub-based’’ small-world framework [16, 49, 50,

57]. The main finding emerging from this area of investi-

gation is that conceptual information is grounded in a

‘‘distributed fashion’’ in ‘‘property specific’’ cortical net-

works that directly support perception and action (and that

were active during learning). Same set of networks are

known to be active during real perception/action, imagi-

nation, and even lexical processing. It is also now well

known that ‘‘retrieval’’ or reactivation of the conceptual

representation can be triggered based on partial cues

coming from ‘‘multiple modalities’’ (sound, 2D picture,

real object, word, etc.).

What are the computational principles necessary to

create such a brain inspired computational framework of

representing object concepts in Darwin that will allow both

Cogn Comput (2013) 5:355–382 359

123

Page 6: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

learning globally (in sounds, colors, words, shapes, and

movements) and endowed with the ability to activate the

complete network from partial cue (sound, word, picture) ?

It is here that we exploit computational principles of ‘‘self-

organization’’ to learn from experience, emerging evidence

related to property-specific distributed networks (to endow

compositionality) and network theory inspired ideas of

‘‘small worldness’’ (to enable multimodal integration and

pattern completion from partial cues). Figure 1 shows the

block diagram that captures the building blocks and

information flows. We briefly summarize the details below.

The sensory streams: At the bottom is the DARWIN

sensory layer that includes the sensors and associated

lower-level communication protocols and algorithms to

analyze properties of the objects mainly color, shape, and

size. The color of objects is analyzed by a color segmen-

tation module based on a recent approach using Markov

random fields [64] developed by one of the partners in the

DARWIN consortium (referred in the acknowledgment

section). This returns a triad of RGB values which forms

the input to the color SOM. At the level of concept system,

information related to object shape is passed as 120-bit

vector’s unique for each shape (like an abstract identifier

of the object). In this way, the complexity of shape anal-

ysis is abstracted from the concept system. Size-related

information is organized into two different maps one

coding for magnitude (the maximum length of the object

across any axes in Cartesian space: say S1) and proportion

(i.e., the ratio of the maximum length with respect to

lengths in the other two axes, say S2). S3 relates to ori-

entation that is not a property of the object itself but rather

is relative to the frame of reference of the observer. This

kind of organization of size-related information is partly

inspired by recent evidence related to representation of

magnitude in the parietal cortex [12]. There are several

advantages of this scheme in terms of inferring what can be

done with different objects that may be indistinguishable

through color or shape (for example, consider a green cube

and a green stick: both have same shape and color; what

distinguishes them is the abstract magnitude and propor-

tion: the former can be used to build a stack the latter as a

tool to pull an unreachable reward). Word information is

the input directly coming from the teacher. Infants often

learn to associate ‘‘words’’ with objects by learning in a

social environment and interacting with the parent/teacher.

It is further possible to exploit compositionality in the

domain of words. For example, consider an example of a

‘‘black apple,’’ even though we may have never encoun-

tered what such an object, we can easily ‘‘imagine’’ what is

should be and this should activate ‘‘top-down’’ higher-level

Fig. 1 The block diagram that captures the building blocks and

information flows that leads to distributed representation/learning of

object concepts. Growing SOM stands for ‘‘growing self-organizing’’

maps learning, representing, and simulating different perceptual

information about objects color, shape, name, size, etc. The box in the

top left shows 12 (out of 13) possibilities to connect 3 nodes, of which

a particular type of connectivity called as ‘‘dual dyad’’ (highlighted)

has been found prevalent in the cortex of several organisms. In the

block diagram, the connectivity between various self-organizing maps

is of dual dyad type, every node representing a neuron in a different

map. The basic computational advantage is to have both functional

segregation and at the same time global integration, hence allowing

the possibility of even a single neuron (in any map) to ‘‘retroactivate’’

a large-scale cognitive network (Color figure online)

360 Cogn Comput (2013) 5:355–382

123

Page 7: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

areas processing color and shape and not just words as is

known from several studies in brain imaging [16]. At

present, ‘‘word’’-related inputs are entered by the teacher

using the keyboard and converted into vectors on the basis

of letter usage frequencies in English language as is done in

[37]. In the present system, a sequence of maximum three

words or lesser describing the object (size-color-shape, for

example ‘‘small red cube’’) are considered, the resulting

individual activity superimposed to get the final activations

in the word SOM. Form an ‘‘application perspective,’’

incorporation of little linguistics (that is grounded in sen-

sorimotor experience of the learner) endows the architec-

ture with a measure of user friendliness. In the future, we

look forward to replace the input modality from the

‘‘keyboard’’ to a direct auditory channel (along the lines of

work done in the EU-funded FP7 CHRIS project or other

available speech analysis software).

Learning a Simple ‘‘Color-Word-Shape’’ Small-World

Network

The information coming from the DARWIN sensory layer

is projected bottom up to a set of growing self-organizing

maps (SOM) learning and representing object properties at

a conceptual level. The first-level neural connectivity

between the sensory layer to property-specific SOMs is

learnt using basic SOM procedure [22, 44]. As we go

higher up in the hierarchy (Fig. 1), the representations

become more multimodal and there is greater integration of

information coming from multiple SOM maps (in layer 1).

Here, we need go beyond the standard self-organizing

maps and introduce some novel concepts for learning

‘‘layer 1 SOM to hub’’ (and higher up ‘‘hub to hub’’ neural

connectivity). In general, hubs are also self-organizing

maps but higher up in the processing hierarchy and serve

two main purposes:

1. Facilitate multimodal integration of information arriv-

ing bottom up (from the sensory streams through layer

1 SOMs)

2. Enable both ‘‘top-down’’ and ‘‘cross-modal’’ activation

of various ‘‘property-specific’’ maps during reasoning

and resolution of contradictions.

As seen in Fig. 1, we distinguish between two kinds of

hubs: ‘‘provincial hubs’’ that integrate neural activity

coming from small sets of lower-level SOMs and ‘‘con-

nector hubs’’ that integrate information coming from pro-

vincial hubs. An analogy may be that of a team leader (who

works on a specific problem with a group of 3–4 students)

and the director of a department (who is the face of the

organization for the external world). The connectivity

between property-specific maps and hubs is developed

using three additional rules as described below.

1. Preferential Attachment: This idea is simple and just

means that there is a tendency of individuals to ‘‘preferen-

tially’’ connect to other highly connected ‘‘individuals’’

(instead of randomly connecting to anyone in the network).

This has the net effect of reducing path length between any

two individuals in a large-scale complex network. It is well

known from network theory that preferential attachment

gives rise to growing scale-free networks with small-world

properties [6], a feature prevalent in many real-world systems.

In the initial attempts to create growing networks with small-

world properties, preferential attachment of new nodes was

directed toward existing ‘‘highly connected’’ individuals with

greater ‘‘nodal degree,’’ hence modeling a kind of ‘‘rich get

richer’’ phenomenon. In this case, the previously existing

nodes (or senior ones) have a clear advantage over new

comers. If this is the case, then how do new comers make it in

a world where ‘‘rich get richer’’? Realizing this issue, Bara-

basi [4] proposed a measure called as ‘‘fitness-connectivity’’

index hence combining ‘‘fit gets richer’’ with ‘‘rich get richer’’

to create growing networks. ‘‘Fitness’’ is generally ‘‘context

dependent’’ can be attributed to different factors based on the

network in question (power grids, internet, and air transport).

Considering that ‘‘space and wiring’’ constraints play a cru-

cial role in the emerging connectivity of the brain, we decided

to have a gradient of ‘‘fitness’’ so as to promote layer 1 neural

SOMs to preferentially connect to ‘‘provincial hubs’’ (and

‘‘provincial hubs’’ to ‘‘connector hubs’’). In the biological

case, we believe it is plausible that evolutionary pressures and

genetic factors may play a role determining ‘‘fitness’’ of

cortical areas to promote preferential attachment.

2. Temporal Coincidence: This simply means that if

neurons in different self-organizing maps are concurrently

active (within a temporal window), then they get connected

to each other (not directly) but through the ‘‘provincial

hub’’ in their territory. Note that being connected through

the provincial hub (and not directly) ensures that there is

both functional segregation (between different neural

maps) and at the same time global integration. An analogy

is two doctoral students working on their own problems,

collaborating at instances and connected through a team

leader. There is close contact at the same time a level of

local functional autonomy.

3. Dual Dyad Connectivity: If there are 3 nodes, then

there are 13 ways to connect them (12 of which are shown

in the left panel of Fig. 1). C. Elegans is a tiny worm

measuring about 1 mm whose brain (with about 302 neu-

rons) has been exquisitely studied for almost 3 decades.

Way back in 1985 the overabundance of ‘‘triangular sub

circuits’’ of a particular type called as ‘‘dual dyad’’ (high-

lighted in Fig. 1) in the brain of C. Elegans was noted by

White [83] and this has been confirmed in several other

subsequent studies. More recently analysis of the cat and

macaque cortex has also revealed that ‘‘dual dyad’’

Cogn Comput (2013) 5:355–382 361

123

Page 8: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

connectivity is found in significantly high proportions [69].

This implies that such kind of connectivity comes with

advantages (hence being retained by evolution). Guided by

these studies, while connecting neurons belonging to dif-

ferent neural SOMs, we have retained the ‘‘dual dyad’’-

type connectivity. The computational gains of having such

reciprocal connectivity between multiple maps will be

demonstrated gradually in various sections.

Figure 2 (right panel) shows activity in the color, word,

shape, and provincial hub maps while learning the first steps

of associating names of different objects (given by the user)

with its perceptual properties (color and shape processed

bottom up through the sensory channels). The left panel

results will be elaborated in the next subsection after

describing global network dynamics of the ‘‘small world.’’

In this subsection, we merely focus on how the connectivity

between various neural maps is developed (the connectivity

relates to ‘‘color,’’ ‘‘word,’’ ‘‘shape,’’ and provincial hub).

Let N be the number of neurons in any SOM and S be

the dimensionality of the bottom-up input feeding the map.

Fig. 2 (Left panel) Neural activity in the self-organizing maps related

to color, word, shape, and provincial hub while learning a simple

‘‘small-world network’’ that brings together all these functionally

segregated neural maps (all driven bottom up by sensory channel) into a

globally integrated system (at the level of provincial hub which is

analogous to a team leader working with 3 graduate students). Five

different cases are shown in the left panel. In every case, the robot is

presented with a novel object and coincident with a word sequence

(provided by the teacher). Neurons in different property-specific maps

that have sensory weights closest to the incoming input signal start

representing these signals (with their sensory weights gradually adapted

as in standard SOMs). At the same time, connections are developed

between ‘‘property-specific’’ SOMs and the provincial hub due to

preferential attachment and temporal coincidence (for example, winner

the color SOM connects to the winner in the word SOM through the

provincial hub by means of dual dyad connectivity pattern). In the left

panel, the activations in the word and provincial hub are shown twice as

they correspond to activations in response to individual components

(color and shape). The net activation can be considered as the

superposition of activations resulting from individual components (like

n the right panel). Right panel the local inferential powers of even this

‘‘small patch’’ in the DARWIN architecture. When someone else

mentions the word ‘‘red horse’’ or grasp a ‘‘black apple,’’ most of us may

be able to anticipate what this new sequence of words may refer to. If a

black apple is eventually kept in front of us, most of us would even grasp

it because we can anticipate top down what a novel object ‘‘could be’’

and if bottom-up sensory input activates the neural maps exactly in the

same way as top down, we can infer this object is indeed the ‘‘black

apple.’’ Action networks could be triggered to initiate the action, though

the goal was unheard of. The same scenario is replicated on the robot:

the user inputs a new word sequence, we observe how activity in the

‘‘word map’’ retroactivities gradually in time, the complete network (in a

very different way from what was learnt: right panel). If a blue cube is

indeed kept in front of the robot, ‘‘top-down’’ and ‘‘bottom-up’’ activity

will resonate allowing the robot to infer that the new object is indeed the

blue cube that it has been commanded to grasp (Color figure online)

362 Cogn Comput (2013) 5:355–382

123

Page 9: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

Then, the connectivity matrix has a dimensionality of

N 9 S. Since we are dealing with multiple maps here, for

clarity, we address NC, NS, NW, and NPH as the number of

neurons in the color, shape, word, and provincial hub,

respectively. Since color, word, and shape SOM activity

forms the bottom-up input to the provincial hub, the con-

nectivity matrix of the provincial hub has a dimensionality

of NPH 9 (NC ? NS ? NW). Since all SOMs are growing,

N itself is a function of time and experience that the robot

acquires. For the illustration purposes in Fig. 2, activity of

9, 9, 30 and 36 neurons in the color, shape, word, and

provincial hub maps, respectively, is shown. Five different

cases are shown in different rows, in each case the robot is

presented with a new object followed by the linguistic

input of what it is from the teacher. In the first case, the

robot is presented with a yellow cylinder. Color and shape

is analyzed bottom up through the sensory layer and feed

the respective SOMs with sensory vectors SC and SV,

respectively. In the same temporal window of integration,

the teacher inputs the word sequence ‘‘yellow cylinder.’’

The two words are inputted in a sequence, and the activity

in the word SOM and provincial hub in response to indi-

vidual components (in this case word 1 describing color

and word 2 describing shape) is shown separately in Fig. 2

(left panel). The net activation in the word SOM and

provincial hub can be visualized as the superposition of the

individual activations (like in Fig. 2 right panel). The dif-

ferent sensory streams activate bottom up the various layer

1 neural maps that initially have randomly initialized

connectivity matrix. Layer 1 maps are trained in parallel

using the standard SOM procedure that is fairly standard

and discussed in detail in numerous references (see [22,

44]). In short, this consists basically of two steps:

1. Finding the neuron ‘i’ that shows maximum activity

for the observed sensory stimulus St at time t. This also

implies that neuron ‘i’ sensory weights si such that

||(si - St)2|| has the smallest value, among all neurons

existing in the respective SOM at that instance of

time.;

2. Adapting the sensory weights of the winner in a

Hebbian fashion by bringing the sensory weights si of

the winner ‘‘i’’ closer to the stimulus St. This simply

has the effect that in future instances the neuron ‘‘i’’

actively codes for the particular sensory stimulus St. In

this way, neurons in different property-specific maps

of layer 1 that have sensory weights closest to the

incoming input sensory vector start representing these

signals.

The net activity in the color, word, and shape SOMs

forms the bottom-up input to the provincial hub. The

connectivity is of dual dyad type, and weights are adjusted

in two identical steps one relating to ‘‘color–word’’

association and other related to ‘‘shape–word’’ association.

This is because the teacher is inputting a sequence of two

words, the first related to color and second related to shape

(of course nothing stops from training the maps separately

in a distributed organization scheme, for example, while

showing a yellow paper and uttering the word ‘‘yellow,’’

shape map is just switched off and learning takes place

between hub and color SOMs). However, we chose to train

them together because color and shape maps do not gen-

erally interfere with each other. Haptics has just recently

been incorporated in iCub humanoid (see EU-funded

Roboskin project for details), and work is ongoing to

exploit this modality, but at present, vision is the main

source of sensory information. The learning rule to connect

layer 1 SOMs with the provincial hub is as follows:

‘if neuron ‘‘i’’ and neuron ‘‘j’’ winning in the color and

word SOMs, respectively, manage to activate neuron ‘‘k’’

in the provincial hub, make Wik = 1 and Wjk = 1. This has

a net effect of enabling neuron ‘‘k,’’ ‘‘i,’’ and ‘‘j’’ in three

different SOMs (operating on their own local sensory

streams) to retro activate each other in ‘‘bottom-up,’’ ‘‘top-

down,’’ and ‘‘cross-modal’’ fashion. The same applies to

adjusting connectivity between shape, word, and hub

SOMs.’’ The internal weights of the provincial hub can

either have random initialization or a winner ‘‘k’’ can be

randomly chosen from the subset of neurons in the pro-

vincial hub that have internal weights zero. The net effect

is that in both cases, there is some neuron in the ‘‘provin-

cial hub’’ that responds to activity in two different SOMs

processing different sensory streams. Activity in any map

can gradually trigger the whole network, hence enabling

‘‘pattern completion’’ from a partial cue.

To start with 5 objects (of different colors and shapes)

associated with their names are taught to the robot. The

activity in various SOMs is shown in Fig. 2. The activity in

the ‘‘word’’ and ‘‘provincial hub’’ maps is shown twice for

clarity, because the teacher input consists of a sequence of

two words. As we can see, in every the layer 1 SOM,

different neural units start learning and representing dif-

ferent sensory stimulus. In the future, if similar stimulus is

projected bottom up, then the neuron coding for it is

reactivated. For example as seen in Fig. 3 (right panel, row

1), showing just a red paper to the robot activates the

neuron coding for ‘‘red’’ sensory stimulus in the color SOM

processed bottom up through vision (and experienced first

while the robot was presented with the red pyramid). At the

same time, activity in the color, shape, and word SOMs is

integrated at the level of the provincial hub by means of the

‘‘dual dyad’’ connectivity pattern. This also implies that

showing a ‘‘red paper’’ to the robot should ‘‘cross-mod-

ally’’ activate the word representation ‘‘red’’ in the word

SOM, even though in this case there is no word input from

the teacher. This is indeed the case. In other words, just

Cogn Comput (2013) 5:355–382 363

123

Page 10: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

perception of color ‘‘bottom up’’ is sufficient to retroacti-

vate the global network learnt during past experience, but

at the same time just associated with the particular ‘‘partial

cue’’ perceived in the present context (in this case, there is

no activation in shape map). This behavior is very common

in infants (show them a ‘‘dog’’ for example and say the

word ‘‘bow bow,’’ next time the child sees a dog, we often

see it playfully pointing to it with the word ‘‘bow bow’’).

Studies in functional imaging go even further providing

evidence that even if it was a toy dog, a real dog, or a

cartoon or just the word ‘‘bow bow’’ should activate the

global network as was experienced during learning [49]. To

further understand how this ‘‘cross-modal’’ and ‘‘top-

down’’ retroactivation takes places when the network is

triggered with a partial cue from the environment, we look

into the global dynamics of the ‘‘small world,’’ which is the

topic of discussion in the next section.

Network Dynamics, Pattern Completion, and Modality

Independence

In the proposed distributed small-world organization, even

the simple ‘‘color-world-shape’’ network consisting of just

four neural maps is endowed with its own ‘‘local’’ ability to

‘‘reason’’ in novel situations, grow, and resolve con-

tradictions that may arise between what the system antic-

ipates ‘‘top down’’ and what actually activates the system

‘‘bottom up.’’ To achieve this objective, the small-world

network has to be complemented with an equally powerful

dynamics that allows neural activity in one map to retro-

activate other relevant networks in ‘‘top-down,’’ ‘‘bottom-

up,’’ and ‘‘cross-modal’’ fashion. The network dynamics

builds upon the idea of neural fields [3] and supplements it

with novel concepts like the introduction of the bifurca-

tion parameter [53] that both brings in computational

Fig. 3 Left panel four cases that demonstrate compositionality,

modality independence, and pattern completion properties depicted

by the ‘‘color-word-shape-provincial hub’’ sub-network composed of

four neural maps. Right panel presents an interesting scenario where

the user issues a goal to reach the ‘‘red container’’ (both a novel word

and at the same time such an object has never been encountered

before). The evolving graphs on the top show the temporal evolution

of activity in different maps when given a ‘‘new word.’’ The graphs at

the bottom show two cases: bottom-up network activity (bifurcation

parameter = 0) when a previously unseen object (green container) is

kept in front of the robot and when another unseen object ‘‘red

container’’ is placed in front of the robot. In the later case, we can

observe that ‘‘top-down’’ activity correlates with ‘‘bottom-up,’’ even

though in both cases the object has been never encountered before

(and commanded just using linguistic input) (Color figure online)

364 Cogn Comput (2013) 5:355–382

123

Page 11: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

advantages and is biological plausible (as will be discussed

later). Let hi be the activity of the ith neuron in the pro-

vincial hub and xprop be the activity of a neuron in any of

the property-specific SOMs connected to the provincial hub

(in this case color, word, and shape SOMs). Let Wprop,hub

encode the connections between the property-specific

maps and the provincial hub. Basically, Wprop, hub is a

NPH 9 (NC ? NS ? NW) matrix learnt as explained in the

previous section. Its transpose encodes backward connec-

tivity from hub to individual maps. The network dynamics

of hub neurons and neurons in the property-specific maps

are governed by Eqs. (1) and (2), respectively:

shub_hi ¼�hiþð1� bÞ

X

i;j

Wprop;hubXpropþ b:ðTopdownÞ� �

ð1Þ

sprop _xprop ¼ �xprop þ ð1� bÞSprop þ b �X

i;j

ðWhub; prophhubÞ

ð2Þ

where

Sprop ¼1ffiffiffiffiffiffiffiffiffiffi

2prs

p e�ðsi�sÞ2

2r2s

The instantaneous activation of any neuron in the hub or

the property-specific maps is governed by three different

components: The first term induces an exponential

relaxation to the dynamics. The second term is the net

feed-forward (or alternatively bottom-up) input. Since

property-specific maps are inputs to the provincial hub,

activity of neurons in the property-specific maps (Xprop

evolving through Eq. 2) drives the activity of hub neurons

(modulated by the connectivity matrix Wprop, hub). At the

same time, the sensory layer is the bottom-up input to the

property-specific maps. Since the property-specific maps are

trained using standard SOM procedure, a Gaussian kernel is

used to compare the sensory weight si of neuron i with

current sensor activations S in order to determine its bottom-

up activity. So while sensory layer drives property-specific

neural maps bottom up, the activity of the neurons in the

individual neural maps drives the provincial hub bottom up.

The third component is the top-down component: for the

property-specific SOMs, the top-down input comes from

the provincial hub to which they are connected. For the

provincial hub, the top-down component comes from the

connector hub (to which it will be linked using exactly

the same principles of preferential attachment and temporal

coincidence: this will become prominent in later sections

hence is just mentioned as ‘‘top down’’ in Eq. 1). So just like

the provincial hub activates the property-specific maps,

activity in the connector hub can activate the provincial hub

(which inversely acts as the bottom-up input to the

connector hub). Thus, there is always a bidirectional flow

of information, as we move upwards, information becomes

more multimodal and integrated, as we move downwards, it

becomes more differentiated (to the level of basic properties

that are sensed by the sensory layer). The top-down input is

also biased by a parameter ‘‘b’’ called as the bifurcation

parameter proposed originally in [53] that plays the role of

modulating ‘‘how much’’ of the neural activity in a specific

map is governed ‘‘top down’’ and how much ‘‘bottom up.’’

For example, if b = 0 in Eq. 2, the system operates only on

real sensory input and is not modulated by activity coming

from the provincial hub. Recent results from brain imaging

[8] have provided evidence for existence of such

dynamic switching between endogenous mental activity

and attention-driven exogenous activity mediated by

anterior insula (AI) and anterior cingulate cortex (ACC).

Computationally, the bifurcation parameter has several

functions the main being detecting contradictions between

ones anticipations situation and what is actually perceived.

In simple terms, if the world does not behave the way we

anticipate it should, it may be better to attend to what is

happening in the real world and learn new things.

Perception as an Act of ‘‘Memory’’: Bottom Up Versus Top

Down

Figure 2 (right panel) shows an example of the application

of the network dynamics of the ‘‘color-word-shape’’ net-

work in novel situations. The user inputs a sequence of new

words ‘‘blue cube’’ (with no such object present in the

environment). As we see, the activity in the word SOM

gradually propagates to the provincial hub and eventually

activates the color SOM in a way that was learnt when the

robot was presented a ‘‘blue container’’ and shape SOM in

way that was learnt when the robot was presented a ‘‘green

cube’’. But the overall activity in the global system, that is,

provincial hub ? color-word-shape maps as a result of the

network dynamics triggered by the utterance of a new word

‘‘blue cube’’ resembles what the robot now anticipates that

a ‘‘blue cube’’ must be. If a blue cube is really kept in front

of the robot, bottom-up sensory input and top-down

anticipation will end up activating the same neurons in

every neural map (and resonance between top down and

bottom up is enough evidence to confirm that the novel

object placed in front of the robot is indeed a ‘‘blue cube’’).

Further any motor behavior (reaching, grasping, trans-

porting) can be executed also on this novel object (men-

tioned just by linguistic input by the user).

Figure 3 (right panel) presents an interesting scenario

where a user now issues a goal to grasp the ‘‘red con-

tainer’’ (both a novel word and at the same time such an

object has never been encountered before). The graphs on

the top show the temporal evolution of activity in different

maps when given a ‘‘new word.’’ The graphs at the bottom

Cogn Comput (2013) 5:355–382 365

123

Page 12: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

show two cases: network activity when a previously unseen

object (green container) is kept in front of the robot and

when another unseen object ‘‘red container’’ is placed in

front of the robot. In the later case, we can observe that

‘‘top-down’’ activity correlates with ‘‘bottom up’’ even

though in both cases the object has been never encountered

before (and commanded just using linguistic input). Even if

the situation is novel, the robot is still able to execute a new

user command in the latter case (but at the same time in the

former case, the robot can infer that there is no ‘‘red con-

tainer’’ placed in front of it, hence quits the goal). In Figure

3, left panel shows four additional cases that show pattern

completion properties of the network. In sum, even in the

very basic network consisting of just four neural maps, the

results demonstrate three aspects:

1. How novel combinations of neural activity can emerge

by reconstructing relevant past experiences (relevant

means triggered by a partial sensory cue). This

perception is also seen as an act of memory and not

essentially driven bottom up.).

2. Resonance between ‘‘top-down’’ anticipation and

‘‘bottom-up’’ sensation leads to inferential mecha-

nisms that can be used to drive goal-directed action

(here simple cases like reaching, grasping novel

objects, unheard words).

3. Contradictions between ‘‘top down’’ and ‘‘bottom up’’

can be used as a stepping stone to learn further and

grow the neural maps.

Detecting Contradictions: Switching to Attention-Driven

Exploration to Learn Further

A side effect of ‘‘top-down’’ and ‘‘bottom-up’’ activity

being projected on the same neural substrate is the auto-

matic detection of contradictions. This information is cru-

cial and can be used to generate saliency signals to bias

attention toward the anomaly and generate exploratory

behaviors to learn further (to resolve the contradiction).

Such mechanisms are important if the robot has to keep

learning ‘‘cumulatively’’ and gradually build up its under-

standing of how the world works. Results from neurosci-

ence [8] provide support for this idea and suggest that

anterior insula and anterior cingulate cortex play important

role in the saliency detection network of the brain. Perhaps,

it is already evident when the user issues the goal to grasp

the ‘‘red container’’ (Fig. 3 right panel). Comparing the

top-down and bottom-up activity in different neural maps,

it is possible to infer that there is a container in the envi-

ronment but in the first case it is not of the right ‘‘color’’

that was requested by the user, while in the latter case, the

goal is realized. Further, the concept system is inherently

multimodal. Hence, in addition to mismatch between top

down versus bottom up, contradictions can also occur if

information coming from different modalities do not res-

onate with each other. The proposed computational model

also deals with such issues. Figure 4 presents some results.

When presented with a green sphere along with the word

green container inputted by the teacher, there is saliency in

the shape, hub, and word maps. Note that contradictions are

detected locally; in other words, the robot infers that there is

something green that correlated with the color perceived

visually and the word uttered by the teacher, but it also

infers that the there is contradictions in the shape and word

(it anticipation should be associated with the presented

object and what the teacher calls it to be). In the second

example, all maps detect saliency. The same applies to the

third scenario where an absolutely new object is presented

to the robot. As seen from the activity in different neural

maps, there is no definitive winner (there are multiple

hypothesis, hence greater saliency). Saliency can also be

thought as a measure of how confused the system is and this

applies ‘‘both’’ when there are ‘‘contradictions’’ and when

the system is operating in ‘‘novel situations.’’ Also note that

global saliency of a network is the cumulative sum of local

saliencies of individual members. Greater the global sal-

iency greater is the discomfort in the network, greater is the

urgency to learn further. The net effect of saliency in terms

of the network dynamics is to lower the bifurcation

parameter, hence causing the switch from endogenous

mental simulations to attention-driven exogenous explora-

tion. Thus, contradictions can be seen stepping stones to

learn new stuff (and the bifurcation parameter drives the

switch from endogenous mental simulations to attention-

driven exogenous exploration). More recently, interesting

results are emerging from neuroscience that implicate the

fact the delusional behaviors in neurological disorders (like

schizophrenia) are a result of improper mixing of ‘‘top

down’’ with ‘‘bottom up.’’ In this background, the bifurca-

tion parameter, now also has a biological basis and signif-

icant importance in switching network dynamics between

exogenous activity driven by real world and endogenous

mental simulations during reasoning about actions, resolv-

ing contradictions by either learn more or just reconciling

ones beliefs with what new has been experienced.

A Body Schema for Cognitive Robots: ‘‘Why

and What’’

Why do cognitive robots need a body schema? For the same

reason for which a human or a chimp needs it: simply put,

without one, it would be unable to use its ‘‘complex body,’’

take advantage of it, and ultimately survive. In general, for

an organism with a complex body inhabiting an unstruc-

tured world, the purpose of ‘‘Action’’ is not just restricted to

366 Cogn Comput (2013) 5:355–382

123

Page 13: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

shaping motor output to generate movement but also to

provide the self with information on the feasibility, conse-

quence, and understanding of ‘‘potential actions’’ (that

could lead to realization of ‘‘goals’’). As described in the

introduction, mounting evidence from neuroscience sub-

stantiates the fact that neural circuits in predominantly

motor areas are activated in many contexts related to action

that do not cause any overt movement. Hence, overt actions

are just the tip of an iceberg: under the surface is hidden a

vast territory of ‘‘actions without movements’’ (covert

actions) which is essence of motor cognition. But as in the

iceberg metaphor, there must be continuity between what is

above and what is below the surface: the link, we suggest, is

the body schema mechanism. The issue of body schema is

not as popular in cognitive robotics in comparison with the

concept of embodiment. These are not the same things. If

you have a body schema, you also have embodiment but not

the other way around. Vernon et al. [74] in their discussion

on a roadmap for cognitive development in humanoid

robots present a catalog of cognitive architectures, but in

none of them the concept of body schema is a key element.

Hoffmann et al. [34] review this concept in robotics,

emphasizing the gap between the idea and its computational

implementations. A biologically plausible computational

formulation of ‘‘body schema’’ based on the idea of passive

motion paradigm (PMP: [55]) was developed by Mohan and

Morasso [51] and implemented on the 53 DoF humanoid

iCub. The model has been successfully applied on iCub

humanoid in a number of contexts related to action. We

refer the interested reader to Mohan and Morasso [51] for

detailed formal analysis and applications in the context of

whole body coordination, skill learning, and tool use.

Considering that the body schema also contributes toward

goal-directed reasoning and embodied simulation, in this

section, we briefly introduce the central ideas to the level of

detail necessary to build up sections on action development

and reasoning.

As seen in Fig. 5a, the body schema is characterized by

different body parts and end points (hands, legs, etc.)

available for connection in the context of ‘‘goal.’’ The links

between body parts are associated with a number of

degrees of freedom (highly redundant in a complex body).

Fig. 4 Ability to detect and resolve contradictions is built in at every

local network. While the results in Fig. 3 show how contradictions

can be inferred due to mismatch between ‘‘top down’’ and ‘‘bottom

up’’. Figure 4 presents results where contradictions are caused due to

mismatch between information coming from different sensory

modalities. Simply, show any infant a potato and say that it is an

apple, it should naturally be surprised. The first two examples show

similar situations with the robot. When presented with a green sphere

along with the word green container inputted by the teacher, there is

saliency in the shape, hub, and word maps. Note that contradictions

are detected locally; in other words, the robot infers that there is

something green that correlated with the color perceived visually and

the word uttered by the teacher, but it also infers that the there is

contradictions in the shape and word (it anticipates should be

associated with the presented object and what the teacher calls it to

be). In the second example, all maps detect saliency. In this sense,

global saliency of a network is the cumulative sum of local saliencies

of individual members. Greater the global saliency, smaller is the

bifurcation parameter greater is the urgency to learn by switching to

attention-guided exploration (Color figure online)

Cogn Comput (2013) 5:355–382 367

123

Page 14: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

In simple terms, the idea behind PMP is that such a schema

can be animated by attaching/detaching ‘‘force fields’’ to

one or more body parts in a ‘‘task-specific’’ fashion. The

animation process is analogous to coordination of a mari-

onette by means of attached strings: as the puppeteer pulls

the task-relevant effector to a target (or along a specific

trajectory), the rest of the body elastically reconfigures so

as to allow the motion to be simulated internally. The idea

is that such simulation process can characterize both

‘‘covert and overt’’ actions. PMP framework has features

that distinguish it from other leading approaches in com-

putational motor control like optimal feedback control

framework and Equilibrium point hypothesis (this has been

discussed in detail in a recent review Mohan and Morasso

[51]). PMP networks are assembled on the fly and operate

in a local, distributed, multi-referential, and goal-directed

fashion. Figure 5b shows the PMP network coordinating

the upper body (left arm-waist-right arm) chain of iCub,

which is relevant for tasks addressed in this paper. As seen,

the network is grouped into the different motor spaces

involved (in this case end effector, arm, and waist). Each

motor space consists of a displacement (blue) and force

node (pink) grouped as a work unit. Vertical links (purple)

within each work unit denote the impedance (stiffness K

and admittance A), while horizontal links (green) between

two work units denote the geometric transformation

between them (Jacobian: J). Note that the links do not carry

information, like in a block diagram, but a combination of

force and motion, that is, (computational) energy. There

are two additional nodes ‘‘sum’’ and ‘‘assignment’’ that add

or assign (forces or displacements) between different ‘‘sub-

networks’’ (in this case connection between tow arms and

the waist). The resulting network is fully connected; con-

nectivity articulated in a fashion that all transformations are

‘‘well posed.’’ This reduces computational cost because it

circumvents the need for kinematic inversions and cost

function computation (as in optimal control approaches).

Starting from an equilibrium state, goals (when switched

‘‘ON’’) basically inject virtual elastic energy in the net-

work, eliciting a reconfiguration of the internal DoFs to a

new equilibrium. The goal can be point attractor (like in

reaching) or moving point attractor (virtual trajectory) as in

the case of handwriting, tool use, etc., where specific

motion trajectories have to be created using the desired end

effector/tool. The dynamics of the network evoked by the

activation of a goal (xT) is equivalent to integrating non-

linear differential equations that, in the simplest case of just

the right arm network and with no additional task-specific

constraints, takes the following form:

_x ¼ J Ar JT Kr xT � xrð Þ ð3Þ

whenever a network like in Fig. 5b is trigged with a goal,

we get four sets of trajectories (as a function of time): (1)

trajectory of joint angles given by the position node in the

joint space (arm and waist); (2) the resulting consequence,

that is, the trajectory of end effectors given by the position

node in end effector space (hands, tools, etc.); (3) the tra-

jectory of torques at the different joints (arm and waist),

given by the force node in the joint space; (4) the resulting

consequence, that is, the trajectory of forces applied by the

end effector given by the force node in the end effector

space. Hence, PMP networks naturally form forward/

inverse models (we always get the motor commands to

coordinate a redundant body and at the same time get the

resulting consequence). If motor commands obtained by

this process of PMP simulation are relevant in the context of

the goal, they can be fed to the actuators and the robot

reproduces the movement. Otherwise, the information

related to consequence of the action predicted by the for-

ward model serves as valuable ‘‘internal event’’ for goal-

directed reasoning. It is here that PMP diverges from

Equilibrium point hypothesis. In EPH, the attractor

dynamics that underlies production of movement is attrib-

uted to the elastic properties of skeletal neuromuscular

system. But this contradicts with emerging results from

neuroscience that both real and imagined actions activate

similar neural substrates in the motor cortex, importantly

covert actions not activating the neuromuscular apparatus.

PMP on the other hand posits that even real actions are a

result of an internal simulation, using similar attractor

dynamics like posited by EPH but at a cortical level. This

could explain the similarity of real and imagined move-

ments because, although in the latter case the attractor

dynamics associated with the neuromuscular system is not

operant, the dynamics due to the interaction among other

brain areas are still at play. If actions generated by anima-

tion of the body schema are perceived to be ‘‘useful,’’ they

can be executed (as motor commands are always synthe-

sized at the intrinsic space in any simulation). Otherwise,

prediction of the forward model is a crucial event to drive

goal-directed reasoning. In this sense, PMP can be consid-

ered a generalization of EPH from action execution (‘‘overt

actions’’) to action planning and reasoning about actions

(‘‘covert actions’’). At the same time, it solves the degrees

of freedom problem [7] and abstracts the complexity of the

‘‘body’’ to the higher-level cognitive networks. As action-

related goals are ‘‘switched on,’’ what we get by the ani-

mation of the body schema is both the motor commands to

‘‘execute’’ a specific movement (inverse model) and at the

same time information on ‘‘feasibility, consequence, and

usefulness’’ of potential movements (forward model).

Since force fields are additive, multiple task-specific

constraints can be incorporated into the PMP relaxation at

run time through superposition of multiple force fields. A

constraint in the extrinsic space could be an obstacle to

avoid, achieving the proper wrist pose to perform an action

368 Cogn Comput (2013) 5:355–382

123

Page 15: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

(for example, while grasping or pushing an object); con-

straints in the intrinsic space mainly relate to taking into

account the limited range of motion of a joint, the joint

power, etc. This issue has been dealt with formally with

several experiments on iCub in a recent article [51] and

hence is not reiterated here. To summarize, we can see PMP

as a mechanism of multiple constraints satisfaction, which

solves implicitly the ‘degrees of freedom problem’ without

any fixed hierarchy between the extrinsic and intrinsic

spaces. The constraints integrated in the system are task-

oriented and can be modified at run time as a function of

performance and success. The relaxation implied by the

Fig. 5 a A graphical representation of the body schema with various

end effectors available for connection with tools, force fields, and

targets. b The network implementation of such a body schema to

coordinate upper body of the humanoid iCub: Work unit: Force node

(pink) plus a displacement node (blue); Geometric causality repre-

sented by Jacobians (green), Elastic causality represented by Admit-

tance and Stiffness (light blue), Branching nodes (black), Timing

signal (yellow). The goal can be point attractor (like in reaching) or

moving point attractor (virtual trajectory) like in the case pushing,

handwriting, use of tools, etc. The application of the goal causes

incremental elastic reconfigurations in the network analogous to the

coordination of a marionette with attached strings. Panels c–e show

the initial condition, end effector trajectories and the final solution

when the network of panel b is used to generate a bimanual reaching

action coordinating the upper body of the robot. This is a multi-

referential system of action representation and synergy formation,

which integrates a Forward and an Inverse Internal Model (Color

figure online)

Cogn Comput (2013) 5:355–382 369

123

Page 16: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

PMP model does not require the target to be fixed. It works

as well with moving targets. In this case, the ‘‘attractor

field’’ becomes an ‘‘attracting wave,’’ with a moving equi-

librium point (also in the case of pushing an object to a goal

location as we will see in the next section). In human

experiments also, there is some evidence of moving equi-

librium points in perturbation experiments [65].

Connecting Object, Action, and the Body: Learning

About Action and Simulating Action

In an embodied framework, ‘‘Actions’’ are mediated

through the ‘‘Body’’ and directed toward ‘‘Objects’’ in the

environment. Playful interactions with objects give rise to

sensorimotor experience, learning, and ability to reason.

Thus, the need to connect ‘‘object,’’ ‘‘action,’’ and the

body/body schema. The scheme is shown in Fig. 6 and

directly builds up on the ‘‘object’’ related ‘‘small world’’

created in ‘‘Naming Games: Learning About Objects and

Simulating Perception’’ section. Note that there is a subtle

separation between representation of actions at an abstract

level (‘‘what all can be done with an object/tool’’) and the

procedural memory related to the action itself (‘‘how to

do’’). While the former relates to ‘‘affordance’’ of an

object, the latter relates to the ‘‘skill’’ of using an object.

The abstract layer forms the ‘‘connector hub’’ and consists

of single neurons coding for different actions like reach,

grasp, push, use of different tools, etc., and grows with time

as new skills are learnt. Single neurons in the connector

hub in turn have the capability to trigger the procedural

memory network responsible for generating the action they

code for. Connector hub in the object space and action

space is connected. All the connections are meant to

Fig. 6 As seen Fig. 6 builds up on Fig. 2 by adding new networks

related to ‘‘action’’ and ‘‘body schema.’’ The connectivity and

information flows between ‘‘object,’’- ‘‘action,’’- and ‘‘body’’-related

networks are shown. The information flow is inherently bidirectional

and characterized by ‘‘dual dyad’’-type connectivity. There is a subtle

separation between representation of actions at an abstract level and

the procedural memory network related to the action itself. The

abstract layer forms the ‘‘connector hub’’ in the action space and

consists of single neurons coding for different actions at an abstract

level (like reach, grasp, push, tool use, etc.). The abstract action layer

is similar to ‘‘canonical neurons’’ found in the pre-motor cortex that

are activated at the sight of objects to which specific actions are

applicable. Note that these single neurons do not code for the action

itself but instead have the capability to trigger the complete procedural

memory network responsible for generating the plan to execute the

concern action. All connectivity between various networks is learnt by

explorative sensorimotor experience. Specific functions of various

layers are summarized in the figure (Color figure online)

370 Cogn Comput (2013) 5:355–382

123

Page 17: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

develop by experience. Connectivity is of the ‘‘dual dyad’’

type, hence allowing bidirectional flow of information

between different neural maps. As a simple example, con-

sider that an object is presented to the robot. Assume for

the sake of discussion that the robot has some past expe-

rience with it too. Then, information from the sensory layer

activates various property-specific maps, their provincial

hubs, finally leading to distributed activity in the object

connector hub (which is a multimodal representation of the

object: like in Figs. 2, and 3). Assuming that the robot

already had experience of performing different actions on

this object, the activity in the ‘‘connector hub’’ of the action

layer basically codes for what all high-level ‘‘actions’’ can

be done with this object (more the robot learns more will

be the possibilities to exploit an object). In this sense,

single neurons in the top level ‘‘action connector hub’’ are

similar to ‘‘canonical neurons’’ found in the pre-motor

cortex (of monkeys and humans) that are activated at the

sight of objects to which specific actions are applicable. At

the same time, the detailed knowledge itself is learnt/rep-

resented in specialized procedural memory networks which

are triggered by neurons at the action connector hub.

The neural connectivity between top-level object and

action hubs is learnt gradually as the robot tries out and

learns what actions are possible with an object. This can be

due to explorative interaction (for example, a small red

cylinder may be reached, grasped, moves in a specific way

pushed, etc.) or by observing and imitating a teacher like

while learning to maneuver various tools [52]. Importantly,

motor repertoire is gradually built up by interacting with

various objects. As a last step, actions have to be ultimately

executed by the body and for this we must synthesize the

motor commands in the task-relevant body chain. This is

accomplished by the link between the ‘‘procedural memory

layer’’ and ‘‘body schema.’’ Action plans (or virtual tra-

jectories) synthesized by the procedural memory networks

serve as attractors to the ‘‘body schema,’’ hence triggering

the PMP simulation in the task-relevant body network. As

an example, if the task is to rotate a lever, the desired

trajectory of motion in the extrinsic space is planned by the

procedural memory network. This acts as a moving point

attractor to the task-relevant body network of the PMP (for

example, the right hand-waist chain). PMP simulation

gives out the motor commands which if sent to the actua-

tors produces the desired action. Basic actions like reach,

grasp, directed search through vision, use of one tool (a toy

crane) to pick up unreachable objects are presently func-

tional with a reasonable level of accuracy. So to go even

deeper inside the scheme presented in Fig. 6, in the next

section, we describe how the robot learns a new and fairly

important multipurpose action ‘‘pushing’’ (and inversely

learning to predict how objects move when forces are

applied on them).

The Pushing Sub-Network

‘‘Pushing’’ is an interesting action investigated extensively

in studies related to understanding of ‘‘physical causality’’

in primates and infants [77, 84]. In addition to the multiple

utilities of the ‘‘push/pull’’ action itself in manipulation

tasks, what makes it significant is the sheer range of

physical concepts that have to be ‘‘learnt’’ and ‘‘abstrac-

ted’’ in order to execute this action successfully. For

example, it has to be learnt that contact is necessary to

push, object properties influence pushability (balls roll

faster than cubes, etc.), pushing objects gives rise to path of

motion in specific directions (the inverse applies for goal-

directed pushing), pushing can be used to support grasping,

bringing objects to proximity, there can be counterforces

that block the pushed object (similar to a goal keeper). The

requirement to capture/learn such a wide range of physical

concepts through ‘‘playful interactions’’ with different

objects makes this task both interesting and challenging.

Different objects move in different ways when force is

exerted on them, some do not move too. By interacting with

various objects, the goal of the robot is to learn a general

forward/inverse model for ‘‘pushing action’’: that is, being

able to predict how an object will move when pushed

(forward model) and being able to generate goal-directed

pushing actions in order to displace an object to a desired

location. Figure 7 zooms into the push sub-network as

connected to the rest of the system (other neural maps,

hubs) and body schema (PMP). To begin, when presented

with any ‘‘object,’’ different property-specific neural maps

are activated bottom up leading to a distributed represen-

tation of the concerned object in the object connector hub

(as described in ‘‘Naming Games: Learning About Objects

and Simulating Perception’’ section). Since object proper-

ties influence pushing, activity in the object connector hub

influences the pushing forward/inverse model and hence is

bidirectionally connected to it (connectivity learnt by

experience). As seen in Fig. 7, the pushing system is rep-

resented using two neural maps: one that is a growing SOM

learning ‘‘average displacement of an object per unit force’’

and the second that represents a distributed coding of

direction in which the object is moving (there is ample

evidence from studies in neuroscience that such directional

coding exists in the brain and serves many purposes). We

shall justify the choice of this representation shortly, but

before that we wish to summarize the process by which the

robot gains experience (which basically precedes repre-

sentation and learning).

Figure 8 (left panel) zooms further into the information

flows and connectivity that is learnt while playing with just

‘‘one: object. The pushing SOM is empty to start with and

gradually grows as the robot interacts with various objects.

The new connectivity that needs to be learnt are the ones

Cogn Comput (2013) 5:355–382 371

123

Page 18: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

between the object connector hub and neurons in the

pushing SOM (‘‘W’’) and the internal weight (Pi) of each

neuron that represents the ‘‘average displacement per unit

force’’ of the object it is representing. The former relates to

perception (as information comes from sensory channels to

activate the connector hub) while the latter relates to action

(or effect of force on the displacement of an object). Note

that both these learnt elements (W and P) complement each

other, that is,

1. Observing a motion, the robot must be able to

anticipate which object it is (by activating ‘‘top down’’

the object connector hub and hence the property-

specific SOMs and the name of the object) and

2. Inversely if it is necessary to ‘‘push’’ a given object to

a target location, the robot must be able to estimate the

force it needs to exert in which direction to realize the

goal. For every novel object the robot is interacting

with, the neural connectivity is learnt as per the

following steps:

1. Growth in the Pushing SOM: To start with the robot

presented with an ‘‘object’’ This leads to ‘‘bottom up’’

activation of different perceptual maps ultimately culmi-

nating in a distributed representation of the concerned

object in the object connector hub (for novel unencoun-

tered objects learning and growth takes place at the per-

ceptual level too as described in ‘‘Naming Games:

Learning About Objects and Simulating Perception’’ sec-

tion based on saliency). For every novel object, we grow

one neuron in the pushing SOM which codes for the

influence of object properties in relation to the ‘‘Pushing

Fig. 7 Left panel zoom into the push sub-network as connected to the

rest of the system (object connector hub) and body schema (PMP).

When presented with an object, the different property-specific neural

maps are activated bottom up leading to a distributed representation

of the object connector hub (as described in chapter 2). Since the user

goal is to learn to push, the push sub-network is activated (empty to

begin with as there is no experience or knowledge). The push sub-

network is represented using two neural maps: one that is a growing

SOM learning ‘‘average displacement of an object per unit force’’ and

the second that represents a distributed coding of direction in which

the object is moving. The former map is empty to start with and

gradually grown as the robot interacts with different objects, growth

only taking place if there is a contradiction between ‘‘the robot

anticipation of how an object might move’’ and ‘‘how it actually

moves in reality.’’ All connections indicated with ‘‘L’’ are learnt from

the scratch. Information flow is bidirectional meaning that it is

possible to move to the object hub from the pushing action network,

and if the connector hub is active, it is possible to trigger he property-

specific maps (as seen in the previous chapter). The right panel how

goal-directed pushing actions are generated through incremental

iterations between the pushing SOM and the direction between the

pushed object and the goal. This gives rise to a virtual trajectory that

serves as an attractor to the action generation system (see text for

details) (Color figure online)

372 Cogn Comput (2013) 5:355–382

123

Page 19: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

action.’’ More details on subtleties regarding growth in the

pushing SOM will follow in the next step.

2. Learning Connectivity Between Object Connector

Hub and Pushing SOM (W): Let ‘‘i’’ be the neuron in the

pushing SOM instantiated to represent the behavior of a

novel object presented and xj be the instantaneous activity

of the jth neuron in the connector hub (determined as per

dynamics of Eqs. 1 and 2). Let Wji be the connection

between j neuron in the connector hub and ith neuron (i.e.,

the new neuron) in the pushing SOM. Then, the learning

rule is as follows: for all active neurons in the connector

hub, that is, if xj [ xThreshold, make Wji = 1. For all cases,

we took xThreshold as 0.86. This has the net effect that any

time in future if either the same object or a similar object is

presented, it will end up activating the ith neuron in the

pushing SOM. The word ‘‘similar’’ is relevant here

because note that activity in the ‘‘connector’’ hub itself is a

result of activity in the property-specific maps. So if the

robot has experienced and learnt how a red cube behaves

when pushed and then later it is presented with a ‘‘blue

cube,’’ still the neuron in the pushing SOM that learnt the

‘‘red cube’’ will be active and will be in a position to

‘‘anticipate’’ the behavior of the novel object. This is

because both ‘‘red cube’’ and ‘‘blue cube’’ will activate

some common neurons in the connector hub (because of

similarity), and the activity of such common neurons is

sufficient to activate neurons in the pushing SOM (because

of connectivity learnt in the past). This implies that by

interacting with the red cube, the robot also has some

capability to ‘‘predict’’ how a blue cube might move when

pushed. If the top-down prediction is same as the observed

behavior, there is no need to grow the pushing SOM fur-

ther, the neuron coding for the red cube also codes for the

blue cube. Since there is no contradiction between antici-

pation and observed behavior, there is no need to grow the

pushing SOM further. Growth in pushing SOM takes place

when there is a contradiction between ‘‘anticipated’’ and

‘‘observed’’ behavior (simply this indicates that the robot

has either no information or incorrect information about the

object being manipulated).

3. Learning Average Displacement of an Object Per

Unit Force (P): So far, we described the learning of bot-

tom-up connectivity between connector hub and pushing

SOM that basically code for perceptual properties of the

object the robot is interacting with. Next we need to learn

the ‘‘action’’-related effects that are coded by the internal

weight Pi for the ith neuron in the pushing SOM. To learn

‘‘Pi’’ that basically estimates ‘‘displacement of an object

per unit force exerted on it,’’ the robot has to act on the

object. So the robot is allowed to exert force on different

objects in different directions randomly (see Fig. 8), at the

same time visually observing the consequence, that is, the

displacement of the object as a result of exerting unit force.

Unit force is approximated as an ‘‘attempted’’ movement of

the deployed end effector by 5 cm. We clarify ‘‘attempted’’

movement ‘‘step by step’’ further because this is nontrivial

Fig. 8 Bottom panel some examples of robot gaining experience. In

addition to different kinds of objects, different end effectors were also

used for pushing the same objects (right hand, left hand, and a ‘‘long

stick’’ as an extension of the arm). Such diverse experience is needed

to learn that ‘‘end effectors’’ used also do not really matter as far as

the causal behavior of the object is concerned’’ using a novel learning

rule that will be proposed in the next section. Further use of tools as

an extension to the arm to push or pull a food reward is one of the

widely investigated cases of tool use in animal behavior (Color figure

online)

Cogn Comput (2013) 5:355–382 373

123

Page 20: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

and relates to explorative action generation using the PMP

mechanism. So firstly, the robot reaches the object of

interest (using the network of Fig. 5b the coordinates the

upper body). If reaching is successful (error between goal

and forward model prediction is zero), a virtual trajectory

(a straight line in this case) is synthesized from the current

position of the end effector to a point 5 cm away in a

randomly chosen direction (we chose 8 directions, that is,

with 45� separation as seen in Fig. 7). This virtual trajec-

tory acts as a moving point attractor to the end effector

performing the explorative action, hence causing the end

effector to follow it like the pull of the puppeteer that PMP

mechanism computationally emulates. While the robot

executes this action, it is also physically interacting with

the object it had reached previously and the intrinsic

properties of the object now begin to influence how suc-

cessful the end effector is in following the virtual trajec-

tory. For objects like ‘‘balls’’ as the end effector moves, the

object of interest goes quite far away (perceived and

localized through vision), small cubes move rather uni-

formly in correspondence with the force exerted, for some

other objects like a heavy box, there is relatively small

displacement and so on.

Hence, ‘‘displacement per unit force’’ basically mea-

sures the ‘‘mobility’’ of the object when a certain amount of

force is exerted on it. Inversely, this information allows the

robot to predict how an object will move when force is

exerted on it (useful while generating goal-directed push-

ing). For every object presented, the robot is allowed to

explore displacing it to a distance of 15 cm (i.e., 3 itera-

tions of application of unit force) in eight different direc-

tions. Averaging the result of this experience, the

parameter Pi for the neuron ‘‘i’’ coding for a particular

object is estimated. Cubes, cylinders, and balls of different

colors and sizes (some heavy ones) encountered by the

robot previously while learning their names (‘‘Naming

Games: Learning About Objects and Simulating Percep-

tion’’ section), few MECCANO blocks (from the MEC-

CANO 2 ? kit for 2-year-olds) we presented gradually.

Figure 8 (right panel) shows some examples of robot

gaining experience. Different end effectors were also used

for pushing the objects (right hand, left hand and also a

‘‘long stick’’ as an extension of the arm). Such diverse

experience is needed to learn that end effector used also

does not really matter as far as the causal behavior of the

object is concerned when pushed (same is the case of color,

but properties like shape and size do matter). This issue of

additionally learning what are the ‘‘causally dominant

properties’’ relevant in a particular task is work in progress.

Still, while gaining experience, we opted to subject the

robot to a diverse set of experiences so that the acquired

sensorimotor data can be utilized to explore other questions

(the discussion section on ongoing work deals with these

issues).

4. Pushing in a goal-directed fashion: Finally, how does

the distributed coding of direction and a growing SOM

learning ‘‘average displacement of an object per unit force’’

generate goal-directed pushing? Consider that the robot has

to push some object from its initial position to a desired

location.

The sequence of network activations is illustrated in the

right panel of Fig. 7. We clarify the loop in detail below.

Given an object, firstly, there is a distributed representation

of the object in the connector hub. Activity in the connector

hub triggers the neurons in the Pushing SOM (through the

connectivity matrix ‘‘W’’) that code for the learnt behavior of

the object. In case there is no activity in the pushing SOM, it

implies that the object has not been experienced before and

we go back to step 1 (note that we also go back to step 1 in

another case too, that is, if there is a contradiction between

the predicted and observed behavior of the object when

pushed, because this indicates that more exploration is nee-

ded). This is the static part (bottom up) of the goal-directed

pushing phase (because the object being pushed remains the

same, within the scope of the issued goal). The dynamic/

interactive phase is a closed loop between ‘‘perception–

prediction–action’’ and consists of the following sequence:

(4a) Detect and localize the current position ‘X(x,y,z)’ of

the object and the target ‘XT(xT,yT,zT)’ (where the

object has to be displaced). This process involves

detection of the object through vision (i.e., what) and

3D reconstruction of the location in the egocentric

frame of reference of the robot (i.e., where). The 3D

reconstruction algorithm to localize the object and

the target position is based on direct linear transform

[66] and has been learnt through a motor babbling

process [54] for details).

(4b) Compute the desired direction ‘‘h’’ to push using

information on X and XT; this activates the neurons

in the motor map responsible for directional coding.

Based on the instantaneously computed direction,

we also see distributed activity the 8 neurons coding

for different directions (see Fig. 9 top panel for 3

such cases).

(4c) If Ai is the activation of the ith neuron in the pushing

SOM and Pi is the internal weight representing

‘‘displacement per unit force’’ learnt by the ith

neuron, then the average predicted mobility of the

object for an incremental iteration where unit force

is applied on it can be computed as P =P�Ai�Pi.

(4d) Compute the next ‘‘virtual target: VT’’ where the

end effector must be such that the object moves to

the predicted location P. To start with, VT is

374 Cogn Comput (2013) 5:355–382

123

Page 21: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

initialized as the starting location of the object being

pushed, that is, X(x,y,z), but diverges later in time as

seen in Fig. 9 (because VT also depends on the

mobility of the object being pushed). We can

decompose it into three components along the x, y,

and z axes (z component same as the initial condition

as pushing is learnt/executed on a planar surface

with infinitesimal effect of gravity):

VTx ¼ VTx þ 1=Pð Þ cos hð ÞVTy ¼ VTy þ 1=Pð Þ sin hð ÞVTz ¼ X zð Þ;

ð4Þ

(4e) Compute the incremental predicted displacement of

the object if the end effector is displaced to a

location estimated by the virtual trajectory as per

Eq. 4. Go to step (b) till the time the predicted

location of the object is close to the goal (\1 cm).

Instead of we may also choose to feed the next

computed position of the virtual target to the PMP

system to move the designated arm to the next

incremental location, go to step (a) of visual tracking

and continue. However, it is computationally

expensive to involve vision in each incremental

step (consider a foot ball player hitting a penalty

kick, he sees the goal post, synthesizes a trajectory,

and executes the kick. With robots we do have the

chance to go back to step (a) and recompute again).

So we chose to iterate steps ‘‘b–c–d–e’’ till the time

the predicted location of the object is close (\1 cm)

to the target.

Xx ¼ Xx þ P cos hð ÞXy ¼ Xy þ P sin hð ÞXz ¼ X zð Þ;

ð5Þ

In this way, we basically move from a ‘‘virtual target’’ to

a ‘‘virtual trajectory.’’ Secondly, the predicted end effector

position need not follow the virtual trajectory as it also

depends on the mobility of the object itself (i.e., the learnt

parameter P). This is clearly shown the three different

cases of Fig. 9. While pushing a ball, the end effector

needs to be displaced just by a small amount along an

estimated virtual trajectory (green trajectory) with an

expectation that it should be enough to send it to the

target location. Cubes move more uniformly with the

displacement of the end effector and have to be pushed to

the destination. For large and heavy objects (we took a box

with a bottle of water inside it), as seen in Fig. 9, the

planned virtual trajectory goes beyond the goal position

because much greater force needs to be exerted (or P � 1).

In sum, the loop ‘‘b–c–d’’ generates two sets of trajectories:

(1) the predicted trajectory in which the object will move

toward the goal and (2) the desired motion of the end

effector (or the virtual trajectory) to generate the action.

Fig. 9 Right panels the virtual trajectories (attractors) and real

trajectories during goal-directed pushing of a cube, ball, and a large

container. Activity in the neurons responsible for distributed coding

of direction during the synthesis of the motor actions is shown at the

top for all three cases (see text for details). While pushing a ball, the

end effector needs to be displaced just by a small amount along an

estimated virtual trajectory (green trajectory) with like kicking a

football to the goal. Cubes move more uniformly with the displace-

ment of the end effector and have to be pushed gradually to the

destination. For large and heavy objects (a box with a bottle of water

inside it), the planned virtual trajectory goes beyond the goal position

because much greater force needs to be exerted (Color figure online)

Cogn Comput (2013) 5:355–382 375

123

Page 22: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

The final step (e) is to synthesize the motor commands and

execute the action. For this, we feed the virtual trajectory

synthesized in step (d) as an attractor to the relevant body

chain of the PMP system (Fig. 5b). Motor commands syn-

thesized are transmitted to the actuators to generate the

movement of pushing (along a smooth planned trajectory)

with the designated end effector. Perceive the consequence

through vision (steps a–b) to evaluate whether the object is

close to the target (most often it is). The process (a–e) is like

smooth sliding of an object along a planned trajectory

(sometimes less force is needed like in the case of the ball, but

sometimes we need to take it to the destination with a lot of

force like the large container like depicted in Fig. 9). Note

that, time also is implicitly represented in Fig. 9. This is

because every ‘‘dot’’ in the virtual trajectory represents iter-

ation in time: the virtual trajectory is very short while pushing

the ball, almost uniform while pushing the small cube and

longer when pushing the large container. This resonates with

our own experiences, to slide a ball to a target location takes

less time than pushing a heavy cylinder (because intrinsic

properties of the object like shape and mass influence the

mobility of an object).

To summarize, in ‘‘A Body Schema for Cognitive

Robots: ‘‘Why and What’’’’ and ‘‘Connecting Object, Action

and the Body: Learning About Action and Simulating

Action’’ sections, we started with the description of the body

schema for DARWIN robots and its various functions within

the cognitive architecture. The small-world framework for

distributed organization and learning of object concepts was

extended in a way to connect ‘‘object, action, and the body

schema’’ retaining top-down/bottom-up information flows

and small worldness. Finally, we showed how a forward/

inverse model for pushing is learnt by the robot by explor-

ative interactions with different objects in its playground.

The inverse problem of generating motor actions to push a

given object to a desired location was also summarized with

results. The system always generates two kinds of trajecto-

ries (for a large set of experienced objects) (Fig. 9): (1) the

virtual trajectory to push the object to a desired location that

basically acts as a moving point attractor to the body schema

to synthesize motor commands for the body chain per-

forming the action and (2) the predicted trajectory in which

the object will move as a consequence of pushing. The latter

is also a crucial piece of information for goal-directed

reasoning.

Simulating ‘‘Perception’’ and ‘‘Action’’ in the Context

of a ‘‘Goal’’

Since the DARWIN robot has gradually developed basic

capabilities to perceive objects, name them (with the help

of teacher/user input), generate primitive actions (like

reach, grasp, push, etc.), and anticipate the sensory motor

consequences of such actions, recently we introduced the

robot to ‘‘make and break’’ tasks using the MECCANO 2?

toy kit (it is a toy set of 2- to 3-year-old infants). The toy kit

basically consists of various building blocks using which

‘‘composite’’ objects can be assembled. In this section, we

present a rather simple scenario from the MECCANO

assembly task to illustrate how simulation of ‘‘perception

and action’’ enables the robot to generate a novel combi-

nation of actions in order to realize an otherwise unreal-

izable goal. As seen in panel 10 A, the task is to insert

‘‘object 1’’ (the face attached to a screw) into ‘‘object 2’’

(that has a hole where the screw can be inserted), to

assemble a new composite toy. The standard sequence

consists of two actions: pick up object 1 and insert it into

object 2. However, standard sequences apply to ‘‘well-

defined’’ environments (like a fully programmed industrial

‘‘pick and place’’ set up). In an unstructured world, the

complexity of the environment under which the goal needs

to be realized plays a significant role in the causal sequence

of actions a cognitive agent must generate to realize its

goals. Standard sequences very often may not work.

• Firstly, there is a need to infer this without blindly

executing the standard/default action plan.

• Secondly, in such cases, cognitive agents must effec-

tively use their past experiences to go beyond experi-

ence and generate novel behaviors to realize the goal

or learn something new if unsuccessful. A simple

scenario of this kind and how simulation of perception

and action enables the robot to infer how its world

should be ‘‘causally’’ transformed such that it becomes

little bit more conducive toward realization of its goal

is the subject of discussion in this section.

As seen in Fig. 10a, both objects are randomly placed at

different locations (inside the visual workspace of the stereo

cameras). To begin with, given any goal, the robot first

visually explores its locally available environment to gather

information about the objects that are present and what all

can be done with them. This process basically involves

focusing attention on various objects, activating bottom up

the various neural maps (related to color, shape, etc., in

Fig. 6) ultimately leading to a distributed representation of

the object in the connector hub (indicated as ‘‘what is it’’ in

Fig. 6). Figure 10e shows the running loop of visual pro-

cessing related to identification and localization of objects in

the scene as action takes place. Neural activity in the object

connector hub in turn causes activations in the single neu-

rons in the action hub coding for various motor actions

experienced with the object in the past (indicated as ‘‘what

all can be done with it’’ in Fig. 6). As mentioned in ‘‘A Body

Schema for Cognitive Robots: ‘‘Why and What’’’’ section,

neurons in the action hub are like the canonical neurons

376 Cogn Comput (2013) 5:355–382

123

Page 23: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

Fig. 10 a–d The task is to insert ‘‘object 1’’ (face) into ‘‘object 2’’ (body),

to assemble a new composite object. b The first 3 virtual actions using the

network of Fig. 5b. In simulations 1 and 2, the robot infers that though the

‘‘face’’ is directly reachable with the right arm, the ‘‘blue body’’ is located so

far that inserting it will not be successful. At the same time, the left arm

network is not coupled to any goal, so is available as a ‘‘tool’’ that could be

exploited. Coupling part 2 as a ‘‘goal’’ to the available left arm, the robot can

infer that it is indeed reachable by the left arm. Exploiting the knowledge of

pushing (learnt in the past and a feasible action here), the robot infers that if

part 2 is slowly displaced close to the ‘‘face,’’ it then becomes reachable by

the right hand and hence allowing the possibility of realizing the goal (2c) in

such an altered world. d The full combination of real and virtual actions that

basically enable the robot to infer how the world can change through ones

actions hence make it more conducive toward realization of its internal

goals. e–j The sequence of actions initiated by the robot to realize the goal

along with perceptual feedback (Color figure online)

Cogn Comput (2013) 5:355–382 377

123

Page 24: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

found in the pre-motor cortex that are activated at the sight of

objects to which specific actions are applicable. While the

‘‘object 1’’ affords reach and grasp actions, the ‘‘object 2’’

affords reach and push action (the robot has indeed experi-

enced pushing the blue MECCANO blocks and learnt a

forward/inverse model of how the object moves when force

is exerted on it: see Fig. 8). This information is stored in the

working memory. At present, the working memory of

DARWIN robot is fairly simple and keeps track of objects in

the world, their spatial locations (in the egocentric frame of

reference), feasible actions on the objects (activity in the

action hub), and status of utilization of body parts (mainly

end effectors that couple to goals during manipulation).

Such a WM structure does suffice for simple scenarios in the

early stages of development of the robot and we hope to

expand the WM further in the future in line with recent

developments [56].

Once the information about the available world is cap-

tured and stored in the WM, the robot initiates internal

simulation of the default plan. Using the body schema net-

work for the iCub upper body (Fig. 5b), the robot internally

simulates the standard sequence of assembly (i.e., picking up

the face and inserting it in the body) and its resulting con-

sequence (given by the forward model). These are virtual

actions 1–2 of Fig. 10b. As seen from the simulated right-

hand trajectories (1–2), the robot infers that though the

‘‘face’’ is directly reachable with the right arm, the ‘‘blue

body’’ is located so far from reach that inserting it will not be

successful. This leads to the inference that the goal cannot be

directly realized (there is a large error between the attempted

goal and predicted forward model consequence). At the

same time, the left arm network is not coupled to any goal, so

is available as an additional degree of freedom (or tool) that

can be used. Coupling part 2 as a goal to the left arm net-

work, the robot can infer that the object 2 is indeed reachable

by the left arm (virtual action 3). Information in the working

memory indicates that pushing is a feasible action supported

by object 2 (and was experienced in the past: Fig. 8). Now

the robot exploits its knowledge of pushing forward/inverse

model to infer how the world will change if ‘‘object 2’’ is

incrementally pushed (with the left hand). While the inverse

model gives the motor commands to generate goal-directed

pushing, the forward model gives the resulting consequence

(the predicted location of the object as a consequence of

pushing). The predicted consequence of pushing is shown as

virtual action 4 in Fig. 10c. The result of this simulation is an

‘‘imagined environment’’ that allows the goal to be realized

(simulated action 5 shows that the screw can indeed be

assembled to the body in such a modified environment). In

sum, simulated actions 1–5 basically allows the robot to

infer that while the default plan will not work, it is indeed

possible to causally transform the world such that it becomes

more conducive toward realizing the goal at hand. Several

subsystems involved in perception and simulation of per-

ception, body schema, action-related forward/inverse mod-

els, and task-specific working memory structures play a

synergetic role in leading to this inference. Panel 10d shows

the full combination of real (shown in yellow) and simulated

actions. The robot basically uses the left hand to slide the

‘‘body’’ close to the ‘‘face,’’ picks up the face with its right

hand, and inserts it into the body, hence assembling a

composite object and realizing the goal. Figure 10e–j show

snapshots of the real actions executed by the robot.

Summing up, the ability to reason, orchestrate thought

and action in accordance with internal goals, especially

when inhabiting an unstructured environment is a funda-

mental feature of any kind of cognitive behavior. Coher-

ently integrating the information from bottom up (sensory,

motor) and top down (memories of past learnt experiences,

simulations of various internal models, etc.), cognitive

agents often manage to swiftly exploit possibilities affor-

ded by the structure in one’s immediate environment to

counteract limitations (of perceptions, actions, and move-

ments) imposed by their bodies. This scenario is a simple

example from the initial phases of developmental curve of

the DARWIN that basically demonstrates the power of

embodied simulation in relation to generation of goal-

directed action in unstructured environments.

Concluding Remarks

Affordances are the seeds of ‘‘action.’’ Identifying and

exploiting them opportunistically in the ‘‘context’’ of an

otherwise unrealizable goal is a sign of cognition. The ability

to mentally manipulate the causal structure of their physical

interactions with their environments endows cognitive agents

with the capability to evaluate ‘‘what additional affordances’’

they can create in the world. This in turn enables them to infer

how the world must ‘‘change’’ such that it becomes a little bit

more conducive toward realization of their goals. A major

part of this process of transformation ‘‘from affordance to

action’’ and the inverse is a result of ‘‘inferences emerging

through embodied simulation.’’ Experiments related to

learning about objects, actions, and the underlying compu-

tational basis that endows DARWIN robots to demonstrate a

preliminary level of embodied intelligence were presented in

this article. The developmental curve of the DARWIN robot

started with simple tasks like learning to associate names of

objects (presented to it by the teacher) with their perceptual

properties (‘‘Naming Games: Learning About Objects and

Simulating Perception’’ section). The underlying computa-

tional framework incorporated several recent findings related

to large-scale functional organization of the cortex, ‘‘small-

world’’ properties, ‘‘dual dyad’’-type connectivity and pow-

ered by network dynamics that ensures ‘‘bottom-up, top-

378 Cogn Comput (2013) 5:355–382

123

Page 25: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

down, and cross-modal’’ activation of various growing neural

maps. The computational advantages are numerous from

promoting functional segregation and global integration,

minimization of processing steps and efficient wiring thus

ensuring low metabolic cost, synchronizability, pattern

completion, and conflict resolution. We showed using various

examples (Figs. 2, 3, 4) how learning about objects parallelly

endowed the robot with the capability to ‘‘anticipate’’ about

novel objects, generate primitive actions (like reach, grasp) on

them, detect novelty and contradictions, and trigger new

learning and growth.

The need for cognitive robots to have a ‘‘flexible’’ body

schema and a possible computational implementation of it

was presented (Fig. 5). The PMP formalism gives rise to

‘‘growing’’ network implementations (forward/inverse mod-

els) of the body (and tool) being coordinated that importantly

operate in a local, distributed, multi-referential, and goal-

directed fashion. We suggest that the computational model for

the body schema developed in DARWIN basically acts like a

powerful ‘‘middle ware’’ that both interacts with lower-level

motor execution layers (that deal with the complexity/

redundancy of the body being coordinated: the degrees of

freedom problem of Bernstein [7] and higher-level reasoning

and cognitive layers (that deal with the complexity of the

world and goals that have to be accomplished, that is, when to

do what with which effector/tool and what is the resulting

consequence). In addition, it provides a shared computational

basis for ‘‘execution, imagination, and understanding’’ of

action for which there is resounding neurobiological evi-

dence. Action development in the DARWIN robots started

with primitive actions like pointing, reaching, grasping (still

in progress), recently extended to drawing/scribbling [54],

and learning to use simple tools by imitation [52]. The motor

repertoire was further extended in this paper by learning a

forward/inverse model of an important multipurpose action,

that is, ‘‘pushing’’ (Figs. 6, 7, 8, 9). Both the forward problem

of being able to anticipate how an object will move when

force is exerted on it and the inverse problem of generating

goal-directed pushing action in order to displace an object to a

desired location were learnt. Note that an expanding body

schema network basically blurs the distinction between tool

and the body during goal-directed coordination as known

from experiments on animals using tools [40, 72] and reso-

nates with the PMP framework [51]. Learning to use other

tools generally used for extension of reach, amplification of

force, coupling objects, etc., that are common in both indus-

trial and domestic environments are scheduled in the next

phase of developmental learning curve of DARWIN robots.

Before going toward these issues, our main emphasis was to

close the loop between perception, action, and reasoning in a

‘‘robust’’ manner (taking inspiration from biology).

An interesting feature in the proposed framework for

connecting ‘‘object, action, and body’’ is the fact that as we

move higher upwards information becomes more and more

integrated and multimodal and as we move downwards

information is more and more differentiated (to the level of

sensed properties). The underlying connectivity and dynam-

ics ensures that activations in any neural map can trigger a

complete network (both higher up in the hierarchy like hubs

or below like property-specific maps and procedural memory

networks for specific actions). We believe that this feature

essentially allows us to go beyond ‘‘object-action’’ to

‘‘property-action.’’ In other words, it enables the robot not

only to learn which actions apply to various objects and what

are their consequences (that we showed in the paper) but also

learn ‘‘which properties are causally dominant’’ while per-

forming specific goals and actions. As an example to clarify,

to wipe off a spider web on the top most corner of a room, it

does not matter if someone uses a red-colored broom or

yellow-colored broom or even a long stick. Any object that

has the specific property ‘‘length’’ will suffice. Similarly,

colors of objects do not affect the way they move when

pushed, shape does and size too (in specific ways based on

recent ongoing experiments with the robot). Objects really do

not matter; it is their properties that matter in the context of

realization of various goals. Note that while we are speaking

about properties, the robot is basically interacting with

‘‘objects’’ in the world. While the robot is playing with objects

gradually in time, how can it also learn and ‘‘pin down’’ which

properties are causally dominant in a particular task by

comparing multiple such playful interactions? Going beyond

object action while learning and interacting with objects

(gradually in time), we believe, has fundamental significance

in terms of analogical reasoning. Humans excel in making

analogies, and in many ways, it is the essence of their crea-

tivity [35, 36]. Even a simple stone may be used as a weapon,

as a paper weight, as a blockage to obstruct flow, as a building

block of a house and so on. Objects do not really matter, it has

their properties that do, and hence allow them to be exploited

for different purposes in different circumstances. Though

approaches for analogical reasoning exist in the literature [38,

45], they lack the embodied framework hence limiting their

reach in common unstructured worlds where neither every

object can be experienced nor everything is known precisely.

If a novel object has a property that supports a particular

action in the context of a ‘‘goal,’’ the robot must certainly

attempt to opportunistically exploit it. If it succeeds what we

will see is behavior that is ‘‘novel’’ and ‘‘creative.’’ A prop-

erty-specific distributed organization of perception and action

further endowed with small-world properties pushes for an

‘‘embodied approach’’ for analogical reasoning and experi-

ments are ongoing in this direction in the context of numerous

tasks like pushing, learning to build the tallest stack given a

random set of objects, use of tools for assembly, etc. These

issues will be a subject of discussion in the Darwin-related

articles in the near future.

Cogn Comput (2013) 5:355–382 379

123

Page 26: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

To conclude, as noted by the neuroscientist Ramachandran

[60], with people from several disciplines tinkering around

with different open problems, cognitive science research is

right now entering into an exciting ‘‘Faraday era’’ in terms of

discovering the general principles related to the structure and

function of the ‘‘three pound jelly,’’ that on the first place

makes all these ‘‘tinkering and discovering’’ possible. Using a

humanoid robot equipped with state-of-the-art sensory and

motor capabilities playing, learning, and reasoning in a

moderately complex (and changing) world, we have attemp-

ted to develop and better understand computational principles

necessary to drive a cognitive robot to exhibit a preliminary

level of purposefulness, flexibility, and adaptability in its

behavior. The results presented are indeed from the early

stages of developmental curve of DARWIN robot, and cer-

tainly, vast oceans lie undiscovered in the quest to better

understand the ‘‘forces and the causes’’ that shape our ‘‘rea-

sons and actions’’ and make us as explorative, intuitive, cog-

nitive, expressive, emotional, irrational, unpredictable, and

conscious we really are! This often drove the curiosity of

Professor Taylor and looking for a computational basis for

cognitive computation grounded in the biology of the brain

would be the only fitting homage offered by younger gener-

ations to keep the essence of his teachings alive!

Acknowledgments The research presented in this article is sup-

ported by IIT (Istituto Italiano di Tecnologia, RBCS dept) and by the

EU FP7 project DARWIN (http://www.darwin-project.eu, Grant No:

FP7-270138). We are indebted to the anonymous reviewers for their

detailed analysis and suggestions to make the draft sharp and more

reader friendly. The authors also acknowledge the support of all teams

involved in the DARWIN consortium.

References

1. Addis DR, Schacter DL. The hippocampus and imagining the future:

where do we stand? Front Hum Neurosci. 2012;5. Article 173.

2. Addis DR, Pan L, Vu MA, Laiser N, Schacter DL. Constructive

episodic simulation of the future and the past: distinct subsystems

of a core brain network mediate imagining and remembering.

Neuropsychologia. 2009;47:2222–38.

3. Amari S. Dynamics of patterns formation in lateral-inhibition

type neural fields. Biol Cybern. 1977;27:77–87.

4. Barabasi AL (2003) Linked: the new science of networks. Bos-

ton: Perseus Books. ISBN-10:0738206679.

5. Barabasi A-L. The network takeover. Nat Phys. 2012;8:14–6.

6. Barabasi A-L, Albert R. Emergence of scaling in random net-

works. Science. 1999;286:509–12.

7. Bernstein N. The coordination and regulation of movements.

Oxford: Pergamon Press; 1967.

8. Bressler SL, Menon V. Large-scale brain networks in cognition:

emerging methods and principles. Trends Cogn Sci. 2010;14(6):

277–90.

9. Buccino G, Binkofski F, Fink GR. Action observation activates

premotor and parietal areas in a somatotopic manner: an fMRI

study. Eur J Neurosci. 2001;13:400–4.

10. Buckner RL, Carroll DC. Self-projection and the brain. Trends

Cogn Sci. 2007;2:49–57. [medline abstract].

11. Buckner RL, Andrews-Hanna JR, Schacter DL. The brain’s

default network: anatomy, function, and relevance to disease.

Ann N Y Acad Sci. 2008;1124:1–38. [medline abstract].

12. Bueti D, Walsh V. The parietal cortex and the representation of

time, space, number and other magnitudes. Philos Trans R Soc B

Biol Sci. 2009;364(1525):1831–40.

13. Caeyenberghs K, van Roon D, Swinnen SP, Smits-Engelsman

BC. Deficits in executed and imagined aiming performance in

brain-injured children. Brain Cogn. 2009;69(1):154–61.

14. Chiel HJ, Beer RD. The brain has a body: adaptive behavior

emerges from interactions of nervous system, body and envi-

ronment. Trends Neurosci. 1997;20:553–7.

15. Clark A. Being there: putting brain, body and world together

again. Cambridge: MIT Press; 1997.

16. Damasio A. Self comes to mind: constructing the conscious brain.

New York: Pantheon; 2010.

17. Decety J. Do imagined and executed actions share the same

neural substrate. Cog Brain Res. 1996;3:87–93.

18. Decety J, Sommerville J. Motor cognition and mental simulation.

In: Kosslyn SM, Smith E, editors. Cognitive psychology: mind

and brain. New York: Prentice Hall; 2007. p. 451–81.

19. Desmurget M, Sirigu A. A parietal-premotor network for movement

intention and motor awareness. Trends Cogn Sci. 2009;13:411–9.

20. Feldman J. From molecule to metaphor: a neural theory of lan-

guage. Cambridge, MA: MIT Press; 2006.

21. Frey SH, Gerry VE. Modulation of neural activity during

observational learning of actions and their sequential orders.

J Neurosci. 2006;26:13194–201.

22. Fritzke B. A growing neural gas network learns topologies. In:

Tesauro G, Touretzky D, Leen T, editors. Advances in neural

information processing systems. 7th ed. Cambridge, MA: MIT

Press; 1995. p. 625–32.

23. Gallese V, Lakoff G. The brain’s concepts: the role of the sen-

sory-motor system in reason and language. Cogn Neuropsychol.

2005;22:455–79.

24. Gallese V, Sinigaglia C. What is so special with embodied sim-

ulation. Trends Cogn Sci (Oct 7). 2011. http://www.unipr.it/arpa/

mirror/pubs/pdffiles/Gallese/2011/tics_20111007.pdf.

25. Georg Stork H (2012) Towards a scientific foundation for engi-

neering cognitive systems—a European research agenda, its

rationale and perspectives. BICA Elsevier Science publishers,

1:82–91. doi:10.1016/j.bica.2012.04.002.

26. Glenberg AM. What memory is for. Behav Brain Sci. 1997;20:

1–19.

27. Glenberg A, Gallese V. Action-based language: a theory of language

acquisition production and comprehension. Cortex. 2012;48(7):

905–22.

28. Grafton ST. Embodied cognition and the simulation of action to

understand others. Ann N Y Acad Sci. 2009;1156:97–117.

29. Grush R. The emulation theory of representation: motor control,

imagery, and perception. Behav Brain Sci. 2004;27:377–96.

30. Hagmann P, Cammoun L, Gigandet X, Meuli R, Honey CJ,

Wedeen VJ, Sporns O. Mapping the structural core of human

cerebral cortex. PLoS Biol. 2008;6(7):e159, 1479–93.

31. Hassabis D, Maguire EA. The construction system of the brain.

In: Bar M, editor. Predictions in the brain: using our past to

generate a future. New York: Oxford University Press; 2011.

32. Hesslow G. Conscious thought as a simulation of behavior and

perception. Trends Cogn Sci. 2002;6:242–7.

33. Hesslow G, Jirenhed DA. The inner world of a simple robot.

J Conscious Stud. 2007;14:85–96.

34. Hoffmann M, Gravato Marques H, et al. Body schema in robotics: a

review. IEEE Trans Auton Mental Dev. 2010;2:304–24.

35. Hofstadter DR. Godel, Escher, Bach: an eternal golden braid.

NY: Basic Books; 1979.

36. Hofstadter DR. I am a strange loop. NY: Basic Books; 2007.

380 Cogn Comput (2013) 5:355–382

123

Page 27: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

37. Hopfield JJ. Searching for memories, Sudoku, implicit check bits,

and the iterative use of not-always-correct rapid neural compu-

tation. Neural Comput. 2008;20(5):1119–64.

38. Hummel JE, Holyoak KJ. A symbolic-connectionist theory of

relational inference and generalization. Psychol Rev. 2003;110:

220–64.

39. Iacoboni M. Neurobiology of imitation. Annual review of psy-

chology. Curr Opin Neurobiol. 2009;19(6):661–5.

40. Iriki A, Sakura O. Neuroscience of primate intellectual evolution:

natural selection and passive and intentional niche construction.

Philos Trans R Soc Lond B Biol Sci. 2008;363:2229–41.

41. Johnson M. The body in the mind: the bodily basis of meaning,

imagination and reason. Chicago: University of Chicago Press;

1987.

42. Kacelnik A, Chappell J, Weir AAS, Kenward B. Tool use and

manufacture in birds. In: Bekoff M, editor. Encyclopedia of

animal behavior, vol 3. Westport, CT: Greenwood Publishing

Group; 2004. p. 1067–9.

43. Kohler E, et al. Hearing sounds, understanding actions: action

representation in mirror neurons. Science. 2002;297(5582):

846–8.

44. Kohonen T. Self-organizing maps. Berlin: Springer; 1995.

45. Kokinov BN, Petrov A. Integration of Memory and Reasoning in

Analogy-Making: The AMBR Model, The Analogical Mind:

Perspectives from Cognitive Science. Cambridge, MA: MIT

Press; 2001.

46. Locher JL. The magic of M. C. Escher. Harry N. Abrams, Inc.

2000. ISBN 0-8109-6720-0.

47. Marino BFM, Gough PM, Gallese V, Riggio L, Buccino G. How

the motor system handles nouns: a behavioral study. Psychol Res.

2013;77(1):64–73.

48. Martin A. The representation of object concepts in the brain.

Annu Rev Psychol. 2007;58:25–45.

49. Martin A. Circuits in mind: the neural foundations for object

concepts. In: Gazzaniga M, editor. The cognitive neurosciences.

4th ed. Cambridge, MA: MIT Press; 2009. p. 1031–45.

50. Meyer K, Damasio A. Convergence and divergence in a neural

architecture for recognition and memory. Trends Neurosci.

2009;32(7):376–82.

51. Mohan V, Morasso P. Passive motion paradigm: an alternative to

optimal control. Front Neurorobot. 2011;5:4. doi:10.3389/fnbot.

2011.00004.

52. Mohan V, Morasso P. How past experience, imitation and prac-

tice can be combined to swiftly learn to use novel ‘‘tools’’:

insights from skill learning experiments with baby humanoids.

international conference on biomimetic and biohybrid systems:

living machines 2012, July 9–12 2012, Barcelona, Spain. 2012.

53. Mohan V, Morasso P, Metta G, Kasderidis S. The distribution of

rewards in growing sensorimotor maps acquired by cognitive

robots through exploration. Neurocomputing. 2011;. doi:

10.1016/j.neucom.2011.06.009.

54. Mohan V, Morasso P, Zenzeri J, Metta G, Chakravarthy VS,

Sandini G. Teaching a humanoid robot to draw ‘Shapes’. Auton

Robots. 2011;31(1):21–53.

55. Mussa Ivaldi FA, Morasso P, Zaccaria R. Kinematic networks. A

distributed model for representing and regularizing motor

redundancy. Biol Cybern. 1988;60:1–16.

56. O’Reilly RC, Munakata Y, Frank MJ, Hazy TE, Contributors.

Computational Cognitive Neuroscience. Wiki Book, 1st Edition.

2012. URL:http://ccnbook.colorado.edu.

57. Patterson K, Nestor PJ, Rogers TT. Where do you known what

you know? The representation of semantic knowledge in the

human brain. Nat Rev Neurosci. 2007;8(12):976–87.

58. Pepperberg IM. The Alex studies: cognitive and communicative

abilities of grey parrots. Harvard University Press. 2000. ISBN

0-674-00806-5.

59. Pulvermuller F, Fadiga L. Active perception: sensorimotor cir-

cuits as a cortical basis for language. Nat Rev Neurosci. 2010;

11(5):351–60.

60. Ramachandran VS. The tell-tale brain: a neuroscientist’s quest

for what makes us human. New York: W. W. Norton & Com-

pany; 2011.

61. Rizzolatti G, Sinigaglia C. The functional role of the parieto-

frontal mirror circuit: interpretations and misinterpretations. Nat

Rev Neurosci. 2010;11:264–74.

62. Rizzolatti G, Fadiga L, Matelli M, Bettinardi V, Paulesu E, Perani

D, Fazio F. Localization of grasp representations in humans by

PET: 1. Observation versus execution. Exp Brain Res. 1996;111:

246–52.

63. Rizzolatti G, Fogassi L, Gallese V. Neurophysiological mecha-

nisms underlying action understanding and imitation. Nat Rev

Neurosci. 2001;2:661–70.

64. Rother C, Kolmogorov V, Blake A. GrabCut: Interactive fore-

ground extraction using iterated graph cuts. In: ACM transactions

on graphics (SIGGRAPH). Los Angeles, CA: ACM Press; 2004.

p. 309–14.

65. Shadmehr R, Mussa-Ivaldi FA, Bizzi E. Postural force fields of

the human arm and their role in generating multijoint movements.

J Neurosci. 1993;13:45–82.

66. Shapiro R. Direct linear transformation method for three-

dimensional cinematography. Res Quart. 1978;49:197–205.

67. Sporns O. Networks of the brain. Cambridge, MA: MIT Press;

2010.

68. Sporns O, Kotter R. Motifs in brain networks. PLoS Biol.

2004;2:1910–8.

69. Sporns O, Honey CJ, Kotter R. Identification and classification of

hubs in brain networks. PLoS ONE. 2007;2:e1049.

70. Suddendorf T, Addis DR, Corballis MC. Mental time travel and

the shaping of the human mind. Philos Trans R Soc B.

2009;364:1317–24.

71. Thompson E. Mind in life biology, phenomenology and the sci-

ences of mind. 1st ed. Cambridge, MA: Harvard University Press;

2007. p. 568.

72. Umilta MA, Escola L, Intskirveli I, Grammont F, Rochat M,

Caruana F, Jezzini A, Gallese V, Rizzolatti G. When pliers

become fingers in the monkey motor system. Proc Natl Acad Sci

USA. 2008;105(6):2209–13.

73. Varela FJ, Maturana HR, Uribe R. Autopoiesis: the organization

of living systems, its characterization and a model. Biosystems.

1974;5:187–96.

74. Venon D, von Hofsten C, Fadiga L. A roadmap for cognitive

development in humanoid robots. Berlin: Springer; 2010.

75. Visalberghi E, Fragaszy D. What is challenging about tool use? The

capuchin’s perspective. In: Wasserman EA, Zentall TR, editors.

Comparative cognition: experimental explorations of animal

intelligence. New York: Oxford University Press; 2006. p. 529–52.

76. Visalberghi E, Limongelli L. Action and understanding: tool use

revisited through the mind of capuchin monkeys. In: Russon A,

Bard K, Parker S, editors. Reaching into thought. The minds of

the great apes. Cambridge: Cambridge University Press; 1996.

p. 57–79.

77. Visalberghi E, Tomasello M. Primate causal understanding in the

physical and in the social domains. Behav Process. 1997;42:

189–203.

78. Vygotsky LS. Mind in society: the development of higher psy-

chological processes. Cambridge, MA: Harvard University Press;

1978.

79. Watts JD, Strogatz S. Collective dynamics of small world net-

works. Nature. 1998;393(6684).

80. Weiner N. Cybernetics: or control and communication in the

animal and the machine. Paris: Hermann & Cie, Cambridge, MA:

MIT Press. 1948. ISBN 978-0-262-73009-9.

Cogn Comput (2013) 5:355–382 381

123

Page 28: Inference Through Embodied Simulation in Cognitive Robotsmeeden/... · representation’’ and cognition is gradually bootstrapped through a cumulative process of learning by interaction

81. Weir AAS, Chappell J, Kacelnik A. Shaping of hooks in New

Caledonian crows. Science. 2002;297:981–3.

82. Welberg L. Neuroimaging: rats join the ‘default mode’ club. Nat

Rev Neurosci. 2012;13(4):223. doi:10.1038/nrn3224.

83. White JG. Neuronal connectivity in C elegans. Trends Neurosci.

1985;8:277–83.

84. Whiten A, McGuigan N, Marshall-Pescini S, Hopper LM.

Emulation, imitation, overimitation and the scope of culture for

child and chimpanzee. Philos Trans R Soc B Biol Sci. 2009;364:

2417–28.

382 Cogn Comput (2013) 5:355–382

123