grounded situation models for situated conversational...
Post on 18-Jul-2020
2 Views
Preview:
TRANSCRIPT
Grounded Situation Modelsfor
Situated Conversational Assistants
Nikolaos Mavridis,
PhD candidate,
The Media Laboratory,
Massachusetts Institute of Technology
Prelude
Acknowledgments: Roz, Whitman, Yiannis, Deb, Kai-yuh, Cogmac, Affective…
Central Theme
How can we build cooperative Situated Conversational Assistants
around
Grounded Situation Models,
which are amodal representations of current, past or imagined situations,
that can be fed
both by the senses and language?
Vision andContributions
Vision: Vision: Cooperative Situated Conversational Assistants (SCAs)
Contributions: Contributions: The Grounded Situation Model (GSM) proposal:
Motivation and Requirements
Explicit Computational Model
A modular Implementation Architecture
An Implemented System
Quantitative and Qualitative Evaluations
A discussion of a Future Extensions plan
A Discussion w proposals: GSMs in relation to theories in AI, cogsci, semantics
Other smaller contributions:A rough Method for Designing GSM-based systems
News
HOW WE WENT FROM…
…A SOMEWHAT DUMMY CONVERSATIONAL ROBOT
A robot that could only touch currently visible objects
NewsTO A ROBOT THAT CAN……IMAGINE, REMEMBER, AND MORE!
Imagining :parts of situations through language, and later updating/verifying through senses
Remembering / answering questions :about the current or past situationsVerbalising gradations of certainty: expressing, and maintaining alternative possibilities
Motor commands :Pich up, touch, hand me…Latest extensions: Face & hand detection, bilinguality,
News
1) Three Novelties above the state of the art in conversational robots
2) Behavioral repertoire comparable / surpassing the Token Test for children
3) Quantitative Performance: median 1 correction (repeat) per 10-step task, 100% task completion
Salient Idea:Grounded Situation Models
Salient Idea:Grounded Situation Models
A LAYMAN’S DEFINITION:Customizable amodal representations w processes, acting as a "theater in the head“, containing descriptions of present, past or imagined situations.
REQUIREMENTS FOR GSMs:
1. Break down situations into types of NL "folk ontology":Histories = moments (situations) + events, Situations = agents + objects + relations, Agents = physical + mental part, Physical part = subparts w properties etc. Contain representations for "me, you, others“ in situations& "embedded GSMs" within mental partHandle uncertainty / represent multiple alternatives
2. Multimodal connectivity / Bi-directionality senses-language:Fed by multiple modalities (vision, speech,..), feeding params to multiple modalities (speech, motor) Bi-directionality between sensing and languageStandard processes for basic operations: updates, matching,..
GSM BIDIRECTIONALITY
SENSES vs. LANGUAGE: DIFFERENCE in Informational Granularity / Specificity
In more detail…
I. Vision, Background, Motivation
II. Computational Model
III. Modular Implementation Architecture
IV. Implemented System: Ripley the Robot
V. Evaluation
VI. Future Extensions plan
VII. Discussion / relation to AI, semantics...
I. Vision,Background,Motivation
Vision:Cooperative SCAsA Situated Conversational Assistant (SCA)
is a system with sensing, acting and speech synthesis/recognition abilities, which engages in physically situated natural language conversation with human partners and assists them in carrying out tasks.
A continuum: from DUMMY TOOL to TRULY COOPERATIVE ASSISTANT
Tool: [Calligraphy Pen]No autonomy – Needs continuous controlHuman adapts to tool Tool does not know about human, objects & task
Cooperative Assistant: [Friend helping with cooking]Coordinated autonomy – Takes initiativeAssistant adapts to humanAssistant knows about human, objects & task
three axis: autonomy, adaptation, knowledge
SOME PREREQUISITES FOR COOPERATIVITY:natural language interaction: Commands, Questions, Inform statements, Declarations, Promises…
intention recognition: He’s looking at the tomatoes…what does he want?
action coordination: Once he cuts the tomatoes, I’ll throw them in the pan
Background:Relevant NotionsI. Symbol Grounding [Harnad]
Any system of symbols, if isolated from a meaning space externalto it, cannot help but becoming circular. (Ex: Semantic Nets.)
Meaning Spaces, Indirect Grounding, Approximations, …
II. Situation Models [Zwaan/Randvansky] ([Laird], [Minsky])."A situation model is a mental representation of a described or experienced situation in a real or imaginary world"
Situation model components & levels, main focus on story understanding / integration. No explicit processes for sensory conn!
III. Situated LanguageLanguage about the concrete here-and-now
Development and levels of situatedness, also in Ripley’s progress
Also: Chapter 2 – purpose, internal rep, prediction etc.
Motivation:Four pathways
Main motivation:
Building Cooperative SCAsOther motivations:
Cognitive Architectures:How to interconnect vision, motor control, and speech? (NLP cannot PNP)
How can we go from percept to words & vice-versa?
GSMs as “strcuctured blacboards” in AI?
Internal Models [Zwaan, P.J. Laird, Minsky]: What if there was a “theatre in the head”
that is filled with present, past or imagined “Situation Models?”
Grounding the Meaning of Words [Roy, Siskind, Regier]: How can we empirically learn models of the meaning of words?
What is wrong with today’s NLP?
MIND(concepts)
WORLD(objects)
SIGN(words)
? “A blue ball”
II. Computational Model
RepresentationWhy three layers?The three layers in operationExample ContentsOperational CycleHistories (Moments & Events))
RepresentationalHierarchy
Two Desiderata for proposed design:
I) Folk Ontology
II) Multimodal connectivity /
bidirectionality between senses-language
Why three layers?
Different Layers for Different Purposes!Consider an object (ball) which is part of a situation, and consider one of its properties (position).Imagine that you:
C1) See a ball and notice its position under uncertaintyC2) Need to grasp the ballC3) Hear that “the ball is at the center”
What is a matching representation of the position for each of C1-C3?
L1) Stochastic (n-D continuous distribution) (INFO FROM SENSES)
L2) Continuous (n-D continuous value) (ACTION PARAMETERS)
L3) Categorical (1-D discrete distribution) (INFO FROM/TO WORD)
(0.2, 0.34)
The three layers in relation totraditional logical sentences
SIGNALS
STOCHASTIC
SYMBOLS
DETERMINISTIC
The three layers in operation
MatchingAlgorithm
Match:
Sensory Expectation
To
Incoming Sensory Data
3 Outcomes:
Create
Update
Destroy
Example Contents of GSMAt time = 137 (present) Agent #1 (“I”)
Physical:Compound Object with 5 objects (body parts)Object #1: (“head”)
Property #1: (“position”)Layer #1 (Stochastic): Circular Gauss. Mix G1{m=(1.5,1.89,2), s=2}, G2{…}Layer #2 (Continuous): (1.6, 1.85, 2.05)Layer #3 (Categorical): #7 (“At the center”)
Property #2: (“pose”)…
Object #2: (“neck”)…
Interface:Viewpoint: R eye center (1.4, 1.89, 2.1), viewdir (-14deg, 5deg, 0deg) Mover: Head Motor: attachment point (1.5, 1.89, 1.8), rot axis: (…)…
Mental:Goals:Estimated GSM: (what Robot believes Robot’s GSM contents are)…
Agent #1 (“You”) Physical:… Interface:… Mental:
…Estimated GSM: (what Robot believes Human’s GSM contents are)
Operational Cycle
Basic Cycle
InitL1L2L3DoForever {If (incoming sensory data)
Senses-to-L1L1-to-L2L1-to-L3
Else If (incoming speech data)Words-to-L3L3-to-L1L1-to-L2
Else (no incoming data – diffuse)Diffuse-L1L1-to-L2L1-to-L3
EndifIf (incoming request for words)
L3-to-wordsElse If (incoming request for action parameter)
L2-to-actionparamsEndif}
Senses-to-L1 contains EXPECTATION MATCHING algorithm
Distribution across modules
VISOR and PROPRIOCEPTOR: contribute sensory data
IMAGINER :contributes speech data
INQUIRER and REMEMBERER:create words and action parameters
GSM :diffuses data that have not been updated
From Situation to History
Moments:Just copy and index current contents (situations)![No filtering/consolidation yet.]
Events:Event classifiers take input from GSM contentsCurrently hand-coded, future HMM / CFG’s etc.Continuously updated “live” events<type> <participants> <time0> <time1> <parameters>
Two types of classifiers:1) categorical classifiers (“red”, “small”, “heavy”)2) event classifiers (“appeared”, “x started touching y”)
Semantics of verbs & adjectives
Adjective meaning: categorical classifier
Verb meaning (second/third person): event classifierVerb meaning (first person): action controller
Video Example Revisited
Ripley answers, imagines and remembers…Notice the contents of the layers!
III. Modular ImplementationArchitecture
ModularImplementationArchitecture
IV. Implemented System:Ripley the Robot
Embodiment
Hardware:
7 dof with force feedback actuators
Gripper with force-sensitive touch sensors
Joint angle encoders
Dual cameras mounted around head
Microphone / Speakers
8 PC’s with gigabit net / Linux / PVM
Environment: table/objects/human
Goal: Conversational Helping Hand
Behavioral specs that were set
1) Answering questions about the present physical context:“What color is the one on the left?”
2) Quantifying and expressing confidence in beliefs“Where is the red one?” (red object out of view for a while)“Probably at the left, but maybe not”,
3) Respond to spoken motor requests: “Look to the left”“Hand me the small red one”
4) Imagining situations described through language:“Imagine an object at the left”, “There is a small object at the right”. (& sensory updates!)
5) Remembering and resolving temporal referents: “What color was the object that was at the center of the table when the red one appeared?”.
6) Resolving indexical human hand pointing: “What color is this one" + pointing
7) Recognizing human hand-related event detection:"Where was the blue one when my hand started touching the red one?“
8) Extending speech interface to multiple languages:Greek / English. Rudimentary GSM-mediated translation
GSM Customization
AGENTs: for Robot, Human, Objects
PROPERTY structures:P1 - Position (3D): Object center positionP2 - Pose (3D): Object pose in scaled quaternions, conv. to matrices / angles P3 - ShapeParams(3D): Object shape params (radius for spheres) etc.P4 -Weight (1D): As acquired from the force sensitive sensorsP5 - Color (3D): Simple RGB-space colorP6 - 3D voxelised occupancy grid: for 3D model acquisition
EVENT types:appeared / disappearedcame in view / came out of viewstarted moving / stopped movingstarted touching / stopped touching
SENSORY connections:Vision, Language, Proprioception
Vision System
Capture
Object Segmenter
Skin Segmenter
Face Detector
2D Objecter
StereObjecter- Objects
StereObjecter- Skin
2D Objecter
2D Objecter
StereObjecter- Faces
regions
blobs (ellipses)
regions
regions
blobs (ellipses)
blobs (rectangles)
Visor
Stereo & Mono blobsMono regions
Stereo & Mono blobsMono regions
Stereo & Mono blobsMono regions
SituationModel
pictures
…Right Channel Pipeline…(same as left channel)
Regions & blobs - R
LEFT
RIGHT
Segmenters: trainable color-based probabilistic w ellipse fit
Face Detector: Viola-Jones based (OpenCV)
StereObjecter: centroid/endpoint correspondence
Create / Delete / Update proposals
Spoken Language Processing
Transcript parsing:Keyword-based semantic framesSpeech-act type classificationTense determinationDescription extraction
Object referent resolution:Body-part / attributive descriptions, indexicals w pointingSingular / PluralDisambiguating questions
Temporal referent resolution:Recency criterion, time of descriptionObject should exist for action
Speech generation:Decision trees / templates
Verbalizing Uncertainty
Gradations
Input:
P(first): Pr of 1st most probable category
P(second): Pr of 2nd most probable category
Entropy Ratio: Entropy / Max Entropy
Input:
Nmatch: Num of objects matching description given by human,
Number: Was human’s description singular or plural
From referent resolution to action execution
Voxelized shape models
->
Current Behavioral Repertoire
The system responds to the following utterance classes:
1. Questions:(present/past)<question> is/are the <obj des><question> was/were the <obj des> when <event des>
2. Action requests: (referent description in present/past)<action> the <obj des><action> the <obj des> when <event des>
3. Imaginer request: Imagine/There is a <obj des>
4. Location-centered “look at”: Look at <location>
5. Viewpoint: <action> the one on my/your left/right
6. Basic mode switching:Wake up, sleep, relax, look at me/the table
Types:
<question> belongs to {where, how big, what color}
<action> belongs to {touch, pick up, hand me, look at}
<obj des> contains <size> <color> <object> <locus>
<event des> contains <actor> <event type>
<size> belongs to {small, medium, large}
<color> belongs to {red, green, blue}
<locus> belongs to {center, left, right, top, bottom}
<actor> is either an <obj des> or <agent part>
<agent part> belongs to {my, your} × {head, arm}
<event type> belongs to {appeared, disappeared, started moving, stopped moving, came in view, came out of view}
7. Recent extensions:Indexicals: <obj des> extended to {this} + fingerpointingTouch events:<event des> extended to <actor> <event type> <patient><event type> extended: {started touching, stopped touching}Multilingual support: English/Greek(auto language switching)
V. Evaluation
Novelties / Qualitative
Generics: levels of Eval for SCAs, Psych tests
Quantitative
Comparison w Token Test
Novelties above state-of-the-art
Our system exhibits three novelties:
N1. Imagining situations described through language “Imagine a large object at the left!”
N2. Remembering and resolving temporal referents“Where was the green one when the red one appeared?”
N3. Quantifying and expressing confidence in beliefs “It is probably at the left”
Some comparable systems: [Crangle and Suppes 94], [McGuire et al 2002], [Sofge et al 2003], [Breazeal et al 2004]
Levels of Evaluation for SCAsIt is proposed that an SCA may be evaluated at various levels (& analogy w humans!):
E1) Module-level:One could attempt quantifying the performance of each of the system’s components (speech recognition, vision etc.). However, this would not be highly nformative about the effectiveness of the system as a whole.
E2) System-level:One could evaluate through measurables that involve more than one component. For example, for the composite VisionSystem-Visor-SituationModel chain, how much statistical error or delay exists between the "outside world" (external reality) and the "inside world" (situation model)? What is the transfer function between the two?
E3) Task-level: For a repertoire of specific tasks, and some criteria of success, how often is the robot successful in carrying out these tasks? For example, in responding to a human saying: "hand me the red one" by indeed handing him the red one?
E4) User satisfaction:One could measure human satisfaction while interacting with the system through questionnaires or otherwise.
E5) Human replacement test: How close to a human are the SCA’s responses, given same situations and utterances heard
Also: Guidelines for Using Human Psychological Tests as Design Specs / Evauation for SCAs
Quantitative Evaluation
Task-Level (T1)10 turns
Rich mix of commands
Repeat cmd if unsuccessful
Error types:
SRE: Speech Recognition
CME: Category Misclassification
VSE: Vision System
MME: Mechanical Manipulation
Quantitative Evaluation
Module-LevelT2: “touch” 100%, “hand me” 82% (grip slip)
T3: “what color?” red 100%, green 64%
The Token Test for ChildrenPsychological Tests usable as:
Evaluations and Behavioral Specifications
A test for 3-year children & aphasics!
Ripley vs. Token Test
ALSO: Ripley surpassing Token Test in other directions! (inform, remember…)
VI. Future Extensions plan
A 5-10 year research plan…Current system provides SIGNIFICANT LEVERAGE towards these extensions!!!
Sensory domain: X1-3: object vision, tactile, human activity
Cognitive / Affective domain:X3-6: embedded situation models, language, plans / goals / affect, LTM
Real-world operation and design / theory:X7-10: online learning, evaluation, design, theoretical reflections
Future Extensions plan:Sensory domain
X1) Better Visual Abilities:X1.1 Richer shapes & shape recognitionX1.2 Active sensingX1.3 Richer spatial relations (containment, support, etc.)
X2) Better Tactile Abilities:X2.1 Soft/Hard, texture, and temperature sensingX2.2 Shape by touch
X3) Better support for Mind Reading / Embedded situation models:X3.1 Eye gaze directionX3.2 Absence/presence of humans, face recognition, multiple speakersX3.3 Extend recognition of human-related eventsX3.4 Keep history of interactions w each human
Future Extensions plan:Cognitive / Affective domainX4) Better Language subsystem:
X4.1 Online naming of objects and agentsX4.2 Conditional execution of actionsX4.3 More systematic treatment of vocabulary, speech acts, descriptions etc.X4.4 Extend informspeech act handling for story understanding / future anticipation
X5) Mixed motor/speech planning, explicit goals, inference, affective stateX5.1 Translate categorical layer to FOPCX5.2 Performsimple inference on the basis of translated categorical layerX5.3 Performsituational commonsense knowledge acquisitionX5.4 Devise goal representations and performsimple planningX5.5 Performonline collaborative planning / action coordinationX5.6 Introduce affective state for the robot fed through goal satisfaction, novelty etc.X5.7 Introduce affective state for the human fed through expression recognition etc.X5.8 Affective learning for the robot by observation of human affective responses
X6) Core GSM enhancementsX6.1 Extend basic GSM operations: attentional focusing etc.X6.2 Better memory subsystem: Consolidation, long-termexperience acquisitionX6.3 Incorporate more results from cognitive science research
Future Extensions plan:Real-world operation / designX7) Integrated Online learning
X7.1 Integrate online learning of adjectives, verbs & params with normal operationX7.2 Online learning of tasks and situationally-appropriate actions
X8) Passing Landmark tests / EvaluationsX8.1 Pass all of the "Token test" with professional testgiverX8.2 Perform simple story understanding testsX8.3 Perform cooperative assembly task testsX8.4 Perform extensive tests through the levels described in section 7.2
X9) Standardisation / Open Sourcing / Design toolsX9.1 Open Source a generic library of basic GSM representations and processesX9.2 Build design tools for semi-automated CAD of GSM-based SCAsX9.3 Build an affordable educational GSM-based robot (One Robot Per Child)
X10) TheoreticalX10.1 Alignment of Situation ModelsX10.2 Empirically Sound Semantics
Scalability! Hierarchical Learability / Open Sourcing – huge teacher/developer baseAchieving Intention Rec and Activity Coord for ideal SCA prerequs
VII. Discussion / relation to AI,semantics, philosophy…
Block 1: Language grounding, Situated languageBlock 2: Abstract entities, Imaginary situationsBlock 3: Semantics, SituationsBlock 4: Logic, OntologiesBlock 5: Stance X and its relation to GSMs
Discussion9.1 Language grounding, Situated languageX 9.1.1 Meaning spaces
9.1.2 Unifying speech with sensing and acting X 9.1.3 Grounding the whole of language
9.1.4 Levels of Detachment from the "here-and-now"
9.2 Abstract entities, Imaginary situations9.2.1 Do abstract entities exist in a primary manner?9.2.2 Can abstract entities exist in a primary manner in GSMs?9.2.3 Representing abstract entities within situation models9.2.4 Inform speech acts, Imagine statements & the "Imaginer" module9.2.5 On the possibility of "Pure Imagination"
9.3 Semantics, SituationsX 9.3.1 The semantics of adjectives and verbs within the GSM proposal X 9.3.2 Successive approximations of meaning models
9.3.3 Comparative adjectives and better spatial semantics9.3.4 Similarity of situations9.3.5 The Enumerability of Situations
9.4 Logic, OntologiesX 9.4.1 Bridging the categorical layer with FOPC
9.4.2 Grounding ontologies
9.5 Stance X and its relation to GSMsX 9.5.1 Embodiment and GSMsX 9.5.2 SHRDLU and GSMs
9.5.3 Procedural semantics and GSMsX 9.5.4 A proposal for Empirically Sound Semantics
9.5.5 The intentional stance and GSMs9.5.6 The homunculus attack and GSMs9.5.7 Some of Minsky’s views and GSMs
Meaning Spaces &Grounding the whole of language
Semantics of verbs & adjectives
Adjective meaning: categorical classifier
Verb meaning (second/third person): event classifierVerb meaning (first person): action controller
Successive Approximationsof Meaning Models
F1 (absolute) 1st order approx:category = f(property value)
F2 (noun-dependent) 2nd order approx:category = f(property value, noun)
F3 (situationalctxt-dependent) 3rd order approx:category = f(property value, noun, values of
same property for other relevant objects)
"the small cup" when surrounded by other large cups
vs. "the small cup"
when surrounded by miniature cups
"small elephant" vs.
"small ant"
Example: Small = radius less than 3cm
GSMs and FOPCExample of translation:
From Event(now, Object1, ismoving)
to
ADVANTAGE: Become able to use all FOPC-based methods
SHRDLU vs. GSMs
Differences:SHRDLU exists in a virtual world, while the GSM-based Ripley exists in the real world.
1) No need for sensing - world continuously & totally accessible at object & property level;
2) No ExternalReality-to-SituationModel process – no segmentation, property extraction etc.
3)The Situation Model IS the only Reality in SHRDLU’s case– and this reality is totally accessible and not partially hidden or noisy or only partially visible at the time.
4) No need for stochastic layer, no need for object permanence- the omniscient "eye" of SHRDLU knows all objects with their unique IDs, and they will never disappear from sight.
5) No concept of "sensory expectation" and verification of sensory expectation - it can never be the case that reality does not match its estimate (the situation model), because reality IS the situation model.
6) No learning whatsoever of semantics of words - "red" is not attached to any sensory space semantics through examples, "pick up" is never taught by demonstration.
7) No "imagination" in the sense used in this document too- no arbitrary recombination of sensory-derived elements, no sensory expectation formed that can be verified or disproved.
8) Actions in SHRDLU always have their predicted effects,9) No user model or self model - no embedded models.
Similarities:1) Some "understanding" of natural language –
- and dialogue with the user about virtual / real world
2) 3D Visualizations of the beliefs of the system about the state of the world3) Keeping of sequences of events etc.
Differences much deeper than the apparent similarities!SHRDLU was an amazing system in its times; but in order to get to the next level, and build SCAs in real environments, GSMs crucial…
Requirements forEmpirically Sound Semantics
Motivational Question:What is a meaningful definition of the "meaning" of an action, which definition fulfills the following three requirements?
R1) It is empirically sound.
Meanings defined in this way should be estimatable in a computational manner, on the basis of external observations of a set of agents acting and possibly interacting (not by human expert introspection)
R2) It is not defined on the basis of a set of high-level natural-language "concepts“.
I.e. defined in terms of an agreed upon description of externally observable states of the world, which world is modelled in some non-language like model.
R3) It contains both a internal / mental and an external / physical part.
(Within clearly explicated “mind” and “world” models)
A proposal forEmpirically Sound Semantics
MeaningMeaning(given(given sign, at a given instant)sign, at a given instant)==
Production Conditions, Reception Effects Production Conditions, Reception Effects across all agentsacross all agents
Production Conditions(sign, agent)=
Which of all possible past experiences will make this agent produce the sign, which past experiences will always ultimately
ground out to states of external reality=
A classifier on all pasts, which dichotomizes them to those that will result to the production of the sign, and those that won’t
Reception Effects(sign, agent)=
How will the receiving agent’s behavior be modified for all possible futures, which will in turn further constrain all sub-sequent futures
(states of external reality) after the agent’s actions=
A further constraint on all possible futures after the reception of the sign
Central Theme
How can we build cooperative Situated Conversational Assistants
around
Grounded Situation Models,
which are amodal representations of current, past or imagined situations,
that can be fed
both by the senses and language?
Vision andContributions
Vision: Vision: Cooperative Situated Conversational Assistants (SCAs)
Contributions: Contributions: The Grounded Situation Model (GSM) proposal:
Motivation and Requirements
Explicit Computational Model
A modular Implementation Architecture
An Implemented System
Quantitative and Qualitative Evaluations
A discussion of a Future Extensions plan
A Discussion w proposals: GSMs in relation to theories in AI, cogsci, semantics
Other smaller contributions:A rough Method for Designing GSM-based systems
News
1) Three Novelties above the state of the art in conversational robots
2) Behavioral repertoire comparable / surpassing the Token Test for children
3) Quantitative Performance: median 1 correction (repeat) per 10-step task, 100% task completion
Epimetron
In the introductory section, we stated upfront and discussedthe vision of this thesis -SCAs.
But taking a higher viewpoint, one might ask: But why do we need SCAs for humanity?
Or taking an even higher viewpoint, one should ask: Where should SCAs and intelligent robots fit within the Order of the world?
This is a question which would deserve deep consideration, and significant future-reaching examination. Thus, I will end this thesis not with an answer, but with a wish:
May the day come when humans and their artificial partners can co-exist in harmony and peace, enabling each other to utilize their full potential, and most importantly, helping eachother towards fulfilling their true purpose of existence.
OPTIONALMATERIAL
Behavioral Rep
Design Method
All the stuff from previous pres
All the new stuff
Some questions to explore…
Meaning of Words: How can we empirically learn models of the meaning of words?
What is wrong with today’s NLP?
Cooperative Machines:What does it mean for a machine to be “cooperative”?
Cognitive Architectures:How can we interconnect vision, motor control, and speech?
How can we go from percept to words & vice-versa?
Internal Models: What if there was a “theatre in the head” that is filled with present, past or imagined “Situation Models?”
Areas implicated: AI, Vision, Robotics, Linguistics, Philosophy
Word meaning?
Important Theoretical Questions:Q1a) What is the meaning of an utterance?Q1b) Where is it? Q1c) How should it be represented?
“A BLUE BALL” = ???
“JOHN KICKED THE BALL” = ???
“I LIKED REMEMBERING YESTERDAY’S LUNCH = ???”
Referential vs. Functional Meaning Semantics vs. Pragmatics
Origin of signs / Empirical Semantics
MIND(concepts)
WORLD(objects)
SIGN(words)
The “Semiotic Triangle”
The Central Problem
The Symbol Grounding Problem
1) How is meaning represented in traditional NLP systems?
Semantic Networks
2) Very much like:
Dictionary DefinitionsApple = is a red fruit that has a kernel
Μήλο = είναι ένα κόκκινο φρούτο που έχει ένα κουκούτσι
3) Thus, one inescapable problem:
CircularityBut how do children solve it? How do we solve it? See next slide…
Escaping the Problem
Our Proposed Solution:Equip machines with:- a starting vocabulary whose meaning is acquired by “being in
the world” (meaning(“apple”), meaning(“red”) etc.) - as well as with a way to compose meanings.
(meaning(“red apple”) = f (meaning(“red”),meaning(“apple”))- Then, by analogy, extension to abstract concepts could
potentially be addressed.(meaning(“justice”)=? meaning(“opposite”)=?)
Meaning Spaces:1. Sensory (“red”, “loud”)2. SensoryMotor (“jump”, “search”)3. Goal-oriented (“it is hot in here”)Bigger picture: semiotic triangle
Practical Pathways for modelsof grounded meaning
Three Practical Pathways:
Robots – Games – AV record
Interconnecting vision, motorcontrol, and language
Q2)How can we interconnect existing subsystems?
And then, what is in the star? (Reps / Processes)
How do we map psychological terms:Beliefs? Goals? Intentions? Emotions?
? “A blue ball”
Are there mental models inthe head?Q3) If (and how) is a mental model constructed
with structure analogous to the situationin which we are embedded, or are remembering / imagining?
What are its contents?
What are its main operations?
Theoretical Steps
Theoretical agent model
proposal, which is GSM-based.
Design methods
incremental, producing a design given a behavioral spec.
Psychological tests
using them for specification / evaluation
The representation:What the GSM is
and what it is not… The GSM is amodal – it is NOT:
A viewer-dependent 2D image An absolute 3D physical model
(it even includes invisibles such as the beliefs of others) An unstructured list of language-like propositions
The GSM is a representation: Accumulating information from multiple modalities
(vision, touch, proprioception, language) Which has a
structure analogous to the situationthe robot is embedded in or is imagining
Which might extend to the past / future (not only present) Which contains Mental and Physical parts in agent descriptions –
supports “embedded GSMs” (estimated GSMs of other agents) Similar to “Situation/Mental/Internal Models” in CogSci, J Laird, Minsky
The GSM is NOT only a representation: But also a set of associated standard processes!
Theory:GSM-Based Agent Models
Representations:R1) Present Situation model: known world
(not only physical necessarily)R2) Internal state (goals, affect)R3) Past states/situations (includes events)R4) Future predictionsR5) Action list
Processes:P1) Sensor-to-SituationModel processP2) SituationModel-to-InternalState proc.P3) Storage/Maintenance of Past process
(includes event recognition)P4) Future Prediction Generation processP5) Action Selection Process (SituationModel-to-action table or planner)
Also: further breakdown! And more arrows…
PresentSituation
(R1)
Future(R4)
Past(R3)
Goals etc.(R2)
Sensors
Effectors
P3
Action List(R2)
P4
P1
P5
AGENT
WORLD
A closer view:Reality-to-SituationModel
ExternalReality
M1: SensorProjection
P1a:Individuation
P1b: PropertyExtraction
P1c: CategoryProjection
InternalSituation Model
Spatiotemporally local external reality is projected into a “sensory stream”
Sensory streams are segmented to “objects” – i.e. some information is marked as pertaining to an object, which objct is also given a unique ID
All of the segments of the sensory streams that resolve to a common objare collected, and properties are calculated on their basis: color, size…
Continuous-valued or fine-granularity property values are quantized to meaningful categories -Color is quantized to “red”, ”green”, ”yellow” etc.
SITUATION-VIEWPOINT of an agent: {M1, P1a, P1b, P1c}(t)
All the info required to go from External Reality to Situation Model
Two questions of fundamental Importance:Q1) How do I estimate another agent’s situation model?Q2) How do I align situation models across agents?
Motivation
How do we interconnect language with perception/action subsystems in an SCA? (why NLP cannot PNP)
How can we find a systematic way to build an SCA starting from a behavioral specification?
GSMs as "structured blackboards" GSMs as cognitive science-like "situation
models" GSMas a "theatrical stage"
top related