language acquisition framework for robots: from grounded language acquisition to spoken dialogues

LCore: A Language Acquisition Framework for RobotsFrom Grounded Language Acquisition to Spoken Dialogues

Komei Sugiura and Naoto IwahashiNational Institute of Information and Communication Technology, [email protected]

2013/12/13

Open problem: grounded language processing

• Language processing based on non-verbal information (vision, motion, context, experience, …) is still very difficult– e.g. “Put the blue cup away”, “Give me the usual”

“blue cup”: multiple candidates “the usual”： umbrella, remote, drink,..

• What is missing in dialog processing for robots?– Physical situatedness / symbol grounding– Shared experience

2

Spoken dialogue system + Robot ≠ Robot dialogue

• Robot dialogue– Categorization/prediction of real-world information– Handling real-world properties– Linguistic interaction

• Why is this difficult?– Machine learning, CV, manipulation, symbol grounding problem,

speech recognition,…

Cutlery

Fork

Tableware

Tea cup Knife

Cup Plate

Robot Language Acquisition Framework [Iwahashi 10, “Robots That Learn to Communicate: A Developmental Approach…”]

• Task: Object manipulation dialogues• Key features

– Fully grounded vocabulary– Imitation learning– Incremental & interactive learning– Language independent

4

LCore functions

Phoneme learning

Word learning

Grammar learning

Disambiguation of word ellipsis

Utterance understanding

Robot-directed utterance detection

Learning question answering

Visual feature learning

Affordance learning

Imitation learning

Role reversal imitation

Active-learning-based dialogue

5

Learning modulesWord Grammar Motion-object

relationship

• Learning verbs• Estimation of related objects• Learning trajectories• Learning phoneme sequences

• Learning nouns/adjectives• Learning probabilistic distributions of

visual features• Learning phoneme sequences

Symbol grounding: Learning nouns and adjectives

• Visual features modeled by Gaussians– Input: visual features of objects

• Out-of-vocabulary word = phoneme sequence + waveform– Voice conversion (Eigenvoice GMM) to robot voice

BLUE

RED

Unknown object

Generative models

Imitation learning of object manipulation [Sugiura+ 07]

• Difficulty: Clustering trajectories in the world coordinate system does not work• Proposed method

– Input: Position sequences of all objects– Estimation of reference point and coordinate system by EM algorithm– Number of state is optimized by cross-validation

Place A on B

Imitation learning using reference-point-dependent HMMs[Sugiura+ 07][Sugiura+ 11]

• Delta parameters

:Position at time t

= …

= …

Searching optimal coordinate system

Reference object ID

HMM parameters

Coordinate system type

* Sugiura, K. et al, “Learning, Recognition, and Generation of Motion by …”, Advanced Robotics, Vol.25, No.17, 2011

Results: motion learning

Place-on Move-closer Raise Rotate

Jump-over Move-away Move-down

Log likelihood

Position

Velocity

Training-set likelihoodMotion “place-on”

No verb is estimated to have WCS-> Reference-point-dependent verb

HMM “Place-on”Place X on Y

Transformation of reference-point-dependent HMMs [Sugiura+ 11]

• What is the problem?– Simple HMMs do not generate continuous trajectories– Situation dependent trajectories

• Reference-point-dependent HMM– Input: (motion ID, object ID) e.g. <place-on, Object 1, Object 3>– Output: Maximum likelihood trajectory

HMM “Place-on”

World CS

Situation

Place X on Y* Sugiura, K. (2011), “Learning, Generation, and Recognition of Reference-Point-Dependent Probabilistic…”

Generating continuous trajectory using delta parameters[Tokuda+ 00]

: state sequence: HMM parameters

: time series of (position,velocity,acceleration)

Maximum likelihood trajectory

*Tokuda, K. et al, “Speech parameter generation algorithms for HMM-based speech synthesis”, 2000

: vector of mean vectors

: matrix of covariance matrices of each OPDF

: filter ( )

: time series of position

Quantitative results

• Evaluation measure– Euclidian distance – Normalized by frame number T

Trajectory by proposed method

Trajectory by Subject

SPOKEN LANGUAGE UNDERSTANDING USING NON-LINGUISTIC INFORMATION

Utterance understanding in LCore (1)• User utterances are understood by using multimodal

information learned in a statistical learning framework

Shared belief

Speech

（HMM）

Motion

（HMM）

Vision(Bayesian

learning of a Gaussian) Motion-object

relationship(Bayesian learning

of a Gaussian)

Context

（MCE Learning）15

Integration of multimodal information

• Shared belief Ψ: weighted sum of five modules

Speech

Motion

Vision

Motion-object relationship

Context

contextsceneactionutterance

16

Inter-module learning

Multimodal understanding

Confidence learning

Utterance/Motion generation

Place Elmo on box

User intension

Place ElmoPlace it

17

Grounded utterance disambiguation

• Simple dialog systemsU: “Place the cup (on the table).”R: “You said place the cup.”

-> Risk of motion failure• Generating confirmation utterances using physical information

R: “I’ll place the red cup on the table, is it OK?”

Where to?Which “cup”?

Multimodal utterance understanding

Place-on Elmo

1st 2nd 30th

… …

1st

2nd

30th

Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, 2011 19

Multimodal utterance understanding

1st 2nd 30th

… …

Margin

1st

2nd

30th

Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, 2011

Place-on Elmo

20

Confirmation by paraphrasing user’s utterance

21

• Learning phase• Bayesian Logistic Regression• Input: Margin(d), Output: probability

Margin

Probability

• Execution phase– Decision-making on responses

based on expected utility

Quantitative result: Risk reductionBaseline

Failure rateRejection rateConfirmation rate# of confirmation utt

Decreased to 1/4

Proposed

22

Reduction of motion failure in learning phase [Sugiura+ 11]

• So far…– Learning utterance understanding probabilities

Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, Vol. 25, No. 17, 2011

• Idea• Learning-by-asking

Phase Operator Motion executorActive Learning Robot User

(Passive) learning User RobotExecution User Robot

Motion success

Motion failure

Reduction of motion failure in learning phase

Execution phaseLearning phaseMotion success

Motion failure

Active Learning phase

“Safe” training data

• Problem: – Motion failure is required in learning

phase to avoid over-fitting

What kind of commands are effective for learning?

Target action Robot utterance Loss

Act=A, Objs = <1,3> “Place-on Elmo blue box” 35.8

Act=A, Objs = <1,3> “Place-on Elmo” 12.3

Act=A, Objs= <1, 2> “Place-on Elmo” 28.1：：：

Act=B, Objs=<2> “Raise box” 332.3：：：

• Proposed method: Active Learning-based command generation• Objective: Reduce the number of interactions• [Input = image], [Output = utterance]• Expected Log Loss Reduction(ELLR[Roy, 2001]) is used to select

the optimal utteranceActive Learning : A form of supervised learning in which inputs can be selected by the algorithm

Utterance generation by ELLR

Reduction of motion failure in learning phase

Number of episodes

Test-set likelihood

(1) Proposed(2) Baseline

Proposed Baseline#

of m

otio

n fa

ilure

Fast convergence Motion failure risk reduced