language acquisition framework for robots: from grounded language acquisition to spoken dialogues
DESCRIPTION
TRANSCRIPT
LCore: A Language Acquisition Framework for RobotsFrom Grounded Language Acquisition to Spoken Dialogues
Komei Sugiura and Naoto IwahashiNational Institute of Information and Communication Technology, [email protected]
2013/12/13
Open problem: grounded language processing
• Language processing based on non-verbal information (vision, motion, context, experience, …) is still very difficult– e.g. “Put the blue cup away”, “Give me the usual”
“blue cup”: multiple candidates “the usual”: umbrella, remote, drink,..
• What is missing in dialog processing for robots?– Physical situatedness / symbol grounding– Shared experience
2
Spoken dialogue system + Robot ≠ Robot dialogue
• Robot dialogue– Categorization/prediction of real-world information– Handling real-world properties– Linguistic interaction
• Why is this difficult?– Machine learning, CV, manipulation, symbol grounding problem,
speech recognition,…
Cutlery
Fork
Tableware
Tea cup Knife
Cup Plate
Robot Language Acquisition Framework [Iwahashi 10, “Robots That Learn to Communicate: A Developmental Approach…”]
• Task: Object manipulation dialogues• Key features
– Fully grounded vocabulary– Imitation learning– Incremental & interactive learning– Language independent
4
LCore functions
Phoneme learning
Word learning
Grammar learning
Disambiguation of word ellipsis
Utterance understanding
Robot-directed utterance detection
Learning question answering
Visual feature learning
Affordance learning
Imitation learning
Role reversal imitation
Active-learning-based dialogue
5
Learning modulesWord Grammar Motion-object
relationship
• Learning verbs• Estimation of related objects• Learning trajectories• Learning phoneme sequences
• Learning nouns/adjectives• Learning probabilistic distributions of
visual features• Learning phoneme sequences
Symbol grounding: Learning nouns and adjectives
• Visual features modeled by Gaussians– Input: visual features of objects
• Out-of-vocabulary word = phoneme sequence + waveform– Voice conversion (Eigenvoice GMM) to robot voice
BLUE
RED
Unknown object
Generative models
Imitation learning of object manipulation [Sugiura+ 07]
• Difficulty: Clustering trajectories in the world coordinate system does not work• Proposed method
– Input: Position sequences of all objects– Estimation of reference point and coordinate system by EM algorithm– Number of state is optimized by cross-validation
Place A on B
Imitation learning using reference-point-dependent HMMs[Sugiura+ 07][Sugiura+ 11]
• Delta parameters
:Position at time t
= …
= …
Searching optimal coordinate system
Reference object ID
HMM parameters
Coordinate system type
* Sugiura, K. et al, “Learning, Recognition, and Generation of Motion by …”, Advanced Robotics, Vol.25, No.17, 2011
Results: motion learning
Place-on Move-closer Raise Rotate
Jump-over Move-away Move-down
Log likelihood
Position
Velocity
Training-set likelihoodMotion “place-on”
No verb is estimated to have WCS-> Reference-point-dependent verb
HMM “Place-on”Place X on Y
Transformation of reference-point-dependent HMMs [Sugiura+ 11]
• What is the problem?– Simple HMMs do not generate continuous trajectories– Situation dependent trajectories
• Reference-point-dependent HMM– Input: (motion ID, object ID) e.g. <place-on, Object 1, Object 3>– Output: Maximum likelihood trajectory
HMM “Place-on”
World CS
Situation
Place X on Y* Sugiura, K. (2011), “Learning, Generation, and Recognition of Reference-Point-Dependent Probabilistic…”
Generating continuous trajectory using delta parameters[Tokuda+ 00]
: state sequence: HMM parameters
: time series of (position,velocity,acceleration)
Maximum likelihood trajectory
*Tokuda, K. et al, “Speech parameter generation algorithms for HMM-based speech synthesis”, 2000
: vector of mean vectors
: matrix of covariance matrices of each OPDF
: filter ( )
: time series of position
Quantitative results
• Evaluation measure– Euclidian distance – Normalized by frame number T
Trajectory by proposed method
Trajectory by Subject
SPOKEN LANGUAGE UNDERSTANDING USING NON-LINGUISTIC INFORMATION
Utterance understanding in LCore (1)• User utterances are understood by using multimodal
information learned in a statistical learning framework
Shared belief
Speech
(HMM)
Motion
(HMM)
Vision(Bayesian
learning of a Gaussian) Motion-object
relationship(Bayesian learning
of a Gaussian)
Context
(MCE Learning)15
Integration of multimodal information
• Shared belief Ψ: weighted sum of five modules
Speech
Motion
Vision
Motion-object relationship
Context
contextsceneactionutterance
16
Inter-module learning
Multimodal understanding
Confidence learning
Utterance/Motion generation
Place Elmo on box
User intension
Place ElmoPlace it
17
Grounded utterance disambiguation
• Simple dialog systemsU: “Place the cup (on the table).”R: “You said place the cup.”
-> Risk of motion failure• Generating confirmation utterances using physical information
R: “I’ll place the red cup on the table, is it OK?”
Where to?Which “cup”?
Multimodal utterance understanding
Place-on Elmo
1st 2nd 30th
… …
1st
2nd
30th
Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, 2011 19
Multimodal utterance understanding
1st 2nd 30th
… …
Margin
1st
2nd
30th
Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, 2011
Place-on Elmo
20
Confirmation by paraphrasing user’s utterance
21
• Learning phase• Bayesian Logistic Regression• Input: Margin(d), Output: probability
Margin
Probability
• Execution phase– Decision-making on responses
based on expected utility
Quantitative result: Risk reductionBaseline
Failure rateRejection rateConfirmation rate# of confirmation utt
Decreased to 1/4
Proposed
22
Reduction of motion failure in learning phase [Sugiura+ 11]
• So far…– Learning utterance understanding probabilities
Sugiura, K. et al, "Situated Spoken Dialogue with Robots Using Active Learning", Advanced Robotics, Vol. 25, No. 17, 2011
• Idea• Learning-by-asking
Phase Operator Motion executorActive Learning Robot User
(Passive) learning User RobotExecution User Robot
Motion success
Motion failure
Reduction of motion failure in learning phase
Execution phaseLearning phaseMotion success
Motion failure
Active Learning phase
“Safe” training data
• Problem: – Motion failure is required in learning
phase to avoid over-fitting
What kind of commands are effective for learning?
Target action Robot utterance Loss
Act=A, Objs = <1,3> “Place-on Elmo blue box” 35.8
Act=A, Objs = <1,3> “Place-on Elmo” 12.3
Act=A, Objs= <1, 2> “Place-on Elmo” 28.1: : :
Act=B, Objs=<2> “Raise box” 332.3: : :
• Proposed method: Active Learning-based command generation• Objective: Reduce the number of interactions• [Input = image], [Output = utterance]• Expected Log Loss Reduction(ELLR[Roy, 2001]) is used to select
the optimal utteranceActive Learning : A form of supervised learning in which inputs can be selected by the algorithm
Utterance generation by ELLR
Reduction of motion failure in learning phase
Number of episodes
Test-set likelihood
(1) Proposed(2) Baseline
Proposed Baseline#
of m
otio
n fa
ilure
Fast convergence Motion failure risk reduced