perception-link behavior model€¦ · [1] h. lee et al, “convolutional deep belief networks for...

PERCEPTION-LINK BEHAVIOR MODEL:

REVISIT ENCODER & DECODER IMI PHD Presentation

Presenter: William Gu Yuanlong (PhD student)

Supervisor: Assoc. Prof. Gerald Seet Gim Lee

Co-Supervisor: Prof. Nadia Magnenat-Thalmann

CONTENT

• Introduction

• Summary of reviewed interface

• Overview of the proposed framework

• Encoder and Decoder

• Conclusion

• Future work

Telepresence

(Sense of being there) vs “Tele” social presence (Sense of being together) [1]

Reference [1] F. Biocca et al., “The networked minds measure of social presence: Pilot test of the factor structure and concurrent validity,” in International Workshop on Presence, 2001.

2 of 15

COMMUNICATION MEDIUMS

Reference [1] E. Paulos, “Personal Tele-Embodiment,” University of California at Berkeley, 2002. [2] K. M. Tsui et al, “Towards Measuring the Quality of Interaction: Communication through Telepresence Robots,” in Performance Metrics for Intelligent Systems Workshop, 2012.

• Distance Telecommunication • Essential tools

• Advantage • Improves productivity

• Eases constrain on resources

• Face to face communication • Golden standard

• How you say it is more important than what you say

• Advantage • More social richness

3 of 15

MOTIVATION

Reference [1] E. Paulos, “Personal Tele-Embodiment,” University of California at Berkeley, 2002. [2] C. Breazeal, “MeBot : A robotic platform for socially embodied telepresence,” in The 5th ACM/IEEE international conference on Human-robot interaction, 2010. [3] K. Hasegawa and Y. Nakauchi, “Preliminary Evaluation of a Telepresence Robot Conveying Pre-motions for Avoiding Speech Collisions,” in hai-conference.net, 2013.

Anthropomorphism in term of appearance and functionality

De

gre

e o

f so

cia

l p

rese

nc

e

MeBot [2]

Hasegawa’s

Bot[3]

PRoP[1]

EDGAR

Face to Face

EDGAR

• Wider range of nonverbal cues; less

certain postures

• Life-sized system

• Rear projection robotic head for

realistic face display

Commercial • Limited nonverbal cues • Semi-autonomous behavior

Existing academic TPR • Wider range of nonverbal cues • Smaller systems (Mebot and Hasegawa) • Control systems contradict each other

• Passive model controller • Natural Interface

- Improve the existing telepresence robot in term of social presence.

- Two aspect of the works were explored 1) Physical appearance (EDGAR)

2) Operator’s interface (PLB)

4 of 15

SUMMARY: REVIEW OF THE OPERATOR’S INTERFACE

Reference [1] C. Breazeal, “MeBot : A robotic platform for socially embodied telepresence,” in The 5th ACM/IEEE international conference on Human-robot interaction, 2010. [2] K. Hasegawa and Y. Nakauchi, “Preliminary Evaluation of a Telepresence Robot Conveying Pre-motions for Avoiding Speech Collisions,” in hai-conference.net, 2013. [3] H. Park, E. Kim, S. Jang, and S. Park, “HMM-based gesture recognition for robot control,” in Pattern recognition and Image Analysis, 2005, pp. 607–614. [4] J. M. Susskind et al., “Generating Facial Expressions with Deep Belief Nets,” in Affective Computing, Emotion Modeling, Synthesis and Recognition, 2008.

5 of 15

GENERAL FRAMEWORK

• Perception-link behavior system integration

• Encodes various features into their styles • Convolution Neural Network with Restricted Boltzmann machine and Sample

Pooling [1]

• Associates style of various features, both operator and interactants • FUSION adaptive resonance theory [2]

• Decodes the current state based on the style and the previous state. • Factored gated restricted Boltzmann machine [3]

Natural interface

Reference

[1] H. Lee et al, “Convolutional deep belief networks for

scalable unsupervised learning of hierarchical

representations,” in Proceedings of the 26th Annual

International Conference on Machine Learning, 2009.

[2] A. Tan et al., “Intelligence through interaction:

Towards a unified theory for learning,” in Advances in

Neural Networks, 2007.

[3] R. Memisevic and G. E. Hinton, “Learning to

represent spatial transformations with factored higher-

order Boltzmann machines.,” Neural computation,

2010.

A novel flexible model that exhibit expressive nonverbal cues without

compromising safety and operator cognitive load.

6 of 15

REVISITING ENCODER • Revisited gestures

encoder

• Additional database

• Compared various unsupervised method

• BOW – Kmean

• BOW – GMM

• CNN-RBM-Max

• Evaluated via intra and inter cluster distance between known label. 𝒊𝑡 𝒊𝑡−1

… …

𝒊𝑡−𝑇+1 𝒊𝑡−𝑘 𝒊𝑡−𝑘−𝑐+1

…

Convoluted window of size c

𝒉(𝑇−𝑐+1) 𝒉(0) 𝒉(𝑘)

… …

1

…

N

… …

𝒊∞ 𝒊0

n

…

1

N

n

1

N

n

𝒉𝑡

1

N

n

𝒉∞

1

N

n

1

N

n

𝒉𝑇

… …

𝒉 = 𝑓(𝒊1:𝑐;𝑾, 𝒃)

𝒉 = 𝑚𝑎𝑥(𝒉0:(𝑇−𝑐+1))

Labeled encoded signal

Window of size T

Convoluted weight

Convoluted Neural Network via Restricted Boltzmann Machine

and Max pooling

7 of 15

DECODER FOR GESTURES

• Two main considerations

• Capability to generate different gestures given any encoded signal.

• Capability to generate similar variations of gestures if encoded signals are close to each others.

8 of 15

Basic concept behind encoding and decoding signals

One of the possible applications: Collision preventions

FRBM MODEL

• Factored Gated Restricted Boltzmann Machine

• Bottom up to estimate the 𝒉𝑡 given 𝒊(𝑡−1): 𝑡−𝑇+1 and 𝒛𝑡

• Top down to infer 𝒊𝑡

𝒊𝑡−1

…

𝒊𝑡−𝑇+1 𝒊𝑡

…

…

𝑾2

𝑾1

𝑾3

…

𝑹 𝒉𝑡 = 𝑓(𝑾1 ∙ 𝒊𝑡: 𝑡−𝑇+1 ∘ [𝑾3 ∙ 𝑹 ∙ 𝒛𝑡 ];𝑾2, 𝒃)

𝒊𝑡: 𝑡−𝑇+1 = 𝑔 𝑾2′ ∙ 𝒉𝑡 ∘ 𝑾3 ∙ 𝑹 ∙ 𝒛𝑡 ;𝑾1, 𝒂

𝒛𝑡

𝒉𝑡

Gate

9 of 15

G1 G2 G3 G4 G5

Sid

e

Fro

nt

Fro

nt

Top

GESTURES GENERATION @ DIFFERENT LABELS

Frame index (15Hz)

Inte

nsi

ty o

f fe

atu

res

#18 (

No

rma

lize

d)

Input: Z (encoded signals) Output: Gestures

(Animation is looped)

10 of 15

Given a specific encoded signals (top), a unique gesture(right) can be reconstructed

Number of features in Z

Inte

nsi

ty o

f e

ac

h f

ea

ture

in

Z

Number of features in Z

Inte

nsi

ty o

f e

ac

h f

ea

ture

in

Z

GESTURES GENERATION @ A LABEL’S PROXIMITY

N1

Sid

e

F

ron

t

Top

N2 N3 Original

(Animation is looped)

11 of 15

Given a set of encoded signals with similar intensity(top), a set of gesture(right) with similar

trait can be reconstructed.

Input: Z (encoded signals) Output: Gestures

CONCLUSION • Capability to generate different gestures

given a specific set of encoded signal.

• Capability to generate similar variations of gestures given three similar encoded signals.

• Future Challenges for decoder • A evaluation method to prove the

correctness of the decoded signals. • A set of new features to encode and

decode the frequencies characteristic. • A cheap and real-time method to

explore non-collision encoded signals

12 of 15

Encoding

Decoding Ideal

Reality

FUTURE WORK

• Associator • Adaptive Resonance Theory

Euclidean

• Encoder for the face • Currently, the current model works

on CK++ data base (frontal only)

Facial identity and expression

Ge

stu

res/

Po

stu

res

Identities

Expression

13 of 15

PCA1

PC

A2

PC

A3

QUESTION AND ANSWER

perception-link behavior model€¦ · [1] h. lee et al, “convolutional deep belief networks for...

Documents