part 3: audio-visual child-robot...

20
Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory National Technical University of Athens, Greece (NTUA) Robot Perception and Interaction Unit, Athena Research and Innovation Center (Athena RIC) Petros Maragos 1 Tutorial at INTERSPEECH 2018, Hyderabad, India, 2 Sep. 2018 slides: http://cvsp.cs.ntua.gr/interspeech2018 Part 3: Audio-Visual Child-Robot Interaction

Upload: others

Post on 09-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

Computer Vision, Speech Communication & Signal Processing Group,Intelligent Robotics and Automation Laboratory

National Technical University of Athens, Greece (NTUA)Robot Perception and Interaction Unit,

Athena Research and Innovation Center (Athena RIC)

Petros Maragos

1

Tutorial at INTERSPEECH 2018, Hyderabad, India, 2 Sep. 2018

slides: http://cvsp.cs.ntua.gr/interspeech2018

Part 3: Audio-Visual Child-Robot Interaction

Page 2: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

2

EU project BabyRobot: Experimental Setup Room

Page 3: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

3Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

TD experiments video

Page 4: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

4Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

ActThinkSense

Visual Stream

Audio Stream

Visual Gesture Recognition

Distant Speech Recognition

AV Localization & Tracking

Action Recognition3d Object

Tracking

Visual Emotion 

Recogn

ition

Spee

ch Emotion 

Recogn

ition

Text Emotion 

Recogn

ition

Behavioral Monitoring

IrisTK behavior generation

child’s activity

child’s behavioral state

Action Branch

Behavioral Branch

Audio Related Information

Visual Related Information

Wizard‐of‐Oz

IrisBroker

Perc

eptio

n Sy

stem

Page 5: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

Experimental Setup: Hardware & Software

Page 6: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

6Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Action Branch: Developed Technologies3D Object Tracking Multiview Gesture Recognition

Multiview Action RecognitionSpeaker Localization and Distant

Speech Recognition

Page 7: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

7Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Track multiple persons using Kinect skeleton.

Select the person closest to the auditory source position.

Rcor: percentage of correct estimations (deviation from ground truth less than 0.5m) Audio Source Localization: 45.5% Audio-Visual Localization: 85.6%

Audio-Visual Localization Evaluation

Page 8: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

8Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multiple views of the child’s gesture from different sensors

Fusion of the three sensors’ decisions

Multi-view Gesture Recognition

Page 9: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

9Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Nod Greet Come Closer

Sit Stop Point

Circle

Gesture Recognition – Vocabulary

Page 10: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

10Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multi-view Gesture Recognition - Evaluation

7 classes: nod, greet, come closer, sit, stop, point, circle Average classification accuracy (%) for the employed

gestures performed by 28 children (development corpus). Results for the five different features for both single and

multi-steam cases.

Page 11: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

11Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multi-view Gesture Recognition -Children vs. Adults

different training schemesAdults modelsChildren modelsMixed model

Employed Features: MBH

A. Tsiami, P. Koutras, N. Efthymiou, P. Filntisis, G. Potamianos, P. Maragos, “Multi3: Multi-sensory Perception System for Multi-modal Child Interaction with Multiple Robots”, Proc. ICRA, 2018.

Page 12: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

12Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Distant Speech Recognition System

12

DSR model training and adaptation per Kinect (Greek models)

CollectedData

I think that you are hammering a nail

I think that you are painting

I think that it is the rabbit

It relates to peace

Page 13: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

13Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Spoken Command Recognition Evaluation

13

• TD (Typically-Developing) children data: 40 phrases• average word (WCOR) and sentence accuracy (SCOR)

for the DSR task, per utterance set for all adaptation choices.

• 4-fold cross-validation

Page 14: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

14Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

different training schemesAdults modelsChildren modelsMixed model

Spoken Command Recognition –Children vs Adults

Page 15: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

15Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Page 16: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

16Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Cleaning a window Ironing a shirt Digging a hole Driving a bus

Painting a wall Hammering a nail Wiping the floor Reading

Swimming Working Out Playing the guitar Dancing

Action Recognition- Vocabulary

Page 17: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

17Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

13 classes of pantomime actions Average classification accuracy (%) for the employed

gestures performed by 28 children (development corpus). Results for the five different features for both single and

multi-steam cases.

Multi-view Action Recognition - Evaluation

N. Efthymiou, P. Koutras, P. Filntisis, G. Potamianos, P. Maragos, “Multi-view Fusion for Action Recognition in Child-Robot Interaction”, Proc. ICIP, 2018.

Page 18: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

18Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

different training schemesAdults modelsChildren modelsMixed model

Employed Features: MBH

Multi-view Action Recognition –Children vs Adults

Page 19: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

19Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

A. Tsiami, P. Filntisis, N. Efthymiou, P. Koutras, G. Potamianos, P. Maragos, “Multi3: Multi-sensory Perception System for Multi-modal Child Interaction with Multiple Robots”, Proc. ICRA, 2018.

Children-Robot Interaction:TD video - Rock Paper Scissors

Page 20: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:

20Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Part 3: Conclusions Synopsis:

• Data collection and annotation: 28 TD and 15 ASD children (+ 20 adults)• Audio-Visual localization and tracking• 3D Object tracking• Multi-view Gesture and Action recognition• Distant Speech recognition• Multimodal Emotion recognition

Ongoing work: • Evaluate the whole perception system with TD and ASD children• Extend and develop methods for engagement and behavioral understanding

Tutorial slides: http://cvsp.cs.ntua.gr/interspeech2018

For more information, demos, and current results: http://cvsp.cs.ntua.gr and http://robotics.ntua.gr