part 3: audio-visual child-robot...
TRANSCRIPT
![Page 1: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/1.jpg)
Computer Vision, Speech Communication & Signal Processing Group,Intelligent Robotics and Automation Laboratory
National Technical University of Athens, Greece (NTUA)Robot Perception and Interaction Unit,
Athena Research and Innovation Center (Athena RIC)
Petros Maragos
1
Tutorial at INTERSPEECH 2018, Hyderabad, India, 2 Sep. 2018
slides: http://cvsp.cs.ntua.gr/interspeech2018
Part 3: Audio-Visual Child-Robot Interaction
![Page 2: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/2.jpg)
2
EU project BabyRobot: Experimental Setup Room
![Page 3: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/3.jpg)
3Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
TD experiments video
![Page 4: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/4.jpg)
4Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
ActThinkSense
Visual Stream
Audio Stream
Visual Gesture Recognition
Distant Speech Recognition
AV Localization & Tracking
Action Recognition3d Object
Tracking
Visual Emotion
Recogn
ition
Spee
ch Emotion
Recogn
ition
Text Emotion
Recogn
ition
Behavioral Monitoring
IrisTK behavior generation
child’s activity
child’s behavioral state
Action Branch
Behavioral Branch
Audio Related Information
Visual Related Information
Wizard‐of‐Oz
IrisBroker
Perc
eptio
n Sy
stem
![Page 5: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/5.jpg)
Experimental Setup: Hardware & Software
![Page 6: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/6.jpg)
6Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
Action Branch: Developed Technologies3D Object Tracking Multiview Gesture Recognition
Multiview Action RecognitionSpeaker Localization and Distant
Speech Recognition
![Page 7: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/7.jpg)
7Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
Track multiple persons using Kinect skeleton.
Select the person closest to the auditory source position.
Rcor: percentage of correct estimations (deviation from ground truth less than 0.5m) Audio Source Localization: 45.5% Audio-Visual Localization: 85.6%
Audio-Visual Localization Evaluation
![Page 8: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/8.jpg)
8Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
Multiple views of the child’s gesture from different sensors
Fusion of the three sensors’ decisions
Multi-view Gesture Recognition
![Page 9: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/9.jpg)
9Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
Nod Greet Come Closer
Sit Stop Point
Circle
Gesture Recognition – Vocabulary
![Page 10: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/10.jpg)
10Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
Multi-view Gesture Recognition - Evaluation
7 classes: nod, greet, come closer, sit, stop, point, circle Average classification accuracy (%) for the employed
gestures performed by 28 children (development corpus). Results for the five different features for both single and
multi-steam cases.
![Page 11: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/11.jpg)
11Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
Multi-view Gesture Recognition -Children vs. Adults
different training schemesAdults modelsChildren modelsMixed model
Employed Features: MBH
A. Tsiami, P. Koutras, N. Efthymiou, P. Filntisis, G. Potamianos, P. Maragos, “Multi3: Multi-sensory Perception System for Multi-modal Child Interaction with Multiple Robots”, Proc. ICRA, 2018.
![Page 12: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/12.jpg)
12Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
Distant Speech Recognition System
12
DSR model training and adaptation per Kinect (Greek models)
CollectedData
I think that you are hammering a nail
I think that you are painting
I think that it is the rabbit
It relates to peace
![Page 13: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/13.jpg)
13Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
Spoken Command Recognition Evaluation
13
• TD (Typically-Developing) children data: 40 phrases• average word (WCOR) and sentence accuracy (SCOR)
for the DSR task, per utterance set for all adaptation choices.
• 4-fold cross-validation
![Page 14: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/14.jpg)
14Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
different training schemesAdults modelsChildren modelsMixed model
Spoken Command Recognition –Children vs Adults
![Page 15: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/15.jpg)
15Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
![Page 16: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/16.jpg)
16Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
Cleaning a window Ironing a shirt Digging a hole Driving a bus
Painting a wall Hammering a nail Wiping the floor Reading
Swimming Working Out Playing the guitar Dancing
Action Recognition- Vocabulary
![Page 17: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/17.jpg)
17Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
13 classes of pantomime actions Average classification accuracy (%) for the employed
gestures performed by 28 children (development corpus). Results for the five different features for both single and
multi-steam cases.
Multi-view Action Recognition - Evaluation
N. Efthymiou, P. Koutras, P. Filntisis, G. Potamianos, P. Maragos, “Multi-view Fusion for Action Recognition in Child-Robot Interaction”, Proc. ICIP, 2018.
![Page 18: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/18.jpg)
18Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
different training schemesAdults modelsChildren modelsMixed model
Employed Features: MBH
Multi-view Action Recognition –Children vs Adults
![Page 19: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/19.jpg)
19Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
A. Tsiami, P. Filntisis, N. Efthymiou, P. Koutras, G. Potamianos, P. Maragos, “Multi3: Multi-sensory Perception System for Multi-modal Child Interaction with Multiple Robots”, Proc. ICRA, 2018.
Children-Robot Interaction:TD video - Rock Paper Scissors
![Page 20: Part 3: Audio-Visual Child-Robot Interactioncvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial_Multimodal... · Part 3: Audio-Visual Child-Robot Interaction. 2 EU project BabyRobot:](https://reader034.vdocuments.us/reader034/viewer/2022050517/5fa13b63b8c73351b52926d9/html5/thumbnails/20.jpg)
20Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction
Part 3: Conclusions Synopsis:
• Data collection and annotation: 28 TD and 15 ASD children (+ 20 adults)• Audio-Visual localization and tracking• 3D Object tracking• Multi-view Gesture and Action recognition• Distant Speech recognition• Multimodal Emotion recognition
Ongoing work: • Evaluate the whole perception system with TD and ASD children• Extend and develop methods for engagement and behavioral understanding
Tutorial slides: http://cvsp.cs.ntua.gr/interspeech2018
For more information, demos, and current results: http://cvsp.cs.ntua.gr and http://robotics.ntua.gr