a supervised learning architecture for human pose recognition in a social robot
DESCRIPTION
A main activity of Social Robots is to interact with people. To do that, the robot must be able to understand what the user is saying or doing. This document presents a supervised learning architecture to enable a social robot to recognise human poses. The architecture is trained using data obtained from a depth camera that allows the creation of a kinematic model of the user. The user labels each set of poses by telling it directly to the robot, which identifies these labels with an Automatic Speech Recognition System (ASR). The architecture is evaluated with two different datasets where the quality of the training examples varies. In both datasets, a user trains the classifier to recognise three different poses. The learned classifiers are evaluated against twelve different users demonstrating high accuracy and robustness when representative examples are provided in the training phase. Using this architecture in a social robot might improve the quality of the human-robot interaction since the robot is able to detect non-verbal cues from the user, making the robot more aware of the interaction context.TRANSCRIPT
![Page 1: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/1.jpg)
A Supervised LearningArchitecture for Human PoseRecognition in a Social Robot
UNIVERSITY CARLOS III OF MADRID
COMPUTER SCIENCE DEPARTMENT
Victor Gonzalez-Pacheco
A thesis submitted for the degree of
Master in Computer Science and Technology- Artificial Intelligence -
2011 July
![Page 2: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/2.jpg)
ii
![Page 3: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/3.jpg)
Director:
Fernando Fernandez Rebollo
Computer Science Department
University Carlos III of Madrid
Co-Director:
Miguel A. Salichs
Systems Engineering and Automation Department
University Carlos III of Madrid
![Page 4: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/4.jpg)
![Page 5: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/5.jpg)
Abstract
A main activity of Social Robots is to interact with people. To do that, the
robot must be able to understand what the user is saying or doing. This
document presents a supervised learning architecture to enable a social
robot to recognise human poses. The architecture is trained using data ob-
tained from a depth camera that allows the creation of a kinematic model of
the user. The user labels each set of poses by telling it directly to the robot,
which identifies these labels with an Automatic Speech Recognition Sys-
tem (ASR). The architecture is evaluated with two different datasets where
the quality of the training examples varies. In both datasets, a user trains
the classifier to recognise three different poses. The learned classifiers
are evaluated against twelve different users demonstrating high accuracy
and robustness when representative examples are provided in the training
phase. Using this architecture in a social robot might improve the quality
of the human-robot interaction since the robot is able to detect non-verbal
cues from the user, making the robot more aware of the interaction context.
Keywords: Pose Recognition, Machine Learning, Robotics, Human-Robot
Interaction, HRI.
![Page 6: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/6.jpg)
![Page 7: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/7.jpg)
Resumen
Una de las principales actividades de los robots sociales es interactuar con
personas. Para ello se requiere que el robot sea capaz de entender qué
es lo que está diciendo o haciendo el usuario. Este documento presenta
una arquitectura de aprendizaje supervisado que permite a un robot so-
cial reconocer poses de personas con las que interactúa. La arquitectura
es entrenada utilizando las imágenes provenientes de una cámara de pro-
fundidad, la cual permite la creación de un modelo cinemático del usuario
que es utilizado para los ejemplos de entrenamiento. El propio usuario se
encarga de etiquetar las poses mostradas al robot mediante su propia voz.
Para detectar las etiquetas dichas por el usuario, el robot utiliza un sistema
de reconocimiento del habla integrado en la arquitectura. La arquitectura
es evaluada con dos datasets diferentes en los cuales se varía la calidad
de los ejemplos de entrenamiento. En ambos datasets, un usuario entrena
al clasificador para que sea capaz de reconocer tres distintas poses. Los
classificadores construidos mediante estos datasets son evaluados medi-
ante una prueba con doce usuarios distintos. La evaluación demuestra
que esta arquitectura consigue una alta precisión y robustez cuando se le
proveen ejemplos representativos en la fase de entrenamiento. El uso de
esta arquitectura en un robot social puede mejorar la calidad de las interac-
ciones humano-robot gracias a que con ella el robot es capaz de detectar
información no-verbal emitida por el usuario. Esto permite que el robot sea
más consciente del contexto de interacción en el que se encuentra.
Palabras clave: Pose Recognition, Machine Learning, Robotics, Human-
Robot Interaction, HRI.
![Page 8: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/8.jpg)
![Page 9: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/9.jpg)
To Raquel, eternal companion in this, and all the journeys.
![Page 10: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/10.jpg)
![Page 11: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/11.jpg)
Acknowledgements
This large project could not be finished without the help of many people,
and for that reason I want to thank them their precious time they dedicated
to me. First, I want to express how thankful I am to my two advisors Fer-
nando Fernández and Miguel Ángel Salichs. You have been a lighthouse
that guided my in this journey and you have been helpful always I have
needed you. I want to thank to Fernando A., especially for helping me to
overcome all the difficulties I encountered when I was integrating the voice
system of the robot, moreover when you were not physically here. It is
really easy and great to work with colleagues as you. I am grateful to the
rest of the "Social Robots" team as well. Especially to Arnaud, Alberto
and David. It’s incredible how you are always responding with invaluable
help, and great advice. There are other people that have suffered the col-
lateral damages of this project, thank you Martin, Javier, Miguel, Juan,
Alberto, and Silvia. It is wonderful to have such colleagues and friends,
always being there exchanging ideas, concerns, helping in problems and
having great coffee breaks with me. Finally, I want to give special thanks to
Raquel, my wife, for all the support she has given to me. Without your pa-
tience, encouragement, and drive I would not finished this project. Thank
you.
![Page 12: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/12.jpg)
![Page 13: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/13.jpg)
Contents
List of Figures xv
List of Tables xvii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problems to be solved . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Structure of the Document . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 7
2.1 Machine Learning in Human-Robot Interactions. . . . . . . . . . . . . . 7
2.2 Depth Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Gesture Recognition with Depth Cameras . . . . . . . . . . . . . . . . . 15
3 Description of the Hardware and Software Platform 17
3.1 Hardware Description of the Robot Maggie . . . . . . . . . . . . . . . . 17
3.2 The AD Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Robot Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 The Kinect Vision System . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 The Weka Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 The Supervised Learning Based Pose Recognition Architecture 29
4.1 Training Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 Pose Labeler Skill . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 Pose Labeler Bridge . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.3 Pose Trainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
xiii
![Page 14: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/14.jpg)
CONTENTS
4.2 Classifying Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Pose Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Pose Teller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Pilot Experiments 43
5.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Conclusions 51
A Class Diagrams of the Main Nodes 55
B Detailed Results of the Pilot experiment 57
C The ASR Skill Grammar for recognising Pose Labels 61
References 63
License 69
xiv
![Page 15: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/15.jpg)
List of Figures
2.1 Categories of Example Gathering . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Operation Diagram of a Depth Camera . . . . . . . . . . . . . . . . . . . 13
2.3 Comparison of the three main depth camera technologies . . . . . . . . 14
3.1 The sensors of the Robot Maggie . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Actuators and interaction mechanisms of the Robot Maggie . . . . . . . 19
3.3 The AD architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Block Diagram of the Prime Sense Reference Design . . . . . . . . . . 26
3.5 OpenNI’s kinematic model of the human body . . . . . . . . . . . . . . . 27
4.1 Overview of the Built System . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Diagram of the built architecture . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Sequence Diagram of the training phase . . . . . . . . . . . . . . . . . . 34
4.4 Use Case of the Pose Labeler Skill . . . . . . . . . . . . . . . . . . . . . 36
4.5 Use Case of the Pose Labeler Bridge . . . . . . . . . . . . . . . . . . . 37
4.6 Use Case of the Pose Trainer Node . . . . . . . . . . . . . . . . . . . . 38
4.7 Sequence Diagram of the classifying phase . . . . . . . . . . . . . . . . 40
4.8 Use Case of the Pose Classifier Node . . . . . . . . . . . . . . . . . . . 41
4.9 Use Case of the Pose Teller Node . . . . . . . . . . . . . . . . . . . . . 42
5.1 Scenario of the experiment . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 The tree built for the models M1 and M2 . . . . . . . . . . . . . . . . . . 47
5.3 The tree built for the model M3 . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 The tree built for the model M4 . . . . . . . . . . . . . . . . . . . . . . . 48
A.1 Class Diagram of the Pose Trainer Node . . . . . . . . . . . . . . . . . . 55
xv
![Page 16: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/16.jpg)
LIST OF FIGURES
A.2 Class Diagram of the Pose Classifier Node . . . . . . . . . . . . . . . . 56
xvi
![Page 17: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/17.jpg)
List of Tables
5.1 Results of the Pilot Experiment . . . . . . . . . . . . . . . . . . . . . . . 47
B.1 Detailed Results of the Models M1 and M2 . . . . . . . . . . . . . . . . 57
B.2 Confusion matrix for the models M1 and M2 . . . . . . . . . . . . . . . . 58
B.3 Detailed Results of the Model M3 . . . . . . . . . . . . . . . . . . . . . . 58
B.4 Confusion matrix for the model M3 . . . . . . . . . . . . . . . . . . . . . 58
B.5 Detailed Results of the Model M4 . . . . . . . . . . . . . . . . . . . . . . 58
B.6 Confusion matrix for the model M4 . . . . . . . . . . . . . . . . . . . . . 59
xvii
![Page 18: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/18.jpg)
LIST OF TABLES
xviii
![Page 19: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/19.jpg)
Chapter 1
Introduction
1.1 Motivation
Human Robot Interaction (HRI) is the field of research that studies how humans and
robots should interact and collaborate. Humans expect that robots to understand them
as other people do. In this aspect, a robot must understand natural language and
should be capable of establishing complex dialogues with its human partners.
But dialogues are not only a matter of words. Most of the information that is ex-
changed in a conversation does not come from the phrases of the people engaged
in this talk but from their non verbal messages. Gestures encompass a great part of
the non-verbal information, but there are also other factors that provide information to
the others. An example of this is the postural information that a person shows to its
listeners.
For example, imagine a a room where several people are sitting in chairs and
talking about something. If, suddenly, a person stands up, it will recall the attention of
the rest of the people of the room. What is this man announcing when suddenly stands
up? Is he about to leave the room? Or perhaps he wants to say something relevant?
His intentions would not be disclosed until he says or does something, but the first
important thing is that the dynamics of the conversation have suddenly changed, or at
least, it have been affected by a change in the context of the room.
Now imagine there is a robot in that room. Would the robot get noticed of this
change in the context? Probably not, if we consider the state of the current technol-
ogy. Gesture and pose recognition systems have been an active research field in the
1
![Page 20: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/20.jpg)
1. INTRODUCTION
recent years [1], but traditional image capture systems require to use complex statis-
tical models for the recognition of the body, making them difficult to apply in practical
applications [2].
But recent technology developments are enabling new types of vision sensors
which are more suitable for interactive scenarios [3]. These devices are the depth
cameras [3], [4], [5], which make the extraction of the body easier than it was with tra-
ditional cameras. And, because the extraction of the body is easier than before, now
it is possible to retarget great quantities of computer power to algorithms that actually
processes the gestures or the pose of the user rather than detecting his body and
tracking it.
Specially relevant is the case of the Microsoft’s Kinect sensor [6], a low cost a
depth camera that offers a precision and a performance similar to the high-end depth
cameras but at a cost several times lower. Among the Kinect, several drivers and
frameworks to control it have appeared. These drivers and frameworks provide direct
access to a skeleton model of the user who is in front of it at a relatively low CPU
cost. This model is precise enough to track the pose of the user and to recognise the
gestures he is doing in real time.
1.2 Objectives
The main objective of this master thesis is to build a software architecture that is ca-
pable of learning human poses seizing the new capabilities offered by a Kinect sensor
mounted in the robot Maggie [7]. With such architecture, it is expected that the robot
will be able to understand some non-verbal information from the users who interact
with the robot as well as context information of the situation.
Since the robot Maggie is a platform to study Human-Robot Interaction (HRI), one
of the requisites of the system is that the user should be able to teach the robot the
poses it doesn’t know. This teaching process must be done by natural interaction
processes. I. e. the user and the robot must communicate by voice.
The learning architecture will obtain the user’s body information retrieving it from
the robot’s vision system. The vision system will rely on a Kinect sensor, which has
2
![Page 21: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/21.jpg)
1.2 Objectives
been recently installed in the robot but not yet fully integrated in its software architec-
ture. For that reason, one of the objectives of the project is to integrate the software
that manages the Kinect sensor with the robot architecture.
Additionally, no learning architecture is running on the robot. Therefore, this archi-
tecture has to be built and integrated in the robot.
The integration of all these components is not a trivial task due the requirements
of the robot’s software architecture, which requires that its software must be capable
of working in a distribute manner. To solve this issue, the mechanisms that will glue all
the developed components will be the communication systems of the robot’s software
architecture and the communication systems of ROS (Robot Operating System). ROS
is an open-source software framework that pretends the standardisation of the robotics
software. The vision components we pretend to use are already integrated in ROS and
its multilanguage support makes possible to develop new ROS-compatible software.
To not redo the work that already have been done in ROS, the vision components will
be accessed through ROS. But ROS is not integrated into the robot and there are not
mechanisms to communicate ROS with the sofware architecture of the robot. Hence,
a last objective of the project is to integrate ROS in the software architecture of the
robot.
These objectives can be summarised in the following list:
• To build a machine learning framework that learns by multi-modal examples. In
essence, the system should learn by fusing verbal information with the informa-
tion captured by the vision system.
• The human should be able to teach the robot by interacting naturally, in the same
manner it would do with another human.
• To develop a pose recognition learning architecture seizing the image acquisition
techniques provided by the Kinect sensor and the algorithms of its drivers.
• To integrate ROS in the robot
• To validate that the system works.
• To integrate and test the whole architecture in the robot Maggie.
3
![Page 22: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/22.jpg)
1. INTRODUCTION
1.3 Problems to be solved
In order to accomplish the objectives enumerated above, several technical problems
have to be addressed. Among all of them, three are specially relevant.
The first one is to integrate the vision algorithms provided by the Kinect sensor in
the architecture of the robot. Until now, this is an uncompleted task that has to be done
to enabling the learning framework to detect the pose of the users. Additionally, the
architecture of the robot lacks of a generic machine learning component. Therefore,
if a learning component has to be used, it must be integrated in the robot software as
well. Finally, the voice recognition system of the robot must be able to collaborate with
the learning architecture in order to feed it with the training examples provided by the
user.
As it has been shown in the objectives section, on of the objectives of the project is
to use some components of the architecture through ROS which it has to be integrated
in the robot software architecture. This is a technical problem because the communi-
cation mechanisms of ROS differ to the ones of the robot. Thus, it is needed to build
some components that enable the communication between ROS and the robot.
These technical difficulties are summarised in the following listing:
• Integrate the vision acquisition system with the robot’s software architecture.
• Integrate the machine learning framework in the architecture of the robot.
• Combining the user’s inputs with the learning architecture.
• Integrate ROS with the robot’s software architecture.
1.4 Structure of the Document
This document is organised as follows. Chapter 2 presents an overview of the related
work in pose and gesture classification with depth cameras. Chapter 3 introduces
an overview of the systems that act as the building blocks of the developed architec-
ture. The chapter describes hardware components such as the robot Maggie and the
Kinect sensor as well as the software modules that act as the scaffold of the project.
In chapter 4 presents the developed architecture and describes its components sep-
arating them following the two phases of the learning process, the training phase and
the classifying phase. After that, chapter 5 presents some pilot experiments that have
4
![Page 23: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/23.jpg)
1.4 Structure of the Document
been carried out to validate the correct functioning of the architecture. Finally, chapter
6 closes the document presenting the conclusions and the future work that remains to
be done.
5
![Page 24: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/24.jpg)
1. INTRODUCTION
6
![Page 25: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/25.jpg)
Chapter 2
Related Work
This chapter introduces the current state of the art in the topics related to the project.
The chapter starts with an overview of the machine learning techniques that are used
in in the field of Human-Robot Interaction (HRI) with the purpose of improving the
interactions between robots and people. After, section 2.2, presents an overview of
the technology used by depth cameras and which enables them to retrieve the depth
information form the scenes. Finally, section 2.3, shows an insight of the applications
and the works that other research groups have been done with depth cameras to
detect human poses and gestures.
2.1 Machine Learning in Human-Robot Interactions.
Fong et al. made an excellent survey [8] of the interactions between humans and
social robots. In the survey, Fong mentions that the main purpose of learning in social
robots is to improve the interaction experience. At the date of the survey (2003),
most of the learning applications were used in robot-robot interaction. Some works
addressed the issue of learning in human-robot interaction, mostly centred in imitating
human behaviours such as motor primitives. According to the authors, learning in
social robots is used for transferring skills, tasks and information to the robot. However,
the authors do not mention the use of learning for transferring concepts to the robot..
Few years later, Goodrich and Schultz, published a complete survey that covered
several HRI fields [9]. The authors remarked the need of robots with learning capa-
bilities because scriptising every possible interaction with humans is not an afford-
7
![Page 26: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/26.jpg)
2. RELATED WORK
able task due the complexity and unpredictable behaviour of the human beings. They
pointed out the need of a continuous learning process where the human can teach the
robot in an ad-hoc and incrementally manner to improve the robot’s perceptual ability,
autonomy and its interaction capabilities. They called this process interactive learning
and it is carried out by natural interaction. Again, the survey only reports works that
referred to learning as an instrument to improve abilities, behaviour, perception and
multi-robot interaction. No explicit mention was done to use learning to provide the
robot with new concepts.
TeleoperationSensors on Teacher
ShadowingExternal
Observation
Demonstration Imitation
Direct Recording
Mapped Recording
Mapped EmbodimentDirect Embodiment
Rec
ord M
apping
Embodiment Mapping
Figure 2.1: Categories of Example Gathering - These four squares categorise the man-ners of gathering examples in the Learning from Demonstration field applied to robotics.The rows represent whether the recording of the example captured all the sensory inputof the teacher (upper row) or not (lower row). Columns represent if the recorded datasetcan apply directly to actions or states of the robot (left column) or if it is needed somemapping (right column). (Retrieved from [10])
There are many fields where machine learning is applied to robotics. Among all of
the machine learning techniques, supervised learning is one of the most widespread
since the robot can explore the world with the supervision of a teacher and thus, re-
ducing the danger to the robot or the environment [10].
This section shows some concepts regarding supervised learning, specially, Learn-
ing from Demonstration (LfD). In this area, [10] presents a excellent survey establish-
ing several LfD approaches in different categories depending on how they collect the
8
![Page 27: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/27.jpg)
2.1 Machine Learning in Human-Robot Interactions.
learning examples and depending on how they learn a policy from these examples.
The latter is not relevant to this project, thus only the former will be summarised be-
low.
Argall et. al [10] call correspondence to the fact of recording learning examples and
transferring them to the robot. They divide correspondence into two main categories
depending on different aspects of how the examples are recorded and transferred to
the robot. These categories are Record Mapping, and Embodiment Mapping. The
former refers to whether the experience of the teacher during the demonstration is
exactly captured or not. The latter refers whether the examples recorded to the dataset
are exactly those that the learner would observe or execute. Argall et. al. present
these two ways of categorisation in a form of a 2x2 matrix (see Fig. 2.1).
Deepening into the categorisation, the authors in [10], categorise the Embodiment
Mapping into two subcategories: demonstration and imitation. In the demonstration
category, the demonstration is performed on the actual robot or in a physically identical
platform and, thus, there is no need of embodiment mapping. On the contrary, in the
imitation category, the demonstration is performed in a platform different to the robot
learner. Therefore, an embodiment mapping between the demonstration platform and
the learning platform is needed.
As it is depicted in Fig. 2.1, both demonstration and imitation categories are divided
into two sub-categories depending on the Record Mapping. In essence, demonstration
is divided into teleoperation and shadowing while imitation is divided into sensors in
the teacher and external observation categories. Following, the four categories are
described.
• Demonstration. The embodiment mapping is direct. The robot uses its own
sensors to record hte example while its body executes the behaviour.
– Teleoperation: is a technique where the teacher directly operates the learner
robot during the execution. The robot learner uses its own sensors to cap-
ture the demonstration. In this case exists a direct mapping between the
recorded example and the observed example.
– Shadowing: in this technique, the robot learner uses its own sensors to
record the example while, at the same time, tries to mimic the teacher’s
motions. In this case there is no a direct record mapping.
9
![Page 28: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/28.jpg)
2. RELATED WORK
• Imitation. The embodiment mapping is not direct. The robot needs to retrieve
the example data from the actions of the teacher.
– Sensors in the teacher: In this technique, several sensors are located on the
executing body to record the teacher execution. Therefore, this technique
is a direct record mapping technique.
– External observator: In this case, the recording sensors are not located on
the executing body. In some cases the sensors are installed on the learner
robot while others are outside it. This means that this technique is not a
direct record mapping technique.
As a commentary, teleoperation provides the most direct method for transferring
information within the demonstration learning, but as a setback, it is not suitable for
all learning platforms [10]. On the other hand, shadowing techniques demand more
processing to enable the learner to mimic the teacher.
The sensors on teacher technique provides very precise measurements since the
information is extracted directly from the teachers sensors. But it requires an extra
overhead in the form of specialised sensors. On the other side, since external obser-
vation sensors are external to the teacher, they do not record the data directly from
it, forcing the learner robot to infer the data of the execution. This makes external
observation less reliable, but since the set up of the sensors produces less overhead
compared to the sensors on the teacher technique, it is more widely used [10]. Typi-
cally, the external sensors used to record human teacher executions are vision-based.
If we apply the categorisation of [10] to this project, we got that there is no recording
mapping because the learning examples are recorded by the Kinect sensor of the
robot. Additionally, the embodiment recording categorisation does not apply because
there is no action or behaviour to learn. Instead, what it is learnt is a concept which,
for the scope of this project, does not need to be mapped to the robot.
Almost all the presented works focus the learning process in learning tasks or
behaviours. Few of them use learning to teach concepts to the robot. This is the
case of [11], where the authors train a mobile robotic platform to understand concepts
related to the environment at which it has to navigate. The authors use a Feed Forward
Neural Network (NN) to train the robot to understand concepts like doors or walls. The
authors train the NN by showing it numerous images of a trash can (its destination
10
![Page 29: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/29.jpg)
2.1 Machine Learning in Human-Robot Interactions.
point) labelling each photo with the distance and the orientation of the can. However,
the work presented some limitations such as the learning process lacked of enough
flexibility to generalise the work to other areas.
A work in an area not directly related to robotics [12] can give us an insight of which
kind of concepts our robot can learn with our system. In that work, Van Karsten et al.
mounted a wireless sensor network in a home to retrieve the activities of the user who
is living in it. The activities were labeled by the person who was living in the house by
voice using a wireless bluetooth headset. The labels were processed by a grammar
based speech recognition system similar to the Maggie’s one (see annex C to check
an example of a Maggie’s grammar). The authors recorded the activities of the user for
a period of 28 days and formed a dataset with them. From the 28 days of recording,
27 were used as training dataset and the other day was used as a tests dataset to
evaluate the classifier. The training dataset was processed in a Hidden Markov Model
(HMM) and in a Conditional Random Field (CRF) to build two models of the user’s
activities.
The performance of both classifiers were evaluated using two types of measures,
the time slice accuracy and the class accuracy. The first represented the percentage
of correctly classified time slices, while the latter represented the average percentage
of correctly classified time slices per class. Both classifiers demonstrated good results
detecting the user’s activities. CRF’s showed better time slice accuracy while HMM
performed better in class accuracy.
Learning the activities of the user might improve the quality of the interactions in a
social robot. But, despite Activity Learning has many potential applications in robotics,
no works have been found in the fields of social robotics or human robot interaction.
Most of the presented works suppose the robot is a passive learner. But a new
paradigm is appearing where the robot takes the initiative to ask the user for more
examples where they are needed [13]. Active learning techniques can produce classi-
fiers with better performance requiring fewer examples or reducing the required num-
ber of examples to reach certain performance [14].
Applied to HRI, there are some works that demonstrated that Active Learning not
only improves the performance of supervised learning, but it also improves the quality
of the interaction with their teachers. In [15], the authors experimented how a social
robot was able to learn different concepts. In the experiment participated 24 people
11
![Page 30: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/30.jpg)
2. RELATED WORK
and the robot was configured to show four different degrees of initiative in the learning
process.
The robot ranged from a traditional passive supervised learning approach to three
different active learning configurations. These active configurations were a naïve ac-
tive learner that queried the teacher every turn; a mixed passive-active mode, where
the robot waited for certain conditions before asking; and a mode where the robot only
queried the teacher when the teacher granted permission to do it.
In their experiment, [15] found that there was no appreciable difference between
the active learning modes, but the three of them outperformed the passive supervised
learning mode in both accuracy and number of examples needed to achieve accurate
models. Additionally, the survey to the 24 users, demonstrated that the users preferred
the active learner robot, which they found to be more intelligent, more engaging and
easier to teach.
Despite active learning seems to be better suited for HRI purposes than traditional
supervised learning, the scope of this project will remain in passive supervised learn-
ing since this is the first approximation to the field. Nevertheless, an objective of the
project is to build an architecture which allows future expansions in case active learn-
ing techniques are added to it in the near future.
2.2 Depth Cameras
Depth cameras are systems that can build a 3D depth map of a scene by projecting
light to that scene. The principle is similar to that of LIDAR scanners with the difference
that the latter are only capable of performing a 2D scanning of the scene while depth
cameras scan the whole scene at once. Fig. 2.2 depticts an example of the operation
principle of a depth camera.
Traditionally, depth information has been carried out by stereo vision or laser based
systems. Stereo cameras rely on passive triangulation methods to obtain the depth
information from the scene. These methods require two cameras separated by a base-
line that determines a limited working depth range. But within these algorithms appear
the so-called correspondence problem, which is determining what pairs of points in the
two images are projections of the same 3D point. In contrast, depth cameras naturally
12
![Page 31: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/31.jpg)
2.2 Depth Cameras
Figure 2.2: Operation Diagram of a Depth Camera - Depth cameras project IR light tothe scene which is analysed by a sensor to get the depth map of the scene. The depictedsensor is the PrimeSense’s sensor which uses the Light Coding technology to retrieve thedepth data. (Retrieved from [16])
deliver depth and simultaneous intensity data avoiding the correspondence problem,
and do not require a baseline in order to operate [3].
In the other hand, laser-based systems provide very precise sliced 3D measure-
ments. But these systems have to deal with difficulties in collision avoidance applica-
tions due their 2D field of view. The most widely adopted solution to solve this problem
has been mounting the sensor on a pan-and-tilt unit. This solves the problem, but
it also implies row by row sampling, which makes this solution inappropriate for real-
time, dynamic scenes. In short, although laser based systems present higher depth
range, accuracy and reliability, they are voluminous, heavy, increase the power con-
sumption, and add additional moving parts when compared to depth cameras. Depth
cameras, on the contrary, are compact and portable, they do not require the control
of mechanical moving parts, thus reducing power consumption, and they do not need
row by row sampling, thus reducing image acquisition time [3].
There are three main categories of depth cameras, depending on how they project
the light to the scene to capture its depth information:
Time-of-Flight (ToF) Cameras ToF cameras obtain the depth information by emitting
a near-infrared light which is reflected by the 3D surfaces of the scenario back to
the sensor (see Fig. 2.3a). Currently two main approaches are being employed
13
![Page 32: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/32.jpg)
2. RELATED WORK
(a) ToF (b) Structured Light (c) Light Coding
Figure 2.3: Comparison of the three main depth cameras technologies. (a) shows theprinciple of operation of the ToF cameras. (b) shows the pattern emitted by a StructuredLight camera. (c) shows the pattern emitted by a Light Coding camera. (b) and (c) obtainthe depth information by comparing the distortions of the pattern received at the sensorwith the original emitted pattern.
in ToF technology [17]. The first one consists in sensors which measure the the
time of a light pulse trip to calculate depth. The second approach measures
phase differences between the emitted and received signals.
Structured Light Cameras Structured Light is based on projecting a narrow band of
IR light onto the scene [4]. When the projected band hits a 3D surface, it pro-
duces a line of illumination that appears distorted from other perspectives than
the projector’s. When this distorted light is received by a sensor, it is possible to
calculate shape of the 3D surface, because the initial form of the band is known.
This applies to the only section of the 3D surface that has illuminated by the light
band. To extend this principle to the whole scene, many methods emit a pattern
of several light bands simultaneously (see Fig 2.3b).
Projected Light Cameras This is the newest technology. It is based on projecting
a pattern of IR light to the scene and and calculate its distortions. Differently to
Structured Light, here the pattern is based on light dots (see Fig. 2.3c). It is used
by devices such as the Microsoft’s Kinect. Since this is the technology used for
this project, is further described in section ??.
14
![Page 33: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/33.jpg)
2.3 Gesture Recognition with Depth Cameras
2.3 Gesture Recognition with Depth Cameras
Depth cameras are an attractive opportunity in several fields that require intense anal-
ysis of the 3D environment. [18] presents a survey showing different technologies and
applications of depth sensors in the recent years. It points out a relevant increase in
the scientific activity in this field in the two or three years prior to the survey. Among
other potential applications, [18] regards gesture recognition as one of the research
fields that can be most benefited by the appearance of the ToF technology. Specially,
tracking algorithms that have combined the data provided by the fusion of depth sen-
sors and RGB cameras have seen a significant increase in robustness.
Prior to the use of depth cameras, several research has been carried out in the
field of gesture recognition [1]. But using traditional computer vision systems leads
to requiring complex statistical models for recognition, which are difficult to use in
practical applications [2].
In [19] and [20], the authors suggest that it would be easier to infer geometry and
3D location of body parts using depth images. Because of that, several works are
focused in the use of depth cameras to detect and track different types of gestures.
Following, some examples of gesture recognition are shown.
Since depth sensors enable an easy segmentation and extraction of localised parts
of the body, several efforts have been dedicated to the field of hand detection and
tracking. For instance, in [21] a ToF camera is used to reconstruct and track a 7 Degree
of Freedom (DoF) hand model. In [22], the authors use a ToF camera for interaction
with computers focusing in two applications: recognising the number of raised fingers
in one hand and moving an object in a virtual environment using only a hand gesture.
Other works propose the use of ToF cameras to track hand gestures [20]. They use
the depth data of the ToF camera to segment the body from the background. Once
they have the body segmented they detect the head position since most hand gestures
are relative to the rest of the body. In [23], an algorithm for recognizing hand single
strokes gestures in a 3D environment with a ToF camera is presented. The authors
modified the "$1" gesture recognizer to be used with depth data. The modification of
the algorithm enabled the fingertip gestures not having to be articulated in the same
perspective as the gesture templates were recorded.
15
![Page 34: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/34.jpg)
2. RELATED WORK
In [24], a stereo-camera for the recognition of pointing gestures in the context of
Human Robot Interaction (HRI) is used. To do so, the authors perform visual track-
ing of the head, hands and head orientation. Using a Hidden Markov Model (HMM)
based classifier, they prove that the gesture recognition performance is improved sig-
nificantly when the classifier is provided with information about the head orientation as
an additional feature. But in [25], the authors assure that they achieved better results
with a ToF camera. To do that, they extract a set of body features from depth images
from a ToF camera and train a model of pointing directions using a Gaussian Process
Regression. This model presented higher accuracy than other simple criteria such as
head-hand, shoulder-hand or elbow-hand lines and the mentioned work of [24].
Some works focused in tracking other parts of the body. For example, [19] uses
a time-of-flight camera for a head tracking application. They use a knowledge-based
training algorithm to divide the depth data into a several initial clusters. Then they per-
form the tracking with an algorithm based on a modified k-means clustering method.
Other approaches rely on kinematic models to track human gestures once the body
is detected. For Instance, in [26], they use two stereo cameras to detect the hands
of a person and build an Inverse-Kinemantics (IK) model of the human body. In [27],
authors present a model-based kinematic self retargeting framework to estimate the
human pose from a small number of key-points of the body. They prove that it is pos-
sible to recover the human pose from a small set of key-points providing an adequate
kinematic model and a good formulation of tracking control subject to kinematic con-
straints. A similar approach is used in [28]. Here, the authors use a Kinect RGB-D
sensor to extract and track the skeleton model of a human body. This allows them to
capture the 3D temporal variations of the velocity vector of the hand and model them
in a Finite State Machine (FSM) to classify the hand gestures.
Most of these works rely on capturing only one or few parts of the body. However,
combining the ease of segmentating the body provided by the Kinect with recent kine-
matic approaches like the one in [27], makes possible to track the whole body without
a significative increase in the CPU consumption. This is the case of the OpenNI frame-
work [29], which enables the possibility of tracking the skeleton model of the user at
a low CPU cost. Because OpenNI is one of the key concepts of this project, it will be
addressed in section ??.
16
![Page 35: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/35.jpg)
Chapter 3
Description of the Hardware andSoftware Platform
Before entering in the description of the architecture developed for this project, it
is necessary to understand the building blocks on which it relies on. This chapter
presents and describes the pre-existing modules and systems that have been used as
the principal components on which de developments of this project leaned on.
In this way, each section of the chapter corresponds to one of the main modules
that have been used to as the base of the project. Section 3.1 describes the robot
Maggie and its hardware components. After, in section3.2 the software architecture
of the robot Maggie is presented. In section 3.3, the Robot Operating System (ROS)
architecture is described. This architecture is used in several modules of the system
such as the vision system and the learning system. The vision system is described
in section 3.4, in which a description of the Kinect technology and algorithms that
enable it to extract and track skeleton models are presented. Finally, in section 3.5 the
learning framework is presented and the algorithm in which relies the learning process
is shown.
3.1 Hardware Description of the Robot Maggie
Maggie is a robotic platform developed by the RoboticsLab team in the Carlos III Uni-
versity of Madrid [7]. The objective of this development is the exploration of the fields
of social robotics and Human-Robot Interaction (HRI).
17
![Page 36: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/36.jpg)
3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM
The main hardware components of the robot can be classified in sensors, actuators
and others. Maggie’s sensing system is composed by a laser sensor, 12 ultrasound
sensors, 12 bumpers, several tactile sensors, a video camera, 2 RFID detectors and
an external microphone. Fig. 3.1 depicts all the sensors of the robot, while Fig. 3.2
depicts the rest of the robot’s hardware. The laser range finder is a Sick LMS 200. The
laser is used to build maps of the environment surrounding the robot and to detect
obstacles. The 12 ultrasound sensors surround the robot’s base. The ultrasound
sensors are used as a complement to the laser. The robot base has also 12 contact
bumpers that are used to detect collisions in case the laser sensor and the ultra sound
sensors fail. Maggie has 9 capacity sensors installed in different points of the the
robot’s skin. The sensors are used as touch sensors to allow tactile interaction with
the robot. The video camera is mounted in the robot’s mouth. The camera is used
to detect and track humans and objects near the robot. Additionally to the standard
camera, the robot has been recently equipped with a Microsoft’s Kinect RGB-D sensor.
This sensor allows the robot to retrieve depth information from a scene as well as
standard RGB images. The robot also has 2 RFID (Radio Frequency IDentification)
detectors, 1 located in the robot’s base and the other in the robot’s nose. These RFID
detectors allow to extract data from RFID tags. The use of RFID tags allow the robot
to extract information of tagged objects is described in [30]. The external wireless
microphone allows the robot receive spoken commands or indications from humans.
The robot is actuated by a mobile base with 2 DoF (rotation and translation in the
ground plane). Also, the robot has two arms with one DoF each one. Maggie’s head
can move in two DoF (pitch and yaw). Finally, the eyelids of the robot can move in one
DoF each one.
The robot has some hardware dedicated to the interaction with humans and other
devices. These hardware devices are an infrared (IR) emitter, a tablet PC, 3 speakers
and an array of LEDs (Light Emitting Diodes). The infrared emitter allows the robot
to control IR based devices like TVs and stereo devices. The tablet PC, located in
Maggie’s chest, is used to show information related to the robot or as an interaction
device. The speakers allow the robot to communicate orally with humans. The LED
emitters are located in the robot’s mouth. They are used to show expressiveness when
the robot is talking.
18
![Page 37: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/37.jpg)
3.1 Hardware Description of the Robot Maggie
Touch Sensor
Touch Sensors
Camera
RFID Reader
Touch Sensor
RFID Reader
Touch Sensor
Bumpers
Touch Sensor
(Back)Touch Sensor
(Back)
Laser
Kinect
Touch Sensor
Touch Sensor
Ultrasonic
Figure 3.1: The sensors of the Robot Maggie - Notice the Kinect RGB-D sensor at-tached to its belly.
Eyelid Actuator (1 DoF)
Arm Actuator (1 DoF)
Neck Actuator (2 DoF)Leds
(expressivity)
Base Actuator (2 DoF)
Tablet PC
(Information and Interaction)
Speakers
(Voice Interaction)
Infrared emmiter
Main computer
(Robot control
and communication)
Arm Actuator (1 DoF)
Figure 3.2: Actuators and interaction mechanisms of the Robot Maggie -
19
![Page 38: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/38.jpg)
3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM
The robot is controlled by a computer located inside her body. The computer com-
municates with the exterior using a IEEE 802.11n connection. The operative system
that runs on the computer is a Ubuntu 10.10 Linux. Over the OS runs the AD architec-
ture.
3.2 The AD Software Architecture
The main software architecture of Maggie is a software implementation of the Automatic-
Deliberative (AD) Architecture [31]. AD is designed imitating the human cognitive pro-
cesses. It is composed by two main levels, the Deliberative Level and the Automatic
Level. In the Deliberative Level, the processes that require high level reasoning and
decision capacity are located. These processes require a big amount of time and re-
sources to be computed. In the Automatic Level are located the low level processes.
These low level processes are the ones that interact directly with the hardware like
sensors and actuators. Usually, these processes are lighter than the deliberative ones
and, therefore, they need less time to be computed and use less resources than the
deliberative processes. Fig 3.3 shows the general schema of the AD architecture.
The AD architecture is composed by the following modules: sequencer, skills, shared
memory system and events system.
The basic component of the AD architecture is the skill [32]. A skill is the minimum
module that allows the robot to execute an action. It can reside in both AD levels de-
pending on its behaviour. A skill that executes complex reasoning or decision functions
is a Deliberative Skill and resides in the Deliberative Level. A skill that controls hard-
ware components or does not do complex reasoning functions is an Automatic Skill.
Every skill has a control loop that executes its main functionality. This control loop can
run in three different ways: cyclical, periodical and event triggered. Also, every skill
has three different states: ready, running and blocked.
Ready. Is the first state of the skill. It is the state between the moment the skill is
instantiated and the control loop is launched at the first time.
Running. It is the state when the control loop is being executed. This state is also
called active state.
20
![Page 39: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/39.jpg)
3.2 The AD Software Architecture
Sh or t -Ter m M em or y
Lon g -Ter m M em or y
Data Flow
Execut ion Or der s
EventsD
E
L
I
B
E
R
A
T
I
V
E
Sen sor D at a Com m an d s
ACTUATORS
A
U
T
O
M
A
T
I
C
SENSORS
Figure 3.3: The AD architecture - Notice that the main communication systems betweenthe skills are the shared memory system and the events system.
21
![Page 40: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/40.jpg)
3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM
Blocked. It is the state when the control loop is not being executed.
Two parameters must be defined when the skill is instantiated for the first time.
The first parameter is the time between loop cycles. In other words, the time between
two running states. The second parameter is the number of times the control loop is
executed. Every skill can be activated or blocked in two different ways. The first way is
being blocked by other skills. This can be done both at the Deliberative and Automatic
levels. The second way to activate or block a skill is by the sequencer. The sequencer
operates at the deliberative level, therefore, only deliberative skills can be activated or
blocked by the sequencer.
Every skill is launched as a process. Therefore, the communication between skills
is an inter-process communication problem. To solve this communication problem
two communication systems have been designed and developed, the Shared Memory
System and the Event System. The Shared Memory System is composed of the Long
Term Memory (LTM) and the Short Term Memory (STM). The Long Term Memory
stores permanent knowledge. The robot uses the data stored in the LTM for reasoning
or for making decisions. The data of the LTM persists even if the system is shut down.
The Short Term Memory is used for storing data that is only needed during the running
cycle of the robot. Examples of data stored in the STM are the data extracted from
sensors or data that needs to be shared among skills. These data is not needed after
the robot is powered off, therefore it is not a persistent memory.
The Event System is used to communicate relevant events or information between
skills. The Event System follows the publisher/subscriber paradigm described by
Gamma et. al. in [33]. The skills can emit or subscribe to determinate events. When
a skill needs to inform to other skills of a relevant event, it emits an event. All the skills
that are subscribed to this event receive a notification in their “inboxes” when the event
is triggered. The events can also carry some data related to the nature of the event
itself. Every skill has an event manager for each subscribed event. The event manager
defines what to do when an event is received. For example, an obstacle monitoring
skill can trigger an “obstacle found” event if an obstacle is detected. Other skills that
need to know if an obstacle is near the robot (for example a movement skill) can sub-
scribe to the “obstacle found” event and act properly when the event is received. For
example, stopping the robot to avoid a collision with the obstacle.
22
![Page 41: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/41.jpg)
3.3 Robot Operating System
Both Shared Memory System and Event System are designed and built following
a distributed architecture. This allows sharing data and notifications (events) between
different machines. Because the skills only communicate using the Shared Memory
System and the Event System, it is possible to run AD in different machines simulta-
neously and keep the whole architecture communicated.
3.3 Robot Operating System
ROS (Robot Operating System) [34] is an open-source, meta-operating system for
robots. It provides services similar to the ones provided by an Operating System (OS),
including hardware abstraction, low-level device control, implementation of commonly-
used functionalities, inter process communication and packet management. Addition-
ally, it also provides tools and libraries for obtaining, building, writing and running code
in a multi-computer environment.
The ROS runtime network is a distributed, peer-to-peer network of processes that
interoperate in a loosely coupled, distributed environment using the ROS communi-
cation infrastructure. ROS provides three main communication styles, including syn-
chronous RPC-style communication called services, asynchronous streaming of data
called topics and storage of data using a Parameter Server.
All the elements of the architecture form a peer-to-peer network where data is
exchanged between elements and processed together. The basic concepts of ROS
are nodes, Master, Parameter Server, messages, services, topics and bags. All of
these provide data to the network.
Nodes: Nodes are the minimum unit structure of the ROS architecture. They are
processes that perform computation. ROS is designed to be modular: a robot
control system usually comprises many nodes performing different tasks. For
example, one node controls the wheel motors of the robot, one node controls
the laser sensors, one node performs the robot localization, one node performs
the path planing, etc.
Master: The ROS Master is the central server which provides name registration and
lookup to the rest of the ROS network. Without the Master, nodes are not able
23
![Page 42: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/42.jpg)
3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM
to find each other and, therefore, exchange messages between them, or invoke
services.
Parameter Server: The Parameter Server is a part of the Master. Its functionality is
to act as a central server where nodes can store data. Nodes use this server to
store and retrieve parameters at runtime. It is not designed for high-performance.
Instead it is used for static, non-binary data such as configuration parameters. It
is designed to be globally viewable, allowing the tools and nodes to easily inspect
the configuration state of the system and modify if necessary.
Messages: Nodes communicate with each other by exchanging messages. A mes-
sage is a data structure of typed fields.
Topics: Messages are routed via a transport system with publish/subscribe seman-
tics. A node sends out a message by publishing it to a given topic. A topic is
a name that is used to identify the content of a message. A node that is inter-
ested in a certain kind of data will subscribe to the appropriate topic. There may
be multiple concurrent publishers and subscribers for a single topic, and a single
node may publish or subscribe to multiple topics. In general, publishers and sub-
scribers are not aware of each others’ existence. The objective is to decouple
the production of information from its consumption.
Services: The publish/subscribe model is a very flexible communication paradigm,
but its many-to-many, one-way transport is not appropriate for request/reply in-
teractions, which are often required in a distributed system. Request/reply is
done via services, which are defined by a pair of message structures: one for
the request and one for the reply. A providing node offers a service under a name
and a client uses the service by sending the request message and awaiting the
reply. ROS client libraries generally present this interaction to the programmer
as if it were a Remote Procedure Call (RPC).
Bags: Bags are a format for saving and playing back ROS message data. Bags are an
important mechanism for storing data, such as sensor data, that can be difficult
to collect but is necessary for developing and testing algorithms.
24
![Page 43: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/43.jpg)
3.4 The Kinect Vision System
Finally, to close the ROS section, is worth tom mention that ROS only runs on Unix-
based platforms and is it language independent. Currently supports C++ and Python
languages, while support for other languages like Lisp, Octave and Java is still in its
experimental phase. The first stable release of ROS was delivered in march 2010.
3.4 The Kinect Vision System
The Microsoft’s Kinect RGB-D sensor is a peripheral designed as a video-game con-
trolling device for the Microsoft’s X-Box Console. But despite its initial purpose, it is it is
currently being used by numerous robotics research groups thanks to the combination
of its high capabilities and low cost. The sensor provides a depth resolution similar to
the high-end ToF cameras, but at a cost several times lower.
The reason for this balance between capabilities and low cost resides in how the
Kinect retrieves the depth information. To obtain the depth information, the device uses
the PrimeSense’s Light Coding Technology [35]. This technology consists in projecting
an Infra-Red (IR) pattern to the scene similarly to how structured light sensors do. But
Light Coding differs from Structured Light in the light pattern. While Structured Light
usually uses grids or strip bands as a pattern, Light Coding emits a dot pattern to the
scene [5], [36] (see Fig. 2.3c).
This projected light pattern creates textures that makes finding the correspondence
between pixels easier. Specially in shiny or texture-less objects or with harsh lighting
conditions. Also, because the pattern is fixed, there is no time domain variation other
that the movements of the objects in the field of view of the camera. This ensures a
precision similar to the ToF and Structured Light cameras, but PrimeSense’s mounted
IR received ir a standard CMOS sensor, which reduces the price of the device drasti-
cally.
The sensor is composed of one IR emitter, responsible of emitting the light pattern
to the scene, a depth sensor responsible of capturing the emitted pattern. It is also
equipped with a standard RGB sensor that records the scene in visible light (see Fig.
3.4).
Both depth and RGB sensors have a resolution of 640x480 pixels. This facilitates
the matching between the depth and the RGB pixels. This calibration process, referred
25
![Page 44: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/44.jpg)
3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM
Figure 3.4: Block Diagram of the Prime Sense Reference Design - This is the blockdiagram of the reference design used by the Kinect Sensor. The Kinect incorporates bothdepth CMOS sensor and colour CMOS sensor. (Retrieved from [16])
by PrimeSense as registration, it is done in factory. Other processes like correspon-
dence1 and reconstruction2 are handled by the chip.
Together with other organisations, PrimeSense has created a non-profit organisa-
tion formed to promote the use of devices such as the Kinect in areas of the natural
interaction. The organisation is named OpenNI (NI stands for Natural Interaction) [29].
OpenNI has released an open-source framework called OpenNI that provides several
algorithms for the use of PrimeSense’s compliant depth cameras in natural interaction
fields.
Some of these algorithms provide the extraction and tracking of a skeleton model
from the user who is interacting with the device. This project uses these algorithms to
get the data form the user ’s joints. In other words, the information that will be provided
to the learning framework comes from the output of the OpenNI’s skeleton extraction
algorithms.
The kinematic model of the skeleton, provided by OpenNi, is a skeleton model of
1Correspondence means matching the pixels of one camera with the pixels of the other camera.2Reconstruction means recovering the 3D information from the disparity between both cameras.
26
![Page 45: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/45.jpg)
3.4 The Kinect Vision System
the body consisting in 15 joints. Fig. 3.5 shows these joints. The algorithms provide
the positions and orientations of every joint. Additionally they also provide the con-
fidence of these measures. Moreover, these algorithms are able to track up to four
simultaneous skeletons, but this feature is not used in this project.
ROS provides a package that envelops OpenNI and enables the access to this
framework from other ROS packages1. Thanks to that, other packages can access to
the data of the Kinect sensor. One of such packages is the pi_tracker package. This
package has a node, named Skeleton Tracker, which uses the OpenNI’s Application
Interface (API) to retrieve the tracking information of the user. The node publishes the
data of the joints in the /skeleton topic so other nodes can use it. This is the case of
the Pose Trainer node.
Head
Neck Right Shoulder
Right Elbow
Right Hand
Torso
Left Shoulder
Left Elbow
Left Hand
Right Hip
Right Knee
Right Foot
Left Hip
Left Knee
Left Foot
Figure 3.5: OpenNI’s kinematic model of the human body - OpenNI algorithms areable to create and track a kinematic model of the human body. The model has 15 jointsand their positions and orientations are updated at a 30 Frames per Second.
1The package is the openni_kinect
27
![Page 46: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/46.jpg)
3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM
3.5 The Weka Framework
The Machine Learning Framework on which this project relies is Weka (Waikato En-
vironment for Knowledge Analysis) [37]. Weka is a popular suite of machine learning
software written in Java, developed at the University of Waikato, New Zealand. Weka
is open-source software released under the GNU General Public License.
Weka supports several data mining tools such as data preprocessing, classifica-
tion, clustering, regression, visualization, and feature selection.
Most of the Weka’s operations suppose that data is available in a single text file or
relation. Weka provides access to SQL databases using Java Database Connectivity
being able to process as an input any result returned by a database query. In both files
and database queries, each data instance is described by a fixed number of attributes.
It is possible to operate with Weka through a user interface, from command line or
by accessing its program Application Interface (API). The latter is the method chosen
for this project.
Since Weka’s API is programmed in Java and ROS provides a client library for
this language, all the elements of the architecture that use the Weka’s API, will be
programmed as ROS nodes. Although ROS Java’s client library is, at this date, in an
experimental phase, the core functionalities are sufficiently stable to be used. These
functionalities are connecting to the ROS Master and subscribing and publishing to
topics.
Since the project focuses in supervised learning methods, only these methods will
be used in the Weka framework. Concretely, the C4.5 decision tree [38] has been
chosen as the main algorithm to learn and detect poses of the user. This algorithm
has been chosen by its good performance [39] and the possibility to see what the robot
has learnt in the learning phase.
28
![Page 47: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/47.jpg)
Chapter 4
The Supervised Learning BasedPose Recognition Architecture
This chapter describes all the modules that have been built in order to learn to detect
human poses. It is divided in two parts, each one describing one part of the learning
process. First, Section 4.1, describes all the components that operate in the training
phase. And second, Section 4.2 describes all the components that operate in the
process of recognising the poses of the user once the system has been trained. But
before entering in detail, an overview of the system is provided.
Fig. 4.1 depicts the general scheme of the built architecture. It consists of two
differentiated parts, each one with one single purpose. The upper part represents the
training phase, where the user teaches the system to recognise certain poses. The
lower part of the figure represents the classifying phase, where the user stands at a
determined pose and the robot tells him at which pose he is.
In the training phase, the robot uses two sensory systems to learn from the user.
The first is its Kinect based RGB-D vision system. With it, the robot acquires the fig-
ure of the user separated from the background and processes it to extract a kinematic
model of the user’s skeleton. The second input is the Automatic Speech Recogni-
tion System (ASR) which allows the robot to process the words said by the user and
converts them into text strings.
In other words, the sensory system captures a pair of datum formed by:
The pose of the user, defined by the configuration of the joints of the kinematic model
of the user
29
![Page 48: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/48.jpg)
4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE
Kinect
I'm sitting
MicrophoneAutomatic Speech
Recognition (ASR) System
What's my pose?
"SIT"
Dataset
Machine Learning Framework
MODEL
Speaker
You're sitting
Kinect
Pose Classifier
Text To Speech (TTS) System
Classifying Phase
Training Phase
Figure 4.1: Overview of the Built System - The upper part of the diagram shows thetraining phase, where the user teaches the robot by verbal commands which are theposes that the robot must learn. The lower part of the diagram depicts the classifyingphase, in which the robot loads the learnt model and tells the user’s current pose by itsvoice system.
30
![Page 49: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/49.jpg)
A label identifying this pose, defined by the text string captured by the auditive sys-
tem of the robot.
This two sensory inputs are fused together to form a pair of datum that are stored
in a dataset. At the end of the training process, the dataset contains all the pose-label
associations that the user has shown to the robot. This dataset is processed by a
machine learning framework that builds a model establishing the relations between
the poses and its associated labels. I. e. the learned model establishes the rules that
define when a determined pose is associated to a certain label.
In the classifying phase, the robot continues receiving snapshots of the skeleton
model every frame. But this time, it does not receive the auditive input that tells which
is the pose of the user. It is the moment when the robot has to guess that pose. To
do that, it loads the learnt model in a classifier module that receives as an input the
skeleton model every frame. The skeleton is processed in the model that returns the
corresponding label to that skeleton.
Then, the label is sent to the Emotional Text To Speech (ETTS) module of the
robot. The ETTS module is in charge of transform text strings into audible phrases
that are said by the robot’s speakers. In this way, the label is sent to the ETTS module
and then, said by the robot.
Fig. 4.2 presents an overview of the whole system but deepening more in how
all these processes are carried out and detailing the modules and messages that
participate on this process. There, the reader can see that the architecture is divided
in two separated parts, the AD part and the ROS part. AD provides powerful tools
for HRI, specially relevant are the ASR and ETTS Skills. The ASR Skill processes the
speeches of the user and transforms them into text. On the other hand, the ETTS skill,
processes strings of text and transforms them in audible words o phrases emitted by
the robot’s speakers.
Therefore, all the parts of the architecture that need verbal inputs or outputs from
or to the user have been developed as AD skills, or have some mechanisms to com-
municate with AD skills. In other words, the interaction with the user is carried out in
the AD part of the System. One of these interaction skills is the pose_labeler_skill,
described in section 4.1.1.
In the other part of the architecture, ROS provides the openNI and other pack-
ages to track humans and extract their skeleton from the Kinect sensor. Therefore,
31
![Page 50: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/50.jpg)
4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE
the components which needed information of the human skeleton have been devel-
oped as ROS nodes. Additionally, since ROS provides a client-library in java, all
the components of the architecture that needed to access the Weka framework have
been programmed as ROS nodes as well. This is the case of the pose_trainer and
pose_classifier nodes. Both are described in sections 4.1.3 and 4.2.1 respectively.
pose_trainer
pose_labeler_bridge
Data
set
WekaModel
Learning Framework/Architecture
pose_classifier
/skeleton
/classified_pose
OpenNI + pi_tracker
Kinect
/pose_labeled
asr_skill
etts_skill
pose_labeler_skill
RECOGNITION_RESULTS
LABELED_POSE
pose_teller
User
speaks
SAY_TEXTspeaks
ROS AD
Figure 4.2: Diagram of the built architecture - The figure depicts a diagram of the wholearchitecture and its components. Note that all the interaction modules reside in the ADpart of the architecture while the skeleton tracking algorithms and the learning frameworkreside in the ROS part.
Additionally, there are some components which main function is establishing links
between ROS and AD and enable the communications between their modules. These
are the Pose_Labeler_Bridge and the Pose_Teller, described in sections 4.1.2 and
4.2.2 respectively.
Since the architecture is composed of several modules that act independently, the
key for the integration of these modules is the messaging system of the architecture.
The communications within the architecture modules is done by exchanging asyn-
32
![Page 51: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/51.jpg)
4.1 Training Phase
chronous messages. In other words, if a module needs information from another
module, it subscribes to its publications. On the other way, when a module has some
information that needs to be shared with others, it publishes to the network. That
means that the mechanism followed for the exchange of information in the architec-
ture is based in the publisher /subscriber paradigm.
The following sections describe briefly what these modules do, how they operate
and, finally, how they link together to build the complete system. The description of
the components is done by following the usual temporal sequence of a user trying to
use the system Usually, the user would first train the pose classifier and then use it
to classify its pose. The former is described in section 4.1 and latter is described in
section 4.2.
4.1 Training Phase
The training phase is the phase of the process in which the classifier is built. In this
phase, the human teaches the robot to recognise some poses. Completing this phase
means that a classifier is built and it can be used in the classifying phase. This section
describes all the components of the architecture that have been developed to train the
system.
Figure 4.3 depticts the temporal sequence of the training phase showing the col-
laboration among all the modules that participate in it. Summarised, the training phase
occurs in the following sequence. First, the system needs to detect what the user is
saying to the robot. This is explained in 4.1.1. If what she is saying is a valid pose, then
it will be labeled and sent to a node in charge of communicating the interaction mod-
ules -which are located in the AD part of the system- with the ROS modules -which
are in charge of the learning step-. The bridging between AD and ROS is described
in section 4.1.2. Finally, the label arrives to the machine learning module which will
gather the label describing the human pose and the data from the vision sensor. This
data will be written to a dataset and sent to the Weka framework to build a classifier
able to detect human poses. This final part of the process is described in section 4.1.3.
33
![Page 52: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/52.jpg)
4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE
pose_labeler_skill pose_labeler_bridge pose_trainerasr_skill
REC_OK (event)
Process possible label
LABELED_POSE (event)/labeled_pose (ROS topic)
skeleton_tracker
/skeleton (ROS topic)
/skeleton (ROS topic)
/skeleton (ROS topic)
/skeleton (ROS topic)
/skeleton (ROS topic)
/skeleton (ROS topic)
Add pose to dataset"TURNED_RIGHT"
/skeleton (ROS topic)
.
.
.
/skeleton (ROS topic)
/skeleton (ROS topic)
Add pose to dataset"TURNED_LEFT"
REC_OK (event)
Process possible label
LABELED_POSE (event)
/labeled_pose (ROS topic)
"END"
"END"
"END"
Building a model and saving it to a file
REC_OK (event)
Process possible label
LABELED_POSE (event)/labeled_pose (ROS topic)
"TURNED_RIGHT"
"TURNED_RIGHT"
"TURNED_RIGHT"
Add pose to dataset"TURNED_RIGHT"
"TURNED_LEFT"
"TURNED_LEFT"
"TURNED_LEFT"
Figure 4.3: Sequence Diagram of the training phase - The pose_trainer node is thenode which fuses the information of the skeleton model and the labels told by the user.
34
![Page 53: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/53.jpg)
4.1 Training Phase
4.1.1 Pose Labeler Skill
When the user starts training the robot, she has to carry out two tasks. The first one
is to put himself in the pose he wants to show the robot. The second task is to tell
the robot at which pose is she. From the robot’s point of view, first it has to detect the
human pose and second, it has to understand what the user is saying to it. The former
is described in section 4.1.3 while the latter is described below.
The user reports the robot its pose by telling it. The robot has an Automatic Speech
Recognition (ASR) System that allows to detect and process natural language. This
system mainly relies on the AD’s ASR Skill [40]. The ASR skill detects the human
speech and processes it according a predefined grammar. If the speech of the human
matches one or more of the semantic elements of that grammar, the ASR Skill sends
an event notifying all the other skills that it has recognised a speech. Other skills
which are subscribed to the ASR events, read the results obtained by the ASR Skill
and process them accordingly.
The ASR skill cannot understand what the user is saying unless it is previously
provided with a grammar that defines the semantic of these speeches. Therefore, if
we want to make the ASR skill able to understand what pose is being said by the
human, we need to build a special grammar. This grammar is summarised below and
described in Annex C.
In this scope, a grammar is a sequence of possible word combinations that the
user can say linked to their semantic meaning. In this way, the built grammar is able to
detect several words that define distinct poses of the human. The semantics of those
words are coded into labels that can be codified as variables in a computer program.
The grammar has been built in a manner that allows to detect up to 18 labels by
combining different semantics in three different categories:
1. Position Semantics
(a) SIT. Defines that the user is sitting on a chair.
(b) STAND. Defines that the user is standing in front of the robot
2. Action Semantics
(a) TURNED. Defines that the user is sitting in a chair.
(b) LOOKING. Defines that the user is standing in front of the robot
35
![Page 54: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/54.jpg)
4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE
(c) POINTING. Defines that the user is pointing with her arm to a location defined inthe directions category
3. Direction Semantics
(a) LEFT. Defines that the action that is doing the user is towards his own left side.For example, if it is pointing, it is doing to her left.
(b) FORWARD. Defines that the action performed by the user is towards her front.
(c) RIGHT. Defines that the action peformed by the user is towards her right
When the ASR Skill detects a combination of words that define the 3 semantics
from above, it emits an event to notify all its subscribers and stores the recognition
results in the Short Term Memory System. One of these subscribers is the Pose
Labeler Skill. When receives an event from the ASR Skill, the Pose Labeler Skill reads
the recognition results from the Short Term Memory and analyses their semantics to
form a label. Examples of labels can be SIT_LOOKING_LEFT, meaning that the user
is sitting and looking towards her left; or STAND_TURNED_FORWARD, meaning that
the user is standing and turned to the robot. Finally, if the label is a valid label -i.e. one
of the labels from above- the Pose Labeler Skill sends an event with the label ID. Fig.
4.4 summarises the main functionalities of the Pose Labeler Skill
pose_labeler
_skill
Subscribe to ASR events
Read the ASR Skill resultslooking for the
label.
Emit an AD LABELED_POSE event with the detected label
Figure 4.4: Use Case of the Pose Labeler Skill - The figure depicts the use case dia-gram of the Pose Labeler Skill. Its main function is to transform the labels told by the userinto a system readable labels and emit them as an AD event.
Additionally to the semantics shown above, the Pose Labeler Skill can also process
two semantics more which have no relation with the labels. These semantics act as
36
![Page 55: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/55.jpg)
4.1 Training Phase
a control layer to allow the user control by voice some aspects of the training phase.
These semantics are the following:
• CHANGE. Used to allow the user to change its pose. This command is said whenthe user wants to change its pose. It is used to allow the classifier to discriminate thetransitions between two poses.
• STOP. Used to end the training process. When the user says she wants to finish thetraining process, the ASR builds this semantic to allow the Pose Labeler Skill to end it.
4.1.2 Pose Labeler Bridge
pose_labeler_bridge
Subscribe to the AD's
LABELED_POSE events
Send Labels to the ROS'
/labeled_posetopic
Figure 4.5: Use Case of the Pose Labeler Bridge - The figure depicts the use casediagram of the Pose Labeler Bridge ROS node. Its main function is to listen the AD’sLABELED_POSE events and to bridge their data to the ROS /labeled_pose topic.
The Pose Labeler Bridge is the next step of the process. It is a ROS node that acts
as a bridge between AD and ROS. Its main functionality is to transform the events
emitted by the Pose Labeler Skill to a ROS topic. Concretely, when the Pose La-
beler Skill detects a label from the ASR’s recognition results, it emits an event called
POSE_LABELED. The Pose Labeler Bridge parses the content of the messages sent
through this event and transforms them to fit in a ROS topic called /labeled_pose. Fig.
4.5 summarises the main functionalities of the Pose Labeler Bridge.
4.1.3 Pose Trainer
The last step of the training process is performed in the Pose Trainer Node. The Pose
Trainer node is a ROS node that does several things (see Fig. 4.6). First of all, it
subscribes to the /labeled_pose topic to know the pose of the user. Secondly, it also
37
![Page 56: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/56.jpg)
4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE
subscribes to the /skeleton topic (see section ?? for more information about this topic).
The Pose Trainer node reads the messages of this topic to extract the information of
the joints of the user. This information is combined with the information from the /
labeled_pose and formatted properly to be understood by the Weka framework.
pose_trainer
Subscribe to /labeled_pose topic
Subscribe to /skeleton topic
Create a dataset with the received skeleton and label messages
Build a model from the dataset
Figure 4.6: Use Case of the Pose Trainer Node - The figure depicts the use casediagram of the Pose Trainer Node. Its main function is to receive messages from the /labeled_pose and /skeleton topics, and build a dataset with these messages. After thedataset is built, the node also creates a learned model from it.
Each skeleton message is coded as a Weka instance. Each instance has 121
attributes divided in the following way. The message consists in 15 joints with 3 at-
tributes for the position of each joint, 4 attributes to track the orientation1 of each joint,
and 1 attribute to track the confidence of the measures of this joint. This makes 120
attributes. The last attribute is the label that comes from the /labeled_pose topic. This
last attribute is also the class of the instance2.
While the Pose Trainer node is running, it collects /skeleton messages, /labeled_pose
messages and fuses them creating instances. During its operation, the node contin-
uously builds a dataset of instances with the received messages. Finally, when it re-
ceives a label with the "STOP" identification, it stops adding messages to the dataset.
Fig. 4.3 shows this process.
1The orientation is coded as a quaternion.2 The class of an instance is the attribute that tells the learning algorithm at which class belongs the
instance. In other words, is the attribute that tells to the classifier how this data must be classified.
38
![Page 57: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/57.jpg)
4.2 Classifying Phase
With the dataset already completed, the node calls the Weka API in order to build
a model from the dataset. This model is the classifier that will be used in section 4.2
to classify the poses of the user. For this project only on model of classifier is built
form the dataset. This model is the Weka’s J48 decision tree, which is an open source
implementation of the C4.5 decision tree (see section 3.5).
If the dataset has relevant data, the model will be able to generalise to other situa-
tions for which the classifier has not been trained. If not, the classifier will not be able
to classify other situations that were not contemplated during training phase and will
cause several classification errors.
The structure of the Pose Trainer node is depicted in Fig. A.1 from Annex A.
4.2 Classifying Phase
The classifying phase is the phase where the robot starts guessing the pose of the
user. To do so, it needs a previously created model of the poses at which the user will
be. The main elements of the classifying phase are the Pose Classifier node and the
Pose Teller node. The first is described in section 4.2.1 while the second is described
in section 4.2.2. The temporal sequence of how these nodes interact is depicted in
Fig. 4.7.
4.2.1 Pose Classifier
The Pose Classifier ROS Node is the node that classifies the pose of the user. Its
main funcitons are depicted in Fig. 4.8. To do so it needs two different inputs. The first
one is the knowledge to decide the pose of the user from the data of its joints. This
comes from the classifier that has been built in section 4.1.3. The second input the
node needs is the content of the /skeleton topic messages. As it is said above, these
messages contain the information of the user’s joints.
The node subscribes to the /skeleton topic and starts reading its messages. For
each received message, the node parses and formats it as a weka instance. This
instance similar to the instances created by the Pose Trainer node in section 4.1.3.
But these instances are different to the others in one aspect: they do not have the
39
![Page 58: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/58.jpg)
4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE
etts_skill pose_teller_tts pose_classifer
Says label
/classified_pose (ROS topic)
skeleton_tracker
/skeleton (ROS topic)
/skeleton (ROS topic)
Classify Pose
/skeleton (ROS topic)
.
.
.
/skeleton (ROS topic)
Classify Pose
Load Model from a file
/classified_pose (ROS topic)
Classify Pose
/classified_pose (ROS topic)
SAY_TEXT (AD event)
Says label
SAY_TEXT (AD event)
.
.
....
"STAND_LOOKING_LEFT"
"STAND_LOOKING_LEFT"
"STAND_LOOKING_RIGHT"
Figure 4.7: Sequence Diagram of the classifying phase - The pose_classifier nodeprocesses the skeleton messages using the learnt model and sends the output to thevoice system of the robot.
40
![Page 59: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/59.jpg)
4.2 Classifying Phase
pose_classifier
Subscribe to /skeleton topic
Load a model to classify poses
Classify the skeleton into a known pose
Send the classified pose to the
/classified_pose topic
Figure 4.8: Use Case of the Pose Classifier Node - The figure depicts the use casediagram of the Pose Classifier Node. The node loads a previously trained model to classifySkeleton messages to known poses.
class defined. In other words, the class of the instance is not set. The Pose Classifier
node uses the classifier to determine at which class should belong the instance.
After classifying the instance, the node emits a message to the /classified_pose
topic. In fact, the sent message is the same as the messages that are sent to the
/pose_labeled topic. But this time the difference that the former refer to labels that
have been deducted while the latter are labels specified by the user.
The structure of this node is depicted in Fig. A.2 from Annex A.
4.2.2 Pose Teller
The Pose Teller Node is the node in charge to tell to the user at which pose is she. In
other words, it tells to the user what pose has been detected by the classifier. The pose
of the user is announced by the Pose Classifier node in the topic /classified_pose. But
the content of the messages of this topic is not understandable by humans. therefore,
it is needed a node that translates that content to a content that can be understood by
people. The Pose Teller node is the module which carries out this task (see Fig. 4.9).
First of all, the Pose Teller node subscribes to /classified_pose topic and reads its
messages to retrieve the label identificator wrote by the classifier. Then, it transforms
41
![Page 60: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/60.jpg)
4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE
the label ID into a description that can be understood by the user. For example, if the
label is STAND_LOOKING_LEFT, the node transforms it to the text "You’re standing
and looking to the left". But this is only a text string. If we want the robot to say this
text, it must be sent to the AD’s ETTS (Emotional Text To Speech) Skill [41]. This skill
is in charge of transforming text strings into audible speeches. Therefore, the Pose
Teller node sends the "textified" label to the ETTS skill by an AD event. Finally, the
ETTS skill pronounces the text using the robot’s voice system.
pose_teller
Tell to the etts_skill to say the labels of the /classified_topic
Subscribe to/classified_pose topic
Figure 4.9: Use Case of the Pose Teller Node - The figure depicts the use case diagramof the Pose Teller Node. Its main function receive labels from the /pose_classified topicand send them to the TTS skill to make it tell the label.
42
![Page 61: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/61.jpg)
Chapter 5
Pilot Experiments
The main objective of carrying a pilot experiment is to test that the classifier is able to
learn human poses. But, apart from that, there is other objective that emmerged during
the design of the system. During the initial tests of the trainer it was observed that the
trainer node built different models depending on the kind of the data or features which
was provided to it. I. e., the pose_trainer node built different models when it was only
fed by the position of the joints than when it was fed by the position and orientation of
them.
It also was observed that when the classifier was trained using all the three data
types -position, orientation and confidence-, the human trainer had to pay attention
to fed the node with representative data. In other words, if the human trained the
pose_trainer node fixed in only one position, the classifier was only able to detect the
learned poses if they were shown in the exact same position as it was trained.
Therefore, if we want to build a classifier which is able to generalise, we have two
options. The first one is to train the classifier without giving them position data. The
second option is to make the position data irrelevant during the training process. The
first option involves a pre-process of the data before is given to the trainer node. During
this pre-process stage, the position of the joints is removed. The second option feeds
the pose_trainer node with all the data from the human joints, but during the training
process, the human must move around the field of view of the Kinect sensor in order
to feed the node with all the data which is relevant.
Note that, while the first option relies on the automation of the process, the sec-
ond one yields the responsibility to the human teacher, who is in charge of providing
43
![Page 62: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/62.jpg)
5. PILOT EXPERIMENTS
the classifier with good data that will allow it to generalise. But, a priori, is not clear
which method builds better classifiers. Therefore, the objective of this evaluation is to
discover which of the two methods produces better models. This is especially relevant
because it is difficult to anticipate that the users are "good teachers" and they will train
the classifier with a good and representative model.
The initial intuition leads us to think that the classifier is able to detect the poses
quite accurately if a good model is provided regardless of which method is used. But
our hypothesis is that it would be easier to build better models with less features,
specially if these features represent better the states of a pose. In other words, it
seems that the joint orientations are more representative for detecting poses than the
information retrieved from the positions of the joints.
5.1 Scenario
To validate the hypothesis, two datasets were created. Both datasets consist of thedata from a user that has trained the classifier to detect three different poses:
1. Standing, turned left
2. Standing, turned front
3. Standing, turned right
But each dataset has been trained in a different manner. In the first dataset "D1", the
user has trained the classifier showing the three poses in different positions. In the
second dataset "D2", the user has trained each pose without changing her relative
position to the robot. That means that D1 has better training data than D2. Also, for
each dataset, two models have been built. Each model has been built using different
features or attributes. Following, the differences between the datasets and its models
are listed:
• Dataset 1 (D1): Not relevant data. The user did not move from her originalposition.
– Model 1 (M1): All attributes were used to construct the model (position, orientationand confidence.
– Model 2 (M2): Only orientation and confidence attributes were used to build themodel.
44
![Page 63: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/63.jpg)
5.1 Scenario
• Dataset 2 (D2): Not relevant data. The user did not move from her originalposition.
– Model 3 (M3): All attributes were used to construct the model (position, orientationand confidence.
– Model 4 (M4): Only orientation and confidence attributes were used to build themodel.
The recording of the datasets D1 and D2 was done in the scenario depicted in Fig.
5.1. The scenario consists of the robot Maggie equipped with a Kinect sensor. The
rectangle in Fig. 5.1 shows the area where the user that trained the classifier was
standing. She was able to move wherever she wanted inside of the rectangle while
recording the poses. The only conditions were that she was not allowed to exit the
rectangle during the recording phase. She was also not allowed to change her pose
without warning the robot. During the recording of the dataset D1, the user moved
through all the rectangle. But in the case of the dataset D2, the user didn’t moved
from the center of the rectangle.
180 cm240 cm
180 cm R
obotU
ser
User Area
Kinect Horizontal Field of View
Figure 5.1: Scenario of the experiment - The cone represents the field of view of theKinect sensor. The user was allowed to move inside the rectangle, but turned in thedirection of the arrows.
45
![Page 64: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/64.jpg)
5. PILOT EXPERIMENTS
After the training phase, 12 different users were used to record a test dataset.
Each one of these people recorded the same 3 poses as the user who trained the four
models. Moreover, they had the same recording conditions than the user who trained
the dataset D1. In short, they were allowed to move inside the rectangle during the
recording phase.
The data of the twelve users were recorded and gathered in one single dataset file.
After that it was tested with the four models. The results of these tests are summarised
in the following section. Also, they are completely stated in Annex B. A discussion of
these results is addressed in section 5.3
5.2 Results
The results of the experiment are summarised in Table 5.1. The table presents how
the models M1, M2, M3 and M4 performed against the test dataset. The best models
were M1 and M2 with more than a 92% of correctly classified instances and barely
a 4% of false positives. Following, model M4 performed slightly worse with a 70%
of correctly classified instances and with a 14% of false positives. Finally, as it was
expected, model M3 shown the worst performance with a 56% of correctly classified
instances and with a 21% of false positives.
Note that the table shows the results of models M1 and M2 in the same row. This
is because they used the same dataset (D1) to build their trees and in the end, the J48
algorithm built the same tree in both cases. Fig. 5.2 depicts the tree of the models
M1 and M2. Although they used different data from the dataset1, the J48 algorithm
decided that the relevant information of D1 was located in the orientation attributes,
producing the same trees in both cases.
This not happened in the tree built in model M3 (see Fig. 5.3). Here, the algorithm
that built the tree, considered that in the training dataset some relevant information
was in the position of the right knee. When the users tested the tree, they varied their
position respect the user who trained it, so it caused several errors.
The last tree, M4 (Fig 5.4), is similar to M3, but with the difference that the former
only uses orientation information. This enabled it to be more accurate than M3, but
1 Remember that M1 used position, orientation and confidence, while M2 used only orientation andconfidence
46
![Page 65: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/65.jpg)
5.3 Discussion
Model TP Rate FP Rate Precision MAE RMSEM1, M2 0.926 0.039 0.940 0.0049 0.0583
M3 0.563 0.213 0.687 0.0416 0.204M4 0.700 0.141 0.796 0.0286 0.1691
Table 5.1: Results of the Pilot Experiment - Models M1 and M2 produced the sameresults and performed better than the other models. Note: TP = True positive; FP = FalsePositive; MAE = Mean Absolute Error; RMSE = Root Mean Squared Error.
not so much as M1 and M2. The reason for this could be that M1/M2 is a bit more
complicated tree, so it seems it can cover more cases than M4.
torsoOrient_w
<=0.956792 > 0.956792
torsoOrient_y
<=0.11679 > 0.11679
left_kneeOrient_w
<=0.609245 > 0.609245
STAND_TURNED_RIGHT
STAND_TURNED_FORWARD
STAND_TURNED_LEFTSTAND_TURNED_RIGHT
Figure 5.2: The tree built for the models M1 and M2 - The classifier only used theorientations of the joints.
5.3 Discussion
The results show that the initial hypothesis is partially validated for this experiment.
The hypothesis announced that using only orientation and confidence attributes will
47
![Page 66: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/66.jpg)
5. PILOT EXPERIMENTS
right_kneeOrient_Z
<=-0.196633 > -0.196633
right_kneePos_X
<=0.100138 > 0.100138
STAND_TURNED_RIGHT
STAND_TURNED_FORWARDSTAND_TURNED_LEFT
Figure 5.3: The tree built for the model M3 - This time, the classifier used positioninformation of the joints.
right_kneeOrient_Z
<=-0.196633 > -0.196633
torsoOrient_w
<=0.699489 > 0.699489
STAND_TURNED_RIGHT
STAND_TURNED_FORWARDSTAND_TURNED_LEFT
Figure 5.4: The tree built for the model M4 - The tree is quite similar to the one in Fig.5.3, but this one only uses orientation information for its joints.
48
![Page 67: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/67.jpg)
5.3 Discussion
lead to models with a higher generalisation capabilities. This has only been validated
partially. When a good training set is provided, like in dataset D1, the classifier builds
models that are able to generalise, no matter the attributes which are used to build
that models. This is the case of models M1 and M2, which ended up being the same.
However, when the dataset has no relevant information, using only orientation at-
tributes leads to classifiers that perform better than classifiers that are built using posi-
tion and orientation attributes. In the studied case this difference was nearly of a 15%
of better classification and nearly a 7% in false positives.
In fact, the classifier has built a model which is based on orientation attributes
rather than positions. This means that, as it was thought at the beginning, the ori-
entation information was more significant than the position information. But it also
means that if the classifier is provided with a relevant dataset, it will choose the most
significant attributes.
To sum up, it seems that providing good training data to the classifier is of paramount
importance. Good datasets produce better models no matter the joint attributes that
are used to build them. But when it is not possible to ensure that the user will train the
classifier properly, it could be better to avoid the use of position attributes.
49
![Page 68: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/68.jpg)
5. PILOT EXPERIMENTS
50
![Page 69: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/69.jpg)
Chapter 6
Conclusions
A pose recognition architecture has been built and integrated in the robot Maggie. This
architecture allows the robot to learn the poses the user has been taught to it.
The system relies on two main pillars, the first is the vision system of the robot,
which is composed of a Kinect depth camera and its official algorithms to track peo-
ple. The sensor and its algorithms have proven to be robust to changes in the light
conditions and partial occlusions of the body. The second pillar of the learning plat-
form is the HRI capabilities of the robot Maggie. Specially its abilities to communicate
with people by voice. Thanks to that, the user taught the robot by speaking to it as it
would do with other people.
To validate the architecture, a pilot experiment was carried out. In the experiment
two datasets were used to build 4 different models that were tested against twelve
people. The experiment demonstrated that the learning system is able to detect the
poses of the users obtaining high accuracy rates when a good training dataset was
provided to the robot.
Following the main contributions of this project are listed.
• A Pose Recognition Architecture has been developed and integrated in a social
robot allowing it to recognise the poses of different people with high accuracy.
• ROS has been integrated with the AD architecture. Although it is not fully inte-
grated, the initial communication mechanisms between both architectures have
been established. Additionally some AD skills have started the process of be-
coming both AD-Skills and ROS nodes.
51
![Page 70: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/70.jpg)
6. CONCLUSIONS
• The Weka machine learning framework has been integrated with the AD archi-
tecture, thanks to the integration between ROS and AD.
• The robot Maggie has now fully integrated the Kinect sensor, its drivers and the
OpenNI framework. This enables a much new possibilities to it, not only in the
HRI field, but also in object recognition and navigation fields.
• The robot Maggie now is able to understand the human poses. This means
that now the robot is able to understand some contextual information when it
is interacting with a user. Thanks to that, the robot can adapt its behaviour
according to this context and improve the interaction quality, as it is perceived
by a user. As an example, imagine that two people are talking to each other,
just in front of the robot. If the robot has its ASR turned on, it will process every
word in the conversation regardless they are not addressed to it. But because
the robot can see that these two people are turned towards the other, it may infer
that those words are not addressed to it and simply discard them.
Extending the architecture is the first step that has to be done as future work.
Since the architecture has proven to be valid and works to its field of use, it would be
challenging extending it to other fields such as gesture recognition. This would enable
the robot to understand better the interaction situations with the users.
This work opens the door for building a continuous learning framework. Thanks
to the integration between the learning system and the interaction system, it is now
possible to strengthen this bonds to build a more generic platform that would allow the
robot being continuously learning from its environment and its partners.
Since the learning platform allows the robot to understand information from the
user, one possible line of work is to study how the robot can infer the intentions of
the users using the information it has learnt from them. Now the robot can learn
human poses, next step of the process is to understand what these poses means in
an interaction context.
Moving the focus from more generic to more specific, we can enter in details re-
garding the developed learning system itself, other line for further work is to deep into
the relation between the position attribute, orientation attributes and the quality of the
training. Although it seems that a good training data its enough to Perhaps one way
would be
52
![Page 71: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/71.jpg)
It is interesting to know if changes in the reference coordinate system would affect
to the training phase. For example, there are some poses that are relative to the
robot such as being near or far to the robot. in this case, it is clear that the adequate
coordinate frame is the robot’s one. But in other cases it might be better to use other
coordinate frames. An example would be an user separating his arms to announce
that something is big and its contrary pose, bringing closer the arms to point out that
something is small. It has to be studied if in this last case, it is better to use a coordinate
frame that has not the origin in the robot’s sensor but, for instance, in one of the user’s
hand.
Other possibilities for further research are comparing several classifiers or, even
more, other data mining techniques. There are studies that have made this compar-
isons in generic situations [39], but introducing the part of the real time human-robot
interaction might lead to interesting findings.
Finally, since the main purpose of the robot is the study of the HRI, user studies
should be carried out to understand how to improve this learning process from the
user perspective. Understanding what the user thinks about the process, might lead
to better training scenarios that would end in robots that learn better from the users.
53
![Page 72: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/72.jpg)
6. CONCLUSIONS
54
![Page 73: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/73.jpg)
Appendix A
Class Diagrams of the Main Nodes
Weka
jointNameconfidenceposXposYposZorientXorientYorientZorientW
Joint
poseuserIDstampJoints
Pose
115
FillAttributesparseSkeletonMessagelabelMessageaddMessagesetClasssaveToFile
DatasetPoses
PoseSet
loadFromFilesaveToFileclassifyInstancebuildClassifier
FilteredClassifierJ48
PoseModel
void main
PoseSetPoseModelNodeHandle
PoseTrainer
Ros
NodeHandle
1n
1
1
1 1
weka.core.Instances
weka.classifiers.Classifier
Figure A.1: Class Diagram of the Pose Trainer Node - The main class is a ROS node(it has a node handle) that connects with weka to build a dataset and a model from theinputs it receives from the topics at which is subscribed (defined in the node handle).
55
![Page 74: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/74.jpg)
A. CLASS DIAGRAMS OF THE MAIN NODES
Weka
jointNameconfidenceposXposYposZorientXorientYorientZorientW
Joint
poseuserIDstampJoints
Pose
115
FillAttributesparseSkeletonMessagelabelMessageaddMessagesetClasssaveToFile
DatasetPoses
PoseSet
loadFromFilesaveToFileclassifyInstancebuildClassifier
FilteredClassifierJ48
PoseModel
void main
PoseSetPoseModelNodeHandle
PoseClassifier
Ros
NodeHandle
1n
1
1
1 1
weka.core.Instances
weka.classifiers.Classifier
LabeledPose
Figure A.2: Class Diagram of the Pose Classifier Node - Notice that it is almost equalto the class diagram of the Pose Trainer Node. The min difference is that this node, loadsthe model built by the Pose Trainer Node and uses this model to classify the skeletonmessages it receives.
56
![Page 75: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/75.jpg)
Appendix B
Detailed Results of the Pilotexperiment
This chapter shows the full results obtained in the pilot experiment described in chapter
5. In the experiment four models were tested against a dataset recorded by twelve
people. The four models were trained according the description shown in section 5.1.
The results, were partially shown in section 5.2 and discussed in section 5.3.
The results are presented in the following tables. Table B.1 presents the detailed
results of the models M1 and M2 while Table B.2 presents the confusion matrix them.
The detailed results of model M3 are presented in B.3 while its confusion matrix is
detailed in the table B.4. Finally, the results of the model M4 are shown in the table
B.5 and its confusion matrix can be analysed in table B.6.
Class TP Rate FP Rate Precision Recall F-Measure ROC AreaSTAND_TURNED_LEFT 0.787 0.000 1.000 0.787 0.881 0.999STAND_TURNED_RIGHT 0.996 0.111 0.826 0.996 0.903 0.997STAND_TURNED_FORWARD 0.982 0.002 0.996 0.982 0.989 0.99Weighted Avg. 0.926 0.039 0.939 0.926 0.926 0.995
Table B.1: Detailed Results of the Models M1 and M2 - These models performed witha 92% of correctly classified instances (column 1) and with barely a 4% of false positives(column 2)
57
![Page 76: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/76.jpg)
B. DETAILED RESULTS OF THE PILOT EXPERIMENT
classified as −→ a b cSTAND_TURNED_LEFT = a 1171 317 0
STAND_TURNED_RIGHT = b 0 1648 6STAND_TURNED_FORWARD = c 0 30 1619
Table B.2: Confusion matrix for the models M1 and M2 - Almost all the errors camebetween the left and the right orientations.
Class TP Rate FP Rate Precision Recall F-Measure ROC AreaSTAND_TURNED_LEFT 0.673 0.315 0.49 0 0.673 0.567 0.679STAND_TURNED_RIGHT 0.209 0.001 0.989 0.209 0.345 0.604STAND_TURNED_FORWARD 0.819 0.334 0.563 0.819 0.667 0.743Weighted Avg. 0.563 0.213 0.687 0.563 0.525 0.675
Table B.3: Detailed Results of the Model M3 - It showed a poor performance with onlya 56% of correctly classified instances (column 1) with more than a 20% of false positives(column 2). Most of the errors were produced due the low performance when classifyingthe TURNED_RIGHT pose.
classified as −→ a b cSTAND_TURNED_LEFT = a 1001 4 483
STAND_TURNED_RIGHT = b 743 346 565STAND_TURNED_FORWARD = c 299 0 1350
Table B.4: Confusion matrix for the model M3 - The TURNED_RIGHT pose produceda great percentage of the errors.
Class TP Rate FP Rate Precision Recall F-Measure ROC AreaSTAND_TURNED_LEFT 0.929 0.317 0.569 0.929 0.706 0.806STAND_TURNED_RIGHT 0.209 0.001 0.989 0.209 0.345 0.604STAND_TURNED_FORWARD 0.984 0.123 0.807 0.984 0.887 0.931Weighted Avg. 0.7 0.141 0.796 0.7 0.644 0.779
Table B.5: Detailed Results of the Model M4 - It showed a slightly better performancethan model 3, with only a 70% of correctly classified instances (column 1) but with a 14%of false positives (column 2). As it happened in the model M3, the performance was poorwhen classifying the TURNED_RIGHT pose.
58
![Page 77: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/77.jpg)
classified as −→ a b cSTAND_TURNED_LEFT = a 1383 4 101
STAND_TURNED_RIGHT = b 1022 346 286STAND_TURNED_FORWARD = c 26 0 1623
Table B.6: Confusion matrix for the model M4 - Like in model M3, theTURNED_RIGHT pose produced a great percentage of the errors.
59
![Page 78: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/78.jpg)
B. DETAILED RESULTS OF THE PILOT EXPERIMENT
60
![Page 79: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/79.jpg)
Appendix C
The ASR Skill Grammar forrecognising Pose Labels
This chapter describes the grammar used to detect the poses the user can tell the
robot to announce her pose. The grammar has the format used by the ASR engine of
the ASR Skill, which is similar to [42] with slight modifications in their tags.
Since the ASR Skill currently only supports Spanish words, the grammar is written
in Spanish. However, the semantic values of the words that can be detected by the
grammar are written in English.
The grammar allows to understand 2 control commands and 1 pose command.
The control commands are defined by its semantics. The first one is the STOP com-
mand, used to finish the training phase. The second one is the CHANGE command,
used to mark the transitions between poses.
The pose semantics are defined in the $pose field. This field understands 3 cate-
gories of semantics: $position, $action and $direction. $position defines if the user is
sit (SIT ) or stand (STAND). The second semantic, $action, defines whether the user
is turned (TURNED), looking (LOOKING)) or pointing (POINTING) towards certain di-
rection. Finally, the third semantic, $direction, defines at which direction is oriented
the $action of the user. These directions can be left (LEFT ), right (RIGHT ) or forward
(FORWARD).
The Spanish words that are located before the semantic labels are the words the
user has to say in order to trigger its related semantic value. For instance, in the
61
![Page 80: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/80.jpg)
C. THE ASR SKILL GRAMMAR FOR RECOGNISING POSE LABELS
$position semantics, the words "sentado" or "en una silla" will trigger the semantic
value SIT, but the words "de pie" or "levantado" will trigger the semantic value STAND.
#ABNF 1.0 ISO−8859−1;
language es−ES ;tag−format <loq−semant ics /1.0 >;pub l i c $root = $pose_tra iner ;
$pose_tra iner = [$GARBAGE] $stop| [$GARBAGE] $change [$GARBAGE]| [$GARBAGE] $pose ;
$stop = (" para" : STOP| "para de e t i qu e t a r " : STOP| " stop " : STOP| "ya e s ta bien " : STOP| " de j a l o ya" : STOP| " cansado" : STOP){<@STOP_COMMAND $value >};
$change = (" pausa" : CHANGE| "cambio" : CHANGE| "cambio de" : CHANGE| "cambiar de" : CHANGE){<@CHANGE_COMMAND $value >};
$pose = [ $po s i t i on ] $ac t i on [$GARBAGE] $d i r e c t i o n ;
$po s i t i on = (" sentado " : SIT| "en una s i l l a " : SIT| "de p i e " : STAND| " levantado " : STAND){<@POSITION $value >};
$ac t i on = (" g i rado " : TURNED| "mirando" : LOOKING| "apuntando" : POINTING){<@ACTION $value >};
$d i r e c t i o n = (" derecha " : RIGHT| " i z qu i e rda " : LEFT| " de lante " : FORWARD){<@DIRECTION $value >};
62
![Page 81: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/81.jpg)
References
[1] SUSHMITA MITRA AND TINKU ACHARYA. Gesture Recognition: A Survey. IEEE
Transactions on Systems, Man and Cybernetics, Part C (Applications and Re-
views), 37(3):311–324, May 2007. 2, 15
[2] A JAUME-I CAPÓ AND JAVIER VARONA. Representation of human postures
for vision-based gesture recognition in real-time. Gesture-Based Human-
Computer, pages 102–107, 2009. 2, 15
[3] SERGI FOIX, G. ALENYA, AND C. TORRAS. Lock-in Time-of-Flight (ToF) Cam-
eras: A Survey. IEEE Sensors Journal, 11(99):1, 2011. 2, 13
[4] DANIEL SCHARSTEIN AND RICHARD SZELISKI. High-Accuracy Stereo Depth
Maps Using Structured Light. Computer Vision and Pattern Recognition, IEEE
Computer Society Conference on, 1:195, 2003. 2, 14
[5] B. FREEDMAN, A. SHPUNT, M. MACHLINE, AND Y. ARIELI. Depth map-
ping using projected patterns, October 2008. 2, 25
[6] VARIOUS AUTHORS. Kinect Entry at the Wikipedia, June 2011. 2
[7] MA SALICHS, R. BARBER, AM KHAMIS, M. MALFAZ, JF GOROSTIZA,
R. PACHECO, R. RIVAS, ANA CORRALES, E. DELGADO, AND D. GARCIA. Mag-
gie: A robotic platform for human-robot social interaction. In 2006 IEEE
Conference on Robotics, Automation and Mechatronics, pages 1–7, 2006. 2, 17
[8] TERRENCE FONG, ILLAH NOURBAKHSH, AND KERSTIN DAUTENHAHN. A survey
of socially interactive robots. Robotics and autonomous systems, 42(3-4):143–
166, 2003. 7
63
![Page 82: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/82.jpg)
REFERENCES
[9] MICHAEL A. GOODRICH AND ALAN C. SCHULTZ. Human-Robot Interaction: A
Survey. Foundations and TrendsÂo in Human-Computer Interaction, 1(3):203–
275, 2007. 7
[10] BRENNA D. ARGALL, SONIA CHERNOVA, MANUELA VELOSO, AND BRETT
BROWNING. A survey of robot learning from demonstration. Robotics and
Autonomous Systems, 57(5):469–483, 2009. 8, 9, 10
[11] SRIDHAR MAHADEVAN, GEORGIOS THEOCHAROUS, AND NIKFAR KHALEELI.
Rapid Concept Learning for Mobile Robots. Autonomous Robots, 5(3):239–
251, 1998. 10
[12] TIM VAN KASTEREN, ATHANASIOS NOULAS, GWENN ENGLEBIENNE, AND BEN
KRÖSE. Accurate activity recognition in a home setting. In Proceedings of
the 10th international conference on Ubiquitous computing, UbiComp ’08, pages
1–9, New York, NY, USA, 2008. ACM. 11
[13] STEPHANIE ROSENTHAL, J. BISWAS, AND M. VELOSO. An effective personal
mobile robot agent through symbiotic human-robot interaction. In Proceed-
ings of the 9th International Conference on Autonomous Agents and Multiagent
Systems: volume 1-Volume 1, pages 915–922. International Foundation for Au-
tonomous Agents and Multiagent Systems, 2010. 11
[14] BURR SETTLES. Active Learning Literature Survey. Computer Sciences Tech-
nical Report 1648, University of Wisconsin–Madison, 2010. 11
[15] MAYA CAKMAK, CRYSTAL CHAO, AND ANDREA L THOMAZ. Designing Interac-
tions for Robot Active Learners. IEEE Transactions on Autonomous Mental
Development, 2(2):108–118, June 2010. 11, 12
[16] PRIMESENSE LTD. PrimeSense’s PrimeSensor Reference Design 1.08, June
2011. 13, 26
[17] A KOLB, E BARTH, AND R KOCH. ToF-sensors: New dimensions for realism
and interactivity. In Computer Vision and Pattern Recognition Workshops, 2008.
CVPRW ’08. IEEE Computer Society Conference on, pages 1–6, 2008. 14
64
![Page 83: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/83.jpg)
REFERENCES
[18] ANDREAS KOLB, ERHARDT BARTH, REINHARD KOCH, AND RASMUS LARSEN.
Time-of-flight sensors in computer graphics. In Eurographics State of the Art
Reports, pages 119–134, 2009. 15
[19] S B GOKTURK AND C TOMASI. 3D head tracking based on recognition and
interpolation using a time-of-flight depth sensor. In Computer Vision and
Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer
Society Conference on, 2, pages II–211 – II–217 Vol.2, 2004. 15, 16
[20] XIA LIU AND K FUJIMURA. Hand gesture recognition using depth data. In
Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE Inter-
national Conference on, pages 529–534, May 2004. 15
[21] PIA BREUER, CHRISTIAN ECKES, AND STEFAN MÜLLER. Hand Gesture Recog-
nition with a Novel IR Time-of-Flight Range Camera - A Pilot Study. In AN-
DRÉ GAGALOWICZ AND WILFRIED PHILIPS, editors, Computer Vision/Computer
Graphics Collaboration Techniques, 4418 of Lecture Notes in Computer Science,
pages 247–260. Springer Berlin / Heidelberg, 2007. 15
[22] HERVÉE LAHAMY AND DEREK LITCHI. Real-time hand gesture recognition
using range cameras. In The International Archives of the Photogrammetry,
Remote Sensing and Spatial Information Sciences [on CD-ROM], page 38, 2010.
15
[23] NADIA HAUBNER, ULRICH SCHWANECKE, R. DÖRNER, SIMON LEHMANN, AND
J. LUDERSCHMIDT. Recognition of Dynamic Hand Gestures with Time-of-
Flight Cameras. In ITG / GI Workshop on Self-Integrating Systems for Better
Living Environments 2010: SENSYBLE 2010, pages 1–7, 2010. 15
[24] K NICKEL AND R STIEFELHAGEN. Visual recognition of pointing gestures
for human-robot interaction. Image and Vision Computing, 25(12):1875–1884,
December 2007. 16
[25] DAVID DROESCHEL, JÖRG STÜCKLER, AND SVEN BEHNKE. Learning to inter-
pret pointing gestures with a time-of-flight camera. In Proceedings of the 6th
international conference on Human-robot interaction, HRI ’11, pages 481–488,
New York, NY, USA, 2011. ACM. 16
65
![Page 84: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/84.jpg)
REFERENCES
[26] RONAN BOULIC, JAVIER VARONA, LUIS UNZUETA, MANUEL PEINADO, ANGEL
SUESCUN, AND FRANCISCO PERALES. Evaluation of on-line analytic and nu-
meric inverse kinematics approaches driven by partial vision input. Virtual
Reality, 10(1):48–61, April 2006. 16
[27] YOUDING ZHU, BEHZAD DARIUSH, AND KIKUO FUJIMURA. Kinematic self retar-
geting: A framework for human pose estimation. Computer Vision and Image
Understanding, 114(12):1362–1375, December 2010. 16
[28] ARNAUD RAMEY, VÍCTOR GONZÁLEZ-PACHECO, AND MIGUEL A SALICHS. Inte-
gration of a low-cost RGB-D sensor in a social robot for gesture recognition.
In Proceedings of the 6th international conference on Human-robot interaction -
HRI ’11, page 229, New York, New York, USA, 2011. ACM Press. 16
[29] OPENNI MEMBERS. OpenNI web page, June 2011. 16, 26
[30] ANA CORRALES, R. RIVAS, AND MA SALICHS. Sistema de identificación de
objetos mediante RFID para un robot personal. In XXVIII Jornadas de Au-
tomática, pages 50–54, Huelva, 2007. Comité Español de Automática. 18
[31] R. BARBER. Desarrollo de una arquitectura para robots móviles autónomos. Apli-
cación a un sistema de navegación topológica. Phd thesis, Universidad Carlos
III de Madrid, 2000. 20
[32] R. RIVAS, ANA CORRALES, R. BARBER, AND MA. Robot skill abstraction for
ad architecture. 6th IFAC Symposium on Intelligent Autonomous Vehicles, 2007.
20
[33] ERICH GAMMA, RICHARD HELM, RALPH JOHNSON, AND JOHN VLISSIDES. De-
sign Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley
Professional, 1995. 22
[34] M. QUIGLEY, B. GERKEY, K. CONLEY, J. FAUST, T. FOOTE, J. LEIBS, E. BERGER,
R. WHEELER, AND A. NG. ROS: an open-source Robot Operating System. In
Open-Source Software workshop of the International Conference on Robotics
and Automation (ICRA), number Figure 1, 2009. 23
66
![Page 85: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/85.jpg)
REFERENCES
[35] PRIMESENSE LTD. PrimeSense’s Frequently Asked Questions (FAQ) web-
site, June 2011. 25
[36] Z. ZALEVSKY, A. SHPUNT, A. MAIZELS, AND J. GARCIA. Method and System
for Object Reconstruction, April 2007. 25
[37] MARK HALL, EIBE FRANK, GEOFFREY HOLMES, BERNHARD PFAHRINGER, PE-
TER REUTEMANN, AND I.H. WITTEN. The WEKA data mining software: an
update. ACM SIGKDD Explorations Newsletter, 11(1):10–18, 2009. 28
[38] J R QUINLAN. C4.5: programs for machine learning. Morgan Kaufmann series in
machine learning. Morgan Kaufmann Publishers, 1993. 28
[39] SB KOTSIANTIS. Supervised machine learning: A review of classification
techniques. Informatica, 31(3):249–268, 2007. 28, 53
[40] F. ALONSO-MARTIN AND MIGUEL SALICHS. INTEGRATION OF A VOICE
RECOGNITION SYSTEM IN A SOCIAL ROBOT. Cybernetics and Systems,
42(4):215–245, may 2011. 35
[41] F ALONSO-MARTIN, ARNAUD A RAMEY, AND MIGUEL A SALICHS. Maggie: el
robot traductor. In UPM, editor, 9 Workshop RoboCity2030-II, number 9, pages
57–73, Madrid, 2011. Robocity 2030. 42
[42] D. CROCKER AND P. OVERELL. Augmented BNF for Syntax Specifications:
ABNF. RFC 2234, Internet Engineering Task Force, November 1997. 61
67
![Page 86: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/86.jpg)
REFERENCES
68
![Page 87: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot](https://reader033.vdocuments.us/reader033/viewer/2022051210/54f6badb4a7959430c8b4877/html5/thumbnails/87.jpg)
This document is published under the license (CC)-BY-SA.
You are free to:
To Share - To copy, distribute and transmit this document.
To Remix - To adapt the document.
To make commercial use of the document.
Under the following conditions:
Attribution - You must attribute the document in the manner specified by the author
or licensor (but not in any way that suggests that they endorse you or your use
of the work.
Share Alike - If you alter, transform, or build upon this document, you may distribute
the resulting work only under the same or similar license to this one.