a supervised learning architecture for human pose recognition in a social robot

87
A Supervised Learning Architecture for Human Pose Recognition in a Social Robot UNIVERSITY CARLOS III OF MADRID COMPUTER SCIENCE DEPARTMENT Victor Gonzalez-Pacheco A thesis submitted for the degree of Master in Computer Science and Technology - Artificial Intelligence - 2011 July

Upload: victor-gonzalez-pacheco

Post on 04-Mar-2015

333 views

Category:

Documents


4 download

DESCRIPTION

A main activity of Social Robots is to interact with people. To do that, the robot must be able to understand what the user is saying or doing. This document presents a supervised learning architecture to enable a social robot to recognise human poses. The architecture is trained using data obtained from a depth camera that allows the creation of a kinematic model of the user. The user labels each set of poses by telling it directly to the robot, which identifies these labels with an Automatic Speech Recognition System (ASR). The architecture is evaluated with two different datasets where the quality of the training examples varies. In both datasets, a user trains the classifier to recognise three different poses. The learned classifiers are evaluated against twelve different users demonstrating high accuracy and robustness when representative examples are provided in the training phase. Using this architecture in a social robot might improve the quality of the human-robot interaction since the robot is able to detect non-verbal cues from the user, making the robot more aware of the interaction context.

TRANSCRIPT

Page 1: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

A Supervised LearningArchitecture for Human PoseRecognition in a Social Robot

UNIVERSITY CARLOS III OF MADRID

COMPUTER SCIENCE DEPARTMENT

Victor Gonzalez-Pacheco

A thesis submitted for the degree of

Master in Computer Science and Technology- Artificial Intelligence -

2011 July

Page 2: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

ii

Page 3: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Director:

Fernando Fernandez Rebollo

Computer Science Department

University Carlos III of Madrid

Co-Director:

Miguel A. Salichs

Systems Engineering and Automation Department

University Carlos III of Madrid

Page 4: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot
Page 5: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Abstract

A main activity of Social Robots is to interact with people. To do that, the

robot must be able to understand what the user is saying or doing. This

document presents a supervised learning architecture to enable a social

robot to recognise human poses. The architecture is trained using data ob-

tained from a depth camera that allows the creation of a kinematic model of

the user. The user labels each set of poses by telling it directly to the robot,

which identifies these labels with an Automatic Speech Recognition Sys-

tem (ASR). The architecture is evaluated with two different datasets where

the quality of the training examples varies. In both datasets, a user trains

the classifier to recognise three different poses. The learned classifiers

are evaluated against twelve different users demonstrating high accuracy

and robustness when representative examples are provided in the training

phase. Using this architecture in a social robot might improve the quality

of the human-robot interaction since the robot is able to detect non-verbal

cues from the user, making the robot more aware of the interaction context.

Keywords: Pose Recognition, Machine Learning, Robotics, Human-Robot

Interaction, HRI.

Page 6: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot
Page 7: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Resumen

Una de las principales actividades de los robots sociales es interactuar con

personas. Para ello se requiere que el robot sea capaz de entender qué

es lo que está diciendo o haciendo el usuario. Este documento presenta

una arquitectura de aprendizaje supervisado que permite a un robot so-

cial reconocer poses de personas con las que interactúa. La arquitectura

es entrenada utilizando las imágenes provenientes de una cámara de pro-

fundidad, la cual permite la creación de un modelo cinemático del usuario

que es utilizado para los ejemplos de entrenamiento. El propio usuario se

encarga de etiquetar las poses mostradas al robot mediante su propia voz.

Para detectar las etiquetas dichas por el usuario, el robot utiliza un sistema

de reconocimiento del habla integrado en la arquitectura. La arquitectura

es evaluada con dos datasets diferentes en los cuales se varía la calidad

de los ejemplos de entrenamiento. En ambos datasets, un usuario entrena

al clasificador para que sea capaz de reconocer tres distintas poses. Los

classificadores construidos mediante estos datasets son evaluados medi-

ante una prueba con doce usuarios distintos. La evaluación demuestra

que esta arquitectura consigue una alta precisión y robustez cuando se le

proveen ejemplos representativos en la fase de entrenamiento. El uso de

esta arquitectura en un robot social puede mejorar la calidad de las interac-

ciones humano-robot gracias a que con ella el robot es capaz de detectar

información no-verbal emitida por el usuario. Esto permite que el robot sea

más consciente del contexto de interacción en el que se encuentra.

Palabras clave: Pose Recognition, Machine Learning, Robotics, Human-

Robot Interaction, HRI.

Page 8: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot
Page 9: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

To Raquel, eternal companion in this, and all the journeys.

Page 10: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot
Page 11: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Acknowledgements

This large project could not be finished without the help of many people,

and for that reason I want to thank them their precious time they dedicated

to me. First, I want to express how thankful I am to my two advisors Fer-

nando Fernández and Miguel Ángel Salichs. You have been a lighthouse

that guided my in this journey and you have been helpful always I have

needed you. I want to thank to Fernando A., especially for helping me to

overcome all the difficulties I encountered when I was integrating the voice

system of the robot, moreover when you were not physically here. It is

really easy and great to work with colleagues as you. I am grateful to the

rest of the "Social Robots" team as well. Especially to Arnaud, Alberto

and David. It’s incredible how you are always responding with invaluable

help, and great advice. There are other people that have suffered the col-

lateral damages of this project, thank you Martin, Javier, Miguel, Juan,

Alberto, and Silvia. It is wonderful to have such colleagues and friends,

always being there exchanging ideas, concerns, helping in problems and

having great coffee breaks with me. Finally, I want to give special thanks to

Raquel, my wife, for all the support she has given to me. Without your pa-

tience, encouragement, and drive I would not finished this project. Thank

you.

Page 12: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot
Page 13: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Contents

List of Figures xv

List of Tables xvii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Problems to be solved . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Structure of the Document . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related Work 7

2.1 Machine Learning in Human-Robot Interactions. . . . . . . . . . . . . . 7

2.2 Depth Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Gesture Recognition with Depth Cameras . . . . . . . . . . . . . . . . . 15

3 Description of the Hardware and Software Platform 17

3.1 Hardware Description of the Robot Maggie . . . . . . . . . . . . . . . . 17

3.2 The AD Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Robot Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 The Kinect Vision System . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 The Weka Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 The Supervised Learning Based Pose Recognition Architecture 29

4.1 Training Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.1 Pose Labeler Skill . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.2 Pose Labeler Bridge . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.3 Pose Trainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

xiii

Page 14: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

CONTENTS

4.2 Classifying Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 Pose Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.2 Pose Teller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Pilot Experiments 43

5.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Conclusions 51

A Class Diagrams of the Main Nodes 55

B Detailed Results of the Pilot experiment 57

C The ASR Skill Grammar for recognising Pose Labels 61

References 63

License 69

xiv

Page 15: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

List of Figures

2.1 Categories of Example Gathering . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Operation Diagram of a Depth Camera . . . . . . . . . . . . . . . . . . . 13

2.3 Comparison of the three main depth camera technologies . . . . . . . . 14

3.1 The sensors of the Robot Maggie . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Actuators and interaction mechanisms of the Robot Maggie . . . . . . . 19

3.3 The AD architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Block Diagram of the Prime Sense Reference Design . . . . . . . . . . 26

3.5 OpenNI’s kinematic model of the human body . . . . . . . . . . . . . . . 27

4.1 Overview of the Built System . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Diagram of the built architecture . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Sequence Diagram of the training phase . . . . . . . . . . . . . . . . . . 34

4.4 Use Case of the Pose Labeler Skill . . . . . . . . . . . . . . . . . . . . . 36

4.5 Use Case of the Pose Labeler Bridge . . . . . . . . . . . . . . . . . . . 37

4.6 Use Case of the Pose Trainer Node . . . . . . . . . . . . . . . . . . . . 38

4.7 Sequence Diagram of the classifying phase . . . . . . . . . . . . . . . . 40

4.8 Use Case of the Pose Classifier Node . . . . . . . . . . . . . . . . . . . 41

4.9 Use Case of the Pose Teller Node . . . . . . . . . . . . . . . . . . . . . 42

5.1 Scenario of the experiment . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 The tree built for the models M1 and M2 . . . . . . . . . . . . . . . . . . 47

5.3 The tree built for the model M3 . . . . . . . . . . . . . . . . . . . . . . . 48

5.4 The tree built for the model M4 . . . . . . . . . . . . . . . . . . . . . . . 48

A.1 Class Diagram of the Pose Trainer Node . . . . . . . . . . . . . . . . . . 55

xv

Page 16: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

LIST OF FIGURES

A.2 Class Diagram of the Pose Classifier Node . . . . . . . . . . . . . . . . 56

xvi

Page 17: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

List of Tables

5.1 Results of the Pilot Experiment . . . . . . . . . . . . . . . . . . . . . . . 47

B.1 Detailed Results of the Models M1 and M2 . . . . . . . . . . . . . . . . 57

B.2 Confusion matrix for the models M1 and M2 . . . . . . . . . . . . . . . . 58

B.3 Detailed Results of the Model M3 . . . . . . . . . . . . . . . . . . . . . . 58

B.4 Confusion matrix for the model M3 . . . . . . . . . . . . . . . . . . . . . 58

B.5 Detailed Results of the Model M4 . . . . . . . . . . . . . . . . . . . . . . 58

B.6 Confusion matrix for the model M4 . . . . . . . . . . . . . . . . . . . . . 59

xvii

Page 18: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

LIST OF TABLES

xviii

Page 19: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Chapter 1

Introduction

1.1 Motivation

Human Robot Interaction (HRI) is the field of research that studies how humans and

robots should interact and collaborate. Humans expect that robots to understand them

as other people do. In this aspect, a robot must understand natural language and

should be capable of establishing complex dialogues with its human partners.

But dialogues are not only a matter of words. Most of the information that is ex-

changed in a conversation does not come from the phrases of the people engaged

in this talk but from their non verbal messages. Gestures encompass a great part of

the non-verbal information, but there are also other factors that provide information to

the others. An example of this is the postural information that a person shows to its

listeners.

For example, imagine a a room where several people are sitting in chairs and

talking about something. If, suddenly, a person stands up, it will recall the attention of

the rest of the people of the room. What is this man announcing when suddenly stands

up? Is he about to leave the room? Or perhaps he wants to say something relevant?

His intentions would not be disclosed until he says or does something, but the first

important thing is that the dynamics of the conversation have suddenly changed, or at

least, it have been affected by a change in the context of the room.

Now imagine there is a robot in that room. Would the robot get noticed of this

change in the context? Probably not, if we consider the state of the current technol-

ogy. Gesture and pose recognition systems have been an active research field in the

1

Page 20: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

1. INTRODUCTION

recent years [1], but traditional image capture systems require to use complex statis-

tical models for the recognition of the body, making them difficult to apply in practical

applications [2].

But recent technology developments are enabling new types of vision sensors

which are more suitable for interactive scenarios [3]. These devices are the depth

cameras [3], [4], [5], which make the extraction of the body easier than it was with tra-

ditional cameras. And, because the extraction of the body is easier than before, now

it is possible to retarget great quantities of computer power to algorithms that actually

processes the gestures or the pose of the user rather than detecting his body and

tracking it.

Specially relevant is the case of the Microsoft’s Kinect sensor [6], a low cost a

depth camera that offers a precision and a performance similar to the high-end depth

cameras but at a cost several times lower. Among the Kinect, several drivers and

frameworks to control it have appeared. These drivers and frameworks provide direct

access to a skeleton model of the user who is in front of it at a relatively low CPU

cost. This model is precise enough to track the pose of the user and to recognise the

gestures he is doing in real time.

1.2 Objectives

The main objective of this master thesis is to build a software architecture that is ca-

pable of learning human poses seizing the new capabilities offered by a Kinect sensor

mounted in the robot Maggie [7]. With such architecture, it is expected that the robot

will be able to understand some non-verbal information from the users who interact

with the robot as well as context information of the situation.

Since the robot Maggie is a platform to study Human-Robot Interaction (HRI), one

of the requisites of the system is that the user should be able to teach the robot the

poses it doesn’t know. This teaching process must be done by natural interaction

processes. I. e. the user and the robot must communicate by voice.

The learning architecture will obtain the user’s body information retrieving it from

the robot’s vision system. The vision system will rely on a Kinect sensor, which has

2

Page 21: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

1.2 Objectives

been recently installed in the robot but not yet fully integrated in its software architec-

ture. For that reason, one of the objectives of the project is to integrate the software

that manages the Kinect sensor with the robot architecture.

Additionally, no learning architecture is running on the robot. Therefore, this archi-

tecture has to be built and integrated in the robot.

The integration of all these components is not a trivial task due the requirements

of the robot’s software architecture, which requires that its software must be capable

of working in a distribute manner. To solve this issue, the mechanisms that will glue all

the developed components will be the communication systems of the robot’s software

architecture and the communication systems of ROS (Robot Operating System). ROS

is an open-source software framework that pretends the standardisation of the robotics

software. The vision components we pretend to use are already integrated in ROS and

its multilanguage support makes possible to develop new ROS-compatible software.

To not redo the work that already have been done in ROS, the vision components will

be accessed through ROS. But ROS is not integrated into the robot and there are not

mechanisms to communicate ROS with the sofware architecture of the robot. Hence,

a last objective of the project is to integrate ROS in the software architecture of the

robot.

These objectives can be summarised in the following list:

• To build a machine learning framework that learns by multi-modal examples. In

essence, the system should learn by fusing verbal information with the informa-

tion captured by the vision system.

• The human should be able to teach the robot by interacting naturally, in the same

manner it would do with another human.

• To develop a pose recognition learning architecture seizing the image acquisition

techniques provided by the Kinect sensor and the algorithms of its drivers.

• To integrate ROS in the robot

• To validate that the system works.

• To integrate and test the whole architecture in the robot Maggie.

3

Page 22: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

1. INTRODUCTION

1.3 Problems to be solved

In order to accomplish the objectives enumerated above, several technical problems

have to be addressed. Among all of them, three are specially relevant.

The first one is to integrate the vision algorithms provided by the Kinect sensor in

the architecture of the robot. Until now, this is an uncompleted task that has to be done

to enabling the learning framework to detect the pose of the users. Additionally, the

architecture of the robot lacks of a generic machine learning component. Therefore,

if a learning component has to be used, it must be integrated in the robot software as

well. Finally, the voice recognition system of the robot must be able to collaborate with

the learning architecture in order to feed it with the training examples provided by the

user.

As it has been shown in the objectives section, on of the objectives of the project is

to use some components of the architecture through ROS which it has to be integrated

in the robot software architecture. This is a technical problem because the communi-

cation mechanisms of ROS differ to the ones of the robot. Thus, it is needed to build

some components that enable the communication between ROS and the robot.

These technical difficulties are summarised in the following listing:

• Integrate the vision acquisition system with the robot’s software architecture.

• Integrate the machine learning framework in the architecture of the robot.

• Combining the user’s inputs with the learning architecture.

• Integrate ROS with the robot’s software architecture.

1.4 Structure of the Document

This document is organised as follows. Chapter 2 presents an overview of the related

work in pose and gesture classification with depth cameras. Chapter 3 introduces

an overview of the systems that act as the building blocks of the developed architec-

ture. The chapter describes hardware components such as the robot Maggie and the

Kinect sensor as well as the software modules that act as the scaffold of the project.

In chapter 4 presents the developed architecture and describes its components sep-

arating them following the two phases of the learning process, the training phase and

the classifying phase. After that, chapter 5 presents some pilot experiments that have

4

Page 23: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

1.4 Structure of the Document

been carried out to validate the correct functioning of the architecture. Finally, chapter

6 closes the document presenting the conclusions and the future work that remains to

be done.

5

Page 24: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

1. INTRODUCTION

6

Page 25: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Chapter 2

Related Work

This chapter introduces the current state of the art in the topics related to the project.

The chapter starts with an overview of the machine learning techniques that are used

in in the field of Human-Robot Interaction (HRI) with the purpose of improving the

interactions between robots and people. After, section 2.2, presents an overview of

the technology used by depth cameras and which enables them to retrieve the depth

information form the scenes. Finally, section 2.3, shows an insight of the applications

and the works that other research groups have been done with depth cameras to

detect human poses and gestures.

2.1 Machine Learning in Human-Robot Interactions.

Fong et al. made an excellent survey [8] of the interactions between humans and

social robots. In the survey, Fong mentions that the main purpose of learning in social

robots is to improve the interaction experience. At the date of the survey (2003),

most of the learning applications were used in robot-robot interaction. Some works

addressed the issue of learning in human-robot interaction, mostly centred in imitating

human behaviours such as motor primitives. According to the authors, learning in

social robots is used for transferring skills, tasks and information to the robot. However,

the authors do not mention the use of learning for transferring concepts to the robot..

Few years later, Goodrich and Schultz, published a complete survey that covered

several HRI fields [9]. The authors remarked the need of robots with learning capa-

bilities because scriptising every possible interaction with humans is not an afford-

7

Page 26: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

2. RELATED WORK

able task due the complexity and unpredictable behaviour of the human beings. They

pointed out the need of a continuous learning process where the human can teach the

robot in an ad-hoc and incrementally manner to improve the robot’s perceptual ability,

autonomy and its interaction capabilities. They called this process interactive learning

and it is carried out by natural interaction. Again, the survey only reports works that

referred to learning as an instrument to improve abilities, behaviour, perception and

multi-robot interaction. No explicit mention was done to use learning to provide the

robot with new concepts.

TeleoperationSensors on Teacher

ShadowingExternal

Observation

Demonstration Imitation

Direct Recording

Mapped Recording

Mapped EmbodimentDirect Embodiment

Rec

ord M

apping

Embodiment Mapping

Figure 2.1: Categories of Example Gathering - These four squares categorise the man-ners of gathering examples in the Learning from Demonstration field applied to robotics.The rows represent whether the recording of the example captured all the sensory inputof the teacher (upper row) or not (lower row). Columns represent if the recorded datasetcan apply directly to actions or states of the robot (left column) or if it is needed somemapping (right column). (Retrieved from [10])

There are many fields where machine learning is applied to robotics. Among all of

the machine learning techniques, supervised learning is one of the most widespread

since the robot can explore the world with the supervision of a teacher and thus, re-

ducing the danger to the robot or the environment [10].

This section shows some concepts regarding supervised learning, specially, Learn-

ing from Demonstration (LfD). In this area, [10] presents a excellent survey establish-

ing several LfD approaches in different categories depending on how they collect the

8

Page 27: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

2.1 Machine Learning in Human-Robot Interactions.

learning examples and depending on how they learn a policy from these examples.

The latter is not relevant to this project, thus only the former will be summarised be-

low.

Argall et. al [10] call correspondence to the fact of recording learning examples and

transferring them to the robot. They divide correspondence into two main categories

depending on different aspects of how the examples are recorded and transferred to

the robot. These categories are Record Mapping, and Embodiment Mapping. The

former refers to whether the experience of the teacher during the demonstration is

exactly captured or not. The latter refers whether the examples recorded to the dataset

are exactly those that the learner would observe or execute. Argall et. al. present

these two ways of categorisation in a form of a 2x2 matrix (see Fig. 2.1).

Deepening into the categorisation, the authors in [10], categorise the Embodiment

Mapping into two subcategories: demonstration and imitation. In the demonstration

category, the demonstration is performed on the actual robot or in a physically identical

platform and, thus, there is no need of embodiment mapping. On the contrary, in the

imitation category, the demonstration is performed in a platform different to the robot

learner. Therefore, an embodiment mapping between the demonstration platform and

the learning platform is needed.

As it is depicted in Fig. 2.1, both demonstration and imitation categories are divided

into two sub-categories depending on the Record Mapping. In essence, demonstration

is divided into teleoperation and shadowing while imitation is divided into sensors in

the teacher and external observation categories. Following, the four categories are

described.

• Demonstration. The embodiment mapping is direct. The robot uses its own

sensors to record hte example while its body executes the behaviour.

– Teleoperation: is a technique where the teacher directly operates the learner

robot during the execution. The robot learner uses its own sensors to cap-

ture the demonstration. In this case exists a direct mapping between the

recorded example and the observed example.

– Shadowing: in this technique, the robot learner uses its own sensors to

record the example while, at the same time, tries to mimic the teacher’s

motions. In this case there is no a direct record mapping.

9

Page 28: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

2. RELATED WORK

• Imitation. The embodiment mapping is not direct. The robot needs to retrieve

the example data from the actions of the teacher.

– Sensors in the teacher: In this technique, several sensors are located on the

executing body to record the teacher execution. Therefore, this technique

is a direct record mapping technique.

– External observator: In this case, the recording sensors are not located on

the executing body. In some cases the sensors are installed on the learner

robot while others are outside it. This means that this technique is not a

direct record mapping technique.

As a commentary, teleoperation provides the most direct method for transferring

information within the demonstration learning, but as a setback, it is not suitable for

all learning platforms [10]. On the other hand, shadowing techniques demand more

processing to enable the learner to mimic the teacher.

The sensors on teacher technique provides very precise measurements since the

information is extracted directly from the teachers sensors. But it requires an extra

overhead in the form of specialised sensors. On the other side, since external obser-

vation sensors are external to the teacher, they do not record the data directly from

it, forcing the learner robot to infer the data of the execution. This makes external

observation less reliable, but since the set up of the sensors produces less overhead

compared to the sensors on the teacher technique, it is more widely used [10]. Typi-

cally, the external sensors used to record human teacher executions are vision-based.

If we apply the categorisation of [10] to this project, we got that there is no recording

mapping because the learning examples are recorded by the Kinect sensor of the

robot. Additionally, the embodiment recording categorisation does not apply because

there is no action or behaviour to learn. Instead, what it is learnt is a concept which,

for the scope of this project, does not need to be mapped to the robot.

Almost all the presented works focus the learning process in learning tasks or

behaviours. Few of them use learning to teach concepts to the robot. This is the

case of [11], where the authors train a mobile robotic platform to understand concepts

related to the environment at which it has to navigate. The authors use a Feed Forward

Neural Network (NN) to train the robot to understand concepts like doors or walls. The

authors train the NN by showing it numerous images of a trash can (its destination

10

Page 29: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

2.1 Machine Learning in Human-Robot Interactions.

point) labelling each photo with the distance and the orientation of the can. However,

the work presented some limitations such as the learning process lacked of enough

flexibility to generalise the work to other areas.

A work in an area not directly related to robotics [12] can give us an insight of which

kind of concepts our robot can learn with our system. In that work, Van Karsten et al.

mounted a wireless sensor network in a home to retrieve the activities of the user who

is living in it. The activities were labeled by the person who was living in the house by

voice using a wireless bluetooth headset. The labels were processed by a grammar

based speech recognition system similar to the Maggie’s one (see annex C to check

an example of a Maggie’s grammar). The authors recorded the activities of the user for

a period of 28 days and formed a dataset with them. From the 28 days of recording,

27 were used as training dataset and the other day was used as a tests dataset to

evaluate the classifier. The training dataset was processed in a Hidden Markov Model

(HMM) and in a Conditional Random Field (CRF) to build two models of the user’s

activities.

The performance of both classifiers were evaluated using two types of measures,

the time slice accuracy and the class accuracy. The first represented the percentage

of correctly classified time slices, while the latter represented the average percentage

of correctly classified time slices per class. Both classifiers demonstrated good results

detecting the user’s activities. CRF’s showed better time slice accuracy while HMM

performed better in class accuracy.

Learning the activities of the user might improve the quality of the interactions in a

social robot. But, despite Activity Learning has many potential applications in robotics,

no works have been found in the fields of social robotics or human robot interaction.

Most of the presented works suppose the robot is a passive learner. But a new

paradigm is appearing where the robot takes the initiative to ask the user for more

examples where they are needed [13]. Active learning techniques can produce classi-

fiers with better performance requiring fewer examples or reducing the required num-

ber of examples to reach certain performance [14].

Applied to HRI, there are some works that demonstrated that Active Learning not

only improves the performance of supervised learning, but it also improves the quality

of the interaction with their teachers. In [15], the authors experimented how a social

robot was able to learn different concepts. In the experiment participated 24 people

11

Page 30: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

2. RELATED WORK

and the robot was configured to show four different degrees of initiative in the learning

process.

The robot ranged from a traditional passive supervised learning approach to three

different active learning configurations. These active configurations were a naïve ac-

tive learner that queried the teacher every turn; a mixed passive-active mode, where

the robot waited for certain conditions before asking; and a mode where the robot only

queried the teacher when the teacher granted permission to do it.

In their experiment, [15] found that there was no appreciable difference between

the active learning modes, but the three of them outperformed the passive supervised

learning mode in both accuracy and number of examples needed to achieve accurate

models. Additionally, the survey to the 24 users, demonstrated that the users preferred

the active learner robot, which they found to be more intelligent, more engaging and

easier to teach.

Despite active learning seems to be better suited for HRI purposes than traditional

supervised learning, the scope of this project will remain in passive supervised learn-

ing since this is the first approximation to the field. Nevertheless, an objective of the

project is to build an architecture which allows future expansions in case active learn-

ing techniques are added to it in the near future.

2.2 Depth Cameras

Depth cameras are systems that can build a 3D depth map of a scene by projecting

light to that scene. The principle is similar to that of LIDAR scanners with the difference

that the latter are only capable of performing a 2D scanning of the scene while depth

cameras scan the whole scene at once. Fig. 2.2 depticts an example of the operation

principle of a depth camera.

Traditionally, depth information has been carried out by stereo vision or laser based

systems. Stereo cameras rely on passive triangulation methods to obtain the depth

information from the scene. These methods require two cameras separated by a base-

line that determines a limited working depth range. But within these algorithms appear

the so-called correspondence problem, which is determining what pairs of points in the

two images are projections of the same 3D point. In contrast, depth cameras naturally

12

Page 31: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

2.2 Depth Cameras

Figure 2.2: Operation Diagram of a Depth Camera - Depth cameras project IR light tothe scene which is analysed by a sensor to get the depth map of the scene. The depictedsensor is the PrimeSense’s sensor which uses the Light Coding technology to retrieve thedepth data. (Retrieved from [16])

deliver depth and simultaneous intensity data avoiding the correspondence problem,

and do not require a baseline in order to operate [3].

In the other hand, laser-based systems provide very precise sliced 3D measure-

ments. But these systems have to deal with difficulties in collision avoidance applica-

tions due their 2D field of view. The most widely adopted solution to solve this problem

has been mounting the sensor on a pan-and-tilt unit. This solves the problem, but

it also implies row by row sampling, which makes this solution inappropriate for real-

time, dynamic scenes. In short, although laser based systems present higher depth

range, accuracy and reliability, they are voluminous, heavy, increase the power con-

sumption, and add additional moving parts when compared to depth cameras. Depth

cameras, on the contrary, are compact and portable, they do not require the control

of mechanical moving parts, thus reducing power consumption, and they do not need

row by row sampling, thus reducing image acquisition time [3].

There are three main categories of depth cameras, depending on how they project

the light to the scene to capture its depth information:

Time-of-Flight (ToF) Cameras ToF cameras obtain the depth information by emitting

a near-infrared light which is reflected by the 3D surfaces of the scenario back to

the sensor (see Fig. 2.3a). Currently two main approaches are being employed

13

Page 32: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

2. RELATED WORK

(a) ToF (b) Structured Light (c) Light Coding

Figure 2.3: Comparison of the three main depth cameras technologies. (a) shows theprinciple of operation of the ToF cameras. (b) shows the pattern emitted by a StructuredLight camera. (c) shows the pattern emitted by a Light Coding camera. (b) and (c) obtainthe depth information by comparing the distortions of the pattern received at the sensorwith the original emitted pattern.

in ToF technology [17]. The first one consists in sensors which measure the the

time of a light pulse trip to calculate depth. The second approach measures

phase differences between the emitted and received signals.

Structured Light Cameras Structured Light is based on projecting a narrow band of

IR light onto the scene [4]. When the projected band hits a 3D surface, it pro-

duces a line of illumination that appears distorted from other perspectives than

the projector’s. When this distorted light is received by a sensor, it is possible to

calculate shape of the 3D surface, because the initial form of the band is known.

This applies to the only section of the 3D surface that has illuminated by the light

band. To extend this principle to the whole scene, many methods emit a pattern

of several light bands simultaneously (see Fig 2.3b).

Projected Light Cameras This is the newest technology. It is based on projecting

a pattern of IR light to the scene and and calculate its distortions. Differently to

Structured Light, here the pattern is based on light dots (see Fig. 2.3c). It is used

by devices such as the Microsoft’s Kinect. Since this is the technology used for

this project, is further described in section ??.

14

Page 33: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

2.3 Gesture Recognition with Depth Cameras

2.3 Gesture Recognition with Depth Cameras

Depth cameras are an attractive opportunity in several fields that require intense anal-

ysis of the 3D environment. [18] presents a survey showing different technologies and

applications of depth sensors in the recent years. It points out a relevant increase in

the scientific activity in this field in the two or three years prior to the survey. Among

other potential applications, [18] regards gesture recognition as one of the research

fields that can be most benefited by the appearance of the ToF technology. Specially,

tracking algorithms that have combined the data provided by the fusion of depth sen-

sors and RGB cameras have seen a significant increase in robustness.

Prior to the use of depth cameras, several research has been carried out in the

field of gesture recognition [1]. But using traditional computer vision systems leads

to requiring complex statistical models for recognition, which are difficult to use in

practical applications [2].

In [19] and [20], the authors suggest that it would be easier to infer geometry and

3D location of body parts using depth images. Because of that, several works are

focused in the use of depth cameras to detect and track different types of gestures.

Following, some examples of gesture recognition are shown.

Since depth sensors enable an easy segmentation and extraction of localised parts

of the body, several efforts have been dedicated to the field of hand detection and

tracking. For instance, in [21] a ToF camera is used to reconstruct and track a 7 Degree

of Freedom (DoF) hand model. In [22], the authors use a ToF camera for interaction

with computers focusing in two applications: recognising the number of raised fingers

in one hand and moving an object in a virtual environment using only a hand gesture.

Other works propose the use of ToF cameras to track hand gestures [20]. They use

the depth data of the ToF camera to segment the body from the background. Once

they have the body segmented they detect the head position since most hand gestures

are relative to the rest of the body. In [23], an algorithm for recognizing hand single

strokes gestures in a 3D environment with a ToF camera is presented. The authors

modified the "$1" gesture recognizer to be used with depth data. The modification of

the algorithm enabled the fingertip gestures not having to be articulated in the same

perspective as the gesture templates were recorded.

15

Page 34: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

2. RELATED WORK

In [24], a stereo-camera for the recognition of pointing gestures in the context of

Human Robot Interaction (HRI) is used. To do so, the authors perform visual track-

ing of the head, hands and head orientation. Using a Hidden Markov Model (HMM)

based classifier, they prove that the gesture recognition performance is improved sig-

nificantly when the classifier is provided with information about the head orientation as

an additional feature. But in [25], the authors assure that they achieved better results

with a ToF camera. To do that, they extract a set of body features from depth images

from a ToF camera and train a model of pointing directions using a Gaussian Process

Regression. This model presented higher accuracy than other simple criteria such as

head-hand, shoulder-hand or elbow-hand lines and the mentioned work of [24].

Some works focused in tracking other parts of the body. For example, [19] uses

a time-of-flight camera for a head tracking application. They use a knowledge-based

training algorithm to divide the depth data into a several initial clusters. Then they per-

form the tracking with an algorithm based on a modified k-means clustering method.

Other approaches rely on kinematic models to track human gestures once the body

is detected. For Instance, in [26], they use two stereo cameras to detect the hands

of a person and build an Inverse-Kinemantics (IK) model of the human body. In [27],

authors present a model-based kinematic self retargeting framework to estimate the

human pose from a small number of key-points of the body. They prove that it is pos-

sible to recover the human pose from a small set of key-points providing an adequate

kinematic model and a good formulation of tracking control subject to kinematic con-

straints. A similar approach is used in [28]. Here, the authors use a Kinect RGB-D

sensor to extract and track the skeleton model of a human body. This allows them to

capture the 3D temporal variations of the velocity vector of the hand and model them

in a Finite State Machine (FSM) to classify the hand gestures.

Most of these works rely on capturing only one or few parts of the body. However,

combining the ease of segmentating the body provided by the Kinect with recent kine-

matic approaches like the one in [27], makes possible to track the whole body without

a significative increase in the CPU consumption. This is the case of the OpenNI frame-

work [29], which enables the possibility of tracking the skeleton model of the user at

a low CPU cost. Because OpenNI is one of the key concepts of this project, it will be

addressed in section ??.

16

Page 35: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Chapter 3

Description of the Hardware andSoftware Platform

Before entering in the description of the architecture developed for this project, it

is necessary to understand the building blocks on which it relies on. This chapter

presents and describes the pre-existing modules and systems that have been used as

the principal components on which de developments of this project leaned on.

In this way, each section of the chapter corresponds to one of the main modules

that have been used to as the base of the project. Section 3.1 describes the robot

Maggie and its hardware components. After, in section3.2 the software architecture

of the robot Maggie is presented. In section 3.3, the Robot Operating System (ROS)

architecture is described. This architecture is used in several modules of the system

such as the vision system and the learning system. The vision system is described

in section 3.4, in which a description of the Kinect technology and algorithms that

enable it to extract and track skeleton models are presented. Finally, in section 3.5 the

learning framework is presented and the algorithm in which relies the learning process

is shown.

3.1 Hardware Description of the Robot Maggie

Maggie is a robotic platform developed by the RoboticsLab team in the Carlos III Uni-

versity of Madrid [7]. The objective of this development is the exploration of the fields

of social robotics and Human-Robot Interaction (HRI).

17

Page 36: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM

The main hardware components of the robot can be classified in sensors, actuators

and others. Maggie’s sensing system is composed by a laser sensor, 12 ultrasound

sensors, 12 bumpers, several tactile sensors, a video camera, 2 RFID detectors and

an external microphone. Fig. 3.1 depicts all the sensors of the robot, while Fig. 3.2

depicts the rest of the robot’s hardware. The laser range finder is a Sick LMS 200. The

laser is used to build maps of the environment surrounding the robot and to detect

obstacles. The 12 ultrasound sensors surround the robot’s base. The ultrasound

sensors are used as a complement to the laser. The robot base has also 12 contact

bumpers that are used to detect collisions in case the laser sensor and the ultra sound

sensors fail. Maggie has 9 capacity sensors installed in different points of the the

robot’s skin. The sensors are used as touch sensors to allow tactile interaction with

the robot. The video camera is mounted in the robot’s mouth. The camera is used

to detect and track humans and objects near the robot. Additionally to the standard

camera, the robot has been recently equipped with a Microsoft’s Kinect RGB-D sensor.

This sensor allows the robot to retrieve depth information from a scene as well as

standard RGB images. The robot also has 2 RFID (Radio Frequency IDentification)

detectors, 1 located in the robot’s base and the other in the robot’s nose. These RFID

detectors allow to extract data from RFID tags. The use of RFID tags allow the robot

to extract information of tagged objects is described in [30]. The external wireless

microphone allows the robot receive spoken commands or indications from humans.

The robot is actuated by a mobile base with 2 DoF (rotation and translation in the

ground plane). Also, the robot has two arms with one DoF each one. Maggie’s head

can move in two DoF (pitch and yaw). Finally, the eyelids of the robot can move in one

DoF each one.

The robot has some hardware dedicated to the interaction with humans and other

devices. These hardware devices are an infrared (IR) emitter, a tablet PC, 3 speakers

and an array of LEDs (Light Emitting Diodes). The infrared emitter allows the robot

to control IR based devices like TVs and stereo devices. The tablet PC, located in

Maggie’s chest, is used to show information related to the robot or as an interaction

device. The speakers allow the robot to communicate orally with humans. The LED

emitters are located in the robot’s mouth. They are used to show expressiveness when

the robot is talking.

18

Page 37: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

3.1 Hardware Description of the Robot Maggie

Touch Sensor

Touch Sensors

Camera

RFID Reader

Touch Sensor

RFID Reader

Touch Sensor

Bumpers

Touch Sensor

(Back)Touch Sensor

(Back)

Laser

Kinect

Touch Sensor

Touch Sensor

Ultrasonic

Figure 3.1: The sensors of the Robot Maggie - Notice the Kinect RGB-D sensor at-tached to its belly.

Eyelid Actuator (1 DoF)

Arm Actuator (1 DoF)

Neck Actuator (2 DoF)Leds

(expressivity)

Base Actuator (2 DoF)

Tablet PC

(Information and Interaction)

Speakers

(Voice Interaction)

Infrared emmiter

Main computer

(Robot control

and communication)

Arm Actuator (1 DoF)

Figure 3.2: Actuators and interaction mechanisms of the Robot Maggie -

19

Page 38: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM

The robot is controlled by a computer located inside her body. The computer com-

municates with the exterior using a IEEE 802.11n connection. The operative system

that runs on the computer is a Ubuntu 10.10 Linux. Over the OS runs the AD architec-

ture.

3.2 The AD Software Architecture

The main software architecture of Maggie is a software implementation of the Automatic-

Deliberative (AD) Architecture [31]. AD is designed imitating the human cognitive pro-

cesses. It is composed by two main levels, the Deliberative Level and the Automatic

Level. In the Deliberative Level, the processes that require high level reasoning and

decision capacity are located. These processes require a big amount of time and re-

sources to be computed. In the Automatic Level are located the low level processes.

These low level processes are the ones that interact directly with the hardware like

sensors and actuators. Usually, these processes are lighter than the deliberative ones

and, therefore, they need less time to be computed and use less resources than the

deliberative processes. Fig 3.3 shows the general schema of the AD architecture.

The AD architecture is composed by the following modules: sequencer, skills, shared

memory system and events system.

The basic component of the AD architecture is the skill [32]. A skill is the minimum

module that allows the robot to execute an action. It can reside in both AD levels de-

pending on its behaviour. A skill that executes complex reasoning or decision functions

is a Deliberative Skill and resides in the Deliberative Level. A skill that controls hard-

ware components or does not do complex reasoning functions is an Automatic Skill.

Every skill has a control loop that executes its main functionality. This control loop can

run in three different ways: cyclical, periodical and event triggered. Also, every skill

has three different states: ready, running and blocked.

Ready. Is the first state of the skill. It is the state between the moment the skill is

instantiated and the control loop is launched at the first time.

Running. It is the state when the control loop is being executed. This state is also

called active state.

20

Page 39: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

3.2 The AD Software Architecture

Sh or t -Ter m M em or y

Lon g -Ter m M em or y

Data Flow

Execut ion Or der s

EventsD

E

L

I

B

E

R

A

T

I

V

E

Sen sor D at a Com m an d s

ACTUATORS

A

U

T

O

M

A

T

I

C

SENSORS

Figure 3.3: The AD architecture - Notice that the main communication systems betweenthe skills are the shared memory system and the events system.

21

Page 40: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM

Blocked. It is the state when the control loop is not being executed.

Two parameters must be defined when the skill is instantiated for the first time.

The first parameter is the time between loop cycles. In other words, the time between

two running states. The second parameter is the number of times the control loop is

executed. Every skill can be activated or blocked in two different ways. The first way is

being blocked by other skills. This can be done both at the Deliberative and Automatic

levels. The second way to activate or block a skill is by the sequencer. The sequencer

operates at the deliberative level, therefore, only deliberative skills can be activated or

blocked by the sequencer.

Every skill is launched as a process. Therefore, the communication between skills

is an inter-process communication problem. To solve this communication problem

two communication systems have been designed and developed, the Shared Memory

System and the Event System. The Shared Memory System is composed of the Long

Term Memory (LTM) and the Short Term Memory (STM). The Long Term Memory

stores permanent knowledge. The robot uses the data stored in the LTM for reasoning

or for making decisions. The data of the LTM persists even if the system is shut down.

The Short Term Memory is used for storing data that is only needed during the running

cycle of the robot. Examples of data stored in the STM are the data extracted from

sensors or data that needs to be shared among skills. These data is not needed after

the robot is powered off, therefore it is not a persistent memory.

The Event System is used to communicate relevant events or information between

skills. The Event System follows the publisher/subscriber paradigm described by

Gamma et. al. in [33]. The skills can emit or subscribe to determinate events. When

a skill needs to inform to other skills of a relevant event, it emits an event. All the skills

that are subscribed to this event receive a notification in their “inboxes” when the event

is triggered. The events can also carry some data related to the nature of the event

itself. Every skill has an event manager for each subscribed event. The event manager

defines what to do when an event is received. For example, an obstacle monitoring

skill can trigger an “obstacle found” event if an obstacle is detected. Other skills that

need to know if an obstacle is near the robot (for example a movement skill) can sub-

scribe to the “obstacle found” event and act properly when the event is received. For

example, stopping the robot to avoid a collision with the obstacle.

22

Page 41: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

3.3 Robot Operating System

Both Shared Memory System and Event System are designed and built following

a distributed architecture. This allows sharing data and notifications (events) between

different machines. Because the skills only communicate using the Shared Memory

System and the Event System, it is possible to run AD in different machines simulta-

neously and keep the whole architecture communicated.

3.3 Robot Operating System

ROS (Robot Operating System) [34] is an open-source, meta-operating system for

robots. It provides services similar to the ones provided by an Operating System (OS),

including hardware abstraction, low-level device control, implementation of commonly-

used functionalities, inter process communication and packet management. Addition-

ally, it also provides tools and libraries for obtaining, building, writing and running code

in a multi-computer environment.

The ROS runtime network is a distributed, peer-to-peer network of processes that

interoperate in a loosely coupled, distributed environment using the ROS communi-

cation infrastructure. ROS provides three main communication styles, including syn-

chronous RPC-style communication called services, asynchronous streaming of data

called topics and storage of data using a Parameter Server.

All the elements of the architecture form a peer-to-peer network where data is

exchanged between elements and processed together. The basic concepts of ROS

are nodes, Master, Parameter Server, messages, services, topics and bags. All of

these provide data to the network.

Nodes: Nodes are the minimum unit structure of the ROS architecture. They are

processes that perform computation. ROS is designed to be modular: a robot

control system usually comprises many nodes performing different tasks. For

example, one node controls the wheel motors of the robot, one node controls

the laser sensors, one node performs the robot localization, one node performs

the path planing, etc.

Master: The ROS Master is the central server which provides name registration and

lookup to the rest of the ROS network. Without the Master, nodes are not able

23

Page 42: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM

to find each other and, therefore, exchange messages between them, or invoke

services.

Parameter Server: The Parameter Server is a part of the Master. Its functionality is

to act as a central server where nodes can store data. Nodes use this server to

store and retrieve parameters at runtime. It is not designed for high-performance.

Instead it is used for static, non-binary data such as configuration parameters. It

is designed to be globally viewable, allowing the tools and nodes to easily inspect

the configuration state of the system and modify if necessary.

Messages: Nodes communicate with each other by exchanging messages. A mes-

sage is a data structure of typed fields.

Topics: Messages are routed via a transport system with publish/subscribe seman-

tics. A node sends out a message by publishing it to a given topic. A topic is

a name that is used to identify the content of a message. A node that is inter-

ested in a certain kind of data will subscribe to the appropriate topic. There may

be multiple concurrent publishers and subscribers for a single topic, and a single

node may publish or subscribe to multiple topics. In general, publishers and sub-

scribers are not aware of each others’ existence. The objective is to decouple

the production of information from its consumption.

Services: The publish/subscribe model is a very flexible communication paradigm,

but its many-to-many, one-way transport is not appropriate for request/reply in-

teractions, which are often required in a distributed system. Request/reply is

done via services, which are defined by a pair of message structures: one for

the request and one for the reply. A providing node offers a service under a name

and a client uses the service by sending the request message and awaiting the

reply. ROS client libraries generally present this interaction to the programmer

as if it were a Remote Procedure Call (RPC).

Bags: Bags are a format for saving and playing back ROS message data. Bags are an

important mechanism for storing data, such as sensor data, that can be difficult

to collect but is necessary for developing and testing algorithms.

24

Page 43: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

3.4 The Kinect Vision System

Finally, to close the ROS section, is worth tom mention that ROS only runs on Unix-

based platforms and is it language independent. Currently supports C++ and Python

languages, while support for other languages like Lisp, Octave and Java is still in its

experimental phase. The first stable release of ROS was delivered in march 2010.

3.4 The Kinect Vision System

The Microsoft’s Kinect RGB-D sensor is a peripheral designed as a video-game con-

trolling device for the Microsoft’s X-Box Console. But despite its initial purpose, it is it is

currently being used by numerous robotics research groups thanks to the combination

of its high capabilities and low cost. The sensor provides a depth resolution similar to

the high-end ToF cameras, but at a cost several times lower.

The reason for this balance between capabilities and low cost resides in how the

Kinect retrieves the depth information. To obtain the depth information, the device uses

the PrimeSense’s Light Coding Technology [35]. This technology consists in projecting

an Infra-Red (IR) pattern to the scene similarly to how structured light sensors do. But

Light Coding differs from Structured Light in the light pattern. While Structured Light

usually uses grids or strip bands as a pattern, Light Coding emits a dot pattern to the

scene [5], [36] (see Fig. 2.3c).

This projected light pattern creates textures that makes finding the correspondence

between pixels easier. Specially in shiny or texture-less objects or with harsh lighting

conditions. Also, because the pattern is fixed, there is no time domain variation other

that the movements of the objects in the field of view of the camera. This ensures a

precision similar to the ToF and Structured Light cameras, but PrimeSense’s mounted

IR received ir a standard CMOS sensor, which reduces the price of the device drasti-

cally.

The sensor is composed of one IR emitter, responsible of emitting the light pattern

to the scene, a depth sensor responsible of capturing the emitted pattern. It is also

equipped with a standard RGB sensor that records the scene in visible light (see Fig.

3.4).

Both depth and RGB sensors have a resolution of 640x480 pixels. This facilitates

the matching between the depth and the RGB pixels. This calibration process, referred

25

Page 44: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM

Figure 3.4: Block Diagram of the Prime Sense Reference Design - This is the blockdiagram of the reference design used by the Kinect Sensor. The Kinect incorporates bothdepth CMOS sensor and colour CMOS sensor. (Retrieved from [16])

by PrimeSense as registration, it is done in factory. Other processes like correspon-

dence1 and reconstruction2 are handled by the chip.

Together with other organisations, PrimeSense has created a non-profit organisa-

tion formed to promote the use of devices such as the Kinect in areas of the natural

interaction. The organisation is named OpenNI (NI stands for Natural Interaction) [29].

OpenNI has released an open-source framework called OpenNI that provides several

algorithms for the use of PrimeSense’s compliant depth cameras in natural interaction

fields.

Some of these algorithms provide the extraction and tracking of a skeleton model

from the user who is interacting with the device. This project uses these algorithms to

get the data form the user ’s joints. In other words, the information that will be provided

to the learning framework comes from the output of the OpenNI’s skeleton extraction

algorithms.

The kinematic model of the skeleton, provided by OpenNi, is a skeleton model of

1Correspondence means matching the pixels of one camera with the pixels of the other camera.2Reconstruction means recovering the 3D information from the disparity between both cameras.

26

Page 45: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

3.4 The Kinect Vision System

the body consisting in 15 joints. Fig. 3.5 shows these joints. The algorithms provide

the positions and orientations of every joint. Additionally they also provide the con-

fidence of these measures. Moreover, these algorithms are able to track up to four

simultaneous skeletons, but this feature is not used in this project.

ROS provides a package that envelops OpenNI and enables the access to this

framework from other ROS packages1. Thanks to that, other packages can access to

the data of the Kinect sensor. One of such packages is the pi_tracker package. This

package has a node, named Skeleton Tracker, which uses the OpenNI’s Application

Interface (API) to retrieve the tracking information of the user. The node publishes the

data of the joints in the /skeleton topic so other nodes can use it. This is the case of

the Pose Trainer node.

Head

Neck Right Shoulder

Right Elbow

Right Hand

Torso

Left Shoulder

Left Elbow

Left Hand

Right Hip

Right Knee

Right Foot

Left Hip

Left Knee

Left Foot

Figure 3.5: OpenNI’s kinematic model of the human body - OpenNI algorithms areable to create and track a kinematic model of the human body. The model has 15 jointsand their positions and orientations are updated at a 30 Frames per Second.

1The package is the openni_kinect

27

Page 46: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM

3.5 The Weka Framework

The Machine Learning Framework on which this project relies is Weka (Waikato En-

vironment for Knowledge Analysis) [37]. Weka is a popular suite of machine learning

software written in Java, developed at the University of Waikato, New Zealand. Weka

is open-source software released under the GNU General Public License.

Weka supports several data mining tools such as data preprocessing, classifica-

tion, clustering, regression, visualization, and feature selection.

Most of the Weka’s operations suppose that data is available in a single text file or

relation. Weka provides access to SQL databases using Java Database Connectivity

being able to process as an input any result returned by a database query. In both files

and database queries, each data instance is described by a fixed number of attributes.

It is possible to operate with Weka through a user interface, from command line or

by accessing its program Application Interface (API). The latter is the method chosen

for this project.

Since Weka’s API is programmed in Java and ROS provides a client library for

this language, all the elements of the architecture that use the Weka’s API, will be

programmed as ROS nodes. Although ROS Java’s client library is, at this date, in an

experimental phase, the core functionalities are sufficiently stable to be used. These

functionalities are connecting to the ROS Master and subscribing and publishing to

topics.

Since the project focuses in supervised learning methods, only these methods will

be used in the Weka framework. Concretely, the C4.5 decision tree [38] has been

chosen as the main algorithm to learn and detect poses of the user. This algorithm

has been chosen by its good performance [39] and the possibility to see what the robot

has learnt in the learning phase.

28

Page 47: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Chapter 4

The Supervised Learning BasedPose Recognition Architecture

This chapter describes all the modules that have been built in order to learn to detect

human poses. It is divided in two parts, each one describing one part of the learning

process. First, Section 4.1, describes all the components that operate in the training

phase. And second, Section 4.2 describes all the components that operate in the

process of recognising the poses of the user once the system has been trained. But

before entering in detail, an overview of the system is provided.

Fig. 4.1 depicts the general scheme of the built architecture. It consists of two

differentiated parts, each one with one single purpose. The upper part represents the

training phase, where the user teaches the system to recognise certain poses. The

lower part of the figure represents the classifying phase, where the user stands at a

determined pose and the robot tells him at which pose he is.

In the training phase, the robot uses two sensory systems to learn from the user.

The first is its Kinect based RGB-D vision system. With it, the robot acquires the fig-

ure of the user separated from the background and processes it to extract a kinematic

model of the user’s skeleton. The second input is the Automatic Speech Recogni-

tion System (ASR) which allows the robot to process the words said by the user and

converts them into text strings.

In other words, the sensory system captures a pair of datum formed by:

The pose of the user, defined by the configuration of the joints of the kinematic model

of the user

29

Page 48: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

Kinect

I'm sitting

MicrophoneAutomatic Speech

Recognition (ASR) System

What's my pose?

"SIT"

Dataset

Machine Learning Framework

MODEL

Speaker

You're sitting

Kinect

Pose Classifier

Text To Speech (TTS) System

Classifying Phase

Training Phase

Figure 4.1: Overview of the Built System - The upper part of the diagram shows thetraining phase, where the user teaches the robot by verbal commands which are theposes that the robot must learn. The lower part of the diagram depicts the classifyingphase, in which the robot loads the learnt model and tells the user’s current pose by itsvoice system.

30

Page 49: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

A label identifying this pose, defined by the text string captured by the auditive sys-

tem of the robot.

This two sensory inputs are fused together to form a pair of datum that are stored

in a dataset. At the end of the training process, the dataset contains all the pose-label

associations that the user has shown to the robot. This dataset is processed by a

machine learning framework that builds a model establishing the relations between

the poses and its associated labels. I. e. the learned model establishes the rules that

define when a determined pose is associated to a certain label.

In the classifying phase, the robot continues receiving snapshots of the skeleton

model every frame. But this time, it does not receive the auditive input that tells which

is the pose of the user. It is the moment when the robot has to guess that pose. To

do that, it loads the learnt model in a classifier module that receives as an input the

skeleton model every frame. The skeleton is processed in the model that returns the

corresponding label to that skeleton.

Then, the label is sent to the Emotional Text To Speech (ETTS) module of the

robot. The ETTS module is in charge of transform text strings into audible phrases

that are said by the robot’s speakers. In this way, the label is sent to the ETTS module

and then, said by the robot.

Fig. 4.2 presents an overview of the whole system but deepening more in how

all these processes are carried out and detailing the modules and messages that

participate on this process. There, the reader can see that the architecture is divided

in two separated parts, the AD part and the ROS part. AD provides powerful tools

for HRI, specially relevant are the ASR and ETTS Skills. The ASR Skill processes the

speeches of the user and transforms them into text. On the other hand, the ETTS skill,

processes strings of text and transforms them in audible words o phrases emitted by

the robot’s speakers.

Therefore, all the parts of the architecture that need verbal inputs or outputs from

or to the user have been developed as AD skills, or have some mechanisms to com-

municate with AD skills. In other words, the interaction with the user is carried out in

the AD part of the System. One of these interaction skills is the pose_labeler_skill,

described in section 4.1.1.

In the other part of the architecture, ROS provides the openNI and other pack-

ages to track humans and extract their skeleton from the Kinect sensor. Therefore,

31

Page 50: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

the components which needed information of the human skeleton have been devel-

oped as ROS nodes. Additionally, since ROS provides a client-library in java, all

the components of the architecture that needed to access the Weka framework have

been programmed as ROS nodes as well. This is the case of the pose_trainer and

pose_classifier nodes. Both are described in sections 4.1.3 and 4.2.1 respectively.

pose_trainer

pose_labeler_bridge

Data

set

WekaModel

Learning Framework/Architecture

pose_classifier

/skeleton

/classified_pose

OpenNI + pi_tracker

Kinect

/pose_labeled

asr_skill

etts_skill

pose_labeler_skill

RECOGNITION_RESULTS

LABELED_POSE

pose_teller

User

speaks

SAY_TEXTspeaks

ROS AD

Figure 4.2: Diagram of the built architecture - The figure depicts a diagram of the wholearchitecture and its components. Note that all the interaction modules reside in the ADpart of the architecture while the skeleton tracking algorithms and the learning frameworkreside in the ROS part.

Additionally, there are some components which main function is establishing links

between ROS and AD and enable the communications between their modules. These

are the Pose_Labeler_Bridge and the Pose_Teller, described in sections 4.1.2 and

4.2.2 respectively.

Since the architecture is composed of several modules that act independently, the

key for the integration of these modules is the messaging system of the architecture.

The communications within the architecture modules is done by exchanging asyn-

32

Page 51: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

4.1 Training Phase

chronous messages. In other words, if a module needs information from another

module, it subscribes to its publications. On the other way, when a module has some

information that needs to be shared with others, it publishes to the network. That

means that the mechanism followed for the exchange of information in the architec-

ture is based in the publisher /subscriber paradigm.

The following sections describe briefly what these modules do, how they operate

and, finally, how they link together to build the complete system. The description of

the components is done by following the usual temporal sequence of a user trying to

use the system Usually, the user would first train the pose classifier and then use it

to classify its pose. The former is described in section 4.1 and latter is described in

section 4.2.

4.1 Training Phase

The training phase is the phase of the process in which the classifier is built. In this

phase, the human teaches the robot to recognise some poses. Completing this phase

means that a classifier is built and it can be used in the classifying phase. This section

describes all the components of the architecture that have been developed to train the

system.

Figure 4.3 depticts the temporal sequence of the training phase showing the col-

laboration among all the modules that participate in it. Summarised, the training phase

occurs in the following sequence. First, the system needs to detect what the user is

saying to the robot. This is explained in 4.1.1. If what she is saying is a valid pose, then

it will be labeled and sent to a node in charge of communicating the interaction mod-

ules -which are located in the AD part of the system- with the ROS modules -which

are in charge of the learning step-. The bridging between AD and ROS is described

in section 4.1.2. Finally, the label arrives to the machine learning module which will

gather the label describing the human pose and the data from the vision sensor. This

data will be written to a dataset and sent to the Weka framework to build a classifier

able to detect human poses. This final part of the process is described in section 4.1.3.

33

Page 52: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

pose_labeler_skill pose_labeler_bridge pose_trainerasr_skill

REC_OK (event)

Process possible label

LABELED_POSE (event)/labeled_pose (ROS topic)

skeleton_tracker

/skeleton (ROS topic)

/skeleton (ROS topic)

/skeleton (ROS topic)

/skeleton (ROS topic)

/skeleton (ROS topic)

/skeleton (ROS topic)

Add pose to dataset"TURNED_RIGHT"

/skeleton (ROS topic)

.

.

.

/skeleton (ROS topic)

/skeleton (ROS topic)

Add pose to dataset"TURNED_LEFT"

REC_OK (event)

Process possible label

LABELED_POSE (event)

/labeled_pose (ROS topic)

"END"

"END"

"END"

Building a model and saving it to a file

REC_OK (event)

Process possible label

LABELED_POSE (event)/labeled_pose (ROS topic)

"TURNED_RIGHT"

"TURNED_RIGHT"

"TURNED_RIGHT"

Add pose to dataset"TURNED_RIGHT"

"TURNED_LEFT"

"TURNED_LEFT"

"TURNED_LEFT"

Figure 4.3: Sequence Diagram of the training phase - The pose_trainer node is thenode which fuses the information of the skeleton model and the labels told by the user.

34

Page 53: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

4.1 Training Phase

4.1.1 Pose Labeler Skill

When the user starts training the robot, she has to carry out two tasks. The first one

is to put himself in the pose he wants to show the robot. The second task is to tell

the robot at which pose is she. From the robot’s point of view, first it has to detect the

human pose and second, it has to understand what the user is saying to it. The former

is described in section 4.1.3 while the latter is described below.

The user reports the robot its pose by telling it. The robot has an Automatic Speech

Recognition (ASR) System that allows to detect and process natural language. This

system mainly relies on the AD’s ASR Skill [40]. The ASR skill detects the human

speech and processes it according a predefined grammar. If the speech of the human

matches one or more of the semantic elements of that grammar, the ASR Skill sends

an event notifying all the other skills that it has recognised a speech. Other skills

which are subscribed to the ASR events, read the results obtained by the ASR Skill

and process them accordingly.

The ASR skill cannot understand what the user is saying unless it is previously

provided with a grammar that defines the semantic of these speeches. Therefore, if

we want to make the ASR skill able to understand what pose is being said by the

human, we need to build a special grammar. This grammar is summarised below and

described in Annex C.

In this scope, a grammar is a sequence of possible word combinations that the

user can say linked to their semantic meaning. In this way, the built grammar is able to

detect several words that define distinct poses of the human. The semantics of those

words are coded into labels that can be codified as variables in a computer program.

The grammar has been built in a manner that allows to detect up to 18 labels by

combining different semantics in three different categories:

1. Position Semantics

(a) SIT. Defines that the user is sitting on a chair.

(b) STAND. Defines that the user is standing in front of the robot

2. Action Semantics

(a) TURNED. Defines that the user is sitting in a chair.

(b) LOOKING. Defines that the user is standing in front of the robot

35

Page 54: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

(c) POINTING. Defines that the user is pointing with her arm to a location defined inthe directions category

3. Direction Semantics

(a) LEFT. Defines that the action that is doing the user is towards his own left side.For example, if it is pointing, it is doing to her left.

(b) FORWARD. Defines that the action performed by the user is towards her front.

(c) RIGHT. Defines that the action peformed by the user is towards her right

When the ASR Skill detects a combination of words that define the 3 semantics

from above, it emits an event to notify all its subscribers and stores the recognition

results in the Short Term Memory System. One of these subscribers is the Pose

Labeler Skill. When receives an event from the ASR Skill, the Pose Labeler Skill reads

the recognition results from the Short Term Memory and analyses their semantics to

form a label. Examples of labels can be SIT_LOOKING_LEFT, meaning that the user

is sitting and looking towards her left; or STAND_TURNED_FORWARD, meaning that

the user is standing and turned to the robot. Finally, if the label is a valid label -i.e. one

of the labels from above- the Pose Labeler Skill sends an event with the label ID. Fig.

4.4 summarises the main functionalities of the Pose Labeler Skill

pose_labeler

_skill

Subscribe to ASR events

Read the ASR Skill resultslooking for the

label.

Emit an AD LABELED_POSE event with the detected label

Figure 4.4: Use Case of the Pose Labeler Skill - The figure depicts the use case dia-gram of the Pose Labeler Skill. Its main function is to transform the labels told by the userinto a system readable labels and emit them as an AD event.

Additionally to the semantics shown above, the Pose Labeler Skill can also process

two semantics more which have no relation with the labels. These semantics act as

36

Page 55: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

4.1 Training Phase

a control layer to allow the user control by voice some aspects of the training phase.

These semantics are the following:

• CHANGE. Used to allow the user to change its pose. This command is said whenthe user wants to change its pose. It is used to allow the classifier to discriminate thetransitions between two poses.

• STOP. Used to end the training process. When the user says she wants to finish thetraining process, the ASR builds this semantic to allow the Pose Labeler Skill to end it.

4.1.2 Pose Labeler Bridge

pose_labeler_bridge

Subscribe to the AD's

LABELED_POSE events

Send Labels to the ROS'

/labeled_posetopic

Figure 4.5: Use Case of the Pose Labeler Bridge - The figure depicts the use casediagram of the Pose Labeler Bridge ROS node. Its main function is to listen the AD’sLABELED_POSE events and to bridge their data to the ROS /labeled_pose topic.

The Pose Labeler Bridge is the next step of the process. It is a ROS node that acts

as a bridge between AD and ROS. Its main functionality is to transform the events

emitted by the Pose Labeler Skill to a ROS topic. Concretely, when the Pose La-

beler Skill detects a label from the ASR’s recognition results, it emits an event called

POSE_LABELED. The Pose Labeler Bridge parses the content of the messages sent

through this event and transforms them to fit in a ROS topic called /labeled_pose. Fig.

4.5 summarises the main functionalities of the Pose Labeler Bridge.

4.1.3 Pose Trainer

The last step of the training process is performed in the Pose Trainer Node. The Pose

Trainer node is a ROS node that does several things (see Fig. 4.6). First of all, it

subscribes to the /labeled_pose topic to know the pose of the user. Secondly, it also

37

Page 56: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

subscribes to the /skeleton topic (see section ?? for more information about this topic).

The Pose Trainer node reads the messages of this topic to extract the information of

the joints of the user. This information is combined with the information from the /

labeled_pose and formatted properly to be understood by the Weka framework.

pose_trainer

Subscribe to /labeled_pose topic

Subscribe to /skeleton topic

Create a dataset with the received skeleton and label messages

Build a model from the dataset

Figure 4.6: Use Case of the Pose Trainer Node - The figure depicts the use casediagram of the Pose Trainer Node. Its main function is to receive messages from the /labeled_pose and /skeleton topics, and build a dataset with these messages. After thedataset is built, the node also creates a learned model from it.

Each skeleton message is coded as a Weka instance. Each instance has 121

attributes divided in the following way. The message consists in 15 joints with 3 at-

tributes for the position of each joint, 4 attributes to track the orientation1 of each joint,

and 1 attribute to track the confidence of the measures of this joint. This makes 120

attributes. The last attribute is the label that comes from the /labeled_pose topic. This

last attribute is also the class of the instance2.

While the Pose Trainer node is running, it collects /skeleton messages, /labeled_pose

messages and fuses them creating instances. During its operation, the node contin-

uously builds a dataset of instances with the received messages. Finally, when it re-

ceives a label with the "STOP" identification, it stops adding messages to the dataset.

Fig. 4.3 shows this process.

1The orientation is coded as a quaternion.2 The class of an instance is the attribute that tells the learning algorithm at which class belongs the

instance. In other words, is the attribute that tells to the classifier how this data must be classified.

38

Page 57: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

4.2 Classifying Phase

With the dataset already completed, the node calls the Weka API in order to build

a model from the dataset. This model is the classifier that will be used in section 4.2

to classify the poses of the user. For this project only on model of classifier is built

form the dataset. This model is the Weka’s J48 decision tree, which is an open source

implementation of the C4.5 decision tree (see section 3.5).

If the dataset has relevant data, the model will be able to generalise to other situa-

tions for which the classifier has not been trained. If not, the classifier will not be able

to classify other situations that were not contemplated during training phase and will

cause several classification errors.

The structure of the Pose Trainer node is depicted in Fig. A.1 from Annex A.

4.2 Classifying Phase

The classifying phase is the phase where the robot starts guessing the pose of the

user. To do so, it needs a previously created model of the poses at which the user will

be. The main elements of the classifying phase are the Pose Classifier node and the

Pose Teller node. The first is described in section 4.2.1 while the second is described

in section 4.2.2. The temporal sequence of how these nodes interact is depicted in

Fig. 4.7.

4.2.1 Pose Classifier

The Pose Classifier ROS Node is the node that classifies the pose of the user. Its

main funcitons are depicted in Fig. 4.8. To do so it needs two different inputs. The first

one is the knowledge to decide the pose of the user from the data of its joints. This

comes from the classifier that has been built in section 4.1.3. The second input the

node needs is the content of the /skeleton topic messages. As it is said above, these

messages contain the information of the user’s joints.

The node subscribes to the /skeleton topic and starts reading its messages. For

each received message, the node parses and formats it as a weka instance. This

instance similar to the instances created by the Pose Trainer node in section 4.1.3.

But these instances are different to the others in one aspect: they do not have the

39

Page 58: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

etts_skill pose_teller_tts pose_classifer

Says label

/classified_pose (ROS topic)

skeleton_tracker

/skeleton (ROS topic)

/skeleton (ROS topic)

Classify Pose

/skeleton (ROS topic)

.

.

.

/skeleton (ROS topic)

Classify Pose

Load Model from a file

/classified_pose (ROS topic)

Classify Pose

/classified_pose (ROS topic)

SAY_TEXT (AD event)

Says label

SAY_TEXT (AD event)

.

.

....

"STAND_LOOKING_LEFT"

"STAND_LOOKING_LEFT"

"STAND_LOOKING_RIGHT"

Figure 4.7: Sequence Diagram of the classifying phase - The pose_classifier nodeprocesses the skeleton messages using the learnt model and sends the output to thevoice system of the robot.

40

Page 59: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

4.2 Classifying Phase

pose_classifier

Subscribe to /skeleton topic

Load a model to classify poses

Classify the skeleton into a known pose

Send the classified pose to the

/classified_pose topic

Figure 4.8: Use Case of the Pose Classifier Node - The figure depicts the use casediagram of the Pose Classifier Node. The node loads a previously trained model to classifySkeleton messages to known poses.

class defined. In other words, the class of the instance is not set. The Pose Classifier

node uses the classifier to determine at which class should belong the instance.

After classifying the instance, the node emits a message to the /classified_pose

topic. In fact, the sent message is the same as the messages that are sent to the

/pose_labeled topic. But this time the difference that the former refer to labels that

have been deducted while the latter are labels specified by the user.

The structure of this node is depicted in Fig. A.2 from Annex A.

4.2.2 Pose Teller

The Pose Teller Node is the node in charge to tell to the user at which pose is she. In

other words, it tells to the user what pose has been detected by the classifier. The pose

of the user is announced by the Pose Classifier node in the topic /classified_pose. But

the content of the messages of this topic is not understandable by humans. therefore,

it is needed a node that translates that content to a content that can be understood by

people. The Pose Teller node is the module which carries out this task (see Fig. 4.9).

First of all, the Pose Teller node subscribes to /classified_pose topic and reads its

messages to retrieve the label identificator wrote by the classifier. Then, it transforms

41

Page 60: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

the label ID into a description that can be understood by the user. For example, if the

label is STAND_LOOKING_LEFT, the node transforms it to the text "You’re standing

and looking to the left". But this is only a text string. If we want the robot to say this

text, it must be sent to the AD’s ETTS (Emotional Text To Speech) Skill [41]. This skill

is in charge of transforming text strings into audible speeches. Therefore, the Pose

Teller node sends the "textified" label to the ETTS skill by an AD event. Finally, the

ETTS skill pronounces the text using the robot’s voice system.

pose_teller

Tell to the etts_skill to say the labels of the /classified_topic

Subscribe to/classified_pose topic

Figure 4.9: Use Case of the Pose Teller Node - The figure depicts the use case diagramof the Pose Teller Node. Its main function receive labels from the /pose_classified topicand send them to the TTS skill to make it tell the label.

42

Page 61: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Chapter 5

Pilot Experiments

The main objective of carrying a pilot experiment is to test that the classifier is able to

learn human poses. But, apart from that, there is other objective that emmerged during

the design of the system. During the initial tests of the trainer it was observed that the

trainer node built different models depending on the kind of the data or features which

was provided to it. I. e., the pose_trainer node built different models when it was only

fed by the position of the joints than when it was fed by the position and orientation of

them.

It also was observed that when the classifier was trained using all the three data

types -position, orientation and confidence-, the human trainer had to pay attention

to fed the node with representative data. In other words, if the human trained the

pose_trainer node fixed in only one position, the classifier was only able to detect the

learned poses if they were shown in the exact same position as it was trained.

Therefore, if we want to build a classifier which is able to generalise, we have two

options. The first one is to train the classifier without giving them position data. The

second option is to make the position data irrelevant during the training process. The

first option involves a pre-process of the data before is given to the trainer node. During

this pre-process stage, the position of the joints is removed. The second option feeds

the pose_trainer node with all the data from the human joints, but during the training

process, the human must move around the field of view of the Kinect sensor in order

to feed the node with all the data which is relevant.

Note that, while the first option relies on the automation of the process, the sec-

ond one yields the responsibility to the human teacher, who is in charge of providing

43

Page 62: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

5. PILOT EXPERIMENTS

the classifier with good data that will allow it to generalise. But, a priori, is not clear

which method builds better classifiers. Therefore, the objective of this evaluation is to

discover which of the two methods produces better models. This is especially relevant

because it is difficult to anticipate that the users are "good teachers" and they will train

the classifier with a good and representative model.

The initial intuition leads us to think that the classifier is able to detect the poses

quite accurately if a good model is provided regardless of which method is used. But

our hypothesis is that it would be easier to build better models with less features,

specially if these features represent better the states of a pose. In other words, it

seems that the joint orientations are more representative for detecting poses than the

information retrieved from the positions of the joints.

5.1 Scenario

To validate the hypothesis, two datasets were created. Both datasets consist of thedata from a user that has trained the classifier to detect three different poses:

1. Standing, turned left

2. Standing, turned front

3. Standing, turned right

But each dataset has been trained in a different manner. In the first dataset "D1", the

user has trained the classifier showing the three poses in different positions. In the

second dataset "D2", the user has trained each pose without changing her relative

position to the robot. That means that D1 has better training data than D2. Also, for

each dataset, two models have been built. Each model has been built using different

features or attributes. Following, the differences between the datasets and its models

are listed:

• Dataset 1 (D1): Not relevant data. The user did not move from her originalposition.

– Model 1 (M1): All attributes were used to construct the model (position, orientationand confidence.

– Model 2 (M2): Only orientation and confidence attributes were used to build themodel.

44

Page 63: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

5.1 Scenario

• Dataset 2 (D2): Not relevant data. The user did not move from her originalposition.

– Model 3 (M3): All attributes were used to construct the model (position, orientationand confidence.

– Model 4 (M4): Only orientation and confidence attributes were used to build themodel.

The recording of the datasets D1 and D2 was done in the scenario depicted in Fig.

5.1. The scenario consists of the robot Maggie equipped with a Kinect sensor. The

rectangle in Fig. 5.1 shows the area where the user that trained the classifier was

standing. She was able to move wherever she wanted inside of the rectangle while

recording the poses. The only conditions were that she was not allowed to exit the

rectangle during the recording phase. She was also not allowed to change her pose

without warning the robot. During the recording of the dataset D1, the user moved

through all the rectangle. But in the case of the dataset D2, the user didn’t moved

from the center of the rectangle.

180 cm240 cm

180 cm R

obotU

ser

User Area

Kinect Horizontal Field of View

Figure 5.1: Scenario of the experiment - The cone represents the field of view of theKinect sensor. The user was allowed to move inside the rectangle, but turned in thedirection of the arrows.

45

Page 64: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

5. PILOT EXPERIMENTS

After the training phase, 12 different users were used to record a test dataset.

Each one of these people recorded the same 3 poses as the user who trained the four

models. Moreover, they had the same recording conditions than the user who trained

the dataset D1. In short, they were allowed to move inside the rectangle during the

recording phase.

The data of the twelve users were recorded and gathered in one single dataset file.

After that it was tested with the four models. The results of these tests are summarised

in the following section. Also, they are completely stated in Annex B. A discussion of

these results is addressed in section 5.3

5.2 Results

The results of the experiment are summarised in Table 5.1. The table presents how

the models M1, M2, M3 and M4 performed against the test dataset. The best models

were M1 and M2 with more than a 92% of correctly classified instances and barely

a 4% of false positives. Following, model M4 performed slightly worse with a 70%

of correctly classified instances and with a 14% of false positives. Finally, as it was

expected, model M3 shown the worst performance with a 56% of correctly classified

instances and with a 21% of false positives.

Note that the table shows the results of models M1 and M2 in the same row. This

is because they used the same dataset (D1) to build their trees and in the end, the J48

algorithm built the same tree in both cases. Fig. 5.2 depicts the tree of the models

M1 and M2. Although they used different data from the dataset1, the J48 algorithm

decided that the relevant information of D1 was located in the orientation attributes,

producing the same trees in both cases.

This not happened in the tree built in model M3 (see Fig. 5.3). Here, the algorithm

that built the tree, considered that in the training dataset some relevant information

was in the position of the right knee. When the users tested the tree, they varied their

position respect the user who trained it, so it caused several errors.

The last tree, M4 (Fig 5.4), is similar to M3, but with the difference that the former

only uses orientation information. This enabled it to be more accurate than M3, but

1 Remember that M1 used position, orientation and confidence, while M2 used only orientation andconfidence

46

Page 65: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

5.3 Discussion

Model TP Rate FP Rate Precision MAE RMSEM1, M2 0.926 0.039 0.940 0.0049 0.0583

M3 0.563 0.213 0.687 0.0416 0.204M4 0.700 0.141 0.796 0.0286 0.1691

Table 5.1: Results of the Pilot Experiment - Models M1 and M2 produced the sameresults and performed better than the other models. Note: TP = True positive; FP = FalsePositive; MAE = Mean Absolute Error; RMSE = Root Mean Squared Error.

not so much as M1 and M2. The reason for this could be that M1/M2 is a bit more

complicated tree, so it seems it can cover more cases than M4.

torsoOrient_w

<=0.956792 > 0.956792

torsoOrient_y

<=0.11679 > 0.11679

left_kneeOrient_w

<=0.609245 > 0.609245

STAND_TURNED_RIGHT

STAND_TURNED_FORWARD

STAND_TURNED_LEFTSTAND_TURNED_RIGHT

Figure 5.2: The tree built for the models M1 and M2 - The classifier only used theorientations of the joints.

5.3 Discussion

The results show that the initial hypothesis is partially validated for this experiment.

The hypothesis announced that using only orientation and confidence attributes will

47

Page 66: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

5. PILOT EXPERIMENTS

right_kneeOrient_Z

<=-0.196633 > -0.196633

right_kneePos_X

<=0.100138 > 0.100138

STAND_TURNED_RIGHT

STAND_TURNED_FORWARDSTAND_TURNED_LEFT

Figure 5.3: The tree built for the model M3 - This time, the classifier used positioninformation of the joints.

right_kneeOrient_Z

<=-0.196633 > -0.196633

torsoOrient_w

<=0.699489 > 0.699489

STAND_TURNED_RIGHT

STAND_TURNED_FORWARDSTAND_TURNED_LEFT

Figure 5.4: The tree built for the model M4 - The tree is quite similar to the one in Fig.5.3, but this one only uses orientation information for its joints.

48

Page 67: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

5.3 Discussion

lead to models with a higher generalisation capabilities. This has only been validated

partially. When a good training set is provided, like in dataset D1, the classifier builds

models that are able to generalise, no matter the attributes which are used to build

that models. This is the case of models M1 and M2, which ended up being the same.

However, when the dataset has no relevant information, using only orientation at-

tributes leads to classifiers that perform better than classifiers that are built using posi-

tion and orientation attributes. In the studied case this difference was nearly of a 15%

of better classification and nearly a 7% in false positives.

In fact, the classifier has built a model which is based on orientation attributes

rather than positions. This means that, as it was thought at the beginning, the ori-

entation information was more significant than the position information. But it also

means that if the classifier is provided with a relevant dataset, it will choose the most

significant attributes.

To sum up, it seems that providing good training data to the classifier is of paramount

importance. Good datasets produce better models no matter the joint attributes that

are used to build them. But when it is not possible to ensure that the user will train the

classifier properly, it could be better to avoid the use of position attributes.

49

Page 68: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

5. PILOT EXPERIMENTS

50

Page 69: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Chapter 6

Conclusions

A pose recognition architecture has been built and integrated in the robot Maggie. This

architecture allows the robot to learn the poses the user has been taught to it.

The system relies on two main pillars, the first is the vision system of the robot,

which is composed of a Kinect depth camera and its official algorithms to track peo-

ple. The sensor and its algorithms have proven to be robust to changes in the light

conditions and partial occlusions of the body. The second pillar of the learning plat-

form is the HRI capabilities of the robot Maggie. Specially its abilities to communicate

with people by voice. Thanks to that, the user taught the robot by speaking to it as it

would do with other people.

To validate the architecture, a pilot experiment was carried out. In the experiment

two datasets were used to build 4 different models that were tested against twelve

people. The experiment demonstrated that the learning system is able to detect the

poses of the users obtaining high accuracy rates when a good training dataset was

provided to the robot.

Following the main contributions of this project are listed.

• A Pose Recognition Architecture has been developed and integrated in a social

robot allowing it to recognise the poses of different people with high accuracy.

• ROS has been integrated with the AD architecture. Although it is not fully inte-

grated, the initial communication mechanisms between both architectures have

been established. Additionally some AD skills have started the process of be-

coming both AD-Skills and ROS nodes.

51

Page 70: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

6. CONCLUSIONS

• The Weka machine learning framework has been integrated with the AD archi-

tecture, thanks to the integration between ROS and AD.

• The robot Maggie has now fully integrated the Kinect sensor, its drivers and the

OpenNI framework. This enables a much new possibilities to it, not only in the

HRI field, but also in object recognition and navigation fields.

• The robot Maggie now is able to understand the human poses. This means

that now the robot is able to understand some contextual information when it

is interacting with a user. Thanks to that, the robot can adapt its behaviour

according to this context and improve the interaction quality, as it is perceived

by a user. As an example, imagine that two people are talking to each other,

just in front of the robot. If the robot has its ASR turned on, it will process every

word in the conversation regardless they are not addressed to it. But because

the robot can see that these two people are turned towards the other, it may infer

that those words are not addressed to it and simply discard them.

Extending the architecture is the first step that has to be done as future work.

Since the architecture has proven to be valid and works to its field of use, it would be

challenging extending it to other fields such as gesture recognition. This would enable

the robot to understand better the interaction situations with the users.

This work opens the door for building a continuous learning framework. Thanks

to the integration between the learning system and the interaction system, it is now

possible to strengthen this bonds to build a more generic platform that would allow the

robot being continuously learning from its environment and its partners.

Since the learning platform allows the robot to understand information from the

user, one possible line of work is to study how the robot can infer the intentions of

the users using the information it has learnt from them. Now the robot can learn

human poses, next step of the process is to understand what these poses means in

an interaction context.

Moving the focus from more generic to more specific, we can enter in details re-

garding the developed learning system itself, other line for further work is to deep into

the relation between the position attribute, orientation attributes and the quality of the

training. Although it seems that a good training data its enough to Perhaps one way

would be

52

Page 71: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

It is interesting to know if changes in the reference coordinate system would affect

to the training phase. For example, there are some poses that are relative to the

robot such as being near or far to the robot. in this case, it is clear that the adequate

coordinate frame is the robot’s one. But in other cases it might be better to use other

coordinate frames. An example would be an user separating his arms to announce

that something is big and its contrary pose, bringing closer the arms to point out that

something is small. It has to be studied if in this last case, it is better to use a coordinate

frame that has not the origin in the robot’s sensor but, for instance, in one of the user’s

hand.

Other possibilities for further research are comparing several classifiers or, even

more, other data mining techniques. There are studies that have made this compar-

isons in generic situations [39], but introducing the part of the real time human-robot

interaction might lead to interesting findings.

Finally, since the main purpose of the robot is the study of the HRI, user studies

should be carried out to understand how to improve this learning process from the

user perspective. Understanding what the user thinks about the process, might lead

to better training scenarios that would end in robots that learn better from the users.

53

Page 72: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

6. CONCLUSIONS

54

Page 73: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Appendix A

Class Diagrams of the Main Nodes

Weka

jointNameconfidenceposXposYposZorientXorientYorientZorientW

Joint

poseuserIDstampJoints

Pose

115

FillAttributesparseSkeletonMessagelabelMessageaddMessagesetClasssaveToFile

DatasetPoses

PoseSet

loadFromFilesaveToFileclassifyInstancebuildClassifier

FilteredClassifierJ48

PoseModel

void main

PoseSetPoseModelNodeHandle

PoseTrainer

Ros

NodeHandle

1n

1

1

1 1

weka.core.Instances

weka.classifiers.Classifier

Figure A.1: Class Diagram of the Pose Trainer Node - The main class is a ROS node(it has a node handle) that connects with weka to build a dataset and a model from theinputs it receives from the topics at which is subscribed (defined in the node handle).

55

Page 74: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

A. CLASS DIAGRAMS OF THE MAIN NODES

Weka

jointNameconfidenceposXposYposZorientXorientYorientZorientW

Joint

poseuserIDstampJoints

Pose

115

FillAttributesparseSkeletonMessagelabelMessageaddMessagesetClasssaveToFile

DatasetPoses

PoseSet

loadFromFilesaveToFileclassifyInstancebuildClassifier

FilteredClassifierJ48

PoseModel

void main

PoseSetPoseModelNodeHandle

PoseClassifier

Ros

NodeHandle

1n

1

1

1 1

weka.core.Instances

weka.classifiers.Classifier

LabeledPose

Figure A.2: Class Diagram of the Pose Classifier Node - Notice that it is almost equalto the class diagram of the Pose Trainer Node. The min difference is that this node, loadsthe model built by the Pose Trainer Node and uses this model to classify the skeletonmessages it receives.

56

Page 75: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Appendix B

Detailed Results of the Pilotexperiment

This chapter shows the full results obtained in the pilot experiment described in chapter

5. In the experiment four models were tested against a dataset recorded by twelve

people. The four models were trained according the description shown in section 5.1.

The results, were partially shown in section 5.2 and discussed in section 5.3.

The results are presented in the following tables. Table B.1 presents the detailed

results of the models M1 and M2 while Table B.2 presents the confusion matrix them.

The detailed results of model M3 are presented in B.3 while its confusion matrix is

detailed in the table B.4. Finally, the results of the model M4 are shown in the table

B.5 and its confusion matrix can be analysed in table B.6.

Class TP Rate FP Rate Precision Recall F-Measure ROC AreaSTAND_TURNED_LEFT 0.787 0.000 1.000 0.787 0.881 0.999STAND_TURNED_RIGHT 0.996 0.111 0.826 0.996 0.903 0.997STAND_TURNED_FORWARD 0.982 0.002 0.996 0.982 0.989 0.99Weighted Avg. 0.926 0.039 0.939 0.926 0.926 0.995

Table B.1: Detailed Results of the Models M1 and M2 - These models performed witha 92% of correctly classified instances (column 1) and with barely a 4% of false positives(column 2)

57

Page 76: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

B. DETAILED RESULTS OF THE PILOT EXPERIMENT

classified as −→ a b cSTAND_TURNED_LEFT = a 1171 317 0

STAND_TURNED_RIGHT = b 0 1648 6STAND_TURNED_FORWARD = c 0 30 1619

Table B.2: Confusion matrix for the models M1 and M2 - Almost all the errors camebetween the left and the right orientations.

Class TP Rate FP Rate Precision Recall F-Measure ROC AreaSTAND_TURNED_LEFT 0.673 0.315 0.49 0 0.673 0.567 0.679STAND_TURNED_RIGHT 0.209 0.001 0.989 0.209 0.345 0.604STAND_TURNED_FORWARD 0.819 0.334 0.563 0.819 0.667 0.743Weighted Avg. 0.563 0.213 0.687 0.563 0.525 0.675

Table B.3: Detailed Results of the Model M3 - It showed a poor performance with onlya 56% of correctly classified instances (column 1) with more than a 20% of false positives(column 2). Most of the errors were produced due the low performance when classifyingthe TURNED_RIGHT pose.

classified as −→ a b cSTAND_TURNED_LEFT = a 1001 4 483

STAND_TURNED_RIGHT = b 743 346 565STAND_TURNED_FORWARD = c 299 0 1350

Table B.4: Confusion matrix for the model M3 - The TURNED_RIGHT pose produceda great percentage of the errors.

Class TP Rate FP Rate Precision Recall F-Measure ROC AreaSTAND_TURNED_LEFT 0.929 0.317 0.569 0.929 0.706 0.806STAND_TURNED_RIGHT 0.209 0.001 0.989 0.209 0.345 0.604STAND_TURNED_FORWARD 0.984 0.123 0.807 0.984 0.887 0.931Weighted Avg. 0.7 0.141 0.796 0.7 0.644 0.779

Table B.5: Detailed Results of the Model M4 - It showed a slightly better performancethan model 3, with only a 70% of correctly classified instances (column 1) but with a 14%of false positives (column 2). As it happened in the model M3, the performance was poorwhen classifying the TURNED_RIGHT pose.

58

Page 77: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

classified as −→ a b cSTAND_TURNED_LEFT = a 1383 4 101

STAND_TURNED_RIGHT = b 1022 346 286STAND_TURNED_FORWARD = c 26 0 1623

Table B.6: Confusion matrix for the model M4 - Like in model M3, theTURNED_RIGHT pose produced a great percentage of the errors.

59

Page 78: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

B. DETAILED RESULTS OF THE PILOT EXPERIMENT

60

Page 79: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

Appendix C

The ASR Skill Grammar forrecognising Pose Labels

This chapter describes the grammar used to detect the poses the user can tell the

robot to announce her pose. The grammar has the format used by the ASR engine of

the ASR Skill, which is similar to [42] with slight modifications in their tags.

Since the ASR Skill currently only supports Spanish words, the grammar is written

in Spanish. However, the semantic values of the words that can be detected by the

grammar are written in English.

The grammar allows to understand 2 control commands and 1 pose command.

The control commands are defined by its semantics. The first one is the STOP com-

mand, used to finish the training phase. The second one is the CHANGE command,

used to mark the transitions between poses.

The pose semantics are defined in the $pose field. This field understands 3 cate-

gories of semantics: $position, $action and $direction. $position defines if the user is

sit (SIT ) or stand (STAND). The second semantic, $action, defines whether the user

is turned (TURNED), looking (LOOKING)) or pointing (POINTING) towards certain di-

rection. Finally, the third semantic, $direction, defines at which direction is oriented

the $action of the user. These directions can be left (LEFT ), right (RIGHT ) or forward

(FORWARD).

The Spanish words that are located before the semantic labels are the words the

user has to say in order to trigger its related semantic value. For instance, in the

61

Page 80: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

C. THE ASR SKILL GRAMMAR FOR RECOGNISING POSE LABELS

$position semantics, the words "sentado" or "en una silla" will trigger the semantic

value SIT, but the words "de pie" or "levantado" will trigger the semantic value STAND.

#ABNF 1.0 ISO−8859−1;

language es−ES ;tag−format <loq−semant ics /1.0 >;pub l i c $root = $pose_tra iner ;

$pose_tra iner = [$GARBAGE] $stop| [$GARBAGE] $change [$GARBAGE]| [$GARBAGE] $pose ;

$stop = (" para" : STOP| "para de e t i qu e t a r " : STOP| " stop " : STOP| "ya e s ta bien " : STOP| " de j a l o ya" : STOP| " cansado" : STOP){<@STOP_COMMAND $value >};

$change = (" pausa" : CHANGE| "cambio" : CHANGE| "cambio de" : CHANGE| "cambiar de" : CHANGE){<@CHANGE_COMMAND $value >};

$pose = [ $po s i t i on ] $ac t i on [$GARBAGE] $d i r e c t i o n ;

$po s i t i on = (" sentado " : SIT| "en una s i l l a " : SIT| "de p i e " : STAND| " levantado " : STAND){<@POSITION $value >};

$ac t i on = (" g i rado " : TURNED| "mirando" : LOOKING| "apuntando" : POINTING){<@ACTION $value >};

$d i r e c t i o n = (" derecha " : RIGHT| " i z qu i e rda " : LEFT| " de lante " : FORWARD){<@DIRECTION $value >};

62

Page 81: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

References

[1] SUSHMITA MITRA AND TINKU ACHARYA. Gesture Recognition: A Survey. IEEE

Transactions on Systems, Man and Cybernetics, Part C (Applications and Re-

views), 37(3):311–324, May 2007. 2, 15

[2] A JAUME-I CAPÓ AND JAVIER VARONA. Representation of human postures

for vision-based gesture recognition in real-time. Gesture-Based Human-

Computer, pages 102–107, 2009. 2, 15

[3] SERGI FOIX, G. ALENYA, AND C. TORRAS. Lock-in Time-of-Flight (ToF) Cam-

eras: A Survey. IEEE Sensors Journal, 11(99):1, 2011. 2, 13

[4] DANIEL SCHARSTEIN AND RICHARD SZELISKI. High-Accuracy Stereo Depth

Maps Using Structured Light. Computer Vision and Pattern Recognition, IEEE

Computer Society Conference on, 1:195, 2003. 2, 14

[5] B. FREEDMAN, A. SHPUNT, M. MACHLINE, AND Y. ARIELI. Depth map-

ping using projected patterns, October 2008. 2, 25

[6] VARIOUS AUTHORS. Kinect Entry at the Wikipedia, June 2011. 2

[7] MA SALICHS, R. BARBER, AM KHAMIS, M. MALFAZ, JF GOROSTIZA,

R. PACHECO, R. RIVAS, ANA CORRALES, E. DELGADO, AND D. GARCIA. Mag-

gie: A robotic platform for human-robot social interaction. In 2006 IEEE

Conference on Robotics, Automation and Mechatronics, pages 1–7, 2006. 2, 17

[8] TERRENCE FONG, ILLAH NOURBAKHSH, AND KERSTIN DAUTENHAHN. A survey

of socially interactive robots. Robotics and autonomous systems, 42(3-4):143–

166, 2003. 7

63

Page 82: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

REFERENCES

[9] MICHAEL A. GOODRICH AND ALAN C. SCHULTZ. Human-Robot Interaction: A

Survey. Foundations and TrendsÂo in Human-Computer Interaction, 1(3):203–

275, 2007. 7

[10] BRENNA D. ARGALL, SONIA CHERNOVA, MANUELA VELOSO, AND BRETT

BROWNING. A survey of robot learning from demonstration. Robotics and

Autonomous Systems, 57(5):469–483, 2009. 8, 9, 10

[11] SRIDHAR MAHADEVAN, GEORGIOS THEOCHAROUS, AND NIKFAR KHALEELI.

Rapid Concept Learning for Mobile Robots. Autonomous Robots, 5(3):239–

251, 1998. 10

[12] TIM VAN KASTEREN, ATHANASIOS NOULAS, GWENN ENGLEBIENNE, AND BEN

KRÖSE. Accurate activity recognition in a home setting. In Proceedings of

the 10th international conference on Ubiquitous computing, UbiComp ’08, pages

1–9, New York, NY, USA, 2008. ACM. 11

[13] STEPHANIE ROSENTHAL, J. BISWAS, AND M. VELOSO. An effective personal

mobile robot agent through symbiotic human-robot interaction. In Proceed-

ings of the 9th International Conference on Autonomous Agents and Multiagent

Systems: volume 1-Volume 1, pages 915–922. International Foundation for Au-

tonomous Agents and Multiagent Systems, 2010. 11

[14] BURR SETTLES. Active Learning Literature Survey. Computer Sciences Tech-

nical Report 1648, University of Wisconsin–Madison, 2010. 11

[15] MAYA CAKMAK, CRYSTAL CHAO, AND ANDREA L THOMAZ. Designing Interac-

tions for Robot Active Learners. IEEE Transactions on Autonomous Mental

Development, 2(2):108–118, June 2010. 11, 12

[16] PRIMESENSE LTD. PrimeSense’s PrimeSensor Reference Design 1.08, June

2011. 13, 26

[17] A KOLB, E BARTH, AND R KOCH. ToF-sensors: New dimensions for realism

and interactivity. In Computer Vision and Pattern Recognition Workshops, 2008.

CVPRW ’08. IEEE Computer Society Conference on, pages 1–6, 2008. 14

64

Page 83: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

REFERENCES

[18] ANDREAS KOLB, ERHARDT BARTH, REINHARD KOCH, AND RASMUS LARSEN.

Time-of-flight sensors in computer graphics. In Eurographics State of the Art

Reports, pages 119–134, 2009. 15

[19] S B GOKTURK AND C TOMASI. 3D head tracking based on recognition and

interpolation using a time-of-flight depth sensor. In Computer Vision and

Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer

Society Conference on, 2, pages II–211 – II–217 Vol.2, 2004. 15, 16

[20] XIA LIU AND K FUJIMURA. Hand gesture recognition using depth data. In

Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE Inter-

national Conference on, pages 529–534, May 2004. 15

[21] PIA BREUER, CHRISTIAN ECKES, AND STEFAN MÜLLER. Hand Gesture Recog-

nition with a Novel IR Time-of-Flight Range Camera - A Pilot Study. In AN-

DRÉ GAGALOWICZ AND WILFRIED PHILIPS, editors, Computer Vision/Computer

Graphics Collaboration Techniques, 4418 of Lecture Notes in Computer Science,

pages 247–260. Springer Berlin / Heidelberg, 2007. 15

[22] HERVÉE LAHAMY AND DEREK LITCHI. Real-time hand gesture recognition

using range cameras. In The International Archives of the Photogrammetry,

Remote Sensing and Spatial Information Sciences [on CD-ROM], page 38, 2010.

15

[23] NADIA HAUBNER, ULRICH SCHWANECKE, R. DÖRNER, SIMON LEHMANN, AND

J. LUDERSCHMIDT. Recognition of Dynamic Hand Gestures with Time-of-

Flight Cameras. In ITG / GI Workshop on Self-Integrating Systems for Better

Living Environments 2010: SENSYBLE 2010, pages 1–7, 2010. 15

[24] K NICKEL AND R STIEFELHAGEN. Visual recognition of pointing gestures

for human-robot interaction. Image and Vision Computing, 25(12):1875–1884,

December 2007. 16

[25] DAVID DROESCHEL, JÖRG STÜCKLER, AND SVEN BEHNKE. Learning to inter-

pret pointing gestures with a time-of-flight camera. In Proceedings of the 6th

international conference on Human-robot interaction, HRI ’11, pages 481–488,

New York, NY, USA, 2011. ACM. 16

65

Page 84: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

REFERENCES

[26] RONAN BOULIC, JAVIER VARONA, LUIS UNZUETA, MANUEL PEINADO, ANGEL

SUESCUN, AND FRANCISCO PERALES. Evaluation of on-line analytic and nu-

meric inverse kinematics approaches driven by partial vision input. Virtual

Reality, 10(1):48–61, April 2006. 16

[27] YOUDING ZHU, BEHZAD DARIUSH, AND KIKUO FUJIMURA. Kinematic self retar-

geting: A framework for human pose estimation. Computer Vision and Image

Understanding, 114(12):1362–1375, December 2010. 16

[28] ARNAUD RAMEY, VÍCTOR GONZÁLEZ-PACHECO, AND MIGUEL A SALICHS. Inte-

gration of a low-cost RGB-D sensor in a social robot for gesture recognition.

In Proceedings of the 6th international conference on Human-robot interaction -

HRI ’11, page 229, New York, New York, USA, 2011. ACM Press. 16

[29] OPENNI MEMBERS. OpenNI web page, June 2011. 16, 26

[30] ANA CORRALES, R. RIVAS, AND MA SALICHS. Sistema de identificación de

objetos mediante RFID para un robot personal. In XXVIII Jornadas de Au-

tomática, pages 50–54, Huelva, 2007. Comité Español de Automática. 18

[31] R. BARBER. Desarrollo de una arquitectura para robots móviles autónomos. Apli-

cación a un sistema de navegación topológica. Phd thesis, Universidad Carlos

III de Madrid, 2000. 20

[32] R. RIVAS, ANA CORRALES, R. BARBER, AND MA. Robot skill abstraction for

ad architecture. 6th IFAC Symposium on Intelligent Autonomous Vehicles, 2007.

20

[33] ERICH GAMMA, RICHARD HELM, RALPH JOHNSON, AND JOHN VLISSIDES. De-

sign Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley

Professional, 1995. 22

[34] M. QUIGLEY, B. GERKEY, K. CONLEY, J. FAUST, T. FOOTE, J. LEIBS, E. BERGER,

R. WHEELER, AND A. NG. ROS: an open-source Robot Operating System. In

Open-Source Software workshop of the International Conference on Robotics

and Automation (ICRA), number Figure 1, 2009. 23

66

Page 85: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

REFERENCES

[35] PRIMESENSE LTD. PrimeSense’s Frequently Asked Questions (FAQ) web-

site, June 2011. 25

[36] Z. ZALEVSKY, A. SHPUNT, A. MAIZELS, AND J. GARCIA. Method and System

for Object Reconstruction, April 2007. 25

[37] MARK HALL, EIBE FRANK, GEOFFREY HOLMES, BERNHARD PFAHRINGER, PE-

TER REUTEMANN, AND I.H. WITTEN. The WEKA data mining software: an

update. ACM SIGKDD Explorations Newsletter, 11(1):10–18, 2009. 28

[38] J R QUINLAN. C4.5: programs for machine learning. Morgan Kaufmann series in

machine learning. Morgan Kaufmann Publishers, 1993. 28

[39] SB KOTSIANTIS. Supervised machine learning: A review of classification

techniques. Informatica, 31(3):249–268, 2007. 28, 53

[40] F. ALONSO-MARTIN AND MIGUEL SALICHS. INTEGRATION OF A VOICE

RECOGNITION SYSTEM IN A SOCIAL ROBOT. Cybernetics and Systems,

42(4):215–245, may 2011. 35

[41] F ALONSO-MARTIN, ARNAUD A RAMEY, AND MIGUEL A SALICHS. Maggie: el

robot traductor. In UPM, editor, 9 Workshop RoboCity2030-II, number 9, pages

57–73, Madrid, 2011. Robocity 2030. 42

[42] D. CROCKER AND P. OVERELL. Augmented BNF for Syntax Specifications:

ABNF. RFC 2234, Internet Engineering Task Force, November 1997. 61

67

Page 86: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

REFERENCES

68

Page 87: A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

This document is published under the license (CC)-BY-SA.

You are free to:

To Share - To copy, distribute and transmit this document.

To Remix - To adapt the document.

To make commercial use of the document.

Under the following conditions:

Attribution - You must attribute the document in the manner specified by the author

or licensor (but not in any way that suggests that they endorse you or your use

of the work.

Share Alike - If you alter, transform, or build upon this document, you may distribute

the resulting work only under the same or similar license to this one.