hand gesture recognition for human-robot interaction · pdf filehand gesture recognition for...

Hand Gesture Recognition for Human-Robot Interaction

J E S Ú S I G N A C I O B U E N O M O N G E

Master of Science Thesis Stockholm, Sweden 2006

Hand Gesture Recognition for Human-Robot Interaction

J E S Ú S I G N A C I O B U E N O M O N G E

Master’s Thesis in Computer Science (20 credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2006 Supervisor at CSC was Danica Kragic Examiner was Henrik Christensen TRITA-CSC-E 2006:070 ISRN-KTH/CSC/E--06/070--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se

AbstractThis Master Thesis deal with the problem of visual gesture recognition.A new form of interaction to control systems has been an intensive �eldof research since the actual methods do not provide much naturalism.This computer vision system recognizes not only a gesture described bythe user's movement, but also the posture the hands keep during interac-tion. It is a general purpose system to command di�erent applications. Itwas successfully tested while controlling a robot navigation system, whichrecognized commands such as start, stop, turn right/left, speed up/down,etc...The process robustness relies on its skin segmentator. It uses a twoclass Bayes framework for skin and background classi�cation. The para-metric probability distributions are modeled by a Mixture of Gaussians.The model adapts itself following illumination changes in the scene allalong the execution time. Hand postures are identi�ed using PrincipleComponent Analysis, tracking is developed using Kalman �lters, and �-nally, gestures are recognized using Hidden Markov Models.Experimental evaluation shows how the integration of tracking and anadaptive color model supports the robustness and exibility of the systemwhen illumination changes occur, providing a good base for later gestureand posture recognition.

Contents1 Introduction 11.1 The goal of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 System background 42.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Posture recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Gesture recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Color adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.6 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . 73 Skin color space 83.1 Color transformation . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Skin pixels distribution . . . . . . . . . . . . . . . . . . . . . . . . 94 Segmentation 104.1 Color modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.2 Classi�cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.3 Blob searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.4 Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Tracking 175.1 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Tracking issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Color adaptation 196.1 On-line adaptive learning . . . . . . . . . . . . . . . . . . . . . . 196.2 New skin pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Posture recognition 217.1 Principle Component Analysis . . . . . . . . . . . . . . . . . . . . 217.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Gesture recognition 238.1 Gesture partitioning . . . . . . . . . . . . . . . . . . . . . . . . . 238.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . 238.3 Topology of models . . . . . . . . . . . . . . . . . . . . . . . . . . 258.4 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Experimental evaluation 289.1 Color model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289.2 Color adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 299.3 Updating policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299.4 Posture recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 309.5 Gesture recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 329.6 Speed and optimization . . . . . . . . . . . . . . . . . . . . . . . 33

10 Conclusion 3811 Improvements and future work 4112 Other applications for the system 42References 43A User guide 47A.1 Running the system . . . . . . . . . . . . . . . . . . . . . . . . . 47A.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

1 IntroductionOver the next few decades, the society will be experiencing a signi�cant aging,[1]. This increase requires new services for managed care and new facilities forproviding assistance to elderly and disabled. One of the potential solutions thathas been strongly advocated is the use of robotic appliances to provide servicessuch as cleaning, getting dressed or mobility assistance. For such systems to betruly useful to ordinary users, the facilities for interaction and instruction haveto be made natural and easy.A service robot that moves autonomously in a regular, dynamic environmentand performs tasks such as cleaning is considered one of the main research goalsin the �eld of robotics. Considering our aging society and upcoming need fordi�erent types of care systems, a service robot development is easily motivated.Mobile robots are already capable of moving around autonomously in knownor partially unknown environments, ful�lling tasks such as door opening orsimple object manipulation. In general, rather than executing tasks in a pre-programmed manner, the aim is to develop a system that allows the robot toacquire knowledge about the environment through interaction with the user anduse that knowledge to plan task execution. Naturally, there is a need for a sys-tem that enables the user to demonstrate the task to he robot which then learnsfrom observation. Such a teaching scenario requires a system that is able to,not only recognize and track the user, but also understand and interpret his/herinstructions in a Programming by Demonstration (PbD) manner, [2], [3].

Figure 1: Example scenarios for gesture recognition: left) instructing robot, andright) Programming by DemonstrationInteraction between human users and robots (HRI) has become an importantresearch problem, [4], [5]. Since its beginnings, di�erent devices has been usedfor that purpose (keyboards, joysticks, tactile screens, game pads...). But noneof them resembles the natural way of human communication, [6], which considersnot only voice, but also gestures made during a conversation. Gestures makeour communication richer and provide people with some additional informationwhich would not be easy to guess from the words (assent, negative, doubt,

1

pointing...) [7].Therefore, hand gestures are one of the most appealing instruction methods.Many di�erent kinds of systems have been proposed for gesture recognition, [7],[6], [8]. Some of them use special devices, for example gloves to capture �ngerand hand movements. This makes the interaction less natural, but makes iteasier to design the system. One option that avoids the use of any non-naturaldevice are computer vision systems. The main disadvantage of vision systemsis their complexity, not only to produce reliable results, but also because ofthe execution time the process can take. It should be as close to real time aspossible.HRI and PbD settings consider the use of both posture and gesture basedinterfaces. A posture is a static con�guration of �ngers and hands and it isusually extracted from a single image. Regarding gestures, a sequence of imagesis used to track human hand motion and compare these to a temporal gesturemodel.The current system uses both postures and gestures. For some tasks, suchas for example instructing the robot to stop, a de�ned con�guration of the handis su�cient. In a PbD setting, a pointing posture is also very frequent.1.1 The goal of the thesisI was given the task of testing and improving a previous work developed atthe Centre of Autonomous System (CAS), Royal Institute of Technology. Thatprevious work is a Master Thesis which has studied the problem of vision basedgesture recognition and implemented a complete gesture recognition system [9].That work was intended to be incorporated into the Intelligent Service Robot(ISR) project at the same department. The ISR project was aimed at developinga basic robot architecture and prototype tasks for a domestic or o�ce robot-assistant, but this gesture recognition system would be used in a new servicerobot scenario for natural task instruction.The requirements for the new system are more wide-ranging than for theprevious work. The new system was required to recognize gestures generatedby two hands. Apart from the gesture recognition which requires a temporalmodel, the system was also required to implement a hand posture recognitionsystem to provide di�erent types of user-robot interaction. In relation to gesturerecognition, posture recognition commonly uses a single image to recognize oneof the hand postures from a discrete set of these. Since it is based on a singleimage, posture recognition requires a reliable hand segmentation because a handposture is determined by �nger con�guration. If one of the �ngers is lost duringa hand segmentation process, this may cause a recognition failure.Since the system was supposed to use images from a camera, color segmen-tation has to be implemented so to provide a base for extracting the hands fromthe image. The technique used for that purpose has to work under di�erent il-lumination conditions and with di�erent skin colors. The system has to work ina domestic environment where illumination conditions can change signi�cantlyduring user-robot interaction, and the system has to be robust enough to con-tinue the reliable segmentation of the hands while the user, for example, walksnear a window or below a lamp.The previous work had some weaknesses and it did not meet the requirementswhich were posed to this work. First of all, hand segmentation results were not

2

suitable for posture recognition. The technique used in the mentioned thesisprovided only big, blob like skin regions which have been highly altering theoriginal shape of the hand due to the successive morphological processing. Thishas caused that the segmented region representing the hand were not detailedenough so to represent the clear di�erence between �nger con�gurations. Thesecond most important weakness of the previous work was that it was not robustto illumination changes in the scene. This is due to the adopted modeling whichuses a simple threshold based color segmentation technique and a HSV space.The thresholds were updated after a gesture was recognized, but after someinitial evaluation of the system it was clear that a simple updating policy byusing thresholding in the HSV space was not su�cient to meet the requirementsof the new system. On the other hand, after some evaluation it was concludedthat tracking and gesture recognition parts were performing well also in termsof the new system, with the exception that in the previous work only gesturesdescribed with the right hand are recognized.In conclusion, the objective of this thesis was to design, implement, and testa complete gesture recognition system which improves the previous work in thepoints discussed above. Hand segmentation had to be re-designed from scratchin order to cover all requirements that the new version needs. Hand posturerecognition had to be included as a new feature. Tracking process and gesturerecognition were adopted from the previous system and was not re-designedapart from being adapted for integration in the new system architecture.1.2 OutlineChapter 2 describes the di�erent steps of a recognition process. The followingchapters explain in detail the color model chosen (chapter 3), the segmentationtechnique (chapter 4), tracking (chapter 5), color model updating (chapter 6),posture (chapter 7) and gesture (chapter 8) recognition. Chapter 9 presentssome experiments and results. Conclusions are summed up on chapter 10. Onchapters 11 and 12 it is discussed some improvements that could be done infuture work, and other applications where the system can be useful. And �nally,appendix A is a user guide to work with the programs developed (trainers andrecognizer).

3

2 System backgroundThe new system will have to support all the requirements described in theprevious section (1.1). Thus, the new system architecture will be designed tomeet the requirements. In addition, tracking and gesture recognition parts fromthe previous work will need to be integrated in the new design.The new architecture will be designed from the classical global vision basedgesture interpretation system proposed in [6] (Fig. 2).

Figure 2: Flow diagram of the classical global vision based gesture interpretationsystem.Those two initial steps from the classical model will be divided into foursteps: data analysis will be separated into image segmentation and trackingwhile recognition will consist of static postures and dynamic gestures.

2.1 SegmentationSegmentation step will be then the �rst one in the process. It will be respon-sible for providing the measurements which are robust to illumination changesand thus providing well de�ned skin shapes from every frame to the posturerecognizer.Color segmentation step takes an image from the input device and producesas output a new binary image in which each pixel is classi�ed as skin or back-ground.There are many image segmentation techniques that can be considered whendesigning a color segmentation module [10], [11]. Some of these are mentionedbelow.Histogram thresholding is based on constructing color and hue histograms.Then images are thresholded at its most clearly separated peaks, [12], [13]. Butthis thresholding technique is not possible when histograms are unimodal, whichusually happens when the image consists of a large background area and smallregions to segment.Region growing starts by generating initial pixels called seeds, and grows theregions around them using some homogeneity criteria, [14]. There have beenexamples of work that use this approach since it is fast and robust when thethresholds for the initial seeds are well de�ned, [15], [16]. That is not the caseof this thesis, since this system requires to be as independent as possible to theskin color of the user.4

Non-parametric statistical approaches, where the model is not speci�ed butis determined by data. Their main advantage is that they provide a proba-bility density function that can be easily estimated, [17]. But as a drawback,histogram approaches require a considerable amount of training data and adapt-ing an histogram is a slow process when there is a large amount of training datato process.Parametric statistical approaches. Compare to non-parametric ones, it isharder to calculate the probability densities. But as a main advantage they canadapt the density functions very fast to the new colors, since they does not takeinto account the old training data which does not need to be so abundant.A parametric statistical method presents the more suitable advantage forthe purpose of our system. It will adapt fast and continuously to the new skincolors in the scene. It will be slow to segment a frame, but reasonable in timeto perform the recognition in real time.As a result from the �rst segmentation step, a new binary image will begiven. That binary image will mark each pixel as skin or background and thenit will be necessary to �nd regions by connecting skin pixels.The run length algorithm explained in [18] is a fast approach to do that.It presents the advantage of giving the possibility of calculating during its ownprocesses some geometrical moments that will be useful in later steps. Thosemoments are: position, center of mass, weight, maximum and minimum coor-dinates and orientation.The segmentation process will produce noise and some pixel misclassi�cation,what will require some morphological operations to �x the connected regionsthat are found.

Figure 3: Flow diagram of the segmentation process.At the moment, segmentation diagram looks as shown in Fig. 3. In thisdiagram, an important fact is not shown: it is not de�ned in which space colormodel the skin model will be represented. If that space color is di�erent fromthe input image which comes from the camera a new step will be necessary toperform that transformation. To make this color transformation faster in caseit is needed to be performed, a look up table should be used.As an example, most of the cameras deliver either a raw image format(RGB0) or and RGB image. If we want to use some other color representation,such as HSV the values of pixels in the original image have to be transformedto this new space.

5

Input from the camera will be images 320 pixels width and 240 height. Thisis a size that almost all cameras support and represents a good trade o� betweenthe time requirements to process the image and the resolution required to detectuser's hands and head.2.2 TrackingKalman �lters were used in the previous work to track the hands and and headof the user. These �lters has often been used in interactive computer graphics,[19], [20].Their main disadvantage are their weak robustness to occlusion problems,but since it is not one of the goals of the thesis that previous tracking subsystemwill be adapted to the new system.2.3 Posture recognitionContinuous posture recognition is a new step that has to be implemented in thenew system. It needs to be performed frame by frame for both hands, thereforethe chosen technique has to be fast and reliable at the same time. Other currentapproaches uses some di�erent pattern classi�cation methods to perform it:Geometric moments such as area, perimeter, major axis length, minor axislength, eccentricity and some invariant moment functions are commonly used.This technique presents the disadvantage of being dependant on the order inwhich that moments are used in discrimination. This technique works �ne whenthe objects to classify are very di�erent, but in the examples considered in ourwork the major di�erences comes from the �nger con�guration which cannot beeasily discriminated only with the moment based representation.Dimensionality reduction techniques. The most used are Linear Discrimina-tion Analysis (LDA) and Principal Component Analysis (PCA). The �rst one(LDA) is a technique used to �nd the best linear discriminant functions of a setof variables which best classi�es the samples. The second one (PCA) is used tosimplify a dataset reducing samples into a new space. It is commonly thoughtthat LDA outperforms PCA because it is directly related to class discrimination,but as stated in [21], when the training dataset is small PCA is more reliable.Therefore, PCA will be the recognition technique that the system will useto perform the new continuous posture recognition process.2.4 Gesture recognitionHidden Markov Models is a technique widely used for speech and gesture recog-nition, [22], [23]. The previous work implements a gesture module which usesposition and their di�erences for recognition. The results that this moduleachieves are good enough to support the new system requirements, therefore,as well as the tracking module, it will not be re-designed but adapted to be in-tegrated into the new system. That adaptation includes the use of both handsto describe gestures instead of only the right one.

6

2.5 Color adaptationColor adaptation is the most important issue for the new system. This modulewill be in charge of updating the color model in such a way that the adaptationreally facilitates the system to perform better in a dynamic environment. Tomanage that, some interactions between di�erent modules need to be designed.The adaptation has to be done after the frame has been segmented becauseit needs to use those pixels that have been marked as skin to update the model.In order to make the model move towards a new color region, it is not onlyenough if the system uses skin that is already been segmented right. Some newpixels should be provided to make the model learn new colors. Those pixels canonly be taken from real blobs, and they will be the pixels that are repaired fromnoise mistakes (holes in the blobs).On the other hand, an updating policy should be designed. It is not usefulto update every frame because the system can fail easily when no skin is beingrecognized. Some con�dence will be needed to decide when to update the model,and that can be provided from the tracking module. In the previous work, themodel was updated only after a gesture was completely recognized. In this newsystem it should be done more often because there is no use in waiting so long.2.6 System architectureIn summary, the new system architecture will evolve as shown in Fig. 4. Thedi�erent components which form the system are presented in the remainder ofthe report and evaluated in section 9.The center of interaction in the system will be the model updating module,which, as said before, will use information from tracking results, the originalimage, and the previous model to adapt it towards the new colors in futureframes. Gesture and Posture subsystems are isolated in the architecture sincethey do not present any relationship between them. Posture recognition willbe continuous and performed frame by frame, but gesture recognition will beperformed only when the user moves his/her hands.

7

Figure 4: Flow diagram of the system.3 Skin color spaceSkin color detection and modelling have been frequently used for face detection.A recent survey [24], presents a detailed overview of commonly used color spaces.Some of these include RGB [25], normalized RGB [26, 27], HSV [28], YCrCb[29] and YUV [30]. In [31], skin chrominance models and their performance havebeen evaluated. For real-world applications and dynamic scenes, color spacesthat separate the chrominance and luminance components of color are typicallyconsidered preferable. Main reason for this is only chrominance-dependent com-ponents of color are considered, increased robustness to illumination changes canbe achieved.3.1 Color transformationFor the purpose of this thesis, using RGB color is a better choice than, forexample, HSV to model skin color, [32]. HSV family presents lower reliabilitywhen the scenes are complex and they contain similar colors such as woodtextures.Moreover, in order to transform a frame it would be necessary to changeeach pixel to the new color space which can be avoided if the camera providesRGB images directly, as most of them do. In case a camera could not do that,color transformation would be required. Each image is 320 pixels width and240 height, and that means that 320 � 240 = 76800 transformations per imagewould be needed. In order to avoid such a huge amount of heavy calculations,the system would use a look up table where all possible transformations areprecalculated. Then each transformation is reduced to a memory access. Thesystem uses 24 bit images, which means 224 di�erent colors using three bytesper color, requiring 48MB of memory.

8

As it is shown in section 9.6, the system will requires 10% of total executiontime when it needs to transform images into a new color space.3.2 Skin pixels distributionThe system needs an initial training with skin colors. This step is done usingan o�-line training process with a set of example images that are manuallysegmented, from which some histograms are calculated. This histograms givethe system an initial part of the whole color space where pixels will be consideredto be skin.The following three-dimensional plots (Fig. 5) show how skin pixels aredistributed. Those plots are generated from skin pixels from one single user,but in four di�erent illumination conditions and taken with di�erent cameras.That cloud of points which represents the skin adapts its shape and moves whenlight changes, what motivates the idea of using an adaptive classi�cation model.It can also be seen in that plot (Fig. 5) the di�erence between choosing headpixels to be included into the model or not. Lips, eyes, and some hair colors areout of the cloud of skin points, but they do not disturb because all the others�t well inside the cloud helping to de�ne it better.

Figure 5: Same user in four di�erent scenes and di�erent cameras. Green pointsrepresent skin pixels from both hands and head, while the red points representpoints which belongs only to hands.At the beginning, it is recommended to have a complex initial model (Fig.6) to involve as many users and illumination conditions as possible. Then, whena new user uses the system it will segment some initial patches of skin and willadapt to it.

9

Figure 6: Skin pixels in RGB color space.4 SegmentationThis section describes the main step in the recognition process. Later track-ing and recognition highly depend on segmentation results. Details about howMixture of Gaussians are used to model skin color are given in section 4.1. TheBayes framework which classi�es pixels using that probabilistic model is shownin 4.2. In 4.3 and 4.4 blob searching and morphological operations used to�x the chosen blobs are commented. Finally 4.5 gives some ideas about howtraining should be performed.4.1 Color modelingIt is not easy to chose the pixel segmentation technique to use. Some approachesuse thresholds to determine the skin region inside the whole color space [9]. Infact, this is an easy and fast way to do it, but it is usually not a good methodconsidering the skin pixels space shown in Fig. 6. This system also requires agood model updating technique, and modifying thresholds is not a very reliableway for achieving a robust system.Other option is the use of non-parametric probabilistic models [17]. Theyare much more complex than the previous one as they use Bayes rule to decidewhich pixels are skin and which are background. The results are commonlymuch better, but they have a problem that makes them impracticable for thissystem: it takes too long to update the model with new pixels and a lot ofinformation from the past has to be saved.As discussed in section 2.1, the method adopted in this work is a probabilistic

10

model, but it is a parametric one. From the histograms made during the trainingstep, a set of normal Gaussian distributions is �t to the data (Fig. 7) creating aMixture of Gaussians, [33]. A Mixture of Gaussians is more appropriate than asingle Gaussian model to segment human skin, as it was tested in [34] and [35].In section 9.1 di�erent number of components are tested for the mixture.P (x) = nX

j=1w(j) � g(x;�(j);�2(j))and nXj=1w(j) = 1

where g(x;�(j);�2(j)) represents a single normal distribution.

Figure 7: Example of a Mixture of Gaussians (4 components).These three color variables (R, G, and B) are not independent. In fact, plotsin Fig. 8 show how RGB skin colors are highly correlated.The estimation is calculated using the full covariance matrices. If correla-tion were not like this, covariances could be modeled by simple independentvariances, but we model covariance matrix as:

cov =0@ �2r �r�g �r�b�g�r �2g �g�b�b�r �b�g �2b

1AThis estimation is calculated using the Expectation Maximization algorithm(EM), [36], [37].

11

Figure 8: Skin pixels correlation (R-G, R-B, and G-B).This algorithm is an iterative optimization method, where initial values formean and variances are calculated distributing randomly the points into themixture components (k points per component). Each component will have thesame weight.

�(j) = kXi=1

xik�2(j) = Pki=1(xin � �(j)n)(xim � �(j)m)k � 1

w(j) = 1NThe algorithm consist of two steps. During the Expectation step (E step),the posterior probability for each pixel to belong to each component is calcu-lated. They have to be normalized.E(j; i) = 12�pj�2jw(j)e� 1

2 (xi��(j))0(�2)�1(xi��(j))

E(j; i) = E(j; i)PNj E(j; i)The next step (M step) re-calculates the new weights, means and covariancematrices which will be used in the following iteration.

�(j) = Pki=1E(j; i)xiPki=1E(j; i)�2(j) = Pki=1E(j; i)(xin � �(j)n)(xim � �(j)m)Pki=1E(j; i)

w(j) = Pki=1E(j; i)PNj=1Pki=1E(j; i)12

Convergence is reached when the distance between two consecutive means isbelow an error threshold (0.05). Choosing this threshold is not a determinantdecision since the aim of the model it to be able to �nd some skin patches atthe beginning of the execution. Then, the model will be adapted to the newlight and user skin color.Further explanation about Expectation Maximization algorithm for a Mix-ture of Gaussians can be found in [38].The next important decision is: how many components (n) should eachmodel use? This was decided empirically. The skin model is not exact enoughwhen using just two components, while using four or �ve is similar as usingthree but requires more calculations per pixel. More details and evaluation areprovided in section 9.1.4.2 Classi�cationOnce the model is available, a Bayes classi�er is used to estimate the probabilityof a pixel being skin or background. Therefore, it needs to keep two color models,one for skin colors and another model for background color.For classi�cation, the system uses ideas proposed in [17] but di�erently fromtheir work, it uses Mixtures of Gaussians for color modeling instead of non-parametric histograms. In the following, the conditional density is denotedP (rgb j FG) for the skin regions and P (rgb j BG) for the background andrgb 2 R3.Posteriors P (BG j rgb) and P (FG j rgb) are estimated using Bayes' formulaand the classi�cation boundary is drawn where the ratio of P (BG j rgb) andP (FG j rgb) exceeds some threshold T . T is a relative risk factor associatedwith misclassi�cation:

T < P (FG j rgb)P (BG j rgb) = P (rgb j FG) P (FG)P (rgb j BG) P (BG)In other words, when the above is satis�ed, the pixel under consideration islabeled as skin. That value T was chosen empirically from a set of images. AROC curve was calculated from three di�erent sets of image samples (Fig. 9).The ROC curve shows the number of hits (fraction of true skin pixels whichare correctly classi�ed) against number of false alarms (fraction of backgroundpixels which are incorrectly classi�ed as skin).Depending on the complexity of the set samples (background can containa door, table or shelve) the ROC line shows more or less false alarms. A highthreshold value (nearly 1.0) would give the best hit rate, but that is not whatthe system requires. The most important issue is to decrease false alarms. Iffalse alarms are too high they can confuse the updating process at the beginningof the execution and move the model towards an incorrect color (for example apiece of wood).In conclusion, 0:4 is the threshold chosen because it is much better for thesystem to get a false alarm rate below 2%, although it means to decrease hitrate between 55 and 90% (depending on the sample set). That is not a troublefor the system because it will adapt soon to the user skin color and then the hitrate will raise fast in the next frames.

13

Figure 9: ROC curve calculated from three di�erent sets of images.4.3 Blob searchingThe result of the classi�cation step is a binary image. That image shows whichpixels where classi�ed as skin and which as background. The next step is tocreate regions from adjacent pixels.Supposing the user is always the only person in the image and that thereis not any interference with the background (for example, a big wood surface),it is correct to suppose the three biggest blobs as being the head and the twohands. But it is also possible for the user not to show the two hands in everyframe, so some checking should be always done (size, position, etc).A run length algorithm is used to search blobs while it also calculates somemoments about the blobs (position, centroid, size and direction...). The sizeis used to chose blobs with a expected maximum and minimum size, positionand centroid are used for tracking, and direction is only used when drawingtracking results on the original image. This algorithm works better searchingfor neighbors in the eight possible directions (8-connectivity) rather than us-ing 4-connectivity, because sometimes a not very well segmented �nger can beconnected to the palm by a pixel on the diagonal pixel neighbors.4.4 MorphologyThe original blobs received after segmentation are not good enough to use di-rectly for PCA recognition, that is rather sensitive to small shape changes. Theyare full of noise and holes and it is necessary to make them more suitable forthe recognition step. The �rst step is to �ll the holes of each of the segmented

14

region.As shown in Fig. 10, a small change was introduced to the classical morpho-logical closing. After the initial dilation, the perimeter of the region is estimatedwhich is then used to �ll the region inside it. After this, all the pixels along theperimeter are removed to retrieve the original outer shape of the region. Thislast step is done because the dilation can add some non-skin pixels to the region,and if they are added frequently the model can get used to them performing aworse segmentation in future frames.

Figure 10: Morphological operations. From left to right: Original blob, ex-panded blob, perimeter, �lled blob, and �nal blob4.5 TrainingTo train the skin color model it requires a set of sample segmented images,which have to be segmented manually. All the other pixels are used to modelthe background. The model will be better if the set of samples is large andthey are not equal (di�erent kind of light, di�erent users...). A big amountof samples can make Estimation Maximization algorithm very slow, but it iscalculated o�-line. The result is a parametric model, so they will not make therecognizer program much slower.Choosing a good range of samples is not only the important decision to takewhile training the color model. The number of Gaussian components for eachmodel (skin and background) is also chosen. From experience, a backgroundwith just two Gaussians is enough. As it will be discussed later in section 9.3,the background model is not updated in run time, so it is not really importantto have a reliable knowledge about the background. The skin model is morecomplicated, the more components the mixture has, the better estimation the

15

model �ts. On the other hand, recognizing and updating time also grow. Oneor two Gaussians are not good enough, and the di�erence in quality betweenusing three or four Gaussians is not so big. Detailed explanation is provided insection 9.1.The color training program will save a �le with all information about bothmodels needed by the recognizer: number of components, means, weights andcovariance matrixes. The histogram generated from skin pixels is not neededany more.

16

5 TrackingIn the previous work, the Kalman �lter [39] is used to track the segmented blobs.Kalman �lter has been the subject of extensive research and application, and isoften used in interactive computer graphics.It provides noise reduction and predictions from its input. Predictions areuseful to decide which blob should be consider head or right/left hand.Each region uses one individual �lter per coordinate (x; y), what makes theuse of six �lter necessary to track the two hands and the head.5.1 PredictionsThe regions are tracked directly in image space where the state is representedboth by the position, p = (x y)T and velocity for each of the tracked regions:

x = � p_p�

Under the assumption of a �rst order Taylor expansion, the system is de�nedas: xk+1 = Axk +CwkA = � 1 �T0 1

�and C = � �T 2=2�T

�here, wk represents Gaussian noise with variance �2w. The observation modelis de�ned as:

zk = H xk + nkH = � 1 0 �

where z is the measurement vector and nk is the measurement noise whichis assumed to be Gaussian with variance �2n. Using the standard Kalman �lterupdating model, [39] it is straightforward to track the regions of interest overtime. The matching is performed using a nearest neighbor algorithm.For a further description of the Kalman �lter algorithm, see [40].5.2 Tracking issuesThe main disadvantage of Kalman �lters while tracking are occlusion problems[9]. Short periods will not interrupt tracking because the �lters hold the velocityof blobs and have restricted acceleration. They arise when the blobs are so closethat the become into one single object. If the user moves the two hands acrosseach other, the tracker will probably get confused and will switch right and lefthands in future tracking.

17

To cope with another problem inherit to the Kalman �lter namely, mul-tiple targets converging to a single one, the previous work implemented a re-initialization step. During tracking, a measure of con�dence is maintained.When one of the following three rules is broken, the measure of con�dence isdecreased. If these are violated for a long time, the level of con�dence becomeslow and re-initialization is triggered:- The region that belongs to left hand has to be to the left of the right hand.- The face is above both hands.- Non of the three tracked regions are at the same location.Figure 11 shows how the �lter smooths the hand movements during oneof the test runs. This is crucial for the other part of the system, namely thegesture recognition step, [41] Hidden Markov Models are based on the wholehand trajectory information.

Figure 11: Left) The raw image position of the hand regions, and Right) Theposition of hand regions estimated by the Kalman �lter.

18

6 Color adaptationIn order to support the system requirements exposed in section 1.1, it can notrely on a static segmenting technique. Users are assumed to be in movement,going near and far from light sources. Sources can be natural (sun light fromwindows) or arti�cial ones, such as uorescent lights or bulbs. Even a paintedwall can alter skin colors by re ection.A Mixture of Gaussians was chosen to model the skin color because theypresent the best advantage against non-parametric models: on-line adaptationis fast in two di�erent meanings. It takes few frames to learn the colors of ahand palm, and it takes few computing time to transform the model.6.1 On-line adaptive learningModel updating is done by calculating new means, weights and covariance ma-trixes for all the components in the Mixture of Gaussian, [42]. It does not needto keep in memory all points used in the training step, neither the histograms.But it is necessary to estimate empirically a value for adapting speed � (�-nally 0.9995 was chosen). This value can not be small because the model wouldchange drastically if false pixels get in the updating step. It can not be equalto 1.0 or all new pixels would be unuseful to update the original model. Moredetails and testing results are provided in section 9.2.Model updating starts calculating new expected posterior probabilities ofeach component Gj for the new skin pixel.

qn = P (Gj j xn) = wn(j) � g(xn;�n�1(j);�2n�1(j))P (xn)New weights for mixture components are calculated using that probabilitiesand the adapting rate �wn(j) = � � wn�1(j) + (1� �) � qn(j)

Each component gets a new learning rate from the de�ned adaptation speedand previous values�n(j) = qn(j) � (1� �)wn(j)Finally, new means and covariance matrixes are calculated using the abovelearning rate

�n(j) = (1� �n(j)) � �n�1(j) + �n(j) � xn�2n(j) = (1� �n(j)) � �2n�1(j) + �n(j) � (xn � �n(j)) � (xn � �n(j))0

6.2 New skin pixelsAfter performing morphological operations, the model will get skin patches withsome new pixels that the classi�er didn't guess. It will use also all correctlyclassi�ed pixels in the hand, which will help to stabilize the model. In order19

to prevent the model from moving towards strange areas only points which arenear to the actual Gaussian distribution are accepted.The more frequently the updating is done, the more reliable will be thesegmentation in future frames. On the other hand, if a false blob is recognized(some materials such as wood, cardboard...) then the model can evolve to awrong situation. As it is shown is section 9.6, it is able to have a frame by frameupdating rating (the updating process takes less than 10% of total executiontime), but one has to be certain sure that the blob is a real hand (see section9.3 for evaluation).To cope with the problem of including non-skin pixels to the update model,some checking is done before they are used to update the model. Each pixelmust be close to any Gaussian component of the model. If a pixel has a verylow probability of being skin (less than 1 � 10�4) in the actual model, it will berejected.

20

7 Posture recognitionPrinciple Component Analysis (PCA) is a classical technique which is used oftenin pattern recognition. Other approaches use techniques [6] such as geometricalmoments [43], contour [44], particle �ltering [45], [46], 3D models [47], etc. butPCA is simple, reliable, and fast. PCA training is the most expensive step inthe system, but it is calculated o�-line. If segmentation results are good enough(without losing any �nger, or not expanding out of the skin) and with fewsamples per posture, there is a high probability of having a right recognition.7.1 Principle Component AnalysisThe basic idea is to reduce samples into a smaller space with less dimensions.In addition, the most relevant information is located in the �rst components inthe new coordinate system [48].Training procedure will be explained later in section 7.2, it is more complexand slower than recognition but it is calculated o�-line. As training results, someeigenvectors and eigenvalues are estimated. Those vectors ([e1; e2; � � � ; ek]0) areused to calculate the new representation of each sample in the new space.

gi = [e1; e2; � � � ; ek]0(xi � c)Where c is the mean vector of the training samples (see section 7.2).Before performing the recognition the blob have to be prepared. To �t in ade�ned frame (30 pixels width and 30 pixels height), they have to be scaled tothe correct size (width or height), and the other dimension should be centeredin the frame. More details and evaluation about this step are given in section9.4.The representation of the new scaled sample is then calculated, and theshortest distance of this point to each of the training samples will be the mostlikely posture.That should be calculated as the Euclidean distance, but the results showthat using a weighed pseudo-distance by eigenvalues is a little bit better. Thisway the �rst components are considered more important that the following ones.An example about distance calculation is given in the evaluation section 9.4.Euclidean distance:

d2 = kXj=1(sj � gij)2

where gi is the representation of the trained sample, s is the sample whichis being recognized, and k is the number of eigenvectors used.Pseudo-distance:d2 = kX

j=1(sj � gij)2 � evkwhere ev is the eigenvalue used to weigh the distance.

21

7.2 TrainingThis is a very time requiring process, but like all the training for the system,it is calculated o�-line. The objective is to get the eigenvalues and eigenvectorsof a matrix containing n posture samples. A sample is a small binary image.It can not be too large because it would make recognizer much slower since itsspeed depends on the the number of dimensions of the sample.Each sample is scaled into a frame of 30 x 30 pixels, which is enough toperform a good recognition using a reasonable run time. It is recommended tohave more than one sample per posture, and for some of them it is necessarybecause they can be described with any hand. A large set of samples is notuseful because almost all of them will be similar so they do not o�er morereliability. The training set and evaluation is described in detail in section 9.4.Image samples are transformed into vectors with values 1.0 for skin pixelsand 0.0 for background pixels. Using all those vectors (samples) in columns, amatrix is built.x = [x1; x2; � � � ; xN ]0

To every pixel in that matrix the average of the row has to be subtracted.ci = 1N

NXj=1 xj

c = [c1; c2; � � � ; cm]0The resultant matrix is called P

P = [x1 � c; x2 � c; � � � ; xN � c]0Covariance matrix (Qmatrix) is calculated as P �P 0. Finally the eigenvectorsand eigenvalues from the Q matrix are calculated.Just the �rst vectors are useful. The number of dimensions for the newspace have to be small but enough to be able to perform a reliable recognition.Eigenvectors are accepted until the addition of eigenvalues reach almost 1, thishappens when the other dimensions do not contain useful information. Thesystem is working at date with 10 dimensions. Evaluation about the thresholdchosen is given in section 9.4.The last step is to �nd the new representation for all the samples used for thetraining. In order to do that, a new matrix is built using the chosen eigenvectorsin rows. That matrix is then multiplied by all samples.The PCA training program saves a �le with eigenvectors and eigenvalueschosen, the averages of each pixel (they are necessary for recognition), and thenew representation for training samples.

22

8 Gesture recognitionHidden Markov Models (HMM) have found a lot of use in recognition problemssince they were described in 1960s. This statistical method started to be usedin 1970s for speech recognition and since 1980s it has been used for biologicalsequences such as DNA.It is a technique which �ts properly this kind of gesture recognition problems.A HMM have some parameters whose values are trained to describe a pattern.Then the samples are classi�ed by models and the one which provides the highestposterior probability is chosen to describe the sample.The previous work concentrates only in isolated gestures. Continuous recog-nition [49], [50] would be more suitable for the system, since it should be ableto recognize movements without a silence stop between them. But it is not inthe scope of this thesis.8.1 Gesture partitioningThe gestures have to be separated by a silence stop between them. A gestureis a continuous sequence where the movement of the blob exceeds a threshold(Fig. 12). Only positions are taken into account while recognizing the gesture,in fact the di�erence in position between two measurements of the blob is themotion used.That motion measurement are then scaled by convolution by a Gaussiankernel with a variance corresponding to the expected length of a gesture (about20 frames).When the gesture is isolated in time, then the positions which describe ithave to be normalized to avoid di�erences in user position or size of the gesture.All the blobs are then centered which makes the system unable to distinguishwhen a gesture is being made, for example, over or below the head. The systemis only sensitive to the dynamics within the gesture.8.2 Hidden Markov ModelsEach known gesture in the database is modeled by a trained HMM. The mostprobable HMM for a sequence will be selected to classify the gesture.A HMM is described by the following parameters:- States, S = fs1; s2; : : : ; sNg.The state at time t is denoted qt.- Symbols, V = fv1; v2; : : : ; vMg.- Transition probability distribution, A = faijg.where aij = P (qt+1 = sj j qt = si); 1 � i; j � N .- Observation probability distribution, B = fbj(k)g.where bj(k) = P (vkatt j qt = sj); 1 � j � N; 1 � k �M .- Initial state distribution, � = f�ig.where �i = P (q1 = si); 1 � i � N .

23

Figure 12: Partition starts when the motion exceeds a de�ned threshold(0.0006), and stops when it falls down again.Measurements are organized into observation sequences, O = o1o2 : : : oT .Observations are not discrete for gestures, therefore the observation probabilitydistribution for each state B is replaced in this case by a continuous Gaussiandensity:

bj(ot) = 1q(2�)nj�2j je�12 (ot��j)0(�2)�1(ot��j)

where n is the number of dimensions of observations, �j is the mean and �2jthe covariance for state j.For a further explanation about HMM read [51].Given a HMM there are three problems of interest to solve:- Evaluation. Calculate P (O j �), the probability of the observation sequencefor the model �.- Decoding. Choose a corresponding state sequence of internal states which bestexplains the observation sequence.- Training. Adjust the model parameters using some observation sequences.

Only two of the three of them need to be solved for the purpose of this thesis:evaluation (section 8.4) and training (section 8.5). The decoding problem ismore important in continuous recognition where the most probable sequence ofgestures need to be recognized.24

8.3 Topology of modelsThe following diagram shows the topology of the HMM used in the previouswork. Transitions are allowed from one state to the next one or two statesforward, but never backwards. Transitions to the state itself are also allowed.The observations are the normalized positions of the blobs.

Figure 13: The topology of the HMMs used.8.4 RecognitionThe evaluation problem is solved for each of the HMMs, and then the patternwith the highest value of P (O j �) is chosen as the corresponding gesture. Somealgorithms have been proposed to calculate that probability. One possibility isthe forward variable �i(t) = P (o1; o2; : : : ; ot; qt = si j �) [52].

�i(1) = �ibi(o1); 1 � i � N�j(t+ 1) = " NX

i=1 �i(t)aij# bj(ot+1); 1 � t � T � 1; 1 � j � N

P (O j �) = NXi=1 �i(T )

That total likelihood P (O j �) is not the appropriate measurement for recog-nition. The decoding problem is usually solved by the Viterbi algorithm [52]which memorizes the minimizing state in each step of the recursion. But as Isaid before, this solution works well for the purpose of this thesis, where mostof the e�ort has been done in skin segmentation steps.When the probability of the sequence has been calculated for all the models,then the highest is chosen to be the recognized gesture as long as that probabilitypasses a con�dence threshold.8.5 TrainingGesture training is the most complicated of the three training sets the systemrequires. First a template for the HMM needs to be provided describing the

25

number of states and transitions. The measurements used is a combination ofpositions and velocities of the blobs [9]. The steps and programs that needs tobe run to train the system with new gestures is described in Appendix A.Training is performed in two steps. First observation sequences are dividedamong the states and initial means and covariances are calculated.�j = 1Tj X

8t;qt=sj ot�2j = 1Tj X

8t;qt=sj(ot � �j)(ot � �j)0where Tj is the number of observations in state j. Using the Viterbi al-gorithm [52] the most probable state sequence is calculated, and then the thetransitions probabilities are approximated by

aij = AijPNk=1Aikwhere Aij is the number of transitions made from i to j in the state sequence.Then the approximation of means and covariances is repeated with the new statesequence until they don't change.But it is possible to get a better trained HMM using Baum-Welch re-estimation [52]. Instead of assigning each observation to one state, the ob-servation is assigned to each state in proportion to the probability of occupyingthat particular state. Lj(t) denotes the probability of being in state j at timet, then mean and covariances are�j = PTt=1 Lj(t)otPTt=1 Lj(t)

�2j = PTt=1 Lj(t)(ot � �j)(ot � �j)0PTt=1 Lj(t)Lj(t) is calculated using the Forward-backward algorithm. Forward proba-bility is de�ned as�i(t) = P (o1; o2; : : : ; ot; qt = si j �)and was already described in section 8.4.Backward probability is de�ned as�i(t) = P (ot+1; : : : ; oT ; j x(t) = i; �)

which is calculated using the following recursion�i(T ) = aiN ; 1 � i � N

�i(t) = NXj=1 aijbj(ot+1)�j(t+ 1); 1 � i � N; 1 � t � T

26

multiplying both backward and forward probabilities�i(t)�i(t) = P (O; qt = j j �)

which can be used to calculate the state occupancy probability:Lj(t) = P (qt = j j O; �) = P (O; qt = j j �)P (O j �) = 1P (O j �)�i(t)�i(t)

the transition probabilities are then approximated byaij = 1P (O j �)PT�1t=1 �i(t)aijbj(ot+1)�j(t+ 1)1P (O j �)PTt=1 �i(t)�i(t)

A further explanation about training can be read in [52, 51].

27

9 Experimental evaluationIn this section some results are given from experimental evaluation. The systemhas been tested using some di�erent cameras, including low resolution web cams.Obviously, results reliability decreases in that case but the system is still usablefor applications such as ,for example, controlling a music player on a desktop.Evaluation has been done about the quality of the color mixture dependingon the number of components in the Mixture of Gaussians, updating polity formodel adaptation, posture recognition, gesture recognition, and running speedof the recognizer program.9.1 Color modelChoosing the correct number of components in the Mixture of Gaussians isvery important for the behavior of the system. The more it uses, the bestsegmentation it gets, but the slower the program runs. It is supposed to be areal time application, so it must be an agreement between quality and speed.The next pictures (Fig. 14) show skin segmentation using 1, 2, 3, 4 and 5components in the mixture.

Figure 14: Skin segmentation using 1, 2, 3, 4 and 5 componentsThree is the number of components chosen. It provides better results thanusing 1 or 2, and the di�erence between a model of 3 components and 4 or 5is not so big. Moreover, an additional Gaussian for the color mixture raisesexecution time as shown in Table 1.

28

N� of Gaussians Time Increase [ms]3 + 1 18 %3 + 2 35 %Table 1: Increase of time requirements with additional Gaussian in GMM (de-fault is 3).9.2 Color adaptationAs it was said in section 6, color adaptation speed is controlled by a constantcalled �. The closer that value is to 1.0, the slower the adaptation it will be.It is a good idea to have a slow speed to avoid situations in which the skinpixels come from a failure in segmentation and tracking. A constant of 0.9995can make the model robust to changes in case the tracking fails, if it does nothappen very often. On the other hand it makes updating slow, because eachnew pixel only weighs 0.05% against 0.95% for the actual model. But if wetake into account that the updating process is performed frame by frame, witharound 500 pixels or even more when the level of tracking con�dence is high(hands and head), then it manages to adapt fast.In the following samples (Fig. 15 and 16) the model adapts from the sameinitial state to the new skin color in the scene. The initial model contains a largeset of users and illumination conditions, while the �nal one has been restrictedto the actual scene by the updating process.9.3 Updating policyThe main aim of this thesis is to develop a gestural control unit robust enough tothe frequent illumination changes in the scene while interacting with the robot.Having a good training set is not enough. Skin pixels change its color veryoften, so an updating policy is needed. These are the results of segmentationusing the color model without color updating:The results are not good. When the hand moves, its color change, so it isnecessary to have a frame by frame updating. In the following tests only skinmodel will be updated, and background model will keep constant. That decisionsimpli�es adaptation process making it faster, because most of the points in animage are supposed to be background and that will make the program run muchslower.The following sequences show what happens when updating is done justusing the three biggest blobs. When a false blob is accepted, the model movestowards that color. On an simple background it works well (Fig. 18), but on acomplex background (with a lot of wood tables, doors...) the results show theproblem discussed above (Fig. 19).The way to solve this problem is to de�ne an updating rule. During thefollowing sequence (Fig. 20) the system can update every frame but only withthe pixels from the blobs which have been tracked at least for a minimum periodof time. Only supposed hands (both left and right) are used. The reason bypixels from the head are avoided to update the model at the beginning of atracking sequence is because they can contain colors that can disturb the model

29

Figure 15: The initial model moves to a di�erent color region where the skin isilluminated by natural light.(such as lips, glasses, hair...). They will be used later, after tracking has beenperformed during a long time and gets stable.The system can also perform well while using a low resolution web camera,as it is shown in Fig. 21.9.4 Posture recognitionThe system recognizes �ve di�erent postures. They are shown in Fig. 22.At date the system works with a set of 12 samples (Fig. 23): four samplesfor the REST posture, and two samples for the STOP, RIGHT, LEFT and UPpostures.The system has been trained with that samples and a threshold of 1 � 10�10,which reduces the eigenspace from 900 to 10 dimensions. Recognition speed willdepend on that number of dimension as is shown in Tables 2, 3.

1 � 10�10 1 � 10�15 1 � 10�2020x20 10 11 40030x30 10 14 90040x40 10 18 1600Table 2: Number of eigenvectors chosen using di�erent frame sizes and thresh-olds.

30

Figure 16: The initial model moves to a di�erent color region where the skin isilluminated by arti�cial light.

Figure 17: The hand is subdivided into two blobs because the model can notupdate towards this user skin color.1 � 10�10 1 � 10�15 1 � 10�2020x20 4.04 4.41 29.5330x30 7.37 9.66 83.1240x40 13.11 21.03 99.99

Table 3: % Of total execution time needed for PCA recognition using di�erentframe sizes and thresholds.Some conclusions can be determined from those tables, considering that weuse 1 � 10�10 as threshold. When using a frame of 20x20 frame recognition timeis below 5% of total execution time. Using a 30x30 frame this time raises until

31

Figure 18: At the beginning the left hand is not segmented, but the modelupdates with new pixels from the right one and then its blob starts growinguntil it �ts the whole hand during the rest of the sequence.7.5% but recognition results are better, specially confusion rate between RESTand STOP postures is increased. Using a frame of 40x40 pixels would almostdouble this recognition time without providing better reliability.Recognition is performed by choosing the closest neighbor. The followingTable (4) shows the di�erence between using the Euclidean distance or the newpseudo-distance presented in Chapter 7. Postures are easier to recognize sincethe distances between the correct and the false postures are bigger in proportion.The recognition depends highly in the quality of the skin segmentation.The shape of the blob should be as good as possible in order to get reliablerecognition. Hands usually lose a �nger, and that can make the recognizerconfuse the posture, as shown in Table 5. Those results were taken from twolong sequences in with the REST posture is the most used (as usual whileinteraction with the system).9.5 Gesture recognitionThe system recognizes �ve di�erent postures. They are shown in Fig. 24. Someof them are described with one hand and some with both of them. The headdoes not participate on the gesture.Each gesture is modelled by a HMM, which have been trained with a largenumber of samples for each gesture. Only examples without any occlusion ortracking failure were included in the sets. All the gestures were developed bythe author himself wearing long sleeves.

32

Figure 19: When the left hand goes over the shelves, the model starts learningthose new colors. Then, a piece of the table is segmented, which makes themodel learn some pixels from the computers near it. Finally the system fails,segmenting also the ceiling of the laboratory.The system is able to distinguish the �ve trained gestures when they aredescribed carefully by the author, but some mistakes are made due to trackingfailures. Results can be seen on Table 6, where SPEED UP is the less reliablecommand. The reason which explains that fact is that moving your hands up,they sometimes make an occlusion with the head confusing the tracking �lters.

9.6 Speed and optimizationSkin segmentation is a very long in time process. It has to calculate heavyoperations for each pixel in the image. This parametric model uses a three-dimensional Mixture of Gaussian, what means that a lot of matrix multiplica-33

Figure 20: Forgetting about using head pixels to update the model, and usingonly hand blobs which are being tracked, the model updates much better thanbefore. Both hands are segmented during this sequence, but sometimes theyare not chosen as one of the 3 biggest blobs because of the pieces of wood in thebackground. Head pixels are used when the tracking gets stable.

Figure 21: Example test run obtained with a low resolution web camera.

34

Figure 22: Known postures are REST (2), STOP(2), LEFT, RIGHT and UP(2)

Figure 23: Training set: LEFT (2), RIGHT (2), STOP(2), UP(2) and REST (4)tions are needed.The system should be able to run in real-time, and without performing anyoptimization it was completely unreachable. A lot of e�ort was employed to getthe program being fast. The model and its updating operations became morecomplex, with a lot of precalculation steps to avoid doing it while segmentingan image. Some matrix multiplications were also optimized depending on theirknown quantity of zero elements, or pro�ting from their symmetry.The program runs on a Linux/GNU system on a 1.6GHz Pentium proces-sor. At that point, it segmented around 12 frames (320 x 240 pixels each) persecond. That rate include getting and displaying of the images that stands forapproximately 13% of the processing time.Tables (7, 8) show the execution pro�le for this slow version.Color transformation and skin segmentation takes almost 70% of total ex-ecution time. When using RGB color space directly from the camera LUTtransformation can be avoided, which saves 8% of time.It is clear that next optimization step needs to be done in the skin seg-mentation step. About 40% of skin segmentation time is spent in background

35

Euclidean PseudoLEFT 231.2 57819.8LEFT 257.2 68920.7RIGHT 43.2 3473.5RIGHT 32.2 2583.8STOP 203.3 49659.0STOP 282.2 64832.4UP 179.9 47388.7UP 171.1 37151.1REST 217.8 47915.9REST 234.7 57589.0REST 206.5 51249.6REST 222.3 59880.1Table 4: Distances between a random RIGHT posture and all training samples.

% REST STOP LEFT RIGHT UPREST 82 4 0 5 9STOP 23 77 0 0 0LEFT 10 0 88 0 2RIGHT 0 0 0 89 11UP 9 0 0 0 91Table 5: Confusion matrix using 8 � 103 as minimum distance to accept therecognition. System recognitions are rows while columns are real postures.

% S MR ML SU SDS 100 0 0 0 0MR 0 100 0 0 0ML 10 0 90 0 0SU 10 0 0 85 5SD 0 0 0 0 100Table 6: Gesture confusion matrix. System recognitions are rows while columnsare real gestures. S - START, MR - MOVE RIGHT, ML - MOVE LEFT, SU -SPEED UP, SD - SPEED DOW.probability calculation. Since background is thought to be static (section 9.3),it can be pre-calculated as a LUT table. It takes some time at the beginning ofthe execution, but it is worth as can be seen in Table 9. The program speedsup to 15 frames per second after this optimization.Skin segmentation is still the most heavy step taking over 50% of time. A�nal optimization idea will be the following: sometimes it will be not necessaryto segment the whole image. When tracking is done with a high con�dence itis possible to have an idea of where the blobs will be located in the next frame.

36

Figure 24: Gestures that the system recognizes at date: MOVE LEFT,MOVE RIGHT, SPEED UP, SPEED DOWN, START.Therefore, a partial segmentation of that frame can be done. Experiments showthat in the 85% of the frames that partial segmentation was developed. Thisis another example of how tracking integration can help the system to performbetter, raising its speed to over 20 frames per second. Table 10 shows theexecution pro�le for the �nal optimized version.

37

OPERATION %Skin segmentation 62.97Show image 6.99PCA analysis 6.63Update model 6.02Get image 5.86Morphology 3.88Search blobs 2.21HMM calculations 0.02kalman �lter 0.01Other operations 5.14Table 7: Execution pro�le during a long sequence using RGB directly from thecamera.

OPERATION %Skin segmentation 60.28LUT transformation 7.97Show image 6.56PCA analysis 6.07Update model 5.66Get image 5.65Morphology 3.59Search blobs 2.08HMM calculations 0.02kalman �lter 0.01Other operations 2.11Table 8: Execution pro�le during a long sequence with color transformation.

OPERATION %Skin segmentation 51.76Show image 8.09PCA analysis 7.47Update model 6.84Get image 6.72Backg. precalculation 5.91Morphology 4.47Search blobs 2.53HMM calculations 0.02kalman �lter 0.01Other operations 6.18Table 9: Execution pro�le using background precalculation.

10 ConclusionThis system is a complete posture and gesture recognizer. Its exible designmakes it suitable for being used as a commander for any application, although

38

OPERATION %Skin segmentation 35.88Show image 10.36PCA analysis 9.78Update model 9.39Get image 8.64Backg. precalculation 7.19Morphology 6.11Search blobs 3.26HMM calculations 0.02kalman �lter 0.01Other operations 9.36Table 10: Execution pro�le using partial segmentation.

its motivation is to be integrated in a Programming by Demonstration (PbD)framework such as a service robot. The system runs in real time processingaround 20 frames per second when their size is 320 x 240 pixels.Skin segmentation is the most important step in the process. All posteriorresults depends on its robustness to illumination changes during interaction. Asexpected, it is also the most time consuming step, but is justi�ed that morethan 35% of execution time is spent during skin pixels classi�cation when theyprovide the desired robustness.The color space which better results gives is RGB. Other spaces (such asHSV which is usually used in skin segmentation) are more easy to confuse bynon-skin pixels when the model is being updated. Moreover, using RGB directlyfrom the camera speeds up the program saving 8% of execution time.Its adaptation algorithm and policy works well to make the classi�cationprovide good hand shapes. A set of morphological operations help on thatpropose adding some extra pixels to the hand blob. Those blob pixels willadapt the color model making future classi�cation better and better. Onlyhand pixels are allowed to update the model when tracking begins, because headpixels can contain some other di�erent colors (hair, lips...). Head pixels are usedlater, when tracking has been performed during some more time. Updating isexecuted almost every frame (every frame a hand is being tracked) and it takesless than 10% of execution time.Filtering and tracking are improvable. Kalman �lter works well as long asthe user interacts in simple scenes. No more users should appear on the sceneor it will not be able to track properly. Occlusions of hands and head are verycommon while describing a gesture. They do not necessary interrupt tracking,but it disturbs the �ltered trajectory. User movements should be done slowlyor the tracker will get confused and will require re-initialization. Kalman �ltersrequire a careful user, otherwise another technique should be use to track thehands.PCA works �ne although sometimes it takes some frames to recognize prop-erly the posture of the hand. When a hand moves or it changes its posture,some shadows appear on the hand. Then the model needs some frames to learnthe new colors and adapt to them covering the whole hand. Even if a �nger can39

not be segmented properly, posture recognition is reliable after a few frames.PCA is also peformed every frame for both hands taking 10% of total time.Gesture recognition provides also a high recognition rate, even when thesystem is tested by di�erent users and they get used to it.In conclusion, the system is user-independent although new users may needsome time to get used to it. Skin adaptation does not take long at the beginning,posture recognition works properly after that, and it is easy to describe the handgestures at the correct speed (Fig. 25).

Figure 25: Sequence of interaction with the system

40

11 Improvements and future workSkin segmentation works well when the system runs on simple backgrounds, butwhen the scene contains big wood elements such as tables, doors or shelves theskin model can get confounded by them and then the system can fail.The current system runs on the false hypothesis that the background isstatic. On future work it maybe could be useful to test if some backgroundupdating helps on segmentation when interacting such complex scenes.Some more experiments should be done to determine a better updating pol-icy, maybe more restrictive when tracking con�dence is not good. It wouldbe a good idea to try to calculate the learning rate of the updating proceduredepending on the context of the gesture (beginning or end).At the moment not many rules are followed when choosing the blobs totrack. Only size is taken into account to accept or reject a blob. It would be agood idea for future tests to use also some other geometrical moments, such asa near position to the previous or an elliptical shape.Kalman �lter works well when user movements are not very fast and noocclusion happens. Some other techniques have been demonstrated to be morerobust while tracking, for example, particle �ltering [53].In the current implementation only isolated gestures can be recognized. Theywork well, but in order to provide the user a more natural interaction way, afuture improvement should consider about performing continuous gesture recog-nition.Another issue for the future work is the integration of the gesture basedsystem that uses only the trajectories of the user's hands with the posturerecognition system to allow for a more expressive and exible interaction system.

41

12 Other applications for the systemThis system can be used for general purpose, without rewriting or adapting itscode. Its output can be read by any other control process (which is not includedin this project) to command whatever needed. Even creating new gestures andpostures is done by training programs, which write �les with all descriptionsneeded by the recognizer system.That independence between the recognizer program and the control programgive the system a lot of versatility, making it easy to use for other kind ofpurposes as, for example, controlling a music player.

42

References[1] P. Wallace, AgeQuake: Riding the demographic rollercoaster shaking busi-ness, �nance and our world. London, UK: Nicholas Brealey PublishingLtd., 1999.[2] H. Friedrich, R. Dillmann, and O. Rogalla, \Interactive robot program-ming based on human demonstration and advice," in Christensen etal(eds.):Sensor Based Intelligent Robots, LNAI1724, 1999, pp. 96{119.[3] S. Ekvall and D. Kragic, \Integrating object and grasp recognition fordynamic scene interpretation," in IEEE International Conference on Ad-vanced Robotics, ICAR2005, 2005.[4] M. Kleinehagenbrock, J. Fritsch, and G. Sagerer, \Supporting advancedinteraction capabilities on a mobile robot with a exible control system,"in IEEE/RSJ International Conference on Intelligent Robots and Systems,IROS, vol. 4, Oct. 2004, pp. 3469{3655.[5] E. A. Topp, D. Kragic, P. Jensfelt, and H. I. Christensen, \An interactiveinterface for service robots," in IEEE International Conference on Roboticsand Automation (ICRA'04), New Orleans, Apr. 2004, pp. 3469{3475.[6] V. I. Pavlovic, R. Sharma, and T. S. Huang, \Visual interpretation of handgestures for human computer interaction: a review," in IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 19, University of Illinois,1997, pp. 677{695.[7] S. Marcel, \Gestures for multi-modal interfaces: a review," Dalle MolleInstitute for Perceptual Arti�cial Intelligence, Switzerland, Tech. Rep.,September 2002.[8] R. Watson, \A survey of gesture recognition techniques," in Technical Re-port TCD-CS-93-11, Trinity College, Dublin, July 1993.[9] F. Sandberg, \Vision based gesture recognition for human-robot interac-tion," Master's thesis, Royal Institute of Technology, 1999.[10] M. Sharma, \Performance evaluation of image segmentation and textureextraction methods in scene analysis," Master's thesis, University of exeter,2001.[11] H. Cheng, X. H. Jiang, Y. Sun, and J. L. Wang, \Color image segmentation:Advances & prospects," in Pattern Recognition 35(2), 2002, pp. 373{393.[12] R. B. Ohlander, \Analysis of natural scenes," Ph.D. dissertation, CarnegieInstitute of Technology, 1975.[13] S. M. Rubin, \Natural scene recognition using locus search," in ComputerGraphics and Image Processing, vol. 13, Bell Telephone Laboratories, 1980,pp. 298{333.[14] Y.-L. Chang and X. Li, \Adaptive image region-growing," in Image Pro-cessing, IEEE Transaction on, Alberta University, Canada, November1994.

43

[15] S. Askar, Y. Dondratyuk, K.Elazouzi, P. Kau�, and O. Schreer, \Vision-based skin-colour segmentation of moving hands for real-time applications,"in Proc. of 1st European Conference on Visual Media Production (CVMP),Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, Ger-many, March 2004.[16] J. M. Buades, M. Gonzalez, and F. J. Perales, \Face and hands segmen-tation in color images and initial matching," in Journal of WSCG, vol. 2,Universitat de les Illes Balears, 2004, pp. 33{36.[17] L. Sigal, S. Sclaro�, and V. Athitsos, \Estimation and predic-tion of evolving color distributions for skin segmentation undervarying illumination," Image and Video Computing Group, BostonUniversity, MA, Tech. Rep. 1999-015, 1 1999. [Online]. Available:citeseer.ist.psu.edu/article/sigal00estimation.html[18] J. Bruce, T. Balch, and M. Veloso, \Fast and inexpensive color imagesegmentation for interactive robots," in Proceedings of the 2000 IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS '00),vol. 3, Carnegie Mellon University, October 2000.[19] K. Schwedt and J. L. Crowley, \Robust face tracking using color," in 4thIEEE International Conference on Automatic Face and Gesture Recogni-tion, Grenoble, France, 2000.[20] N. Funk, \A study of the kalman �lter applied to visual tracking," in CM-PUT 652 Probabilistic Methods in AI, University of Alberta, December2003.[21] A. M. Mart��nez and A. C. Kak, \Pca versus lda," in IEEE Transactionon pattern analysis and machine intelligence, vol. 23, Purdue University,February 2001.[22] T. Starner and A. Pentland, \Visual recognition of american sign languageusing hidden markov models," Master's thesis, Massachusetts Institute ofTechnology, February 1995.[23] R. H. Liang and M. Ouhyoung, \A real-time continous gesture recognitionsystem for sign language," in 3rd. International Conference on Face andGesture Recognition, National Taiwan University, Taipei, 1998.[24] M. Yang, D. Kriegman, and N. Ahuja, \Detecting faces in images: A sur-vey," in IEEE Trans. on PAMI,, vol. 24(1), 2002, pp. 34{58.[25] T. Jebara, K. Russel, and A. Pentland, \Mixture of eigenfeatures for real-time structure from texture," in Proc. of ICCV, 128-135, p. 1998.[26] M. Jones and J. Rehg, \Statistical color models with application to skindetection," in Proc. of CVPR, vol. 1, 1999, pp. 274{280.[27] S. Kim, N. Kim, S. Ahn, and H. Kim, \Object oriented face detection usingrange and color information," in Proc. of Face and Gesture Recognition,1998, pp. 76{81.

44

[28] D. Saxe and R. Foulds, \Toward robust skin identi�cation in video images,"in Proc. of Face and Gesture Recognition, 1996, pp. 379{384.[29] D. Chai and K. Ngan, \Locating facial region of a head-and-shoulders colorimag," in Proc. of Face and Gesture Recognition, 1998, pp. 124{129.[30] M. Yang and N. Ahuja, \Face detection and gesture recognition for human-computer interaction," in Kluwer Academic Publishers, New York, 2001.[31] J. Terrillon, M. Shirazi, H. Fukamachi, and S. Akamatsu, \Comparativeperformance of di�erent skin chrominance models and chrominance spacesfor the automatic detection of human faces in color images," in Proc. ofFace and Gesture Recognition, 2000, pp. 54{61.[32] M. St�orring, \Computer vision and human skin colour," Ph.D. dissertation,Aalborg University, Denmark, 2004.[33] Y. Raja, S. McKenna, and S. Gong, \Tracking color objects using adaptivemixture models," in Image and Vision Computing, vol. 17(3-4), 1999, pp.225{231.[34] T. S. Caetano, S. D. Olabarriaga, and D. A. C. Barone, \Performanceevaluation of single and multiple-gaussian models for skin color modeling,"in XV Simpsio Brasileiro de Computao Gr�ca e Processamento de Imagens,Universidade Federal do Rio Grande do Sul, 2002.[35] M. Yang and N. Ahuja, \Gaussian mixture model for human skin colorand its application in image and video databases," in Conf. on Storage andRetrieval for Image and Video Databases (SPIE 99), University of Illinoisat Urbana-Champaign, 1999.[36] F. Dellaert, \The expectation maximization algorithm," Georgia Instituteof Technology, Tech. Rep. GIT-GVU-02-20, 2002.[37] T. M. Mitchell, Machine learning. McGrawHill, 1997.[38] I. Esteban and krisztian Balog, \Em for a mixture of gaussians," Universityof Amsterdam, Tech. Rep., 2005.[39] Y. Bar-Shalom and T. Fortmann, \Tracking and data association," in Aca-demic Press, 1987.[40] G. Welch and G. Bishop, \An introduction to the kalman �lter," Universityof North Carolina at Chapel Hill, Tech. Rep., 2002.[41] H. Christensen, D. Kragic, and F. Sandberg, \Vision for interaction," in In-telligent Robotic Systems, ser. Lecture notes in Computer Science, G. Hagerand H. Christensen, Eds. Springer, 2001.[42] D.-S. Lee, \Online adaptive gaussian mixture learning for video applica-tions," in ECCV 2004 Workshop on Statistical Methods for Video Process-ing, 2004.[43] F. S. Abdolah Chalechale and P. P. Golshah Nagdy, \Hand postureanalysis for visual-based human-machine interface," in Proceedings of theWDIC2005, University of Wollongong, 2005, pp. 91{96.

45

[44] Y. Liu and Y. Jia, \A robust hand tracking and gesture recognition methodfor wereable visual interfaces and its applications," in Eighth IEEE Interna-tional Symposium on Wearable Computers, Beijing Institute of Technology,2004.[45] L. Bretzner, I. Laptev, and T. Lindeberg, \Hand gesture recognition usingmulti-scale colour features, hierarchical models and particle �ltering," inProc. Face and Gesture 2002, Washington D.C., 2002.[46] L. Br�ethes, P. Menezes, F. Lerasle, and J. Hayet, \Face tracking and handgesture recognition for human-robot interaction," in International Confer-ence on Robotics and Automation, Coimbra University, 2004.[47] D. M. Shan Lu, D. Samaras, and J. Oliensis, \Using multiple cues for handtracking and model re�nement," in Proc. CVPR, Rutgers University, 2003.[48] L. I. Smith, \A tutorial on principal component analysis,"http://csnet.otago.ac.nz/cosc453/student tutorials/principal components.pdf,2002.[49] S. Eickeler and G. Rigoll, \Hidden markov model based continuous onlinegesture recognition," in Int. Conference on Pattern Recognition (ICPR),University Duisburg, August 1998, pp. 1206{1208.[50] H.-K. Lee and J. H. Kim, \An hmm-based threshold model approach forgesture recognition," in IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, vol. 21, POSCO Center, Korea, October 1999, pp. 961{973.[51] L. R. Rabiner, \A tutorial on hidden markov models and selected applica-tions in speech," in IEEE, volume 77 N 2, pages 257-286, February 1989.[52] S. Young, \The htk book," Cambridge university, htk.eng.cam.ac.uk, De-cember 2002.[53] M. Lourakis and A. Argyros, \Real time tracking of multiple skin-coloredobjects with a possibly moving camera," in Proceedings of the EuropeanConference on Computer Vision, vol. 3, Foundation for Research and Tech-nology - Hellas, Greece, May 2004, pp. 368{379.

46

A User guideThis manual assumes that the program is running on a Linux system, usingand USB camera compatible with video4linux API. Nevertheless, the programis able to run using a set of images saved on hard disk, but it should be compiledproperly re-de�ning the macro SOURCE and run with an extra option (pictures�lename).The program is called recognizer. If you run it on a console you will get thelist of options you should provide to it../recognizer [--initialization] [--verbose n] [--lut_table filename]

-- gesture_model | --capture_gestures filename

--skin_model filename --pca filename

- initialization turns on initialization mode. The program will start updatingthe color model with the biggest blob on the scene during some seconds,then it will continue with normal execution.- verbose changes verbose mode. n goes from 0 to 4, being 1 the default value.- lut table indicates the name of the LUT �le. By default the system doesn'tuse any table, so images are in RGB space color.- gesture model tells the program to run in recognition mode. It loads the �lewith the HMM de�nitions.- capture gestures tells the program to run in training mode. It stores thegesture descriptions in the �le.- skin model indicates the �le which contains the skin model description.- pca indicates the �le which contains the pca eigenspace description.A.1 Running the systemIf the available skin model �le �ts well to the scene where the interaction willtake place, initialization mode can be avoided. Otherwise is recommended tostart it and spend �ve seconds moving a hand close enough to the camera, sothe model can learn quickly its pixels.The system default for verbose level is set to 1, which means that only awindow will pop up to show how tracking is doing. It is recommended to uselevel 2 while interacting with the system, so an extra window will appear toshow skin segmentation results.The output generated by the system goes through stdout, and it is just thename of the gestures and postures recognized. Extra verbose information goesthrough stderr. That helps the system to be piped with a control unit while theuser can see what is going on during execution.Example:./recognizer <OPTIONS> | ./commander

Where commander can be a shell script to command whatever application.For example the following script can control Amarok, a famous music player.47

while true

do

read command

# POSTURE COMMANDS

if [ "${command}" = "STOP" ]

then

echo "POSTURE STOP -> Stopping player."

dcop amarok player stop

elif [ "${command}" = "RIGHT" ]

then

echo "POSTURE RIGHT -> Seeking forward 2 seconds."

dcop amarok player seekRelative 2

elif [ "${command}" = "LEFT" ]

then

echo "POSTURE LEFT -> Seeking backward 2 seconds."

dcop amarok player seekRelative -2

elif [ "${command}" = "UP" ]

then

echo "POSTURE UP -> Shuffling playlist."

dcop amarok playlist shufflePlaylist

elif [ "${command}" = "REST" ]

then

# Ignore this posture.

:

# GESTURE COMMANDS

elif [ "${command}" = "SPEED_UP" ]

then

echo "GESTURE SPEED_UP -> Setting volume up."

dcop amarok player volumeUp

elif [ "${command}" = "SPEED_DOWN" ]

then

echo "GESTURE SPEED_DOWN -> Setting volume down."

dcop amarok player volumeDown

elif [ "${command}" = "MOVE_RIGHT" ]

then

echo "GESTURE MOVE_RIGHT -> Playing next."

dcop amarok player next

48

elif [ "${command}" = "MOVE_LEFT" ]

then

echo "GESTURE MOVE_LEFT -> Playing previous."

dcop amarok player prev

elif [ "${command}" = "START_PAUSE" ]

then

echo "GESTURE START_PAUSE -> Starting player."

dcop amarok player playPause

# UNKNOWN COMMANDS

else

echo "WARNING: UNKNOWN COMMAND ->" "${command}"

fi

done

A.2 TrainingThe recognizer uses all information learned during the training steps to per-form its work. There are three di�erent trainers, skinTrainer, pcaTrainer, andmarkovTrainer.The �rst one (skinTrainer) uses a set of images and masks to learn the skinand background pixels and calculate the Mixture of n Gaussians. The input �leshould contain a sample image �le name in a line, followed by another line withits mask �le name. The result is a �le containing all information about bothmixtures, skin and background models../skinTrainer [--lut_table filename] samplesfile modelfile

The other trainer (pcaTrainer) calculates the eigenvectors and values for aset of shapes given also as masks. The training �le should contain a posturename and a �le per line. All information related to the eigenspace will be writeninto a �le too../pcaTrainer samplesfile resultfile

Those both trainers can be pretty slow depending on the amount of trainingimages given to them.Gesture training is a little bit more complicated and requires some moresteps. First, some gestures need to be captured. The recognizer program doesthat work when running with the option capture gestures. Then the user shoulddescribe the same gesture several times, and they will be recorded in a �le.All the di�erent gestures have to be recorded in di�erent �les.While describing the gestures, some mistakes will be done. The programmarkovFilter can help to decide wich gestures have to be rejected. The gestureswill be displayed and they can be accepted pressing return or rejected pressingn followd by return.49

./markovFilter gesturefile filteredfileFinally, to train a HMM with that gesture descripton �le the programmarkovTrainer is used. It will output a trained version of the template �le.Training has to be performed for each gesture separately and the resulting HMMde�nitions need to be concatenated into one �le for use in recognition. Eachgesture must be named manually in the de�nition �le.

./markovTrainer default8.hmm filteredfile > gesture.hmmThe initial HMM template �le which contains the description of the modelshould be formated as:<HMM> noName

<NumStates> 4

<NumDims> 8

<DimIdentifiers>

left_x left_y left_dx left_dy right_x right_y right_dx right_dy

<Means>

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

<Covariances>

1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ...

1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ...

1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ...

1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ...

<Transitions>

0.34 0.33 0.33 0.0

0.0 0.33 0.33 0.33

0.0 0.0 0.5 0.5

0.0 0.0 0.0 1.0The DimIdenti�ers section contains the names of features to use as obser-vations, they should correspond to column identi�ers in the data �les used fortraining. The Means section has one row for each state and one column for eachobservation dimension. The Covariances section has one row for each state de-scribing the NumStates * Numstates covariance matrix listed with row by row.Rows in the Transtitions section indicates from state while columns the to state.Transitions given zero probability are not considered at all possible by training.The means and covariances of the template can be set to arbitrary valuessice they are initializated by training.It is possible to evaluate a HMM de�nition �le using the evalhm program../evalhmm gesture.hmm datafile CORRECT_GESTURE_NAMEThat will test recognition of the HMMs de�ned in data�le with example ges-tures from gesture.hmm (examples of one gesture). CORRECT GESTURE NAMEis the name of the gesture evaluated.

50

TRITA-CSC-E 2006:070 ISRN-KTH/CSC/E--06/070--SE

ISSN-1653-5715

www.kth.se

hand gesture recognition for human-robot interaction · pdf filehand gesture recognition for...

Documents