vision-based tracking and recognition of dynamic hand gestures

Vision-Based Tracking and Recognition

of Dynamic Hand Gestures

Dissertation zur Erlangung des Gradeseines Doktors der Naturwissenschaften

in der Fakultat fur Physik und Astronomieder Ruhr-Universitat Bochum

vorgelegt von

Maximilian Krugeraus Dusseldorf

Bochum 2007

1. Gutachter: Prof. Dr. Christoph von der Malsburg

2. Gutachter: Prof. Dr. Andreas D. Wieck

Tag der Disputation: 11. Dezember 2007

Acknowledgements

I am grateful to Prof. Dr. Christoph von der Malsburg for his guidance andsupport. He offered me the opportunity to start my research in the challeng-ing field of computer vision and taught me to ask the question “How does thebrain work?”. His passion on this topic inspired me and the present work.Further I would like to thank Prof. Dr. Andreas Wieck for his interest inbeing my second reviewer. Special thanks go to Dr. Rolf P. Wurtz for manydiscussions, his advice and support during my research and the generationof this thesis. The life at the institute has been so much easier with thehelp of Uta Schwalm and Anke Bucher. Thank you for the administrativesupport. Acknowledgements are also made to Michael Neef for providing theIT infrastructure.

For their patient help whenever I struggled with LATEX or C++ problems Iwould like to thank Gunter Westphal and Manuel Gunter. I am also gratefulto Marco Muller who helped me finding faces and Andreas Tewes for endlessdiscussions on scientific or non-scientific matters.

I am glad to have friends like Thorsten Prustel and Bianca von Scheidt whoaccompanied and supported me from my first academic day on.

My father Dietmar and my mother Margaretha, my brother Christian andhis family for all the nice moments and for showing me the strength of afamily.

Words cannot express what you mean to me, my wife Seza and our son David.

Contents

1 Introduction 1

2 Survey of Sign Language Recognition Techniques 62.1 Sign Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Previous Work of Sign Language Recognition . . . . . . . . . . 8

3 System Architecture 113.1 Software Agents . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Multi-Agent System . . . . . . . . . . . . . . . . . . . . . . . 13

4 Visual Tracking 164.1 Multi-Agent Tracking Architecture . . . . . . . . . . . . . . . 184.2 Tracking Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Recognition Agent . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Feature Extraction 235.1 Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1.1 Bunch Graph Matching for Face Detection . . . . . . . 245.2 Hand Posture . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2.1 Bunch Graph Matching for Hand Posture Classification 295.2.2 Contour Information for Hand Posture Classification . 32

6 Recognition 366.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 38

6.1.1 Elements of an HMM . . . . . . . . . . . . . . . . . . . 406.1.2 HMM Basic Problems . . . . . . . . . . . . . . . . . . 426.1.3 Evaluation Problem . . . . . . . . . . . . . . . . . . . . 426.1.4 Decoding Problem . . . . . . . . . . . . . . . . . . . . 456.1.5 Estimation Problem . . . . . . . . . . . . . . . . . . . 47

6.2 Modification of the Hidden Markov Model . . . . . . . . . . . 516.2.1 Evaluation Problem . . . . . . . . . . . . . . . . . . . . 52

i

Contents ii

6.2.2 Estimation Problem . . . . . . . . . . . . . . . . . . . 536.3 The HMM Recognition Agent . . . . . . . . . . . . . . . . . . 57

6.3.1 Layer One: HMM Sensor . . . . . . . . . . . . . . . . . 596.3.2 Layer Two: Gesture HMM Integration . . . . . . . . . 596.3.3 Layer Three: Decision Center . . . . . . . . . . . . . . 66

7 Experiments and Results 687.1 BSL Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.2 Signer Dependent Experiments . . . . . . . . . . . . . . . . . 717.3 Different Signer Experiments . . . . . . . . . . . . . . . . . . . 80

8 Discussion and Outlook 848.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

A Democratic Integration 88

B Gabor Wavelet Transformation 92

C BSL Database 95

D BSL Experiments Results 98

List of Figures 110

List of Tables 112

Bibliography 113

Curriculum Vitae 122

Chapter 1

Introduction

Communication between humans and computers is limited and is done mainlyby pressing keys, by touching screens and by verbal commands like readingcoded signs. Speech, however, is the dominant factor in communication be-tween humans. It is often accompanied by intended or unintended gestures.Wherever the environment is not favorable to verbal exchange gestures arethe preferred way of communication, i.e., when background noise, long dis-tances or language problems make verbal interaction between people difficultor impossible. The human brain is capable of understanding gestures quiteeasily and clearly. There are different ways to perform and interpret a ges-ture, using one or two hands or arms or both. Some gestures are static,e.g., showing a single hand posture, while other gestures are changing andneed dynamics to transmit their information. The performance of computerscould be greatly enhanced if they were able to recognize gestures, if theirinteraction with humans can be more “human”.

Gesture recognition by computers offers new applications in industry(steering and control of robots) and in security (surveillance). Another im-portant field for human computer interaction (HCI) lies in the recognition ofsign language. A translation from the language of the deaf to the hearing willease the communication in public institutions like postal offices or medicalcenters. Thus, gesture recognition is an interdisciplinary field of research.Finding answers to the questions like: How do we understand other people’sbehavior? How can we assign goals, intentions, or beliefs to the inhabitantsof our social world? How can we combine/perform tracking and recognition?What data structures are suitable to store the information in a robust andfault-tolerant way? build the basis for investigations in brain research andneural computation.

The current state of knowledge in the field of brain research is that gesturerecognition or action recognition in primates and humans is managed by

Introduction 2

mirror neurons (Gallese and Goldman, 1998; Iacoboni et al., 2005). Amirror neuron is defined as a neuron which fires both when an animal actsand when the animal observes the same action performed by another animalor person (Umiltai et al., 2001; Keysers et al., 2003). Thus, the neuron“mirrors” the behavior of another animal, as though the observer were itselfacting. In humans, brain activity consistent with mirror neurons has beenfound in the premotor cortex and the inferior parietal cortex (Rizzolattiand Craighero, 2004). Hence, it seems that recognition is performed bycomparing the made observations with a previously learned sequence. Thisallows to predict the coming events during the observation and thus enhancesthe robustness of the whole process.

The aim of the present work is to build a recognition system which per-forms gesture recognition during the observation process. Each learned ges-ture is represented by an individual module that competes with the othersduring recognition. The data is acquired by a single camera facing the signer.No data gloves or electric tracking is applied in the recognition system. Com-puter vision and thus visual gesture recognition has to deal with well-knownproblems of image processing, e.g., illumination and camera noise. Hence,the implemented recognition system has to be reliable in the presence ofcamera noise and a changing environment. Therefore, object tracking andobject recognition are important steps towards the recognition of gestures.Thus, the system in the present work has to show robust feature extractionand adaption to a flexible environment and signer.

This is realized by applying different autonomous modules which coop-erate in order to solve the given task. Following the principles of OrganicComputing presented in Muller-Schloer et al. (2004), the robustnessof the recognition system is enhanced by dividing a problem into differentsubtasks, which are solved by autonomous subsystems. All subsystems areworking on-line and therefore can help each other. Furthermore, they are ableto flexibly adapt to new situations. The results of the different subsystemsare combined to solve the overall task. This integration of information fromdifferent sources, like hand contour, position and their temporal developmentpresents, beside the coordination of these processes, the main challenge forcreating the recognition system.

The implemented subsystems will autonomously solve a part of the prob-lem using techniques like democratic integration (Triesch and von derMalsburg, 2001a) for information merging, bunch graph matching (Ladeset al., 1993) for face/object recognition and a modified parallel Hidden MarkovModel (HMM) (Rabiner, 1989), where the information of the different sub-systems are merged in order to recognize the observed gesture. Each of thesetechniques learns its knowledge from examples. For this purpose the train-

Introduction 3

(a) different (b) bat

Figure 1.1: Trajectory variations — The signs “different” and “bat”shown with the trajectory differences of ten repetitions by the same profes-sional signer. Both signs are part of the British Sign Language.

ing is performed with minimal user interaction. Just, like one-click learning,described in Loos and von der Malsburg (2002), the user only defines asimilarity threshold value.

In order to realize different autonomous units, their environment andcommunication between them in a software framework, a multi-agent system(MAS) (Ferber, 1999) has been designed. The installed agents show self-xproperties like dynamic adaption to a changing environment (self-healing),perception of their environment and the capability to rate their action (self-control). They can easily be added or deleted during execution time andthus provide the needed flexibility.

Sign language is a good start for gesture recognition research because ithas a structure. This structure allows to develop and test methods on signlanguage recognition before applying them on gesture recognition. Therefore,the present work concentrates on the British Sign Language (BSL).

Like gesture recognition, sign language recognition also has to deal withthe challenge of recognizing the hand postures, and their spatial and temporalchanges that code the distinct sign. From a technical point of view theprojection of the 3D scene onto a 2D plane has to be considered. Thisresults in the loss of depth information and therefore the reconstruction ofthe 3D-trajectory of the hand is not always possible. Besides, the position of

Introduction 4

the signer in front of the camera may vary. Movements like shifting in onedirection or rotating around the body axis must be considered, as well as theocclusion of some fingers or even a whole hand during signing.

Despite its constant structure each sign shows plenty of variations in thedimensions of time and space. Even if the same person performs the samesign twice, small changes in speed and position of the hand will occur. Anexample is presented in fig. 1.1, where the different trajectories, which occurwhen the same sign is performed ten times by a professional, are illustrated.Generally, a sign is affected by the preceding and subsequent signs. Thiseffect is called co-articulation. Part of the co-articulation problem is that therecognition system has to be able to detect sign boundaries automatically, sothat the user is not required to segment the sign sentence into single signs.Finally, any sign language or gesture recognition system should be able toreject unknown signs and gestures, respectively.

A graphical overview of the sign language recognition system applied inthe present work is given in fig. 1.2. In order to explain the applied system thisthesis is structured as follows: Chapter 2 gives an overview of previous workin the field of sign language recognition. Subsequently chapter 3 describesthe multi-agent system architecture, in particular the constructed agents andthe applied methods. The visual tracking is presented in chapter 4. Chapter5 introduces the different features that are applied for the sign languagerecognition which is presented in chapter 6. A description of the experimentsundertaken and their results can be found in chapter 7. The present work isdiscussed in chapter 8 where future work is outlined.

Introduction 5

Figure 1.2: Sign Language Recognition Work Flow — The sign lan-guage recognition system is divided in four work steps. Starting with therecording of the sign sequence using a monocular camera, the tracking is per-formed by multi-cue fusion, tracking each object separately. The next step isthe feature extraction where the position of the hands and the correspondinghand posture are extracted and processed in order to enter the sign recogni-tion. There, the received features are integrated to recognize the performedsign.

Chapter 2

Survey of Sign LanguageRecognition Techniques

Research in sign language recognition is connected to the fields of spokenlanguage recognition and computer vision. Language recognition, becauseit deals with the similar problem of recognizing a sequence of patterns intime (sound) and as sign languages are visual languages, computer vision isneeded to collect and process the input data.

This chapter gives an overview of previous work in the field of gestureor, more precisely, sign language recognition. Each of the presented ideas,including the present work, is focused on the manual part of sign language.However, sign language is more complex as will be shown before the signlanguage recognition systems are introduced.

2.1 Sign Language

In contrast to gestures which are a typical component of spoken languages,the sign languages present the natural way of communication between deafpeople. Sign languages develop, like oral languages, in a self-organized way.An example which shows that sign language appears wherever communitiesof deaf people exist is reported by Bernard Tervoort1 (Stokoe, 2005).

Like spoken languages, sign languages are not universal and vary fromregion to region. British and American Sign Language, for instance, arequite different and mutually unintelligible, even though the hearing of bothcountries share the same oral language. Furthermore, a sign language is

1Tervoort observed the development of a sign language for groups of deaf pupils. Al-though unacquainted with any sign language outside their own group, they developed signsthat were only used by the group itself and tended to vanish when the group dispersed.

Related Work 7

not a visual reproduction of an oral language. Its grammar is fundamen-tally different, and thus distinguishes itself from gestures (Pavlovic et al.,1997). While the structure of a sentence in spoken language is linear, oneword followed by another, the structure in sign language is not. It shows asimultaneous structure and allows parallel temporal and spatial configura-tions which code the information about time, location, person and predicate.Hence, even though the duration of a sign is approximately twice as long asthe duration of a spoken word, the duration of a signed sentence is about thesame (Kraiss, 2006).

Differing from pantomime, sign language does not include its environ-ment. Signing takes place in the 3D signing space which surrounds the trunkand the head of the speaker. The communication is done by simultaneouslycombining hand postures, orientation and movement of the hands, arms orbody, facial expressions and gaze. Fig. 2.1 (a) illustrates some signs of theBritish Sign Language (BSL), one-handed or two-handed with facial expres-sion.

In addition to the signs, each sign language has an manual alphabetfor finger spelling. Fig. 2.1 (b) depicts the alphabet performed in the BSL.Differing from the American Sign Language its letters are performed two-handed. The finger spelling codes the letters of spoken language and is mainlyused to spell names or oral language words. In contrast to written text forspoken language an equally accepted notation system does not exist for signlanguage (Kraiss, 2006). Although some systems like signwriting.org(2007) or HamNoSys (2007) exist, no standard notation system has beenfounded.

As mentioned above, the performance of the sign language can be di-vided into manual (hand shape, hand orientation, location and motion) andnon-manual (trunk, head, gaze, facial expression, mouth) parameters. Somesigns can be distinguished by manual parameters alone, while others remainambiguous unless additional non-manual information is made available. Iftwo signs only differ in one parameter they are called “a minimal pair”.The following recognition systems, including the present work, concentrateon manual features and investigate one-handed signs performed by the domi-nant hand only, and two-handed signs, which can be performed symmetricallyor non-symmetrically.

Related Work 8

(a) (b)

Figure 2.1: BSL — Both figures are kindly provided by the Royal As-sociation for Deaf people RAD (2007) and show examples taken from theBritish sign language. The left image shows the variations of performed signwhich include one-handed or two-handed signs and the importance of facialexpressions. On the right hand side the finger spelling chart of the BSL isdepicted.

2.2 Previous Work of Sign Language Recog-

nition

Sign language recognition has to solve three problems. The first challenge isreliable tracking of the hands, followed by robust feature extraction as thesecond problem. Finally, the third task concerns the interpretation of thetemporal feature sequence. In the following some approaches to solve theseproblems, which have inspired this work, are presented.

Starner and Pentland (1995) analyze sign gestures performed by onesigner wearing colored gloves. After color segmentation and the extraction of

Related Work 9

position and contour of the hands their recognition is based on a continuoussentence of signs which are bound to a strict grammar using trained HiddenMarkov models (HMM). Their work is enhanced in Starner et al. (1998),by changing to skin color data collected from a camera in front of the speaker.In a second system the camera is mounted in a cap worn by the user.

Hienz et al. (1999) and Bauer and Kraiss (2002) introduce an HMM-based continuous sign language recognition system which splits the signsinto subunits to be recognized. The needed image segmentation and featureextraction is simplified by using gloves with different colors for fingers andpalm. Thus, the extracted sequence of feature vectors reflects the manualsign parameters. The same group has built another recognition system thatworks with skin color segmentation and builds a multiple tracking hypothesissystem (Akyol, 2003; Zieren and Kraiss, 2005; von Agris et al., 2006).The winner hypothesis is determined at the end of the sign. The authorsinclude high level knowledge of the human body and the signing process inorder to compute the likelihood of all hypothesized configurations per frame.They extract geometric features like axis ratio, compactness and eccentricityof the hands segmented by skin color and apply HMMs as well. von Agriset al. (2006) use the recognition system for signer independent sign languagerecognition.

Instead of colored gloves Vogler and Metaxas (1997, 1999, 2001) use3D object shape and motion extracted with computer vision methods as wellas a magnetic tracker fixed at the signer’s wrists. They propose a parallelHMM algorithm to model gesture components and recognize continuous sign-ing sentences. Shape, movement and location of the right hand along withmovement and location of the left hand are represented by separate HMMchannels, which were trained with relevant data and features. For recogni-tion, individual HMM networks were built in each channel and a modifiedViterbi decoding algorithm searched through all the networks in parallel.Path probabilities from each network that went through the same sequenceof words were combined. This work is enhanced in Vogler and Metaxas(2003) using multiple channels, by integrating 3D motion data and cyber-glove hand posture data. Tanibata et al. (2002) proposed a similar schemefor isolated word recognition in the Japanese Sign Language. The authorsapply HMMs which model the gesture data from right and left hand in a par-allel mode. The information is merged by multiplying the resulting outputprobabilities.

Richard Bowden’s group structures the classification mode around a lin-guistic definition of signed words (Bowden et al., 2003; Ong and Bowden,2004; Kadir et al., 2004). This enables signs to be learned reliably from fewtraining examples. Their classification process is divided into two stages. The

Related Work 10

first stage generates a description of hand shape and movement using skincolor detection. The extracted features are described in the same way whichis used within sign linguistics to document signs. This description allowsbroad generalization and therefore significantly reduces the requirements offurther classification stages. In the second stage, Independent ComponentAnalysis (ICA) (Comon, 1994) is used to separate the sources of informa-tion from uncorrelated noise. Their final classification uses a bank of Markovchains to recognize the temporal transitions of individual signs.

All of the presented works are very inspiring and have different interestingapproaches to overcome the different problems of sign language recognition.Most of the introduced systems are working off-line, meaning they collectthe feature sequence and start recognition when the gesture has already beenperformed.

The approach in the present work divides the problems into differentsubtasks that are solved by autonomous subsystems. Instead of single colortracking a self-organized multi-cue tracking for the different body parts isapplied. As in the papers mentioned above an HMM approach for the tem-poral recognition is chosen. The HMM method is extended by introducingself-controlling properties, allowing the recognition system to perform itsrecognition on-line during observation of the input sequence.

Chapter 3

System Architecture

The aim of this thesis is to develop a recognition system that distributes thetask of recognition to a set of subsystems. Fig. 3.1 depicts the idea of splittingthe process of sign language recognition into three main subsystems. Thereis one subsystem for object tracking and one for object recognition. Bothprovide the input data for the sign language recognition subsystem. Eachsubsystem works autonomously, but cooperation is needed to solve the over-all goal. Chapter 4 explains the subsystem for visual object tracking of thehead and the hands. The object recognition subsystem is used for the recog-nition of static hand postures and face detection. The applied techniques forobject recognition are described in chapter 6. Both systems are continuouslyexchanging information about input data from tracking and the recognizedobject. The information flow to the sign language recognition system is one-way and comprises the preprocessed features, which are integrated in orderto the recognize the observed sign.

Each subsystem is technically realized by one or more software agents.Software agents are a popular way to install a framework of autonomoussubsystems on a computer. They provide the autonomy, flexibility and ro-bustness that suit the demands of the tracking and recognition system. Asthere is more than one agent in use, a multi-agent system (MAS) has beendeveloped to provide the infrastructure.

This chapter will give a short introduction to the field of software agentsand the implemented MAS. The whole recognition system is written in C++using the image processing libraries FLAVOR (Rinne et al., 1999) and Ltilib(Kraiss, 2006).

System Architecture 12

Figure 3.1: System Architecture — The sign language recognition systemis divided into three subsystems. The object tracking and the object recogni-tion systems are working on the sequence of input image and are responsiblefor the robust preprocessing of the features that enter in the third subsystem.The sign language recognition system integrates the received information inorder to determine the most probable sign. The data flow between the sub-systems is denoted by arrows.

3.1 Software Agents

The term software agent describes a software abstraction, or an idea, sim-ilar to object-oriented programming terms such as methods, functions, andobjects (Wikipedia, 2007). The concept of an agent provides a convenientway to describe a complex software entity that is capable of acting with acertain degree of autonomy in order to accomplish given tasks.

Software agents are acting self-contained and capable of making indepen-dent decisions. They are taking actions to satisfy internal goals based upontheir perceived environment (Liu and Chua, 2006). Thus, they are able toexhibit goal-directed behavior by taking the initiative (Nikraz et al., 2006).

Unlike objects, which are defined in terms of methods and attributes,an agent is defined in terms of its behavior. Franklin and Graesser(1997); Bigus (2001) and Wooldridge (2002) discuss four main conceptsthat distinguish agents from arbitrary programs: Persistence, which is theconcept of continuously running code that is not executed on demand, butrather decides for itself when it should perform which activity. The autonomyenables the agent to be task-selective, set its priorities, work goal-directed andto make decisions without human intervention. The agent has control overits actions and internal state. Its social ability allows the agent to engage


Figure 3.2: MAS Objects — The multi-agent system contains three baseclasses. The environment and the blackboard are singleton objects and admin-istrate the input data and the communication. Different kind and a varyingnumber of agents are installed to solve the subtask of sign language recogni-tion. The arrows denote the data flow between the interfaces of the objectclasses.

other components through some kind of communication and coordination.Finally, reactivity enables the agents to perceive the context in which theyoperate and react to it appropriately. They perceive their surrounding areato which they can adapt. These concepts also distinguish agents from expertsystems, which are not coupled to their environment and are not designedfor reactive or proactive behavior (Wooldridge, 2002).

3.2 Multi-Agent System

The whole recognition system is build on a multi-agent system developedearlier by Kruger et al. (2004). The MAS conducts the framework for theimplementation of the different subsystems and provides the interface forcommunication. It has to solve problems like changing number of agents,caused by occlusion or tracking failure and the communication between thedifferent entities (Ferber, 1999). Although the agents are equipped withall the abilities required to fulfill their individual task, they are not supposedto have all data or all methods available to solve the overall goal of signlanguage recognition. The robustness and flexibility of the present work isachieved by the idea of splitting a complex task into smaller and simplersubtasks. Hence, as demanded by Liu and Chua (2006) collaboration andcommunication between the agents have to be installed.

As depicted in fig. 3.2 the MAS consists of three base classes of objects,the environment, the blackboard, and the agent. While environment andblackboard are realized as singleton objects (Buschmann et al., 1996), therecan be a multitude of different agents. These agents handle tasks ranging


from coordination of subprocesses and tracking of an image point, up to therecognition of human extremities.

Environment

The information about the world is supplied by the environment. Based onthe desired functionality of visual tracking and recognition, the environmentprovides access to image sequences, e.g., the current original color image andits processed versions, like the gray value image and the difference imagebetween two consecutive video frames.

Blackboard

Communication within the system is done via the blackboard. For this pur-pose, a message can be quite complex, e.g., it can carry an image, and hasa defined lifetime. Each agent can write messages onto the blackboard andread the messages other agents have posted. Thus, message handling allowsthe creation of new agents with specific properties, the change of propertiesand also the elimination of agents during run-time.

Agent

The agent is the most interesting entity because it shows the above men-tioned properties. In order to implement this behavior, agents have threelayers, which are shown in fig. 3.3. The top layer, called AgentInterface, ad-ministrates the communication and provides the interface to the environment.The fusion center in the second layer is called cueIntegrator and merges theinformation supplied by one or more sensors of layer three. The perceptionof the surrounding area is twofold. First, there is the message handling viathe blackboard and second an agent can receive information by its sensors,which filter the input data coming directly from the environment. Based onthe collected information the agent reaches a decision about further actions.

As mentioned above, the sign language recognition is separated into thesubtasks of object tracking and recognition (object and gesture). Eachsubtask is solved by one or more agents. Hence, teamwork and an ob-server/controller architecture are essential. For this purpose three mainclasses of agents, the tracking agents, the agents for recognition and theagents for control are implemented in the MAS.

Tracking agents merge different visual cues like color, texture and move-ment to follow an object. Cue fusion is done using democratic integration(Triesch and von der Malsburg, 2001a). This technique offers a self-organized, flexible, and robust way of tracking and will be explained in chap-


Figure 3.3: Design of an Agent — An agent is based on three modules.The connection to the environment and the communication are included inthe AgentInterface module. Followed by one cueIntegrator, the module thatintegrates and interprets the information provided by one or more Sensors.

ter 4. Agents which provide stored world knowledge are called recognitionagents and are applied for face and hand posture recognition. Training therecognition agents, i.e., learning world knowledge from examples, is also acrucial task requiring autonomous working methods (Wurtz, 2005).

One key point of the recognition system is its independence from userinteraction. Hence, controlling agents are responsible for solving the conflictsthat might occur during execution.

Chapter 4

Visual Tracking

Human motion tracking, i.e., the tracking of a person or parts of the person,is a difficult task. Viewed on a large scale, humans look nearly the same; theyhave a head, two shoulders, two arms, a torso and two legs. However, theytend to be very different at smaller scales; having different clothes, shorterarms, et cetera1. The same holds true for the movement of a person. Whenrepeating the same sign, the trajectories and dynamics can be quite different.

Based on the framework presented in chapter 3, the tracking of the headand the left and right hand is based on visual information and solved by acooperation of globally and locally acting agents. Following Moeslund andGranum (2001), the implemented visual tracking has to solve two tasks.First, the detection of interesting body parts and second their reliable track-ing. An example tracking sequence is shown in fig. 4.1, where the imple-mented tracking system runs in a scene with complex background whichincludes moving objects. This chapter will introduce the applied agents andtheir interaction for object tracking.

Beginning the description with the architecture where the important mod-ules will be presented. Subsequently, the connection of the different agentswill be explained. The features which are applied for sign language recogni-tion in the present work focus on the information extracted from the headand both hands of the signer. Thus, the head and hands have to be detectedto start the tracking. For this purpose, the whole image is scanned for skincolored moving blobs to get a rough estimate of the body parts. These blobsare the only constraints applied for defining a region of interest (ROI). Whendetected, their position and size will be passed to the control agent whichadministrate the tracking agents.

1The problems that occur for learning models of articulated objects like the humanbody is investigated in Schafer (2006).

Visual Tracking 17

Figure 4.1: Tracking Sequence — In this tracking sequence head andhands were found. Object identity is visualized by the color of the rectangles,which delineate the attention region of each tracking agent. Moving skin colorin the background is ignored.

The task of a tracking agent is twofold. It tracks its target and, as shownin fig. 4.5 provides the needed information of the object recognition agents.In order to enhance the tracking the tracking agent uses a multi-cue trackingmethod that adapts to changes of the target. Each cue is initialized withcue values extracted from the object during the agent’s initialization. Thetracking agents will be explained in detail in the second part of this chapterin section 4.2.

Visual Tracking 18

Figure 4.2: Applied Agents — The arrows denote the data flow betweenthe different agents. Starting with the attention agent to search for possibleposition for head and hands, its results are passed to the tracking controlagent. The tracking control agent verifies if the object is already tracked bya tracking agent and otherwise instantiates a new one. Recognition agentsare continuously trying to identify the objects that are tracked by the trackingagents.

4.1 Multi-Agent Tracking Architecture

Fig. 4.2 pictures the agents which are implemented in the tracking system.While the attention agent and the tracking control agent are only used todetect and administrate new ROIs, the tracking agents and the recognitionagents are running throughout the whole tracking and recognition process.

The object tracking starts with the attention agent scanning the wholeimage for the signer’s head and hands. The identification of the foreground ortarget regions constitutes an interpretation of the image based on knowledge,which is usually specified to the application scenario. Since contour and sizeof a hand’s projection in the two-dimensional image plane vary considerably,color is the feature most frequently used for hand localization in gesturerecognition systems. Although the color of the recorded object is dependenton the current illumination, skin color has proven to be a robust cue forhuman motion tracking (Jones and Rehg, 2002).

Just like in Steinhage (2002) the attention agent has two sensors, one isscanning for skin color and the other for movement. Each sensor is producinga two-dimensional map that codes the presence of colors similar to skin colorand motion. These binary maps are merged in the cueIntegration unit byapplying the logical conjunction to gain the skin colored moving blobs. At

Visual Tracking 19

the current stage these blobs are likely, but not verified, the head and thetwo hands of the signer. In the next step, the detected blobs are segmentedand used to define the object’s position and size. In order to reject artifactsdue to noise small segments are filtered out.

However, the global search is expensive in terms of computational time.Thus, the attention agent scans for abstract cues and is only active as long asthe head and hands have not been found in the image. It becomes reactivatedif one of the tracking agents is loosing its target.

Position and size of the remaining ROIs are then passed to the trackingcontrol agent which supervises the tracking. It checks whether the target isalready tracked and, if this is not the case, a new tracking agent is instanti-ated and calibrates its sensors to follow the object.

4.2 Tracking Agent

Object tracking is performed by tracing an image point on the target byits corresponding tracking agent. The tracking agents take on this task byscanning the local surrounding area of the previous target position on thecurrent frame. This area is called the tracking agent’s attention region. Localagents have the advantage that they save computational time. Furthermore,they do not get confused by objects that are outside their region of interest.

Differing from the detection of ROIs, skin color and movement are notenough for reliable tracking. Therefore, the tracking agent uses a more robustmethod for object tracking. As shown in fig. 4.3, it does not rely on a singlefeature but instead integrates the results of four different information sources.In order to integrate those information, the tracking agent applies a schemecalled democratic integration Triesch and von der Malsburg (2001a),which is explained in detail in appendix A.

The information acquisition of each feature is realized through a feature-specific sensor, namely pixel template, motion prediction, motion and colorinformation. Each sensor contains a prototype of its corresponding feature.During the instantiation of the agent, the prototypes are extracted from theimage position appointed by the tracking control agent. Hence, they areadjusted to the object’s individual features and are capable to adapt them-selves to new situations during the tracking process (see below). Scanningthe attention region of its corresponding tracking agent, the sensor producesa similarity map by comparing its prototype with the image region. In or-der to determine the new position of the target object, the tracking agent’scueIntegrator computes a weighted average of the similarity maps derived

Visual Tracking 20

Figure 4.3: Democratic Integration — As the person is entering thescene, there is only one tracking agent in charge. On the left the trackingresult of the sensor is marked with a circle. The rectangle shows the borderof the agent’s search region. On the right the similarity maps created bythe different sensors are given, from left to right: color, motion, motionprediction and pixel template. The fusion center shows the resulting saliencymap by applying the democratic integration scheme.

from the different sensors. Out of this map the position with the highestvalue is assigned as the current target position.

This position is fed back to the sensors and serves as the basis for twotypes of adaptation. First, the weights of the sensor are adapted according totheir agreement with the overall result. Second, the sensors adapt their inter-nal parameters in order to have their output match the determined collectiveresult better.

Spengler and Schiele (2001) criticize the democratic integration pre-sented in Triesch and von der Malsburg (2001a) by mentioning thatthe system is limited to single target tracking. Furthermore, they add thatself-organizing systems like democratic integration are likely to fail in caseswhere they start to track false positives. In such cases the system may adaptto a wrong target resulting in reinforced false positive tracking. By intro-ducing a multitude of local tracking agents the limitation to single objecttracking has been solved. It is true that the tracking agents can be suscep-tible to tracking false positives, but as shown in fig. 4.4 they are performingwell even in cases where the three objects are very close to each other.

After tracking the object on the current frame, each tracking agent eval-uates its success. For this purpose, democratic integration provides a confi-dence value, which allows to rate the object tracking. If the confidence valueis below a certain threshold, the tracking result is not reliable and the object

Visual Tracking 21

Figure 4.4: Meet gesture in BSL — The four images are part of thesign “meet” performed in British Sign Language. As shown, the local track-ing agents perform well and do not mix or switch to another object even insituations where all three objects (head and hands) meet.

is defined to be lost, e.g., the object left the scanning region of the trackingagent. If the tracking agent is not able to retrieve the object during thenext two frames, it terminates its tracking. Before it is deleted it sends amessage to the attention agent that will restart the search for ROIs on thewhole image. This evaluation of the tracking is part of the self-healing of thesystem.

Otherwise, if tracking has been successful, the tracking agent informs theother agents by posting a message containing the information needed forrecognition. These information include the current position, the contour ofthe target and an image of its attention region. Fig. 4.5 illustrates that theimages sent by different tracking agents can overlap.

4.3 Recognition Agent

The messages coming from the tracking agents are read by the three recogni-tion agents, with a different function. One is scanning the image included inthe message for the presence of a face. The other two agents are specializedon hand posture recognition and apply two different approaches. Chapter 6will give a detailed description of the used recognition techniques. The resultof the recognition agents is fed back to the tracking agents and furthermore,provides the input for sign language recognition.

When initialized, the tracking agent does not know which kind of theobject it is tracking. This information is stored in a type flag, which isinitialized with UNKNOWN. A tracking agent following an UNKNOWN

Visual Tracking 22

Figure 4.5: Input images — Input image is the attention region of thetracking agent in charge. On the left is the whole image where the trackedobject like head and left and right hand are color coded. The individual at-tention region of each tracking agent is displayed on the right. In addition toposition and contour information, they present the input for the recognitionagents which are using the bunch graph method introduced in section 5.1.1.

target over a couple of frames (three in the presented work) is probablytracking an uninteresting object. Therefore, it will delete itself after sendinga message to the attention agent. Only the message from the recognitionagent changes the type flag to HEAD, LEFT HAND or RIGHT HAND inorder to allow to continue the tracking. After the recognition agents sendtheir messages, the environment loads the next image and the next trackingcycle is started.

Chapter 5

Feature Extraction

The sign language recognition system has to be stable and robust enoughto deal with variations in the execution speed and position of the hands.Hidden Markov Models (HMMs) can solve these problems. Their ability tocompensate time and amplitude variations of signals has been amply demon-strated for sign language recognition as described in Ong and Ranganath(2005) and chapter 2. Thus, the HMM method serves as the data structureto store and recognize learned feature sequences. The information integra-tion used for sign language recognition is installed in the HmmRecognitionagent which embeds an extended self-controlled Hidden Markov Model ar-chitecture. Before the extension of the HMM method is discussed in section6.2, this chapter describes the input features to be processed by the HMM.

In order to be suitable for a recognition task a feature should show highinter-gesture variance, which means that it varies significantly between differ-ent gestures, and a low intra-gesture variance, which denotes that the featureshows only small variations between multiple productions of the same ges-ture. The first property means that the feature carries much information,while the second indicates that it is not significantly affected by noise orunintentional variations.

Thus, the features used for sign language recognition in the present workinclude the position of both hands and the corresponding hand posture. Asmentioned in the previous chapter, each feature assignment is performed bya special recognition agent. In contrast to the trajectory information, whichcomes directly from the corresponding tracking agent and is explained insection 5.1, the static hand posture has to be recognized and assigned topreviously learned sign lexicons as introduced in section 5.2. The generationof a sign lexicon is driven by examples and requires only minimal user inter-action. It consists of one threshold for the similarity of two correspondingfeatures.

Feature Extraction 24

5.1 Position

In order to make position information translation invariant, the face of thesigner serves as a body-centered coordinate system, which is used to describethe position of both hands. The position is posted directly by the corre-sponding tracking agent and is treated in a continuous fashion. Thus, aGaussian-Mixture model (Titterington et al., 1985), is used to store theposition information. Hence, a recognition agent for face detection is brows-ing the messages of the tracking agents and advices the corresponding agentif a face has been detected in its attention region. The face detection which isimplemented in the present work is performed by the bunch graph matchingalgorithm (Wiskott et al., 1997). The concept of the bunch graph provedto be very flexible and robust in many circumstances and in the present workserves for hand posture classification as well.

5.1.1 Bunch Graph Matching for Face Detection

Two of the implemented recognition agents in the system apply bunch graphmatching for object recognition. One is the above mentioned face detectionagent, the other is classifying static hand postures and will be introduced insection 5.2. Both agents need a database of stored bunch graphs, which willbe used for the matching. The object knowledge is learned from examples,e.g., for faces, the nodes of the graph are set on remarkable position, so-calledlandmarks, like the tip of the nose, the eyes, etc. Bunch graph matching,developed at the “Institut fur Neuroinformatik, Systembiophysik”, Ruhr Uni-versity of Bochum, Germany proved to be a reliable concept for face findingand face recognition in three international contests, namely the 1996 FERETtest (Rizvi et al., 1998; Phillips et al., 2000) , the Face Recognition Ven-dor Test (Phillips et al., 2003) and the face authentication test based onthe BANCA database (Messer et al., 2004). An good overview is given inTewes (2006), chapter 3.

Elastic Graph Matching

Elastic Graph Matching (EGM) is the umbrella term for model graph andbunch graph matching. Its neurally inspired object recognition architectureintroduced in Lades et al. (1993) has been successfully applied to objectrecognition, face finding and recognition. One advantage of EGM is that itdoes not require a perfectly segmented input image. Instead, the attentionregion of the tracking agent in charge provides the input images. The taskof object recognition using these images as input can be quite difficult. As


(a) Elastic Graph (b) Input Image (c) Matching result

Figure 5.1: Face detection using EGM — From left to right this figureillustrates the previously trained elastic graph, the input image provided bythe tracking agent and the result of a matching process. At the position in(c), the object information which is represented in the graph (a) reaches thehighest similarity with the input image (b).

depicted in fig. 4.5 the attention region of the tracking agent which are tracingdifferent targets can overlap.

Fig. 5.1 outlines EGM. The data structure is a two-dimensional labeledelastic graph, which represents the learned objects. An example of an elasticgraph for face detection is shown in fig. 5.1 (a). The nodes of an elasticgraph are labeled with a local image description of the represented object.The edges are attributed with a distance vector. The object recognition usingEGM is working in a unsupervised way. It performs a search for a set of nodepositions on the input image which best matches, in terms of maximizing ameasure of similarity using the local image description attached to each nodeof the corresponding graph. An example for the EGM is illustrated in fig. 5.1.The elastic graph depicted in fig. 5.1 (a) is trained to represent a face. Theinput image fig. 5.1 (b) contains a face which is not included in the trainingset. Fig. 5.1 (c) shows the result of the EGM. As expected, the face regionof the input image shows the highest similarity with the faces stored in theelastic graph.

Model Graph and Model Graph Matching

As mentioned above each node codes a local image description. The textureinformation of the neighborhood around the node positions proved to be areliable description when applied for the recognition of the object. Texture


(a) Jet and model graph (b) Bunch graph

Figure 5.2: Construction of a bunch graph — Each node of the modelgraph is attributed with a local texture description. This texture descriptionis achieved by a wavelet transformation with a family of Gabor wavelets andstored in a so-called “Jets”. In order to increase the flexibility several modelgraphs are added or overlayed to form a bunch graph.

can be well stored using Gabor jets. The Gabor jets are obtained by convolv-ing the input image with a family of Gabor wavelets. These wavelets differ insize and orientation. They are coding an orientation and scale-sensitive de-scription of the texture around a node position and are explained in detail inappendix B. Furthermore, the appendix B lists the applied set of parameterswhich controll the Gabor jets in the present work in tab. B.1 and tab. B.2.

An elastic graph whose nodes are attributed with Gabor jets is calledmodel graph G. In fig. 5.2 (a) a Gabor jet is depicted with its correspondingGabor wavelets which differ in orientation and scale. The model graph infig. 5.2 (a) shows that a Gabor jet is connected to each node.

The recognition process using a model graph is called graph matching. Forthe purpose of graph matching a similarity function which expects Gabor jetsto be compared with each other is applied in the present work. The jets haveto be extracted from the input image. This is done by convolving the wholeimage with the same Gabor wavelets which were used to create the entriesof the model graph. The resulting Gabor wavelet transformed input imageis called Feature Set Image (FeaStImage) and is created by attaching thecomputed Gabor jets to all pixels of the input image.


In order to conduct a graph matching several so-called moves are per-formed on the FeaStImage. Each move modifies the location of the graph’snodes, and thus the whole graph, with respect to the input image. After thenode positions have been modified, the node entries of the model graph arecompared with the corresponding jets of the FeaStImage. This comparison isdone by using the similarity function Sabs of eq. (B.7) and is explained in de-tail in appendix B. The applied similarity function Sabs uses the magnitudesof the jet entries and has the form of a normalized scalar product. It returnsa similarity value for each node. By averaging the similarity values over allnodes the similarity value sG

1 (eq. (B.8)) for the whole model graph can beobtained. This similarity value sG is used to rate the result of the EGM.Thus, the object which is represented by the model graph is defined to be atthe position in the input image where the highest sG has been computed.

For object recognition, a set of previously learned bunch graphs (see be-low) is sequentially matched to the input image. The graph which receivesthe highest similarity sG represents the classification result. In case that sGis below a certain threshold TsG , the recognition is considered to be uncertainand will be neglected.

Bunch Graphs

Wiskott et al. (1997) introduce an extension to the concept of modelgraphs. Model graphs are enlarged to the store more object informationand thus enhance the object recognition. While for model graphs only oneGabor jet has been attached to each node, the bunch graph attaches severalGabor jets as shown in fig. 5.2 (b). These Gabor jets are taken from differentpresentations of the same class of objects, e.g., different faces or the sameobject with different backgrounds. They form a bunch of jets for each nodeposition.

The object recognition procedure using bunch graph matching is similarto the presented model graph matching. The calculation of the similarityvalue sB in eq. (B.9) differs only in the computation of the similarities ateach node position. This has to be modified insofar, as now one jet of theimage is compared to the whole bunch of jets connected to the bunch graph’snode. For this purpose each jet in the bunch is compared to that of theFeaStImage using the similarity function sG already mentioned. The maximalsG is associated with the corresponding node. Finally, the similarity forthe bunch graph is determined in the same way as for the model graph byaveraging over all nodes.

1This work resigns to moves where the model graph can be distorted. Therefore thedistortion term which can be added to sG is neglected.


Figure 5.3: BSL sign “different” — The sign “different” taken from theBritish Sign Language. It is performed by both hands moving synchronouslyfrom the inner region to the outside.

5.2 Hand Posture

The classification of hand posture stands for the search of the most similarentry out of a previously learned sign lexicon of hand postures. However,vision based object recognition is a complicated task (von der Malsburg,2004) and the human hand, as an articulated object, makes the hand posturerecognition even more difficult. It has 27 degrees of freedom (Pavlovic et al.,1997) and therefore is very flexible and hard to model (Wu et al., 2005). Evenif the scaling problem is left out by limiting to a fixed distance to the camera,the recognition has to deal with the current problems of rotation and lightingconditions as described in Barczak and Dadgostar (2005).

Eliminating the rotation is performed by increasing the number of anglesto which the model graph has to be compared. This makes the classificationcomputationally expensive. Therefore, the present work neglects rotationinvariance. Rotated hand image (fig. 5.3) are treated as different hand pos-tures.

The lighting conditions are sensitive to different sources of light and toother objects producing shadows over the detectable object. The way the


object appears may change dramatically with variations in the surroundingarea. If new situations arise it is unlikely that the object will be detected.Therefore, it is important to collect appropriate training images. Solved inthe same way as the rotation, the objects that include shadows and thatare partially occluded can best be encoded if the examples include thesesituations.

For the purpose of hand posture recognition the present work implementstwo different and independent classification methods. The first method isbased on bunch graph matching, which has already been introduced for facedetection, while the second classification method is contour matching basedon shape context information. As the bunch graph and the contour matchingapproach are quite different and demand diverse input data, they are appliedto complement hand posture recognition. Nevertheless, the aim of bothtechniques is to extract the hand posture and assign it, if possible, to themost similar element of the corresponding sign lexicon.

Other than the position, the hand postures are stored in a discrete man-ner. Thus, the first step of each technique is to build up a sign lexicon. Itconsists of a set of representative hand postures, where each is assigned to aunique index number. The multiplicity of hand postures makes it necessaryto conduct the sign lexicons of the hand postures by learning from examplesinstead of using hand labeled data.

5.2.1 Bunch Graph Matching for Hand Posture Clas-sification

Bunch graph matching has already been applied to hand posture recogni-tion by Becker et al. (1999); Triesch and von der Malsburg (2001b)and Triesch and von der Malsburg (2002). As suggested in Trieschand von der Malsburg (2002), the classification is enhanced by usinga bunch graph which is applied to model complex backgrounds. For thispurpose, model graphs taken from one hand posture in front of differentbackgrounds are merged into a bunch graph. A hand posture in front ofdifferent backgrounds is shown in fig. 5.4. These bunch graphs proved to en-hance the matching because the nodes which lie on the edge of the objectare extracting and comparing their texture information always with someparts (depending on the used scales of the Gabor kernel) of the background.Thus, during the matching process the nodes with the most similar learnedbackground, compared to the current background, are used to compute thesimilarity between the bunch graph and the presented image.


(a) Original Image (b) Extracted Hand Posture (c) Different Background

Figure 5.4: Bunch Graph with Background — The nodes on the edgeof the object are extracting and comparing their texture information alwayswith some part (depending on the used scales of the Gabor kernel) of thebackground. Thus, it proved to enhance the bunch graph matching if the handposture is extracted from the original image fig. 5.4 (a). Finally the extractedhand posture fig. 5.4 (b) can be pasted in front of different backgrounds, likepictured in fig. 5.4 (c). Each of these scenes produces a model graph to beadded to the bunch graph. Instead of different representations of the objectthis bunch graph stores the descriptions of the same model with differentbackgrounds.

The bunch graphs used by Triesch and von der Malsburg (2001b)and Triesch and von der Malsburg (2002) are hand-labeled. Their nodepositions are manually placed at anatomically significant points ,i.e., the rimof the hand and on highly textured positions within the hand. Their workproved to give good recognition results on data sets containing 10 examples.In the case of sign language recognition the huge amount of variation makesit complicated to set up a sign lexicon of representative hand postures man-ually. As reported by Triesch and von der Malsburg (2002) problemsresult from the large geometric variations which occur between different in-stantiations of a hand posture. These variations would require the bunchgraph to be distorted in a way that would allow every node to easily reach anumber of false targets and therefore the overall performance would becomeworse (Triesch and von der Malsburg, 2002). Hence, the bunch graphsin the present work are not distorted, knowing that this enlarges the signlexicon.


Figure 5.5: Automatic model graph extraction — Examples of imagesfor automatic model graph extraction. In contrast to the hand labeled data,the node positions are not set to geometric markers.

Creation of the Sign Lexicon

Automatic and unsupervised creation of bunch graphs used for hand pos-ture classification has been a research topic in Ulomek (2007). The authorinvestigates different methods for the automatic creation of a bunch graph,especially using corner information. Therefore, he computed the recognitionrates and the average calculation times to compare the different concepts.His experiments show that the bunch graph having regularly spaced nodesare performing better than the ones including corner information for nodeplacing2. The more densely the nodes are set, the better is the recognitionrate at the expense of increase in calculation time. Hence, the bunch graphsin the present work ignore corner information and are created by putting agrid upon the hand posture as shown in fig. 5.5.

The sign lexicon is data driven, using a simple but automatic and un-supervised training and clustering of the model graphs. Model graphs areautomatically extracted from segmented images showing a left or a righthand. On the foreground a node is placed at every fourth pixel. Thus, avery dense model graph as illustrated in fig. 5.5 is obtained for each example.Model graphs with too few nodes are rejected to enlarge the robustness ofthe recognition, knowing that very small hand postures will not be includedin the conducted sign lexicon.

After their extraction, the model graphs are clustered according to al-gorithm 1 to reduce the amount of entries in the sign lexicon. However, anodewise comparison using the similarity function of the matching process

2Although it is shown in Ulomek (2007) that the hand labeled graphs perform better,they are not used in this work for the reasons stated above.


Algorithm 1: Clustering of model graphs

Data: List of input images, TsG = 0.95create List of model graphs1

sort List by Size2

while List not empty do3

take next entry as matching graph4

foreach model graph after the matching graph do5

match matching graph on the image of model graph6

if sG ≥ TsG then7

model graph is represented by the matching graph8

delete model graph from List9

else10

match next entry in List11

end12

end13

end14

Result: Database of model graph sign lexicon

only makes sense if the topology of the model graphs and the number ofnodes to be compared are equal. This is true for face finding but cannotbe realized for the multitude of hand postures. Thus, an indirect similarityis measured by matching the extracted model graphs on the input image ofthe other model graph, using the resulting sG and a threshold value TsG toconclude the similarity between two hand postures. Hence, the number ofelements and thus the decision of which postures are needed to represent thetraining set in the lexicon is coming out of the data itself. Only minor userinteraction by setting the similarity threshold value TsG is necessary.

To enhance the robustness of the recognition a bunch graph is formedusing the computed model graph and the same model graph with addedbackground information as mentioned above. The sign lexicon for left andright hand are applied to train the HMM (section 6.2) and to equip therecognition agent to perform its recognition task.

5.2.2 Contour Information for Hand Posture Classifi-cation

In contrast to bunch graph matching, hand posture classification using con-tour information can only be done by segmenting the image and splitting


it in the hand as foreground and the rest as background. The input datacontaining the contour or shape of the static hand posture is provided inthe message of the corresponding tracking agent. The agent performs colorsegmentation based on the current value of its color sensor to extract thecontour.

The hand posture recognition using contour information is performed onclosed hand contours like the ones depicted in fig. 5.6 (a). The measurementof similarity of two closed hand shapes is performed by searching for corre-spondences between reference points on both contours. In order to solve thecorrespondence problem a local descriptor called shape context is attached toeach point on the contour (Belongie et al., 2002). The present work usesthe shape context as described in Schmidt (2006). At a reference point theshape context captures the distribution of the remaining points relative toits position. Thus, a global discriminative characterization of the contour isoffered. Therefore, corresponding points on two similar contours have similarshape contexts.

Shape Context

A contour is represented by a discrete set of n uniformly spaced referencepoints Φ = p1, . . . , pn, where pi ∈ R2. These points do not need correspondto key points such as maxima of curvature or inflection points. Given thatcontours are piecewise smooth, the reference points are a good approximationof the underlying continuous contour as long as the number n of referencepoints is sufficiently large. As illustrated in fig. 5.6 (b) for a single shapecontext, its K bins are uniformly distributed in log-polar space. Thus, thedescriptor is more sensitive to positions of nearby sample points than tothose of more distant points. For a point pi on the contour a histogram hiof the relative coordinates to the remaining n− 1 points is computed. Thishistogram is defined to be the shape context at pi.

Matching with Shape Contexts

Finding correspondences between two contours is comparable to assigning themost similar reference points on both contours. The similarity is measured bycomparing the appropriate shape contexts. Hence, the dissimilarity betweentwo contours is computed as a sum of matching errors between correspondingpoints.

Consider a point pi, on the first contour and a point qj on the secondcontour. Let Cij = C(pi, qj) denote the cost of matching these two points.


(a) Contour Extraction (b) Shape Context

Figure 5.6: Shape Context — Closed contours as shown on the left side(a) provide the input data for the recognition which is based on local imagedescriptors. The shape context descriptor depicted in (b) is collecting contourinformation by collecting the distribution of the reference points relative to afixed point in log-polar coordinates.

As shape contexts are distributions represented as histograms, it is naturalto use the chi-square test statistics:

Cij ≡ C(pi, qj) =1

2

K∑k=1

[hi(k)− hj(k)]2

hi(k) + hj(k), (5.1)

where hi(k) and hj(k) represent the K−bin normalized histogram at pi andqj, respectively. Given the matrix of costs Cij between all pairs of points pion the first contour and qj on the second contour using the permutation ε,the minimum of the total cost of matching,

H(ε) =∑i

C(pi, qε(i)) (5.2)

is subject to the constraint that the matching will fit one-to-one (Schmidt,2006). In order to apply eq. (5.2), both contours have to have the samenumber of shape context points. Differing from Belongie et al. (2002), thepresent work restricts the matching to preserve topology. Thus, the order ofthe points is a stringent condition. Therefore, if point pi of the first contourcorresponds to point qj of the second contour, then pi+1 has to correspond toqj+1 or qj−1 in the opposite direction, respectively. The direction is a resultof the computation of the best matching result.


Shape Distance Matching

The similarity of two contours measured at their n reference points is ex-pressed by their shape distance DSC. The shape distance is computed as thesum of shape context matching costs from the best matching reference pointsn using εmin as the optimal permutation.

DSC =1

nH(εmin). (5.3)

The shorter the distance DSC, the higher is the similarity between the twocontours. Hence, DSC can be a measure for a nearest neighbor classification.

Together with a threshold value TSC the shape distance DSC serves toconduct the sign lexicon by using a standard vector quantization as describedin Gray (1990). Similar to the bunch graph method the training of the signlexicon is performed with only very little user interaction. Again, just athreshold for the maximal distance TSC, which declares two contours to besimilar enough to be represented by one of them, has to be set.

The same distance of eq. (5.3) is used for the classification of the currenthand posture. In case that no entry of the contour sign lexicon is belowthe threshold3, the current hand posture cannot be detected by the contourrecognition agent.

3 Reasons are mainly connected with poor color segmentation or too small shapes.

Chapter 6

Recognition

As presented in the previous chapter, the data sources for sign languagerecognition are the position and the posture of the left and the right hand.Since the hand posture is determined by two procedures, which have twodifferent approaches to the classification, the results of bunch graph andcontour classification are supposed to be independent and will be treated astwo different features. Therefore, six features/observations can be extractedfrom each frame of the sign sequence.

These six feature streams are collected over time in order to come toa decision whether the presented sign is known to the system and, if so,which sign is performed. The feature integration method applied for the signlanguage recognition works on-line. The data is collected for every frameand is directly processed by the recognition system during performance ofthe sign. Thus, the most probable sign for the image sequence up to thecurrent frame can be given.

Sometimes, recognition of a static hand posture might fail in the bunchgraph or the contour matching agent or for both of them. Therefore, the chal-lenge is recognition with varying temporal execution of the sign as well ashandling missing data or falsely classified data. The ability of Hidden Markovmodels to compensate time and amplitude variations of signals has been am-ply demonstrated for speech (Rabiner, 1989; Rabiner and Juang, 1993;Schukat-Talamazzini, 1995) and character recognition (Bunke et al.,1995). The use of an HMM is twofold. First, during the training phase,it is applied to store a specified data sequence and its possible variations,while second the stored information can be compared with an observed datasequence by computing the similarity over time during the recognition phase.

According to the divide and conquer strategy, the present work uses asingle HMM for each feature to store the temporal and spatial variationsthat occur when the sign is performed several times.

Recognition 37

The task of the recognition system lies not only in recognizing the se-quence of a single feature, e.g., the left hand position but also in fusing thefeature recognition information. This challenge of integrating different fea-tures is done by differentiating into strong and weak features. The positionof the hands build up the strong feature which gives the baseline of the recog-nition. The hand posture information is chosen as the weak features. Theyare bound to the strong features and will be used in a rewarding way. Thus,weak features will add confidence to the recognition of the sign if they havebeen detected in the observation sequence. Otherwise, they will not changeor punish the recognition if they are absent.

The idea of defining position as the strong feature which gives the ba-sis for sign recognition is taken from the early experiments by Johansson(1973), where moving light images suggest that many movements or gesturesmay be recognized by motion information alone. Besides, the way of the fea-ture description confirms the strong and weak feature method in this work.Position has a Euclidean metric, which can describe the similarity of sur-rounding points, while recognition of the hand posture is based on discretesign lexicons, where each entry is chosen to be dissimilar from the others.

In order to run the sign language recognition system in real world appli-cations, the detection of the temporal start and end point of meaningful ges-tures is important. This problem is referred to as gesture spotting (Lee andKim, 1999; Derpanis, 2004) and addresses the problem of co-articulation inthat the recognition system has to ignore the meaningless transition of thehands from the end of the previous sign to the start point of the followingsign. Another critical point of consideration in any recognition system is signrejection, the rejection of an unknown observation sequence.

The idea of present work is to solve the previously mentioned problemsby distributing the information of the signs among autonomous subsystems,one for each sign. Every subsystem calculates its similarity to the runningsign sequence and decides about the start and end of the sign by its own.

This chapter will start with an introduction of the theory of HMM1. Thestructure and basic problems of the HMM are explained in section 6.1. TheHMM has to be adapted to the problem of sign language recognition. Insection 6.2 the modification undertaken in the present work will be moti-vated and explained in detail. The whole sign recognition is performed by arecognition agent that will be presented in section 6.3.

1The introduction of Hidden Markov models is based on the famous paper of Rabiner(1989) which is recommended for a more detailed description of this topic.

Recognition 38

Figure 6.1: 4 State Markov Chain — Markov Chain having four states(denoted by a circle). Each state can emit just one symbol, e.g., S1. Thetransition to the next state is dependent on the transition probability distribu-tion at each state. The Markov chain is a first order Markov process, wherefuture states depend only on the present state.

6.1 Hidden Markov Models

The capabilities of an Hidden Markov model are to store a sequence of sym-bols2 and to produce a sequence of observations based on the previouslystored information. The production of the observation sequence is describedby two stochastic processes. The first stochastic process applies the Markovchain (Meyn and Tweedie, 1993) and directs the transmission from thecurrent state to the next state. A Markov chain as illustrated in fig. 6.1 ischaracterized by the distribution of the transition probabilities at each state.Thus, by running through the chain, it produces a sequence of states likeS1, S3, S3, . . .. In order to produce a sequence of observations each state can

2including the occurring variations of the sequence

Recognition 39

Figure 6.2: Bakis Model — HMM with the left-right (Bakis) topology,typically used in gesture and speech recognition. The solid lines denote thetransition probabilities and thus aij = 0 if j < i ∨ j > i + 2. Thedotted line connects a continuous observation distribution to the belongingstate (circle).

be attributed to an observation symbol. The transition through the chain isdescribed by a Markov process of order 1, in which the conditional probabilitydistribution of future states, depends only on the present state.

In contrast to the Markov chain where each state of the model can onlyemit a single symbol, states in the HMM architecture as depicted in fig. 6.2can emit one symbol out of a distinct alphabet. The probability of emittingthe symbol is stored for each state in the emission probability distributionover the alphabet. The probability of emitting a symbol in the state S1 canbe interpreted as the probability to observe the symbol, when being in stateS1. Therefore, the emission probability distribution is called observationdistribution in the recognition process. The decision of emitting a symbolrepresents the second stochastic process. As the observer of an HMM onlyobserves the emitted symbols, while the emitting states remain unknown, thestates of an HMM are called hidden. Due to their doubly stochastic nature,HMMs are very flexible and became well known in the gesture recognitioncommunity, see chapter 2.

Recognition 40

6.1.1 Elements of an HMM

A Hidden Markov Model is characterized by the following elements:

1. N , the number of states in the model. The individual states are denotedby S = {S1, S2, . . . , SN}, and the state at time t by qt.

2. M , the number of distinct observation symbols per state, i.e., the sizeof a discrete alphabet. The observation symbols correspond to thephysical output of the system being modeled. The included symbols aredenoted by V = {v1, v2, . . . , vM}. If the observation is in a continuousspace, M is replaced by a continuous distribution over an interval ofpossible observations.

3. The transition probability distribution A = {aij}, which is given by

aij = P (qt+1 = Sj | qt = Si) , 1 ≤ i, j ≤ N (6.1)

and which adds up for a fixed state Si as∑j

aij = 1. (6.2)

Just like the state transition of the Markov chain, the state transitionprobability aij (eq. (6.1)) from state Si to state Sj only depends on thepreceding state (first order Markov process). For the special case thatany state can reach any other state in a single step (the ergodic model),the transition probabilities have aij > 0, for all i, j. For other typesof HMM (e.g., the Bakis model in fig. 6.2), the transition probabilitiescan be aij = 0 for one or more Si, Sj state pairs.

4. The observation probability distribution B = {bj(k)} in state j, where

bj(k) = P (vk at t | qt = Sj) , 1 ≤ j ≤ N, 1 ≤ k ≤M (6.3)

and ∑k

bj(k) = 1 (6.4)

The observation probability distribution B can be discrete or continu-ous. It describes the probability of emitting/observing the symbol vk,when the model is in state j.

5. The initial distribution π = {πi} where

πi = P (q1 = Si) , 1 ≤ i ≤ N (6.5)

specifies the probability of the state Si to be the first state starting theHMM calculations.

Recognition 41

Thus, a complete specification of an HMM consists of two model parameters(N and M), the specification of observation symbols, and the specificationof the three probabilistic measures A,B and π. In the following the compactnotation

λ =(π,A,B

)(6.6)

to indicate the complete parameter set of the model λ will be used.

Generation of an Observation Sequence

Having appropriate values for an HMM λ, algorithm 2 can be used to generatea sequence with length T of observations ot taken from the alphabet V .

o = (o1, o2, . . . , oT ) . (6.7)

The advantage of algorithm 2 is twofold, first it can generate a sequence ofobservations (eq. (6.7)) using an HMM and second it is able to model how agiven observation sequence was generated by an appropriate HMM.

Algorithm 2: This algorithm can be applied to generate an observationsequence. It can also model how a given observation sequence wasgenerated by an appropriate HMM.

Data: λ =(π,A,B

)Result: oChoose initial state q1 = Si according to π1

t = 12

while t = t+ 1 < T do3

Observation: select ot = vk, according to Bi

in state Si4

Transition: based on (aij), of state Si change to the new state5

qt+1 = Sjend6

Recognition 42

6.1.2 HMM Basic Problems

The art of HMMs lies in their topology of allowed transitions, the featuresto be observed and their emission probabilities. While the choice of featureshighly dependents on the observation task, the solution of the following threebasic problems will help to set transition and emission probabilities for real-world applications.

Evaluation problem : Given the model λ and the observation sequence o,how do we efficiently compute P (o | λ), the probability of generatingthe observation sequence given the model?

Decoding problem : Given the model λ and the observation sequence o =(o1, o2, . . . , oT ), how do we choose a corresponding state sequence q =q1, q2, . . . , qT which is meaningful in some sense, i.e., best explains theobservations?

Estimation problem : How do we train the HMM by adjusting the modelparameters λ =

(π,A,B

)to maximize P (o | λ)?

6.1.3 Evaluation Problem

The objective is to calculate how well a given model λ matches with a pre-sented observation sequence o. This computation is important for any recog-nition application using HMMs. The standard way to recognize a gestureg out of a set G is to train an HMM λg for every single known gestureg ∈ G. Based on the trained data, the recognition starts after the obser-vation sequence of g is recorded with the calculation of P(o | λg) for everyHMM λg. This computation is done by solving the evaluation problem. Fi-nally, the model λg which produces the highest probability of describing theobservation sequences

g = arg maxg∈G

P(o | λg

)(6.8)

is deemed to be the recognized gesture g .Given the observation sequence o and the HMM λ, the straightforward

way to compute the P (o | λ) is through enumerating every possible statesequence q of length T and compute the corresponding probability P(o |q1, . . . , qT ). The results will then be added up to receive the overall P (o | λ).

The evaluation problem is solved by calculating the observation proba-bility while running a distinct sequence of states S of the HMM. Therefore,

Recognition 43

first the probability P(S | λ) of the HMM passing through this fixed statesequence is computed by

P(S | λ) = P(q1, . . . , qT | λ) = πq1 ·T∏t=2

aqt−1qt . (6.9)

Then the observation probability of producing o by the state sequence S iscalculated:

P(o | S, λ) = P(o1, . . . , oT | q1, . . . , qT , λ) =T∏t=1

bqt(ot). (6.10)

Thus, the joint probability of S and o can be calculated by the probabilitiescalculated in eq. (6.9) and eq. (6.10) using the chain rule:

P(o, S | λ) = P(o | S, λ) ·P(S | λ) = πq1bq1(o1)T∏t=2

aqt−1qtbqt(ot). (6.11)

Finally, the observation probability is computed by

P (o | λ) =∑S∈QT

P(o, S | λ)

=∑S∈QT

P(o | S, λ) ·P(S | λ)

=∑S∈QT

πq1bq1(o1)T∏t=2

aqt−1qtbqt(ot) (6.12)

summing over all possible state sequences QT with length T . Using eq. (6.12)to solve the evaluation problem, one has to sum up over NT state sequencesresulting in 2T ·NT multiplications to calculate P(o | λ). The computationtime grows exponentially with the sequence length T . Therefore, even forsmall models and short observation sequences the calculation of the observa-tion probability would take too much time. Thus, the above computation isby far to expensive for real-world applications and the more efficient Forward-Backward algorithm is used.

Recognition 44

Forward-Backward Algorithm

The Forward-Backward algorithm (Rabiner, 1989) is a recursive algorithmwhich runs the calculation of P(o | λ) in linear time to the length of theobservation sequence. The algorithm is based on the computation of theforward variable α and the backward variable β. The forward variable isdefined as

αt(j) = P(o1, . . . , ot, qt = j | λ) (6.13)

the probability of observing the partial observation sequence o1, o2, . . . , ot andbeing in state Sj at time t, given the model λ, while the backward variableβ,

βt(i) = P(ot+1, . . . , oT | qt = i, λ). (6.14)

is defined as the probability of observing the partial observation sequenceot+1, . . . , oT of an HMM λ being in state qt at time t. Algorithm 3 describesthe computation of α and β. Their computation differs mainly in the tem-poral development of the recursion. As shown, in the termination step, theresulting P (o | λ), can be computed using either αt(j) or βt(i) or both, using:

P (o | λ) =N∑j=1

αt(j)βt(j). (6.15)

Applying the Forward-Backward algorithm, especially the multiplicationof the probabilities in eq. (6.15), the recognition would not be robust to:

1. a missing observation,

2. a falsely classified static hand posture, which was not included in thetraining data3,

3. an observation sequence which takes longer than the learned ones.

The first problem can be solved by perfect tracking and faultless classificationof static hand postures. The remaining problems become less crucial whencollecting more training data. However, up to now neither perfect trackingnor classification could be achieved.

Thus, the aim of the present work is to develop a system that worksreliably and robustly under real world conditions, where only sparse data isobtained.

3Using a discrete observation distribution, zero will be returned for the observationprobability of a symbol that has not been learned for the particular state.

Recognition 45

Algorithm 3: The Forward-Backward algorithm is used for efficientcomputation of P (o | λ). Having computed the forward and backwardvariables, the computation of the probability of observing the sequenceis calculated in the Termination step.

Initialization:

For j = 1, 2, . . . , N initialize

α1(j) = πjbj(o1)

For i = 1, 2, . . . , N initialize

βT (i) = 1

Induction:

For t > 1 and j = 1, 2, . . . , Ncompute

αt(j) =

(N∑i=1

αt−1(i)aij

)bj(ot)

For t < T and i = 1, 2, . . . , Ncompute

βt(i) =N∑j=1

aijbj(ot+1)βt+1(j)

Termination:

Compute

P (o | λ) =N∑j=1

αT (j)

Compute

P (o | λ) =N∑j=1

πjbj(o1)β1(j)

6.1.4 Decoding Problem

The aim of solving the decoding problem is to uncover the hidden part ofthe HMM, i.e., to find the most likely sequence of hidden states q that couldhave emitted a given output sequence o. This information about q becomesinteresting when physical significance is attached to the states or to sets ofstates of the model λ. Besides, the decoding of the state sequence can beused for solving the estimation problem, which is applied for training theHMM.

Although the state sequence is hidden, it can be estimated by the knowl-edge of the observation sequence o and the model parameters of the HMM λ.Therefore, it is postulated in the present work that the observation sequence

Recognition 46

o is generated by one state sequence q ∈ QT of length T . Thus, the statesequence q that maximizes the posterior probability

P(q | o, λ) =P(o, q | λ)

P(o | λ)(6.16)

is defined to be the producing sequence. Ignoring the (constant) denominatorof eq. (6.16) the optimal state sequence q

P(o, q | λ) = arg maxq∈QT

P(o, q | λ) =: P(o | λ) (6.17)

can be computed using the Viterbi algorithm.

Viterbi Algorithm

The Viterbi algorithm (Viterbi, 1967) is a dynamic programming algorithm,which is applied to find the sequence of hidden states q most likely to emitthe sequence of observed events. This sequence of hidden states q is calledViterbi path.

The Viterbi algorithm is based on several assumptions. First, both theobserved events and the hidden states have to be in a sequence correspondingto time. Second, these two sequences have to be aligned. Thus, an observedevent needs to correspond to exactly one hidden state. Third, computingthe most likely hidden sequence q

tup to a certain time t has to depend only

on the observed event at time t, and the most likely sequence qt−1

of theprevious time step t−1. These assumptions are all satisfied in the first-orderhidden Markov model.

The concept behind the Viterbi algorithm is based on recursion with resultcaching (sometimes called memorization in the computer science literature)and is shown in algorithm 4. Instead of computing the forward variable αt(j)from section 6.1.3, the maximal achievable probability

ϑt(j) = max{P(o1, . . . , ot, q1, . . . , qt−1, Sj | λ) with q ∈ QT} (6.18)

of running through a sequence of states to produce the partial observationsequence o1, . . . , ot under the constraint that the HMM ends in state Sj isapplied. Thus, ϑt(j) is the best score (highest probability) along a singlepath at time t which accounts for the partial observation sequence until timet and ends in state Sj. The following state is achieved by induction

ϑt+1(j) =

[arg max

iϑt(i)aij

]bj(ot+1). (6.19)

Recognition 47

Algorithm 4: The Viterbi algorithm is similar to the implementationof the forward variable αt(j). The difference is the maximization overthe previous states during the recursion which is used instead of thesumming used in the induction step of the calculation of the forwardvariable in algorithm 3.

Initialization:

ϑ1(j) = πjbj(o1), 1 ≤ j ≤ N

ψ1(j) = 0

Recursion:

ϑt(j) = max1≤i≤N

[ϑt−1(i)aij] bj(ot), 2 ≤ t ≤ T 1 ≤ j ≤ N

ψt(j) = arg max1≤i≤N

[ϑt−1(i)aij]

Termination:

P(o | λ) = max1≤j≤N

[ϑT (j)]

qT = arg max1≤j≤N

[ϑT (j)]

Backtracking:qt = ψt+1(qt+1)

In order to follow the argument which maximizes eq. (6.19) for each t and j,the path history is stored in the array ψt(j). As depicted in algorithm 4, thelast step of the Viterbi algorithm, backtracking, is applied to determine theViterbi path.

6.1.5 Estimation Problem

Solving the estimation problem, i.e., training an HMM means to adjust itsmodel parameters such that they best describe how a given observation se-quence otrain can be produced. In speech recognition, e.g., each word ispresented by an HMM, which codes the sequence of phonemes. The HMM istrained by presenting a number of repetitions of the coded word and adapt-ing the HMM parameters to enhance the recognition using the Forward-Backward algorithm from section 6.1.3.

Recognition 48

Up to now, there is no known analytical solution for the HMM parameterswhich maximize the probability of the training sequence (Rabiner, 1989).Instead, the Baum-Welch algorithm constitutes an iterative procedure whichcan be used to adapt the λ =

{A,B, π

}in such a way that P (otrain | λ) is

locally maximized.

Baum-Welch Algorithm

The Baum-Welch algorithm (Baum et al., 1970) is a generalized expectation-maximization algorithm. It computes maximum likelihood estimates andposterior mode estimates for the parameters (A,B and π) of an HMM λ,when given a set of observation sequences otrain ∈ O as training data.

The construction of an improved model λ is mainly adapted from theαt(j) and βt(j) probabilities computed for the calculation of P(otrain | λ). Inorder to describe the adaptation of the HMM parameters, the probability

ξt(i, j) = P(qt = i, qt+1 = j | otrain, λ)

=P(qt = i, qt+1 = j, otrain | λ)

P(otrain | λ)

=αt(i)aijbj(ot+1)βt+1(j)

N∑i=1

αt(i)βt(i)

, 1 ≤ t ≤ T. (6.20)

of being in state Si at time t and in state Sj at time time t+1, is introduced.Further,

γt(i) = P(qt = i | otrain, λ) (6.21)

the probability of being in state Si at time t, given the observation sequenceotrain and the model λ is defined. The sum over the sequence of the twoprobabilities can be interpreted as:

T−1∑t=1

γt(i) = expected number of transitions from Si, (6.22)

T−1∑t=1

ξt(i, j) = expected number of transitions from Si to Sj. (6.23)

They are connected by:

γt(i) =N∑j

ξt(i, j) =αt(i)βt(i)N∑i=1

αt(i)βt(i)

t < T. (6.24)

Recognition 49

Using the concept of event counting, the expectations for the transition fromstate Si to Sj and the number of transitions through state Si, respectively areachieved by summing the ξt(i, j) and γt(i) values resulting from the trainingsequence. These estimates allow to change the model parameters of λ inorder to improve its evaluation towards the training sequence.

The Baum-Welch algorithm is depicted in algorithm 5 and works in twosteps. In the first step the forward variable α and the backward variable βfor each HMM state are calculated in order to compute ξ (eq. (6.20)) andγ (eq. (6.21)). Based on these computations, the second step updates theHMM parameters. The initial distribution πi of state Si is calculated bycounting the expected number of times in Si and dividing it by the overallsequence observation probability. The transition probabilities are updated toaij by counting the frequency of the transition pair values from state Si to Sjand dividing it by the entire number of transitions through Si. Finally, thenew emission probabilities bj(k) for emitting the symbol vk when the HMMis in state Sj are determined by the fraction of the expected number of timesbeing in state Sj and observing symbol vk in the numerator and the expectednumber of times in state Sj in the denominator.

If the initial HMM λ = {A,B, π} is modified according to the Baum-Welch algorithm, it has been proven by Baum et al. (1970) that the resultingHMM λ = {A, B, π} is either λ = λ or the model λ is more likely in the sensethat

P(o | λ) > P(o | λ),

i.e., the new model is more likely to produce the training sequence.While running the Baum-Welch algorithm, the new model λ is placed

iteratively as λ. The HMM parameter adaptation is repeated until no furtherimprovement of the probability of otrain is observed from the model or a fixednumber of iterations (imax) has been reached. The final result of the Baum-Welch algorithm is called a maximum likelihood estimate of the HMM.

It should be noted that the number of states N is not part of the pa-rameter estimation process and has to be specified manually. This is animportant decision. Lower values of N will reduce the computational cost atthe risk of inadequate generalization, while higher values may lead to over-specialization. Another disadvantage mentioned in Rabiner (1989) is thedependence of the result of the Baum-Welch algorithm from the initializationof the trained parameters. While the transition matrix can be initialized withrandom values, the emission matrix should contain good initial estimates.

Recognition 50

Algorithm 5: The Baum Welch algorithm works in two steps, firstcalculating the forward and backward variables and second updating ofthe HMM parameters. Starting with an initial HMM λ =

(π,A,B

)the

algorithm is optimizing the parameters towards the trainings sequenceotrain. The algorithm runs until a maximal number of iterations imaxhas been reached. χ[·] is a function that returns 1 in case of a truestatement and zero otherwise.

Result: λ ; /* improved HMM */

begin

initialize: λ = λwhile imax not reached and λ! = λ do

/* Step one: */

set: λ = λcalculate: γ, ξ for λ, otrain/* Step two: */

update parameters:

πi = expected number of times in state Si at time t = 1

= γ1(i) =α1(i)β1(i)N∑i=1

αt(i)βt(i)

aij =expected number of tranisitions from state Si to state Sj

expected number of transitions from state Si

=

T−1∑t=1

ξt(i, j)

T−1∑t=1

γt(i)

=

T−1∑t=1

αt(i)aijbj(ot+1)βt+1(j)

T−1∑t=1

αt(i)βt(i)

bj(k) =expected number of times in Sj and observing symbol vk

expected number of times in state Sj

=

T∑t=1

γt(j)χ[ot=vk]

T∑t=1

γt(j)

=

T∑t=1

αt(j)βt(j)χ[ot=vk]

T∑t=1

αt(j)βt(j)

end

end

Recognition 51

6.2 Modification of the Hidden Markov Model

Following HMM notation the extracted features introduced in chapter 5 willbe called observations in the following. Each kind of observation has a partic-ular degree of uncertainty. Due to the applied tracking method, the positionon the object can vary, the contour might not be accurately determined dueto blurring or erroneous segmentation and the bunch graph classificationcould have gone wrong. Thus, instead of putting all observations into oneobservation vector as shown in fig. 6.3 (a), they are separated into singlechannels fig. 6.3 (b) in order to enhance the robustness of recognition and todecrease the training effort.

The parallel HMM (PaHMM) structure has been applied by Vogler andMetaxas (1999, 2001, 2003) and Tanibata et al. (2002). These authorsdivided the observations for left and right hand and trained a HMM for each

(a) Single HMMs (b) parallel HMMs

Figure 6.3: HMM architectures — By using single HMMs (a) each statestores the feature vector using a multi-dimensional space. The parallel HMMs(b) code each feature (circle, cross, triangle) in a single HMM and combinethe information during runtime using the concept of strong and weak featurespresented in section 6.3.

Recognition 52

hand. Shet et al. (2004) and Vogler and Metaxas (2003) show that par-allel HMMs model the parallel processes independent from each other. Thus,they can be trained independently which has the advantage that considera-tions of the different combinations are not required at training time.

Hence, each gesture g will be represented by six channels which separatethe position y relative to the head, the hand posture observations of contourclassification c and texture τ for left and for right hand. The HMMs codingthe sign are running in parallel and are merged during the recognition process.

For this purpose, the information of the weak (c, τ) and strong (y) featurechannels pass through two stages of integration. At the first stage, the inte-gration is limited to the three HMMs which code the observations for left orright hand, respectively. The following second stage connects the informationof both hands in order to determine the similarity of the trained sign to theincoming observation sequence. The recognition process is administrated bya recognition agent and will be explained in section 6.3.

The topology of the HMM, which serves to code the information of eachchannel is an extension of the Bakis model as seen in fig. 6.2. While thedata structure of the HMM is the same as described in section 6.1, themodifications needed to include the HMM to the recognition system referto the evaluation and the estimation problem. In order to achieve the aimof robust sign language recognition the computation of the probability ofemitting the observed sign is solved in a way different from that explainedin section 6.1.3. This modified solution to the evaluation problem will beexplained in section 6.2.1 and was made to enhance the robustness of therecognition by allowing a flexible combination of the independent channels.

A different way of solving the training of the HMMs was chosen in orderto improve the capability of the composition of the HMM, given that onlya few training sequences are available. This alternative to the Baum-Welchalgorithm is presented in section 6.2.2.

6.2.1 Evaluation Problem

The flexibility of the HMM depends on the training data which determinesthe transition probabilities aij and the observation probability distributionsB = {bj(k)} of the model. Under real world conditions the HMM will beconfronted with unknown, i.e., not learned, observations or variations in thedynamics, e.g., caused by missing tracking information, blurring etc. Thiswill lead to problems when using the conventional procedure to solve theevaluation problem (section 6.1.3). Thus, the doubly probabilistic nature ofthe HMM is separated by introducing a self-controlled transition betweenthe HMM states. The self-controlled transition allows to pass the current

Recognition 53

observation to the next state (null transition) or to ignore it. This approachapplies a strict left-right model with π1 = 1 and πi = 0 for i 6= 1 and aij = 1if j = i+1, and aij = 0 otherwise. Instead of using the transition probabilitymatrix A where the transitions are learned, the aij are replaced by a weightingfunction wt(u) which is updated during the recognition process. Thus, thecomputation of the forward variable α introduced in section 6.1.3 changesto:

αt+1(j) = αt(i) aij︸︷︷︸=1

w(u)bj(ot+1), (6.25)

and therefore the computation of the P (o | λ) becomes

P (o | λ) = αT (N). (6.26)

Eq. (6.25) is used to perform on-line gesture recognition and computes theprobability of observing the partial observation sequence on every frame. Theweighting function w expressing the certainty for each channel is inspired bya Gaussian:

w(u) = exp

(−u

2

2σ

), (6.27)

where u ∈ [0,∞] is a measure of uncertainty. The modified calculation ofthe P (o | λ) is presented in algorithm 6. Starting with a maximal certaintyof u = 0 at the beginning of the recognition, the HMM checks whetherthe received observation ot is presented in the observation distribution B

iof the current state Si. In case that the probability bi(ot) is not satisfying,i.e., below a recognition threshold Tr, the HMM can pass the observationto the next state Si+1, checks again and pass it further to the state Si+2

if necessary. If the observation does not even match at state Si+2 it willbe ignored and the HMM returns to state Si. Each of these transitionsis punished by increasing the uncertainty u and thus lowering the weightingfunction. To gain its certainty back, the HMM recovers with every recognizedobservation by decreasing u. In case of a recognized observation the systemswitches to the next state.

6.2.2 Estimation Problem

Six different information channels (f = 1, . . . , 6) are applied for sign recog-nition. Each feature sequence is coded and treated in a separate HMM.Because one of the assumptions made above is independence of the features,the HMMs are trained independently. The HMMs which represent a signg have the same number of states Ng. Their training is performed in thesame way, differing only in the used feature and the feature representation.

Recognition 54

Algorithm 6: Process performed for the determination of the recogni-tion probability of a HMM, being in state i and receiving a new obser-vation ot.

if bi(ot) > Tr then1

update (decrease) u2

P (o | λ) + = bi(ot) · w(u)3

else4

if bi+1(ot) > Tr then5

/* null transition */

i = i+ 16

update (increase) u7

P (o | λ) + = bi+1(ot) · w(u)8

else9

if bi+2(ot) > Tr then10

/* null transition */

i = i+ 211


P (o | λ) + = bi+2(ot) · w(u)13

else14

/* ignore observation */


re-enter to state i = i− 116

end17

end18

end19

Therefore, the following describes the training for one HMM λf,g, but theprocess is the same for the other feature HMMs. The estimation or trainingprocedure of the λ parameters which will be presented in this section is com-pletely based on training examples. Thus, the HMM parameters are adaptedtowards a set Og of observation sequences which represent the sign g.

In the first step, the number of states Ng of the HMM4 has to be defined.For this reason, two different strategies can be applied. The first strategy isusing a fixed number of states for each trained sign, as described in Starneret al. (1998) (using four states for each HMM) and Rigoll et al. (1996)(using three states). The other strategy is to concatenate the number of

4As mentioned in section 6.1.5, the number of states of the trained HMM is not givenby the Baum-Welch algorithm.

Recognition 55

states with the length of the applied training sequences of Og. This workfollows the second strategy. In contrast to Grobel and Assan (1997) andZahedi et al. (2005) who are taking the minimal sequence length of thetraining data, this work is using the maximal sequence length to determinethe Ng for the HMM.

As the HMM is using the modified evaluation algorithm from section6.2.1, there is no need to train the transition matrix and training concentrateson the estimation of the emission probability distributions. The creationof the observation probability distribution for each state is simple, easilyextensible and avoids the accurate initialization needed for the Baum-Welchalgorithm.

Representation of Discrete Observations

The emission probability distribution of the hand posture information re-ceived from the sign lexicons of the bunch graph and the contour matchingare discrete and thus can be described by an histogram Hhp, where hp rep-resents the contour or texture. The sign lexicon entries represent the bin ofthe corresponding histogram. As mentioned in chapter 5, they are chosen tobe as dissimilar as possible. The probability of observing a symbol o equalsthe entry in the attributed histogram bin:

bj(o) = Hhp,j(o), (6.28)

where Hhp,j is the normalized histogram of feature hp at state j.

Representation of Continuous Observations

In contrast to the hand posture observations, the position information has aEuclidean metric and the description of similarity or proximity makes sense.Thus, the emission probability distribution of the position is described by amixture of Gaussians:

bj(o) =M∑m=1

cjmN[o, µ

jm,Σ

jm

], (6.29)

where M is the number of Gaussians N , with mean vector µ and covariancematrix Σ, which is constant in this work. The mixture coefficient c applied

in the present work is set constant to c = 1M

for each Gaussian.

Recognition 56

Figure 6.4: Estimation of HMM Parameters — The number of statesof the generated HMM is set to the longest training sequence. The observationdistribution stored at each state is conducted by the observations which areavailable at the position in the training sequences which corresponds to thecurrent state.

Training of the HMM

Thus, training of the emission probability distribution for each state j isreduced to either fill the bins of the histograms for the hand posture obser-vations or to set up the mixture of Gaussians. Both types of representationare empty in the beginning.

Training is performed as illustrated in fig. 6.4. The number of Gaussianmixtures M or the number of entries into the histogram is equal to thenumber of observations for each time step of the observation sequences. Asthe training sequences are of different length, there will be fewer observationswhen coming to the last states of the HMM. In the case of the position eachobservation at time step j is added as a mean vector µ, while for the handposture the associated entry in the histogram is increased by one. If anobservation is empty, e.g., the hand contour could not be matched with anentry of the sign lexicon, this “special” observation does not join the trainingand does not increase M or the number of entries in H, respectively.

Recognition 57

Figure 6.5: Recognition Agent — The recognition agent is hierarchicallyorganized. At the bottom are the HMM Sensor modules, six of them for eachsign. They collect the information coming from the other agents and calculatetheir observation probability. The HMM Sensor results are merged in theGesture HMM Integration modules, of which there is one for each learnedsign. The Decision Center decides about the most probable sign.

The advantage of this solution to the estimation problem is that it canbe easily extended, a starting model is not needed and it can be trained withonly little training data (9 sequences compared to the 30 sequences usedin Vogler and Metaxas (1999) or 25 training sequences in Shet et al.(2004)).

6.3 The HMM Recognition Agent

The administration of the sign language recognition is hosted in the HMMrecognition agent. Starting the recognition cycle with collecting the obser-vations, sign language recognition is done from bottom to top by runningthrough the hierarchically ordered three layers of the HMM recognition agent(see fig. 6.5 and algorithm 8).

In terms of the notation of the HMM recognition agent, each learned signis represented as a Gesture HMM Integration module and its attached HMMSensors. The HMM Sensors denote the trained feature HMMs which calcu-late their current observation probability relative to the incoming data. Their

Recognition 58

Figure 6.6: Information processing — Information processing used forsingle sign recognition. The rectangles on the left side refer to the HMMSensors, collecting input data. The information integration for one sign isplaced in Gesture HMM Integration and runs through two stages. In the firststage, the information for each hand is analyzed. In the second stage theinformation from both hands are integrated to give a rating for the recognitionof the gesture.

results are passed to the corresponding Gesture HMM Integration module,which fuses the information in order to compute the similarity of the incom-ing observation sequence to the stored sign. While the HMM Sensors arecollecting the observations supplied by the other agents, the Gesture HMMIntegration module performs the information integration based only on theinformation provided by the HMM Sensors. In order to compute the sim-ilarity of the accommodated sign to the observed image sequence the Ges-ture HMM Integration module performs two integration steps as depictedin fig. 6.6. In the first integration step, all the HMM Sensor informationconnected to the same hand is fused. While, the information received foreach hand is merged in the second step and the similarity of both handsinformation of the current frame is computed. This value is added to theoverall similarity which has been aquired by the observations of the previousframes. Finally, in the Decision Center module of layer three, the resultsof the attached Gesture HMM Integration modules are analyzed in order todetermine which sign is the most probable.

Recognition 59

6.3.1 Layer One: HMM Sensor

The HMM Sensors provide the framework for the computation of the weightedobservation probability w(u) · bj(ot) using the algorithm 6 presented in sec-tion 6.2.1. In order to perform the recognition, each HMM Sensor embedsan HMM. The threshold Tr introduced in section 6.2.1 is the same for eachHMM Sensor. While the HMM stores the learned data and is applied for therecognition, Tr determines if an observation can be recognized in respect ofthe current HMM state. Six HMM Sensors, as illustrated in fig. 6.6, form agroup that constitutes the features used for the recognition of a sign.

If the observed sequence is similar to the stored data, the encapsulatedHMM, starting in its first state S1, will proceed to the following states S2, . . .during the recognition process described in section 6.2.1. This is an indicatorfor the progress of the recognition. The HMM Sensors are not connected toeach other. The only restriction is that the lowest state in the sign groupis at most four states behind the leading, in terms of highest state, HMMof the HMM Sensor group. Besides this restriction the sensors are runningindependently and can be in different states.

Due to the nature of the observation, the HMM Sensors differ in the prob-ability distributions of their embedded HMM; these are a continuous distri-bution for the position, that has been realized by using Gaussian mixtures,and a discrete one, which represents the histogram for contour or textureinformation. As a Euclidean distance metric can be used for the positionobservations, the continuous distribution offers a more flexible way to eval-uate the observation. In contrast to the position, the discrete feature spacedoes not allow the concept of similarity and distance cannot be assumed tobe Euclidean.

However, only the resulting probability, independent of the engaged ob-servation distribution, and the current state of the sensor are passed to theupper Gesture HMM Integration module.

6.3.2 Layer Two: Gesture HMM Integration

Each learned sign g is represented by a Gesture HMM Integration module inlayer two. The task of a Gesture HMM Integration module is to merge theinformation of its attached HMM Sensors to compute the quality κg eq. (6.32)of the sign g matching the partial observation sequence.

To get rid of possible multiplications with small numbers causing numeri-cal problems when estimating αt+1(j) using equation eq. (6.25), it is common

Recognition 60

to work with the logarithmical values (log) of the probabilities sent by thesensors. Hence, the applied forward variable is expressed by:

αt+1(j) = αt(i) + log(wt(u)bj(ot+1)). (6.30)

The logarithm is a strictly monotonic mapping, changing the multiplicationof the probabilities to an addition of the received log values. Thus, for everyframe the Gesture HMM integration receives the log probability l of the lefthand’s (lh) position llh(y), contour llh(c) and texture llh(τ) and of the righthand lrh(y), lrh(c) and lrh(τ).

The weighted probabilities in eq. (6.30) are in the interval of [0, 1]. Hence,the resulting log values are all negative or zero. They decrease very quicklyfor lower probabilities and become −∞ for zero probability. Thus, highprobability values will map to negative log values close to zero. Low proba-bility values will map to lower negative values with a strong increase of theirabsolute value.

The calculation of the current quality κa for the present frame is doneby passing the two integration stages illustrated in fig. 6.6. Integrating thequality for left and right hand separately in the first stage is done using theconcept of strong and weak features.

Missing Observations

Due to the continuous nature of its observation distribution the positionfeature is chosen to be the strong cue. In case of a missing observation, e.g.,the tracking agent failed to track the hand, the use of the Euclidean distanceallows to take the last known position as the current observation. The missingdata problem is more critical for the recognition of the static hand posture.As mentioned in section 5.2, recognition might not be stable on every frame,especially when the hand is moving. Thus, contour and texture informationare called weak cues and are integrated in the recognition process by usingthe rewarding function % of eq. (6.31). Hence, correct contour and textureinformation reward the recognition while missing data or falsely classifiedhand postures do not disturb the corresponding Gesture HMM Integrationmodule.

Rewarding Function

The rewarding function % rewards only if position and contour or position andtexture information of the corresponding hand are correlated. In this context,correlation does not necessarily mean that the HMMs have to be in the samestate i. Instead, for each hand the corresponding l(c) and l(y) or l(τ) and l(y)

Recognition 61

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0-1

0

1

2

3

4

5

6

7

Rew

ard

ρ(x)

Input Values of ρ

Rewarding Function ρ

θ

(x-θ)H(x-θ)

Figure 6.7: Rewarding Function % — The most important parameter ofthe rewarding function is the correlation threshold θ. Only if the quality ofthe hand position is below this value, the rewarding learned sign performedwith the corresponding hand posture. The more probable the hand postureinformation (due to the applied log it will be close to zero) the higher willbe the reward which is given by the absolute value of the distance betweenthreshold and input value x.

both just have to be above a correlation threshold θ. This means that bothobservations would match within the context of the performed sign. Thereward is linked to the probability for the hand posture recognition l(c), l(τ),which are treated in the same way and therefore are depicted by the variablex in the following eq. (6.31). Both parameters, the hand posture probability xand the threshold θ are negative or zero. The rewarding function is designedto give a higher reward if the hand posture is more likely in the sense of fittingto the current observation sequence. Generally, the rewarding function canbe written as:

%(x) = (x− θ)H(x− θ) ; H : Heaviside step function . (6.31)

Fig. 6.7 shows a plot of % as applied in the present work.

Recognition 62

Algorithm 7: The κa for each hand is based and the information ofthe position and is rewarded by matching hand posture observations.

Result: Current quality κa of left and right hand.foreach Hand do1

/* compute current quality */

κa = l(y)2

/* check reward of texture information */

if l(y) > θ ∧ l(τ) > θ then3

/* add reward */

κa+ = %(l(τ))4

end5

/* check reward of contour information */

if l(y) > θ ∧ l(c) > θ then6

/* add reward */

κa+ = %(l(c))7

end8

end9

Integration of Strong and Weak Features

The term in front of the Heaviside function H determines the amount ofthe reward which will be added to κa as specified in algorithm 7. Due tothe applied Heaviside function, the result of the rewarding function is zeroor positive. Thus, in case of a missing or falsely classified observation thereward changes nothing, but it increases the κa if the hand posture storedin the present state of the HMM Sensor matches to the learned sign of thecurrent frame. At the beginning of the recognition process or after a reset(see below) the overall quality κg for the single sign g will be initialized withzero.

After computing κa of the current frame, the quality κg is updated byadding the κa of each hand. Without the hand posture information, usingonly the position information, the κg would continuously decrease with in-creasing sign length, as illustrated in fig. 6.8 (a). By introducing % of eq. (6.31)the κa and thus κg can become positive. Therefore, κg cannot be transferredback into a probability again. The resulting belief of the sign represented bythe Gesture HMM Integration module is performed by a weighted additionof the current qualities κlh, κrh received for each hand using:

κg+ = wlhκlh + wrhκrh. (6.32)

Recognition 63

Focusing on the dominant hand by setting wrh = 0.7 and wlh = 0.3, therecognition is run without further information whether the presented sign isperformed by one or both hands.

States of the Gesture HMM Integration Modules

Each Gesture HMM Integration module has two states, active or inactive.In the active state, the module is certain that it could match the sign andby proceeding to higher states of the HMM Sensor, the recognition is contin-uously following the coming observations. The increase of the HMM Sensorstates is a cue for the similarity of the learned sign to the performed sign.A Gesture HMM Integration module and therefore its corresponding signbecomes active if the κa of the first state5 is above the activation thresholdξstart. Otherwise the Gesture HMM Integration module is inactive. Thismeans that all the connected HMM Sensors are set to their initial state andall the parameters like the uncertainty u of each sensor and the κg are resetto zero. An active Gesture HMM Integration module becomes inactive if itsκg drops below the inactivation threshold ξstop.

A graphical overview of the behavior of the Gesture HMM Integrationmodules is given in fig. 6.8. Both plots where taken form the same recog-nition experiment described in section 7.2 where the system was tested torecognize the sign “about”. While fig. 6.8 (a) shows the development of thecorresponding Gesture HMM Integration module, trained on the data of the“about” sign, fig. 6.8 (b) depicts the behavior of a module where learneddata (the sign “fast”) and thus the presented observation sequence does notmatch. As shown in fig. 6.8 (a) for the matching sign module, without thehand posture information adding only the l(y), the κg decreases with increas-ing sign length. Depending on the matching position observation the signwould run in its inactive state and would be reset for the further recognition.By introducing % of eq. (6.31) the κa and thus κg can become positive andhelps the Gesture HMM Integration module to stay active.

ξstart and ξstop have global values and allow the system to reset a sign mod-ule autonomously in order to restart the recognition during the performanceof the sign. With a view to a continuous stream of data, the active/inactivemode was developed to handle the problem of co-articulation (the framesbetween two gestures) and the case where the first frames for one or moresigns is similar and only the following frames will decide which sign is per-formed. Thus, as the sequence is followed, only the most likely sign will stayactive until the final decision about the most probable sign is made by the

5According to the applied recognition algorithm, the observation is not matched withthe first state the observation can be passed to the second state, see section 6.2.1.

Recognition 64

Decision Center of layer three. The Decision Center compares the resultsof the Gesture HMM Integration modules and determines which sign is themost probable so far.

Recognition 65

-20

-15

-10

-5

0

5

10

0 5 10 15 20

Qua

lity

Val

ues

Frame

Gesture Module ABOUT, presented sign ABOUT

Only Positionκg

ρ(c)ρ(τ)

ξstopξstart

(a) ABOUT

-20

-15

-10

-5

0

5

10

0 5 10 15 20

Qua

lity

Val

ues

Frame

Gesture Module FAST, presented sign ABOUT

Only Positionκg

ρ(c)ρ(τ)

ξstopξstart

(b) FAST

Figure 6.8: Recognition Modules — The plots illustrate the behaviorof the Gesture HMM Integration modules during the recognition of the sign“about”. For this purpose each figure plots the threshold for activating andinactivating the modules and the curve which shows the overall quality κg aswell as the rewards given by the attached weak cues. In order to demonstratethe advantage of the reward, the recognition quality based only on the positioninformation is plotted. As can be seen for the “about” module in fig. 6.8 (a)this line constantly decreases and therefore would run the Gesture HMM Inte-gration module into its inactive mode. The rewards allow the module to stayactive and thus recognize the sign. Fig. 6.8 (b) shows that parts of the “about”sign are quite similar to the beginning of the stored “fast” sign. Thus, it isimportant to consider the introduced confidence value to discard these events.

Recognition 66

6.3.3 Layer Three: Decision Center

Only active Gesture HMM Integration modules will receive the attention ofthe Decision Center in layer three. The autonomy of the Gesture HMMIntegration modules prohibits the Decision Center to use equation eq. (6.8)to declare the sign g with the highest current value of κg as the recognizedsign. The Decision Center would compare sign modules that are alreadyrunning with sign modules that just started. Thus, recognition is coupled tothe progress of the HMM Sensor by means of a confidence value ζg, whichis individual for each sign g. It is computed by the ratio of the currentstate of the sensor to the maximal number of states N of the HMM and isa measure of certainty. Therefore, only signs which are above the confidencethreshold of ζg,min will be handled by the Decision Center. This minimalconfidence ζg,min can be different for each sign and is computed as the ratioof its shortest to its longest sequence of the training set.

Out of the signs that reached their ζg,min, the Decision Center choosesthe one with the highest κg to be the most probable sign that represents theobservation sequences so far. This method favors short signs that only needa small amount of recognized frames to reach their ζg,min. Therefore, compe-tition is introduced in the Decision Center. Based on the sign module withthe highest κg all Gesture HMM Integration modules are inhibited by κg.Thus, short signs become inactive before they reach the needed confidencevalue. If a sign reaches a confidence value of one a reset signal is sent toall connected Gesture HMM Integration modules and the recognition of thecurrent sign is completed.

Recognition 67

Algorithm 8: The recognition is hierarchically organized into threelayers. The characteristic of each layer is its way of information inte-gration. Layer one, the HMM Sensor analyzes the received observationwith its observation probability function. Layer two comprise the HMMIntegration module of each learned gesture and integrates the informa-tion received from layer one. The top layer compares the results fromthe HMM Integration modules. The Decision Center determines themost probable sign and manages the inhibition.

while not at end of gesture sequence do1

/* ************************************************* */

/* Layer one: HMM Sensor */

/* ************************************************* */

foreach HMM sensor do2

calculate observation probabilities3

end4

/* ************************************************* */

/* Layer two: HMM Integration modules */

/* ************************************************* */

foreach HMM Integration module do5

compute % to fuse the information of position and contour6

calculate the current quality κa7

update the overall quality κg8

control the activation using κg, ξstart and ξstop9

end10

/* ************************************************* */

/* Layer three: Decision Center */

/* ************************************************* */

if ∃ HMM Integration module that reached its ζg,min then11

choose HMM Integration module with highest κg12

as current winner13

end14

if ζwinner == 1 then15

reset all HMM Integration modules16

end17

else18

/* inhibit all gestures */

search for the maximal quality κmax19

foreach HMM Integration module do20

subtract κmaxfrom κg21

end22

end23

end24

Result: last winner will be chosen as recognized gesture.

Chapter 7

Experiments and Results

The presented MAS recognition system was tested for its capabilities in signerdependent (section 7.2) and signer independent (section 7.3) sign languagerecognition. The aim of the experiments was to evaluate how the teamworkand cooperation within the MAS (handling object tracking and hand pos-ture recognition) solves the problems immanent in sign language recognition.Especially the information fusion which uses the concept of weak and strongcues presented in section 6.3.2 is of special interest.

One crucial problem of any sign language (or speech) recognition systemis to detect the start and the end of the observed sign sequence. Continu-ous sign language recognition systems like Vogler and Metaxas (2001)solve the problem by modeling the movement between two consecutive signs.However, the authors remark that this enlarges the amount of included HMMconsiderably. This number can be reduced by using filler model HMMs aspresented in Eickeler and Rigoll (1998); Eickeler et al. (1998). Analternative is presented in Starner et al. (1998) and Hienz et al. (1999).These authors put the sign sentence under the constraints of a known gram-mar. Thus, it will be treated as one long sign, where the hand’s restingpositions mark the start and the end of the sentence.

The recognition system in the present work solves this segmentation prob-lem by introducing a high degree of autonomy to each sign module1. Everysign module implemented in the recognition agent is designed to detect thestart of the sign by screening the input stream in order to find a partial obser-vation sequence that fits with the data sequence stored in its first state. Asdescribed in section 6.3 the corresponding Gesture HMM Integration modulebecomes active if its quality κa is above the activation threshold ξstart.

1The Gesture HMM Integration module and its HMM Sensors.

Experiments and Results 69

(a) 54242 (b) 54243 (c) 54244 (d) 54245 (e) 54246 (f) 54247

Figure 7.1: End of Sign Detection — This sequence of images shows theend of the sign “fantastic”, including some following frames. According tothe provided ground truth information, the sign ends with frame 54244.

The detection of the last frame of the sign in the presented sequence iseven more difficult. While the start of a sign performance is very similarin terms of starting position and hand posture, the different length of theindividual performance makes the precise decision of the last frame verydifficult. Fig. 7.1 demonstrates the problem. Even for the human eye it iscomplicated to determine the turning point which marks the end of the signand the transition to the next sign.

In order to determine the end of the sign, the system uses the confidencevalue ζg introduced in section 6.2. As the Gesture HMM Integration moduleruns through the states of the HMM Sensors the confidence value increases.Hence, the end of the sign can be determined once the confidence value is1. However, a confidence value of 1 can only be reached if the presentedsequence is as long as the longest sequence which was used to build theHMM Sensor. Thus, a confidence threshold ζg,min was introduced for eachsign module. It was computed by the ratio of the shortest to the longestsequence in the corresponding training set.

The strength of the recognition system presented in this work is thatit determines the most probable sign at any time during performance ofthe sign. This is supported by the fact that the sign has already reacheda certain degree of confidence, i.e., the confidence value is above the signspecific threshold mentioned above. Hence, in the coming experiments thesign which is most probable at the end of the presented sequence is definedas the recognized sign. It is worth noting that the system does not know theend of the sequence.

Both experiments, the signer dependent and the signer independent, workon signs of the British Sign Language (BSL). Sign language data was kindly


Parameter Value Description

Tr 0.01 Declare observation as known to thecurrent state of the HMM

θ -5 Correlation thresholdξstart -1 Activate a sign moduleξstop -11 Reset a sign module

Table 7.1: HMM Recognition Agent Parameters — Parameters whichare applied by the HMM Recognition Agent. These parameters are the samein each of the conducted experiments.

provided by Richard Bowden from the University of Surrey. The same datahas been used to train the HMM Recognition agent as well as the appliedagents for hand posture recognition. Some of the signs have already usedin recognition experiments published by Bowden et al. (2004) and Kadiret al. (2004).

All experiments are running with the same set of parameters: There is noindividual tuning by changing the used thresholds listed in tab. 7.1.

7.1 BSL Data

The BSL data base consists of a continuous movie with a separate file contain-ing ground truth information about the first and last frame of the recordedsign. Overall, the movie shows 91 different signs, each repeated 10 times.The signs were performed by one professional signer. This makes a total of29, 219 images that have to be processed. The number of frames per signranges from 11 to 81. Even within the sign repetitions, the sequence lengthshows differences of approximately 50 percent, e.g., the length of sign “live”ranges from 18 to 41 frames. All 91 signs and some additional informationabout the temporal variations between the different signs and the variationswithin the repetitions of the sign are listed in appendix C.

Fig. 7.2 illustrates the trajectories of two example sign sequences. Thesigner is wearing a red shirt and colored gloves, a yellow glove on the righthand and a blue on the left hand. Thus, color segmentation is applied toextract the position of the hand (the center of gravity), the texture and the


(a) different (b) bat

Figure 7.2: BSL Trajectories — The images show the starting positionof the signs “different” and “bat”, including the different trajectories of tenrepetitions by the same signer.

contour during the training of the recognition agents. This information incombination with the bunch graph face detection introduced in section 5.1.1are applied to start the automated learning from examples, which builds upthe hand posture sign lexicon for the bunch graph matching (section 5.2.1)and contour matching (section 5.2.2), respectively. These sign lexicons serveto generate the HMM for each sign as explained in section 6.2.2.

7.2 Signer Dependent Experiments

In addition to the recognition experiments in Bowden et al. (2004) andKadir et al. (2004), further investigations like testing the ability of thepresented system to handle the effect of co-articulation and the rejection ofunknown sequences were undertaken. Hence, three main signer dependentexperiments were set up. The first two experiments investigate the meanrecognition rate. It is computed knowing the start of the sign in the firstand an unknown start in the second experiment. In the third experiment therecognition system was confronted with an unknown sign and was tested onits rejection capability. The result is given by the positive rejection rate.


(a) Input image (b) Color sensor detection

Figure 7.3: Attention Agent — In order to demonstrate the portabilityto experiments where the sign is performed with ungloved hands, the colorsensor of the attention agent merges the computed color similarities of theblue, yellow and skin color values (a) into one overall map (b). The image onthe right shows the color information map, which is used for the processingdescribed in chapter 4.

Both recognition experiments were performed using a leave-one-out pro-cedure, which means that for the testing sign all the sequences excluding theone which is tested were used to build the HMMs as described in section6.2.1. Thus, ten recognition experiments per sign have been carried out. Incontrast to the recognition experiments, all ten repetitions were used to gen-erate the HMMs during the evaluation of the positive rejection rate in thelast experiment.

The integration scheme of coupling strong and weak features has beentested by running each recognition experiment three times. In order to es-timate the value of each feature, the first run integrates only position andcontour information for the sign language recognition. In the second run thecombination of position and texture information is applied. Finally, all fea-tures are integrated to evaluate the advantage of the multi-cue informationintegration. During the detection of the ROIs, the attention agent’s colorsensor fuses the color information of the blue, yellow and skin color blobs asillustrated in fig. 7.3 to demonstrate the transferability to experiments wherethe sign is performed with ungloved hands.


ExperimentMean Recognition Rate in %

Contour Texture Combined

Known Start 83.08 86.81 91.43Unknown Start 78.02 84.61 87.03

Table 7.2: Mean Recognition Rates — As shown, the combination ofboth weak features with the position as the strong feature achieves the highestrecognition rates for both experiments. The recognition rates decrease for theunknown start experiment, but are only lowered by approximately 5% for thecombination run.

Recognition Experiments

Both recognition experiments differ only with respect to the start of the signperformance. The attained recognition rates are presented in tab. 7.2. Inthe case of a known start of the sign, the mean recognition rate of 83.08%was achieved by integrating the position and contour information. Althoughmuch slower in terms of processing speed, a better result of 86.81% was gainedusing the texture information instead of contour information. As expected,the best mean recognition rate of 91.43% was achieved by the combinationof all three cues.

In order to simulate the co-articulation, the recognition system started 10frames before the ground truth setting. Having only changed the start of theinput sequence the experiment showed the same trend of the computed meanrecognition rates as described above. Using only position and contour themean recognition rate was 78.02%. Again, the bunch graph in combinationwith the position showed a higher mean recognition rate of 84.61%. Followingthe experiments carried out with known start, the integration of the two weakcues gained the best result in terms of recognition rates. In this case a meanrecognition rate of 87.03% was obtained. By comparing both experiments,tab. 7.2 shows that there is just a 5% difference between a given and a self-controlled start of the recognition.

The distribution of the recognition rates is shown in fig. 7.4 (a) for thegiven start of the sign and fig. 7.4 (b) for the co-articulation experiment. Theyare presented in a histogram style where the number of recognized signs isthe entry in the recognition rate bins. For example as shown in fig. 7.4 (a)


01 1

0 0

2

4

2

4

21

56

0

21

32

0

3

5 5

24

46

21

4

01

4

6

2

7

23

41

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100

Recognition Rates in %

Nu

mb

er o

f S

ign

s

Combined Texture Contour

(a) Known Start of Signing

10 0

2 2 2 2

7 7

29

39

3

01

0

2 23

6

16

18

40

12

3

0

2

8

6

8

15

18

28

0

5

10

15

20

25

30

35

40

45

0 10 20 30 40 50 60 70 80 90 100

Recognition Rates in %

Nu

mb

er o

f S

ign

s

Combined Texture Contour

(b) Unknown Start of Signing

Figure 7.4: BSL Recognition Rates — Both histograms show the numberof recognized signs as the entry in the recognition rate bins. Especially thedistribution plotted in fig. 7.4 (b) demonstrates the advantage of fusing bothcues, contour and texture.


for the combined case, 21 signs had been recognized with a recognition rateof 90%. Taking the first set of experiments, a mean recognition rate of 90%and higher could be achieved for over 84% of the presented signs. The signswith lower recognition rates of 0 to 10% were dominated by very similar signswhich have a lower sequence length.

The behavior of the recognition system is presented by discussing tworesults of the recognition experiments. Tab. 7.3 and tab. 7.4 show the results2

of the recognition experiments for the signs “computer”, “excited interested”and “live”. The sign “computer” demonstrates the advantage of the multicue integration. While the sign is not recognized using only the positionand contour information, using texture instead provides a mean recognitionrate of 60% for the known start experiment and 40% for the unknown startexperiment, respectively. Combining both weak cues enhances the meanrecognition rate to 50% for the unknown start experiment while the result of60% stays the same for the known start experiment. Therefore, the appliedinformation integration proved to handle the combination of two informationstreams and shows that the result can even be better than by taking the bestout of the two single cues.

However, a disadvantage of the recognition system is shown by the resultson the “excited interested” sign. This sign is dominated by the shorter “live”sign, that shows to be very similar as depicted in fig. 7.5. The difference ofthe trajectories is small compared to the inter-sign variations that can occurin other sign like the “different” and the “bat” gesture trajectories that aregiven in fig. 7.2. This misclassification trap is caused by the self-organizingproperty of the system. All known signs are in a loop and are thus waitingto become active by passing the activation threshold ξstart. Therefore, asseen above, signs with low recognition rate are likely to be dominated bysimilar shorter signs. This effect can be decreased by introducing grammar orother non-manual observation like facial expressions to future sign languagerecognition systems.

2The complete list of recognition failures is presented in appendix D.


GestureKnown Start (Contour)

Recognized Signs and Number of Hits

computer best 7, father 1, how 2excited interested ill 2, excited interested 2, cheque 1, live 5live live 10

GestureKnown Start (Texture)


computer difficult 1, best 2, computer 6, father 1excited interested excited interested 1, cheque 1, live 6, dog 2live best 1, live 9

GestureKnown Start (Combined)


computer difficult 1, best 1, in 1, computer 6, father 1excited interested ill 2, excited interested 2, cheque 1, live 5live live 10

Table 7.3: Recognition Results Known Start — This table displays aselection of the recognition results for the experiments with known start. Theinput signs are presented in the left column and the corresponding recognitionresult are listed in the right column. The “Number of Hits” entry codes whooften the sign has been recognized.


GestureUnknown Start (Contour)


computer best 8, father 1, how 1excited interested excited interested 2, cheque 1, live 6, angry 1live angry 1, live 9

GestureUnknown Start (Texture)


computer later 1, difficult 2, best 3, computer 4excited interested cheque 1, live 6, angry 1, dog 2live angry 1, live 9

GestureUnknown Start (Combined)


computer later 1, difficult 1, ill 1, in 1, computer 5, father 1excited interested ill 1, excited interested 3, in 1, cheque 1, live 3,

angry 1live angry 1, live 9

Table 7.4: Recognition Results Unknown Start — This table displaysa selection of the recognition results for the experiments with unknown start.The input signs are presented in the left column and the corresponding recog-nition result are listed in the right column. The “Number of Hits” entry codeswho often the sign has been recognized.


(a) excited interested (b) live

Figure 7.5: Similar Signs — The trajectories and static hand gestures ofthe signs excited interested (left) and live (right) are very similar. There-fore, the shorter sign live dominates the recognition of the excited interestedperformance. The integration of non-manual observation like a grammar orfacial expression should help to differentiate between similar signs.

Sign Rejection

The common HMM recognizer determines the model with the best likelihoodto be the recognized sign. However, it is not guaranteed that the pattern isreally similar to the reference sign unless the likelihood value is high enough.A simple thresholding for the likelihood is difficult to realize and becomeseven more complicated in the case of sign language recognition, where allsigns differ in their length. Instead of a fixed value, Lee and Kim (1999)introduce a threshold HMM which provides the needed threshold. Theirthreshold HMM is a model for all trained gestures and thus its likelihood issmaller than that of the dedicated gesture model. The authors demonstratetheir model on a sign lexicon of 10 gestures. Applying the threshold modelon a database of 91 signs causes several problems when generating the HMM.The first problem concerns the distribution of the stored observations. Whenmerging the emission distribution of every sign the attained observation dis-tribution of the threshold HMM might be distributed in a way that fills thewhole observation space. The same holds true for the distribution of thetransition probabilities. Thus, the threshold model might lose its character-


istic of producing a higher likelihood if the sign is not known. Or even worse,it shows a higher probability and thus will better explain the observationsequence.

Differing from the threshold model, the filler model presented in Eick-eler and Rigoll (1998), Eickeler et al. (1998) is trained on arbitrary andother “garbage” movements. Their rejection experiments are performed ona set of 13 different gestures and thus pose the same problems as mentionedabove.

The presented recognition system handles the sign rejection with the au-tonomy of the applied sign modules. If the presented image sequence isnot known to the recognition system none of the HMM Gesture recognitionmodules becomes active or is able to reach a confidence value which is highenough to confirm a recognized sign during the recognition process. Thus,the positive rejection rate represents the ability of the recognition systemto reject a sign that is not included in its learned sign memory. The meanpositive rejection rate was computed on the 910 experiments with knownstart and using all sign modules except the one of the running sign. Therecognition system achieved a mean positive rejection rate of 27.8%.

The false acceptance of signs can be explained by the missing competitionand thus missing suppression in the Decision Center. As the correspondingsign is not included, similar signs are able to achieve their minimal confidencevalue and are declared for the recognized gesture.

However, the presented recognition system is not dependent on any filleror threshold HMM and thus allows to add or delete signs without furtherprocessing on existing data.

Discussion

The experiments conducted in the present work show that the MAS provedto be a reliable recognition system for signer dependent sign language recog-nition. The presented information integration of strong and weak featuresproved to work well. Although the texture feature showed to perform betterthan the contour, the combination of both features improves the recognition.All experiments were running with the same set of parameters. There wasno individual tuning by changing the used thresholds.

When comparing the results of the recognition system with other systems,it should be considered that, in contrast to speech recognition, there is nostandardized benchmark for sign language recognition. Thus, recognitionrates cannot be compared directly (Kraiss, 2006). Nevertheless, there areimpressive results by Zieren and Kraiss (2005) of 99.3% on a database of


232 isolated signs and von Agris et al. (2006) with 97.9% for a database of153 isolated signs.

As the processed data in the present work is partly included in the workof Bowden et al. (2004) and Kadir et al. (2004), their results are morecomparable with the results of the present work than the ones mentionedabove. Bowden et al. (2004) achieved a mean recognition rate of 97.67%for a sign lexicon of 43 while Kadir et al. (2004) achieved a recognition rateof 92% for a lexicon of 164 words. The comparable recognition rate achievedin the present work achieves 91.43% on 91 signs and therefore is in line withthe other systems. However, the strength of the MAS is the autonomy of therecognition process in order to handle the effect of co-articulation and therejection of unknown signs. Both experiments have not been adressed in theworkings mentioned above.

7.3 Different Signer Experiments

Analysis of the hand motion reveals that the variation between differentsigners are larger than within one signer (von Agris et al., 2006). Theauthors also confirm that other manual features such as hand posture andlocation exhibit analogue variability.

Besides the problems which have to be solved for the signer dependent signlanguage recognition like temporal and spatial variations in the performanceof the sign (which are even enlarged by the individual performance of eachsigner) the signer independent system has to deal with the different physiqueof the signers as well. Two approaches can be applied to solve these problems.

The first approach is to build the recognition on features that are de-scribed in a way that the differences mentioned above are too small to disturbthe coarse description of the feature. This approach is used in Eickeler andRigoll (1998) where signer independence is achieved by using general fea-tures out of difference (pixel change) images of the whole body region. Theirrecognition system proved to work with 13 different gestures. One problemresulting from the coarse feature description is the small number of gesturesthat can be distinguished by the recognition system.

The other approach concerns signer adaption methods as presented byvon Agris et al. (2006). The authors extract geometric and dynamic fea-tures, which are adapted to methods taken from speech recognition research.In their experiments, three signer are used for training and one signer fortesting. The system was tested on 153 isolated signs. Using a supervisedadaption with 80 adaption sequences, their recognition rate shows an accu-racy of 78.6%.


Signer independence in the present work is investigated by running therecognition experiment with a known start on a subset of the signs mentionedabove. The MAS recognition system is not changed, it is the same as for thesigner dependent experiments, except for the color sensor of the attentionagent which is set to skin color. Hence, it has been trained on data, wherethe signer is wearing gloves. Thus, the applied recognition agents storedcontours or bunch graphs which were trained on the glove data. Therefore,these experiments did not only prove the signer independence of recognition,the difference in performing the gesture and having a different body structure,but also it was indirectly proved for the applied object recognition methodsto generalize over identity.

The signer independence of recognition has been proved on a subset ofsigns (“hello”, “meet”, “fast”, “know”) included in two data sets. In thefirst set the signs were performed by the same signer as in the experimentsin section 7.2. Recorded on different sessions, the signs were cut out of arunning sentence performed without colored gloves. The signs in the secondset were performed by members of the “Institut fur Neuroinformatik”, RuhrUniversity of Bochum Germany, who performed sign language for the firsttime.

Object tracking worked well on both data sets. It is shown in fig. 7.6 forthe first data set and in fig. 7.7 for the second set. However, the recognitionsystem could only recognize the sign “meet” of the first data set and failedwith all other represented sign sequences.

Discussion

Possible reasons for the failure of the signer independent recognition arethe following. One major problem is the temporal and spatial variationbetween the training and the testing set. Temporal, due to the fact, that therecorded signs of the first set, although recorded by the same signer as in thetraining set, are shorter than the shortest sequence in the training sets andcould therefore not reach the required confidence threshold of the recognizedsign. The spatial variation means that the position for performing the signis outside the position variations of the trained position HMM.

Another difficulty concerns feature extraction. The classification of thehand postures, based on the data trained on the BSL database and thereforetrained on colored gloves is not adequate for signer independence.


(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(j) (k) (l)

Figure 7.6: Different Signer Detection “difficult” — This sequence ofimages was used as input for the signer independent recognition. As shown inthe sequence, the tracking performed well. The problem that occurred concernthe recognition of the hand posture and the length of the sequence which isonly half as long as the shortest sign sequence in the training set.


(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(j) (k) (l)

Figure 7.7: Different Signer Detection “fast” — This sequence wasperformed by an amateur who performed sign language for the first time.Therefore, temporal and spatial variations in the performance as well as therecognition failures of the hand posture did not allow the recognition. How-ever, the tracking performed well in the cluttered environment. The firstframe shows the system during its initialization phase, the face and the handsare not yet recognized and thus are coded with yellow boxes.

Chapter 8

Discussion and Outlook

8.1 Discussion

In order to realize the sign language recognition system, a software frameworkto design and test multi-agent systems has been built up. The characteris-tics of the implemented multi-agent system are autonomous and cooperatingunits. Principles like divide and conquer, learning from examples and self-control have been applied for object tracking and sign language recognition.Both systems are further divided into smaller subsystems, which are realizedas simultaneously running agents. The modular framework allows the recog-nition system to be easily extensible. New signs are included by connectingtheir HMM Sensors to a Gesture HMM Integration module which is addedto the HMM recognition agent.

The recognition of signs is realized by introducing a modification to thestandard HMM architecture. The task of the HMMs is to store featuresequences and their variations. This data is compared with an incomingfeature sequence in order to recognize the performed sign. The presentedrecognition system divides the input features into two types of information.Reliable features show temporal continuity and are more robust under vari-ations between the observed and the learned data, while weaker features arenot as robust and will therefore fail more often to be recognized from theobservations. Both types of information channels are integrated by using acorrelation and rewarding scheme. Another innovation is the competition ofthe learned signs during the recognition process.

In addition to satisfactory recognition results the autonomy of the systemallows to handle the problem of co-articulation. Although the sign rejectionexperiments did not shown the expected positive rejection rate, the ability of

Discussion 85

sign rejection immanent in the design of the recognition system. Thereforeit does not need extra modules like threshold or filler HMMs.

Only simple features like the position and hand posture have been ap-plied. The present work does not include a grammar or a high level descrip-tion. These would be an interesting challenge for future projects and will bediscussed in section 8.2.

8.2 Outlook

The presented work can and should be further enhanced by investigating theparameters which control the dynamics and the tracking process as well asthe applied thresholds systematically.

Although the system is designed to work online, and thus present the mostprobable sign for each frame, the recognition processes are too expensive torun in real-time. This problem does not hold if tracking only is demanded.In this case the tracking is running in real-time. In order to speed up therecognition, the modular architecture allows to be simultaneously executedon different computers. First tests using a CORBA interface have successfullybeen done.

Based on the presented work, sign language research can be continued inthe following directions:

Online Learning

In order to have a recognition system, one would like it to be adaptive to signvariations as well as to be capable of learning new signs. The modular designof the present work favors an easy integration of new sign modules. The mostchallenging task is to build the whole system from scratch by starting with anempty HMM recognition agent. In this case, it would be realistic to start thelearning with a defined start and end of the observed sign. Further, underthe hypothesis of suitable color segmentation even the hand posture signlexicons for contour and texture could be build from scratch. Nevertheless,the first step should be to start learning by adding recognized signs to thecorresponding sign modules. In a second step, the rejection capability of thesystem should be enhanced. This allows to find new gestures and then toadd them to the HMM recognition agent.

The applied HMM structure allows to expand the distribution probabilityby just adding a new entry in the histogram for the discrete observationsor add a new Gaussian in case of a continuous observation, respectively.Alternatively, the weights of the existing Gaussian mixtures could be adjusted

Discussion 86

to avoid a distribution with too many Gaussians. New states can easily beadded to the HMM if the new sign is having more frames than the previouslylearned ones. However, a problem occurs if the new sign is shorter. In thiscase, the HMM might not reach the needed confidence threshold which isused to declare the probable occurrence of the sign. As the start of thesign is expected to be given, the computed overall quality might be used todetermine the similarity of the observation under the condition that the signmodule is active.

Integration of Non-Manual Information

The integration of facial expressions is often demanded and important forthe full understanding of sign language. However, facial expressions are hardto recognize and should therefore be integrated as a weak feature, givinga reward if the observation match the expected expression or change noth-ing otherwise. Classification of facial expression is treated in Tewes et al.(2005) and could be imported as a new HMM sensor using a discrete featuredescription.

The consideration of grammar is an important feature for continuous signlanguage recognition. In the present work, the grammar would not be usedto recognize the whole sentence but instead it would predict the appearanceof the sign in the context of the previously seen observations. Same as thefacial expression, the grammar could be integrated as a weak cue and thuscontribute a reward.

Person Independence

The most challenging task for sign language recognition systems is the gen-eralization over signer identity. The signer independence capability of thepresent work could be realized by an enhancement of the hand posture recog-nition as well as the adaption of the position feature. The trajectories of thehands have to be adjusted to the characteristic behavior of the performingsigner.

In order to improve the contour matching, the first task would be to en-hance the contour extraction by using enhanced color segmentation. A sec-ond approach concerns the process of contour matching which is described inHorn (2007). The author investigates the advantage of integrating contourand texture information as well as the detection of contours. The improve-ment of the bunch graph concept in order to allow a more generalized handposture recognition is more difficult. Triesch and von der Malsburg(2002) showed that generalization could be achieved if the bunch graph stores

Discussion 87

the hand postures of multiple persons. As a second extension, the variationsof the landmarks could be learned and stored as a special move which isshown for the facial expressions in Tewes (2006). However, both enhance-ments of the bunch graph require more human interaction, at least for theinitialization.

The adaption of the position information has to solve several problems,because the position of the trajectory in the signing space might differ notonly in a linear shift, but as well the whole performance could show nonlin-ear variations in form and space. The approach to collect more data fromdifferent signers to train the HMM is somehow limited. As the variationsget to broad the distribution loses its characteristics and becomes less dis-tinguishable from the other signs. Thus, it seems to be necessary to find atransformation that adapts the learned position information to the observedposition sequence. This solution would require to solve a global optimiza-tion over the learned signs during runtime. In order to limit the number ofapplied signs, the system profits by the introduction of a grammar and theimproved hand posture recognition mentioned above.

Appendix A

Democratic Integration

Democratic integration was first introduced in Triesch and von der Mals-burg (2001a) and developed for robust face tracking. By using democraticintegration, different cues like color, motion, motion prediction and pixeltemplates are integrated to agree on one result. After this decision eachcue adapts towards the result agreed on. In particular, discordant cues arequickly suppressed and re-calibrated, while cues having been consistent withthe result in the recent past are given a higher weight for future decisions.

The information fusion using democratic integration relies on two under-lying assumptions. The first one is that the cues have to be statisticallydependent, otherwise there is no point in trying to integrate them. The sec-ond assumption is that the environment has to exhibit a certain temporalcontinuity. Otherwise, any adaption would be useless.

In the following, the democratic integration will be explained in the con-text of the MAS chapter 3 and the visual tracking chapter 4 applied in thepresent work. Thus, each of the above mentioned cues is represented by aspecific sensor, working on the two dimensional attention region of the agent.Referring to fig. A.1, each sensor i provides a saliency map M

i(x, t) at time t

that shows the image similarity at each coordinate x with an agent-specificand adaptable prototype template P

i(t). The prototype default template is

extracted from the object when the agent is initialized. The integration isperformed in the cueIntegrator layer by summing up the weighted saliencymaps to an overall map R(x, t),

R(x, t) =∑i

ri(t)M i(x, t). (A.1)

The weight ri(t) is part of the self-controlling of sensor i and is called reli-ability. All reliabilities included in the sensors of the agent are normed and

Democratic Integration 89

Figure A.1: Democratic Integration — Tracking agent in use, on theleft we see the tracking result marked with the circle. The rectangle showsthe border of the agent’s search region. On the right we see the similaritymaps created by the different sensors, from left to right: color, motion, mo-tion prediction and pixel template. The fusion center shows the result of theinformation integration.

therefore sum to one∑

i ri(t) = 1. In order to find the current target positionx(t), the overall similarity map is scanned for the maximum entry

x(t) = arg maxx

{R(x, t)

}. (A.2)

After the target position x is determined, the tracking agent rates its successby analyzing the similarity value at position x(t) in map R(x(t), t). If thevalue is above a threshold, the image point has been found and thus theobject successfully tracked. Otherwise, the tracking agent has failed and lostthe image point. Depending on the threshold and its similarity value at thetarget position the agent is capable to update the reliabilities and to adaptits sensors to the new situation.

Adaption of Reliabilities

The benefit of each sensor can be evaluated by introducing a quality valueqi(t), which computes how well the sensor i could predict the target positionx(t). Using,

qi(t) = R(Mi(x(t)))−

⟨M

i(x, t)

⟩(A.3)

where 〈. . .〉 denotes an average over all positions x, and

R(x) =

{0 : x < 0x : x ≥ 0

(A.4)


is the ramp function. Thus, the response of the sensor at the estimatedtarget position is compared to its response average over the whole attentionregion of the agent. If the response at x(t) is above the average, the qualityis given by the distance to that average, otherwise the quality is set to zero.Normalized qualities qi(t) are defined by :

qi(t) =qi(t)∑j qj(t)

. (A.5)

The tracking is made adaptive by defining dynamics based on the derivativeof the reliability r and the time constant τ . The reliabilities are coupled tothe normalized qualities by using:

τ ri(t) = qi(t)− ri(t). (A.6)

Hence, a sensor with a quality measurement higher than its current reliabilitywill tend to increase its reliability while a sensor with a quality lower thanits current reliability will have its reliability lowered. Therefore, the sensorswhose quality remains zero, will end up with a reliability of zero, i.e., will becompletely suppressed. The time constant τ is considered to be sufficientlylarge. Thus, the dynamic of the ri(t) in eq. (A.6) reacts slowly to changes onthe qi(t), which are expected to be spoiled by high-frequency noise. Due tothe qi(t), the ri(t) are normalized as well. If the reliabilities are initializedsuch that their sum is one, the sum will remain one throughout. Hence,the reliabilities can be regarded as weights. Actually, the ri(t) are coding arunning average of the qualities of the sensor.The qi(t) and hence the q depend on the distribution on the resulting similar-ity maps M

i(x, t), and on the position x of the maximal response in R(x, t).

As R(x, t) is a superposition of the resulting Mi(x, t) weighted by ri(t), the

qualities qi(t) are indirectly influenced by the reliabilities ri(t).

Adaption of Prototypes

The dynamics introduced above allow the system to suppress discordant cues.However, the second adaption concerns the prototypes P

iof the sensors,

which in the previous discussion were assumed to be fixed. The attachedsensors change their prototypes in such a way that their outputs improve thetotal result. The adaption of the sensor prototype depends on the trackingresult. If the target was found in the image the prototype adapts towards thecurrent parameters at the target position x(t). Otherwise it adapts towardsthe default template, initialized at the beginning, allowing a recalibration ofdiscordant cues.


In order to explain the prototype dynamic, a function fi(I(x, t)) extract-ing a feature vector P

i(x, t) from a fixed image region around x is introduced.

Further, the feature vector extracted at x is denoted by Pi(t). Then the pro-

totype dynamics are defined by:

τiP i(t) = P

i(t)− P

i(t) = fi(I(x(t)))− P

i(t). (A.7)

The time constants τi are different from the time constant τ which describesthe adaption of the reliabilities in eq. (A.6). The saliency map of a sensoris computed by comparing its current prototype to feature vectors extractedfrom the agent’s attention region. After the computation of R, the prototype

moves towards the feature vector Pi(t) extracted from the input image. If

the scene is stationary, i.e., Pi(t) is constant, then P

i(t) will converge to

Pi(t) with time constant τi. Thus, τi determines how quickly the prototype

of a cue adapts towards changes in the object appearance.

Eq. (A.2) proved to work well on small objects (Triesch and von derMalsburg, 2001a; Kahler and Denzler, 2005) with a unimodal similar-ity map. Larger objects can create a multi-modal similarity map with morethan one peak. Hence, the search of the target position in the present workis modified by thresholding the map. From the remaining peaks the centerof gravity of the segment in which the old target position lies or the segmentwith the minimum distance is calculated and defines the current target posi-tion. In some cases the target position might not be located on the trackingobject or is a bad point to track. Nevertheless, the experiments showed thatthis was not the case for different tracking scenarios and that the trackingbecame more stable.

Appendix B

Gabor Wavelet Transformation

A Gabor jet J (x) = {Jj(x)} is a local feature which describes the distribu-tion of gray values in an image I(x) around a given pixel position x. It codesthe texture and is based on a wavelet transform, defined by the convolution

Jj(x) =

∫I(x′)ψj(x− x′)d2x′ (B.1)

with a family of Gabor wavelets (Wurtz, 1995)

ψj(x) =k2j

σ2exp

(−k2jx

2

2σ2

)[exp(ikjx)− exp

(−σ2

2

)]. (B.2)

Thus, a Gabor wavelet is represented by a plane wave with wave vector kjwhich is restricted by a Gaussian envelope function of width σ.

Since this is a wavelet transformation, the family of wavelets is self-similarin the sense that all wavelets can be generated from one mother wavelet bychanging the scaling (ν) and orientation (µ).

This is done using a discrete set of different frequencies and orientations,

kj =

(kjxkjy

)=

(kν cosϕµkν sinϕµ

)(B.3)

kν = 2−ν+22 π with ν ∈ [0,M − 1] (B.4)

ϕµ = µπ

Bwith µ ∈ [0, B − 1] (B.5)

with index j = µ+ νB, where B stands for the number of directions and Mfor the number of scales used to create the jet.

Gabor Wavelet Transformation 93

Convolving an image with a whole family of Gabor wavelets of differentscale and orientation provides a set of complex coefficients at each pixel.Hence, a J is defined as a set {Jj} of B ×M complex coefficients

Jj = aj exp(iφj) (B.6)

having amplitudes aj(x) that are slowly varying with position and phasesφj(x) varying with the spatial frequency given by the wave vector k.

It has been shown that so-called simple cells located in the visual cortexof cats whose responses depend on frequency and orientation of the visualstimuli can be modelled as linear filters and also approximated using Gaborwavelets (Pollen and Ronner, 1981; Jones and Palmer, 1987).

Similarity Function for Comparing Gabor Jets

The similarity function for comparing Gabor jets applied in the present workhas been proposed by Lades et al. (1993); Wiskott (1995):

Sabs(J ,J ′) =

∑j aja

′j√∑

j a2j

∑j a′2j

. (B.7)

Sabs uses only the magnitudes a of the jet entries and has the form of anormalized scalar product. The function eq. (B.7) yields similarity valuesbetween zero and one.

With J a fixed jet, and J ′ = J ′(x) the jet at position x in an image,Sabs(J ,J ′(x)) proved to be a smooth function. It is therefore used in thepresent work, throughout.

Similarity Function for Comparing Model Graphs

The similarity sG of two model graphs G and G ′1 with N nodes is computedby comparing their Gabor jets Ji and J ′i at node position i using eq. (B.7)and:

sG =1

N

N∑i=1

Sabs(Ji,J ′i ). (B.8)

This equation is modified for the comparison of a bunch graph B with a modelgraph G ′. In this case only the jet entry b of B with the highest similarity tothe model graph jet entry is chosen for the computation of the similarity sB:

sB =1

N

N∑i=1

maxb{Sabs(Ji,b,J ′i )}. (B.9)

1In the elastic graph matching G′ is extracted from the input image.

Gabor Wavelet Transformation 94

Applied Parameters of Gabor Jets

The present work applies the concept of the bunch graph matching for facedetection and the classification of hand postures, cp. chapter 5. Both casesare using different sets of parameters to build the Gabor jets. While the facerecognition uses the parameters listed in tab. B.1 the jets which are applied toextract the texture information for the hand posture classification are chosento code more details of the surrounding texture on smaller scales. Theseparameters are listed in tab. B.2.


M 5 number of scalesB 8 number of directionsσ 2π width of Gaussian envelope

Table B.1: Face Detection — Parameters which are applied for the Gaborjets in the face detection bunch graph matching.


M 10 number of scalesB 16 number of directionsσ 1.1 width of Gaussian envelope

Table B.2: Hand Posture Classification — Parameters which are appliedfor the Gabor jets in the hand posture bunch graph classification.

Appendix C

BSL Database

Table C.1: BSL sign Database — This table lists the 91 signs provided bythe British Sign Language Datebase used in this thesis. Each sign has beenrecorded ten times. The temporal variations are given by the number of framesof the shortest, the longest and the mean (taken from the ten repetitions) signlength.

GestureTemporal Variations in Frames

mean length min length max length

about 25 18 38age 36 26 44america 75 64 81angry 20 12 28anxious 44 35 59apple 23 19 29argue 48 40 65ask 19 15 28bad 32 24 47ball 39 31 51banana 57 46 73bat 23 18 30before 26 20 46believe 31 27 45best 18 13 23black 21 16 28blind 39 32 48blue 42 35 52

BSL Data 96



book 28 20 49bottle 34 23 62box 38 29 46boy 31 20 45british 39 32 58brown 40 34 50busy 28 18 44but 24 18 30can 21 14 25cant 34 30 44car 48 42 56cat 26 22 31cheque 31 22 39class 35 26 41clever 27 23 34coffee 40 37 45collage 27 21 34comunity 33 29 41computer 69 64 79confidence 33 29 40confused 41 34 56continue 36 26 44comunication 47 38 56dark 25 20 31deaf 18 11 22different 25 21 29difficult 37 35 44dog 30 26 39dont know 23 21 29dont want 19 17 27drunk 19 16 23eat 34 30 42english 44 36 54excited interested 55 45 68family 57 52 65fantastic 22 14 41fast 27 23 32father 30 25 38

BSL Data 97



forget 27 20 39friend 45 40 52from 21 17 26good 20 16 29green 20 12 29guilty 29 24 35handsome 38 33 47hearing 16 14 23hello 41 34 48here 18 13 23house 30 26 33how 17 13 21hungry 47 41 58I me 17 14 20ill 19 15 26in 14 12 18know 42 41 47lady 30 24 42language 20 18 24last week 18 15 22later 35 32 41laugh 35 33 39learn 49 39 54level 21 17 26light morning 23 13 28like 30 25 34listen 21 17 28live 34 18 41long time ago 40 35 44love 21 16 26man 22 20 27many 28 20 38maybe 46 38 60meet 20 15 26take care 31 20 35

Appendix D

BSL Experiments Results

The following tables list the recognition failures for the specific experiments.The input signs are presented in the left column and the corresponding recog-nition result are listed in the right column. The “Number of Hits” entry codeswho often the sign has been recognized. The “NO WINNER” entries denoteto the case where no sign module reached the required confidence value.

Table D.1: Known Start — Position, Contour and Texture are integrated.

Gesture Recognized Signs and Number of Hits

apple apple 9, can 1ask hearing 1, ask 9banana NO WINNER 1, about 3, banana 6bat NO WINNER 1, here 1, bat 8before before 9, last week 1believe NO WINNER 2, believe 6, fantastic 2best NO WINNER 1, best 9blind eat 2, blind 8book book 9, in 1bottle NO WINNER 1, bottle 9boy boy 9, drunk 1british NO WINNER 1, british 9but but 9, bad 1cant cant 9, can 1clever laugh 1, NO WINNER 1, clever 6, listen 2computer difficult 1, best 1, in 1, computer 6, father 1comunity comunity 9, fantastic 1confidence i me 1, confidence 9

BSL Results 99


continue NO WINNER 1, continue 9deaf hearing 9, deaf 1different NO WINNER 1, different 8, angry 1dont know deaf 3, dont know 7eat eat 9, blind 1excited interested ill 2, excited interested 2, cheque 1, live 5family family 9, english 1father best 1, father 9forget NO WINNER 1, forget 9good here 3, good 7hearing hearing 9, deaf 1house NO WINNER 1, house 9laugh laugh 5, coffee 1, apple 4like like 5, i me 5love bottle 1, love 9maybe here 1, maybe 9take care NO WINNER 1, takecare 8, can 1

BSL Results 100

Table D.3: Known Start — Position and Texture are integrated.


about about 9, father 1anxious comunity 5, in 1, anxious 3, fantastic 1apple apple 9, age 1argue live 1, argue 9ask hearing 1, ask 9banana about 5, in 2, banana 3bat NO WINNER 2, bat 8before before 7, last week 3believe NO WINNER 1, believe 4, last week 2, man 1,

fantastic 1, black 1best NO WINNER 1, best 9blind eat 2, blind 8book book 9, in 1bottle NO WINNER 1, bottle 9box how 1, box 9british best 1, british 9cant NO WINNER 1, cant 9car car 7, live 2, cheque 1clever NO WINNER 1, coffee 1, clever 6, listen 2computer difficult 1, best 2, computer 6, father 1comunity comunity 9, fantastic 1confidence NO WINNER 1, i me 2, confidence 7continue NO WINNER 1, continue 9deaf hearing 9, deaf 1different NO WINNER 2, different 8difficult difficult 9, father 1dont know NO WINNER 1, deaf 2, dont know 7eat eat 9, blind 1excited interested excited interested 1, cheque 1, live 6, dog 2family best 1, family 9fast i me 1, fast 9father best 1, father 9forget forget 8, deaf 2good here 4, good 6green green 9, last week 1here i me 1, here 8, from 1house NO WINNER 1, house 9

BSL Results 101


i me NO WINNER 1, i me 9later later 7, best 1, in 2laugh laugh 2, coffee 2, apple 6like like 3, i me 7live best 1, live 9love bottle 1, love 9man i me 1, man 9maybe here 1, maybe 9take care NO WINNER 2, eat 1, takecare 4, coffee 3

BSL Results 102

Table D.5: Known Start — Position and Contour are integrated.


age takecare 1, age 9america NO WINNER 6, america 4anxious anxious 9, light morning 1apple apple 9, can 1ask hearing 1, ask 9bad good 1, bad 9ball meet 4, ball 6banana about 1, NO WINNER 1, best 1, in 2, banana 5bat NO WINNER 1, bat 9before before 9, last week 1believe NO WINNER 1, believe 6, deaf 1, fantastic 2best NO WINNER 1, best 9black takecare 1, black 9blind eat 2, blind 8bottle NO WINNER 1, bottle 9box how 2, box 8boy boy 9, drunk 1british ill 2, best 2, british 6busy NO WINNER 1, busy 9but but 9, bad 1cant dont want 1, cant 9clever laugh 1, NO WINNER 2, clever 5, listen 2computer best 7, father 1, how 2comunity comunity 8, in 1, fantastic 1confidence NO WINNER 1, i me 1, confidence 8continue NO WINNER 4, continue 6deaf hearing 8, deaf 2different NO WINNER 3, different 7dont know NO WINNER 1, deaf 3, dont know 6dont want dont want 9, good 1eat laugh 1, eat 6, continue 1, blind 1, age 1excited interested ill 2, excited interested 2, cheque 1, live 5family NO WINNER 7, best 1, family 2fast i me 1, fast 9forget forget 8, but 1, deaf 1from here 1, from 9good here 2, good 7, from 1

BSL Results 103


guilty guilty 9, meet 1hearing hearing 9, deaf 1house NO WINNER 1, ill 4, house 5lady lady 9, coffee 1language NO WINNER 1, language 9level NO WINNER 10like like 2, i me 8listen deaf 1, listen 9long time ago NO WINNER 1, long time ago 9love NO WINNER 2, love 8many NO WINNER 2, many 5, best 3maybe here 2, maybe 8take care takecare 1, lady 7, coffee 1, can 1

BSL Results 104

Table D.7: Unknown Start — Position, Contour and Texture are inte-grated.


age can 2, blind 1, age 7america NO WINNER 2, america 8angry live 1, angry 9apple laugh 1, eat 1, can 1, apple 7ask hearing 1, ask 9banana about 1, best 1, in 1, banana 7bat i me 1, here 1, bat 8before before 9, last week 1believe believe 4, last week 1, drunk 1, fantastic 4black deaf 1, can 1, blind 1, black 7blue green 1, best 1, blue 8box how 1, box 9british ill 1, best 3, british 6but but 9, bad 1cant cant 9, good 1car car 8, live 1, cheque 1cat cat 9, light morning 1clever NO WINNER 1, clever 8, can 1coffee hearing 1, coffee 8, deaf 1computer later 1, difficult 1, ill 1, in 1, computer 5, father 1deaf hearing 6, deaf 4difficult difficult 9, best 1dont know deaf 4, dont know 5, listen 1dont want NO WINNER 1, dont want 9eat eat 9, takecare 1english best 1, english 9excited interested ill 1, excited interested 3, in 1, cheque 1, live 3,

angry 1family NO WINNER 1, family 9fast i me 3, fast 7father best 1, father 9forget hearing 1, forget 8, bad 1friend friend 9, best 1good here 1, good 7, bat 2green i me 1, green 9hearing hearing 9, deaf 1

BSL Results 105


here i me 1, here 9house bat 1, house 9ill ill 9, best 1know know 9, deaf 1lady hearing 1, lady 9language ill 1, language 9last week green 1, last week 9later later 9, best 1laugh eat 2, coffee 6, apple 2learn NO WINNER 1, learn 9like like 6, i me 3, fantastic 1listen deaf 1, listen 9live angry 1, live 9love bottle 1, love 9man i me 1, man 7, apple 1, fantastic 1maybe good 1, maybe 9take care takecare 3, lady 1, coffee 5, can 1

BSL Results 106

Table D.9: Unknown Start — Position and Texture are integrated.


about about 8, father 1, english 1age blind 1, age 9america NO WINNER 1, best 1, america 8anxious comunity 2, in 1, anxious 6, fantastic 1apple NO WINNER 1, eat 1, i me 1, apple 7argue angry 1, argue 9ask hearing 1, deaf 1, ask 8ball meet 2, ball 8banana about 6, in 2, banana 2bat dont want 1, here 1, bat 8before before 8, last week 2believe believe 4, last week 1, drunk 1, fantastic 4black deaf 1, apple 1, black 8box how 1, box 9british ill 3, best 1, british 6but but 9, bad 1cant cant 8, good 2car car 7, cheque 1, live 2clever NO WINNER 2, clever 8coffee hearing 1, coffee 9computer later 1, difficult 2, best 3, computer 4comunity comunity 9, class 1confidence i me 1, confidence 9deaf hearing 5, deaf 5different best 1, different 9difficult difficult 9, father 1dont know deaf 3, dont know 6, listen 1eat eat 8, takecare 1, blind 1excited interested cheque 1, live 6, angry 1, dog 2family NO WINNER 1, best 1, family 8fast i me 2, fast 8father best 3, father 7friend difficult 2, friend 7, english 1good here 2, good 5, bad 1, bat 2handsome NO WINNER 1, handsome 9hearing hearing 9, ask 1here i me 2, here 8

BSL Results 107


house ill 1, house 9know know 7, deaf 1, listen 2language language 9, level 1last week green 1, last week 9later later 8, in 2laugh coffee 8, apple 2learn deaf 1, learn 9like like 7, i me 3listen deaf 1, listen 9live angry 1, live 9love bottle 1, love 9man i me 1, man 8, fantastic 1maybe here 1, maybe 8, from 1take care lady 1, coffee 9

BSL Results 108

Table D.11: Unknown Start — Position and Contour are integrated.


age deaf 1, can 1, age 8america NO WINNER 4, america 5, how 1apple can 1, apple 9ball meet 4, ball 6banana about 1, NO WINNER 2, best 1, in 1, banana 5bat here 2, bat 8before before 9, last week 1believe NO WINNER 1, believe 4, deaf 1, fantastic 4black eat 1, takecare 1, deaf 1, can 1, blind 1, black 5blind can 1, blind 9blue green 1, best 1, blue 8box how 1, box 9british ill 2, best 4, british 4busy NO WINNER 1, busy 9but but 9, bad 1cant dont want 2, NO WINNER 1, cant 6, deaf 1can apple 3, can 7car car 8, live 1, cheque 1cheque angry 1, cheque 9class class 9, meet 1clever laugh 1, NO WINNER 1, clever 8coffee hearing 1, coffee 8, deaf 1computer best 8, father 1, how 1comunity comunity 8, in 2confidence confidence 7, i me 2, can 1continue NO WINNER 5, continue 5deaf hearing 5, deaf 5difficult difficult 8, best 2dont know deaf 2, dont know 6, ask 1, how 1dont want dont want 8, good 2eat lady 1, eat 7, deaf 1, bat 1english how 2, english 8excited interested excited interested 2, cheque 1, live 6, angry 1family NO WINNER 8, best 1, family 1fantastic NO WINNER 1, fantastic 9fast i me 3, fantastic 1, fast 6father about 1, father 9

BSL Results 109


forget forget 6, but 2, ask 1, bad 1friend friend 9, how 1from bat 1, from 9good here 2, good 6, bad 1, from 1green i me 2, green 8guilty i me 3, guilty 7handsome handsome 9, deaf 1hearing hearing 9, deaf 1here i me 1, here 9house NO WINNER 2, ill 5, house 2, dog 1ill ill 9, best 1know know 8, deaf 2lady lady 8, coffee 2language NO WINNER 5, language 5last week green 2, last week 8later later 8, about 1, in 1laugh laugh 7, can 2, age 1level book 1, NO WINNER 7, best 1, level 1like like 5, i me 3, guilty 2listen deaf 1, listen 9live angry 1, live 9love NO WINNER 3, love 7many many 7, best 3man i me 3, man 7maybe here 4, maybe 5, from 1take care takecare 2, lady 4, coffee 1, can 3

List of Figures

1.1 Trajectory variations . . . . . . . . . . . . . . . . . . . . . . . 31.2 Sign Language Recognition Work Flow . . . . . . . . . . . . . 5

2.1 BSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . 123.2 MAS Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Design of an Agent . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1 Tracking Sequence . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Applied Agents . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Democratic Integration . . . . . . . . . . . . . . . . . . . . . . 204.4 Meet gesture in BSL . . . . . . . . . . . . . . . . . . . . . . . 214.5 Input images . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1 Face detection using EGM . . . . . . . . . . . . . . . . . . . . 255.2 Construction of a bunch graph . . . . . . . . . . . . . . . . . 265.3 BSL sign “different” . . . . . . . . . . . . . . . . . . . . . . . 285.4 Bunch Graph with Background . . . . . . . . . . . . . . . . . 305.5 Automatic model graph extraction . . . . . . . . . . . . . . . 315.6 Shape Context . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1 4 State Markov Chain . . . . . . . . . . . . . . . . . . . . . . 386.2 Bakis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.3 HMM architectures . . . . . . . . . . . . . . . . . . . . . . . . 516.4 Estimation of HMM Parameters . . . . . . . . . . . . . . . . . 566.5 Recognition Agent . . . . . . . . . . . . . . . . . . . . . . . . 576.6 Information processing . . . . . . . . . . . . . . . . . . . . . . 586.7 Rewarding Function % . . . . . . . . . . . . . . . . . . . . . . 616.8 Recognition Modules . . . . . . . . . . . . . . . . . . . . . . . 65

7.1 End of Sign Detection . . . . . . . . . . . . . . . . . . . . . . 69

110

List of Figures 111

7.2 BSL Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . 717.3 Attention Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 727.4 BSL Recognition Rates . . . . . . . . . . . . . . . . . . . . . . 747.5 Similar Signs . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.6 Different Signer Detection “difficult” . . . . . . . . . . . . . . 827.7 Different Signer Detection “fast” . . . . . . . . . . . . . . . . 83

A.1 Democratic Integration . . . . . . . . . . . . . . . . . . . . . . 89

List of Tables

7.1 HMM Recognition Agent Parameters . . . . . . . . . . . . . . 707.2 Mean Recognition Rates . . . . . . . . . . . . . . . . . . . . . 737.3 Recognition Results Known Start . . . . . . . . . . . . . . . . 767.4 Recognition Results Unknown Start . . . . . . . . . . . . . . . 77

B.1 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 94B.2 Hand Posture Classification . . . . . . . . . . . . . . . . . . . 94

C.1 BSL sign Database . . . . . . . . . . . . . . . . . . . . . . . . 95

D.1 Known Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98D.3 Known Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100D.5 Known Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102D.7 Unknown Start . . . . . . . . . . . . . . . . . . . . . . . . . . 104D.9 Unknown Start . . . . . . . . . . . . . . . . . . . . . . . . . . 106D.11 Unknown Start . . . . . . . . . . . . . . . . . . . . . . . . . . 108

112

Bibliography

Akyol, S. Nicht-intrusive Erkennung isolierter Gesten und Gebarden. PhDThesis, Aachen, Techn. Hochsch, 2003.

Barczak, A.L.C. and Dadgostar, F. Real-time hand tracking using aset of cooperative classifiers based on haar-like features. Research Lettersin the Information and Mathematical Sciences, 5: 29–42, 2005.

Bauer, B. and Kraiss, K.-F. Video-based sign recognition using self-organizing subunits. In ICPR (2), 434–437. 2002.

Baum, L. E., Petrie, T., Soules, G., and Weiss, N. A maximizationtechnique occurring in the statistical analysis of probabilistic functions ofmarkov chains. The Annals of Mathematical Statistics, 41(1): 164–171,1970.

Becker, M., Kefalea, E., Mael, E., von der Malsburg, C., Pagel,M., Triesch, J., Vorbruggen, J.C., Wurtz, R.P., and Zadel, S.GripSee: A gesture-controlled robot for object perception and manipula-tion. Autonomous Robots, 6(2): 203–221, 1999.

Belongie, S., Malik, J., and Puzicha, J. Shape matching and objectrecognition using shape contexts. IEEE TRANSACTIONS ON PATTERNANALYSIS AND MACHINE INTELLIGENCE, 24(4): 509 – 522, 2002.

Bigus, J. Intelligente Agenten mit Java programmieren . eCommerce undInformationsrecherche automatisieren. Addison-Wesley, 2001.

Bowden, R., Windridge, D., Kadir, T., Zisserman, A., and Brady,M. A linguistic feature vector for the visual interpretation of sign language.In ECCV (1), 390–401. 2004.

Bowden, R., Zisserman, A., Kadir, T., and Brady, M. Vision basedinterpretation of natural sign languages. 2003.

113

Bibliography 114

Bunke, H., Roth, M., and Schukat-Talamazzini, E.G. Off-lineCursive Handwriting Recognition Using Hidden Markov Models. PatternRecognition, 28(9): 1399–1413, 1995.

Buschmann, F., Meunier, R., Rohnert, H., Sommerlad, P., andStal, M. Pattern-Oriented Software Architecture, Volume 1: A Systemof Patterns. John Wiley & Sons, 1996.

Comon, P. Independent component analysis, a new concept? Signal Pro-cess., 36(3): 287–314, 1994. ISSN 0165-1684.

Derpanis, K. G. A review of vision-based hand gestures. Technical report,Centre for Vision Research, York University (Canada),, 2004.

Eickeler, S., Kosmala, A., and Rigoll, G. Hidden Markov ModelBased Continuous Online Gesture Recognition. In Int. Conference on Pat-tern Recognition (ICPR), 1206–1208. Brisbane, 1998.

Eickeler, S. and Rigoll, G. Continuous online gesture recognition basedon hidden markov models. Technical report, Faculty of Electrical Engineer-ing - Computer Science, Gerhard-Mercator-University Duisburg, 1998.

Ferber, J. Multi-Agent Systems. Addison Wesley, 1999.

Franklin, S. and Graesser, A. Is it an agent, or just a program?:A taxonomy for autonomous agents. In ECAI ’96: Proceedings of theWorkshop on Intelligent Agents III, Agent Theories, Architectures, andLanguages, 21–35. Springer-Verlag, London, UK, 1997. ISBN 3-540-62507-0.

Gallese, V. and Goldman, A. Mirror neurons and the simulation theoryof mind-reading. Trends in Cognitive Sciences, 2(12), 1998.

Gray, R. M. Vector quantization. In A. Waibel and K.-F. Lee, editors,Readings in Speech Recognition, 75–100. Kaufmann, San Mateo, CA, 1990.

Grobel, K. and Assan, M. Isolated sign language recognition using hid-den markov models. In Proceedings of the IEEE International Conferenceon Systems, Man and Cybernetics, 1, 162–167. 1997.

HamNoSys. http://www.sign-lang.uni-hamburg.de/projekte/hamnosys.URL http://www.sign-lang.uni-hamburg.de/Projekte/HamNoSys,2007. [Online; accessed 21-February-2007].

Bibliography 115

Hienz, H., Bauer, B., and Kraiss, K.-F. Hmm-based continuous signlanguage recognition using stochastic grammars. In GW ’99: Proceedingsof the International Gesture Workshop on Gesture-Based Communicationin Human-Computer Interaction, 185–196. Springer-Verlag, London, UK,1999. ISBN 3-540-66935-3.

Horn, W. Detection of contours and their fusion with texture informationfor object recognition. PhD Thesis, Universitat Bochum, 2007.

Iacoboni, M., Molnar-Szakacs, I., Gallese, V., Buccino, G.,Mazziotta, J. C., and Rizzolatti, G. Grasping the intentions ofothers with one’s own mirror neuron system. PLoS Biol, 3(3), 2005. ISSN1545-7885.

Johansson, G. Visual perception of biological motion and a model for itsanalysis. Perception and Psychophysics, 14(2): 201–211, 1973.

Jones, J. P. and Palmer, L. A. An evaluation of the two-dimensionalgabor filter model of simple receptive fields in cat striate cortex. J Neuro-physiol, 58(6): 1233–1258, 1987.

Jones, M.J. and Rehg, J.M. Statistical color models with application toskin detection. International Journal of Computer Vision, 46(1): 81–96,2002.

Kadir, T., Bowden, R., Ong, E. J., and Zisserman, A. Minimaltraining, large lexicon, unconstrained sign language recognition. In Pro-ceedings of the 15th British Machine Vision Conference, Kingston. 2004.

Kahler, O. and Denzler, J. Self-organizing and adaptive data fusion for3d object tracking. In U. Brinkschulte, J. Becker, D. Fey, C. Hochberger,T. Martinetz, C. Muller-Schloer, H. Schmeck, T. Ungerer, and R. Wurtz,editors, ARCS 2005 – System Aspects in Organic and Pervasive Computing– Workshops Procedeedings, Innsbruck Austria, March 14–17, 109–116.VDE Verlag, Berlin, Offenbach, 2005.

Keysers, C., Kohler, E., Umilta, M., Nanetti, L., Fogassi, L.,and Gallese, V. Audiovisual mirror neurons and action recognition.Exp Brain Res, 153: 628–636, 2003.

Kraiss, K.-F. Advanced man machine interaction. Springer, Berlin [u.a.],2006. ISBN 3-540-30618-8.

Bibliography 116

Kruger, M., Schafer, A., Tewes, A., and Wurtz, R.P. Commu-nicating agents architecture with applications in multimodal human com-puter interaction. In Peter Dadam and Manfred Reichert, editors, Infor-matik 2004, 2, 641–645. Gesellschaft fur Informatik, 2004.

Lades, M., Vorbruggen, J.C., Buhmann, J., Lange, J., von derMalsburg, C., Wurtz, R.P., and Konen, W. Distortion invariantobject recognition in the dynamic link architecture. IEEE Transactionson Computers, 42: 300–311, 1993.

Lee, H.-K. and Kim, J.H. An HMM-based threshold model approach forgesture recognition. IEEE Transactions on Pattern Analysis and MachineIntelligence, 21(10): 961–973, 1999.

Liu, X. and Chua, C.-S. Multi-agent activity recognition using ob-servation decomposed hidden markov models. Image Vision Comput.,24(2): 166–175, 2006.

Loos, H.S. and von der Malsburg, C. 1-Click Learning of Ob-ject Models for Recognition. In H.H. Bulthoff, S.-W. Lee, T.A. Pog-gio, and C. Wallraven, editors, Biologically Motivated Computer Vision2002 (BMCV 2002), 2525 of Lecture Notes in Computer Science, 377–386.Springer Verlag, Tubingen, Germany, 2002.

Messer, K., Kittler, J., Sadeghi, M., Hamouz, M., Kostin, A.,Cardinaux, F., Marcel, S., Bengio, S., Sanderson, C., Poh,N., Rodriguez, Y, Czyz, J., Vandendorpe, L., McCool, C.,Lowther, S., Sridharan, S., Chandran, V., Palacios, R.P., Vi-dal, E., Bai, L., Shen, L., Wang, Y., Yueh-Hsuan, C., Hsien-Chang, L., Yi-Ping, H., Heinrichs, A., Muller, M., Tewes, A.,von der Malsburg, C., Wurtz, R.P., Wang, Z., Xue, F., Ma,Y., Yang, Q., Fang, C., Ding, X., Lucey, S., and Goss, R. and-Schneiderman, H. Face authentication test on the banca database. InProceedings of ICPR 2004, Cambridge, 4, 523–532. 2004.

Meyn, S. and Tweedie, R. Markov Chains and Stochastic Stability. 1993.

Moeslund, T.B. and Granum, E. A survey of computer vision-based hu-man motion capture. Computer Vision and Image Understanding: CVIU,81(3): 231–268, 2001.

Muller-Schloer, C., von der Malsburg, C., and Wurtz, R.P. Ak-tuelles Schlagwort: Organic Computing. Informatik Spektrum, 27(4): 332–336, 2004.

Bibliography 117

Nikraz, M., Caire, G., and Bahri, P.A. A methodology for the de-velopment of multi-agent systems using the jade platform. Comput. Syst.Sci. Eng., 21(2), 2006.

Ong, E. and Bowden, R. A boosted classifier tree for hand shape detec-tion. 2004.

Ong, S. C. W. and Ranganath, S. Automatic sign language analysis: Asurvey and the future beyond lexical meaning. IEEE Trans. Pattern Anal.Mach. Intell., 27(6): 873–891, 2005.

Pavlovic, V., Sharma, R., and Huang, T.S. Visual interpretation ofhand gestures for human-computer interaction: A review. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 19(7): 677–695,1997.

Phillips, P.J., Grother, P., Micheals, R., Blackburn, D.M.,Tabassi, E., and Bone, M. Face recognition vendor test 2002. InAMFG ’03: Proceedings of the IEEE International Workshop on Analysisand Modeling of Faces and Gestures, page 44. IEEE Computer Society,Washington, DC, USA, 2003. ISBN 0-7695-2010-3.

Phillips, P.J., Moon, H.J., Rizvi, S.A., and Rauss, P.J. The feretevaluation methodology for face-recognition algorithms. 22(10): 1090–1104, 2000.

Pollen, D.A. and Ronner, S.F. Phase relationships between adjacentsimple cells in the visual cortex. Science, 212(4501): 1409–1411, 1981.

Rabiner, L. A tutorial on hidden markov models and selected applicationsin speech recognition. Proceedings of the IEEE, 77(2): 257–286. URLhttp://www.cs.berkeley.edu/ murphyk/Bayes/rabiner.pdf, 1989.

Rabiner, L. R. and Juang, B. H. Fundamentals of Speech Recognition.Englewood Cliffs, NJ: Prentice Hall, 1993.

RAD. Royal association for deaf people. URLhttp://www.royaldeaf.org.uk/, 2007. [Online; accessed 21-February-2007].

Rigoll, G., Kosmala, A., and Schuster, M. A New Approach toVideo Sequence Recognition Based on Statistical Methods. In IEEE Int.Conference on Image Processing (ICIP), 839–842. Lausanne, 1996.

Bibliography 118

Rinne, M., Potzsch, M., Eckes, C., and von der Malsburg, C.Designing Objects for Computer Vision: The Backbone of the LibraryFLAVOR. Internal Report IRINI 99-08, Institut fur Neuroinformatik,Ruhr-Universitat Bochum, 1999.

Rizvi, S. A., Phillips, P. J., and Moon, H. The feret verification testingprotocol for face recognition algorithms. In FG ’98: Proceedings of the 3rd.International Conference on Face & Gesture Recognition, page 48. IEEEComputer Society, Washington, DC, USA, 1998. ISBN 0-8186-8344-9.

Rizzolatti, G. and Craighero, L. The mirror-neuron system. AnnualReview of Neuroscience, 27(1): 169–192, 2004.

Schafer, A. Versuche zum Lernen artikulierter Objekte aus Bildsequenzen.Shaker Verlag, Germany, 2006.

Schmidt, N.T. Konturmetrik als modellbasierter Term in der NURBS-Snakes Zielfunktion. Master’s Thesis, Universitat Dortmund, 2006.

Schukat-Talamazzini, E. Automatische Spracherkennung - Grundlagen,statistische Modelle und effiziente Algorithmen. Vieweg, Braunschweig,1995.

Shet, V. D., Prasad, V.S.N., Elgammal, A.M., Yacoob, Y., andDavis, L.S. Multi-cue exemplar-based nonparametric model for gesturerecognition. In ICVGIP, 656–662. 2004.

signwriting.org. http://www.signwriting.org. URLhttp://www.signwriting.org, 2007. [Online; accessed 21-February-2007].

Spengler, M. and Schiele, B. Towards robust multi-cue integration forvisual tracking. In ICVS, 93–106. 2001.

Starner, T. and Pentland, A. Visual recognition of american signlanguage using hidden markov models. In International Workshop on Au-tomatic Face and Gesture Recognition, 189–194. 1995.

Starner, T., Weaver, J., and Pentland, A. Real-time american signlanguage recognition using desk and wearable computer based video. IEEETransactions on Pattern Analysis and Machine Intelligence, 20(12): 1371–1375, 1998.

Bibliography 119

Steinhage, A. Tracking human hand movements by fusing early visualcues. In R. Wurtz and M. Lappe, editors, Proceedings of the 4th WorkshopOn Dynamic Perception, 139–144. Akademische Verlagsgesellschaft GmbH,Berlin, 2002.

Stokoe, W.C., Jr. Sign Language Structure: An Outline of the VisualCommunication Systems of the American Deaf. J. Deaf Stud. Deaf Educ.,10(1): 3–37, 2005.

Tanibata, N., Shimada, N., and Shirai, Y. Extraction of hand featuresfor recognition of sign language words. 2002.

Tewes, A. A Flexible Object Model for Encoding and Matching HumanFaces. Shaker Verlag, Germany, 2006.

Tewes, A., Wurtz, R.P., and von der Malsburg, C. A flexible ob-ject model for recognising and synthesising facial expressions. In TakeoKanade, Nalini Ratha, and Anil Jain, editors, Proceedings of the Interna-tional Conference on Audio- and Video-based Biometric Person Authenti-cation, LNCS, 81–90. Springer, 2005.

Titterington, D., Smith, A., and Makov, U. Statistical Analysis ofFinite Mixture Distributions. John Wiley & Sons, 1985.

Triesch, J. and von der Malsburg, C. Democratic integration: Self-organized integration of adaptive cues. Neural Computation, 13(9): 2049–2074, 2001a.

Triesch, J. and von der Malsburg, C. A system for person-independent hand posture recognition against complex backgrounds.IEEE Transactions on Pattern Recognition and Machine Intelligence,23(12): 1449–1453, 2001b.

Triesch, J. and von der Malsburg, C. Classification of hand posturesagainst complex backgrounds using elastic graph matching. Image andVisionComputing, 20(13): 937–943, 2002.

Ulomek, F.K. Klassifikation statischer Handgesten. Master’s Thesis,Physics Dept., Univ. of Bochum, Germany, 2007.

Umiltai, M.A., Kohler, E., Gallese, V., Fogassi, L., Fadiga, L.,Keysers, C., and Rizzolatti, G. I know what you are doing - aneurophysiological study. Neuron, 31: 155–165(11), 2001.

Bibliography 120

Viterbi, A. Error bounds for convolutional codes and an asymptoticallyoptimum decoding algorithm. Information Theory, IEEE Transactions on,13(2): 260–269, 1967.

Vogler, C. and Metaxas, D.N. Adapting hidden markov models forasl recognition by using three-dimensional computer vision methods. InProceedings of the IEEE International Conference on Systems, Man andCybernetics, 156–161. 1997.

Vogler, C. and Metaxas, D.N. Parallel hidden markov models foramerican sign language recognition. In ICCV (1), 116–122. 1999.

Vogler, C. and Metaxas, D.N. A framework for recognizing the simul-taneous aspects of american sign language. Comput. Vis. Image Underst.,81(3): 358–384, 2001. ISSN 1077-3142.

Vogler, C. and Metaxas, D.N. Handshapes and movements: Multiple-channel american sign language recognition. In Gesture Workshop, 247–258. 2003.

von Agris, U., Schneider, D., Zieren, J., and Kraiss, K.-F. Rapidsigner adaptation for isolated sign language recognition. In CVPRW’06: Proceedings of the 2006 Conference on Computer Vision and PatternRecognition Workshop, page 159. IEEE Computer Society, Washington,DC, USA, 2006.

von der Malsburg, C. Vision as an exercise in Organic Computing. InPeter Dadam and Manfred Reichert, editors, Informatik 2004, 2, 631–635.2004.

Wikipedia. Software agent — wikipedia, the free encyclopedia. 2007. [On-line; accessed 21-February-2007].

Wiskott, L. Labeled Graphs and Dynamic Link Matching for Face Recog-nition and Scene Analysis, 53. Verlag Harri Deutsch, Thun, Frankfurt amMain, 1995.

Wiskott, L., Fellous, J.-M., Kruger, N., and von der Malsburg,C. Face recognition by elastic bunch graph matching. IEEE Transactionson Pattern Analysis and Machine, Intelligence, 19(7): 775–779, 1997.

Wooldridge, M. Introduction to MultiAgent Systems. John Wiley andSons, 2002. ISBN 047149691X.

http://www.harri-deutsch.de/

Bibliography 121

Wu, Y., Lin, J., and Huang, T.S. Analyzing and capturing articulatedhand motion in image sequences. 27(12): 1910–1922, 2005.

Wurtz, R.P. Multilayer Dynamic Link Networks for Establishing ImagePoint Correspondences and Visual Object Recognition, 41. Verlag HarriDeutsch, Thun, Frankfurt am Main, 1995.

Wurtz, R.P. Organic Computing methods for face recognition. it – Infor-mation Technology, 47(4): 207–211, 2005.

Zahedi, M., Keysers, D., and Ney, H. Appearance-based recognitionof words in american sign language. 511–519. 2005.

Zieren, J. and Kraiss, K.-F. Robust person-independent visual signlanguage recognition. In IbPRIA (1), 520–528. 2005.

http://www.harri-deutsch.de

http://www.harri-deutsch.de

Lebenslauf

Personliche Daten

Name Maximilian Kruger

Geburtsdatum 16. Juli 1974

Geburtsort Dusseldorf

Adresse Hermannstrae 35, 44791 Bochum

Telefon 0234-3888939, 0234/32-25559

E-Mail [email protected]

Schulbildung

1981 — 1987 Renee-Sintenis-Grundschule, Berlin

1987 — 1992 Georg-Herwegh-Oberschule, Berlin

1992 — 1994 Gymnasium Blankenese, Hamburg

1994 Abitur

Studium

1994 — 2001 Studium der Geophysik an der UniversitatHamburg

Diplomarbeit:Amplitude Versus Offset Unter-suchungen von hochauflosenden seismischenDaten vom Bengal Schelf bei Prof. Dr. D. Gajew-ski

Curriculum Vitae 123

Beruflicher Werdegang

1995 — 2000 Hilfswissenschaftler am Institut fur Geophysik derUniversitat Hamburg

2000 — 2001 Geophysiker bei Petrologic Geophysical ServicesGmbH

2002 — Wissenschaftlicher Mitarbeiter am Institut furNeuroinformatik, Lehrstuhl Systembiophysik vonProf. Dr. Christoph von der Malsburg, Ruhr-Universitat Bochum

vision-based tracking and recognition of dynamic hand gestures

Documents