gesture recognition based on localist attractor networks with application to robot control...

64 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | FEBRUARY 2012 1556-603X/12/$31.00©2012IEEE

Application Notes

I. Introduction

I n the past, the study of robotics has predominantly focused on topics such as manipulation and naviga-

tion, while few robotic systems have been designed with flexible user inter-faces to interact with people. Recently, realization of social robots that can work in human environments has become an important research area due to the potential of enhancing human well-being and quality of life [1]–[5], in terms of service, health-care, nursing and entertainment.

Since many social robots will be operated by non-expert users, it is essential for them to be equipped with natural interfaces for interaction with humans. One possible avenue is the use of gestures, which is a natural means of human communication. The arms become a comfortable natural input and eliminate the need for mechanical input devices. Using gestures as an input to control a robot is not a new idea. Recently, many researchers have studied to recognize Taichi gestures [6], interact with virtual agents [7] and manipulate objects on a table [8]. There are also a few works that used gestures to instruct a mobile robot [9], [10] and [11].

Human gestures can be captured using a variety of hardware devices such as the DataGlove [12]–[14], a mouse input device, a magnetic spatial position and orientation sensor, and a video

camera [15] [16]. Compared to devices that require wearing a glove or sensor and being linked to the computer, using video camera as gesture input device can increase users’ autonomy. However, the drawback is that the user needs to face the robot and the system has diffi-culty to track people in crowds.

One of the main difficulties in designing human-robot interfaces is in letting the robot know the proper amount of human char-acteristics. Gesture signals are characterized by high dimensionality, nonlinear attributes, time variance, and susceptibility to environ-mental noise and disturbance. Intelligent learning techniques can be effectively utilized to handle com-plexity and time-variance of human characteristics [4]. There are a few works that realize gesture recognition using intelligent learning approaches, includ-ing neural networks [17], Hidden Mar-kov Model (HMM) [6], and Fuzzy Min-Max Neural Network (FMMNN) in [18]–[20].

Given the challenges in the human gesture recognition, it is desirable to develop a method that is highly accurate and robust, based on a minimal set of examples or only a few stereotypes. In addition, it is also desirable that the rec-ognition method does not require a large number of training examples, so that it is flexible for the user to define gestures based on specifications of the environment or the user preferences.

The classical pattern recognition algo-rithms can be broadly categorized into two distinct classes: on one hand are approaches based on classification boundaries such as MLP and SVM, and on the other hand are prototype based approaches, including K-nearest neigh-

bor (KNN), learning vector quantiza-tion (LVQ), self-organizing map

(SOM) and adaptive reso-nance theory (ART). For algorithms based on classi-fication boundar ies, a large training set is gener-ally required for the networks to learn an

appropriate classification boundary. Though some

improvements have been made to alleviate the dependence on train-

ing set, a large training set is still pre-ferred to achieve high accuracy. Thus, they are limited to the offline recogni-tion. For prototype based classifier algo-r ithms, KNN algor ithms rely on measuring the distance between the training vectors and their prototypes [21], and LVQ is a popular class of adap-tive nearest neighbor classifiers [22]. SOM classifier operates in two modes: training and mapping. Training builds a map which preserves the topological properties of the input space of the training prototypes via a competitive process, and mapping automatically clas-sifies a new input vector [23]. ART is a two-layer structured network [24] in which the first layer performs a charac-terization process by means of extracting features, giving rise to activation in the

Rui Yan, Keng Peng Tee, Yuanwei Chua, Haizhou Li, and Huajin TangInstitute for Infocomm Research, SINGAPORE

Gesture Recognition Based on Localist Attractor Networks with Application to Robot Control

ltiest

-nce.

tion ((SO

n

abo

impro© IMAGASTATE

Digital Object Identifier 10.1109/MCI.2011.2176767Date of publication: 17 January 2012

FEBRUARY 2012 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 65

feature representation field. The second layer is a categorization process, associat-ing the input pattern to a prototype of the category representation field via the connection weights between the two layers. A threshold parameter called vigi-lance is employed to determine if the input pattern has a match with the pro-totype. An concrete example of ART based classifier for pattern classification can be found in [25]. Differing from methods based on classification bound-aries, many prototype based classifier algorithms do not require a large num-ber of training examples and can be designed for online learning.

After decades of development, the aforementioned pattern recognition methods for gesture recognition are well established. However, in the con-text of pattern classification applica-tions, the main difficulty is that the learning complexity grows exponen-tially with the dimensionality of the data. Although reducing dimensionality is possible during feature extraction, if incomplete or erroneous features are extracted, the classification process is inherently limited in performance. An overview of deep learning approaches and research directions, which use new ideas associated with cognitive abilities for information representation, has been provided in [26]. Furthermore, human cognition has been widely studied and applied to different fields. A cognitively-inspired game-playing framework was presented to develop cognitively-plausible pattern-based human-like playing systems in [27]. In this paper, we are motivated to apply attractor networks to achieve cognitive memory properties. The attractor net-works in [28]–[33] are memory based, and can be formulated by auto-associa-tive attractor dynamics to provide pat-tern recognition. In such a network, each attractor represents individual ste-reotypical pattern. Before the start of control tasks, a few sets of gesture motions including all patterns are demonstrated to obtain attractors. When a new gesture is given, the net-work activities will converge to one of the attractors that is similar to the

given gesture. Similar to the above mentioned prototype based classifier algorithms, the attractor networks also employ the prototype patterns which are stored in the networks as memory patterns and can realize online recog-nition. However, they do not require the distance measure to calculate the similarity between the input patterns and stored patterns. Categorization, which involves assigning the input pat-tern to any of the stored patterns, is realized through a dynamic conver-gence process governed by the attrac-tor networks.

In most attractor networks, knowl-edge of the stereotypes are globally distributed in the network connection weights, and there is no existing mech-anism to design the attractors directly, i.e., there are no existing procedures that can robustly translate a specifica-tion of attractors into a set of weights [30], [31]. In order to overcome this limitation, a very practical model called the localist attractor network (LAN) was proposed in [30]. LAN can avoid the translation procedure and directly use the attractor information. LAN provides a unique combination of advantages: (i) the wiring of the archi-tecture to any given attractor landscape is achieved by a well-defined mathe-matical model; (ii) LAN has cognitive memory properties that are highly applicable in social robotics.

In this work, an online gesture rec-ognition system is presented. This sys-tem first records the human gesture signals using sensors and extracts ges-ture features by an inverse kinematics mapping, followed by fast Fourier transformation. Then the features are memorized and gesture recognition is realized with the help of Localist Attractor Networks (LAN). LAN ges-

ture recognition only requires a small amount of training data to memorize gesture patterns. Users only need to demonstrate all patterns of pre-defined gestural commands a few times before operating the control tasks. In the experiment, the proposed gesture rec-ognition system are applied to control a robot. The remainder of the paper is organized as follows. Section II intro-duces the formulation of Localist Attractor Networks. Our gesture rec-ognition system is described in Section III. Experimental results are demon-strated in Section IV. There are discus-sions and brief concluding remarks in Section V and VI.

II. Formulation of Localist Attractor NetworksIn this section, we present the LAN model [30] and its key advantages. The LAN model provides a simple proce-dure to wire up the architecture, since the parameters of model have a clear mathematical interpretation that allows readers to easily understand how the parameters control the qualitative behavior of the model. Additionally, spurious attractors can be eliminated, and the convergence and stability of this model can be proved.

Assume a localist attractor net con-sist of a set of n state units and m attrac-tor units. Let wi be an attractor unit i representing the center of its attractor basin in state-space shown in Fig. 1.

Denote pi as the connection weight (also called pull or strength) of an attractor unit wi, and si the width of the attractor unit in its attractor basin. The selection of pi is based on a speci-fication of the desired behavior of the net, while the value of si can be esti-mated from wi and the state-space of its attractor basin. The activity of an

LAN provides a unique combination of advantages: (i) the wiring of the architecture to any given attractor landscape is achieved by a well-defined mathematical model; (ii) LAN has cognitive memory properties that are highly applicable in social robotics.

66 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | FEBRUARY 2012

attractor at time t, qi 1 t 2 reflects the nor-malized distance from its center to the current state y 1 t 2

qi 1 t 2 5 pi g 1y 1t 2 , wi, si 2s ,

g 1y, w, s2 5 exp 12|y2w|2/2s22 , s5a

jpj

g 1y 1t 2, wj, si 2 . (1)

The structure of qi 1t 2 is given in Fig. 2. It is obvious that the attractors form a layer of normalized radial-basis-function units.

For an initial value of the state e, the state is pulled toward attractors in pro-portion to their activity according to the following expression:

y 1 t1125a 1 t 2e1 112a 1 t 22ai

qi 1 t 2wi,

(2)

where a trades off the external input and attractor influences. The external input e can be taken as the observation. A method is provided in [30] to reduce a gradually over time to achieve a per-sistent effect of the observation on the asymptotic state. a can be defined as follows:

a 1 t 2 5sy2 1 t 2 / 1sy

2 1 t 2 1sz2 2 , (3)

where the attractor width sy is not fixed, but is dynamic and model determined:

sy2 1 t 2 5 1

naiqi 1 t 2|y 1 t 2 2wi|

2. (4)

In addition, a fixed parameter sz is applied to describe the unreliability of the observation. If sz5 0 then the observation is wholly reliable, i.e., y5e. Thus the main architecture of LAN net-work can be shown in Fig. 3. In this architecture, qi 1 t 2 , i5 1, 2, c, m can be calculated from Equation (1) and the network structure of qi 1 t 2 is shown in Fig. 2.

Remark 1: After selecting the param-eters wi, pi, si and sy in the LAN, the only remaining free parameter sz deter-mines how responsive the network is to the external input e . For a general input signal which is not severely corrupted, we can choose a small value of sz from the region 10.05, 0.2 2 in the network. However, for a severely corrupted input, some experiments in [30] have been given to discuss how the values of sz affect two types of responses (spurious and adulterous). Spurious responses are those in which the f inal s tate y 1 t 2 , t S ` in (2) does not converge to any attractor, while adulterous responses are those in which the state converges to a neighboring attractor instead of the correct one. The results demonstrate that the number of spurious responses increases as sz decreases. At the same time, the rate of adulterous responses decreases. This implies that, for a severely corrupted input, a suitable value of sz is needed to achieve a good trade-off between the two effects.

Remark 2: The motivation of the localist attractor is based on a generative model of the observed input on the underlying attractor distribution. The

FIGURE 1 Attractors in attractor basins of the state-space.

Attractor

State Space

State Unit

Attractor Basin

y (t )

q1 (t )

q2 (t )

qm (t )

g (⋅) π1g (⋅)

π2g (⋅)

πmg (⋅)

g (⋅)

g (⋅)

π1

π2

πm

.

.

....

1/s

1/s

1/s

∑ s

FIGURE 2 The network structure of qi 1 t 2 .

⏐y(t ) – w1⏐2

⏐y(t ) – w2⏐2

⏐y(t ) – wm⏐2

...

...

q1(t )

q1(t )

α (t )

y (t + 1)

α (t )

1 – α (t )

q2(t )

q2(t )

qm(t )

qm(t )

∑

∑

1n σy

σz

+

+

w1

w2

wm

Input ε

FIGURE 3 LAN network architecture.


above mathematical interpretation of model parameters stems from a conver-gence proof corresponding to a search for a maximum likelihood interpretation of the observation [30]. LAN is formu-lated in a generative framework from a statistical basis to derive the state update law. The objective for the network is to minimize the free energy F described for a particular observation e and the priors of parameters sy, sz and wi :

F 1q, y|e2 5ai

qi ln qi

pi1

12sz

2|e 2 y|2

3 12sy

2ai

qi|y2wi|2

1 n ln 1sy sz 2 . (5)

The update laws for the various parame-ters can be derived by minimizing F with respect to each of them. If the pri-ors pi, wi and the generative noise in the observation sz, are constant, we can obtain the update laws for qi, a, sy and y in (1)–(4) by minimizing F with respect to these parameters.

We summarize the above described computational procedure of LAN into the following flow chart shown in Fig. 4. This LAN procedure will be used to realize the human gesture recognition.

III. Gesture Recognition SystemIn this section, a gesture recognition sys-tem is developed with application to control a robot. Firstly we define gesture commands by users, collect gesture sig-nals from the sensor and extract gesture features. After that, the LAN procedure shown in Fig. 4 is used to recognize ges-tures. Finally we realize the gesture-based control for a robot from recognition results. Details of the procedure are given in the following subsections.

A. Defining Gestural CommandsTo use LAN for recognizing gestures, the attractors need to be determined first. The user is required to specify one or more gestures to obtain data that will generate the attractors. Our system allows users to freely define a family of gestures that describe the available com-mands. Furthermore, in order to increase the usability of the system, a constraint on the gestural command is

that the start positions of all gestures are similar to the end positions. This enables users to realize the command control smoothly without being forced to hold their arms steady or stop between commands.

B. Data AcquisitionHuman gestures can be captured by using a variety of hardware devices. In this work, the human’s gesture motion

data are collected by a Measurand Sha-peTape, which is an upper body motion capture system comprising Arm Tape, Thoracic Orientation Sensors, Pelvic Orientation Sensors and Data Concen-trator as shown in Fig.5. Arm Tape is a fiber optic based 3D bend and twist sen-sors to measure arm movements. Fur-thermore, inertial sensors measure torso movements. When a human user dem-onstrates a gesture motion, motion data

Choose the Center wi (i = 1, 2, … , m) of Each Attractor Basinfrom a Set of State Units {X1, X2, … , Xn }

Estimate the Width of Attractor Basins σi

Set the Connection Weights πi and Unreliability Parameters σz

Calculate qi , σy, and α from Equations (1), (4), and (3)

Iterate y Using Equation (2) Untily Converges to Any of wi , i = 1, 2, ⋅ , m

FIGURE 4 A flow chart of LAN procedure to recognize human gestures.

Arm Tape

Wrist Harness

Data Concentrator

Thoracic OrientationSensors

Pelvic OrientationSensors

FIGURE 5 The ShapeTape configuration.

Our system allows users to freely define a family of gestures that describe the available commands.


is captured at 80 Hz and consists of global 3D Cartesian position signals of shoulders, elbows and wrists, all with respect to a reference frame located at a point on the hip.

The collected gesture positions are defined by

5Xt6 t51N 5

5Xt,l s, Xt,le, Xt,lw, Xt,rs, Xt,re, Xt,rw6 t51N ,

which consists of six points. Positions of human gestures in the Cartesian coordi-nate corresponding to different joints are shown in Fig.6. In the figure,

Xt,ls5 1xt,ls, yt,l s, zt,ls 2 , Xt,le5 1xt, le, yt, le, zt, le 2 and

Xt,lw5 1xt, lw, yt,lw , zt,lw 2

correspond to positions of the left shoulder, elbow and wr ist, while Xt,rs5 1xt,rs, yt,rs, zt,rs 2 , Xt,re5 1xt,re, yt,re,zt,re 2 and Xt,rw5 1xt,rw,yt,rw, zt,rw 2 corre-spond to positions of the right shoulder, elbow and wrist. In the above equations, time elapsed from the beginning of the demonstration, and position values are with respect to the torso frame.

C. Projection in a Latent SpaceIn order to find an optimal representa-tion for the given gesture, we project the original postures from the task space to joint space, i.e., project 5Xt6 t51

N to

5qt6 t51N 5

5qt,l1, qt,l2, qt,l3, qt,l4, qt,r1, qt,r2, qt,r3, qt,r46 t51N 6,

where 1qt, l1, qt,l2, qt,l3 2 are pitch, yaw and roll angles for the left shoulder, qt,l4 is the yaw angle for the left elbow. Similar def-initions hold for 1qt,r1, qt,r2, qt,r3 2 and qt,r4 (See Fig. 7).

The forward kinematics are repre-sented by

Xt5 f 1q t2 , (6)

where f 1 # 2 is a nonlinear mapping from joint space to task space. The inverse kinematics problem is to solve for

q t5 f21 1Xt 2 .The inverse mapping f 21 1 # 2 is the mapping from the task space to joint space. It is nonlinear and one-to-many since the dimension of task space is higher than that of joint space. Fur-thermore, it is ill-conditioned when the robot is near its singular configura-tions. Thus we need to find a robust and smooth motion retargeting tech-nique to solve the inverse mapping. In this work, Singularity-Robust Modular Inverse Kinematics [34] is applied.

The algorithm is based on 2-DOF pan-tilt modules to obtain the joint angles. Each 4-DOF arm consists of two 2-DOF pan-tilt modules, based on the configuration shown in [34]. Fig. 8 illustrates a pan-tilt unit. Let 1x, y, z 2 be the endpoint position in the task space. This position is linked with respect to a local coordinate frame 1Ox,Oy,Oz 2 at the base of the pan-tilt joint. From geometrical relationship, the pan angle qp and the tilt angle qt are given as follows:

qp5 tan21ayxb

qt5 tan21a2zxrb, (7)

Xt, ls = (xt, ls, yt, ls , zt, ls )

Xt, le = (xt, le, yt, le, zt, le)

Xt, lw = (xt, lw, yt, lw, zt, lw)

Xt, rs = (xt, rs, yt, rs, zt, rs )

Xt, re = (xt, re, yt, re, zt, re)

Xt, rw = (xt, rw, yt, rw, zt, rw)

FIGURE 6 Positions of human gestures in the Cartesian coordinate.

(qt, l 1 , qt, l 2 , qt, l 3)

qt, l4

(qt, r 1 , qt, r 2 , qt, r 3)

qt, r 4

FIGURE 7 Joint angles of human gestures.

In order to find an optimal representation for the given gesture, we project the original postures from the task space to joint space.


where

xr5 x cos qp1 y sin qp (8)

is the x-coordinate of the new frame after the original frame is rotated by qp about the z-axis.

For smooth motion, we solve the inverse kinematics in a modular fash-ion based on the derivative of the inverse tangent function of the rele-vant task space variables. For robust-ness to kinematic singularity, we add a regularization parameter that vanishes whenever the task variables are outside a neighborhood of zero. Using the robust and smooth motion retargeting technique in [34], the pan angle qp and the tilt angle qt can be described as follows:

q#p5

x 1y# 2key 2 2 y 1x#2kex 2x21 y21e 1x,y 2 (9)

q#t5

z 1x# r2kexr2 2 xr 1z#2kez 2

xr21 z2 , (10)

where 1 x, y, z 2 is the estimated Carte-sian position provided by Equation (6) in [34], e 1d25 1 d2 1d 22 and e 1x, y 2 is time-varying regularization parameter defined as

e 1x, y 2 5 d 0,e0,

e01 a1 1x21 y22b1 2 31 a2 1x21 y22 b1 2 2 ,

x21 y2 . b2

x21 y2 , b1

b1 # x21 y2 # b2

(11)

where b1, b2, e0 are small positive con-stants, and

a15 2e0/ 1 b22 b1 2 3 a2523e0/ 1 b22 b1 2 2 (12)

are obtained by considering boundary constraints when fitting a third order polynomial between the two ex-tremes of e.

D. Extraction of Gesture FeaturesFourier analysis is extremely useful for data analysis, as it breaks down a signal into constituent sinusoids of different

frequencies. The Fourier transform gives a relationship between a signal in the time domain and its representation in the frequency domain. Being a trans-form, no information is created or lost in the process, so the original signal can be recovered from knowing the Fourier transform, and vice versa. The fast Fou-rier transform (FFT) is particular useful algorithm to map the data from time domain to frequency domain.

From the matrix

5qt6 t51N 5

5qt,l1, qt,l2, qt,l3, qt,l4, qt,r1, qt,r2, qt,r3, qt,r46 t51N ,

e a c h c o l u m n 5qt,j6 t51N , w h e r e

j5 l1, l2, l3, l4, r1, r2, r3, r4, represents angle values for a joint at different times. When using the Fast Fourier Transformation, the coefficients of the first M lower frequencies of 5qt,j6 t51

N is obtained as

5ck, j6k512M 5 3c1, j, c2, j, c, c2M, j 4.

Considering all joint values, the feature vector in the frequency domain becomes

Cf5 35ck, l16k512M ,5ck, l26k51

2M ,c,5ck, r46k512M 4T.

E. Gesture Recognition Using Localist Attractor Networks (LAN)In this subsection, we discuss the gesture recognition method using LAN. LAN gesture recognition, which has cognitive memory properties, only requires a small amount of training data and it avoids tedious training process.

Compared with the static arm poses, dynamic gestures are more commonly used in human communications, and they provide additional freedom to design gestural commands. Therefore, we employ dynamic gestures to control a robot. However, dynamic gestures are more difficult to tackle, particularly how to detect the start and end points of a

gesture. In this work, the gesture is detected by using a velocity threshold. The gesture motion is considered to start if any joint velocity exceeds a cer-tain threshold, and is considered to end if it goes below the threshold for a dura-tion of time.

Features of the defined gestural commands are obtained as a few vec-tors in the frequency domain after going through the above procedures of data acquisition, projection in the joint space and FFT transformation. Assume that

5Cf 1,1,Cf 2,1, c,Cfi,1, c,Cfm,16, 5Cf1,2,Cf 2,2, c,Cfi, 2, c,Cfm,26,

( 5Cf 1,k,Cf 2,k,c,Cfi,k ,c,Cfm,k6 (13)

represent k sets of gesture features in fre-quency domain corresponding to m dif-ferent stereotypical patterns. This means that we demonstrate, k times, the gesture motions for each stereotypical pattern. Then the total number of state units is k 3 m. Thus

5Cf 1,1,Cf 1,2, c,Cf 1,k6, ( 5Cf i,1,Cf i,2, c,Cf i,k6

5Cfm,1,Cfm,2, c,Cfm,k6 (14)

Oz

Ox

Oy

qp

–qt

(x, y, z)

FIGURE 8 Pan-tilt module.

LAN gesture recognition, which has cognitive memory properties, only requires a small amount of training data and it avoids tedious training process.


are attractor basins for m different ste-reotypical patterns. Now denote

5Cf 1,c,Cf 2,c, c,Cfi,c , c,Cfm,c 6 as m attractor units of the above k sets of gesture features. The following Remark 3 explains a way of determining the attractors.

Remark 3: We choose the center of attractor basin 5Cf 1,1,Cf 1,2, c,Cf1,k6 as the first attractor Cf 1,c . Similarly, the centers of 5Cfi,1,Cfi,2,c,Cfi,k6 and 5Cfm,1,Cfm,2, c,Cfm,k6 are the i2th and m2th attractors denoted by Cfi,c and Cfm,c . By experiments, we find that all vector s in the a t t rac tor bas in 5Cfi,1,Cf i,2, c,Cfi,k6 always stay in a small region. This implies that the width si of this attractor basin is small. In the experiment, we also try to dem-onstrate, only once, each gesture pat-tern. Thus, we can directly set 5Cf 1,1,Cf 2,1, c,Cfi,1, c,Cfm,16 as m attractors and choose a small value of si in the online gesture recognition system. The results demonstrate that the accu-

rate of recognition results are also very high. This will shrink the procedure to find the centers of attractor basins. Fur-thermore, this will also reduce the time to set up more gesture data.

Now set attractor units wi5 5Cf1,c,Cf 2,c, c,Cfi, c, c,Cfm,c6 in LAN. Thus for a new vector Cf denoted as a fea-ture of a new gesture, the LAN proce-dure shown in Fig. 4 can be directly used to find the pattern of the new gesture. Details of LAN procedure involve the following steps:

Step 1. Decide Cf i,c , i5 1, 2, c, n according to Remark 3 and set Cf i,c , i5 1, 2, c, n as a attractor wi in LAN network.

Step 2. Take Cf as the current state y. It is obvious that y is independent of time t.

Step 3. Choose y as the observa-tion e.

Step 4. Estimate values of si . From Remark 3, si is a small value and can be obtained from attractor basins in (14). In the experiments, si [ 10.05, 0.15 2 .

Step 5. Choose values of pi and sz. The choosing of pi is based on a specifi-cation of the desired behavior of the net. And we also discuss how to choose sz in Remark 1. In the experiments, we choose pi5 1 and sz in the region 10.05, 0.2 2 .

Step 6. Calculate parameters qi, sy, a from Equation (1), (3) and (4).

Step 7. Iterate y using Equation (2) until y converges to any of wi,. i5 1, 2, # , m.

Thus we realize the gesture recogni-tion target using localist attractor net-works.

In summary, the proposed gesture recognition system is illustrated by Figure 9.

IV. ExperimentIn this section, the proposed gesture recognition system has been imple-mented in Visual C++ and Webot soft-ware. Key concerns of the system include the usability, as well as the robustness to variations of individual gesture expressions. We show that the proposed system recognizes gestures and responds with corresponding actions in a realistic robotics task.

In this experiment, we define gestur-al commands for the robot to move in the front, back, left and right directions, as well as to move faster, slower and stop. Measurand ShapeTape sensors acquire motion data as the user freely performs the gestural commands, which satisfy constraints as described in Subsection A of Section III. More specifically, the user performs the following gestures:

1. Direction: The ‘Direction’ ges-tures involve wave-like motions. For the right direction, starting from the initial arm position depicted in Figure 10(a), the person moves the right arm to posi-tion shown in Fig. 10(b), holds it there for a brief moment, and returns to the initial arm position. The gesture for the left direction is defined in a similar way. For the forward direction, also starting from the initial arm position, the person moves the arm onward to the position (c) in Fig. 10, and then continuously moves the arm to the position (d), before returning to the initial arm

Starting

Human Gesture

Sensing

Projection

Transformation

Recognition

Output

Transformed Features

Data in the Latent Space

Measured Data

FeatureExtractionS

tart

a N

ew G

estu

re

FIGURE 9 A flow chart of a gesture recognition system.


position after a short pause. Similarly, for the backwards direction, move the arm from the initial position to the position (e) and then to position (f) and return to the initial position.

2. Stop: This gesture is also a dynamic motion. In order to trigger the stop com-mand, the user starts the arm from the initial position and moves it into the right position shown in (g) for about a second before returning to the initial position.

3. Speed: Again, from the initial arm position, the operator brings the arm to the position (h), and moves rap-idly (slowly) to command the faster (slower) speed, before returning to the initial arm position.

After defining the gestural com-mands, according to Section III, the pro-cedure for gesture recognition is briefly summarized in the following steps: 1) Collect the gesture motion data

from Measurand ShapeTape sensor;

2) Project the data in a latent space to reduce the dimension;

3) Use FFT transformation to achieve the features of gesture motions;

4) Execute the recognition algorithm based on LAN All of the above steps are performed

on a real-time basis in Visual C++. For example, we show gesture templates in the projected joint space for ‘Stop’, ‘Right’, ‘Slow’ and ‘Fast’ gestures. Fig. 11 gives the joint angles 5qt,r1, qt,r2,qt,r3, qt,r 46 of the right arm. We also show the results in frequency domain after FFT transformation. If we choose M5 10 in FFT transformation, the real and imagi-nary parts of the first 10 lower frequen-cies for the joint qt,r1 are given in Fig. 12. As seen from this figure, the four gestures have different coefficients in the lower frequencies, implying that the characteris-tic can be used to recognize different ges-ture patterns.

We estimate s5 0.1 from the exper-iment data, and choose pi5 1 and sz5 0.1 in the LAN method. Gesture recognition can be realized with the pro-posed procedure (six steps) in Subsection E. For an example of ‘Fast’ gesture, the convergence results are shown in Table 1. From this table, this ‘Fast’ gesture con-verges to the ‘Fast’ pattern in gestural commands after 2 iterations. For other gesture patterns, we have also the similar results, i.e., they can converge to defined gestures in 3 iterations. Furthermore, parameters s and sz have been changed into the different values such as: s5 0.05 and sz5 0.1; s5 0.1 and sz5 0.1; s5 0.5 a n d sz5 0.1;s5 0.1 and sz5 1.0; s5 0.2 and sz5 0.8. We find that the convergence can be reached in 3 iterations for differ-ent parameters. As an example, Table 2 gives the recognition result for s5 0.2 and sz5 0.8.

)d()c()b()a(

)h()g()f()e(

Right

deepSpotSdrawkcaB

Forward

FIGURE 10 Sample gestural commands to control robots.


Now, we realize a gesture based inter-face to control a mobile robot by map-ping gestural commands to actions. For example, when the human waves his hand to the left, the robot identifies the gesture and moves towards the left direc-tion. Using Webots software, we simulate a mobile robot and evaluate the gesture recognition system in the designed robot to let this robot move to different direc-tions, change the speed or stop.

In order to test the ability of the system to recognize gestures despite var iability of motion, the user is required to repeat the same gesture for a few times. We also test how well the system can recognize different gestures in a random order. Table 3 shows the recognition accuracy obtained from LAN approach. Overall, LAN approach can classify 99.15% of the examples correctly. Out of 235 testing examples, two misclassification errors occurred, for ‘Right’ and ‘Forward’ patterns. Based on records of the start and end times for each gesture, we attribute the failures to insufficient waiting time before the misclassified gestures are performed, which resulted in inaccu-rate detection of the gesture start time.

Fig. 13 illustrates how the user con-trols a mobile robot using a gesture as an input command. Moreover, we also ask a few different users to attend the experiment to test the system ability. The recognition results are given in Table 4. The experimental results show that the system works well for the dif-ferent users. From the experiments, we also find that operators can easily con-trol this system.

V. DiscussionIn this work, human gestures are cho-sen to control the robot since gestures are a natural form of human communi-cation, and thus it provides a natural

0 1 2 3 4 5 6 7 8 9 10−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

Frequency Index

Fre

quen

cy C

onte

nt

FIGURE 12 Coefficients of the first 10 lower frequencies of joint qt,r1 for four gestures in Fig.11, ‘square’ for ‘Slow’ gesture, ‘o’ for ‘Fast’ gesture, ‘*’ for ‘Slow’ gesture and ‘e’ for ‘Fast’ gesture.

TABLE 1 Distances between ‘Fast’ feature and different gesture features in attractor units.

ITERATION LEFT RIGHT FORWARD BACKWARD STOP SLOW FAST

1 0.571409 0.715404 0.432162 0.550074 0.251911 0.141823 0.00177102 2 0.571475 0.715577 0.432505 0.550232 0.252284 0.143592 2.0182E-006 3 0.571475 0.715577 0.432505 0.550232 0.252284 0.143592 1.52191E-0064 0.571475 0.715577 0.432505 0.550232 0.252284 0.143592 1.52179E-006

We realize a gesture based interface to control a mobile robot by mapping gestural commands to actions.

0 0.5 1 1.5 2 2.5 3 3.5 4−50

0

50

Join

tA

ngle

(°)

Join

tA

ngle

(°)

Join

tA

ngle

(°)

Join

tA

ngle

(°)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5−50

050

100(a)

0 0.5 1 1.5 2 2.5 3 3.5 4

(d)

Time (s)

(b)

0 1 2 3 4 5 6−50

0

50

(c)

−500

50100

qt,r 1 qt,r 2 qt,r 3 qt,r 4

FIGURE 11 Examples of gesture templates in the joint space. (a) a ‘Stop’ gesture, (b) a ‘Right’ gesture, (c) a ‘Slow’ gesture, and (d) a ‘Fast’ gesture.


interface for robots to interact with humans. A typical interface for robot control is a joystick, which sends low-level instructions to the robot, such as moving forward, turning in different directions, or performing complicated tasks in a remote location. However, joystick control is still limited to pre-programmed functionalities and it is difficult to define new functions according to changes in working con-ditions of the robot. In contrast, our proposed gesture based control can eas-ily realize changes in control functions by adding new gestures or changing gestural commands. Furthermore, the joystick only controls a single point of the robot (e.g. end effector), while ges-ture based control simultaneously han-dles multiple points.

In the experiment of Section IV, we used body-worn sensors to record human gesture data, which may constrain the user’s freedom of movement. This limitation motivates us to study the use

of vision based acquisition of gesture data. In particular, we use Microsoft’s Kinect sensor to track human body motions.

Based on the same set of gestural commands as Section IV, we record 50 gesture motions for each pattern. In order to realize the current gesture rec-ognition target, we use position and ori-entation information in the time domain, provided by Kinect, for six joints, including shoulders, elbows and hands for both left and right arms. FFT transformation is used to extract features in the frequency domain. We arbi trarily choose the feature for each pat tern as a set of 7 attractors, denoted by 5Cf 1,c,

TABLE 4 Recognition results for different users.

USERNUMBERS OF GESTURES ERRORS ACCURACY

1 200 2 99%2 210 0 100%3 180 0 100%4 180 1 99.4%5 150 0 100%

TABLE 3 Recognition results for LAN algorithm.

GESTURES RECOGNIZED

LEFT RIGHT FORWARD BACKWARD STOP SLOW FAST

GIVEN GESTURE 235 30 29 39 40 35 25 37LEFT 30 30 - - - - - - RIGHT 30 - 29 - - - - 1 FORWARD 40 - - 39 - - - 1 BACKWARD 40 - - - 40 - - - STOP 35 - - - - 35 - - SLOW 25 - - - - - 25 - FAST 35 - - - - - - 35

OperatorOperator

Real-Time Monitor Real-Time Monitor

Directional AntennaDirectional Antenna

Access Point

Access Point

(a) (b)

FIGURE 13 Gesture based interface to control a mobile robot to the right direction or stop there.

TABLE 2 Distances between ‘Stop’ feature and different gesture features in attractor units.

ITERATION LEFT RIGHT FORWARD BACKWARD STOP SLOW FAST

1 0.620498 0.764216 0.410772 0.690748 1.44589E-009 0.294372 0.2595652 0.620498 0.764216 0.410772 0.690748 5.58826E-017 0.294372 0.2595653 0.620498 0.764216 0.410772 0.690748 5.58826E-017 0.294372 0.259565


Cf 2,c,c,Cf i,c, c,Cf 7,c6. Then we test the recognition accuracy for the remain-ing 49 sets of gestures based on LAN (the same 7 steps in Subsection E of Section III). The result shows that LAN can achieve about 95% recognition accuracy in this case. While we only give a simple discussion for the offline ges-ture recognition, it is straightforward to implement a similar interface as that in Section IV to perform online gesture recognition. Although Kinect provides a easy and simple way to track the human’s body, a drawback is that the user needs to always face the camera and stay in a certain distance from the cam-era. Thus, there are different limitations for the different gesture-capturing devices. This implies that we need to find a suitable device to capture gestures to match the requirement and target.

Currently, our approach has only been applied to a simulated mobile robot and evaluated in a series of exper-iments. In our next step, we will imple-ment the proposed approach on an autonomous robot system to test its util-ity in the context of a realistic service robotics task (e.g. clean-up tasks in [11]). The proposed system can also be applied to a virtual world for a person to inter-act with virtual agents.

VI. ConclusionsIn this work, we proposed an online gesture recognition method based on LAN. It was employed to recognize human gestures using streams of fea-ture vectors extracted from real-time sensory data. As an application, the ges-ture recognition system was used to instruct the robot to execute the pre-defined commands such as moving in different directions, changing speed, stopping and so on. Exper imental results showed a high accuracy in con-trolling the mobile robot using gesture recognition. This system provides a flexible and easy-to-use human-robot

interface to control a robot. The user only needs to demonstrate all gesture patterns a few times before starting the control tasks. The robot is able to memorize all patterns and recognize a given gesture. Furthermore, in order to define a new pattern corresponding to a new control task, users only need to demonstrate this pattern. It is advanta-geous in the areas of service robots since many of them will be operated by non-expert users.

AcknowledgmentWe would like to thank Mr. Yung Siang Liau and Mr. Jiayu Shen for their help during the experiments.

References[1] R. D. Schraft and G. Schmierer, Serviceroboter (Lec-ture Notes in Control and Information Sciences). Berlin: Springer-Verlag, 1998.

[2] M. Hillman, “Rehabilitation robotics from past to present-historical perspective,” in Proc. ICORR 2003.

[3] C. L. Lisetti, S. M. Brown, K. Alvarez, and A. H. Marpaung, “A social informatics approach to human–ro-bot interaction with a service social robot,” IEEE Trans. Syst., Man Cybern. C, vol. 34, no. 2, pp. 195–209, 2004.

[4] Z. Z. Bien and H.-E. Lee, “Effective learning system techniques for human–robot interaction in service envi-ronment,” Knowl.-Based Syst., vol. 20, pp. 439–456, 2007.

[5] G. Massera, E. Tuci, T. Ferrauto, and S. Nolfi, “The facilitatory role of linguistic instructions on developing manipulation skills,” IEEE Comput. Intell. Mag., vol. 5, no. 3, pp. 33–42, 2010.

[6] W. Campbell, A. Becker, A. Azarbayejani, A. Bobick, and A. Pentland, “Invariant features for 3-D gesture rec-ognition,” MIT Media Lab. Perceptual Comput. Sec., Tech. Rep. 379, 1996.

[7] P. Maes, B. Blumberg, T. Darrel, and A. Pentland, “The alive system: Full body interaction with animated autonomous agents,” ACM Multimedia Syst., vol. 5, pp. 105–112, 1997.

[8] M. Becker, E. Kefalea, E. Mal, C. Malsburg, M. Pagel, J. Triesch, J. Vorbrggen, R. Wrtz, and S. Zadel, “Gripsee: A gesture-controlled robot for object percep-tion and manipulation,” Autonomous Robots, vol. 6, no. 2, pp. 203–221, 1999.

[9] D. Kortenkamp, E. Huber, and P. Bonasso, “Recogniz-ing and interpreting gestures on a mobile robot,” in Proc. AAAI-96. Cambridge, MA: MIT Press, 1996, pp. 915–921.

[10] R. E. Kahn, M. J. Swain, P. N. Prokopowicz, and R. J. Firby, “Gesture recognition using the perseus archi-tecture,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Francisco, CA, 1996, pp. 734–741.

[11] S. Waldherr, R. Romero, and S. Thrun, “A gesture based interface for human–robot interaction,” Autono-mous Robots, vol. 9, pp. 151–173, 2000.

[12] J. P. Hale, “Anthropometric teleoperation: Con-trolling remote manipulators with dataglove,” NASA-TM-103588, 1992.

[13] M. Brooks, “The dataglove as a man–machine inter-face for robotics,” in Proc. 2nd IARP Workshop Medical and Healthcare Robotics, U.K., 1989, pp. 213–225.

[14] D. J. Sturman and D. Zeltzer, “A design method for ‘whole-hand’ human–computer interaction,” Trans. In-form. Syst., vol. 11, no. 3, pp. 219–238, 2000.

[15] A. Wilson and A. Bobick, “Learning visual behavior for gesture analysis,” MIT Media Lab. Perceptual Com-put. Sec., Tech. Rep. 337, 1995.

[16] G. R. McMillan, “The technology and applications of gesture-based control,” in Proc. RTO EN-3, France, Oct. 1998, pp. 1–11.

[17] H. J. Boehme, A. Brakensiek, U. D. Braumann, M. Krabbes, and H. M. Gross, “Neural networks for gesture-based remote control of a mobile robot,” in Proc. 1998 IEEE World Congr. Computational Intelligence WCCI’98 (IJCNN’98), Anchorage, AK, pp. 372–377.

[18] P. K. Simpson, “Fuzzy min–max neural network part I: Classif ication,” IEEE Trans. Neural Networks, vol. 3, no. 5, pp. 776–786, 1992.

[19] J. S. Kim, W. Jang, and Z. Bien, “A dynamic gesture recognition system for the Korean sign language (KSL),” IEEE Trans. Syst., Man, Cybern. B, vol. 26, no. 2, pp. 354–359, 1996.

[20] J. B. Kim and Z. Z. Bien, “Recognition of continu-ous Korean sign language using gesture tension model and soft computing technique,” IEICE Trans. Inform. Syst., vol. E87-D, no. 5, pp. 1265–1270, 2002.

[21] F. Fernández and P. Isasi, “Local feature weighting in nearest prototype classif ication,” IEEE Trans. Neural Networks, vol. 19, no. 1, pp. 40–53, 2008.

[22] S. Seo and K. Obermayer, “Soft learning vector quan-tization,” Neural Comput., vol. 15, no. 7, pp. 1589–1604, 2003.

[23] T. Kohonen, Self-Organizing Maps. New York: Springer-Verlag, 2001.

[24] G. A. Carpenter and S. Grossberg, “ART 2: Self-organization of stable category recognition codes for analog input patterns,” Appl. Opt., vol. 26, no. 23, pp. 4919–4930, 1987.

[25] D. Liu, Z. Pang, and S. R. Lloyd, “A neural net-work method for detection of obstructive sleep apnea and narcolepsy based on pupil size and EEG,” IEEE Trans. Neural Networks, vol. 19, no. 2, pp. 308–318, 2008.

[26] I. Arel, D. Rose, and T. Karnowski, “Deep ma-chine learning a new frontier in artif icial intelligence research,” IEEE Comput. Intell. Mag., vol. 5, no. 4, pp. 13–18, 2010.

[27] J. Mándziuk, “Towards cognitively plausible game playing systems,” IEEE Comput. Intell. Mag., vol. 6, no. 2, pp. 38–51, 2011.

[28] L. M. Kay, L. R. Lancaster, and W. J. Freeman, “Re-afference and attractors in the olfactory system during odor recognition,” Int. J. Neural Syst., vol. 7, no. 4, pp. 489–495, 1996.

[29] D. J. Amit and N. Brunel, “Model of global sponta-neous activity and local structured activity during delay periods in the cerebral cortex,” Cerebral Cortex, vol. 7, no. 3, pp. 237–252, 1997.

[30] R. S. Zemel and M. C. Mozer, “Localist attractor networks,” Neural Comput., vol. 13, pp. 1045–1064, 2001.

[31] H. T. Siegelmann, “Analog-symbolic memory that tracks via reconsolidation,” Physica D, vol. 237, pp. 1207–1214, 2008.

[32] H. Tang, K. C. Tan, and E. J. Teoh, “Dynamics analysis and analog associative memory of networks with LT neurons,” IEEE Trans. Neural Networks, vol. 17, no. 2, pp. 409–418, 2006.

[33] H. Tang, H. Li, and R. Yan, “Memory dynamics in attractor networks with saliency weights,” Neural Com-put., vol. 22, no. 7, pp. 1899–1926, 2010.

[34] K. P. Tee, R. Yan, Y. Chua, and Z. Huang, “Sin-gularity-robust modular inverse kinematics for robotic gesture imitation,” in Proc. IEEE Conf. Robotics and Bio-mimetics, Tianjin, China, 2010, pp. 920–925.

This system provides a flexible and easy-to-use human-robot interface to control a robot.

gesture recognition based on localist attractor networks with application to robot control...

Documents