recovering the linguistic components of the manual signs

6
Recovering the Linguistic Components of the Manual Signs in American Sign Language Liya Ding and Aleix M. Martinez Dept. of Electrical and Computer Engineering The Ohio State University {dingl,aleix}@ece.osu.edu Abstract Manual signs in American Sign Language (ASL) are constructed using three building blocks – handshape, mo- tion, and place of articulations. Only when these three are successfully estimated, can a sign by uniquely identi- fied. Hence, the use of pattern recognition techniques that use only a subset of these is inappropriate. To achieve accurate classifications, the motion, the handshape and their three-dimensional position need to be recovered. In this paper, we define an algorithm to determine these three components form a single video sequence of two- dimensional pictures of a sign. We demonstrated the use of our algorithm in describing and recognizing a set of manual signs in ASL. 1. Introduction Sign languages are used all over the world as a primary means of communications by deaf people. American Sign Langauge (ASL), is one such language. According to cur- rent estimates, it is used regularly by more than 500, 000 people, and up to 2 million use it from time to time. There is thus a great need for systems that can be used with ASL (e.g., computer interfaces) or can serve as interpreters be- tween ASL and English. As other sign languages, ASL has a manual component and a non-manual one (i.e., the face). The manual sign is further divided into three components: i) handshape, ii) motion, and iii) place of articulation [2, 13, 4]. Most man- ual signs can only be distinguished when all three com- ponents have been identified. An example is illustrated in Fig. 1. In this figure, the words (more accurately called concept in ASL) “search” and “drink” share the same handshape, but have different motion and place of articulation; “family” and “class” share the same motion and place of articulation, but have a different handshape. Similarly, the concepts “onion” and “apple” would only be distinguished by the place of articulation – one at the mouth, the other at the chin. If we are to build computers that can recognize ASL, it is imperative that we develop algorithms that can identify these three components of the manual sign. In this paper, we present innovative algorithms for obtaining the mo- tion, handshape and their three-dimensional position from a single (frontal) video sequence of the sign. By position, we mean that we can identify the 3D location of the hand with respect to the face and torso of the signer. The mo- tion is given by the path traveled by the dominant hand from start to end of the sign. The dominant hand is that which carries (most of) the meaning of the word – usually the right hand for right-handed people. Finally, the hand- shape is given by the linguistically significant fingers [2]. In each ASL sign, only a subset of fingers is actually of linguistic interest. While the position of the other fingers is irrelevant, that of the linguistically significant fingers is not. We address the issue as follows. First, we assume that while not all the fingers are visible in the video se- quence of a sign, those that are linguistically significant are (even if only for a short period of time). This assump- tion is based on the observation that signers must provide the necessary information to observes to make a sign un- ambiguous. Therefore, the significant fingers ought to be visible at some point. With this assumption in place, we can use structure-from motion algorithms to recover the 3D structure and motion of the hand. To accomplish this though, we need to define algorithms that are robust to occlusions. This is necessary because although the sig- nificant fingers may be visible for some interval of time, these may be occluded elsewhere. We thus need to extract as much information as possible from each segment of our sequence. Once these three components of the sign are recovered, we can construct a feature space to represent each sign. This will also allow us to do recognition in new video sequences. This is in contrast to most computer vision systems which only use a single feature for representa- tion and recognition [10, 1, 5]. In several of these algo- rithms, the discriminant information is generally searched within a feature space constructed with appearance-based features such as images of pre-segmented hands [1], hand binary masks, and hand contours [12]. The other most typically used feature set is motion [15]. Then, for recog-

Upload: others

Post on 18-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recovering the Linguistic Components of the Manual Signs

Recovering the Linguistic Components of the Manual Signs in American SignLanguage

Liya Ding and Aleix M. MartinezDept. of Electrical and Computer Engineering

The Ohio State University{dingl,aleix }@ece.osu.edu

Abstract

Manual signs in American Sign Language (ASL) areconstructed using three building blocks – handshape, mo-tion, and place of articulations. Only when these threeare successfully estimated, can a sign by uniquely identi-fied. Hence, the use of pattern recognition techniques thatuse only a subset of these is inappropriate. To achieveaccurate classifications, the motion, the handshape andtheir three-dimensional position need to be recovered. Inthis paper, we define an algorithm to determine thesethree components form a single video sequence of two-dimensional pictures of a sign. We demonstrated the useof our algorithm in describing and recognizing a set ofmanual signs in ASL.

1. Introduction

Sign languages are used all over the world as a primarymeans of communications by deaf people. American SignLangauge (ASL), is one such language. According to cur-rent estimates, it is used regularly by more than500, 000people, and up to2 million use it from time to time. Thereis thus a great need for systems that can be used with ASL(e.g., computer interfaces) or can serve as interpreters be-tween ASL and English.

As other sign languages, ASL has a manual componentand a non-manual one (i.e., the face). The manual sign isfurther divided into three components:i) handshape,ii)motion, andiii) place of articulation [2, 13, 4]. Most man-ual signs can only be distinguished when all three com-ponents have been identified. An example is illustratedin Fig. 1. In this figure, the words (more accuratelycalled conceptin ASL) “search” and “drink” share thesame handshape, but have different motion and place ofarticulation; “family” and “class” share the same motionand place of articulation, but have a different handshape.Similarly, the concepts “onion” and “apple” would onlybe distinguished by the place of articulation – one at themouth, the other at the chin.

If we are to build computers that can recognize ASL, itis imperative that we develop algorithms that can identify

these three components of the manual sign. In this paper,we present innovative algorithms for obtaining the mo-tion, handshape and their three-dimensional position froma single (frontal) video sequence of the sign. Byposition,we mean that we can identify the 3D location of the handwith respect to the face and torso of the signer. Themo-tion is given by the path traveled by the dominant handfrom start to end of the sign. The dominant hand is thatwhich carries (most of) the meaning of the word – usuallythe right hand for right-handed people. Finally, thehand-shapeis given by the linguistically significant fingers [2].In each ASL sign, only a subset of fingers is actually oflinguistic interest. While the position of the other fingersis irrelevant, that of the linguistically significant fingers isnot. We address the issue as follows. First, we assumethat while not all the fingers are visible in the video se-quence of a sign, those that are linguistically significantare (even if only for a short period of time). This assump-tion is based on the observation that signers must providethe necessary information to observes to make a sign un-ambiguous. Therefore, the significant fingers ought to bevisible at some point. With this assumption in place, wecan use structure-from motion algorithms to recover the3D structure and motion of the hand. To accomplish thisthough, we need to define algorithms that are robust toocclusions. This is necessary because although the sig-nificant fingers may be visible for some interval of time,these may be occluded elsewhere. We thus need to extractas much information as possible from each segment of oursequence.

Once these three components of the sign are recovered,we can construct a feature space to represent each sign.This will also allow us to do recognition in new videosequences. This is in contrast to most computer visionsystems which only use a single feature for representa-tion and recognition [10, 1, 5]. In several of these algo-rithms, the discriminant information is generally searchedwithin a feature space constructed with appearance-basedfeatures such as images of pre-segmented hands [1], handbinary masks, and hand contours [12]. The other mosttypically used feature set is motion [15]. Then, for recog-

Page 2: Recovering the Linguistic Components of the Manual Signs

(a) (b)

(c) (d)Figure 1.Example signs with same handshape, different motion(a-b); same motion, different handshapes (c-d).

nition, one generally uses Hidden Markov Models [12],Neuron Network [15] and Multiple Discriminant Analy-sis [1]. These methods are limited to the identification ofsigns clearly separated by the single feature in use. Thesemethods are thus not scalable to large systems.

Our first goal is thus to define a robust algorithm thatcan recover the handshape and motion trajectory of eachmanual sign as well as their 3D position. In our study, weallow not only for self-occlusions but also for impreciselocalizations of the fiducial points. We restrict ourselves tothe case where the handshape does not change from startto end, because this represents a sufficiently large num-ber of the concepts in ASL [11] and allows us to use lin-ear fitting algorithms. Derivations of our method are inSection2. The motion path of the hand can then be ob-tained using solutions to the three point resection problem[6]. Since these solutions are generally very sensitive tomiss-localized feature points, we introduced a robustifiedversion which searches for the most stable computation.Derivations for this method are in Section3.

To recover the 3D position of the hand with respect tothe face, we make use of a face detection algorithm. Wefind the distance from the face to the camera as the maxi-mum distances traveled by the hand in all signs of that per-son. Using the perspective model, we can get the 3D po-sition of face using the camera coordinate system. Then,the 3D position of the hand can be described using the facecoordinate system, providing the necessary information todiscriminate signs [4, 13, 2]. Derivations are in Section4.Experimental results are in Section5, where we developon the use of the three algorithms presented in this paper.

2. Handshape

We denote the set of all 3D-world hand points (inparticular, we will be using the knuckles) asPe ={p1, . . . ,pn}, wherepi = (xi, yi, zi)

T specifies the threedimensional coordinates of theith feature point in the Eu-clidean coordinate system. As it is well-known, the im-age points in cameraj are given byQj = AjPe + bj ,

j = 1, 2, . . . , m, where

Qj = (qj1, . . . ,qjn) =(

uj1 . . . ujn

vj1 . . . vjn

)(1)

are the image points, andAj andbj are the parameters ofthejth affine camera.

Since in our application the camera position does notchange,Qj here are the image points in thejth image ofour video sequence. Our goal is to recoverPe with regardto the object (i.e. hand) coordinate system from knownQj , j = 1, 2, ..., m.

Jacobs [7] presents a method that uses the model de-fined above to recover the 3D shape even when not allthe feature points are visible in all frames, i.e., with oc-clusions. Let us first represent the set of affine equationsgiven above in a compact form asD = AP, where

D =

Q1

Q2...

Qm

, A =

A1 b1

A2 b2

......

Am bm

, P =

[p1 p2 · · · pn

1 1 . . . 1

].

When there is neither noise nor missing data,D is of rank4 or less, since it is the product ofA (which has 4 columns)and P (4 rows). If we consider a row vector ofD as apoint in theRn space, all the points fromD lie in a 4-dimensional subspace ofRn. This subspace, which is ac-tually the row space ofD, is denoted asL . Any four linearindependent rows ofD should spanL .

When there is missing data in a row vectorDi (i =1, 2, ..., 2m), all possible values (that can occupy this po-sition) have to be considered. The possible points in thisrow vector create an affine subspace denotedEi. Assumewe have four rowsDh, Di, Dj , Dl (h, i, j, l = 1, 2, ..., 2m,with h 6= i 6= j 6= l) with or without missing data, anddenote the set asFk = {Dh, Di, Dj , Dl}, k ∈ N. And,there is a total ofnf (nf ∈ N) possibleFk.

If the four affine subspaces (Eh, Ei, Ej , El) cor-responding to these four row vectors inFk don’tintersect, then L should be a subset ofSk =span(Eh, Ei, Ej , El), k = 1, 2, ..., nf . And, thus, Lshould be a subset of the intersection of all possible spanof this kind. Hence,S =

⋂k=1,2,...,nf

Sk, andL ⊆ S.Unfortunately, with localization noise and errors causedby inaccurate modelling, this relation of subsets is notretained. This can be solve by using the null-space [7].That is, the orthogonal complement ofSk is denoted asS⊥k . If we have the matrix representation ofS⊥k as Nk,thenN = [N1, N2, . . . , Nnf

] is the matrix representationof S⊥. And the null space ofN is S. Using SVD ofN = UWVT , we can take the four columns inU accord-ing to the four smallest singular values as the four rows inP. Next, we can find the matrix representationP’ of thesubspaceL that is closest to being its null-space accordingto the Frobenius norm. In this case, note that one of the

Page 3: Recovering the Linguistic Components of the Manual Signs

−100

0

100

−200−100

0100

200

−150

−100

−50

0

50

100

ZX

Y

Figure 2. 3D handshape reconstructed from the 2D knucklepoints tracked over a video sequence.

vectors spanningD’s row spaceL is known to be a vectorwith all 1s (because in homogeneous form,P has a rowwith 1s).

If an image point is missing (its(u, v) coordinatesare missing), taking the two rows corresponding to thesame image as inFk will prove beneficial for cal-culating Nk. Then, to calculateNk from Sk, we

take Fk = {D2i−1, D2i, D2j−1, D2j ,[1 · · · 1]T }, i, j =

1, 2, ..., m, i 6= j for better stability. And we eliminate thevectors with poor condition in calculation of nullspacesNi, i = 1, .., nf which combine into a more stable solu-tion of N. This generally improves the performance of theabove defined algorithm in practice.

If a column inD has a large number of missing data,we may very well be unable to recover any 3-dimensionalinformation from it. Note, however, that this can only hap-pen when one of the fingers is occluded during the entiresequence, in which case,we will assume that this finger isnot linguistically significant. This means we do not needto recover its 3D shape, because this will be irrelevant forthe analysis and understanding of the sign.

OnceP has been properly estimated, we can use thenon-missing data of each row inD to fill in the missinggaps as a linear combination of the rows ofP. At the sametime, we have decomposed the filledD into P andA =D(P)+.

The above resultP generates what is known as anaffine shape. Two affine shapes are equivalent if thereexists an affine transformation between them. To breakthis ambiguity, we can include the Euclidean constraintsto find thatEuclideanshape that best approximates thereal one. One way to achieve this, is to find a matrixHsuch thatMjH(j = 1, 2, ..., n) is orthographic, whereMj = (Aj bj). This is so, because orthographic projec-tions do not carry the shape ambiguity mentioned above.The Euclidean shape can be recovered using the Cholesydecomposition and non-linear optimization. Fig.2 showsan example result of the recovery of the 3D hand shape“W”.

3. Motion Reconstruction

We can now recover the 3D motion path of the hand byfinding the pose difference between each pair of consecu-tive frames. That means, we need to estimate the transla-tion and rotation made by the object from frame to frameusing the camera coordinate system. A typical solution

is that given by the PNP resection problem [6]. Also, inthis case, the appropriate model to be used is perspectiveprojection.

In the three point perspective pose estimation problem,there are three object points,p1, p2, andp3, with cam-era coordinatespi = (xi, yi, zi)

T . Our goal is to recoverthese values for each of the points. Since the 3D shape ofthe object has already been recovered, the interpoint dis-tances (i.e., namelya = ‖p2 − p1‖, b = ‖p1 − p3‖, andc = ‖p3 − p2‖) can be easily calculated.

As it is well-known, the perspective model is given by

{ui = f xi

zi

vi = f yi

zi

, i = 1, 2, 3 , (2)

whereqi = (ui, vi)T is theith image point andf is the

focal length of the camera. The the object is in the direc-tion specified by the following unit vector

i =1√

u2i + v2

i + f2

ui

vi

f

, i = 1, 2, 3 .

Now, the task reduces to finding those scalers,s1, s2 ands3, such thatpi = sii, i = 1, 2, 3. The angles betweenthese unit vectors can be calculated as

cosα = 2 · 3, cosβ = 1 · 3, cos γ = 1 · 2. (3)

Grunert’s solution to this problem, is based on the assump-tion thats2 = µs1 ands3 = νs1, which allows us to re-duce the three point resection problem to a fourth orderpolynomial ofν:

A4ν4 + A3ν

3 + A2ν2 + A1ν + A0 = 0, (4)

where the coefficientsA4,A3, A2, A1 andA0 are func-tions of interpoint distancesa, b, c and the anglesα, β, γbetweeni’s [6].

Such polynomials are known to have zero, two or fourreal roots. With eachν’s real root, we can calculateµ,s1, s2, s3 and the values ofp1, p2 andp3. To recover thetranslation and rotation of the hand points, we use the nineequations given by

pi = R wpi + t, i = 1, 2, 3 , (5)

wherewpi is the hand point as described in the world co-ordinate system, andR andt = [tx, ty, tz]T are the rota-tion matrix and translation vectors we want to recover.Thenine entries of the rotation matrix are not independent. Toreduce the 12 dependent parameters into 9 and solve fromthe nine equations, we use a two-step method as in [3].Having rotation matrixR we further parameterize it intoa 3 rotational angle representationrx, ry, rz. Altogether,we use six parameters,rx, ry, rz, tx, ty, tz, to represent therotational and translational motion of each frame.

Page 4: Recovering the Linguistic Components of the Manual Signs

Since we regard the hand as a rigid object during themotion, the rotation and translation of any three-pointgroup are identical. The polynomial defined in (4) mayhave more than one root.Unfortunately, in general, it isnot known which of these roots corresponds to the solu-tion of our problem. To solve this, we calculate all thesolutions given by all possible combinations of three fea-ture points, and describe them in a histogram. This allowsus to select the result with highest occurrence (i.e., withmost votes) as our solution.

It is also known that the geometric approaches to thethree point resection problem are very sensitive to errorsof localization. We now define an approach to address thisissue.

We assume that the correct localization is close to thatgiven by the user or any automatic tracking algorithm. Wegenerate a set of candidate hand points by moving theoriginal position of the original fiducial about a neigh-borhood ofp × p pixels. The solutions for each of theGrunert’s polynomials are then used to obtain all possiblevalues forrx, ry, rz, tx, ty and tz. Each of these resultsis described in a histogram and the result intervalI0 withmost votes is checked out. A wider interval centered atI0

is then chosen. The median of the results within this newinterval will correspond to our final solution. Note thatvoting was first used to eliminate the outliers from the so-lutions of Grunert’s polynomials and, hence, our methodis not effected by large deviations of the results. The me-dian is used to select the best result among the correct so-lutions from different image point localizations. The ro-bustness of this algorithm has been studied in [3].

4. Place of articulation

To successfully represent the 3D handshape and mo-tion trajectories recovered using the above presented algo-rithms, we need to be able to describe these with respectto the signer not the camera. For this, we can set the faceas a frame of reference. This is appropriate because theface provides a large amount of information and serves asthe main center of attention [9]. For example, we have thesign “father” as handshape “5”, and with the thumb tapingthe forehead as the motion pattern. To represent this sign,we can use the center of the face as the origin of the 3Dspace. Without loss of generality, we can further assumethe face defines thex-y plane of our 3D representation.

The cascade-based face detection method constructedfrom a set of Haar-like features [14] provides a fast andappropriate face detection algorithm for our application.Since in the sign, the motion of the head is small, Gaussianmodels can be employed to fit the results. Given a modelof the center of the face and its radius, false detection andbad detection can be readily corrected.

In our data-set, extracted from [8], the signing personstands in a fixed position. Because of the fact that there areno signs in ASL where the hand moves behind the face, for

09−H−10−D frame: 9 09−H−10−D frame: 40

Figure 3.Face detection examples.

each person, we can define the distance from the face tothe camera as the maximum distances between the handand the camera in all video sequences.

Once the face center[uf , vf ]T , the radiusrf , and thedistance from the face to the cameraZf are known, wecan calculate the center of the face in the camera coordi-nate system asuf = f

Xf

Zfandvf = f

Yf

Zf. We then de-

fine thex-y plane to be that provided by the face (whichis from then one assumed to be a plane). Whenever theface is frontal, there will only be a translation betweenthe camera coordinate system and that of the face, i.e.,Pf = Pc− [Xf , Yf , Zf ]T . This is the most common casein our examples, since the subjects in [8] were asked tosigned while looking at the camera. Using this approach,we can define the place of articulation with respect to thesubjects’ coordinate system. To normalize the 3D posi-tions even further, we can scale this 3D space in the sub-jects’ coordinate system so that the radius of face is alwaysa pre-specified constant. This provided scale invariance.

−200 0 200 −800 −600 −400 −200 0−700

−600

−500

−400

−300

−200

−100

0

100

200

300

ZX

Y

12 3

456

7 8

−400−200

0200

400 −800−600

−400−200

0−500

−400

−300

−200

−100

0

100

200

300

400

Z

X

Y1

2 3456

7 8

Figure 4.The places of articulation for the signs “water” and“Wednesday”.

Our 3D face model can be divided into eight regions:forehead region, eye regions, nose region, mouth region,jaw region and cheek regions. This divisions allow us tofurther represent the signs close to the face appropriately– allowing, for example, to easily discriminate between“onion” and “apple”. At the same time, eye detection,nose detection and mouth detection could be employed togive a more accurate localization of the regions and moredetailed definition of the regions. This is illustrated in Fig.4. The sign“water” happens with the tip of the index fingertouching the jaw. And the sign “Wednesday” has a circular

Page 5: Recovering the Linguistic Components of the Manual Signs

motion in front of the body (or face).

5. Experimental Results

In this section, we show some results using video se-quences from the Purdue ASL Database [8]. This databasecontains 2576 video sequences. The first set of thesevideos includes a large number of video clips of motionprimitives and handshapes. In particular, there is a subsetof video clips in which each person signs two words (con-cepts) with distinct meanings. Although the signs havedifferent meaning, they share a common handshape. Thedifference is given by the motion and/or place or articula-tion. These are used here to test the performance of ouralgorithm.

In general it is very difficult to track the knuckles ofthe hand automatically, mainly due to the limited imageresolution, inevitable self-occlusions and lack of salientcharacteristics present in the video sequences. Since ourgoal is to test the performance of the three algorithms pre-sented in this paper, we opted for a manual detection ofthe fiducials. This allows us to demonstrate the uses ofour approach for representing and identifying signs.

Let us start by showing some example results of our al-gorithm. The signs “fruit” and “cat” share the same hand-shape, and have a similar place of articulation. However,they are easily distinguished by their motion path. In Fig.5, we can see the handshape of these two signs. A quan-titative comparison further demonstrates their similarity.This is provided by an Euclidean distance between theknuckle points, once normalized by position and scale.Shift invariance is simply given by centering the posi-tion of the wrist. Then, we use a least-squares solutionto match the rest of the two sets of points. The residualerrors (i.e, squares of the distances),R2, provides the er-ror, which in our case is close to zero. In recognition ofthe handshape, we can compare the 3D shape of the re-constructed handshape with trained handshape models inthe database.

−100−50

050

100

−200

−100

0

100

−500

50100

XZ

Y

−100 0 100−50050

−200

−150

−100

−50

0

50

100

150

XZ

Y

Figure 5.The recovered handshapes of sign “fruit” (left) and sign“cat” (right).

For the places of articulation, we centered our 3D posi-tions in the center of the face and normalize the radius ofthe face appropriately. The forehead region, eye regions,nose region, mouth region, chine regions and cheek re-gions of the face are inferred from a (learned) face model.At the same time, eye detection, nose detection and mouthdetection can be employed to give a more accurate and

more detailed definition of the regions. For example, asto the signs for “fruit” and “cat,” the places of articulationare similar: “fruit” is in the jaw region of the face, “cat”starts in the jaw with a small motion to the right side ofthe face.

The motion of the sign “fruit” is a small rotation aroundthe tip of the thumb and index finger (where the handtouches the face). The motion of the sign “cat” is a smallrepeated translation between the jaw region and the righthand-side of the jaw. The definition of the starting andending points of the motion is detected as a short pause insigning (also known as zero velocity crossings).

In Figs.6 and7, we show two sequences of images andthe corresponding projection of the reconstructed hand-shape. The 3D handshapes obtained by our algorithm (inthe face coordinate system) are shown above each image.In addition, we provide the 3D trajectory recovered by therobust method presented in this paper. The direction ofmotion is marked along the trajectory. Since we have nor-malized the 3D representation with respect to the subjects’coordinate system, the trajectory can be used for recogni-tion. The algorithm we employ is the least squares so-lution presented above. In this case, we first discretizethe paths into an equal number of evenly separated points.The residual of the least squares fit provides the classifica-tion. This is sufficient to correctly classify the same signs(as signed by other subjects) correctly.

−400 −200 0 200 400 −800−600

−400−200

0

−500

−400

−300

−200

−100

0

100

200

300

400

Z

X

Y

−400 −200 0 200 400 −800−600

−400−200

0

−500

−400

−300

−200

−100

0

100

200

300

400

Z

X

Y

−400 −200 0 200 400 −800−600

−400−200

0

−500

−400

−300

−200

−100

0

100

200

300

400

Z

X

Y

−400 −200 0 200 400 −800−600

−400−200

0

−500

−400

−300

−200

−100

0

100

200

300

400

Z

X

Y

−2000

200 −400

−200

0−400

−300

−200

−100

0

100

200

300

ZX

Y

Figure 6.Reconstruction of the 3D handshape and hand trajec-tory for sign “fruit.” Four example frames with the 3D hand-shapes recovered by our method and its 3D trajectory.

Page 6: Recovering the Linguistic Components of the Manual Signs

−400 −200 0 200 400 −800−600

−400−200

0

−500

−400

−300

−200

−100

0

100

200

300

400

Z

X

Y

−400 −200 0 200 400 −800−600

−400−200

0

−500

−400

−300

−200

−100

0

100

200

300

400

Z

XY

−400 −200 0 200 400 −800−600

−400−200

0

−500

−400

−300

−200

−100

0

100

200

300

400

Z

X

Y

−400 −200 0 200 400 −800−600

−400−200

0

−500

−400

−300

−200

−100

0

100

200

300

400

Z

X

Y

−400 −200 0 200 400 −800−600

−400−200

0

−500

−400

−300

−200

−100

0

100

200

300

400

Z

X

Y

−400 −200 0 200 400 −800−600

−400−200

0

−500

−400

−300

−200

−100

0

100

200

300

400

Z

X

Y

−200 0 200 −400−200

0−400

−300

−200

−100

0

100

200

300

ZX

Y

Figure 7.Reconstruction of the handshape and hand trajectoryfor the sign “cat.” Five example frames with the 3D handshapesrecovered by our method and the 3D trajectory of the sign.

6. Conclusions

To successfully represent and recognize a large numberof ASL signs, one needs to be able to recover their 3D po-sition, handshape and motion trajectory. In this paper wehave presented a set of algorithms specifically designed toaccomplish this. Since in ASL, self-occlusions and im-precise fiducial detection are common, we have presentedextensions of the structure-from-motion and resection al-gorithms that appropriately resolve these issues. We havealso introduced the use of a face detector to identify theplace of articulation of the sign. These components to-gether allow us to uniquely identify a set of signs whichonly diverge in one of these three variables.

7. Acknowledgments

This research was supported in part by a grant from theNational Institutes of Health.

References

[1] Y. Cui and J. Weng, “Appearance-Based Hand Sign Recog-nition from Intensity Image Sequences,” Computer VisionImage Understanding, vol. 78, no. 2, pp. 157-176, 2000.1,2

[2] D. Brentari, “A prosodic model of sign language phonol-ogy,” MIT Press, 2000.1, 2

[3] L. Ding and A.M. Martinez, “Three-Dimensional Shape andMotion Reconstruction for the Analysis of American SignLanguage,” In Proc. the 2nd IEEE Workshop on Vision forHuman Computer Interaction, 2006.3, 4

[4] K. Emmorey and J. Reilly (Eds.), “Language, gesture, andspace,” Hillsdale, N.J.:Lawrence Erlbaum, 1999.1, 2

[5] R.A. Foulds, “Piecewise parametric interpolation for tempo-ral compression of multijoint movement trajectories” IEEETransactions on Information Technology In Biomedicine10(1):199-206, 2006.1

[6] R.M. Haralick, C. Lee, K. Ottenberg, M. Nolle, “Review andAnalysis of Solutions of the Three Point Perspective PoseEstimation Problem” International Journal of Computer Vi-sion 13, 3, pp. 331-356, 1994.2, 3

[7] D.W. Jacobs “Linear Fitting with Missing Data forStructure-from-Motion” Proc. IEEE Computer Vision andPattern Recognition pp. 206-212, 1997.2

[8] A.M. Martinez, R.B. Wilbur, R. Shay and A.C. Kak, “ThePurdue ASL Database for the Recognition of American SignLanguage,” In Proc. IEEE Multimodal Interfaces, Pittsburgh(PA), November 2002.4, 5

[9] M.S. Messing and R. Campbell, “Gesture, speech, andsign,” Oxford University Press, 1999.4

[10] S.C.W. Ong, S. Ranganath “Automatic Sign LanguageAnalysis: A survey and the Future beyond Lexical Mean-ing,” IEEE Trans. on Pattern Analysis and Machine Intelli-gence, Vol 27, No 6, June 2005.1

[11] W.C. Stoke, D.C. Casterline, and C.G. Croneberg, “Adictionary of American sign language on linguistic princi-pales,” Linstok Press, 1976.2

[12] N. Tanibata, N. Shimada, and Y. Shirai, “Extraction ofHand Features for Recognition of Sign Language Words,”In Proc. International Conf. Vision Interface, pp. 391-398,2002.1, 2

[13] R.B. Wilbur, “American Sign Language: Linguistic andapplied dimensions” Second Edition, Boston: Little, Brown,1987.1, 2

[14] P. Viola, M.Jones, “Rapid Object Detection using aBoosted Cascade of Simple Features,” Proc. IEEE Com-puter Vision and Pattern Recognition, 2001.4

[15] M. Yang, N. Ahuja, and M. Tabb, “Extraction of 2DMotion Trajectories and Its Application to Hand GestureRecognition,” IEEE Trans. Pattern Analysis Machine In-telligence, Vol. 24, No. 8, pp. 1061-1074, Aug. 2002.1,2