3d model-based hand tracking using stochastic direct...

3D Model-Based Hand Tracking Using Stochastic Direct Search Method

John Y. Lin†, Ying Wu‡, Thomas S. Huang†

† University of Illinois at Urbana-Champaign, 405 N. Mathews, Urbana, IL 61801‡ Northwestern University, 2145 Sheridan, Evanston, IL 60208

† {jy-lin,huang}@ifp.uiuc.edu, ‡ [email protected]

Abstract

Tracking the articulated hand motion in a video sequenceis a challenging problem in which the main difficulty arisesfrom the complexity of searching for an optimal motion esti-mate in a high dimensional configuration space induced bythe articulated motion. Considering that the complexities ofthis problem may be reduced by learning the lower dimen-sional manifold of the articulation motion in the configura-tion space, we propose a new representation for the non-linear manifold of the articulated motion, with a stochas-tic simplex algorithm that facilitates very efficient search.Contrary to traditional methods of representing the mani-folds through clustering and transition matrix construction,we maintain the set of all training samples. To perform thesearch of best matching configuration with respect to theinput image, we combine sequential Monte Carlo techniquewith the Nelder-Mead simplex search which is efficient andeffective when the gradient is not readily accessible. Thisnew approach has been successfully applied to hand track-ing and our experiments show the efficiency and robustnessof our algorithm.

1. Introduction

Capturing the articulate hand motion from visual input is animportant task in many fields with various applications in-cluding human motion understanding, gesture recognition,and motion capture for animation synthesis. Currently, themost reliable methods still require external sensors attachedto the human body such as the special markers for the mo-tion capture systems, data gloves for hand motion capturing,and magnetic sensors for virtual environment interactions.These devices are generally expensive and cumbersome.

On the other hand, vision-based techniques offer an in-expensive and non-invasive alternative for capturing the ar-ticulate body motions. In this paper we present an approachfor tracking the highly articulate human hand motion. Thedifficulties of tracking the human hand arises mainly fromthe large degrees of freedom (DOF) involved in the motion.As a result, estimating the correct hand motion and con-

figuration parameters is equivalent to a search in the highdimensional space. Other difficulties include the self oc-clusions of different fingers, and clutters from the imagebackground which introduce additional uncertainties for theestimation.

One strategy for tracking the hand motion is using theappearance-based approach [2, 7, 16, 17, 23], which at-tempts to estimate the hand states directly from the im-age features. A nonlinear mapping is learned from a largeamount of training images. This approach can quickly es-timate the hand configuration once the mapping is learned.However, it is difficult to determine the structure of the map-ping function and the set of optimal training data.

Model-based approach is another alternative for estimat-ing hand articulations [10, 11, 13, 14, 15, 18, 20, 24, 19].The idea is to compare image features between 3D handmodel projections and real hand images. The hand stateis recovered from the configuration that generates the bestmatch. With a well initialized hand model, this approachcan produce a very accurate estimate. However, model-based tracking involves the estimation of parameters in highdimensional space. The hand motion generally consists of27 DOF, 6 for the global motion and 21 for the finger artic-ulation.

The computational complexity can be reduced by ob-serving that the hand motion is highly constrained. Pre-vious work has shown that by incorporating motion con-straints, the dimensionality of the feasible space can besignificantly reduced [10, 11, 15]. To take advantage ofthe motion constraints for the model based approach, wemust address two key issues: 1) the representation of thefeasible configuration space, and 2) an efficient and effec-tive searching algorithm associated with this representation.Previous work [3, 4, 8, 6] generally represents the manifoldusing piecewise linear assumption. A clustering algorithmis first applied to identify locally similar patches, which isthen approximated by a linear manifold in a lower dimen-sional space determined by PCA or other dimensionality re-duction techniques. For the case of object recognition, anaffinity measure is defined between manifolds that would

1

Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’04) 0-7695-2122-3/04 $ 20.00 © 2004 IEEE

maximize the class separability[3, 8]. For tracking, a tran-sition probability table is constructed to model the motiondynamics[4, 6]. It is generally a difficult problem in deter-mining a suitable clustering algorithm and to construct anaccurate manifold approximation.

In this paper, we propose a novel representation for thefeasible configuration space using a set of discrete sam-ples collected from CyberGlove. Each sample correspondsto one set of joint angle parameter that defines the handshape. By using the entire set of samples directly, wedo not attempt to find an approximated parametrization ofthe lower dimensional manifolds embedded in the feasiblespace, thereby avoiding the estimation error due to incorrectassumptions and lossy dimensionality reduction techniques.

For searching in this discrete space, we propose to use astochastic Nelder-Mead (NM) simplex search. NM methodis a classical direct search algorithm that is used for thecases when gradients can not be accessible or evaluated.However, like all other bottom-up approaches, it is oftentrapped in local minima for nonconvex objective functions.Therefore we propose to combine the top-down multiplehypothesis approach and estimate the optima with severalsimplices.

In Sec 2 we describe the representation of the feasibleconfiguration space in greater details. Sec 3 gives the basicform of NM simplex search and how we adapt it to searchin our new representation. Then multiple hypothesis versionof NM simplex is presented in Sec 4. Sec 5 and 6 describehow we combine global and local motion and how we ob-tain the objective function values. Finally, experiment re-sults are shown in Sec 7 and conclusions are given in Sec 8.

2 Non-Parametric Representation ofthe Hand Configuration Space

The hand motion consists of global MG and local ML fingermotions. Global motion includes the 3D translation t androtation R of the entire hand. ML is represented by the setof joint angles θi, which has roughly 20 degrees of freedom(DOFs) [10]. Exhaustive searching in this space withoutany prior knowledge of the feasible space is a nearly impos-sible task. How to model natural configuration distributionof the feasible states is the key in successfully reducing thecomputational complexity. One way to model the feasiblespace is by observing the piecewise smooth linear manifoldstructures that lie in a lower dimensional space. Wu et al.[24] learned the local manifold structure by collecting realhand motion data and apply PCA and other anatomical con-straints to reduce the dimensionality to 7. Observing thelinear manifold for motion representations, they were ableto effectively track finger motions. Stenger[20] on the otherhand, also made use of the linear manifold approximation

to help generate templates for constructing the tree struc-ture representation of the configuration space. However, toobtain a good manifold parametrization is generally diffi-cult. Furthermore, PCA can only model the global charac-teristics. It is not capable of identifying the local manifoldstructures and their true dimensionality.

We propose to adopt a nonparametric representation andmodel the entire feasible space directly from the set of Ncollected CyberGlove data. The entire feasible space Ψ isdefined by the set of θi, i = 1 . . . N , where each θ ∈ R20

represents one sampled configuration. Then a kd-tree struc-ture is constructed so that given any point θ ∈ R20, we canquickly find an approximated nearest neighbor θ′ ∈ Ψ. Oneof the benefits of using this representation is that no learningis required to find a closed form of the manifold representa-tion of Ψ. Therefore, this approach avoids the error inducedby incorrect approximations of the representation. Further-more, the motion constraints are automatically embeddedin this discrete model. Clearly, the model can be refinedwhen more samples are collected. This advantage is gainedas a trade off for the cost of the computer memory, which isinexpensive nowadays.

3 The Direct Search Algorithm

This section presents the basic Nelder-Mead search method,and the adaptation for searching in the feasible configura-tion space Ψ defined in sec. 2

3.1 Nelder-Mead Simplex Algorithm

With the feasible configuration space Ψ defined, a search al-gorithm must be designed that is appropriate for this struc-ture. Given the estimated state x∗

t at time t, the goal is toidentify x∗

t+1 which minimizes certain objective functionf(x). However, Ψ has several properties that make it un-favorable for the numerical optimization. First, because Ψuses a nonparametric representation in a high dimensionalspace, it is difficult to estimate the derivative of the objectivefunction f(x). Second, although we could define f(x) forevery x, it is difficult to obtain the closed form of f(x) dueto its nonlinear nature. Third, the representation has severaldiscontinuities due to the existence of many infeasible con-figurations. Because of these properties, it is impossible toobtain the gradient and we must rule out the gradient de-scent algorithms which require first or second derivatives,such as Newton methods.

Having only the access to the function values f(x), onealternative is the Nelder-Mead (NM) search method [21]which belongs to the class of direct search methods. TheNM method attempts to minimize a scalar-value nonlin-ear function of n real variables using only function values,without any derivative information, whether explicit or im-plicit. The NM method maintains at each iteration a nonde-

2


generate simplex S, which is a geometric object defined bya convex hull of n + 1 points {x0, . . . , xn}in Rn.

Through a sequence of elementary geometric transfor-mations, the initial simplex moves, expands, or contractstowards the minimum. At each step, The worst vertex withhighest cost xmax = arg max

x∈Sf(x) is replaced by one with

smaller function value. In the case of visual hand tracking,the objective function is defined as the negative of the like-lihood function −p(z|x), where z is the image observationand x is the hand state.

The procedure for performing NM search at each itera-tion is described in the following steps (See [21] for details).At the beginning of each iteration, the worst vertex xmax isselected and the centroid x = 1

n (∑n

i=0 xi − xmax) is com-puted. Then depending on f(xr), we perform the follow-ing operations to obtain the new vertex xnewwhich replacesxmax.

• Reflect xr = (1 + α)x − αxmax.

• Expand xe = γxr + (1 − γ)x.

• Contract xc = βxmax + (1 − β)x

The iteration for NM-method typically terminates for a suf-ficiently small simplex or when a maximum number of iter-ation is reached.

3.2 Two Stage NM Method

Given the hand state θt, the NM method begins by gen-erating an initial simplex S0

t around θt, and the iterativeprocedure guides the search towards a minimum. AlthoughNM method is suitable for searching when the gradient isnot easily accessible, the basic form of NM does not takeadvantage of the motion constraints embedded in the fea-sible space Ψ. To incorporate the constraints in the searchprocess, instead of performing the unconstrained simplexsearch in the continuous domain, we propose a two stagehierarchical NM search. In the coarse level, we restrict thesimplex vertices θi to be one of the samples θj ∈ Ψ. At thekth iteration, a new vertex θnew is generated as describedin section 3.1, and a nearby configuration θ′ is located toreplace θmax ∈ Sk+1

t .

θ′ = [θnew]+ = arg minθ∈Ψ

‖ θnew − θ ‖

There are many algorithms and data structures designed forlocating a nearest sample point in a set from any given lo-cation, such as Voronoi diagram, kd-tree, and ANN. In ourexperiment, we implemented the kd-tree structure. SinceNM method will converge towards a minimum, the nearestneighbor does not need to be very exact. By constrainingthe searching to the discrete space Ψ, the hand motion con-straints are automatically applied to the searching.

Since the data we collected can not possibly cover theentire feasible space, there exist gaps and discontinuities inΨ. Searching only in the discrete domain will not guaranteean optimal convergence; therefore, a second stage is neededto refine the search. In the first stage, the NM method beginswith a larger simplex and ends with a more relaxed termi-nation condition in the discrete domain. Then in the secondstage the algorithm continues the iteration in the continuousdomain with a more strict termination condition.

4 Stochastic Simplex Tracking

Like all bottom-up search algorithm, the NM simplex algo-rithm presented in the previous section fails when the costfunction has multiple minima. Because of the noise pre-sented in image feature extraction and the nontrivial def-inition of the cost function (see Sec 6), the cost functioncan not be convex. Furthermore, the result of the searchalgorithm outputs only the maximum likelihood estimatex∗

t = arg maxx

p(zt|x) when the cost function is defined

as ft(x) = −p(zt|x).One way to tackle the multiple minima problem is to em-

ploy the multiple hypotheses approach. In the tracking liter-ature, particle filtering has been widely used as an effectivemultiple hypotheses tracker to reduce ambiguities. The par-ticle filter [1, 9] uses a probabilistic approach that estimatesx∗ from a time evolving pdf p(x|z) through the Bayes for-mulation:

p(xt+1|zt+1) ∝ p(zt+1|xt+1)p(xt+1|zt) (1)

where xt is the target state at time t and zt = {z1, . . . zt}is the history of image observations. The posterior den-sity p(xt|zt) is represented using Monte Carlo simulation,with a set of n random samples and weights {s(n)

t , π(n)t }

to approximate arbitrary nonlinear multi-modal pdf. How-ever, particle filtering techniques becomes impractical forhigh dimensional state space as the number of samples re-quired grows exponentially with respect to the number ofdimensions. To cope with this problem, Cham[5] proposesto use a semi-parametric representation, and Wu[24] usesimportance sampling from an underlying distribution. An-other problem with the particle filter is the degeneracy phe-nomenon where sample weights become insignificant overtime [1].

We propose a stochastic simplex search algorithm bycombining the NM algorithm with the particle filteringtracking scheme. Instead of using a set of random sam-ples to model the pdf evolution, we use a set of simpliceseach generated from a mode of the pdf at t to locate thenew modes at t + 1 through NM method. The new algo-rithm combines the advantages of each approach to reducethe limitations induced by employing NM or particle filter-ing alone. First, the implementation of multiple hypothe-

3


ses increases the chances of reaching the global minimum.Second, the prior p(xt+1|zt) (Eq. 1) is included in estimat-ing a more accurate current hand state. Third, since eachsimplex will converge to a local minimum, the convergedsamples will have more significant weights. Cham [5] alsoemployed an iterative Gauss-Newton method to locate themodes of the likelihood p(zt|xt). In our case, since we donot have access to the gradient, NM simplex search is em-ployed.

To approximate the tracking prior p(xt+1|zt), we firstdraw random samples xi

t from p(xt|zt). Then an initial sim-plex Si0 is generated from each xi

t, which corresponds to amode in p(xt|zt). Next the two stage simplex search is car-ried out to obtain a local minimum corresponding to a modeof the pdf:

Si∗′ = NMd(Si0)

Si∗ = NMc(Si∗′) (2)

where NMd and NMc denotes the discrete and continu-ous NM operations respectively. Each operation takes aninitial set of vertices and outputs a converged simplex. Thesearch terminates when

n∑

j=0

‖ xkj − xk+1

j ‖2< ε (3)

where xkj ∈ Sk and Sk is the simplex generated at the kth

iteration. The new sample xit+1 is the centroid of the con-

verged simplex:

xit+1 =

1n + 1

∑

xj∈Si∗xj (4)

πit+1 = p(zt+1|xt+1) is computed as described in sec.

6. This procedure is similar to the result of p(xt+1|zt) =∫xt

p(xt+1|xt)p(xt|zt)dxt where the dynamics p(xt+1|xt)is often modelled as a Gaussian. In the case of NM, the iter-ative procedures (Sec 3.1) automatically drives the simplextowards a new mode.

5 Global and Local Motions

In [22], Wu proposed to simultaneously estimate bothglobal and local hand motions by a divide-and-conquer ap-proach. Rather than estimating the optimal state parame-ters altogether, the global motion and local motion param-eters are estimated separately and combined in an iterativemanner. Different algorithms can be used to independentlyestimate each set of parameters, such as the GA-based al-gorithm used in [22] and the iterative closed point algo-rithm employed in [12]. In our experiments, we apply thestochastic NM simplex search in the continuous space forglobal parameter estimation. Then the two stage simplex

search (Sec. 4) is employed to recover the finger motion.These two steps are then repeated until the results converge.Our experiments have shown that this approach reduces thecomputational complexity while effectively estimates thehand motion.

6 Model Matching

We employ edge and silhouette observations to measurethe likelihood of hypothesis as in [12]. The cylinder handmodel is first projected onto the image plane as describedin [19] to obtain the edge points that define the projectedshape. Assume K projected model samples are gener-ated, edge detection is performed on the points along thenormal of these samples. Assuming that M edge points{zm,m = 1, . . . , M} are observed, and the clutter is a Pois-son process with density λ, then,

pek(z|xk) ∝ 1 +

1√2πσeqλ

M∑

m=1

exp− (zm − xk)2

2σ2e

We notice that with edge points alone could not provide agood likelihood estimation; therefore, we also consider thesilhouette measurement. The segmented foreground pix-els are XORed with the projected silhouette image, and the

likelihood is computed as ps ∝ exp− (AI−AM )2

2σ2s

. Since awell matched projection contributes lower cost, we definethe objective function at time t to be the negative of thelikelihood function:

f(x, z) = −p(z|x) ∝ −psK∏

k=1

pek (5)

7 Experiments

In our experiments, we use a 3D hand model and each fin-ger phalanx is represented using a truncated cylinder. Handprojection is generated as described in [19] and is super-imposed on the real hand image. All experiments are per-formed in a cluttered background. The first experimentdemonstrates the robustness of our 3D model-based track-ing in which the hand undergoes both translation and rota-tions (Figure 1). The sequence shows that the 3D model canhandle significant occlusions. The fingers are assumed rigidin this case. We use 10 simplex search for each frame.

In the second video sequence, the fingers bend and ex-tend while the hand moves simultaneously (Figure 2). Thisexperiment shows the robustness of our articulation track-ing algorithm. We use 30 simplices for finger articulationtracking and 10 simplices for global motion. We have alsotested the sequence using CONDENSATION algorithm with5000 samples, and the algorithm fails after about 10 frames.

4


In addition to the superimposed model projection, a recon-structed 3D hand model is shown below each correspond-ing image for better visualizations. The experiment resultsshow that our algorithm is robust and successful in trackingcomplex hand motions in a cluttered environment.

8 Conclusions

This paper proposes to track the articulate hand motion byaddressing two key issues: 1) the representation of the fea-sible configuration space and 2) an efficient tracking algo-rithm associated with the representation. We propose to di-rectly model the feasible space from real hand motion data.The advantage of this representation is that it automaticallyincorporates the motion constraints in this model. Further-more, the errors induced from the approximation algorithmscan be avoided.

In order to utilize this discrete representation for trackinghand motion, we propose a stochastic NM simplex searchalgorithm which is modified to work for the discrete space.One main advantage of the NM method is that it does notrequire the knowledge of gradients, which in our case isdifficult to obtain. Since direct search methods are oftentrapped in local optima, we incorporated particle filteringframework with the simplex search. The experiment resultsshow that our algorithm is robust in tracking hand motionsin cluttered background.

Acknowledgments

The authors thank Howard Huang for his numerous insight-ful comments. This work was supported in part by Na-tional Science Foundation Grant IIS-01-38965, and NSFIIS-0347877 for YW.

References

[1] S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tu-torial on particle filters for on-line non-linear/non-gaussianbayesian tracking. IEEE Transactions of Signal Processing,50(2):174–188, 2002.

[2] V. Athitsos and S. Sclaroff. Estimating 3d hand pose froma cluttered image. In Proc. IEEE Conf. on Computer Visionand Pattern Recognition, volume II, pages 432–439, Madi-son, 2003.

[3] R. Basri, D. Roth, and D. Jacobs. Clustering appearances of3d objects. In Proc. of IEEE Conf. on Computer Vision andPattern Recognition, pages 414–420, 1998.

[4] M. Brand. Shadow puppetry. In Proc. of IEEE Int’l Conf.Computer Vision, pages 1237–1244, 1999.

[5] T.-J. Cham and J. M. Rehg. A multiple hypothesis approachto figure tracking. In Proc. of IEEE Conf. on Computer Vi-sion and Pattern Recognition, pages 239–245, 1999.

[6] A. Heap and D. Hogg. Wormholes in shape space: Trackingthrough discontinuous changes in shape. In Proc. of IEEEInt’l Conf. Computer Vision, pages 344–349, 1998.

[7] T. Heap and D. Hogg. Towards 3D hand tracking using adeformable model. In Proc. of IEEE Int’l Conf. AutomaticFace and Gesture Recognition, pages 140–145, Killington,VT, 1996.

[8] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman.Clustering appearancs of objects under varying illuminationconditions. In Proc. of IEEE Conf. on Computer Vision andPattern Recognition, volume I, pages 11–18, 2003.

[9] M. Isard and A. Blake. Contour tracking by stochastic prop-agation of conditional density. In Proc. of European Conf.on Computer Vision, pages 343–356, Cambridge, UK, 1996.

[10] J. J. Kuch and T. S. Huang. Vision-based hand modeling andtracking for virtual teleconferencing and telecollaboration.In Proc. of IEEE Int’l Conf. on Computer Vision, pages 666–671, Cambridge, MA, June 1995.

[11] J. Lee and T. Kunii. Model-based analysis of hand posture.IEEE Computer Graphics and Applications, 15:77–86, Sept.1995.

[12] J. Lin, Y. Wu, and T. S. Huang. Capturing human hand mo-tion in image sequences. In Proc. of IEEE WMVC, pages99–104, 2002.

[13] S. Lu, D. Metaxas, D. Samaras, and J. Oliensis. Using mul-tiple cues for hand tracking and model refinement. In Proc.IEEE Int’l Conf. on Computer Vision, volume II, pages 443–450, Madison, 2003.

[14] V. Pavlovic, R. Sharma, and T. S. Huang. Visual interpre-tation of hand gestures for human computer interaction: Areview. IEEE Trans. on PAMI, 19:677–695, July 1997.

[15] J. Rehg and T. Kanade. Model-based tracking of self-occluding articulated objects. In Proc. of IEEE Int’l Conf.Computer Vision, pages 612–617, 1995.

[16] R. Rosales, S. Sclaroff, and V. Athitsos. 3D hand pose re-construction using specialized mappings. In Proc. IEEE Int’lConf. on Computer Vision, Vancouver, Canada, July 2001.

[17] J. Segen and S. Kumar. Shadow gesture: 3d hand pose esti-mation using a single camera. In Proc. IEEE Conf. on Com-puter Vision and Pattern Recognition, pages 479–485, 1999.

[18] N. Shimada, Y. Shirai, Y. Kuno, and J. Miura. Hand gestureestimation and model refinement using monocular camera -ambiguity limitation by inequality constraints. In Proc. ofthe 3rd Conf. on Face and Gesture Recognition, pages 268–273, 1998.

[19] B. Stenger, P. R. S. Mendonca, and R. Cippola. Model-based3d tracking of an articulated hand. In Proc. IEEE Conf. onComputer Vision and Pattern Recognition, Hawaii, 2001.

[20] B. Stenger, A. Thayananthan, P. Torr, and R. Cippola. Fil-tering using a tree-based estimator. In Proc. IEEE Conf. onComputer Vision and Pattern Recognition, 2003.

[21] F. Walters, L. R. Parker, S. L. Morgan, and S. N. Deming.Sequential Simplex Optimization. CRC Press, Boca Raton,USA, 1991.

[22] Y. Wu and T. S. Huang. Capturing articulated human handmotion: A divide-and-conquer approach. In Proc. of IEEEInt’l Conf. Computer Vision, pages 606–611, 1999.

[23] Y. Wu and T. S. Huang. View-independent recognition ofhand postures. In Proc. of IEEE Conf. on Computer Visionand Pattern Recognition, volume II, pages 88–94, 2000.

[24] Y. Wu, J. Lin, and T. S. Huang. Capturing natural handarticulation. In Proc. of IEEE Int’l Conf. Computer Vision,volume II, pages 426–432, 2001.

5


Figure 1: Tracking global hand motions involving translation and out-of-plane rotation. The projected model edge points are superimposedon the real hand image.

Figure 2: Simultaneously tracking finger articulation and global hand motion. The projected edge points are superimposed on the realhand image. Below each real hand image, a corresponding reconstructed 3D hand model is shown for better visualization.

6


3d model-based hand tracking using stochastic direct...

Documents